cs.CL [Back]

[1] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Oshayer Siddique,J. M Areeb Uzair Alam,Md Jobayer Rahman Rafy,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: This paper evaluates LLMs' ability to solve physics problems, improves performance using multi-agent frameworks, and introduces a new physics evaluation benchmark called PhysicsEval.

Details

Motivation: The motivation behind this paper is to improve the ability of LLMs to solve physics problems, a critical domain of natural language reasoning, and to create a comprehensive benchmark for evaluating such models. Method: The researchers evaluated frontier LLMs for solving physics problems, employed various inference-time techniques and agentic frameworks to enhance performance, and introduced a new benchmark called PhysicsEval with 19,609 problems from textbooks and online sources. Result: Significant improvements in model performance were observed when using multi-agent frameworks, particularly for problems the models initially struggled with. The paper also successfully introduced PhysicsEval, a new benchmark for physics problem-solving evaluation. Conclusion: The study concludes that using multi-agent frameworks significantly improves the performance of models on physics problems, especially those initially performed poorly. Additionally, the creation of a new evaluation benchmark, PhysicsEval, contributes to future research in this field. Abstract: The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

[2] Do LLMs produce texts with "human-like" lexical diversity?

Kelly Kendro,Jeffrey Maloney,Scott Jarvis

Main category: cs.CL

TL;DR: 本研究发现大型语言模型（LLM）生成的文本在词汇多样性方面与人类写作显著不同，尤其是较新的ChatGPT模型与人类写作差异更大。

Details

Motivation: 尽管已有大量研究关注LLM生成的文本是否真正具有人类风格，但这一问题仍未明确。本研究从词汇多样性的角度出发，进一步探讨LLM生成文本与人类写作的相似性或差异性。 Method: 研究通过比较四个ChatGPT模型（-3.5、-4、-o4 mini和-4.5）生成的文本与240名L1和L2英语参与者的写作，分析了词汇多样性的六个维度：容量、丰富度、多样性-重复性、均衡性、差异性和分散性。采用了单因素MANOVA、单因素ANOVA和支持向量机进行分析。 Result: 结果显示，LLM生成的文本在所有词汇多样性变量上均显著不同于人类写作，其中ChatGPT-o4 mini和-4.5差异最大。尽管生成的token较少，ChatGPT-4.5在词汇多样性方面高于其他模型。而人类写作者的词汇多样性在不同子组（如教育水平、语言状态）之间没有显著差异。 Conclusion: 研究得出LLM生成的文本在词汇多样性方面与人类写作存在显著差异，尤其是ChatGPT-o4 mini和-4.5模型与人类写作差异最大，这表明LLM并未生成真正类似人类的文本，且较新的LLM生成的文本比旧模型更不接近人类写作。 Abstract: The degree to which LLMs produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (-3.5, -4, -o4 mini, and -4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAS, and Support Vector Machines revealed that the LLM-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and -4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity despite producing fewer tokens. The human writers' lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that LLMs do not produce human-like texts in relation to lexical diversity, and the newer LLMs produce less human-like texts than older models. We discuss the implications of these results for language pedagogy and related applications.

[3] Semiotic Complexity and Its Epistemological Implications for Modeling Culture

Zachary K. Stine,James E. Deitrick

Main category: cs.CL

TL;DR: The paper argues that computational humanists should theorize their modeling practices as translation work to avoid errors and ensure transparency. It highlights the issue of treating semiotically complex data as simple and provides recommendations for addressing this issue.

Details

Motivation: The motivation for this paper is the need for greater theorizing in computational humanities to achieve epistemological and interpretive clarity, which is essential for the maturation of the field. Method: The paper frames computational modeling as translation work between cultural, linguistic domains and computational, mathematical domains. It introduces the concept of semiotic complexity and analyzes how current modeling practices handle semiotically complex data. Result: The paper identifies a significant issue in current modeling practices where semiotically complex data is treated as semiotically simple, leading to potential translation errors. It highlights the importance of accounting for semiotic complexity in evaluation practices. Conclusion: The paper concludes that computational humanists should engage in more theorizing to ensure internal consistency, avoid translation errors, and facilitate interpretive transparency. It recommends researchers account for epistemological issues, particularly semiotic complexity, in their modeling practices. Abstract: Greater theorizing of methods in the computational humanities is needed for epistemological and interpretive clarity, and therefore the maturation of the field. In this paper, we frame such modeling work as engaging in translation work from a cultural, linguistic domain into a computational, mathematical domain, and back again. Translators benefit from articulating the theory of their translation process, and so do computational humanists in their work -- to ensure internal consistency, avoid subtle yet consequential translation errors, and facilitate interpretive transparency. Our contribution in this paper is to lay out a particularly consequential dimension of the lack of theorizing and the sorts of translation errors that emerge in our modeling practices as a result. Along these lines we introduce the idea of semiotic complexity as the degree to which the meaning of some text may vary across interpretive lenses, and make the case that dominant modeling practices -- especially around evaluation -- commit a translation error by treating semiotically complex data as semiotically simple when it seems epistemologically convenient by conferring superficial clarity. We then lay out several recommendations for researchers to better account for these epistemological issues in their own work.

[4] FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Mingda Chen,Yang Li,Xilun Chen,Adina Williams,Gargi Ghosh,Scott Yih

Main category: cs.CL

TL;DR: 本文提出了 FACTORY，一个经过人工验证的高质量提示集，用于评估语言模型的长格式事实性生成能力，发现当前最先进模型仍存在显著的事实错误。

Details

Motivation: 现有长格式事实性评估基准测试缺乏人工验证，可能导致质量问题，因此需要一种更可靠和具有挑战性的评估方式。 Method: 开发了一个模型循环方法生成的大型、人工验证的提示集 FACTORY，并使用其对现有数据集和 6 个最先进的语言模型进行人工评估。 Result: 使用 FACTORY 进行评估的结果显示，最先进的模型回应中约有 40% 的声明不符合事实，而其他数据集中这一比例仅为 10%。 Conclusion: FACTORY 是一个可靠且具有挑战性的基准测试，强调模型需要推理长尾事实的必要性及其相对于先前基准测试的优势。 Abstract: Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.

[5] Is neural semantic parsing good at ellipsis resolution, or isn't it?

Xiao Zhang,Johan bos

Main category: cs.CL

TL;DR: 研究发现，尽管神经语义解析器在标准测试中表现良好，但它们在处理英语动词短语省略这种高度上下文敏感的语言现象时存在困难。

Details

Motivation: 神经语义解析器在多种语言现象中表现出良好的整体性能，但对于强上下文敏感现象（如英语动词短语省略）的处理能力尚不清楚。 Method: 构建了一个包含120个省略情况的语料库，并将其用作挑战集来测试一系列神经语义解析器的性能。 Result: 神经语义解析器在标准测试集表现优异，但在处理省略实例时失败。 Conclusion: 神经语义解析器在处理标准测试集时表现良好，但在处理英语动词短语省略现象时却失败了，表明其在处理高度上下文敏感的现象上存在局限性。 Abstract: Neural semantic parsers have shown good overall performance for a variety of linguistic phenomena, reaching semantic matching scores of more than 90%. But how do such parsers perform on strongly context-sensitive phenomena, where large pieces of semantic information need to be duplicated to form a meaningful semantic representation? A case in point is English verb phrase ellipsis, a construct where entire verb phrases can be abbreviated by a single auxiliary verb. Are the otherwise known as powerful semantic parsers able to deal with ellipsis or aren't they? We constructed a corpus of 120 cases of ellipsis with their fully resolved meaning representation and used this as a challenge set for a large battery of neural semantic parsers. Although these parsers performed very well on the standard test set, they failed in the instances with ellipsis. Data augmentation

[6] Comparison of Large Language Models for Deployment Requirements

Alper Yaman,Jannik Schwab,Christof Nitsche,Abhirup Sinha,Marco Huber

Main category: cs.CL

TL;DR: This paper presents a comparative list of foundational and domain-specific Large Language Models (LLMs) to help researchers and companies choose the optimal model based on features like release year, licensing, and hardware requirements. The list is published on GitLab and will be continuously updated.

Details

Motivation: The motivation behind this study is to navigate the rapidly evolving LLM landscape and facilitate LLM selection due to the increasing number of open-source foundational and fine-tuned models introduced over the past two years. Method: The paper presents a comparative list of foundational and domain-specific LLMs that is published on GitLab and will be continuously updated. Result: The result is the publication of a comparative list of foundational and domain-specific LLMs focusing on features like release year, licensing, and hardware requirements. Conclusion: The paper concludes that there is a growing need for a comparative list of foundational and domain-specific LLMs to help researchers and companies choose the optimal model based on features like release year, licensing, and hardware requirements. Abstract: Large Language Models (LLMs), such as Generative Pre-trained Transformers (GPTs) are revolutionizing the generation of human-like text, producing contextually relevant and syntactically correct content. Despite challenges like biases and hallucinations, these Artificial Intelligence (AI) models excel in tasks, such as content creation, translation, and code generation. Fine-tuning and novel architectures, such as Mixture of Experts (MoE), address these issues. Over the past two years, numerous open-source foundational and fine-tuned models have been introduced, complicating the selection of the optimal LLM for researchers and companies regarding licensing and hardware requirements. To navigate the rapidly evolving LLM landscape and facilitate LLM selection, we present a comparative list of foundational and domain-specific models, focusing on features, such as release year, licensing, and hardware requirements. This list is published on GitLab and will be continuously updated.

[7] Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Xiaofeng Wu,Alan Ritter,Wei Xu

Main category: cs.CL

TL;DR: 本文分析了表格理解任务的挑战，并提出了一个分类法，以促进未来的研究。

Details

Motivation: 表格因其复杂和灵活的结构而在大语言模型和多模态大语言模型中受到广泛关注，但其二维特性以及多样化的格式和用途导致了特定方法的发展，而非通用方法，这使得表格理解任务难以导航。 Method: 通过分类法和任务介绍来分析表格理解任务的挑战和关键概念。 Result: 该论文揭示了该领域的三个关键不足：1）检索聚焦的任务占主导地位；2）模型在处理复杂表格结构、大规模表格、长上下文或多表格场景时面临重大挑战；3）模型在不同表格表示和格式之间的泛化能力有限。 Conclusion: 该论文提出了一个表格输入表示的分类法，并介绍了表格理解任务，强调了该领域需要进一步研究的几个关键差距。 Abstract: Tables have gained significant attention in large language models (LLMs) and multimodal large language models (MLLMs) due to their complex and flexible structure. Unlike linear text inputs, tables are two-dimensional, encompassing formats that range from well-structured database tables to complex, multi-layered spreadsheets, each with different purposes. This diversity in format and purpose has led to the development of specialized methods and tasks, instead of universal approaches, making navigation of table understanding tasks challenging. To address these challenges, this paper introduces key concepts through a taxonomy of tabular input representations and an introduction of table understanding tasks. We highlight several critical gaps in the field that indicate the need for further research: (1) the predominance of retrieval-focused tasks that require minimal reasoning beyond mathematical and logical operations; (2) significant challenges faced by models when processing complex table structures, large-scale tables, length context, or multi-table scenarios; and (3) the limited generalization of models across different tabular representations and formats.

[8] Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform

Rana Aref Salama,Abdou Youssef,Mona Diab

Main category: cs.CL

TL;DR: This paper explores the use of Discrete Wavelet Transforms (DWT) in NLP, showing that DWT can compress word and sentence embeddings significantly while preserving or enhancing their performance in downstream tasks.

Details

Motivation: Wavelet transforms have proven effective in signal and image processing, and this paper explores their application in NLP to enhance embedding representation, compress data, and extract meaningful semantic features. Method: The paper employs Discrete Wavelet Transforms (DWT) on word and sentence embeddings to analyze and compress their representations, evaluating the results on semantic similarity and other downstream tasks. Result: DWT was able to reduce the dimensionality of embeddings by 50-93% with almost no performance loss in semantic similarity tasks and achieved better accuracy in most downstream tasks. Conclusion: Wavelet transforms, specifically DWT, have the potential to significantly reduce the dimensionality of word and sentence embeddings with minimal impact on performance and improved accuracy in downstream tasks, paving the way for more efficient NLP applications. Abstract: Wavelet transforms, a powerful mathematical tool, have been widely used in different domains, including Signal and Image processing, to unravel intricate patterns, enhance data representation, and extract meaningful features from data. Tangible results from their application suggest that Wavelet transforms can be applied to NLP capturing a variety of linguistic and semantic properties. In this paper, we empirically leverage the application of Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We aim to showcase the capabilities of DWT in analyzing embedding representations at different levels of resolution and compressing them while maintaining their overall quality. We assess the effectiveness of DWT embeddings on semantic similarity tasks to show how DWT can be used to consolidate important semantic information in an embedding vector. We show the efficacy of the proposed paradigm using different embedding models, including large language models, on downstream tasks. Our results show that DWT can reduce the dimensionality of embeddings by 50-93% with almost no change in performance for semantic similarity tasks, while achieving superior accuracy in most downstream tasks. Our findings pave the way for applying DWT to improve NLP applications.

[9] Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

Bryce Anderson,Riley Galpin,Tom S. Juzek

Main category: cs.CL

TL;DR: The paper explores if the influence of Large Language Models has led to a shift in human language use, finding a moderate increase in LLM-associated word usage post-2022.

Details

Motivation: To determine if the changes in word usage attributed to the influence of Large Language Models (LLMs) reflect broader changes in the human language system itself. Method: A dataset of 22.1 million words from unscripted spoken language was constructed using conversational science and technology podcasts. Lexical trends were analyzed before and after ChatGPT's release in 2022, focusing on commonly LLM-associated words. Result: A moderate yet significant increase in the usage of LLM-associated words post-2022 was observed, suggesting a convergence between human word choices and LLM-associated patterns. Baseline synonym words, however, showed no significant directional shift. Conclusion: The study suggests that there may be a shift in human language use influenced by AI, particularly Large Language Models (LLMs), but whether this represents natural language change or a novel shift driven by AI exposure remains an open question. Abstract: In recent years, written language, particularly in science and education, has undergone remarkable shifts in word usage. These changes are widely attributed to the growing influence of Large Language Models (LLMs), which frequently rely on a distinct lexical style. Divergences between model output and target audience norms can be viewed as a form of misalignment. While these shifts are often linked to using Artificial Intelligence (AI) directly as a tool to generate text, it remains unclear whether the changes reflect broader changes in the human language system itself. To explore this question, we constructed a dataset of 22.1 million words from unscripted spoken language drawn from conversational science and technology podcasts. We analyzed lexical trends before and after ChatGPT's release in 2022, focusing on commonly LLM-associated words. Our results show a moderate yet significant increase in the usage of these words post-2022, suggesting a convergence between human word choices and LLM-associated patterns. In contrast, baseline synonym words exhibit no significant directional shift. Given the short time frame and the number of words affected, this may indicate the onset of a remarkable shift in language use. Whether this represents natural language change or a novel shift driven by AI exposure remains an open question. Similarly, although the shifts may stem from broader adoption patterns, it may also be that upstream training misalignments ultimately contribute to changes in human language use. These findings parallel ethical concerns that misaligned models may shape social and moral beliefs.

[10] Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering

Peixian Li,Yu Tian,Ruiqi Tu,Chengkai Wu,Jingjing Ren,Jingsong Li

Main category: cs.CL

TL;DR: 研究提出了一种新的框架，通过整合结构化临床推理，显著提升了大语言模型在医学诊断中的准确性和可靠性。

Details

Motivation: 虽然大语言模型在医学文本理解和生成方面表现出色，但在复杂临床场景中的诊断可靠性仍有限。研究旨在提升其诊断准确性和临床推理能力。 Method: 研究提出了一种病因感知注意力引导框架，包括构建临床推理框架（CRS）、病因感知头识别算法、以及推理引导的参数高效微调方法。 Result: 在一致性诊断队列中，该框架将平均诊断准确率提高了15.65%，推理焦点得分提高了31.6%。在外部验证中也表现出增强的诊断可靠性。 Conclusion: 该研究提出了一种实用且有效的方法，通过将模型注意力与结构化的临床推理框架对齐，增强了基于大语言模型的诊断系统的临床推理能力，为构建更可解释和可靠的AI诊断系统提供了新范式。 Abstract: Objective: Large Language Models (LLMs) demonstrate significant capabilities in medical text understanding and generation. However, their diagnostic reliability in complex clinical scenarios remains limited. This study aims to enhance LLMs' diagnostic accuracy and clinical reasoning ability. Method: We propose an Etiology-Aware Attention Steering Framework to integrate structured clinical reasoning into LLM-based diagnosis. Specifically, we first construct Clinical Reasoning Scaffolding (CRS) based on authoritative clinical guidelines for three representative acute abdominal emergencies: acute appendicitis, acute pancreatitis, and acute cholecystitis. Next, we develop the Etiology-Aware Head Identification algorithm to pinpoint attention heads crucial for the model's etiology reasoning. To ensure reliable clinical reasoning alignment, we introduce the Reasoning-Guided Parameter-Efficient Fine-tuning that embeds etiological reasoning cues into input representations and steers the selected Etiology-Aware Heads toward critical information through a Reasoning-Guided Loss function. Result: On the Consistent Diagnosis Cohort, our framework improves average diagnostic accuracy by 15.65% and boosts the average Reasoning Focus Score by 31.6% over baselines. External validation on the Discrepant Diagnosis Cohort further confirms its effectiveness in enhancing diagnostic accuracy. Further assessments via Reasoning Attention Frequency indicate that our models exhibit enhanced reliability when faced with real-world complex scenarios. Conclusion: This study presents a practical and effective approach to enhance clinical reasoning in LLM-based diagnosis. By aligning model attention with structured CRS, the proposed framework offers a promising paradigm for building more interpretable and reliable AI diagnostic systems in complex clinical settings.

[11] Systematic Evaluation of Optimization Techniques for Long-Context Language Models

Ammar Ahmed,Sheng Di,Franck Cappello,Zirui Liu,Jingoo Han,Ali Anwar

Main category: cs.CL

TL;DR: This paper benchmarks optimization techniques for large language models in long-context scenarios, revealing that combining methods can harm larger models due to compounded errors and emphasizing the importance of integrating system-level profiling with task-specific insights for optimal performance.

Details

Motivation: The motivation is to address the underexplored efficacy of optimization techniques in long-context scenarios and system evaluations for large language models, aiming to provide insights for practitioners and researchers to balance efficiency, accuracy, and scalability. Method: The paper systematically benchmarks optimization techniques (like pruning, quantization, and token dropping) on two LLM architectures, analyzing their individual and combined effects on performance metrics such as memory usage, latency, and throughput. It also evaluates scalability on a 70-billion parameter model. Result: The experiments revealed that naive combinations of optimization techniques can adversely affect larger models due to compounded approximation errors. Additionally, relying solely on F1 scores can obscure precision-recall trade-offs in question answering tasks. Conclusion: The paper concludes that combining optimization techniques for LLMs can lead to compounded approximation errors, especially in larger models, and that system-level profiling combined with task-specific insights is crucial for balancing efficiency, accuracy, and scalability. Abstract: Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.

[12] Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Kaiyan Zhao,Zhongtao Miao,Yoshimasa Tsuruoka

Main category: cs.CL

TL;DR: MCSEO通过引入细粒度的对象-短语对齐，提升了多模态句子嵌入的效果，并在多个任务中表现出色。

Details

Motivation: 多模态句子嵌入模型在训练中通常利用图像-字幕对，但这些对可能包含冗余或无关信息，因此需要一种方法来提升嵌入效果。 Method: MCSEO利用现有的分割和目标检测模型提取准确的对象-短语对，并使用这些对优化一个针对对象-短语对应的对比学习目标。 Result: 实验结果表明，MCSEO在不同骨干模型的语义文本相似性（STS）任务中始终优于强大的基线模型。 Conclusion: MCSEO通过结合细粒度的对象-短语对齐和传统的图像-字幕对齐，提升了多模态句子嵌入的效果，证明了精确的对象-短语对齐在多模态表示学习中的重要性。 Abstract: Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.

[13] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Keer Lu,Chong Chen,Bin Cui,Huang Leng,Wentao Zhang

Main category: cs.CL

TL;DR: This paper introduces AdaPlan, a new agent paradigm for LLMs, and PilotRL, a training framework using reinforcement learning to improve long-term planning and execution, achieving superior performance over existing models.

Details

Motivation: Existing LLM agent paradigms like ReAct are limited in handling complex tasks requiring long-term strategic planning and suffer from poor generalization due to reliance on supervised fine-tuning. Method: Development of the adaptive global plan-based agent paradigm AdaPlan, and the global planning-guided training framework PilotRL, incorporating progressive reinforcement learning to enhance long-horizon decision-making and generalization ability. Result: LLaMA3.1-8B-Instruct combined with PilotRL outperformed GPT-4o by 3.60% and GPT-4o-mini by 55.78% at a comparable parameter scale. Conclusion: PilotRL, based on the AdaPlan paradigm, achieves state-of-the-art performance, surpassing even closed-source models like GPT-4o by significant margins. Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

[14] Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Alan Dao,Dinh Bach Vu,Alex Nguyen,Norapat Buppodom

Main category: cs.CL

TL;DR: 本研究提出了一种新的小型语言模型推理机制，通过动态构建和优化任务向量，使小型模型在知识密集型任务上表现接近大型模型。

Details

Motivation: 小型语言模型在知识密集型任务中存在固有局限，而现有的大多数方法将推理过程视为固定或启发式过程，缺乏动态调整能力。 Method: 提出了一种新的范式，将模型内部的推理过程视为动态任务向量机，并通过RLVR优化该过程，同时结合MCP集成技术。 Result: 开发的Lucy模型（1.7B参数）在SimpleQA基准测试中达到了78.3%的准确率，与DeepSeek-V3等大型模型表现相当。 Conclusion: 通过将模型的内部推理过程视为动态任务向量机，并利用RLVR进行优化，小型语言模型（如Lucy）可以在知识密集型任务上与大型模型相媲美。 Abstract: Small language models (SLMs) are inherently limited in knowledge-intensive tasks due to their constrained capacity. While test-time computation offers a path to enhanced performance, most approaches treat reasoning as a fixed or heuristic process. In this work, we propose a new paradigm: viewing the model's internal reasoning, delimited by and tags, as a dynamic task vector machine. Rather than treating the content inside these tags as a mere trace of thought, we interpret the generation process itself as a mechanism through which the model \textbf{constructs and refines its own task vectors} on the fly. We developed a method to optimize this dynamic task vector machine through RLVR and successfully trained an agentic web-search model. We present Lucy, a 1.7B-parameter SLM that leverages this dynamic reasoning mechanism with MCP integration to achieve 78.3% accuracy on the SimpleQA benchmark, performing on par with much larger models such as DeepSeek-V3. This demonstrates that small models can rival large ones when equipped with structured, self-constructed task reasoning.

[15] EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Jiyu Chen,Poh Seng Lim,Shuang Peng,Daxiong Luo,JungHau Foo,Yap Deep,Timothy Lee Jun Jie,Kelvin Teh Kae Wen,Fan Yang,Danyu Feng,Hao-Yun Chen,Peng-Wen Chen,Fangyuan Li,Xiaoxin Chen,Wong Wai Mun

Main category: cs.CL

TL;DR: This paper proposes EdgeInfinite-Instruct, a method to efficiently deploy large language models on edge devices, improving performance and reducing computational and memory costs.

Details

Motivation: Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. Method: We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Result: EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. Conclusion: Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices. Abstract: Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.

[16] Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

Dingzirui Wang,Xuangliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: 本文研究了上下文学习（ICL）中演示无效的原因，并提出了基于梯度流的新方法GradS来选择有效的演示，提高了模型性能。

Details

Motivation: 现有工作假定ICL中的演示都是有效的，但许多研究表明并非所有演示都能提高性能，因此需要探究演示无效的原因。 Method: 基于梯度流和线性自注意力模型分析演示无效的原因，并提出GradS方法利用梯度流选择有效演示。 Result: 随着模型层数增加，演示的有效性差异被放大，GradS在五个主流数据集上的四个主流LLMs中比最强基线平均提升了6.8%。 Conclusion: 本文揭示了演示无效的原因，并提出了GradS方法，通过实验验证了其有效性。 Abstract: Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $6.8\%$ on average over the strongest baselines, demonstrating its effectiveness.

Hengxing Cai,Jinhan Dong,Yijie Rao,Jingcheng Deng,Jingjun Tan,Qien Chen,Haidong Wang,Zhen Wang,Shiyu Huang,Agachai Sumalee,Renxin Zhong

Main category: cs.CL

TL;DR: This paper proposes SA-GCS, a new training framework combining Curriculum Learning with Reinforcement Learning for improved performance in UAV Vision-Language Navigation.

Details

Motivation: Existing RL methods face challenges in training data efficiency, convergence speed, and handling difficulty variation among samples. Method: Semantic-Aware Gaussian Curriculum Scheduling (SA-GCS) integrates Curriculum Learning into Reinforcement Learning. Result: Experiments on the CityNav benchmark show SA-GCS outperforms baselines, achieves faster and stable convergence, and generalizes well across different model scales. Conclusion: SA-GCS improves training efficiency, accelerates convergence, and enhances overall model performance in UAV VLN tasks. Abstract: Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) aims to enable agents to accurately localize targets and plan flight paths in complex environments based on natural language instructions, with broad applications in intelligent inspection, disaster rescue, and urban monitoring. Recent progress in Vision-Language Models (VLMs) has provided strong semantic understanding for this task, while reinforcement learning (RL) has emerged as a promising post-training strategy to further improve generalization. However, existing RL methods often suffer from inefficient use of training data, slow convergence, and insufficient consideration of the difficulty variation among training samples, which limits further performance improvement. To address these challenges, we propose \textbf{Semantic-Aware Gaussian Curriculum Scheduling (SA-GCS)}, a novel training framework that systematically integrates Curriculum Learning (CL) into RL. SA-GCS employs a Semantic-Aware Difficulty Estimator (SA-DE) to quantify the complexity of training samples and a Gaussian Curriculum Scheduler (GCS) to dynamically adjust the sampling distribution, enabling a smooth progression from easy to challenging tasks. This design significantly improves training efficiency, accelerates convergence, and enhances overall model performance. Extensive experiments on the CityNav benchmark demonstrate that SA-GCS consistently outperforms strong baselines across all metrics, achieves faster and more stable convergence, and generalizes well across models of different scales, highlighting its robustness and scalability. The implementation of our approach is publicly available.

[18] Combining Discrete Wavelet and Cosine Transforms for Efficient Sentence Embedding

Rana Salama,Abdou Youssef,Mona Diab

Main category: cs.CL

TL;DR: This paper explores the application of wavelets and DCT to NLP tasks, specifically for consolidating information in word and sentence embeddings, with results showing improved performance over original embeddings.

Details

Motivation: Wavelets have shown promise in image and signal processing, suggesting potential for application in Natural Language Processing (NLP) tasks. Method: The paper applies Discrete Wavelet Transforms (DWT) to word and sentence embeddings and combines DWT with Discrete Cosine Transform (DCT) to compress sentences into fixed-size vectors. Result: The method effectively consolidates important information in word vectors while reducing dimensionality and proposes a non-parameterized model that compresses sentences with dense information into fixed-size vectors. Conclusion: The proposed paradigm's efficacy is demonstrated through downstream applications, yielding comparable and even superior results to original embeddings. Abstract: Wavelets have emerged as a cutting edge technology in a number of fields. Concrete results of their application in Image and Signal processing suggest that wavelets can be effectively applied to Natural Language Processing (NLP) tasks that capture a variety of linguistic properties. In this paper, we leverage the power of applying Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We first evaluate, intrinsically and extrinsically, how wavelets can effectively be used to consolidate important information in a word vector while reducing its dimensionality. We further combine DWT with Discrete Cosine Transform (DCT) to propose a non-parameterized model that compresses a sentence with a dense amount of information in a fixed size vector based on locally varying word features. We show the efficacy of the proposed paradigm on downstream applications models yielding comparable and even superior (in some tasks) results to original embeddings.

[19] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

Minghao Guo,Xi Zhu,Jingyuan Huang,Kai Mei,Yongfeng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于代理的图神经网络 ReaGAN，通过节点级决策和检索增强生成技术，解决了传统 GNN 在信息不平衡和全局关系建模方面的限制。

Details

Motivation: 现有的图神经网络（GNNs）由于固定的传播机制，无法处理节点信息的不平衡和忽略图中的全局语义关系。 Method: 提出了一种名为 ReaGAN 的基于代理的框架，每个节点作为一个代理，基于其内部记忆独立规划其下一个动作，并结合检索增强生成（RAG）来访问语义相关的内容。 Result: ReaGAN 在少样本上下文设置下使用冻结的 LLM 主干实现了具有竞争力的性能，而无需微调。 Conclusion: ReaGAN 通过基于代理的框架，使每个节点具有自主决策能力，并利用检索增强生成（RAG）来建立图中的全局关系，展示了在图学习中代理规划和局部-全局检索的潜力。 Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness -- some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model's ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.

[20] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

Yuqi Tang,Kehua Feng,Yunfeng Wang,Zhiwen Chen,Chengfei Lv,Gang Yu,Qiang Zhang,Keyan Ding

Main category: cs.CL

TL;DR: 本文提出了一种高效的多轮对话评估方法，通过将多个LLM评委的偏好知识聚合到单个模型中，保留了多评委反馈的优势，同时显著降低了评估成本。在多个基准上的实验表明，该方法在多种场景下优于现有基线，具有高效性和鲁棒性。

Details

Motivation: 评估大语言模型的对话能力仍然是一项具有挑战性的任务，现有的LLM作为评委的方法存在各种偏差，而使用多个LLM评委的方法虽然有效，但计算开销较大。 Method: 本文提出了一种高效的多轮对话评估方法，通过聚合多个LLM评委的偏好知识到一个模型中。 Result: 在七个单次评分和成对比较的对话评估基准上的大量实验表明，该方法在多种场景下优于现有的基线方法，展示了其效率和鲁棒性。 Conclusion: 本文提出了一种高效的多轮对话评估方法，通过将多个大语言模型（LLM）评委的偏好知识聚合到单个模型中，保留了多评委反馈的优势，同时显著降低了评估成本，实现了快速灵活的对话质量评估。 Abstract: Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

[21] GETALP@AutoMin 2025: Leveraging RAG to Answer Questions based on Meeting Transcripts

Jeongwoo Kang,Markarit Vartampetian,Felix Herron,Yongxin Zhou,Diandra Fabre,Gabriela Gonzalez-Saez

Main category: cs.CL

TL;DR: This paper presents GETALP's submission to a question-answering task based on meeting transcripts, using a retrieval augmented generation system combined with Abstract Meaning Representations.

Details

Motivation: The motivation for this paper is to improve the quality of responses in a question-answering task based on meeting transcripts by incorporating Abstract Meaning Representations. Method: The authors propose three systems that combine retrieval augmented generation with Abstract Meaning Representations. Result: The results show that incorporating Abstract Meaning Representations leads to high-quality responses for approximately 35% of the questions and improves the ability to distinguish between different participants in 'who' questions. Conclusion: The conclusion is that combining retrieval augmented generation with Abstract Meaning Representations improves the quality of responses in a question-answering task based on meeting transcripts. Abstract: This paper documents GETALP's submission to the Third Run of the Automatic Minuting Shared Task at SIGDial 2025. We participated in Task B: question-answering based on meeting transcripts. Our method is based on a retrieval augmented generation (RAG) system and Abstract Meaning Representations (AMR). We propose three systems combining these two approaches. Our results show that incorporating AMR leads to high-quality responses for approximately 35% of the questions and provides notable improvements in answering questions that involve distinguishing between different participants (e.g., who questions).

[22] The Missing Parts: Augmenting Fact Verification with Half-Truth Detection

Yixuan Tang,Jincheng Wang,Anthony K. H. Tung

Main category: cs.CL

TL;DR: 该研究提出了TRACER这一新框架，旨在解决事实核查中因信息遗漏导致的误导性声明识别问题，并通过新基准PolitiFact-Hidden验证其有效性。

Details

Motivation: 现有的事实核查系统难以识别由于关键背景信息被遗漏而具有误导性的事实正确声明，为此需要一个新框架来解决这个问题。 Method: 提出了TRACER，该框架通过证据对齐、推断隐含意图和估计隐藏内容的因果影响来识别基于遗漏的错误信息。 Result: TRACER在多个强基线模型上均提高了性能，特别是在Half-True分类的F1指标上提高了最多16个百分点。 Conclusion: TRACER是一个可集成到现有事实核查流程中的模块化重新评估框架，它通过识别基于遗漏的错误信息，在处理半真半假声明方面表现出色，提高了事实核查的可靠性。 Abstract: Fact verification systems typically assess whether a claim is supported by retrieved evidence, assuming that truthfulness depends solely on what is stated. However, many real-world claims are half-truths, factually correct yet misleading due to the omission of critical context. Existing models struggle with such cases, as they are not designed to reason about what is left unsaid. We introduce the task of half-truth detection, and propose PolitiFact-Hidden, a new benchmark with 15k political claims annotated with sentence-level evidence alignment and inferred claim intent. To address this challenge, we present TRACER, a modular re-assessment framework that identifies omission-based misinformation by aligning evidence, inferring implied intent, and estimating the causal impact of hidden content. TRACER can be integrated into existing fact-checking pipelines and consistently improves performance across multiple strong baselines. Notably, it boosts Half-True classification F1 by up to 16 points, highlighting the importance of modeling omissions for trustworthy fact verification.

[23] EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

Jiaxin Deng,Qingcheng Zhu,Junbiao Pang,Linlin Yang,Zhongqian Fu,Baochang Zhang

Main category: cs.CL

TL;DR: 本文提出EFlat-LoRA方法，通过在低秩子空间中寻找平坦极小值提升LoRA的泛化能力，在多个任务和模型上取得了显著的性能提升。

Details

Motivation: LoRA的表达能力和泛化能力之间的关系研究较少，而Sharpness-Aware Minimization (SAM) 在CNN和Transformer中有效，但缺乏针对LoRA的方法来探索sharpness与泛化能力之间的关系。 Method: 提出Flat-LoRA和其高效版本EFlat-LoRA，将全参数空间的扰动转移到低秩子空间，以寻找平坦极小值。 Result: 实验表明，EFlat-LoRA在RoBERTa-large的GLUE数据集上比LoRA和全微调分别平均高出1.0%和0.5%，在视觉语言模型（如Qwen-VL-Chat）的SQA和VizWiz数据集上分别提升了1.5%和1.0%。 Conclusion: EFlat-LoRA在保持LoRA效率的同时，通过寻求平坦极小值提升了模型的泛化能力，并验证了LoRA的泛化能力与sharpness密切相关。 Abstract: Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat-LoRA and its efficient version i.e., EFlat-LoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFlat-LoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFlat-LoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models e.g., Qwen-VL-Chat shows performance improvements of 1.5% and 1.0% on SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.

[24] The Prosody of Emojis

Giulio Zhou,Tsz Kin Lam,Alexandra Birch,Barry Haddow

Main category: cs.CL

TL;DR: 本研究探讨了emoji如何影响口语中的韵律特征，并揭示了听者能通过韵律变化理解emoji的意义，表明emoji在数字交流中具有传递韵律意图的作用。

Details

Motivation: 在基于文本的环境中，缺少语调、时间、音调等韵律特征，而emoji作为视觉替代，增加了情感和实用细微差别。研究emoji如何影响语音中的韵律实现以及听者如何解释韵律线索以恢复emoji的意义。 Method: 通过结构化但开放的产生和感知任务收集实际人类语音数据，分析韵律和emoji之间的关系。 Result: 结果表明，说话人会根据emoji线索调整他们的韵律，听者通常能仅凭韵律变化识别预期的emoji，并且emoji之间的语义差异越大，韵律差异也随之增加。 Conclusion: Emoji在数字交流中能够作为有意义的韵律意图载体，为理解其在数字媒介环境中的交际作用提供了见解。 Abstract: Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emoji by analysing actual human speech data, collected through structured but open-ended production and perception tasks. This provides empirical evidence of how emoji semantics shape spoken delivery and perception. Results show that speakers adapt their prosody based on emoji cues, listeners can often identify the intended emoji from prosodic variation alone, and greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis can act as meaningful carriers of prosodic intent, offering insight into their communicative role in digitally mediated contexts.

[25] PaPaformer: Language Model from Pre-trained Paraller Paths

Joonas Tapaninaho,Mourad Oussala

Main category: cs.CL

TL;DR: 本文提出了一种高效的解码器Transformer架构PaPaformer，通过组合低维并行路径的方法，在减少训练时间和模型参数的同时提升了性能。

Details

Motivation: 训练现代大语言模型需要越来越多的计算资源和时间，即使是小型语言模型（SLMs）也需要几天时间，因此需要探索更高效的训练方法。 Method: 引入了一种名为PaPaformer的仅解码器Transformer架构变体，通过将低维并行路径组合成一个更大的模型进行训练和评估。 Result: PaPaformer能够在数小时内完成训练，而不是数天或数周，并且可以使用不同类型的数据单独训练低维路径，然后组合成一个更大模型。 Conclusion: PaPaformer通过低维并行路径的组合训练，不仅减少了模型参数和训练时间，还提升了性能，并为特定任务需求提供了定制化路径的可能。 Abstract: The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces \textit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.

[26] SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought

Jianwei Wang,Ziming Wu,Fuming Lai,Shaobing Lian,Ziqian Zeng

Main category: cs.CL

TL;DR: SynAdapt是一种高效的推理框架，通过合成连续思维链（CCoT）作为对齐目标，提高大型语言模型的准确性和效率，并通过难度分类器识别难题，实现自适应提示，从而在各种基准测试中实现了最佳的准确性-效率权衡。

Details

Motivation: 为了解决现有连续思维链（CCoT）方法中存在的间接微调、有限对齐或目标不一致的问题，并提高模型处理难题的能力。 Method: SynAdapt生成合成的CCoT作为对齐目标，并引入一个难度分类器来识别难题，然后自适应提示LLM重新思考这些问题。 Result: 在各种难度级别的基准测试中，SynAdapt都显示出了强大的效果，实现了最佳的准确性-效率权衡。 Conclusion: SynAdapt通过合成CCoT和难度分类器，提供了一种有效的解决方案来提升大型语言模型的推理能力。 Abstract: While Chain-of-Thought (CoT) reasoning improves model performance, it incurs significant time costs due to the generation of discrete CoT tokens (DCoT). Continuous CoT (CCoT) offers a more efficient alternative, but existing CCoT methods are hampered by indirect fine-tuning, limited alignment, or inconsistent targets. To overcome these limitations, we propose \textit{SynAdapt}, an innovative efficient reasoning framework. Specifically, \textit{SynAdapt} generates the synthetic CCoT to serve as a precise and effective alignment target for LLMs. This synthetic CCoT explicitly guides the LLM to learn CCoT and derive accurate answers directly. Furthermore, relying solely on CCoT is insufficient for solving hard questions. To address this, \textit{SynAdapt} integrates a difficulty classifier that leverages both question context and CCoT to identify hard questions. CCoT can effectively help identify hard questions after some brief reasoning. We then adaptively prompt the LLM to re-think these hard questions for improved performance. Extensive experimental results across various benchmarks from different difficulty levels strongly demonstrate the effectiveness of our method, achieving the best accuracy-efficiency trade-off.

[27] A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models

Mingruo Yuan,Shuyi Zhang,Ben Kao

Main category: cs.CL

TL;DR: CRUX通过考虑上下文信息和一致性来提高大型语言模型的置信度估计效果。

Details

Motivation: 当前的大型语言模型置信度估计方法忽视了响应与上下文信息的相关性，这在输出质量评估中是一个至关重要的因素，尤其是在提供了背景知识的情况下。 Method: CRUX框架使用了两种新指标：上下文熵减和统一一致性检验。 Result: 在三个基准数据集（CoQA，SQuAD，QuAC）和两个领域特定数据集（BioASQ，EduQG）上的实验表明，CRUX的效果优于现有的基线方法，达到了最高的AUROC。 Conclusion: CRUX是一个有效的置信度估计框架，它通过整合上下文保真度和一致性来提高大型语言模型的可靠性。 Abstract: Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.

[28] GHTM: A Graph based Hybrid Topic Modeling Approach in Low-Resource Bengali Language

Farhana Haque,Md. Abdur Rahman,Sumon Ahmed

Main category: cs.CL

TL;DR: This paper proposes GHTM, a novel GCN and NMF-based topic modeling approach for Bengali, which outperforms existing methods and introduces a new Bengali dataset, NCTBText, to diversify available corpora.

Details

Motivation: Topic modeling in Bengali remains understudied due to morphological complexity and lack of resources, prompting the need for a more effective approach and diverse datasets. Method: The study proposes a novel GCN-based model, GHTM, which uses graph representations of document vectors and NMF decomposition to extract topics. It compares the model's performance against traditional methods (LDA, LSA, NMF) and contemporary frameworks (BERTopic, Top2Vec) on three Bengali datasets. Result: Experimental results show that the GHTM model surpasses other topic modeling techniques in topic coherence and diversity on Bengali datasets, demonstrating its effectiveness. Conclusion: The proposed GHTM model outperforms traditional and contemporary Bengali topic modeling techniques in topic coherence and diversity, and the study introduces a new Bengali dataset, NCTBText, to enrich existing corpora. Abstract: Topic modeling is a Natural Language Processing (NLP) technique that is used to identify latent themes and extract topics from text corpora by grouping similar documents based on their most significant keywords. Although widely researched in English, topic modeling remains understudied in Bengali due to its morphological complexity, lack of adequate resources and initiatives. In this contribution, a novel Graph Convolutional Network (GCN) based model called GHTM (Graph-Based Hybrid Topic Model) is proposed. This model represents input vectors of documents as nodes in the graph, which GCN uses to produce semantically rich embeddings. The embeddings are then decomposed using Non-negative Matrix Factorization (NMF) to get the topical representations of the underlying themes of the text corpus. This study compares the proposed model against a wide range of Bengali topic modeling techniques, from traditional methods such as LDA, LSA, and NMF to contemporary frameworks such as BERTopic and Top2Vec on three Bengali datasets. The experimental results demonstrate the effectiveness of the proposed model by outperforming other models in topic coherence and diversity. In addition, we introduce a novel Bengali dataset called "NCTBText" sourced from Bengali textbook materials to enrich and diversify the predominantly newspaper-centric Bengali corpora.

[29] Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

Lennart Meincke,Ethan Mollick,Lilach Mollick,Dan Shapiro

Main category: cs.CL

TL;DR: 本报告研究了提示策略对AI模型性能的影响，发现简单提示变化通常无效，但某些提示可能对特定问题产生显著影响。

Details

Motivation: 为了帮助商业、教育和政策领导者更好地理解与AI合作的技术细节，并验证一些关于提示的常见观点。 Method: 评估了威胁性或激励性提示对模型在GPQA和MMLU-Pro基准测试中的表现的影响。 Result: 实验表明，通常情况下，威胁或激励对模型性能没有显著影响，但提示的细微变化可能在个别问题上产生显著差异。 Conclusion: 简单的提示变化可能没有之前认为的那么有效，尤其是在处理困难问题时。然而，特定提示方法可能对个别问题产生显著影响。 Abstract: This is the third in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate two commonly held prompting beliefs: a) offering to tip the AI model and b) threatening the AI model. Tipping was a commonly shared tactic for improving AI performance and threats have been endorsed by Google Founder Sergey Brin (All-In, May 2025, 8:20) who observed that 'models tend to do better if you threaten them,' a claim we subject to empirical testing here. We evaluate model performance on GPQA (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024). We demonstrate two things: - Threatening or tipping a model generally has no significant effect on benchmark performance. - Prompt variations can significantly affect performance on a per-question level. However, it is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question. Taken together, this suggests that simple prompting variations might not be as effective as previously assumed, especially for difficult problems. However, as reported previously (Meincke et al. 2025a), prompting approaches can yield significantly different results for individual questions.

[30] DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models

Shantanu Thorat,Andrew Caines

Main category: cs.CL

TL;DR: This paper introduces DACTYL, a new dataset for AIG text detection that focuses on one-shot/few-shot generations and domain-specific CPT models. It also compares the performance of BCE-trained and DXO classifiers, showing that DXO classifiers generalize better, particularly on out-of-distribution texts.

Details

Motivation: Current AIG text detectors perform poorly in real-world settings, particularly with one-shot/few-shot and CPT-generated texts. This paper aims to address this issue by introducing a more robust dataset and exploring new optimization methods. Method: The paper introduces DACTYL, a dataset focused on one-shot/few-shot generations and texts from domain-specific CPT models. The authors train classifiers using standard binary cross-entropy (BCE) optimization and deep X-risk optimization (DXO), comparing their performance on the DACTYL test set and out-of-distribution texts. Result: Many existing AIG text detectors struggle significantly on the DACTYL dataset. BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, but DXO classifiers excel on out-of-distribution (OOD) texts and outperform BCE-trained classifiers in a mock deployment scenario involving student essays. Conclusion: DACTYL dataset proves to be a challenging benchmark for AIG text detectors, exposing vulnerabilities in their performance on one-shot/few-shot and CPT-generated texts. DXO classifiers are shown to have better generalization capabilities, especially in out-of-distribution scenarios. Abstract: Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.

[31] Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

Wenxuan Wang,Zizhan Ma,Meidan Ding,Shiyi Zheng,Shengyuan Liu,Jie Liu,Jiaming Ji,Wenting Chen,Xiang Li,Linlin Shen,Yixuan Yuan

Main category: cs.CL

TL;DR: 这篇论文系统性地回顾了医学领域大型语言模型的发展，提出了推理增强技术的分类方法，分析了其在不同领域的应用，并展望了未来的研究方向和挑战。

Details

Motivation: 大型语言模型在医学领域的能力虽强，但在系统性、透明性和可验证性推理方面仍存在不足，这促使了专门针对医学推理的模型发展。 Method: 本文对2022-2025年间的60项重要研究进行了系统性分析，提出了一种推理增强技术的分类方法，并调查了评估基准的演变。 Result: 文章提出了一个清晰的推理增强技术分类体系，分析了这些技术在不同数据模态和临床应用中的使用，并调查了评估基准的复杂性演变。 Conclusion: 本文总结了医学领域大型语言模型的发展现状，提出了在训练时和测试时的推理增强技术分类，并指出了未来构建高效、可靠和具有社会责任感的医疗AI所面临的挑战和方向。 Abstract: The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.

[32] MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

Farhan Farsi,Farnaz Aghababaloo,Shahriar Shariati Motlagh,Parsa Ghofrani,MohammadAli SadraeiJavaheri,Shayan Bali,Amirhossein Shabani,Farbod Bijary,Ghazal Zamaninejad,AmirMohammad Salehoof,Saeedeh Momtazi

Main category: cs.CL

TL;DR: This study addresses the lack of cultural and linguistic evaluation resources for non-English LLMs by introducing 19 new datasets focused on Persian language and Iranian culture and benchmarking 41 LLMs.

Details

Motivation: The motivation stems from the lack of evaluation resources for languages other than English and the cultural bias of most LLMs toward European and American contexts, which limits their familiarity with non-Western cultures like Iranian culture. Method: The study introduces 19 new evaluation datasets focused on the Persian language and Iranian culture, including topics such as Iranian law, Persian grammar, idioms, and university entrance exams. These datasets were used to benchmark 41 prominent LLMs. Result: The study successfully introduces 19 new evaluation datasets and benchmarks 41 prominent LLMs, addressing the gap in cultural and linguistic evaluation for Persian language and Iranian culture. Conclusion: The study contributes to bridging the cultural and linguistic evaluation gap in LLMs by focusing on Persian language and Iranian culture, offering 19 new evaluation datasets and benchmarking 41 LLMs. Abstract: As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.

[33] Team "better_call_claude": Style Change Detection using a Sequential Sentence Pair Classifier

Gleb Schmidt,Johannes Römisch,Mariia Halchynska,Svetlana Gorovaia,Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: This paper proposes a Sequential Sentence Pair Classifier (SSPC) for fine-grained style change detection, effectively identifying shifts at the sentence level across datasets of increasing difficulty.

Details

Motivation: Style change detection at the sentence level is a challenging problem in computational authorship analysis, especially when dealing with short, stylistically shallow sentences prevalent in the PAN 2025 shared task. Method: A Sequential Sentence Pair Classifier (SSPC) was developed, using a pre-trained language model (PLM) to represent sentences, a bidirectional LSTM (BiLSTM) to contextualize them, and a multi-layer perceptron to predict style switches between adjacent sentences. Result: The model achieved macro-F1 scores of 0.923 (EASY), 0.828 (MEDIUM), and 0.724 (HARD), outperforming both random baselines and Claude-3.7-sonnet's zero-shot performance. Conclusion: The proposed Sequential Sentence Pair Classifier (SSPC) effectively detects style changes at the sentence level, particularly for short and stylistically shallow sentences, outperforming baselines and achieving strong macro-F1 scores across three datasets of varying difficulty. Abstract: Style change detection - identifying the points in a document where writing style shifts - remains one of the most important and challenging problems in computational authorship analysis. At PAN 2025, the shared task challenges participants to detect style switches at the most fine-grained level: individual sentences. The task spans three datasets, each designed with controlled and increasing thematic variety within documents. We propose to address this problem by modeling the content of each problem instance - that is, a series of sentences - as a whole, using a Sequential Sentence Pair Classifier (SSPC). The architecture leverages a pre-trained language model (PLM) to obtain representations of individual sentences, which are then fed into a bidirectional LSTM (BiLSTM) to contextualize them within the document. The BiLSTM-produced vectors of adjacent sentences are concatenated and passed to a multi-layer perceptron for prediction per adjacency. Building on the work of previous PAN participants classical text segmentation, the approach is relatively conservative and lightweight. Nevertheless, it proves effective in leveraging contextual information and addressing what is arguably the most challenging aspect of this year's shared task: the notorious problem of "stylistically shallow", short sentences that are prevalent in the proposed benchmark data. Evaluated on the official PAN-2025 test datasets, the model achieves strong macro-F1 scores of 0.923, 0.828, and 0.724 on the EASY, MEDIUM, and HARD data, respectively, outperforming not only the official random baselines but also a much more challenging one: claude-3.7-sonnet's zero-shot performance.

[34] Segment First, Retrieve Better: Realistic Legal Search via Rhetorical Role-Based Queries

Shubham Kumar Nigam,Tanmay Dubey,Noel Shallum,Arnab Bhattacharya

Main category: cs.CL

TL;DR: TraceRetriever is a scalable and reliable pipeline for legal precedent retrieval that works with limited case information, integrating multiple models for improved search accuracy.

Details

Motivation: The growing complexity and volume of legal documents challenge traditional retrieval methods, and real-world legal search often operates with limited case information. Method: TraceRetriever integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion before final re-ranking. Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier. Result: Evaluated on IL-PCR and COLIEE 2025 datasets, TraceRetriever effectively addresses the challenges of growing document volume while aligning with practical search constraints. Conclusion: TraceRetriever provides a reliable and scalable solution for legal precedent retrieval, enhancing legal research when only partial case knowledge is available. Abstract: Legal precedent retrieval is a cornerstone of the common law system, governed by the principle of stare decisis, which demands consistency in judicial decisions. However, the growing complexity and volume of legal documents challenge traditional retrieval methods. TraceRetriever mirrors real-world legal search by operating with limited case information, extracting only rhetorically significant segments instead of requiring complete documents. Our pipeline integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion before final re-ranking. Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier trained on Indian judgments. Evaluated on IL-PCR and COLIEE 2025 datasets, TraceRetriever addresses growing document volume challenges while aligning with practical search constraints, reliable and scalable foundation for precedent retrieval enhancing legal research when only partial case knowledge is available.

[35] Better Call Claude: Can LLMs Detect Changes of Writing Style?

Johannes Römisch,Svetlana Gorovaia,Mariia Halchynska,Gleb Schmidt,Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: This study shows that modern large language models can detect subtle sentence-level style changes in multi-author texts, outperforming traditional baselines and showing increased sensitivity to stylistic rather than semantic differences.

Details

Motivation: The motivation behind this study is to understand the capabilities of state-of-the-art LLMs in detecting subtle writing style changes at the sentence level, which is a challenging task in authorship analysis. The work also aims to establish a baseline performance for the PAN competition and investigate whether LLMs can detect purely stylistic differences independent of semantic content. Method: The authors benchmarked four state-of-the-art large language models (LLMs) using the official PAN 2024 and 2025 'Multi-Author Writing Style Analysis' datasets to evaluate zero-shot performance on sentence-level style change detection. They analyzed model sensitivity to stylistic variations and assessed the influence of semantics on predictions. Result: The results show that LLMs are sensitive to variations in writing style even at the sentence level. They outperformed existing PAN competition baselines, establishing a new challenging benchmark. The study also found evidence suggesting that the latest LLMs are more responsive to content-independent, stylistic cues than previously believed. Conclusion: The study concludes that state-of-the-art LLMs are sensitive to writing style variations at the sentence level and establish a challenging baseline for style change detection, with evidence suggesting that these models are more attuned to content-independent, purely stylistic signals. Abstract: This article explores the zero-shot performance of state-of-the-art large language models (LLMs) on one of the most challenging tasks in authorship analysis: sentence-level style change detection. Benchmarking four LLMs on the official PAN~2024 and 2025 "Multi-Author Writing Style Analysis" datasets, we present several observations. First, state-of-the-art generative models are sensitive to variations in writing style - even at the granular level of individual sentences. Second, their accuracy establishes a challenging baseline for the task, outperforming suggested baselines of the PAN competition. Finally, we explore the influence of semantics on model predictions and present evidence suggesting that the latest generation of LLMs may be more sensitive to content-independent and purely stylistic signals than previously reported.

[36] NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System

Shubham Kumar Nigam,Balaramamahanthi Deepak Patnaik,Shivam Mishra,Ajay Varghese Thomas,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya

Main category: cs.CL

TL;DR: NyayaRAG框架结合法律事实、法规和判例，显著提升法律判决预测和解释能力。

Details

Motivation: 现有研究主要依赖案例内部内容，忽略了普通法体系中依赖成文法和判例的核心要素，需要更贴近实际法庭场景的模型。 Method: 提出NyayaRAG框架，结合事实案例描述、相关法律条款和语义检索的先前案例，评估其在预测法院判决和生成法律解释方面的效果。 Result: 结合结构化法律知识的输入显著提升了预测准确性和解释质量，通过多种输入配置和评估指标验证了效果。 Conclusion: NyayaRAG框架通过结合事实案例描述、相关法律条款和语义检索的先前案例，显著提高了法律判决预测的准确性和解释质量。 Abstract: Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.

[37] Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

Yingxu Wang,Shiqi Fan,Mengzhu Wang,Siwei Liu

Main category: cs.CL

TL;DR: 本文提出了一种新的知识图谱问答方法DAMR，结合符号搜索和自适应路径评估，解决了现有方法的问题，并在多个基准测试中表现出色。

Details

Motivation: 解决现有KGQA方法适应性差、计算成本高、路径评估不准的问题。 Method: 提出了一种新的DAMR方法，结合了符号搜索和自适应路径评估。 Result: 实验表明，DAMR在多个KGQA基准测试中显著优于最先进的方法。 Conclusion: DAMR显著优于现有的KGQA方法。 Abstract: Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.

[38] Out-of-Context Abduction: LLMs Make Inferences About Procedural Data Leveraging Declarative Facts in Earlier Training Data

Sohaib Imran,Rob Lamb,Peter M. Atkinson

Main category: cs.CL

TL;DR: 研究发现LLMs（如GPT 4o）能够基于训练数据中的信息进行推理，推断出聊天机器人的名称和行为特征，这对LLMs的情境感知能力和AI安全性具有意义。

Details

Motivation: 研究动机是探索大型语言模型是否能够基于其训练数据中的信息进行推理，特别是在没有上下文的情况下进行假设推理（abduction）。 Method: 实验设计包括训练处理LLMs的虚构聊天机器人名称和行为描述，但不包括与聊天机器人的对话示例，并测试LLMs是否能推断聊天机器人的名称和行为特征。 Result: GPT 4o能够根据聊天机器人的特征性回答正确推断出至少一个聊天机器人的名字，并且通过预先训练可以展示更符合描述的行为特征。 Conclusion: 研究得出LLMs（如GPT 4o）能够通过训练数据中的相关信息推断出最可能的解释，这表明它们具备情境感知能力，对AI安全性有潜在影响。 Abstract: Large language models (LLMs) are trained on large corpora, yet it is unclear whether they can reason about the information present within their training data. We design experiments to study out-of-context abduction in LLMs, the ability to infer the most plausible explanations for observations using relevant facts present in training data. We train treatment LLMs on names and behavior descriptions of fictitious chatbots, but not on examples of dialogue with the chatbots. We find that OpenAI's GPT 4o LLM can correctly infer at least one chatbot's name after observing example responses characteristic of that chatbot. We also find that previously training GPT 4o on descriptions of a chatbot's behavior allows it to display behaviors more characteristic of the chatbot when iteratively trained to display such behaviors. Our results have implications for situational awareness in LLMs and, therefore, for AI safety.

[39] Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents

Sarah Mercer,Daniel P. Martin,Phil Swatton

Main category: cs.CL

TL;DR: 该论文研究了基于角色的生成代理在多大程度上能够代表人类群体，通过重现HEXACO人格调查实验，使用GPT-4驱动的代理进行因子分析，并与2004年Ashton、Lee和Goldberg的研究结果进行比较。

Details

Motivation: 探索生成代理在社会科学中的有效性，特别是在人格特质调查中的应用潜力。 Method: 通过调查310个GPT-4驱动的代理来重现HEXACO人格问卷实验，并对其回答进行因子分析，同时进行跨模型分析以评估人格特征的可变性。 Result: 1）从代理的回答中可提取出一致且可靠的人格结构，与HEXACO框架部分对齐；2）当GPT-4与充分策划的群体结合时，得出的人格维度是一致且可靠的；3）跨模型分析显示了人格分析的可变性，表明模型特有的偏差和局限性。 Conclusion: 生成代理在社会科学中具有潜力，但也存在挑战和局限性，论文为设计一致且具有代表性的代理角色提供了实用指导。 Abstract: Generative agents powered by Large Language Models demonstrate human-like characteristics through sophisticated natural language interactions. Their ability to assume roles and personalities based on predefined character biographies has positioned them as cost-effective substitutes for human participants in social science research. This paper explores the validity of such persona-based agents in representing human populations; we recreate the HEXACO personality inventory experiment by surveying 310 GPT-4 powered agents, conducting factor analysis on their responses, and comparing these results to the original findings presented by Ashton, Lee, & Goldberg in 2004. Our results found 1) a coherent and reliable personality structure was recoverable from the agents' responses demonstrating partial alignment to the HEXACO framework. 2) the derived personality dimensions were consistent and reliable within GPT-4, when coupled with a sufficiently curated population, and 3) cross-model analysis revealed variability in personality profiling, suggesting model-specific biases and limitations. We discuss the practical considerations and challenges encountered during the experiment. This study contributes to the ongoing discourse on the potential benefits and limitations of using generative agents in social science research and provides useful guidance on designing consistent and representative agent personas to maximise coverage and representation of human personality traits.

[40] Agentic large language models improve retrieval-based radiology question answering

Sebastian Wind,Jeta Sopa,Daniel Truhn,Mahshad Lotfinia,Tri-Thien Nguyen,Keno Bressem,Lisa Adams,Mirabela Rusu,Harald Köstler,Gerhard Wellein,Andreas Maier,Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: 本研究开发了一种新的代理RAG框架，结合LLMs与放射学知识库，有效提升放射学问答系统的诊断准确性和事实性，特别是在中等规模模型中表现突出。

Details

Motivation: 传统的单步检索增强生成系统在处理复杂放射学问题时存在局限，因此需要一种更高效的框架来提高临床决策的准确性。 Method: 提出了一种基于代理的RAG框架，使LLMs能够自主分解问题、迭代检索放射学证据并动态生成响应，并评估了24种LLMs在104个专家整理问题上的表现。 Result: 代理RAG框架显著提高了诊断准确率（73% vs. 64%），减少了幻觉现象（平均9.4%），并在46%的案例中检索到临床相关信息，尤其在中等和小规模模型中效果显著。 Conclusion: 研究证明，基于LLMs的放射学问答系统通过代理RAG框架显著提高了诊断准确性和事实依据，尤其是在中等规模模型中，未来需要进一步验证其临床实用性。 Abstract: Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P<0.001) and conventional online RAG (73% vs. 68%; P<0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.

[41] GLiDRE: Generalist Lightweight model for Document-level Relation Extraction

Robin Armingaud,Romaric Besançon

Main category: cs.CL

TL;DR: 本文提出了一种新的文档级关系抽取模型GLiDRE，其基于GLiNER的关键思想，并在少样本场景下表现出了优越的性能。

Details

Motivation: 当前的关系抽取方法在零样本或少样本设置下的性能很大程度上仍未被探索，且GLiNER模型已证明紧凑的NER模型可以超越更大的大型语言模型。 Method: 基于GLiNER的关键思想构建了一个新的文档级关系抽取模型GLiDRE，并在Re-DocRED数据集上的各种数据设置下对GLiDRE与最先进的模型进行了基准测试。 Result: 在Re-DocRED数据集上的实验表明，GLiDRE在少样本场景下表现优异。 Conclusion: GLiDRE在少样本场景下达到了最先进的性能，并且代码已公开。 Abstract: Relation Extraction (RE) is a fundamental task in Natural Language Processing, and its document-level variant poses significant challenges, due to the need to model complex interactions between entities across sentences. Current approaches, largely based on the ATLOP architecture, are commonly evaluated on benchmarks like DocRED and Re-DocRED. However, their performance in zero-shot or few-shot settings remains largely underexplored due to the task's complexity. Recently, the GLiNER model has shown that a compact NER model can outperform much larger Large Language Models. With a similar motivation, we introduce GLiDRE, a new model for document-level relation extraction that builds on the key ideas of GliNER. We benchmark GLiDRE against state-of-the-art models across various data settings on the Re-DocRED dataset. Our results demonstrate that GLiDRE achieves state-of-the-art performance in few-shot scenarios. Our code is publicly available.

[42] MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations

Qiyao Xue,Yuchen Dou,Ryan Shi,Xiang Lorraine Li,Wei Gao

Main category: cs.CL

TL;DR: MMBERT is a novel multimodal framework for hate speech detection in Chinese social networks, combining textual, speech, and visual data through a Mixture-of-Experts (MoE) architecture to outperform existing models.

Details

Motivation: Hate speech detection on Chinese social networks poses unique challenges due to the use of cloaking techniques and limited focus on multimodal strategies for the Chinese context, despite the recent advancements of large language models (LLMs) in this domain. Method: MMBERT uses a BERT-based multimodal framework integrating textual, speech, and visual modalities via a Mixture-of-Experts (MoE) architecture, supported by a progressive three-stage training paradigm, modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy. Result: MMBERT significantly outperforms fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs using in-context learning approaches on several Chinese hate speech datasets, proving its robustness against adversarial perturbations. Conclusion: The proposed MMBERT framework demonstrates superior performance over existing models in detecting hate speech on Chinese social networks, particularly by addressing the challenges of cloaking techniques and adversarial perturbations through its multimodal approach and MoE architecture. Abstract: Hate speech detection on Chinese social networks presents distinct challenges, particularly due to the widespread use of cloaking techniques designed to evade conventional text-based detection systems. Although large language models (LLMs) have recently improved hate speech detection capabilities, the majority of existing work has concentrated on English datasets, with limited attention given to multimodal strategies in the Chinese context. In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. To address the instability associated with directly integrating MoE into BERT-based models, we develop a progressive three-stage training paradigm. MMBERT incorporates modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy to enhance robustness against adversarial perturbations. Empirical results in several Chinese hate speech datasets show that MMBERT significantly surpasses fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs utilizing in-context learning approaches.

[43] ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation

Atakan Site,Emre Hakan Erdemir,Gülşen Eryiğit

Main category: cs.CL

TL;DR: This paper introduces a zero-shot system using LLMs to generate Python code for answering questions on tabular data, achieving strong results in the SemEval-2025 Task 8 competition.

Details

Motivation: The motivation is to address the challenge of question answering over tabular data from diverse domains using a zero-shot approach, leveraging the power of large language models for code generation. Method: The authors propose a Python code generation framework using state-of-the-art open-source LLMs to generate executable Pandas code via optimized prompting strategies. Their approach is zero-shot and applied to both DataBench QA and DataBench Lite QA subtasks. Result: Different LLMs showed varying effectiveness in Python code generation, but overall, code generation outperformed alternative approaches. The system ranked eighth in Subtask I and sixth in Subtask II among systems that beat the baseline. Conclusion: The paper concludes that leveraging LLM-based code generation, particularly with open-source models, is effective for tabular question answering in the DataBench task. Their system achieved strong results, placing eighth in Subtask I and sixth in Subtask II among systems outperforming the baseline. Abstract: This paper presents our system for SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data. The primary objective of this task is to perform question answering on given tabular datasets from diverse domains under two subtasks: DataBench QA (Subtask I) and DataBench Lite QA (Subtask II). To tackle both subtasks, we developed a zero-shot solution with a particular emphasis on leveraging Large Language Model (LLM)-based code generation. Specifically, we propose a Python code generation framework utilizing state-of-the-art open-source LLMs to generate executable Pandas code via optimized prompting strategies. Our experiments reveal that different LLMs exhibit varying levels of effectiveness in Python code generation. Additionally, results show that Python code generation achieves superior performance in tabular question answering compared to alternative approaches. Although our ranking among zero-shot systems is unknown at the time of this paper's submission, our system achieved eighth place in Subtask I and sixth place in Subtask~II among the 30 systems that outperformed the baseline in the open-source models category.

[44] Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models

Xushuo Tang,Yi Ding,Zhengyi Yang,Yin Chen,Yongrui Gu,Wenke Yang,Mingchen Ju,Xin Cao,Yongfei Liu,Wenjie Zhang

Main category: cs.CL

TL;DR: MISGENDERED+ evaluates modern LLMs on inclusive pronoun handling, showing progress but highlighting ongoing challenges with neopronouns and identity-sensitive reasoning.

Details

Motivation: The motivation stems from the increasing deployment of LLMs in sensitive contexts where fairness and inclusivity are crucial, and prior benchmarks like MISGENDERED were outdated and limited in scope. Method: The authors introduced MISGENDERED+, an updated benchmark for evaluating LLMs' pronoun fidelity, and tested five models—GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5—across zero-shot, few-shot, and gender identity inference tasks. Result: Results indicate significant improvements in pronoun handling compared to earlier models, particularly in binary and gender-neutral pronoun accuracy, but inconsistencies remain in neopronouns and reverse inference tasks. Conclusion: The study concludes that while there have been notable improvements in handling binary and gender-neutral pronouns in updated LLMs, accuracy on neopronouns and reverse inference tasks remains inconsistent, highlighting ongoing challenges in inclusive AI development. Abstract: Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical. Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI. Prior work, such as the MISGENDERED benchmark, revealed significant limitations in earlier LLMs' handling of inclusive pronouns, but was constrained to outdated models and limited evaluations. In this study, we introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs' pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and gender identity inference. Our results show notable improvements compared with previous studies, especially in binary and gender-neutral pronoun accuracy. However, accuracy on neopronouns and reverse inference tasks remains inconsistent, underscoring persistent gaps in identity-sensitive reasoning. We discuss implications, model-specific observations, and avenues for future inclusive AI research.

[45] Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

Jinsong Li,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin

Main category: cs.CL

TL;DR: DAEDAL是一种无需训练的扩散大语言模型去噪策略，能够动态调整生成长度，提升生成效率和性能。

Details

Motivation: 扩散大语言模型（DLLM）受限于静态预定义生成长度的问题，而这一问题导致在复杂任务上表现不佳或计算开销过大。 Method: DAEDAL分为两个阶段：1）在去噪前，根据任务需求逐步扩展初始长度；2）在去噪过程中，通过插入mask token动态扩展生成不足的区域。 Result: DAEDAL在多个实验中表现与精心调整的固定长度基线相当甚至更好，同时提高了有效token比率，增强了计算效率。 Conclusion: DAEDAL通过利用模型内部信号实现了动态自适应长度扩展，解决了扩散大语言模型的静态长度限制问题，提升了计算效率和生成能力。 Abstract: Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.

cs.CV [Back]

[46] A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

Jie Zhu,Yiyang Su,Minchul Kim,Anil Jain,Xiaoming Liu

Main category: cs.CV

TL;DR: This paper proposes QME, a novel learnable score-fusion framework for whole-body biometric recognition that improves performance by addressing model misalignment and data variability issues.

Details

Motivation: Whole-body biometric recognition is a complex multimodal task, and traditional score-fusion methods struggle with variations in score distributions and data quality, limiting performance. Method: The authors introduced QME, a learnable score-fusion strategy using a Mixture of Experts (MoE), along with a pseudo-quality loss and a score triplet loss to improve performance. Result: Extensive experiments demonstrated that QME outperforms baseline methods, achieving state-of-the-art results across multiple whole-body biometric datasets. Conclusion: The proposed QME framework effectively enhances whole-body biometric recognition by addressing challenges like model misalignment and data variability, achieving state-of-the-art results. Abstract: Whole-body biometric recognition is a challenging multimodal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging of similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present \textbf{Q}uality-guided \textbf{M}ixture of score-fusion \textbf{E}xperts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo-quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multimodal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality.

[47] Punching Bag vs. Punching Person: Motion Transferability in Videos

Raiyaan Abdullah,Jared Claypoole,Michael Cogswell,Ajay Divakaran,Yogesh Rawat

Main category: cs.CV

TL;DR: This paper explores the transferability of high-level motion concepts in action recognition models, introduces a new framework with three datasets, and highlights challenges in novel contexts while proposing ways to improve recognition through disentangling coarse and fine motions.

Details

Motivation: The paper investigates whether action recognition models can effectively transfer high-level motion concepts across diverse contexts, even within similar distributions, such as recognizing "punching" in unseen variations like "punching person." Method: A motion transferability framework with three datasets (Syn-TA, Kinetics400-TA, and Something-Something-v2-TA) was introduced, and 13 state-of-the-art models were evaluated on these benchmarks to analyze performance drops in novel contexts. Result: There is a significant drop in performance when recognizing high-level actions in novel contexts. Multimodal models struggle more with fine-grained unknown actions, and larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning. Conclusion: The study establishes a crucial benchmark for assessing motion transferability in action recognition, highlighting the importance of disentangling coarse and fine motions for improved recognition in challenging datasets. Abstract: Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.

[48] The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking

Mateo de Mayo,Daniel Cremers,Taihú Pire

Main category: cs.CV

TL;DR: 本文提出Monado SLAM数据集，以改进头戴式设备在复杂场景下的跟踪性能。

Details

Motivation: 尽管视觉惯性里程计（VIO）和同步定位与地图构建（SLAM）技术取得了进展，但它们在处理头戴式设备使用场景中的许多挑战时仍然存在不足，例如高强度运动、动态遮挡、长时间跟踪、低纹理区域、不良照明条件和传感器饱和等问题。 Method: 通过从多个虚拟现实头戴设备中采集真实序列，构建了Monado SLAM数据集。 Result: 提出了Monado SLAM数据集，用于改进对头戴式设备跟踪场景的覆盖，并以宽松的CC BY 4.0许可发布。 Conclusion: Monado SLAM数据集的发布有望推动VIO/SLAM研究和开发，改进头戴式应用场景中的跟踪系统。 Abstract: Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4.0 license, to drive advancements in VIO/SLAM research and development.

[49] Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Basna Mohammed Salih Hasan,Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: 该研究提出了一种基于眼周区域的性别分类卷积神经网络模型，在两个数据集上分别达到了99%和96%的高准确率，证明了该方法的有效性和实用性。

Details

Motivation: 性别分类在安全、人机交互、监控和广告等领域至关重要，但其准确性可能受到化妆品和伪装的影响。因此，研究聚焦于利用眼周区域进行性别分类。 Method: 引入了一个复杂的卷积神经网络（CNN）模型，利用彩色图像数据库评估眼周区域在性别分类中的有效性。 Result: 模型在CVBL数据集上达到了99%的准确率，在(Female and Male)数据集上达到了96%的准确率，并且仅使用了较少的可学习参数（7,235,089）。 Conclusion: 研究证明了基于眼周区域的性别分类模型在准确性和实用性方面具有优势，适用于安全和监控领域。 Abstract: Gender classification has emerged as a crucial aspect in various fields, including security, human-machine interaction, surveillance, and advertising. Nonetheless, the accuracy of this classification can be influenced by factors such as cosmetics and disguise. Consequently, our study is dedicated to addressing this concern by concentrating on gender classification using color images of the periocular region. The periocular region refers to the area surrounding the eye, including the eyelids, eyebrows, and the region between them. It contains valuable visual cues that can be used to extract key features for gender classification. This paper introduces a sophisticated Convolutional Neural Network (CNN) model that utilizes color image databases to evaluate the effectiveness of the periocular region for gender classification. To validate the model's performance, we conducted tests on two eye datasets, namely CVBL and (Female and Male). The recommended architecture achieved an outstanding accuracy of 99% on the previously unused CVBL dataset while attaining a commendable accuracy of 96% with a small number of learnable parameters (7,235,089) on the (Female and Male) dataset. To ascertain the effectiveness of our proposed model for gender classification using the periocular region, we evaluated its performance through an extensive range of metrics and compared it with other state-of-the-art approaches. The results unequivocally demonstrate the efficacy of our model, thereby suggesting its potential for practical application in domains such as security and surveillance.

[50] World Consistency Score: A Unified Metric for Video Generation Quality

Akshat Rakheja,Aarsh Ashdhir,Aryan Bhattacharjee,Vanshika Sharma

Main category: cs.CV

TL;DR: This paper introduces the World Consistency Score (WCS), a new evaluation metric for generative video models that measures internal world consistency by combining four interpretable sub-components using a learned weighted formula trained on human preferences.

Details

Motivation: The motivation for WCS arises from the limitations of existing video evaluation metrics, which often focus solely on visual fidelity or prompt alignment while neglecting the internal world consistency of generated videos. This work aims to fill that gap by proposing a metric that evaluates temporal and physical coherence in video generation models. Method: WCS integrates four interpretable sub-components - object permanence, relation stability, causal compliance, and flicker penalty - which are combined via a learned weighted formula to produce a single consistency score. The weights are trained using human preference data, and open-source tools such as trackers, action recognizers, CLIP embeddings, and optical flow are used to compute each submetric. Result: The experimental validation blueprint outlines the use of benchmarks like VBench-2.0, EvalCrafter, and LOVE to test WCS's correlation with human evaluations, perform sensitivity analyses, and compare it against established metrics such as FVD, CLIPScore, VBench, and FVMD. Conclusion: The proposed World Consistency Score (WCS) provides a comprehensive and interpretable framework for evaluating video generation models, addressing gaps left by prior metrics that focused only on visual fidelity or prompt alignment. Abstract: We introduce World Consistency Score (WCS), a novel unified evaluation metric for generative video models that emphasizes internal world consistency of the generated videos. WCS integrates four interpretable sub-components - object permanence, relation stability, causal compliance, and flicker penalty - each measuring a distinct aspect of temporal and physical coherence in a video. These submetrics are combined via a learned weighted formula to produce a single consistency score that aligns with human judgments. We detail the motivation for WCS in the context of existing video evaluation metrics, formalize each submetric and how it is computed with open-source tools (trackers, action recognizers, CLIP embeddings, optical flow), and describe how the weights of the WCS combination are trained using human preference data. We also outline an experimental validation blueprint: using benchmarks like VBench-2.0, EvalCrafter, and LOVE to test WCS's correlation with human evaluations, performing sensitivity analyses, and comparing WCS against established metrics (FVD, CLIPScore, VBench, FVMD). The proposed WCS offers a comprehensive and interpretable framework for evaluating video generation models on their ability to maintain a coherent "world" over time, addressing gaps left by prior metrics focused only on visual fidelity or prompt alignment.

[51] GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration

Li Mi,Manon Bechaz,Zeming Chen,Antoine Bosselut,Devis Tuia

Main category: cs.CV

TL;DR: 本文提出了一种新的主动地理定位方法GeoExplorer，它通过好奇心驱动的探索提高了在未知环境和目标中的定位性能。

Details

Motivation: 当前的AGL方法依赖于距离奖励，但在距离估计困难或遇到未知目标和环境时表现出较差的鲁棒性和泛化能力。 Method: GeoExplorer使用基于内在奖励的好奇心驱动探索策略，而不是传统的距离奖励方法。 Result: 在四个AGL基准测试中进行了广泛的实验，验证了GeoExplorer在多样化的环境中的有效性，特别是在定位未知目标和环境时。 Conclusion: GeoExplorer通过利用好奇心驱动的探索，解决了传统距离奖励方法在探索策略方面的局限性，从而在未知目标和环境中实现了更强的鲁棒性和泛化能力。 Abstract: Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.

[52] Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs

Bhavya Goyal,Felipe Gutierrez-Barragan,Wei Lin,Andreas Velten,Yin Li,Mohit Gupta

Main category: cs.CV

TL;DR: This paper introduces Probabilistic Point Clouds (PPC), a new 3D representation enhancing LiDAR-based object detection by incorporating measurement uncertainty, leading to improved performance in challenging environments.

Details

Motivation: Modern LiDARs face challenges in real-world scenarios like long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors propagate to perception models, resulting in loss of accuracy. Method: Proposed Probabilistic Point Clouds (PPC), a novel 3D scene representation where each point is augmented with a probability attribute to encapsulate measurement uncertainty. Introduced inference approaches that leverage PPC for robust 3D object detection. Result: Demonstrated via simulations and real captures that PPC-based 3D inference methods are robust and versatile, applicable in various indoor and outdoor scenarios. Conclusion: PPC-based 3D inference methods outperform several baselines using LiDAR as well as camera-LiDAR fusion models, especially in challenging scenarios. Abstract: LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various scene understanding tasks. Modern LiDARs face key challenges in several real-world scenarios, such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines do not retain any uncertainty information from the raw measurements when constructing point clouds. We propose Probabilistic Point Clouds (PPC), a novel 3D scene representation where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (or confidence) in the raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines using LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light. Our project webpage is at https://bhavyagoyal.github.io/ppc .

[53] On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI

David Restrepo,Ira Ktena,Maria Vakalopoulou,Stergios Christodoulidis,Enzo Ferrante

Main category: cs.CV

TL;DR: This paper introduces Selective Modality Shifting (SMS) to reveal text bias in Vision-Language Models used for medical tasks, showing that models often ignore visual data in favor of textual input, highlighting the need for better multimodal integration.

Details

Motivation: The motivation of the paper is to address the potential modality bias in Vision-Language Models (VLMs) used for clinical decision-making, where textual information might overshadow visual data, leading to suboptimal or unreliable model behavior. Method: The authors introduced Selective Modality Shifting (SMS), a perturbation-based method to evaluate modality reliance in binary classification tasks by swapping images or text between samples with opposing labels. They tested six open-source VLMs on two medical datasets: MIMIC-CXR and FairVLMed, analyzing model performance and calibration in both perturbed and unperturbed settings. Result: The evaluation of six VLMs showed a consistent over-reliance on text input, even when complementary visual information was available. Attention-based analysis confirmed that textual details often overshadowed image content. The modality bias persisted across both generalist and fine-tuned models. Conclusion: The study concludes that existing Vision-Language Models (VLMs) show a strong bias toward textual information in medical decision-making tasks, often overlooking important visual cues. This highlights the need for improved multimodal integration in medical models. Abstract: Clinical decision-making relies on the integrated analysis of medical images and the associated clinical reports. While Vision-Language Models (VLMs) can offer a unified framework for such tasks, they can exhibit strong biases toward one modality, frequently overlooking critical visual cues in favor of textual information. In this work, we introduce Selective Modality Shifting (SMS), a perturbation-based approach to quantify a model's reliance on each modality in binary classification tasks. By systematically swapping images or text between samples with opposing labels, we expose modality-specific biases. We assess six open-source VLMs-four generalist models and two fine-tuned for medical data-on two medical imaging datasets with distinct modalities: MIMIC-CXR (chest X-ray) and FairVLMed (scanning laser ophthalmoscopy). By assessing model performance and the calibration of every model in both unperturbed and perturbed settings, we reveal a marked dependency on text input, which persists despite the presence of complementary visual information. We also perform a qualitative attention-based analysis which further confirms that image content is often overshadowed by text details. Our findings highlight the importance of designing and evaluating multimodal medical models that genuinely integrate visual and textual cues, rather than relying on single-modality signals.

[54] Graph Lineages and Skeletal Graph Products

Eric Mjolsness,Cory B. Scott

Main category: cs.CV

TL;DR: This paper introduces hierarchical graph lineages and an algebraic type theory for graded graphs, enabling efficient hierarchical model architectures and applications in deep learning and numerical methods.

Details

Motivation: The motivation is to develop a mathematical framework for hierarchical graph structures that can efficiently represent and operate on growing graphs. This is aimed at applications in machine learning, computational science, and related fields, enabling better hierarchical model architectures and algorithms. Method: The paper introduces structured graph lineages that grow hierarchically, using bipartite graphs, prolongation maps, and category theory to derive low-cost skeletal variants of standard algebraic graph operations. These operations are analyzed for their algebraic and category-theoretic properties and applied to demonstrate their utility in deep neural networks and multigrid methods. Result: The paper defines hierarchical graph lineages and graded graphs, derives skeletal variants of standard algebraic graph operations, and demonstrates their application in deep neural networks and multigrid numerical methods. It also establishes unary operators like thickening and escalation for additional functionality. Conclusion: The paper concludes that the proposed algebraic type theory for graded graphs and hierarchical graph lineages is well-suited for defining hierarchical model architectures and implementing local sampling, search, or optimization algorithms. Applications in deep neural networks and multigrid numerical methods are demonstrated. Abstract: Graphs, and sequences of growing graphs, can be used to specify the architecture of mathematical models in many fields including machine learning and computational science. Here we define structured graph "lineages" (ordered by level number) that grow in a hierarchical fashion, so that: (1) the number of graph vertices and edges increases exponentially in level number; (2) bipartite graphs connect successive levels within a graph lineage and, as in multigrid methods, can constrain matrices relating successive levels; (3) using prolongation maps within a graph lineage, process-derived distance measures between graphs at successive levels can be defined; (4) a category of "graded graphs" can be defined, and using it low-cost "skeletal" variants of standard algebraic graph operations and type constructors (cross product, box product, disjoint sum, and function types) can be derived for graded graphs and hence hierarchical graph lineages; (5) these skeletal binary operators have similar but not identical algebraic and category-theoretic properties to their standard counterparts; (6) graph lineages and their skeletal product constructors can approach continuum limit objects. Additional space-efficient unary operators on graded graphs are also derived: thickening, which creates a graph lineage of multiscale graphs, and escalation to a graph lineage of search frontiers (useful as a generalization of adaptive grids and in defining "skeletal" functions). The result is an algebraic type theory for graded graphs and (hierarchical) graph lineages. The approach is expected to be well suited to defining hierarchical model architectures - "hierarchitectures" - and local sampling, search, or optimization algorithms on them. We demonstrate such application to deep neural networks (including visual and feature scale spaces) and to multigrid numerical methods.

[55] Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition

Xiangyu Kong,Hengde Zhu,Haoqin Sun,Zhihao Guo,Jiayan Gu,Xinyi Ni,Wei Zhang,Shizhe Liu,Siyang Song

Main category: cs.CV

TL;DR: This paper proposes a new method for real personality recognition by simulating internal cognition through expressive behaviors, using a 2D Graph Neural Network for improved recognition performance.

Details

Motivation: The motivation stems from the observation that existing methods for personality recognition often rely on external observers and do not accurately capture real personality traits from internal cognition. Method: The authors employed an end-to-end strategy involving cognition simulation, 2D graph construction, and personality recognition modules using a novel 2D Graph Neural Network (2D-GNN). Result: The result is a new approach for real personality recognition that efficiently simulates internal cognition from external behaviors, leading to improved recognition performance. Conclusion: The paper concludes that their proposed method can effectively recognize real personality traits by simulating internal cognition through expressive behaviors. Abstract: Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers' personality impressions based on target individuals' expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition performance. Inspired by the association between real personality and human internal cognition underlying the generation of expressive behaviours, we propose a novel RPR approach that efficiently simulates personalised internal cognition from easy-accessible external short audio-visual behaviours expressed by the target individual. The simulated personalised cognition, represented as a set of network weights that enforce the personalised network to reproduce the individual-specific facial reactions, is further encoded as a novel graph containing two-dimensional node and edge feature matrices, with a novel 2D Graph Neural Network (2D-GNN) proposed for inferring real personality traits from it. To simulate real personality-related cognition, an end-to-end strategy is designed to jointly train our cognition simulation, 2D graph construction, and personality recognition modules.

[56] SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

Shayan Jalilian,Abdul Bais

Main category: cs.CV

TL;DR: 本文提出了一种参数效率高的方法SAM-PTx，通过将CLIP衍生的文本嵌入作为类级语义指导，改进了Segment Anything Model (SAM)的语义文本提示的潜力。

Details

Motivation: 尽管Segment Anything Model (SAM)在基于提示的分割方面表现出色，但与传统的空间提示（如点和框）相比，语义文本提示的潜力尚未得到充分探索。 Method: 提出了一个轻量级的适配器设计Parallel-Text，将文本嵌入注入到SAM的图像编码器中，只修改每个Transformer块的MLP-平行分支，同时保留用于空间推理的注意力路径。 Result: 通过在COD10K数据集以及COCO和ADE20K的低数据子集上的监督实验和消融研究，显示结合固定文本嵌入作为输入可以提高分割性能。 Conclusion: SAM-PTx通过结合固定的文本嵌入作为输入，在保持大多数原始架构冻结的同时，实现了语义引导的分割，并在COD10K数据集以及COCO和ADE20K的低数据子集上展示了比纯空间提示基线更好的分割性能。 Abstract: The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM's image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM's architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.

[57] Object-Centric Cropping for Visual Few-Shot Classification

Aymane Abdali,Bartosz Boguslawski,Lucas Drumetz,Vincent Gripon

Main category: cs.CV

TL;DR: This research demonstrates that including local positioning information of objects significantly boosts Few-Shot Image Classification performance, with the Segment Anything Model and unsupervised methods offering notable improvements.

Details

Motivation: The motivation is to address performance deterioration in Few-Shot Image Classification caused by image ambiguities, such as multiple objects or complex backgrounds, by incorporating additional contextual information. Method: The study examines the impact of incorporating local positioning information of objects on Few-Shot Image Classification performance, leveraging the Segment Anything Model and unsupervised foreground object extraction techniques. Result: The results show a marked enhancement in classification performance across established benchmarks when additional local positioning information is incorporated. Conclusion: The research concludes that incorporating information about the local positioning of objects significantly improves Few-Shot Image Classification performance, with a significant fraction of improvement achievable through the use of the Segment Anything Model or unsupervised foreground object extraction methods. Abstract: In the domain of Few-Shot Image Classification, operating with as little as one example per class, the presence of image ambiguities stemming from multiple objects or complex backgrounds can significantly deteriorate performance. Our research demonstrates that incorporating additional information about the local positioning of an object within its image markedly enhances classification across established benchmarks. More importantly, we show that a significant fraction of the improvement can be achieved through the use of the Segment Anything Model, requiring only a pixel of the object of interest to be pointed out, or by employing fully unsupervised foreground object extraction methods.

[58] Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network

Chenggang Guo,Hao Xu,XianMing Wan

Main category: cs.CV

TL;DR: The paper introduces a new model, MSF-UM, for depth map super-resolution that efficiently captures long-range dependencies and uses color image guidance for improved performance with reduced parameters.

Details

Motivation: The motivation is to overcome the limitations of traditional convolutional neural networks and transformers in depth map super-resolution, particularly regarding long-range dependencies and computational complexity. Method: The method involves a multi-scale fusion U-shaped Mamba (MSF-UM) model that integrates Mamba's efficient state-space modeling with a multi-scale U-shaped fusion structure guided by a color image. Result: The MSF-UM model achieves better reconstruction accuracy with fewer parameters and shows excellent generalization in large-scale depth map super-resolution tasks. Conclusion: The proposed MSF-UM model significantly reduces the number of parameters while achieving better reconstruction accuracy in depth map super-resolution. Abstract: Depth map super-resolution technology aims to improve the spatial resolution of low-resolution depth maps and effectively restore high-frequency detail information. Traditional convolutional neural network has limitations in dealing with long-range dependencies and are unable to fully model the global contextual information in depth maps. Although transformer can model global dependencies, its computational complexity and memory consumption are quadratic, which significantly limits its ability to process high-resolution depth maps. In this paper, we propose a multi-scale fusion U-shaped Mamba (MSF-UM) model, a novel guided depth map super-resolution framework. The core innovation of this model is to integrate Mamba's efficient state-space modeling capabilities into a multi-scale U-shaped fusion structure guided by a color image. The structure combining the residual dense channel attention block and the Mamba state space module is designed, which combines the local feature extraction capability of the convolutional layer with the modeling advantage of the state space model for long-distance dependencies. At the same time, the model adopts a multi-scale cross-modal fusion strategy to make full use of the high-frequency texture information from the color image to guide the super-resolution process of the depth map. Compared with existing mainstream methods, the proposed MSF-UM significantly reduces the number of model parameters while achieving better reconstruction accuracy. Extensive experiments on multiple publicly available datasets validate the effectiveness of the model, especially showing excellent generalization ability in the task of large-scale depth map super-resolution.

[59] PointGauss: Point Cloud-Guided Multi-Object Segmentation for Gaussian Splatting

Wentao Sun,Hanqing Xu,Quanyun Wu,Dedong Zhang,Yiping Chen,Lingfei Ma,John S. Zelek,Jonathan Li

Main category: cs.CV

TL;DR: PointGauss是一种高效的3D多目标分割方法，通过点云引导的高斯泼溅表示实现快速分割，并提出了一个新的大规模多目标数据集DesktopObjects-360。

Details

Motivation: 现有的方法存在初始化时间长和多视角一致性有限的问题，同时现有基准测试也有局限性，如单目标聚焦、3D评估不一致、规模小和部分覆盖。 Method: 该方法包括一个基于点云的高斯基元解码器和一个GPU加速的2D掩码渲染系统，以实现高效的3D分割和多视角一致性。 Result: 实验表明，与现有最先进方法相比，PointGauss在多视角mIoU方面有1.89至31.78%的显著提升，并保持了优越的计算效率。 Conclusion: PointGauss是一个基于点云的高斯泼溅表示框架，能够实现高效的3D多目标分割，并提出了一个新的大规模数据集DesktopObjects-360。 Abstract: We introduce PointGauss, a novel point cloud-guided framework for real-time multi-object segmentation in Gaussian Splatting representations. Unlike existing methods that suffer from prolonged initialization and limited multi-view consistency, our approach achieves efficient 3D segmentation by directly parsing Gaussian primitives through a point cloud segmentation-driven pipeline. The key innovation lies in two aspects: (1) a point cloud-based Gaussian primitive decoder that generates 3D instance masks within 1 minute, and (2) a GPU-accelerated 2D mask rendering system that ensures multi-view consistency. Extensive experiments demonstrate significant improvements over previous state-of-the-art methods, achieving performance gains of 1.89 to 31.78% in multi-view mIoU, while maintaining superior computational efficiency. To address the limitations of current benchmarks (single-object focus, inconsistent 3D evaluation, small scale, and partial coverage), we present DesktopObjects-360, a novel comprehensive dataset for 3D segmentation in radiance fields, featuring: (1) complex multi-object scenes, (2) globally consistent 2D annotations, (3) large-scale training data (over 27 thousand 2D masks), (4) full 360{\deg} coverage, and (5) 3D evaluation masks.

[60] Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

Hyundong Jin,Hyung Jin Chang,Eunwoo Kim

Main category: cs.CV

TL;DR: This paper proposes a framework for continual learning in vision-language models that better integrates language instructions using specialized visual projectors and expert management techniques.

Details

Motivation: Existing continual learning methods prioritize visual inputs over language instructions, leading to suboptimal task performance with repetitive textual instructions. Method: The paper introduces a mixture of visual projectors specialized for instruction contexts, an expert recommendation strategy, and expert pruning to manage interference. Result: Experiments showed that the proposed method outperforms existing continual learning approaches in vision-language tasks by better following instructions. Conclusion: The proposed framework improves continual learning in vision-language models by emphasizing language instructions and managing visual projectors efficiently. Abstract: Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.

[61] Multimodal Referring Segmentation: A Survey

Henghui Ding,Song Tang,Shuting He,Chang Liu,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 这篇论文调查了多模态指代表分割，总结了统一的元架构和代表性方法，讨论了应对现实世界复杂性的策略，并提供了性能比较。

Details

Motivation: 多模态指代表分割在基于用户指令的准确对象感知的实际应用中起着至关重要的作用，过去十年中由于卷积神经网络、变压器和大型语言模型的进步，这一领域受到了广泛关注。 Method: 介绍该领域背景，总结统一的元架构，回顾代表方法，讨论广义指代表达方法，提供性能比较。 Result: 该论文提供了多模态指代表分割的全面调查，包括背景介绍、统一元架构总结、代表性方法回顾、广义指代表达方法讨论以及标准基准的性能比较。 Conclusion: 该论文提供了一个全面的多模态指代表分割调查，总结了统一的元架构和代表性方法，并讨论了应对现实世界复杂性的广义指代表达方法，同时提供了标准基准的性能比较。 Abstract: Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

[62] Towards Robust Semantic Correspondence: A Benchmark and Insights

Wenyue Chong

Main category: cs.CV

TL;DR: This paper introduces a new benchmark for evaluating semantic correspondence robustness in challenging scenarios, revealing insights on model performance and robustness enhancement strategies.

Details

Motivation: Semantic correspondence is crucial for various computer vision tasks, yet its robustness in adverse conditions remains underexplored. This work aims to address this gap by establishing a benchmark for challenging scenarios. Method: A novel benchmark dataset with 14 challenging scenarios was created, and extensive evaluations were conducted on existing semantic correspondence approaches and robustness enhancement strategies. Result: Key findings include performance drops in adverse conditions for all methods, enhanced robustness using large-scale models (with limitations upon fine-tuning), and the superiority of DINO over Stable Diffusion, with their fusion yielding better results. Common data augmentations were found to be ineffective. Conclusion: The study concludes that while large-scale vision models improve robustness in semantic correspondence, their fine-tuning can lead to reduced relative robustness. Task-specific designs are necessary for robustness enhancement as general augmentations are ineffective. Abstract: Semantic correspondence aims to identify semantically meaningful relationships between different images and is a fundamental challenge in computer vision. It forms the foundation for numerous tasks such as 3D reconstruction, object tracking, and image editing. With the progress of large-scale vision models, semantic correspondence has achieved remarkable performance in controlled and high-quality conditions. However, the robustness of semantic correspondence in challenging scenarios is much less investigated. In this work, we establish a novel benchmark for evaluating semantic correspondence in adverse conditions. The benchmark dataset comprises 14 distinct challenging scenarios that reflect commonly encountered imaging issues, including geometric distortion, image blurring, digital artifacts, and environmental occlusion. Through extensive evaluations, we provide several key insights into the robustness of semantic correspondence approaches: (1) All existing methods suffer from noticeable performance drops under adverse conditions; (2) Using large-scale vision models can enhance overall robustness, but fine-tuning on these models leads to a decline in relative robustness; (3) The DINO model outperforms the Stable Diffusion in relative robustness, and their fusion achieves better absolute robustness; Moreover, We evaluate common robustness enhancement strategies for semantic correspondence and find that general data augmentations are ineffective, highlighting the need for task-specific designs. These results are consistent across both our dataset and real-world benchmarks.

[63] Privacy-Preserving Driver Drowsiness Detection with Spatial Self-Attention and Federated Learning

Tran Viet Khoa,Do Hai Son,Mohammad Abu Alsheikh,Yibeltal F Alem,Dinh Thai Hoang

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的基于空间自注意力机制和联邦学习的驾驶员疲劳检测方法，能够有效处理现实世界中的数据变异性，并在智能交通系统中实现早期和可靠的疲劳检测。

Details

Motivation: 驾驶员疲劳是交通事故的主要原因之一，但在现实环境中准确检测疲劳仍是一项挑战，尤其是在面部数据分散且多样化的情况下。 Method: 开发了一种新的空间自注意力机制（SSA）与长短期记忆网络（LSTM）结合的方法，并采用梯度相似性比较（GSC）来支持联邦学习。 Result: 实验结果显示，该框架在联邦学习设置下的检测准确率达到89.9%，在各种部署场景下均优于现有方法。 Conclusion: 该论文提出了一种新的注意力机制和联邦学习方法，以提高驾驶员疲劳检测的准确性，并强调其在智能交通系统中提升道路安全的潜力。 Abstract: Driver drowsiness is one of the main causes of road accidents and is recognized as a leading contributor to traffic-related fatalities. However, detecting drowsiness accurately remains a challenging task, especially in real-world settings where facial data from different individuals is decentralized and highly diverse. In this paper, we propose a novel framework for drowsiness detection that is designed to work effectively with heterogeneous and decentralized data. Our approach develops a new Spatial Self-Attention (SSA) mechanism integrated with a Long Short-Term Memory (LSTM) network to better extract key facial features and improve detection performance. To support federated learning, we employ a Gradient Similarity Comparison (GSC) that selects the most relevant trained models from different operators before aggregation. This improves the accuracy and robustness of the global model while preserving user privacy. We also develop a customized tool that automatically processes video data by extracting frames, detecting and cropping faces, and applying data augmentation techniques such as rotation, flipping, brightness adjustment, and zooming. Experimental results show that our framework achieves a detection accuracy of 89.9% in the federated learning settings, outperforming existing methods under various deployment scenarios. The results demonstrate the effectiveness of our approach in handling real-world data variability and highlight its potential for deployment in intelligent transportation systems to enhance road safety through early and reliable drowsiness detection.

[64] TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models

Christian Simon,Masato Ishii,Akio Hayakawa,Zhi Zhong,Shusuke Takahashi,Takashi Shibuya,Yuki Mitsufuji

Main category: cs.CV

TL;DR: TITAN-Guide 是一种高效的文本到视频扩散模型引导方法，能够在减少内存占用的同时提升模型控制性能。

Details

Motivation: 现有的无训练引导框架存在内存需求高或控制效果不佳的问题，限制了其在计算密集型模型（如文本到视频扩散模型）中的应用。 Method: TITAN-Guide 通过开发一种无需反向传播的扩散潜在变量优化方法，研究了前向梯度下降在引导扩散任务中的应用。 Result: 实验表明，TITAN-Guide 在潜在变量优化过程中能有效管理内存，显著提升文本到视频扩散模型的性能。 Conclusion: TITAN-Guide 提出了一种高效的文本到视频扩散模型引导方法，克服了内存空间问题，并在引导过程中提供了更优的控制性能。 Abstract: In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either have heavy memory requirements or offer sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks. Code, models, and demo are available at https://titanguide.github.io.

[65] AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Jin Lyu,Liang An,Li Lin,Pujin Cheng,Yebin Liu,Xiaoying Tang

Main category: cs.CV

TL;DR: AniMer+ 通过创新的网络架构和合成数据集，实现了对哺乳动物和鸟类的高效3D重建，性能优于现有方法。

Details

Motivation: 现有方法受限于网络容量和缺乏全面的多物种数据集，因此需要更强大的模型和数据集来实现跨物种的准确重建。 Method: 引入AniMer+框架，结合Mixture-of-Experts设计和扩散模型生成的合成数据集CtrlAni3D和CtrlAVES3D，进行统一重建。 Result: AniMer+ 在多个基准测试中表现优异，尤其是在跨域的Animal Kingdom数据集上，并验证了合成数据和网络架构的有效性。 Conclusion: AniMer+ 提出了一个统一的框架，通过高容量的家族感知Vision Transformer和合成数据集，实现了哺乳动物和鸟类的高效重建，优于现有方法。 Abstract: In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

[66] Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence

Danzhen Fu,Jiagao Hu,Daiguo Zhou,Fei Wang,Zepeng Wang,Wenhua Liao

Main category: cs.CV

TL;DR: A new framework for pedestrian video editing in autonomous driving scenarios improves robustness by generating realistic and consistent multi-view pedestrian videos, enabling data augmentation and scenario simulation.

Details

Motivation: Pedestrian detection models in autonomous driving systems often lack robustness due to insufficient representation of dangerous pedestrian scenarios in training datasets. Method: A framework integrating video inpainting and human motion control techniques to enable controllable pedestrian video editing, involving identifying pedestrian regions, expanding detection boxes, resizing/stitching regions into a unified canvas, and applying a binary mask for editable area designation. Result: Extensive experiments demonstrate high-quality pedestrian editing with strong visual realism, spatiotemporal coherence, and cross-view consistency. Conclusion: The proposed method provides a robust and versatile solution for multi-view pedestrian video generation, with potential applications in data augmentation and scenario simulation for autonomous driving. Abstract: Pedestrian detection models in autonomous driving systems often lack robustness due to insufficient representation of dangerous pedestrian scenarios in training datasets. To address this limitation, we present a novel framework for controllable pedestrian video editing in multi-view driving scenarios by integrating video inpainting and human motion control techniques. Our approach begins by identifying pedestrian regions of interest across multiple camera views, expanding detection bounding boxes with a fixed ratio, and resizing and stitching these regions into a unified canvas while preserving cross-view spatial relationships. A binary mask is then applied to designate the editable area, within which pedestrian editing is guided by pose sequence control conditions. This enables flexible editing functionalities, including pedestrian insertion, replacement, and removal. Extensive experiments demonstrate that our framework achieves high-quality pedestrian editing with strong visual realism, spatiotemporal coherence, and cross-view consistency. These results establish the proposed method as a robust and versatile solution for multi-view pedestrian video generation, with broad potential for applications in data augmentation and scenario simulation in autonomous driving.

[67] Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement

Chunyan She,Fujun Han,Chengyu Fang,Shukai Duan,Lidan Wang

Main category: cs.CV

TL;DR: This paper proposes a two-stage enhancement pipeline for event cameras that improves low-light image enhancement by leveraging modality-specific advantages and mitigating spatial mismatches, outperforming existing methods.

Details

Motivation: Existing event-based methods do not fully exploit modality-specific advantages by directly combining frames and events into a single model, which limits performance in low-light image enhancement. Method: The method involves a visibility restoration network with amplitude-phase entanglement, a fusion strategy with dynamic alignment to address spatial mismatch, and a contrastive loss based on spatial-frequency interpolation for generating negative samples. Result: Experiments show that the proposed method achieves superior performance compared to state-of-the-art models in low-light image enhancement tasks. Conclusion: The proposed method, which decouples the enhancement pipeline into visibility restoration and structure refinement stages, outperforms state-of-the-art models in low-light image enhancement using event cameras. Abstract: The event camera, benefiting from its high dynamic range and low latency, provides performance gain for low-light image enhancement. Unlike frame-based cameras, it records intensity changes with extremely high temporal resolution, capturing sufficient structure information. Currently, existing event-based methods feed a frame and events directly into a single model without fully exploiting modality-specific advantages, which limits their performance. Therefore, by analyzing the role of each sensing modality, the enhancement pipeline is decoupled into two stages: visibility restoration and structure refinement. In the first stage, we design a visibility restoration network with amplitude-phase entanglement by rethinking the relationship between amplitude and phase components in Fourier space. In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch caused by the temporal resolution discrepancy between two sensing modalities, aiming to refine the structure information of the image enhanced by the visibility restoration network. In addition, we utilize spatial-frequency interpolation to simulate negative samples with diverse illumination, noise and artifact degradations, thereby developing a contrastive loss that encourages the model to learn discriminative representations. Experiments demonstrate that the proposed method outperforms state-of-the-art models.

[68] DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios

Yufeng Zhong,Zhixiong Zeng,Lei Chen,Longrong Yang,Liming Zheng,Jing Huang,Siqi Yang,Lin Ma

Main category: cs.CV

TL;DR: DocTron-Formula is a unified OCR framework for mathematical formulas that leverages vision-language models and achieves superior performance on a newly introduced large-scale dataset, CSFormula.

Details

Motivation: Existing models struggle with the structural diversity, complexity, and variability of mathematical content in scientific literature, highlighting the need for a more robust and unified solution. Method: DocTron-Formula is built upon general vision-language models and is enhanced through supervised fine-tuning. Additionally, the CSFormula dataset is introduced for training and evaluation. Result: The proposed method outperforms specialized models in accuracy and robustness across various styles, domains, and complex layouts. Conclusion: DocTron-Formula provides a unified framework for OCR in mathematical formulas, achieving state-of-the-art performance and setting a new paradigm for understanding complex scientific documents. Abstract: Optical Character Recognition (OCR) for mathematical formula is essential for the intelligent analysis of scientific literature. However, both task-specific and general vision-language models often struggle to handle the structural diversity, complexity, and real-world variability inherent in mathematical content. In this work, we present DocTron-Formula, a unified framework built upon general vision-language models, thereby eliminating the need for specialized architectures. Furthermore, we introduce CSFormula, a large-scale and challenging dataset that encompasses multidisciplinary and structurally complex formulas at the line, paragraph, and page levels. Through straightforward supervised fine-tuning, our approach achieves state-of-the-art performance across a variety of styles, scientific domains, and complex layouts. Experimental results demonstrate that our method not only surpasses specialized models in terms of accuracy and robustness, but also establishes a new paradigm for the automated understanding of complex scientific documents.

[69] GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection

Suhang Cai,Xiaohao Peng,Chong Wang,Xiaojie Cai,Jiangbo Qian

Main category: cs.CV

TL;DR: 本研究提出了一种新的弱监督视频异常检测框架，通过生成语义可控和物理合理的合成视频，解决了异常检测中数据稀缺和标注成本高的问题，并在实验中显示出优于现有方法的性能。

Details

Motivation: 由于现实世界中异常的罕见性、不可预测性和高昂的标注成本，使得现有的模型在性能和泛化能力方面受到限制。 Method: 提出了一种生成视频增强的弱监督视频异常检测（GV-VAD）框架，该框架利用文本条件视频生成模型来生成语义可控且物理上合理的合成视频。 Result: 通过使用虚拟视频以低成本扩充训练数据，并利用合成样本损失缩放策略控制生成的合成样本的影响，从而提高训练效率。 Conclusion: 实验表明，提出的GV-VAD框架在UCF-Crime数据集上优于最先进的方法。 Abstract: Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at https://github.com/Sumutan/GV-VAD.git.

[70] Steering Guidance for Personalized Text-to-Image Diffusion Models

Sunghyun Park,Seokeon Choi,Hyoungwoo Park,Sungrack Yun

Main category: cs.CV

TL;DR: This paper proposes a new guidance method called personalization guidance that effectively balances target alignment and model editability in few-image fine-tuning of text-to-image diffusion models.

Details

Motivation: Fine-tuning text-to-image diffusion models with limited data creates a trade-off between subject fidelity and text editability, which existing methods like CFG and AG fail to balance effectively. Method: Personalization guidance utilizes an unlearned weak model conditioned on a null text prompt and dynamically interpolates weights between pre-trained and fine-tuned models during inference. Result: Experimental results show that the proposed method improves both text alignment and fidelity to the target distribution while maintaining compatibility with various fine-tuning strategies and without additional computational cost. Conclusion: The proposed personalization guidance method effectively balances the trade-off between aligning with the target distribution and preserving the original model's knowledge, outperforming existing guidance methods. Abstract: Personalizing text-to-image diffusion models is crucial for adapting the pre-trained models to specific target concepts, enabling diverse image generation. However, fine-tuning with few images introduces an inherent trade-off between aligning with the target distribution (e.g., subject fidelity) and preserving the broad knowledge of the original model (e.g., text editability). Existing sampling guidance methods, such as classifier-free guidance (CFG) and autoguidance (AG), fail to effectively guide the output toward well-balanced space: CFG restricts the adaptation to the target distribution, while AG compromises text alignment. To address these limitations, we propose personalization guidance, a simple yet effective method leveraging an unlearned weak model conditioned on a null text prompt. Moreover, our method dynamically controls the extent of unlearning in a weak model through weight interpolation between pre-trained and fine-tuned models during inference. Unlike existing guidance methods, which depend solely on guidance scales, our method explicitly steers the outputs toward a balanced latent space without additional computational overhead. Experimental results demonstrate that our proposed guidance can improve text alignment and target distribution fidelity, integrating seamlessly with various fine-tuning strategies.

[71] Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating

Lilika Makabe,Hiroaki Santo,Fumio Okura,Michael S. Brown,Yasuyuki Matsushita

Main category: cs.CV

TL;DR: 本文提出了一种实用且准确的相机光谱灵敏度校准方法，仅需一个未校准的衍射光栅片即可实现。

Details

Motivation: 准确的相机光谱灵敏度校准对于颜色校正、照明估计和材料分析等计算机视觉任务至关重要。 Method: 通过捕捉直接照明和通过光栅片后的衍射图案，以闭合形式估计相机光谱灵敏度和衍射光栅参数。 Result: 实验表明，该方法在合成和真实世界数据上均优于传统方法，证明了其有效性和实用性。 Conclusion: 本文提出了一种基于衍射光栅的相机光谱灵敏度校准方法，实验表明该方法优于传统的基于参考目标的方法。 Abstract: This paper introduces a practical and accurate calibration method for camera spectral sensitivity using a diffraction grating. Accurate calibration of camera spectral sensitivity is crucial for various computer vision tasks, including color correction, illumination estimation, and material analysis. Unlike existing approaches that require specialized narrow-band filters or reference targets with known spectral reflectances, our method only requires an uncalibrated diffraction grating sheet, readily available off-the-shelf. By capturing images of the direct illumination and its diffracted pattern through the grating sheet, our method estimates both the camera spectral sensitivity and the diffraction grating parameters in a closed-form manner. Experiments on synthetic and real-world data demonstrate that our method outperforms conventional reference target-based methods, underscoring its effectiveness and practicality.

[72] Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

Angelos Vlachos,Giorgos Filandrianos,Maria Lymperaiou,Nikolaos Spanos,Ilias Mitsouras,Vasileios Karampinis,Athanasios Voulodimos

Main category: cs.CV

TL;DR: 本文提出了一种双智能体框架，通过PromptEngineer和VisionReasoner实现多图像推理，在多种任务中表现出色，特别是在复杂任务上的高性能表现。

Details

Motivation: 解决跨多样化数据集和任务格式的交错多模态推理问题，实现自动化、模块化、无需训练的推理过程。 Method: 提出了一种基于语言的PromptEngineer和大型视觉-语言模型（LVLM）VisionReasoner的协作代理框架，用于多图像推理任务。 Result: 在2025 MIRAGE挑战赛的18个多样化数据集中，Claude 3.7 在TQA、DocVQA和MMCoQA任务上分别达到了99.13%、96.87%和75.28 ROUGE-L的准确率。 Conclusion: 该框架通过双智能体系统实现了多图像推理任务的高效处理，展示了在多种任务和数据集上的泛化能力。 Abstract: We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.

[73] Stable at Any Speed: Speed-Driven Multi-Object Tracking with Learnable Kalman Filtering

Yan Gong,Mengjun Chen,Hao Liu,Gao Yongsheng,Lei Yang,Naibang Wang,Ziying Song,Haoqun Ma

Main category: cs.CV

TL;DR: This paper proposes SG-LKF, a speed-guided learnable Kalman filter for MOT, which dynamically adapts to ego-vehicle speed and improves tracking performance in high-speed scenarios.

Details

Motivation: Conventional MOT methods neglect ego-vehicle speed-induced variations, leading to instability in dynamic, high-speed scenarios. This work addresses this limitation by incorporating speed-aware uncertainty modeling. Method: SG-LKF integrates a learnable Kalman filter with MotionScaleNet (MSNet), which dynamically predicts filter parameters based on speed. A self-supervised trajectory consistency loss is introduced to enhance association and continuity. Result: SG-LKF outperforms existing vision-based methods on KITTI 2D MOT (79.59% HOTA), KITTI 3D MOT (82.03% HOTA), and nuScenes 3D MOT (2.2% AMOTA improvement over SimpleTrack). Conclusion: The proposed SG-LKF method achieves state-of-the-art performance on vision-based MOT tasks by incorporating ego-vehicle speed into uncertainty modeling and optimizing trajectory consistency. Abstract: Multi-object tracking (MOT) enables autonomous vehicles to continuously perceive dynamic objects, supplying essential temporal cues for prediction, behavior understanding, and safe planning. However, conventional tracking-by-detection methods typically rely on static coordinate transformations based on ego-vehicle poses, disregarding ego-vehicle speed-induced variations in observation noise and reference frame changes, which degrades tracking stability and accuracy in dynamic, high-speed scenarios. In this paper, we investigate the critical role of ego-vehicle speed in MOT and propose a Speed-Guided Learnable Kalman Filter (SG-LKF) that dynamically adapts uncertainty modeling to ego-vehicle speed, significantly improving stability and accuracy in highly dynamic scenarios. Central to SG-LKF is MotionScaleNet (MSNet), a decoupled token-mixing and channel-mixing MLP that adaptively predicts key parameters of SG-LKF. To enhance inter-frame association and trajectory continuity, we introduce a self-supervised trajectory consistency loss jointly optimized with semantic and positional constraints. Extensive experiments show that SG-LKF ranks first among all vision-based methods on KITTI 2D MOT with 79.59% HOTA, delivers strong results on KITTI 3D MOT with 82.03% HOTA, and outperforms SimpleTrack by 2.2% AMOTA on nuScenes 3D MOT.

[74] CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

Zongheng Tang,Yi Liu,Yifan Sun,Yulu Gao,Jinyu Chen,Runsheng Xu,Si Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的协同感知方法CoST，通过将多智能体和多时间的观测信息融合到一个统一的时空空间中，提高了感知的效率和准确性。

Details

Motivation: 现有的方法通常将多智能体融合和多时间融合分为两个连续的步骤，导致效率低下且性能受限。本文旨在通过统一的时空融合方法解决这些问题。 Method: 提出了一种名为Collaborative perception with Spatio-temporal Transformer (CoST) 的方法，将不同智能体和不同时刻的观测信息同时融合到一个统一的时空空间中。 Result: CoST在提高感知准确性的同时减少了传输带宽需求，并且不依赖于特定的方法，可以兼容大多数现有方法。 Conclusion: CoST通过统一的时空融合方法在效率和准确性方面都取得了显著提升。 Abstract: Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.

[75] Honey Classification using Hyperspectral Imaging and Machine Learning

Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh

Main category: cs.CV

TL;DR: 本文提出了一种基于机器学习的蜂蜜植物来源自动分类方法，通过类别转换、线性判别分析和SVM/KNN模型实现了高准确率分类。

Details

Motivation: 为了提高蜂蜜植物来源分类的准确性和效率，本文提出了一个自动化的机器学习方法，以解决传统方法可能存在的不足。 Method: 本文采用了三个主要步骤：数据集准备、特征提取和分类。在数据集准备阶段使用类别转换方法以最大化类别间的可分性；特征提取阶段采用线性判别分析（LDA）技术提取相关特征并降低维度；分类阶段使用支持向量机（SVM）和K近邻（KNN）模型对特征进行分类。 Result: 实验结果表明，该方法在标准蜂蜜高光谱成像数据集上达到了目前最先进的分类准确率，其中高光谱图像分类准确率为95.13%，高光谱实例分类准确率为92.80%。 Conclusion: 本文提出了一种基于机器学习的方法，用于自动分类蜂蜜的植物来源，并在标准蜂蜜高光谱成像数据集上验证了该方法的有效性，分类准确率分别达到95.13%和92.80%。 Abstract: In this paper, we propose a machine learning-based method for automatically classifying honey botanical origins. Dataset preparation, feature extraction, and classification are the three main steps of the proposed method. We use a class transformation method in the dataset preparation phase to maximize the separability across classes. The feature extraction phase employs the Linear Discriminant Analysis (LDA) technique for extracting relevant features and reducing the number of dimensions. In the classification phase, we use Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) models to classify the extracted features of honey samples into their botanical origins. We evaluate our system using a standard honey hyperspectral imaging (HSI) dataset. Experimental findings demonstrate that the proposed system produces state-of-the-art results on this dataset, achieving the highest classification accuracy of 95.13% for hyperspectral image-based classification and 92.80% for hyperspectral instance-based classification.

[76] SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies

Liang Han,Xu Zhang,Haichuan Song,Kanle Shi,Yu-Shen Liu,Zhizhong Han

Main category: cs.CV

TL;DR: SparseRecon通过特征一致性损失和不确定性引导深度约束，显著提高了稀疏视角下的3D重建质量。

Details

Motivation: 现有的泛化性方法在未见视角中表现不佳，而过拟合方法的重建质量受限于有限的几何线索。 Method: SparseRecon是一种基于体积渲染特征一致性和不确定性引导深度约束的新型神经隐式重建方法。 Result: 实验结果表明，SparseRecon优于现有方法，尤其是在稀疏输入和小重叠视角场景中。 Conclusion: SparseRecon能够从稀疏视角中重建高质量的3D几何，尤其适用于小重叠视角的场景。 Abstract: Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfitting-based. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: https://hanl2010.github.io/SparseRecon/.

[77] Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi,Sanghyeok Lee,Byungoh Ko,Eunseo Kim,Jihyung Kil,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: This paper introduces Representation Shift, a training-free token compression method compatible with FlashAttention, reducing computation costs and improving speed in video tasks without retraining or attention maps.

Details

Motivation: The increasing computational cost of self-attention in Transformers and the memory overhead of attention map construction motivate the need for efficient, training-free token compression methods compatible with optimized attention kernels like FlashAttention. Method: The paper proposes Representation Shift, a metric that measures token importance based on the degree of change in token representations, enabling token compression without reliance on attention maps or retraining. Result: Experiments show that Representation Shift achieves up to 5.5% speedup in video-text retrieval and 4.4% in video QA while being compatible with FlashAttention and generalizing to CNNs and state space models. Conclusion: Representation Shift is a training-free, model-agnostic token compression method that effectively integrates with FlashAttention, offering significant speedups in video-text retrieval and video QA tasks. Abstract: Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.

[78] Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models

Yuji Sato,Yasunori Ishii,Takayoshi Yamashita

Main category: cs.CV

TL;DR: BiAnt是一种基于大语言模型的视频长期动作预测方法，通过结合前向和后向预测，克服了传统方法的单向性限制，并在Ego4D数据集上验证了其优越性能。

Details

Motivation: 传统方法使用编码器提取过去动作的特征，并通过解码器预测未来事件，由于其单向性限制了性能，因此需要一种更有效的方法来捕获长期动作中的子动作。 Method: BiAnt利用大语言模型进行前向和后向预测，以捕获场景中的语义不同子动作。 Result: 在Ego4D上的实验结果表明，与基线方法相比，BiAnt在编辑距离方面提高了性能。 Conclusion: BiAnt克服了传统方法的单向性限制，通过结合前向和后向预测，提高了长期动作预测的性能。 Abstract: Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics. Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature. These methods struggle to capture semantically distinct sub-actions within a scene. The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model. Experimental results on Ego4D demonstrate that BiAnt improves performance in terms of edit distance compared to baseline methods.

[79] Advancing Welding Defect Detection in Maritime Operations via Adapt-WeldNet and Defect Detection Interpretability Analysis

Kamal Basha S,Athira Nambiar

Main category: cs.CV

TL;DR: This paper proposes Adapt-WeldNet and DDIA frameworks to enhance the performance and interpretability of weld defect detection systems, improving trust, safety, and reliability in critical offshore and marine operations.

Details

Motivation: The motivation is to overcome the limitations of traditional non-destructive testing methods and existing neural network-based approaches for weld defect detection, which often fail to detect subtle or internal defects and lack interpretability, posing safety concerns. Method: The paper proposes an adaptive framework called Adapt-WeldNet, which evaluates various pre-trained architectures, transfer learning strategies, and adaptive optimizers. It also introduces the Defect Detection Interpretability Analysis (DDIA) framework, which uses Explainable AI (XAI) techniques like Grad-CAM and LIME, along with domain-specific evaluations by certified professionals. A Human-in-the-Loop (HITL) approach is also incorporated. Result: The result is the development of the Adapt-WeldNet framework for optimal defect detection and the DDIA framework for enhanced interpretability, which together improve the reliability, fairness, and accountability of defect detection systems through expert validation. Conclusion: The paper concludes that the proposed Adapt-WeldNet and DDIA frameworks significantly enhance the performance and interpretability of welding defect detection systems, thereby increasing trust, safety, and reliability in critical offshore and marine operations. Abstract: Weld defect detection is crucial for ensuring the safety and reliability of piping systems in the oil and gas industry, especially in challenging marine and offshore environments. Traditional non-destructive testing (NDT) methods often fail to detect subtle or internal defects, leading to potential failures and costly downtime. Furthermore, existing neural network-based approaches for defect classification frequently rely on arbitrarily selected pretrained architectures and lack interpretability, raising safety concerns for deployment. To address these challenges, this paper introduces ``Adapt-WeldNet", an adaptive framework for welding defect detection that systematically evaluates various pre-trained architectures, transfer learning strategies, and adaptive optimizers to identify the best-performing model and hyperparameters, optimizing defect detection and providing actionable insights. Additionally, a novel Defect Detection Interpretability Analysis (DDIA) framework is proposed to enhance system transparency. DDIA employs Explainable AI (XAI) techniques, such as Grad-CAM and LIME, alongside domain-specific evaluations validated by certified ASNT NDE Level II professionals. Incorporating a Human-in-the-Loop (HITL) approach and aligning with the principles of Trustworthy AI, DDIA ensures the reliability, fairness, and accountability of the defect detection system, fostering confidence in automated decisions through expert validation. By improving both performance and interpretability, this work enhances trust, safety, and reliability in welding defect detection systems, supporting critical operations in offshore and marine environments.

[80] $MV_{Hybrid}$: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models

Won June Cho,Hongjun Yoon,Daeky Jeong,Hyeongyeol Lim,Yosep Chong

Main category: cs.CV

TL;DR: 本文提出了一种名为MV_Hybrid的新型混合架构，结合了状态空间模型和ViT，用于病理学视觉基础模型，以更准确地预测基因表达和其他下游任务，从而克服传统ViT架构的局限性。

Details

Motivation: 空间转录组学的高成本和技术复杂性限制了其在临床中的应用，因此需要一种更实用的方法来从常规组织病理学图像中预测空间基因表达。作者假设超越ViT的架构创新可能更好地捕捉与分子表型相关的低频、细微形态模式。 Method: 作者提出了一种结合状态空间模型（SSMs）和ViT的混合骨干架构MV_Hybrid，并在相同的结直肠癌数据集上使用DINOv2自监督学习方法进行预训练，随后在随机分割和留一研究外（LOSO）设置下评估模型性能。 Result: 在LOSO评估中，MV_Hybrid的基因表达预测相关性比表现最好的ViT高出57%，并且在随机分割设置相比下，其性能下降幅度小43%。此外，MV_Hybrid在下游任务（如分类、补丁检索和生存预测）中也表现良好或优于ViT。 Conclusion: MV_Hybrid是一种有前景的下一代病理学VFM骨干网，其在基因表达预测、分类、补丁检索和生存预测任务中均展现出优于ViT的性能和鲁棒性。 Abstract: Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are already trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce $MV_{Hybrid}$, a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, $MV_{Hybrid}$ achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, $MV_{Hybrid}$ shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone. Our code is publicly available at: https://github.com/deepnoid-ai/MVHybrid.

[81] Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition

Guanjie Huang,Danny H. K. Tsang,Shan Yang,Guangzhi Lei,Li Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于多智能体系统的新方法Cued-Agent，用于实现高效的汉语手语识别，解决了传统方法在数据有限情况下的性能问题，并在多个场景中表现出色。

Details

Motivation: 传统的手语识别方法受限于数据量和复杂的多模态融合设计，性能不佳，因此需要一种在数据有限的情况下仍能有效训练的新方法。 Method: Cued-Agent采用了多智能体系统，包括基于多模态大语言模型的手势识别代理、基于预训练Transformer的唇部识别代理、手部提示解码代理和自校正音素到单词代理。 Result: Cued-Agent在实验中表现优于现有最先进的方法，并且建立了包含14个受试者的新数据集。 Conclusion: Cued-Agent实现了对汉语手语的高效识别，为听力障碍人士的交流提供了新的解决方案，并且在正常和听力障碍场景中均表现出色。 Abstract: Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.

[82] Decouple before Align: Visual Disentanglement Enhances Prompt Tuning

Fei Zhang,Tianfei Zhou,Jiangchao Yao,Ya Zhang,Ivor W. Tsang,Yanfeng Wang

Main category: cs.CV

TL;DR: 本文提出DAPT框架，通过解耦-对齐策略解决视觉语言模型的信息不对称问题，优化注意力分布，提升模型在多个任务上的表现。

Details

Motivation: 在PT中存在视觉模态比文本模态传递更多信息的问题，导致模型注意力偏向背景区域，需要对齐视觉和文本模态以提升模型表现。 Method: DAPT通过利用粗略和精细的视觉分割线索，将视觉模态显式解耦为前景和背景表示，并与原始前景文本和手工背景类别对齐，同时采用视觉推拉正则化增强视觉注意力。 Result: DAPT在少样本学习、基础到新颖的泛化和数据高效学习任务中均表现出优越性能。 Conclusion: DAPT是一种有效的PT框架，通过解耦-对齐概念来解决视觉语言模型中的信息不对称问题，并在多个任务中展示出优越的性能。 Abstract: Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at https://github.com/Ferenas/DAPT.

[83] Video Forgery Detection with Optical Flow Residuals and Spatial-Temporal Consistency

Xi Xue,Kunio Suzuki,Nabarun Goswami,Takuya Shintate

Main category: cs.CV

TL;DR: This paper proposes a dual-branch detection framework that combines RGB features and optical flow residuals to detect AI-generated videos, achieving strong performance across multiple models.

Details

Motivation: The motivation is to address the challenge of detecting AI-generated videos that exhibit high visual fidelity and coherent motion, where existing methods struggle to capture fine-grained temporal inconsistencies. Method: The method combines RGB appearance features with optical flow residuals using a dual-branch architecture. One branch detects appearance-level artifacts, while the other identifies motion anomalies in flow residuals. Result: Extensive experiments show that the proposed method is robust and generalizes well across ten diverse generative models, effectively detecting a wide range of forged videos. Conclusion: The proposed dual-branch framework leveraging spatial-temporal consistency is effective in detecting AI-generated forged videos, showing strong performance across diverse generative models. Abstract: The rapid advancement of diffusion-based video generation models has led to increasingly realistic synthetic content, presenting new challenges for video forgery detection. Existing methods often struggle to capture fine-grained temporal inconsistencies, particularly in AI-generated videos with high visual fidelity and coherent motion. In this work, we propose a detection framework that leverages spatial-temporal consistency by combining RGB appearance features with optical flow residuals. The model adopts a dual-branch architecture, where one branch analyzes RGB frames to detect appearance-level artifacts, while the other processes flow residuals to reveal subtle motion anomalies caused by imperfect temporal synthesis. By integrating these complementary features, the proposed method effectively detects a wide range of forged videos. Extensive experiments on text-to-video and image-to-video tasks across ten diverse generative models demonstrate the robustness and strong generalization ability of the proposed approach.

[84] iSafetyBench: A video-language benchmark for safety in industrial environment

Raiyaan Abdullah,Yogesh Singh Rawat,Shruti Vyas

Main category: cs.CV

TL;DR: The paper introduces iSafetyBench, a video-language benchmark designed to evaluate model performance in industrial environments. It highlights the struggle of current models in recognizing hazardous activities and in multi-label scenarios, emphasizing the need for more robust safety-aware models.

Details

Motivation: The motivation is to address the gap in evaluating vision-language models' capabilities in high-stakes industrial domains, where recognizing both routine operations and safety-critical anomalies is essential. Method: The researchers introduced iSafetyBench, a new video-language benchmark for evaluating model performance in industrial environments. It contains 1,100 video clips annotated with open-vocabulary, multi-label action tags. Each clip is paired with multiple-choice questions for evaluation. Eight state-of-the-art models were tested under zero-shot conditions. Result: The results show that state-of-the-art video-language models struggle with iSafetyBench, especially in recognizing hazardous activities and in multi-label scenarios. Significant performance gaps were revealed. Conclusion: The study concludes that despite the strong performance of video-language models on existing benchmarks, they struggle with recognizing hazardous activities and multi-label scenarios in industrial environments. The introduction of iSafetyBench highlights the need for more robust, safety-aware multimodal models. Abstract: Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: https://github.com/raiyaan-abdullah/iSafety-Bench.

[85] Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents

Janika Deborah Gajo,Gerarld Paul Merales,Jerome Escarcha,Brenden Ashley Molina,Gian Nartea,Emmanuel G. Maminta,Juan Carlos Roldan,Rowel O. Atienza

Main category: cs.CV

TL;DR: 本文介绍了 Sari Sandbox，一个用于零售购物任务中具身代理研究的高保真 3D 模拟环境和相关基准数据集 SariBench。

Details

Motivation: 填补零售特定模拟环境在具身代理训练中的空白，提供一个高保真、可扩展的仿真平台。 Method: 开发了一个名为 Sari Sandbox 的仿真环境，支持虚拟现实（VR）和视觉语言模型（VLM）驱动的具身代理，并引入了 SariBench 数据集。 Result: Sari Sandbox 支持多种商店配置和交互式商品，同时提供了基于人类表现的基准测试和性能分析。 Conclusion: Sari Sandbox 为具身代理在零售购物任务中的基准测试提供了高保真、逼真的3D零售商店模拟，并通过 SariBench 提供了人类演示数据集。 Abstract: We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari-sandbox-env.

[86] PMR: Physical Model-Driven Multi-Stage Restoration of Turbulent Dynamic Videos

Tao Wu,Jingyuan Ye,Ying Fu

Main category: cs.CV

TL;DR: This paper proposes a multi-stage video restoration framework (PMR) to address atmospheric turbulence-induced distortions, combining a Dynamic Efficiency Index (DEI) for dynamic intensity quantification and achieving high-quality, efficient restoration.

Details

Motivation: Atmospheric turbulence causes geometric distortions and blurring in long-range dynamic scene videos, and existing methods struggle with edge detail restoration and mixed distortion elimination, especially under strong turbulence and complex dynamics. Method: A Dynamic Efficiency Index (DEI) is introduced to quantify video dynamic intensity, and a Physical Model-Driven Multi-Stage Video Restoration (PMR) framework is proposed, which includes de-tilting, motion segmentation enhancement, and de-blurring stages. Result: Experimental results show that the method effectively suppresses motion trailing artifacts, restores edge details, and exhibits strong generalization capabilities, particularly in high-turbulence and complex dynamic scenarios. Conclusion: The proposed PMR framework effectively addresses geometric distortions and blurring caused by atmospheric turbulence, achieving high restoration quality and efficiency, and will have significant applicability in real-world scenarios. Abstract: Geometric distortions and blurring caused by atmospheric turbulence degrade the quality of long-range dynamic scene videos. Existing methods struggle with restoring edge details and eliminating mixed distortions, especially under conditions of strong turbulence and complex dynamics. To address these challenges, we introduce a Dynamic Efficiency Index ($DEI$), which combines turbulence intensity, optical flow, and proportions of dynamic regions to accurately quantify video dynamic intensity under varying turbulence conditions and provide a high-dynamic turbulence training dataset. Additionally, we propose a Physical Model-Driven Multi-Stage Video Restoration ($PMR$) framework that consists of three stages: \textbf{de-tilting} for geometric stabilization, \textbf{motion segmentation enhancement} for dynamic region refinement, and \textbf{de-blurring} for quality restoration. $PMR$ employs lightweight backbones and stage-wise joint training to ensure both efficiency and high restoration quality. Experimental results demonstrate that the proposed method effectively suppresses motion trailing artifacts, restores edge details and exhibits strong generalization capability, especially in real-world scenarios characterized by high-turbulence and complex dynamics. We will make the code and datasets openly available.

[87] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model

Hanqi Chen,Xu Zhang,Xiaoliu Guan,Lielin Jiang,Guanzhong Wang,Zeyu Chen,Yi Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Sortblock的无训练推理加速框架，通过动态缓存块级特征并结合轻量级线性预测机制，在保证生成质量的同时显著提升扩散Transformer模型的推理速度。

Details

Motivation: 扩散Transformer（DiT）虽然生成能力强大，但由于其固有的顺序去噪过程导致推理延迟高，限制了其在实时场景中的部署。现有的无训练加速方法通常忽略去噪阶段和Transformer块之间不断变化的语义重点，因此需要一种更高效的加速方法。 Method: 提出Sortblock框架，基于相邻时间步之间的相似性动态缓存块级特征，通过排序残差的演变自适应确定重新计算比例，并结合轻量级线性预测机制减少跳过块中的累积误差。 Result: 实验表明，Sortblock在各种任务和DiT架构上实现了超过2倍的推理加速，输出质量仅有轻微下降。 Conclusion: Sortblock是一种有效的、可推广的扩散生成模型加速解决方案，在保证生成质量的同时显著提升了推理速度。 Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities, particularly benefiting from Transformer architectures that enhance visual and artistic fidelity. However, their inherently sequential denoising process results in high inference latency, limiting their deployment in real-time scenarios. Existing training-free acceleration approaches typically reuse intermediate features at fixed timesteps or layers, overlooking the evolving semantic focus across denoising stages and Transformer blocks.To address this, we propose Sortblock, a training-free inference acceleration framework that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Furthermore, we incorporate a lightweight linear prediction mechanism to reduce accumulated errors in skipped blocks.Extensive experiments across various tasks and DiT architectures demonstrate that Sortblock achieves over 2$\times$ inference speedup with minimal degradation in output quality, offering an effective and generalizable solution for accelerating diffusion-based generative models.

[88] DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

Junyu Chen,Dongyun Zou,Wenkun He,Junsong Chen,Enze Xie,Song Han,Han Cai

Main category: cs.CV

TL;DR: DC-AE 1.5 enhances diffusion model performance by improving convergence through structured latent space and augmented training strategies, achieving better image generation quality at higher speeds.

Details

Motivation: The authors aim to overcome the issue where increasing latent channel numbers in autoencoders leads to slower convergence and poorer generation quality in diffusion models, limiting their performance. Method: The paper proposes two key innovations: Structured Latent Space to impose a channel-wise structure on the latent space, and Augmented Diffusion Training to accelerate convergence through additional objectives. Result: DC-AE 1.5 achieves faster convergence and better image generation quality compared to DC-AE, as demonstrated on the ImageNet 512x512 dataset. Conclusion: DC-AE 1.5 improves the convergence speed and generation quality of high-resolution diffusion models by introducing a structured latent space and augmented diffusion training. Abstract: We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.

[89] IN2OUT: Fine-Tuning Video Inpainting Model for Video Outpainting Using Hierarchical Discriminator

Sangwoo Youn,Minji Lee,Nokap Tony Park,Yeonggyoo Jeon,Taeyoung Na

Main category: cs.CV

TL;DR: This paper proposes a novel approach for video outpainting using a hierarchical discriminator and a specialized loss function, leading to improved performance over existing methods.

Details

Motivation: Video outpainting presents a unique challenge of extending the borders while maintaining consistency with the given content. In this paper, the authors suggest using video inpainting models for outpainting, which excel in object flow learning and reconstruction. Method: The paper introduces a hierarchical discriminator that meets both global and local objectives and develops a specialized outpainting loss function that leverages both local and global features of the discriminator. Result: Fine-tuning on this adversarial loss function enhances the generator's ability to produce both visually appealing and globally coherent outpainted scenes. Conclusion: The proposed method outperforms state-of-the-art methods both quantitatively and qualitatively. Abstract: Video outpainting presents a unique challenge of extending the borders while maintaining consistency with the given content. In this paper, we suggest the use of video inpainting models that excel in object flow learning and reconstruction in outpainting rather than solely generating the background as in existing methods. However, directly applying or fine-tuning inpainting models to outpainting has shown to be ineffective, often leading to blurry results. Our extensive experiments on discriminator designs reveal that a critical component missing in the outpainting fine-tuning process is a discriminator capable of effectively assessing the perceptual quality of the extended areas. To tackle this limitation, we differentiate the objectives of adversarial training into global and local goals and introduce a hierarchical discriminator that meets both objectives. Additionally, we develop a specialized outpainting loss function that leverages both local and global features of the discriminator. Fine-tuning on this adversarial loss function enhances the generator's ability to produce both visually appealing and globally coherent outpainted scenes. Our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively. Supplementary materials including the demo video and the code are available in SigPort.

[90] UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken

Runmin Cong,Zongji Yu,Hao Fang,Haoyan Sun,Sam Kwong

Main category: cs.CV

TL;DR: This paper proposes UIS-Mamba, a Mamba-based underwater instance segmentation model with Dynamic Tree Scan and Hidden State Weaken modules, achieving state-of-the-art performance on UIIS and USIS10K datasets.

Details

Motivation: Underwater Instance Segmentation tasks are crucial for underwater complex scene detection, but Mamba faces challenges due to underwater scene particularities like color distortion and blurred instance boundaries. Method: Proposed a Mamba-based underwater instance segmentation model named UIS-Mamba with Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW) modules. Result: Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets while maintaining a low number of parameters and computational complexity. Conclusion: UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, maintaining a low number of parameters and computational complexity. Abstract: Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS-Mamba.

[91] Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting

Seunggeun Chi,Enna Sachdeva,Pin-Hao Huang,Kwonjoon Lee

Main category: cs.CV

TL;DR: This paper introduces a novel approach to amodal completion by integrating physical constraints into a multi-regional inpainting technique, enhancing the realism and accuracy of object completions in dynamic human-object interaction scenarios.

Details

Motivation: The motivation is to overcome the limitations of existing methods that struggle with plausible completions in dynamic scenarios due to a limited understanding of human-object interactions. Method: The method involves using physical prior knowledge and a specialized multi-regional inpainting technique that defines primary and secondary regions based on occlusion likelihood, utilizing customized denoising strategies within a diffusion model. Result: The experimental results show significant improvement in accuracy and realism of generated completions in shape and visual detail, demonstrating robustness without ground-truth contact annotations. Conclusion: The paper concludes that the new approach significantly outperforms existing methods in HOI scenarios by incorporating physical constraints into a multi-regional inpainting technique, moving machine perception closer to human-like understanding. Abstract: Amodal completion, which is the process of inferring the full appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, such as those that use pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios because they have a limited understanding of HOI. To solve this problem, we've developed a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI. By incorporating physical constraints from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to be, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method uses customized denoising strategies across these regions within a diffusion model. This improves the accuracy and realism of the generated completions in both their shape and visual detail. Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios, moving machine perception closer to a more human-like understanding of dynamic environments. We also show that our pipeline is robust even without ground-truth contact annotations, which broadens its applicability to tasks like 3D reconstruction and novel view/pose synthesis.

[92] Reducing the gap between general purpose data and aerial images in concentrated solar power plants

M. A. Pérez-Cutiño,J. Valverde,J. Capitán,J. M. Díaz-Báñez

Main category: cs.CV

TL;DR: 本文介绍了一个名为AerialCSP的新型虚拟数据集，旨在解决CSP工厂航拍图像处理中的泛化和标注问题，通过预训练模型减少手动标注需求，提高故障检测准确性。

Details

Motivation: 由于CSP工厂的航拍图像包含高度反射的表面和领域特定的元素，现有的机器学习模型难以泛化到这种环境，而收集和标注这样的数据成本高昂且耗时。 Method: 创建了一个名为AerialCSP的虚拟数据集，模拟CSP工厂的航拍图像，用于在部署前对模型进行预训练。 Result: AerialCSP为CSP工厂的航拍检测提供了注释数据，基准测试表明，预训练显著提高了罕见和小缺陷的检测精度。 Conclusion: AerialCSP是一个高质量的合成数据集，能够有效减少实际应用中对手动标注数据的需求，同时提高了现实世界中故障检测的准确性。 Abstract: In the context of Concentrated Solar Power (CSP) plants, aerial images captured by drones present a unique set of challenges. Unlike urban or natural landscapes commonly found in existing datasets, solar fields contain highly reflective surfaces, and domain-specific elements that are uncommon in traditional computer vision benchmarks. As a result, machine learning models trained on generic datasets struggle to generalize to this setting without extensive retraining and large volumes of annotated data. However, collecting and labeling such data is costly and time-consuming, making it impractical for rapid deployment in industrial applications. To address this issue, we propose a novel approach: the creation of AerialCSP, a virtual dataset that simulates aerial imagery of CSP plants. By generating synthetic data that closely mimic real-world conditions, our objective is to facilitate pretraining of models before deployment, significantly reducing the need for extensive manual labeling. Our main contributions are threefold: (1) we introduce AerialCSP, a high-quality synthetic dataset for aerial inspection of CSP plants, providing annotated data for object detection and image segmentation; (2) we benchmark multiple models on AerialCSP, establishing a baseline for CSP-related vision tasks; and (3) we demonstrate that pretraining on AerialCSP significantly improves real-world fault detection, particularly for rare and small defects, reducing the need for extensive manual labeling. AerialCSP is made publicly available at https://mpcutino.github.io/aerialcsp/.

[93] TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation

Jiale Zhou,Wenhan Wang,Shikun Li,Xiaolei Qu,Xin Guo,Yizhong Liu,Wenzhong Tang,Xun Lin,Yefeng Zheng

Main category: cs.CV

TL;DR: This paper proposes TopoTTA, a test-time adaptation framework for tubular structure segmentation, which effectively handles domain shifts by enhancing topological representation and improving topological continuity.

Details

Motivation: The motivation is to overcome domain shifts in tubular structure segmentation, which affect topological structures and local features, leading to performance degradation in unseen target domains. Method: The method involves two stages: Stage 1 uses Topological Meta Difference Convolutions (TopoMDCs) to adapt models to cross-domain topological discrepancies, and Stage 2 employs a Topology Hard sample Generation (TopoHG) strategy to improve topological continuity. Result: Experiments show that TopoTTA achieves an average improvement of 31.81% in clDice across four scenarios and ten datasets, demonstrating its effectiveness in handling topological distribution shifts. Conclusion: TopoTTA is an effective test-time adaptation framework for tubular structure segmentation, addressing domain shifts and improving topological continuity. Abstract: Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudo-break regions. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA's effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.

[94] SDMatte: Grafting Diffusion Models for Interactive Matting

Longfei Huang,Yu Liang,Hao Zhang,Jinwei Chen,Wei Dong,Lunde Chen,Wanyu Liu,Bo Li,Pengtao Jiang

Main category: cs.CV

TL;DR: SDMatte is a diffusion-driven interactive matting model that improves performance by incorporating visual prompts, enhanced embeddings, and attention mechanisms, leading to better extraction of fine-grained details.

Details

Motivation: Recent interactive matting methods perform well in capturing primary object regions but struggle with fine-grained edge details. Diffusion models offer strong potential for modeling complex data and generating realistic textures, which can improve interactive matting outcomes. Method: SDMatte utilizes diffusion models trained on large datasets, transforms text-driven interaction into visual prompt-driven interaction, integrates coordinate and opacity embeddings into U-Net, and implements masked self-attention to focus on specified areas. Result: Extensive experiments on multiple datasets validate the effectiveness and superior performance of SDMatte in interactive matting tasks. Conclusion: The proposed SDMatte model demonstrates superior performance in interactive matting by leveraging diffusion models and incorporating visual prompt-driven interaction, enhanced spatial and opacity sensitivity, and masked self-attention mechanisms. Abstract: Recent interactive matting methods have shown satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte's sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Our code and model are available at https://github.com/vivoCameraResearch/SDMatte.

[95] AutoDebias: Automated Framework for Debiasing Text-to-Image Models

Hongyi Cai,Mohammad Mahdinur Rahman,Mingkang Dong,Jie Li,Muxin Pu,Zhili Fang,Yinan Peng,Hanjun Luo,Yang Liu

Main category: cs.CV

TL;DR: AutoDebias是一种自动识别和减轻文本到图像模型中社会偏见的新框架，它在没有特定偏见类型先验知识的情况下工作，并在保持图像质量的同时有效应对微妙刻板印象和多重交互偏见。

Details

Motivation: 现有的去偏见方法对于简单或众所周知的情况有效，但在处理细微或重叠的偏见方面存在困难。 Method: AutoDebias利用视觉语言模型检测有偏见的视觉模式，并通过生成反映平衡表示的包容性替代提示来构建公平指南。这些指南推动了一个CLIP引导的训练过程，促进了更公平的输出。 Result: AutoDebias在91.6%的准确率下检测有害模式，并将偏见输出从90%减少到可以忽略的水平，同时保持了原始模型的视觉保真度。 Conclusion: AutoDebias是一个有效的框架，可以自动识别并减轻文本到图像模型中的有害偏差，同时保持模型的原始性能。 Abstract: Text-to-Image (T2I) models generate high-quality images from text prompts but often exhibit unintended social biases, such as gender or racial stereotypes, even when these attributes are not mentioned. Existing debiasing methods work well for simple or well-known cases but struggle with subtle or overlapping biases. We propose AutoDebias, a framework that automatically identifies and mitigates harmful biases in T2I models without prior knowledge of specific bias types. Specifically, AutoDebias leverages vision-language models to detect biased visual patterns and constructs fairness guides by generating inclusive alternative prompts that reflect balanced representations. These guides drive a CLIP-guided training process that promotes fairer outputs while preserving the original model's image quality and diversity. Unlike existing methods, AutoDebias effectively addresses both subtle stereotypes and multiple interacting biases. We evaluate the framework on a benchmark covering over 25 bias scenarios, including challenging cases where multiple biases occur simultaneously. AutoDebias detects harmful patterns with 91.6% accuracy and reduces biased outputs from 90% to negligible levels, while preserving the visual fidelity of the original model.

[96] Fine-grained Spatiotemporal Grounding on Egocentric Videos

Shuo Liang,Yiwu Zhong,Zi-Yuan Hu,Yeyao Tao,Liwei Wang

Main category: cs.CV

TL;DR: This paper introduces EgoMask and EgoMask-Train, addressing challenges in spatiotemporal video grounding for egocentric videos and showing improved model performance after fine-tuning.

Details

Motivation: Despite the growing importance of egocentric videos in applications like augmented reality and robotics, research in this area remains underdeveloped compared to exocentric videos. Method: Introduction of EgoMask, a pixel-level benchmark and EgoMask-Train, a large-scale training dataset for fine-grained spatiotemporal grounding in egocentric videos. Result: State-of-the-art spatiotemporal grounding models perform poorly on EgoMask but show significant improvement after fine-tuning on EgoMask-Train, while maintaining performance on exocentric datasets. Conclusion: EgoMask provides essential resources and insights for advancing egocentric video understanding. Abstract: Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .

[97] CLIPTime: Time-Aware Multimodal Representation Learning from Images and Text

Anju Rani,Daniel Ortiz-Arroyo,Petar Durdevic

Main category: cs.CV

TL;DR: CLIPTime is a new framework based on CLIP that predicts fungal growth stages and timestamps, effectively capturing temporal dynamics in biological growth.

Details

Motivation: Understanding the temporal dynamics of biological growth is critical across diverse fields, but vision-language models like CLIP have limited effectiveness in capturing temporal progression. Method: CLIPTime, a multimodal, multitask framework built upon the CLIP architecture, is proposed. It learns joint visual-textual embeddings and enables time-aware inference without requiring explicit temporal input during testing. It jointly performs classification and regression tasks using custom evaluation metrics like temporal accuracy and regression error. Result: Experimental results show that CLIPTime can accurately predict both the developmental stage and corresponding timestamp of fungal growth, demonstrating its capability in modeling biological progression. Conclusion: CLIPTime effectively models biological progression and produces interpretable, temporally grounded outputs, highlighting the potential of vision-language models in real-world biological monitoring applications. Abstract: Understanding the temporal dynamics of biological growth is critical across diverse fields such as microbiology, agriculture, and biodegradation research. Although vision-language models like Contrastive Language Image Pretraining (CLIP) have shown strong capabilities in joint visual-textual reasoning, their effectiveness in capturing temporal progression remains limited. To address this, we propose CLIPTime, a multimodal, multitask framework designed to predict both the developmental stage and the corresponding timestamp of fungal growth from image and text inputs. Built upon the CLIP architecture, our model learns joint visual-textual embeddings and enables time-aware inference without requiring explicit temporal input during testing. To facilitate training and evaluation, we introduce a synthetic fungal growth dataset annotated with aligned timestamps and categorical stage labels. CLIPTime jointly performs classification and regression, predicting discrete growth stages alongside continuous timestamps. We also propose custom evaluation metrics, including temporal accuracy and regression error, to assess the precision of time-aware predictions. Experimental results demonstrate that CLIPTime effectively models biological progression and produces interpretable, temporally grounded outputs, highlighting the potential of vision-language models in real-world biological monitoring applications.

[98] Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving

Stefan Englmeier,Max A. Büttner,Katharina Winter,Fabian B. Flohr

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态框架，用于在自动驾驶中检索罕见且复杂的人类行为场景，结合了SMPL运动序列和视频帧，并引入了一个扩展数据集WayMoCo。

Details

Motivation: 自动驾驶系统需要在涉及弱势道路使用者（VRUs）的不寻常或复杂行为的安全关键场景中可靠运行。识别驾驶数据集中的这些边缘案例对于鲁棒评估和泛化至关重要，但在大规模数据集的长尾中检索此类罕见的人类行为场景具有挑战性。 Method: 该方法结合了基于SMPL（Skinned Multi-Person Linear）的运动序列及其对应的视频帧，并将其编码到与自然语言对齐的共享多模态嵌入空间中。 Result: 论文介绍了一个名为WayMoCo的新数据集，并表明其方法在运动上下文检索任务中比现有最先进模型提高了27.5%的准确率。 Conclusion: 该论文提出了一种新的基于上下文感知的运动检索框架，以支持对自动驾驶系统在多样化、以人为中心的场景中的目标评估。 Abstract: Autonomous driving systems must operate reliably in safety-critical scenarios, particularly those involving unusual or complex behavior by Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets is essential for robust evaluation and generalization, but retrieving such rare human behavior scenarios within the long tail of large-scale datasets is challenging. To support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios, we propose a novel context-aware motion retrieval framework. Our method combines Skinned Multi-Person Linear (SMPL)-based motion sequences and corresponding video frames before encoding them into a shared multimodal embedding space aligned with natural language. Our approach enables the scalable retrieval of human behavior and their context through text queries. This work also introduces our dataset WayMoCo, an extension of the Waymo Open Dataset. It contains automatically labeled motion and scene context descriptions derived from generated pseudo-ground-truth SMPL sequences and corresponding image data. Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.

[99] PIF-Net: Ill-Posed Prior Guided Multispectral and Hyperspectral Image Fusion via Invertible Mamba and Fusion-Aware LoRA

Baisong Li,Xingwang Wang,Haixiao Xu

Main category: cs.CV

TL;DR: PIF-Net improves multispectral and hyperspectral image fusion by tackling the ill-posed problem through an invertible neural architecture and efficient fusion module, leading to superior performance.

Details

Motivation: MHIF is inherently ill-posed due to the trade-off between spectral and spatial information and data misalignment issues, which previous studies have not effectively addressed. Method: PIF-Net incorporates ill-posed priors using an invertible Mamba architecture for global spectral modeling and introduces a Fusion-Aware Low-Rank Adaptation module to dynamically calibrate spectral and spatial features. Result: Experiments on benchmark datasets show that PIF-Net significantly outperforms state-of-the-art methods in image restoration quality while ensuring model efficiency. Conclusion: The proposed PIF-Net effectively addresses the ill-posed nature of MHIF by integrating invertible Mamba architecture and a lightweight fusion module, achieving superior image restoration performance while maintaining computational efficiency. Abstract: The goal of multispectral and hyperspectral image fusion (MHIF) is to generate high-quality images that simultaneously possess rich spectral information and fine spatial details. However, due to the inherent trade-off between spectral and spatial information and the limited availability of observations, this task is fundamentally ill-posed. Previous studies have not effectively addressed the ill-posed nature caused by data misalignment. To tackle this challenge, we propose a fusion framework named PIF-Net, which explicitly incorporates ill-posed priors to effectively fuse multispectral images and hyperspectral images. To balance global spectral modeling with computational efficiency, we design a method based on an invertible Mamba architecture that maintains information consistency during feature transformation and fusion, ensuring stable gradient flow and process reversibility. Furthermore, we introduce a novel fusion module called the Fusion-Aware Low-Rank Adaptation module, which dynamically calibrates spectral and spatial features while keeping the model lightweight. Extensive experiments on multiple benchmark datasets demonstrate that PIF-Net achieves significantly better image restoration performance than current state-of-the-art methods while maintaining model efficiency.

[100] Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution

Yiwen Wang,Xinning Chai,Yuhong Zhang,Zhengxue Cheng,Jun Zhao,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 本文提出了一种新的视频超分辨率方法SeTe-VSR，通过引入语义和时空引导，在保持时间一致性和提高细节恢复方面表现出色。

Details

Motivation: 现有的视频超分辨率模型在生成过程中难以充分控制，导致与低分辨率输入的高保真对齐和帧间时间一致性仍然是一个重大挑战。 Method: 在潜在扩散空间中引入高层语义信息和时空信息，以指导视频超分辨率过程。 Result: SeTe-VSR在细节恢复和感知质量方面优于现有方法，不仅保持了高真实感视觉内容，还显著提高了保真度。 Conclusion: SeTe-VSR通过结合语义和时空引导，在视频超分辨率任务中实现了细节恢复和时间一致性的平衡，证明了其在复杂任务中的有效性。 Abstract: Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.

[101] HyPCV-Former: Hyperbolic Spatio-Temporal Transformer for 3D Point Cloud Video Anomaly Detection

Jiaping Cao,Kangkang Zhou,Juan Du

Main category: cs.CV

TL;DR: HyPCV-Former is a novel method for video anomaly detection in 3D point cloud videos that achieves state-of-the-art results.

Details

Motivation: Previous methods using Euclidean representations are limited in capturing hierarchical event structures and spatio-temporal continuity. Method: HyPCV-Former extracts spatial features from point cloud sequences, embeds them into Lorentzian hyperbolic space, and uses hyperbolic multi-head self-attention to model temporal dynamics. Result: HyPCV-Former achieves state-of-the-art performance with a 7% improvement on the TIMo dataset and a 5.6% gain on the DAD dataset compared to benchmarks. Conclusion: HyPCV-Former is a novel hyperbolic spatio-temporal transformer for anomaly detection in 3D point cloud videos that achieves state-of-the-art performance. Abstract: Video anomaly detection is a fundamental task in video surveillance, with broad applications in public safety and intelligent monitoring systems. Although previous methods leverage Euclidean representations in RGB or depth domains, such embeddings are inherently limited in capturing hierarchical event structures and spatio-temporal continuity. To address these limitations, we propose HyPCV-Former, a novel hyperbolic spatio-temporal transformer for anomaly detection in 3D point cloud videos. Our approach first extracts per-frame spatial features from point cloud sequences via point cloud extractor, and then embeds them into Lorentzian hyperbolic space, which better captures the latent hierarchical structure of events. To model temporal dynamics, we introduce a hyperbolic multi-head self-attention (HMHA) mechanism that leverages Lorentzian inner products and curvature-aware softmax to learn temporal dependencies under non-Euclidean geometry. Our method performs all feature transformations and anomaly scoring directly within full Lorentzian space rather than via tangent space approximation. Extensive experiments demonstrate that HyPCV-Former achieves state-of-the-art performance across multiple anomaly categories, with a 7\% improvement on the TIMo dataset and a 5.6\% gain on the DAD dataset compared to benchmarks. The code will be released upon paper acceptance.

[102] LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Yuzhuo Chen,Zehua Ma,Jianhua Wang,Kai Kang,Shunyu Yao,Weiming Zhang

Main category: cs.CV

TL;DR: LAMIC是一种无需训练的多参考图像合成方法，通过引入两种注意力机制实现了优秀的布局控制、身份保持和背景保留能力。

Details

Motivation: 在可控图像合成中，从多个参考图像中生成连贯且一致的图像仍然是一个挑战。LAMIC首次以无需训练的方式将单参考扩散模型扩展到多参考场景。 Method: LAMIC基于MMDiT模型，引入了两种即插即用的注意力机制：1）用于增强实体解耦的组隔离注意力（GIA）；2）用于实现布局感知生成的区域调制注意力（RMA）。 Result: 实验表明，LAMIC在大多数主要指标上达到了最先进的性能，包括ID-S、BG-S、IN-R和AVG分数，并在复杂合成任务中实现了最佳DPG。 Conclusion: LAMIC是一个无需训练的多图像合成框架，实现了对身份保持、背景保留、布局控制和提示跟随的卓越能力，展示了强大的零样本泛化能力。 Abstract: In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

[103] SAMSA 2.0: Prompting Segment Anything with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA 2.0 improves hyperspectral medical image segmentation by combining spectral and spatial information without retraining, outperforming existing methods.

Details

Motivation: Traditional RGB-only models and prior spectral fusion methods underperform in hyperspectral medical imaging, especially in low-data and noisy scenarios. Method: SAMSA 2.0 uses spectral angle prompting to integrate spectral similarity with spatial cues for segmentation. Result: SAMSA 2.0 achieves up to +3.8% higher Dice scores compared to RGB-only models and +3.1% over prior spectral fusion approaches, with enhanced few-shot and zero-shot performance. Conclusion: SAMSA 2.0 improves segmentation accuracy and robustness in hyperspectral medical imaging by incorporating spectral angle prompting without retraining. Abstract: We present SAMSA 2.0, an interactive segmentation framework for hyperspectral medical imaging that introduces spectral angle prompting to guide the Segment Anything Model (SAM) using spectral similarity alongside spatial cues. This early fusion of spectral information enables more accurate and robust segmentation across diverse spectral datasets. Without retraining, SAMSA 2.0 achieves up to +3.8% higher Dice scores compared to RGB-only models and up to +3.1% over prior spectral fusion methods. Our approach enhances few-shot and zero-shot performance, demonstrating strong generalization in challenging low-data and noisy scenarios common in clinical imaging.

[104] LesiOnTime -- Joint Temporal and Clinical Modeling for Small Breast Lesion Segmentation in Longitudinal DCE-MRI

Mohammed Kamran,Maria Bernathova,Raoul Varga,Christian Singer,Zsuzsanna Bago-Horvath,Thomas Helbich,Georg Langs,Philipp Seeböck

Main category: cs.CV

TL;DR: LesiOnTime结合时间与临床信息，显著提升乳腺MRI中早期小病灶的分割效果。

Details

Motivation: 现有的深度学习方法主要关注大病灶分割，忽视了临床诊断中常用的纵向与临床信息，而早期小病灶检测需要这些信息。 Method: 提出了LesiOnTime，包括时间先验注意力模块和BI-RADS一致性正则化损失，结合纵向影像与临床评分进行病灶分割。 Result: 在内部数据集上，LesiOnTime在Dice指标上优于现有方法5%，消融实验表明两个模块均有贡献。 Conclusion: LesiOnTime通过整合纵向影像和BI-RADS评分，提高了小病灶的分割性能，强调了时间与临床信息在早期乳腺癌筛查中的重要性。 Abstract: Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI) is critical for early cancer detection, especially in high-risk patients. While recent deep learning methods have advanced lesion segmentation, they primarily target large lesions and neglect valuable longitudinal and clinical information routinely used by radiologists. In real-world screening, detecting subtle or emerging lesions requires radiologists to compare across timepoints and consider previous radiology assessments, such as the BI-RADS score. We propose LesiOnTime, a novel 3D segmentation approach that mimics clinical diagnostic workflows by jointly leveraging longitudinal imaging and BIRADS scores. The key components are: (1) a Temporal Prior Attention (TPA) block that dynamically integrates information from previous and current scans; and (2) a BI-RADS Consistency Regularization (BCR) loss that enforces latent space alignment for scans with similar radiological assessments, thus embedding domain knowledge into the training process. Evaluated on a curated in-house longitudinal dataset of high-risk patients with DCE-MRI, our approach outperforms state-of-the-art single-timepoint and longitudinal baselines by 5% in terms of Dice. Ablation studies demonstrate that both TPA and BCR contribute complementary performance gains. These results highlight the importance of incorporating temporal and clinical context for reliable early lesion segmentation in real-world breast cancer screening. Our code is publicly available at https://github.com/cirmuw/LesiOnTime

[105] Leveraging Convolutional and Graph Networks for an Unsupervised Remote Sensing Labelling Tool

Tulsi Patel,Mark W. Jones,Thomas Redfern

Main category: cs.CV

TL;DR: 本文提出了一种基于无监督学习的遥感图像标注方法，利用卷积神经网络和图神经网络提取更鲁棒的特征空间，以实现对地理区域的自动标注。

Details

Motivation: 遥感图像的标注耗时且昂贵，传统方法依赖预标注数据，限制了新数据的标注效率。 Method: 本文采用无监督学习方法，结合卷积神经网络和图神经网络，对Sentinel-2卫星图像进行分割和特征编码，以提取更鲁棒的特征空间。 Result: 该方法减少了标注工具中的异常值，允许用户在细粒度级别进行标注，并在编码空间中形成旋转不变的语义关系。 Conclusion: 本文提出的方法有效解决了传统遥感图像标注的局限性，提供了一种高效的无监督标注方案。 Abstract: Machine learning for remote sensing imaging relies on up-to-date and accurate labels for model training and testing. Labelling remote sensing imagery is time and cost intensive, requiring expert analysis. Previous labelling tools rely on pre-labelled data for training in order to label new unseen data. In this work, we define an unsupervised pipeline for finding and labelling geographical areas of similar context and content within Sentinel-2 satellite imagery. Our approach removes limitations of previous methods by utilising segmentation with convolutional and graph neural networks to encode a more robust feature space for image comparison. Unlike previous approaches we segment the image into homogeneous regions of pixels that are grouped based on colour and spatial similarity. Graph neural networks are used to aggregate information about the surrounding segments enabling the feature representation to encode the local neighbourhood whilst preserving its own local information. This reduces outliers in the labelling tool, allows users to label at a granular level, and allows a rotationally invariant semantic relationship at the image level to be formed within the encoding space.

[106] EPANet: Efficient Path Aggregation Network for Underwater Fish Detection

Jinsong Yang,Zeyuan Hu,Yichen Li

Main category: cs.CV

TL;DR: This paper proposes EPANet, a lightweight and efficient network for underwater fish detection, which improves detection performance by enhancing feature integration and diversity.

Details

Motivation: Underwater fish detection is challenging due to low object resolution, background interference, and visual similarity between targets and surroundings, while existing methods suffer from high complexity and reduced efficiency. Method: The proposed EPANet includes an efficient path aggregation feature pyramid network (EPA-FPN) and a multi-scale diverse-division short path bottleneck (MS-DDSP bottleneck) to enhance feature integration and local feature diversity. Result: Extensive experiments show that EPANet achieves superior detection accuracy and faster inference speed than state-of-the-art methods with comparable or lower parameter complexity. Conclusion: EPANet provides an efficient and accurate solution for underwater fish detection, outperforming existing methods in both detection accuracy and inference speed. Abstract: Underwater fish detection (UFD) remains a challenging task in computer vision due to low object resolution, significant background interference, and high visual similarity between targets and surroundings. Existing approaches primarily focus on local feature enhancement or incorporate complex attention mechanisms to highlight small objects, often at the cost of increased model complexity and reduced efficiency. To address these limitations, we propose an efficient path aggregation network (EPANet), which leverages complementary feature integration to achieve accurate and lightweight UFD. EPANet consists of two key components: an efficient path aggregation feature pyramid network (EPA-FPN) and a multi-scale diverse-division short path bottleneck (MS-DDSP bottleneck). The EPA-FPN introduces long-range skip connections across disparate scales to improve semantic-spatial complementarity, while cross-layer fusion paths are adopted to enhance feature integration efficiency. The MS-DDSP bottleneck extends the conventional bottleneck structure by introducing finer-grained feature division and diverse convolutional operations, thereby increasing local feature diversity and representation capacity. Extensive experiments on benchmark UFD datasets demonstrate that EPANet outperforms state-of-the-art methods in terms of detection accuracy and inference speed, while maintaining comparable or even lower parameter complexity.

[107] Video Color Grading via Look-Up Table Generation

Seunghyun Shin,Dongmin Shin,Jisu Shin,Hae-Gon Jeon,Joon-Young Lee

Main category: cs.CV

TL;DR: This paper proposes a reference-based video color grading framework that uses a diffusion model to generate a look-up table (LUT), effectively transferring the artistic look, mood, and emotion from reference scenes to input videos while preserving structural details and enabling fast inference.

Details

Motivation: The motivation behind this research is to make video color grading more accessible to non-professionals by automating the process, which traditionally requires specialized skills and is time-consuming. The goal is to transfer artistic styles or moods from reference scenes to videos while preserving structural details. Method: The method involves generating a look-up table (LUT) using a diffusion model to align color attributes between reference scenes and the input video. The model ensures that high-level features like look, mood, and emotion are consistent between the reference and the video. Additionally, a pipeline is introduced to enhance low-level features such as contrast and brightness based on text prompts. Result: The experimental results, including extensive user studies, demonstrate the effectiveness of the proposed approach in achieving accurate and visually appealing color grading. The method allows for fast inference, maintains structural details, and provides customization options through text prompts. Conclusion: The paper concludes that their proposed reference-based video color grading framework effectively transfers the look, mood, and emotion from reference scenes to an input video using a LUT generated via a diffusion model. They also show that their method preserves structural details and allows for fast inference, with additional customization through text prompts for low-level features. Abstract: Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. Codes are publicly available at https://github.com/seunghyuns98/VideoColorGrading.

[108] Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Daniel Wolf,Heiko Hillenhagen,Billurvan Taskin,Alex Bäuerle,Meinrad Beer,Michael Götz,Timo Ropinski

Main category: cs.CV

TL;DR: Current VLMs struggle with understanding relative anatomical positions in medical images, often relying on prior knowledge rather than image data, but visual prompts and the new MIRP benchmark offer avenues for improvement and research.

Details

Motivation: Accurate determination of relative positions of anatomical structures in medical images is a fundamental requirement for the clinical application of Vision-Language Models (VLMs), yet this capability remains underexplored. Method: The researchers evaluated leading VLMs (GPT-4o, Llama3.2, Pixtral, JanusPro) on their ability to determine relative positions in medical images, and explored whether visual prompts like alphanumeric or colored markers could enhance performance. Result: All tested VLMs performed poorly on relative positioning tasks in medical images, showing significantly lower accuracy compared to performance on natural images. Visual prompts provided only moderate improvements. Conclusion: The study concludes that current state-of-the-art Vision-Language Models (VLMs) struggle with determining relative positions in medical images, relying more on prior anatomical knowledge than image content, and the newly introduced MIRP dataset can facilitate future research in this domain. Abstract: Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.

[109] DBLP: Noise Bridge Consistency Distillation For Efficient And Reliable Adversarial Purification

Chihan Huang,Belal Alsinglawi,Islam Al-qudah

Main category: cs.CV

TL;DR: The paper proposes DBLP, an efficient diffusion-based framework for real-time adversarial purification, achieving SOTA robust accuracy and superior image quality.

Details

Motivation: The motivation behind the paper is the critical vulnerability of deep neural networks to adversarial perturbations and the limitation of existing diffusion-based adversarial purification methods that require intensive iterative denoising. Method: The paper introduces Diffusion Bridge Distillation for Purification (DBLP), a novel diffusion-based framework with a new objective called noise bridge distillation and adaptive semantic enhancement to improve adversarial purification. Result: Extensive experiments across multiple datasets demonstrate that DBLP achieves state-of-the-art (SOTA) robust accuracy, superior image quality, and around 0.2s inference time. Conclusion: The paper concludes that DBLP is a significant step towards real-time adversarial purification, achieving SOTA robust accuracy, superior image quality, and fast inference time. Abstract: Recent advances in deep neural networks (DNNs) have led to remarkable success across a wide range of tasks. However, their susceptibility to adversarial perturbations remains a critical vulnerability. Existing diffusion-based adversarial purification methods often require intensive iterative denoising, severely limiting their practical deployment. In this paper, we propose Diffusion Bridge Distillation for Purification (DBLP), a novel and efficient diffusion-based framework for adversarial purification. Central to our approach is a new objective, noise bridge distillation, which constructs a principled alignment between the adversarial noise distribution and the clean data distribution within a latent consistency model (LCM). To further enhance semantic fidelity, we introduce adaptive semantic enhancement, which fuses multi-scale pyramid edge maps as conditioning input to guide the purification process. Extensive experiments across multiple datasets demonstrate that DBLP achieves state-of-the-art (SOTA) robust accuracy, superior image quality, and around 0.2s inference time, marking a significant step toward real-time adversarial purification.

[110] HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

Jizhihui Liu,Feiyi Du,Guangdao Zhu,Niu Lian,Jun Li,Bin Chen

Main category: cs.CV

TL;DR: HiPrune is a highly efficient, training-free token pruning method for VLMs that leverages hierarchical attention to maintain accuracy while drastically reducing computation and improving inference speed.

Details

Motivation: Vision-Language Models (VLMs) suffer from high computational overhead due to lengthy visual token sequences. Existing methods either rely on special tokens or require task-specific training, limiting scalability. There was a need for a more efficient, generalizable, and training-free solution. Method: HiPrune exploits the hierarchical attention structure within vision encoders to select three types of informative tokens—Anchor tokens, Buffer tokens, and Register tokens—without requiring retraining, and integrates seamlessly with any ViT-based VLM. Result: HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens and maintaining 99.5% accuracy with just 11.1% tokens. It also reduces inference FLOPs and latency by up to 9×. Conclusion: HiPrune is a training-free and model-agnostic token pruning framework that effectively reduces computational overhead in Vision-Language Models (VLMs) while maintaining high task accuracy and significantly improving inference efficiency. Abstract: Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.

[111] Training-Free Class Purification for Open-Vocabulary Semantic Segmentation

Qi Chen,Lingxiao Yang,Yun Chen,Nailong Zhao,Jianhuang Lai,Jie Shao,Xiaohua Xie

Main category: cs.CV

TL;DR: 本文提出FreeCP，一种训练无关的语义分割分类净化框架，通过解决类冗余和视觉-语言歧义问题，显著提升开放词汇语义分割性能。

Details

Motivation: 现有训练无关方法忽略了类冗余和视觉-语言歧义对类别激活的影响，这些问题可能导致次优的类别激活图和亲和力优化激活图。 Method: 提出了一种新的训练无关方法FreeCP，用于净化语义类别并纠正冗余和歧义导致的错误，利用净化后的类别表示生成最终分割预测。 Result: 在八个基准数据集上进行了广泛实验，验证了FreeCP的有效性。 Conclusion: FreeCP是一个有效的训练无关分类净化框架，能够显著提升与现有OVSS方法结合时的分割性能。 Abstract: Fine-tuning pre-trained vision-language models has emerged as a powerful approach for enhancing open-vocabulary semantic segmentation (OVSS). However, the substantial computational and resource demands associated with training on large datasets have prompted interest in training-free methods for OVSS. Existing training-free approaches primarily focus on modifying model architectures and generating prototypes to improve segmentation performance. However, they often neglect the challenges posed by class redundancy, where multiple categories are not present in the current test image, and visual-language ambiguity, where semantic similarities among categories create confusion in class activation. These issues can lead to suboptimal class activation maps and affinity-refined activation maps. Motivated by these observations, we propose FreeCP, a novel training-free class purification framework designed to address these challenges. FreeCP focuses on purifying semantic categories and rectifying errors caused by redundancy and ambiguity. The purified class representations are then leveraged to produce final segmentation predictions. We conduct extensive experiments across eight benchmarks to validate FreeCP's effectiveness. Results demonstrate that FreeCP, as a plug-and-play module, significantly boosts segmentation performance when combined with other OVSS methods.

[112] Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints

Jens U. Kreber,Joerg Stueckler

Main category: cs.CV

TL;DR: PhysNAP is a novel diffusion-based method for generating physically plausible articulated objects that align with point clouds, using SDFs and physical constraints.

Details

Motivation: Articulated objects are common in everyday environments, and generating physically plausible versions that align with partial point clouds is challenging. Existing methods may not adequately address physical constraints or category-specific features. Method: PhysNAP uses a diffusion model guided by point cloud alignment loss and physical constraints (non-penetration and mobility) based on SDFs. It also incorporates category-aware improvements for better alignment. Result: PhysNAP improves constraint consistency in generated objects and provides a tradeoff with generative ability when evaluated on the PartNet-Mobility dataset. Conclusion: PhysNAP is an effective method for generating articulated objects that balances generative ability and constraint consistency, especially when compared to unguided diffusion models. Abstract: Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose PhysNAP, a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with PhysNAP using the PartNet-Mobility dataset. We also compare it with an unguided baseline diffusion model and demonstrate that PhysNAP can improve constraint consistency and provides a tradeoff with generative ability.

[113] Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images

Hannah Kniesel,Leon Sick,Tristan Payer,Tim Bergner,Kavitha Shaga Devan,Clarissa Read,Paul Walther,Timo Ropinski

Main category: cs.CV

TL;DR: This paper proposes a weakly supervised object detection method using image-level annotations and knowledge distillation, eliminating the need for expensive bounding box annotations.

Details

Motivation: Annotations for object detection are expensive and time-consuming, requiring expert knowledge. A more efficient approach is needed. Method: A domain-specific weakly supervised object detection algorithm using image-level annotations and knowledge distillation from a pre-trained model. Result: The proposed method successfully generates pseudo-labels that enable training an object detection model without bounding box annotations, achieving superior performance in limited annotation time scenarios. Conclusion: The proposed pseudo-labeling method outperforms other weak labeling approaches and even ground truth labels when annotation time is limited. Abstract: Current state-of-the-art methods for object detection rely on annotated bounding boxes of large data sets for training. However, obtaining such annotations is expensive and can require up to hundreds of hours of manual labor. This poses a challenge, especially since such annotations can only be provided by experts, as they require knowledge about the scientific domain. To tackle this challenge, we propose a domain-specific weakly supervised object detection algorithm that only relies on image-level annotations, which are significantly easier to acquire. Our method distills the knowledge of a pre-trained model, on the task of predicting the presence or absence of a virus in an image, to obtain a set of pseudo-labels that can be used to later train a state-of-the-art object detection model. To do so, we use an optimization approach with a shrinking receptive field to extract virus particles directly without specific network architectures. Through a set of extensive studies, we show how the proposed pseudo-labels are easier to obtain, and, more importantly, are able to outperform other existing weak labeling methods, and even ground truth labels, in cases where the time to obtain the annotation is limited.

[114] CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry

Jingchao Xie,Oussema Dhaouadi,Weirong Chen,Johannes Meier,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: This paper introduces CoProU-VO, an improved unsupervised visual odometry method that combines uncertainty across frames, leading to better performance in dynamic environments.

Details

Motivation: Unsupervised visual odometry methods struggle in dynamic scenes due to the static scene assumption. Traditional uncertainty modeling only considers single-frame information, missing the temporal uncertainty across frames. Method: The paper proposes CoProU-VO, an end-to-end approach that uses a probabilistic formulation to combine uncertainty from target and reference frames. It is built on vision transformer backbones and simultaneously learns depth, uncertainty estimation, and camera poses. Result: Experiments on KITTI and nuScenes datasets show that CoProU-VO outperforms previous unsupervised monocular methods, particularly in challenging highway scenes. Ablation studies confirm the effectiveness of cross-frame uncertainty propagation. Conclusion: The paper concludes that CoProU-VO, by combining uncertainty modeling across temporal frames, achieves significant improvements in visual odometry, especially in dynamic scenes. Abstract: Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.

[115] Uncertainty-Aware Likelihood Ratio Estimation for Pixel-Wise Out-of-Distribution Detection

Marc Hölle,Walter Kellermann,Vasileios Belagiannis

Main category: cs.CV

TL;DR: 本文提出了一種新的基於不確定性感知的可能性比率估計方法，用於語義分割中的未知對象檢測，解決了現有方法在複雜場景中混淆稀有對象與未知對象的問題。

Details

Motivation: 語義分割模型在現實中的自動駕駛場景中常因錯誤分類未知對象而失敗，而現有的像素級分佈外檢測方法在處理複雜場景時表現不佳，因此需要一種更有效的方法來區分已知與未知對象。 Method: 引入了一種不確定性感知的可能性比率估計方法，使用證據分類器進行可能性比率測試，以區分語義分割模型中的已知與未知像素特徵，並明確考慮不確定性。 Result: 在五個標準基準數據集上的評估顯示，該方法在保持高平均精度（90.91%）的同時，平均偽陽性率最低（2.5%），且計算開銷極小。 Conclusion: 該方法通過引入不確定性，更有效地利用了異常暴露，解決了現有方法在處理複雜場景中的限制。 Abstract: Semantic segmentation models trained on known object classes often fail in real-world autonomous driving scenarios by confidently misclassifying unknown objects. While pixel-wise out-of-distribution detection can identify unknown objects, existing methods struggle in complex scenes where rare object classes are often confused with truly unknown objects. We introduce an uncertainty-aware likelihood ratio estimation method that addresses these limitations. Our approach uses an evidential classifier within a likelihood ratio test to distinguish between known and unknown pixel features from a semantic segmentation model, while explicitly accounting for uncertainty. Instead of producing point estimates, our method outputs probability distributions that capture uncertainty from both rare training examples and imperfect synthetic outliers. We show that by incorporating uncertainty in this way, outlier exposure can be leveraged more effectively. Evaluated on five standard benchmark datasets, our method achieves the lowest average false positive rate (2.5%) among state-of-the-art while maintaining high average precision (90.91%) and incurring only negligible computational overhead. Code is available at https://github.com/glasbruch/ULRE.

[116] A Novel Modeling Framework and Data Product for Extended VIIRS-like Artificial Nighttime Light Image Reconstruction (1986-2024)

Yihe Tian,Kwan Man Cheng,Zhengbo Zhang,Tao Zhang,Suju Li,Dongmei Yan,Bing Xu

Main category: cs.CV

TL;DR: 本研究提出了一种新的重建框架，成功开发了扩展的 VIIRS-like 中国人工夜间灯光数据集（EVAL），显著提高了现有产品的精度，并将时间范围回溯至1986年。

Details

Motivation: NPP-VIIRS 传感器的夜间灯光遥感数据时间跨度从2012年开始，限制了对更早时期的长期时间序列研究。现有方法存在两个主要缺陷：光强低估和结构遗漏。 Method: 提出了一种包括构建和优化两个阶段的重建框架。构建阶段采用分层融合解码器（HFD），优化阶段使用双特征优化器（DFR）结合高分辨率不透水表面掩码。 Result: 开发了扩展的 VIIRS-like 中国人工夜间灯光数据集（EVAL），将标准数据记录回溯延长了26年，从1986年开始。定量评估显示 EVAL 的 R² 从 0.68 提升至 0.80，RMSE 从 1.27 降低至 0.99。 Conclusion: EVAL 比现有的先进产品显著提升，并且具有较高的时间一致性和与社会经济参数的高度相关性，证实了其在长期分析中的可靠性。 Abstract: Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Despite the progress in extending VIIRS-like NTL time-series, current methods still suffer from two significant shortcomings: the underestimation of light intensity and the structural omission. To overcome these limitations, we propose a novel reconstruction framework consisting of a two-stage process: construction and refinement. The construction stage features a Hierarchical Fusion Decoder (HFD) designed to enhance the fidelity of the initial reconstruction. The refinement stage employs a Dual Feature Refiner (DFR), which leverages high-resolution impervious surface masks to guide and enhance fine-grained structural details. Based on this framework, we developed the Extended VIIRS-like Artificial Nighttime Light (EVAL) product for China, extending the standard data record backwards by 26 years to begin in 1986. Quantitative evaluation shows that EVAL significantly outperforms existing state-of-the-art products, boosting the $\text{R}^2$ from 0.68 to 0.80 while lowering the RMSE from 1.27 to 0.99. Furthermore, EVAL exhibits excellent temporal consistency and maintains a high correlation with socioeconomic parameters, confirming its reliability for long-term analysis. The resulting EVAL dataset provides a valuable new resource for the research community and is publicly available at https://doi.org/10.11888/HumanNat.tpdc.302930.

[117] Wukong Framework for Not Safe For Work Detection in Text-to-Image systems

Mingrui Liu,Sixiao Zhang,Cheng Long

Main category: cs.CV

TL;DR: Wukong is a novel NSFW detection framework for T2I systems that leverages early diffusion process outputs and pre-trained model parameters, providing efficient and accurate detection without full image generation.

Details

Motivation: To efficiently and accurately detect NSFW content in T2I generation while avoiding computational costs and adversarial vulnerabilities of existing methods. Method: Proposed Wukong, a transformer-based NSFW detection framework, which uses intermediate outputs from early denoising steps and reuses U-Net's pre-trained cross-attention parameters. Introduced a new dataset with prompts, seeds, and image-specific NSFW labels for evaluation. Result: Wukong significantly outperforms text-based safeguards and achieves comparable accuracy to image-based filters while offering much greater efficiency. Conclusion: Wukong is an efficient and accurate NSFW detection framework that operates within the diffusion process of T2I systems, leveraging intermediate outputs and reusing pre-trained cross-attention parameters to enable early detection without waiting for full image generation. Abstract: Text-to-Image (T2I) generation is a popular AI-generated content (AIGC) technology enabling diverse and creative image synthesis. However, some outputs may contain Not Safe For Work (NSFW) content (e.g., violence), violating community guidelines. Detecting NSFW content efficiently and accurately, known as external safeguarding, is essential. Existing external safeguards fall into two types: text filters, which analyze user prompts but overlook T2I model-specific variations and are prone to adversarial attacks; and image filters, which analyze final generated images but are computationally costly and introduce latency. Diffusion models, the foundation of modern T2I systems like Stable Diffusion, generate images through iterative denoising using a U-Net architecture with ResNet and Transformer blocks. We observe that: (1) early denoising steps define the semantic layout of the image, and (2) cross-attention layers in U-Net are crucial for aligning text and image regions. Based on these insights, we propose Wukong, a transformer-based NSFW detection framework that leverages intermediate outputs from early denoising steps and reuses U-Net's pre-trained cross-attention parameters. Wukong operates within the diffusion process, enabling early detection without waiting for full image generation. We also introduce a new dataset containing prompts, seeds, and image-specific NSFW labels, and evaluate Wukong on this and two public benchmarks. Results show that Wukong significantly outperforms text-based safeguards and achieves comparable accuracy of image filters, while offering much greater efficiency.

[118] GeoMoE: Divide-and-Conquer Motion Field Modeling with Mixture-of-Experts for Two-View Geometry

Jiajun Le,Jiayi Ma

Main category: cs.CV

TL;DR: GeoMoE is a new streamlined framework for two-view geometry that effectively models heterogeneous motion patterns using a Mixture-of-Experts approach, resulting in improved motion field estimation.

Details

Motivation: Traditional methods fail to account for the variability in motion fields caused by extreme viewpoint and scale changes and depth discontinuities, leading to inaccurate estimations. Method: The paper introduces GeoMoE, which uses a Probabilistic Prior-Guided Decomposition strategy and an MoE-Enhanced Bi-Path Rectifier to decompose and refine motion fields by assigning dedicated experts to motion sub-fields. Result: GeoMoE achieves superior performance in relative pose and homography estimation while demonstrating strong generalization capabilities. Conclusion: GeoMoE provides a new framework for motion field modeling in two-view geometry that effectively addresses heterogeneous motion patterns, outperforming existing methods in accuracy and generalization. Abstract: Recent progress in two-view geometry increasingly emphasizes enforcing smoothness and global consistency priors when estimating motion fields between pairs of images. However, in complex real-world scenes, characterized by extreme viewpoint and scale changes as well as pronounced depth discontinuities, the motion field often exhibits diverse and heterogeneous motion patterns. Most existing methods lack targeted modeling strategies and fail to explicitly account for this variability, resulting in estimated motion fields that diverge from their true underlying structure and distribution. We observe that Mixture-of-Experts (MoE) can assign dedicated experts to motion sub-fields, enabling a divide-and-conquer strategy for heterogeneous motion patterns. Building on this insight, we re-architect motion field modeling in two-view geometry with GeoMoE, a streamlined framework. Specifically, we first devise a Probabilistic Prior-Guided Decomposition strategy that exploits inlier probability signals to perform a structure-aware decomposition of the motion field into heterogeneous sub-fields, sharply curbing outlier-induced bias. Next, we introduce an MoE-Enhanced Bi-Path Rectifier that enhances each sub-field along spatial-context and channel-semantic paths and routes it to a customized expert for targeted modeling, thereby decoupling heterogeneous motion regimes, suppressing cross-sub-field interference and representational entanglement, and yielding fine-grained motion-field rectification. With this minimalist design, GeoMoE outperforms prior state-of-the-art methods in relative pose and homography estimation and shows strong generalization. The source code and pre-trained models are available at https://github.com/JiajunLe/GeoMoE.

[119] DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

Junzhe Lu,Jing Lin,Hongkun Dou,Ailing Zeng,Yue Deng,Xian Liu,Zhongang Cai,Lei Yang,Yulun Zhang,Haoqian Wang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的3D全身人体姿态先验模型DPoser-X，通过变分扩散采样解决姿态相关任务，并引入了截断时间步调度方法和掩码训练机制，以提高模型性能和泛化能力。

Details

Motivation: 由于人体姿态的内在复杂性和高质量全身姿态数据集的稀缺，构建一个通用且鲁棒的全身人体姿态先验模型仍然具有挑战性。 Method: 本文提出了一种名为DPoser的扩散模型作为身体姿态先验，并将其扩展到DPoser-X以进行表达性的全身人体姿态建模。该方法将各种姿态中心任务统一为逆问题，并通过变分扩散采样解决。为了提升下游应用的性能，引入了一种专为姿态数据特征设计的截断时间步调度方法。此外，还提出了一种掩码训练机制，有效地结合了全身和特定部位的数据集。 Result: 广泛的实验表明，DPoser-X在多个基准测试中表现出色，包括身体、手、面部和全身姿态建模，其表现始终优于最先进的替代模型。 Conclusion: DPoser-X为全身人体姿态先验建模树立了新基准，证明了扩散模型在这一领域的潜力。 Abstract: We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling. Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions. Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.

[120] Backdoor Attacks on Deep Learning Face Detection

Quentin Le Roux,Yannick Teglia,Teddy Furon,Philippe Loubet-Moundi

Main category: cs.CV

TL;DR: 本文研究了针对人脸检测的人脸生成攻击，包括Object Generation Attacks和Landmark Shift Attack，并提出了相应的缓解措施。

Details

Motivation: 为了提高在无约束环境下人脸识别系统的安全性和鲁棒性，研究者们需要了解这些系统在人脸检测和对齐任务中可能面临的潜在攻击方式。 Method: 提出了Face Generation Attacks，包括Object Generation Attacks 和 Landmark Shift Attack，用于攻击人脸检测系统，并提出缓解策略。 Result: 首次实现了对人脸检测器坐标回归任务的后门攻击，并展示了攻击的有效性。 Conclusion: 本文提出了针对人脸检测的物体生成攻击（Object Generation Attacks），并展示了对坐标回归任务进行后门攻击的Landmark Shift Attack，同时提供了缓解这些漏洞的方法。 Abstract: Face Recognition Systems that operate in unconstrained environments capture images under varying conditions,such as inconsistent lighting, or diverse face poses. These challenges require including a Face Detection module that regresses bounding boxes and landmark coordinates for proper Face Alignment. This paper shows the effectiveness of Object Generation Attacks on Face Detection, dubbed Face Generation Attacks, and demonstrates for the first time a Landmark Shift Attack that backdoors the coordinate regression task performed by face detectors. We then offer mitigations against these vulnerabilities.

[121] Minimum Data, Maximum Impact: 20 annotated samples for explainable lung nodule classification

Luisa Gallée,Catharina Silvia Lisson,Christoph Gerhard Lisson,Daniela Drees,Felix Weig,Daniel Vogele,Meinrad Beer,Michael Götz

Main category: cs.CV

TL;DR: This paper proposes using synthetic data generated by an enhanced Diffusion Model to improve the training of explainable AI models for medical image diagnosis, achieving significant performance gains in attribute and target prediction accuracy.

Details

Motivation: The motivation is to enhance clinicians' trust and usability in medical image diagnosis by integrating pathology-related visual attributes into AI decision-making, mirroring established radiological diagnostic criteria. However, the scarcity of large-scale attribute-annotated datasets limits the adoption of such models. Method: The researchers enhanced a Diffusion Model with attribute conditioning and trained it using only 20 attribute-labeled lung nodule samples from the LIDC-IDRI dataset. The generated synthetic images were then incorporated into the training of an explainable model to evaluate performance improvements. Result: Incorporating synthetic images into the training of the explainable model increased attribute prediction accuracy by 13.4% and target prediction accuracy by 1.8% compared to training with only the small real attribute-annotated dataset. Conclusion: The study concludes that synthetic data can effectively overcome the limitations of small annotated datasets, enhancing the performance and applicability of explainable models in medical image analysis. Abstract: Classification models that provide human-interpretable explanations enhance clinicians' trust and usability in medical image diagnosis. One research focus is the integration and prediction of pathology-related visual attributes used by radiologists alongside the diagnosis, aligning AI decision-making with clinical reasoning. Radiologists use attributes like shape and texture as established diagnostic criteria and mirroring these in AI decision-making both enhances transparency and enables explicit validation of model outputs. However, the adoption of such models is limited by the scarcity of large-scale medical image datasets annotated with these attributes. To address this challenge, we propose synthesizing attribute-annotated data using a generative model. We enhance the Diffusion Model with attribute conditioning and train it using only 20 attribute-labeled lung nodule samples from the LIDC-IDRI dataset. Incorporating its generated images into the training of an explainable model boosts performance, increasing attribute prediction accuracy by 13.4% and target prediction accuracy by 1.8% compared to training with only the small real attribute-annotated dataset. This work highlights the potential of synthetic data to overcome dataset limitations, enhancing the applicability of explainable models in medical image analysis.

[122] Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights

Junhao Zheng,Jiahao Sun,Chenhao Lin,Zhengyu Zhao,Chen Ma,Chong Zhang,Cong Wang,Qian Wang,Chao Shen

Main category: cs.CV

TL;DR: 本文提出了第一个针对对抗性补丁攻击的全面防御基准，揭示了现有方法的不足并提供了改进方向。

Details

Motivation: 现有对抗补丁攻击的防御方法缺乏统一和全面的评估框架，导致评估结果不一致和不完整。 Method: 重新评估了11种具有代表性的防御方法，并构建了包含2种攻击目标、13种补丁攻击、11种目标检测器和4种多样度量的大规模补丁防御基准。 Result: 发现了三个主要结论：自然主义补丁的防御难点在于数据分布而非高频特性；攻击目标的平均精度比补丁检测准确度更能反映防御效果；自适应攻击可以绕过现有防御，而具有复杂/随机模型或通用补丁特性的防御相对稳健。 Conclusion: 本文提出了第一个针对补丁攻击的防御基准，揭示了现有防御方法的局限性，并指出改进的方向。 Abstract: Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.

[123] Can Large Pretrained Depth Estimation Models Help With Image Dehazing?

Hongfei Zhang,Kun Zhou,Ruizheng Wu,Jiangbo Lu

Main category: cs.CV

TL;DR: 本文提出了一种适用于多种去雾架构的RGB-D融合模块，有效解决了真实场景中雾霾空间变化的问题。

Details

Motivation: 由于真实场景中雾霾的空间变化特性，图像去雾仍然是一个具有挑战性的问题。现有方法的架构特定设计限制了其在不同场景下的适应能力。 Method: 基于预训练深度表示，提出了一种即插即用的RGB-D融合模块，以解决实际场景中雾霾空间变化带来的挑战。 Result: 实验证明，所提出的模块在多个基准测试中均表现出良好的去雾效果，并且能够与多种去雾架构结合使用。 Conclusion: 提出的RGB-D融合模块在多种去雾架构中表现出色，验证了该方法的有效性和广泛适用性。 Abstract: Image dehazing remains a challenging problem due to the spatially varying nature of haze in real-world scenes. While existing methods have demonstrated the promise of large-scale pretrained models for image dehazing, their architecture-specific designs hinder adaptability across diverse scenarios with different accuracy and efficiency requirements. In this work, we systematically investigate the generalization capability of pretrained depth representations-learned from millions of diverse images-for image dehazing. Our empirical analysis reveals that the learned deep depth features maintain remarkable consistency across varying haze levels. Building on this insight, we propose a plug-and-play RGB-D fusion module that seamlessly integrates with diverse dehazing architectures. Extensive experiments across multiple benchmarks validate both the effectiveness and broad applicability of our approach.

[124] D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Chende Zheng,Ruiqi suo,Chenhao Lin,Zhengyu Zhao,Le Yang,Shuai Liu,Minghui Yang,Cong Wang,Chao Shen

Main category: cs.CV

TL;DR: The paper proposes a novel training-free method called D3 for detecting AI-generated videos by leveraging second-order temporal discrepancies, offering significant performance improvements over existing methods.

Details

Motivation: The motivation stems from the increasing ease of generating high-fidelity AI videos (e.g., Sora), raising concerns about synthetic content dissemination, and the limitations of existing detection methods due to insufficient exploration of temporal artifacts in synthetic videos. Method: D3 is a training-free detection method based on second-order temporal discrepancies, derived from a theoretical framework established through second-order dynamical analysis under Newtonian mechanics. It leverages the Second-order Central Difference features tailored for temporal artifact detection. Result: The D3 method outperformed the previous best method by 10.39% (absolute) mean Average Precision on the GenVideo dataset and demonstrated robust performance and computational efficiency across 4 open-source datasets totaling 40 subsets. Conclusion: The proposed D3 method demonstrates exceptional computational efficiency and robust performance in detecting AI-generated videos, offering a promising solution for addressing concerns over synthetic content dissemination. Abstract: The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3's exceptional computational efficiency and strong robust performance. Our code is available at https://github.com/Zig-HS/D3.

[125] MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models

Jiale Li,Mingrui Wu,Zixiang Jin,Hao Chen,Jiayi Ji,Xiaoshuai Sun,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: This study introduces MIHBench, a benchmark for evaluating object-related hallucinations in multi-image MLLMs, and proposes a Dynamic Attention Balancing method to effectively reduce such hallucinations, enhancing model performance in multi-image scenarios.

Details

Motivation: While hallucinations in single-image MLLMs have been widely studied, those in multi-image settings remain largely unexplored. The authors aim to fill this gap by systematically analyzing multi-image hallucinations and proposing a dedicated benchmark and mitigation strategy. Method: The authors conducted a systematic study of hallucinations in multi-image MLLMs and introduced MIHBench, a benchmark for evaluating object-related hallucinations. They identified key factors contributing to hallucinations and proposed a Dynamic Attention Balancing mechanism to address these issues. Result: MIHBench was successfully developed with three core tasks targeting object-related hallucinations. The evaluation revealed key factors influencing hallucination occurrences, and the proposed Dynamic Attention Balancing mechanism showed significant improvements in reducing hallucinations across multiple state-of-the-art MLLMs. Conclusion: The study concludes that multi-image hallucinations in MLLMs are influenced by several key factors, such as the number of image inputs and single-image hallucination tendencies. The proposed Dynamic Attention Balancing mechanism effectively mitigates these hallucinations, enhancing the stability and accuracy of semantic integration in multi-image settings. Abstract: Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.

[126] YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Guanning Zeng,Xiang Zhang,Zirui Wang,Haiyang Xu,Zeyuan Chen,Bingnan Li,Zhuowen Tu

Main category: cs.CV

TL;DR: YOLO-Count is a novel, differentiable model for object counting and quantity control in text-to-image generation, achieving state-of-the-art accuracy and effective guidance for generative models.

Details

Motivation: To tackle both general object counting challenges and enable precise quantity control in text-to-image generation, addressing variations in object size and spatial distribution. Method: The model introduces a 'cardinality' map as a novel regression target, leveraging representation alignment and a hybrid strong-weak supervision scheme. Its architecture is fully differentiable, allowing gradient-based optimization for accurate object count estimation and generative model guidance. Result: YOLO-Count achieves state-of-the-art counting accuracy and provides robust, effective quantity control for text-to-image systems, as demonstrated by extensive experiments. Conclusion: YOLO-Count is a differentiable open-vocabulary object counting model that effectively addresses general counting challenges and enables precise quantity control for text-to-image generation. Abstract: We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

[127] Rethinking Backbone Design for Lightweight 3D Object Detection in LiDAR

Adwait Chandorkar,Hasan Tercan,Tobias Meisen

Main category: cs.CV

TL;DR: This paper proposes Dense Backbone, a lightweight and efficient backbone for 3D object detection that significantly reduces computational costs while maintaining high detection accuracy, demonstrated through its integration with PillarNet.

Details

Motivation: Most LiDAR-based 3D object detection approaches rely on complex VGG-based or ResNet-based backbones, increasing model complexity. While lightweight backbones are well-explored in 2D object detection, research in 3D remains limited, motivating the need for a more efficient solution. Method: The authors introduced Dense Backbone, a lightweight backbone combining high processing speed, lightweight architecture, and robust detection accuracy. They adapted multiple state-of-the-art 3D object detectors, such as PillarNet, with this backbone and evaluated performance on the nuScenes test set. Result: DensePillarNet, the adaptation of PillarNet using Dense Backbone, achieved a 29% reduction in model parameters and a 28% reduction in latency with only a 2% drop in detection accuracy on the nuScenes test set. Conclusion: Dense Backbone is the first dense-layer-based lightweight backbone specifically designed for 3D object detection, offering a plug-and-play solution that significantly reduces computational costs while maintaining detection accuracy. Abstract: Recent advancements in LiDAR-based 3D object detection have significantly accelerated progress toward the realization of fully autonomous driving in real-world environments. Despite achieving high detection performance, most of the approaches still rely on a VGG-based or ResNet-based backbone for feature exploration, which increases the model complexity. Lightweight backbone design is well-explored for 2D object detection, but research on 3D object detection still remains limited. In this work, we introduce Dense Backbone, a lightweight backbone that combines the benefits of high processing speed, lightweight architecture, and robust detection accuracy. We adapt multiple SoTA 3d object detectors, such as PillarNet, with our backbone and show that with our backbone, these models retain most of their detection capability at a significantly reduced computational cost. To our knowledge, this is the first dense-layer-based backbone tailored specifically for 3D object detection from point cloud data. DensePillarNet, our adaptation of PillarNet, achieves a 29% reduction in model parameters and a 28% reduction in latency with just a 2% drop in detection accuracy on the nuScenes test set. Furthermore, Dense Backbone's plug-and-play design allows straightforward integration into existing architectures, requiring no modifications to other network components.

[128] GECO: Geometrically Consistent Embedding with Lightspeed Inference

Regine Hartwig,Dominik Muhle,Riccardo Marin,Daniel Cremers

Main category: cs.CV

TL;DR: GECO是一种新的特征学习方法，能够在几何感知任务中实现高速度和高性能，并提出了新的度量标准。

Details

Motivation: 现有的自监督视觉基础模型缺乏对底层三维几何结构的理解，而GECO填补了这一空白。 Method: 提出了一种基于最优传输的训练框架，生成几何一致的特征，并在特征学习中超越关键点监督。 Result: GECO能够在遮挡和分离情况下进行监督学习，运行速度为30 fps，比之前的方法快98.2%，在PFPascal、APK和CUB数据集上分别将PCK提高了6.0%、6.2%和4.1%。 Conclusion: GECO实现了比现有方法更快的运行速度，同时在多个数据集上达到了最先进的性能，并引入了新的度量标准来提升几何感知特征学习。 Abstract: Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs). We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learning. Link to project page: https://reginehartwig.github.io/publications/geco/

[129] Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

Laura Pedrouzo-Rodriguez,Pedro Delgado-DeRobles,Luis F. Gomez,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez

Main category: cs.CV

TL;DR: The paper investigates the use of facial motion patterns as behavioral biometrics for identity verification in avatar-mediated communication, introducing a new dataset of realistic avatar videos and proposing a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling for improved biometric verification.

Details

Motivation: The increasing use of photorealistic talking-head avatars in virtual meetings, gaming, and social platforms introduces serious security risks such as impersonation. The paper explores the challenge of biometric verification in avatar-mediated scenarios. Method: A lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling was proposed, using only facial landmarks to model dynamic facial gestures. Result: Experimental results demonstrated that facial motion cues enable meaningful identity verification with AUC values approaching 80%. Conclusion: Facial motion patterns can serve as reliable behavioral biometrics for identity verification in avatar-mediated communication. Abstract: Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar-preserving their appearance and voice-making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

[130] SU-ESRGAN: Semantic and Uncertainty-Aware ESRGAN for Super-Resolution of Satellite and Drone Imagery with Fine-Tuning for Cross Domain Evaluation

Prerana Ramkumar

Main category: cs.CV

TL;DR: SU-ESRGAN is a new super-resolution framework for satellite imagery that improves semantic consistency and provides uncertainty estimation, making it suitable for critical applications like disaster response and urban planning.

Details

Motivation: Generative Adversarial Networks (GANs) have achieved realistic super-resolution of images but lack semantic consistency and per-pixel confidence, limiting their use in critical remote sensing applications. This work addresses these limitations by introducing a framework that ensures both image quality and reliability. Method: SU-ESRGAN combines ESRGAN with segmentation loss via DeepLabv3 for preserving class details and uses Monte Carlo dropout to generate pixel-wise uncertainty maps. The model is tested on aerial imagery and evaluated using metrics like PSNR, SSIM, and LPIPS. It is also fine-tuned for cross-domain applications and tested on two drone-based datasets. Result: SU-ESRGAN achieves results comparable to Baseline ESRGAN on aerial imagery in terms of PSNR, SSIM, and LPIPS. The model demonstrates strong adaptation to the Aerial Maritime Drone Dataset, indicating the importance of domain-aware training. Its modular design allows for integration in UAV data pipelines for on-board or post-processing super-resolution. Conclusion: The SU-ESRGAN model is a novel super-resolution framework for satellite imagery that integrates semantic segmentation and uncertainty estimation, enhancing credibility for critical applications like disaster response and urban planning. Its modular design allows for easy integration into UAV data pipelines, and it demonstrates strong adaptation to domain-specific data. Abstract: Generative Adversarial Networks (GANs) have achieved realistic super-resolution (SR) of images however, they lack semantic consistency and per-pixel confidence, limiting their credibility in critical remote sensing applications such as disaster response, urban planning and agriculture. This paper introduces Semantic and Uncertainty-Aware ESRGAN (SU-ESRGAN), the first SR framework designed for satellite imagery to integrate the ESRGAN, segmentation loss via DeepLabv3 for class detail preservation and Monte Carlo dropout to produce pixel-wise uncertainty maps. The SU-ESRGAN produces results (PSNR, SSIM, LPIPS) comparable to the Baseline ESRGAN on aerial imagery. This novel model is valuable in satellite systems or UAVs that use wide field-of-view (FoV) cameras, trading off spatial resolution for coverage. The modular design allows integration in UAV data pipelines for on-board or post-processing SR to enhance imagery resulting due to motion blur, compression and sensor limitations. Further, the model is fine-tuned to evaluate its performance on cross domain applications. The tests are conducted on two drone based datasets which differ in altitude and imaging perspective. Performance evaluation of the fine-tuned models show a stronger adaptation to the Aerial Maritime Drone Dataset, whose imaging characteristics align with the training data, highlighting the importance of domain-aware training in SR-applications.

[131] Sample-Aware Test-Time Adaptation for Medical Image-to-Image Translation

Irene Iele,Francesco Di Feola,Valerio Guarrasi,Paolo Soda

Main category: cs.CV

TL;DR: This paper proposes a dynamic Test-Time Adaptation (TTA) framework for medical image-to-image translation that improves model performance on out-of-distribution samples while preserving accuracy on in-distribution data.

Details

Motivation: Image-to-image translation in medical imaging faces challenges in handling out-of-distribution samples, leading to performance degradation. This work aims to address this limitation through dynamic, sample-specific adaptation. Method: The method introduces a Reconstruction Module to quantify domain shift and a Dynamic Adaptation Block to selectively modify features of a pretrained model, enabling dynamic adjustments during translation. Result: The proposed approach demonstrated consistent improvements in two medical image-to-image translation tasks—low-dose CT denoising and T1 to T2 MRI translation—outperforming both the baseline model and prior TTA methods. Conclusion: The proposed Test-Time Adaptation (TTA) framework enhances the resilience of image-to-image translation models in handling out-of-distribution samples without compromising performance on in-distribution samples, offering a dynamic, sample-specific approach for real-world applications. Abstract: Image-to-image translation has emerged as a powerful technique in medical imaging, enabling tasks such as image denoising and cross-modality conversion. However, it suffers from limitations in handling out-of-distribution samples without causing performance degradation. To address this limitation, we propose a novel Test-Time Adaptation (TTA) framework that dynamically adjusts the translation process based on the characteristics of each test sample. Our method introduces a Reconstruction Module to quantify the domain shift and a Dynamic Adaptation Block that selectively modifies the internal features of a pretrained translation model to mitigate the shift without compromising the performance on in-distribution samples that do not require adaptation. We evaluate our approach on two medical image-to-image translation tasks: low-dose CT denoising and T1 to T2 MRI translation, showing consistent improvements over both the baseline translation model without TTA and prior TTA methods. Our analysis highlights the limitations of the state-of-the-art that uniformly apply the adaptation to both out-of-distribution and in-distribution samples, demonstrating that dynamic, sample-specific adjustment offers a promising path to improve model resilience in real-world scenarios. The code is available at: https://github.com/cosbidev/Sample-Aware_TTA.

[132] Zero-Shot Anomaly Detection with Dual-Branch Prompt Learning

Zihan Wang,Samira Ebrahimi Kahou,Narges Armanfard

Main category: cs.CV

TL;DR: 本论文提出了一种名为PILOT的零样本异常检测框架，通过双分支提示学习机制和无标签测试时适应策略，有效应对领域转移问题，在13个工业和医学基准上展示了最先进的性能。

Details

Motivation: 现有的ZSAD方法在领域转移情况下表现不佳，因为它们的训练数据来自有限的训练领域，无法推广到新的分布。 Method: 提出了一个双分支提示学习机制和一种无标签测试时适应策略。 Result: 在13个工业和医学基准上的大量实验表明，PILOT在领域转移下的异常检测和定位方面均达到了最先进的性能。 Conclusion: PILOT是一个在零样本异常检测中表现最先进的框架，通过两个创新机制有效应对领域转移问题。 Abstract: Zero-shot anomaly detection (ZSAD) enables identifying and localizing defects in unseen categories by relying solely on generalizable features rather than requiring any labeled examples of anomalies. However, existing ZSAD methods, whether using fixed or learned prompts, struggle under domain shifts because their training data are derived from limited training domains and fail to generalize to new distributions. In this paper, we introduce PILOT, a framework designed to overcome these challenges through two key innovations: (1) a novel dual-branch prompt learning mechanism that dynamically integrates a pool of learnable prompts with structured semantic attributes, enabling the model to adaptively weight the most relevant anomaly cues for each input image; and (2) a label-free test-time adaptation strategy that updates the learnable prompt parameters using high-confidence pseudo-labels from unlabeled test data. Extensive experiments on 13 industrial and medical benchmarks demonstrate that PILOT achieves state-of-the-art performance in both anomaly detection and localization under domain shift.

[133] Cross-Dataset Semantic Segmentation Performance Analysis: Unifying NIST Point Cloud City Datasets for 3D Deep Learning

Alexander Nikitas Dimopoulos,Joseph Grasso

Main category: cs.CV

TL;DR: This study explores semantic segmentation performance in heterogeneously labeled point-cloud datasets for public safety, highlighting challenges like data unification and detection of small, safety-critical elements, with findings suggesting a need for standardized annotation and improved labeling techniques.

Details

Motivation: The study is motivated by challenges in unifying differently labeled 3D data from heterogeneously labeled point-cloud datasets relevant to public safety applications, including pre-incident planning systems derived from lidar scans. Method: The methodology employs a graded schema with the KPConv architecture, evaluating performance through IoU metrics on safety-relevant features. Result: Results indicate performance variability: geometrically large objects achieve higher segmentation performance, while smaller safety-critical features exhibit lower recognition rates, impacted by class imbalance and limited geometric distinction. Conclusion: The study concludes that reliable point-cloud semantic segmentation for public safety requires standardized annotation protocols and improved labeling techniques to address data heterogeneity and detect small, safety-critical elements. Abstract: This study analyzes semantic segmentation performance across heterogeneously labeled point-cloud datasets relevant to public safety applications, including pre-incident planning systems derived from lidar scans. Using NIST's Point Cloud City dataset (Enfield and Memphis collections), we investigate challenges in unifying differently labeled 3D data. Our methodology employs a graded schema with the KPConv architecture, evaluating performance through IoU metrics on safety-relevant features. Results indicate performance variability: geometrically large objects (e.g. stairs, windows) achieve higher segmentation performance, suggesting potential for navigational context, while smaller safety-critical features exhibit lower recognition rates. Performance is impacted by class imbalance and the limited geometric distinction of smaller objects in typical lidar scans, indicating limitations in detecting certain safety-relevant features using current point-cloud methods. Key identified challenges include insufficient labeled data, difficulties in unifying class labels across datasets, and the need for standardization. Potential directions include automated labeling and multi-dataset learning strategies. We conclude that reliable point-cloud semantic segmentation for public safety necessitates standardized annotation protocols and improved labeling techniques to address data heterogeneity and the detection of small, safety-critical elements.

Wenxuan Guo,Xiuwei Xu,Hang Yin,Ziwei Wang,Jianjiang Feng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出IGL-Nav，一种基于增量3D高斯定位的高效视觉导航框架，通过几何信息和可微渲染优化实现图像目标导航。

Details

Motivation: 传统方法无法充分建模已探索的3D环境与目标图像之间的几何关系，因此需要一种高效且准确的图像定位方法。 Method: IGL-Nav基于可渲染的3D高斯表示，采用增量更新场景表示、前馈单目预测、几何信息进行离散空间匹配，并通过可微渲染优化解决精细目标姿态问题。 Result: IGL-Nav在多样化实验配置中显著优于现有最先进方法，能够处理更具挑战性的自由视角图像目标设置，并可在真实世界机器人平台部署。 Conclusion: IGL-Nav通过增量3D高斯定位框架，有效解决了以图像为目标的视觉导航问题，表现出比现有最先进方法更优越的性能。 Abstract: Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view image-goal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.

Table of Contents

cs.CL [Back]

[1] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

[2] Do LLMs produce texts with "human-like" lexical diversity?

[3] Semiotic Complexity and Its Epistemological Implications for Modeling Culture

[4] FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

[5] Is neural semantic parsing good at ellipsis resolution, or isn't it?

[6] Comparison of Large Language Models for Deployment Requirements

[7] Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

[8] Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform

[9] Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

[10] Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering

[11] Systematic Evaluation of Optimization Techniques for Long-Context Language Models

[12] Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

[13] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

[14] Lucy: edgerunning agentic web search on mobile with machine generated task vectors

[15] EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

[16] Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

[17] SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation

[18] Combining Discrete Wavelet and Cosine Transforms for Efficient Sentence Embedding

[19] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

[20] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

[21] GETALP@AutoMin 2025: Leveraging RAG to Answer Questions based on Meeting Transcripts

[22] The Missing Parts: Augmenting Fact Verification with Half-Truth Detection

[23] EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

[24] The Prosody of Emojis

[25] PaPaformer: Language Model from Pre-trained Paraller Paths

[26] SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought

[27] A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models

[28] GHTM: A Graph based Hybrid Topic Modeling Approach in Low-Resource Bengali Language

[29] Prompting Science Report 3: I'll pay you or I'll kill you -- but will you care?

[30] DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models

[31] Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

[32] MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

[33] Team "better_call_claude": Style Change Detection using a Sequential Sentence Pair Classifier

[34] Segment First, Retrieve Better: Realistic Legal Search via Rhetorical Role-Based Queries

[35] Better Call Claude: Can LLMs Detect Changes of Writing Style?

[36] NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System

[37] Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

[38] Out-of-Context Abduction: LLMs Make Inferences About Procedural Data Leveraging Declarative Facts in Earlier Training Data

[39] Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents

[40] Agentic large language models improve retrieval-based radiology question answering

[41] GLiDRE: Generalist Lightweight model for Document-level Relation Extraction

[42] MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations

[43] ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation

[44] Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models

[45] Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

cs.CV [Back]

[46] A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

[47] Punching Bag vs. Punching Person: Motion Transferability in Videos

[48] The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking

[49] Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

[50] World Consistency Score: A Unified Metric for Video Generation Quality

[51] GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration

[52] Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs

[53] On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI

[54] Graph Lineages and Skeletal Graph Products

[55] Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition

[56] SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

[57] Object-Centric Cropping for Visual Few-Shot Classification

[58] Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network

[59] PointGauss: Point Cloud-Guided Multi-Object Segmentation for Gaussian Splatting

[60] Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

[61] Multimodal Referring Segmentation: A Survey

[62] Towards Robust Semantic Correspondence: A Benchmark and Insights

[63] Privacy-Preserving Driver Drowsiness Detection with Spatial Self-Attention and Federated Learning

[64] TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models

[65] AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

[66] Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence

[67] Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement

[68] DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios

[69] GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection

[70] Steering Guidance for Personalized Text-to-Image Diffusion Models

[71] Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating

[72] Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

[73] Stable at Any Speed: Speed-Driven Multi-Object Tracking with Learnable Kalman Filtering

[74] CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

[75] Honey Classification using Hyperspectral Imaging and Machine Learning

[76] SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies

[77] Representation Shift: Unifying Token Compression with FlashAttention