Skip to content

Table of Contents

cs.CL [Back]

[1] MultiPL-MoE: Multi-Programming-Lingual Extension of Large Language Models through Hybrid Mixture-of-Experts

Qing Wang,Xue Han,Jiahui Wang,Lehao Xing,Qian Hu,Lianlian Zhang,Chao Deng,Junlan Feng

Main category: cs.CL

TL;DR: MultiPL-MoE improves multilingual code generation in LLMs by combining token and segment-level MoE structures.

Details Motivation: Multilingual code generation remains extremely challenging despite LLMs' excellent code creation capabilities, and there is a need to improve MultiPL performance while retaining popular LLMs with restricted computational resources. Method: MultiPL-MoE utilizes a hybrid mixture of experts (MoE) to optimize expert selection at both token and segment levels. Result: The experimental results proved the effectiveness of MultiPL-MoE in multilingual code generation. Conclusion: MultiPL-MoE is effective in improving multilingual code generation performance of LLMs. Abstract: Despite LLMs' excellent code creation capabilities, multilingual code generation remains extremely challenging. To address this, we intent to improve the multi-programming-lingual (MultiPL) performance of the base LLMs while retaining the most popular ones using restricted computational resources. We consider MultiPL to be a special case of multiple natural languages and propose a MultiPL extension of LLMs utilizing a hybrid mixture of experts (MoE), called MultiPL-MoE. Specifically, MultiPL-MoE combines two paired MoEs to optimize expert selection at both the token and segment levels. The token-level MoE is a standard upcycling MoE structure with a shared expert and a novel gate weight normalization approach that aids in the final fusion with the segment-level MoE. The segment-level MoE incorporates two innovative designs to better capture the syntactic structure and contextual patterns of programming languages: First, using a sliding window to partition the input token sequence into multiple segments; Then, adopting an expert-choice routing strategy that allows experts to select the top-k segments. The results of the experiment proved the effectiveness of MultiPL-MoE.

[2] Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English

Nguyen Huu Nhat Minh,Tran Nguyen Anh,Truong Dinh Dung,Vo Van Nam,Le Pham Tuyen

Main category: cs.CL

TL;DR: This paper proposes a bilingual speech recognition approach to address cross-lingual phoneme recognition challenges in Vietnamese-English code-switching scenarios, achieving improved accuracy and robustness.

Details Motivation: The motivation stems from the challenge of cross-lingual phoneme recognition in mixing Vietnamese and English pronunciations, where Vietnamese relies on tonal variations and English features stress patterns and non-standard pronunciations. Method: The method involves constructing a representative bilingual phoneme set and using an end-to-end system leveraging the PhoWhisper pre-trained encoder to enhance phoneme recognition. Result: The experiments showed improved recognition accuracy in bilingual speech recognition for Vietnamese while addressing the complexities of tonal and stress-based phoneme recognition. Conclusion: The proposed bilingual speech recognition approach improves recognition accuracy for Vietnamese-English bilingual speech and offers a robust framework for handling tonal and stress-based phoneme recognition challenges. Abstract: Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition (ASR) when mixing Vietnamese and English pronunciations. Unlike many languages, Vietnamese relies on tonal variations to distinguish word meanings, whereas English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages. To address this challenge, we propose a novel bilingual speech recognition approach with two primary contributions: (1) constructing a representative bilingual phoneme set that bridges the differences between Vietnamese and English phonetic systems; (2) designing an end-to-end system that leverages the PhoWhisper pre-trained encoder for deep high-level representations to improve phoneme recognition. Our extensive experiments demonstrate that the proposed approach not only improves recognition accuracy in bilingual speech recognition for Vietnamese but also provides a robust framework for addressing the complexities of tonal and stress-based phoneme recognition

[3] Rethinking Reasoning in LLMs: Neuro-Symbolic Local RetoMaton Beyond ICL and CoT

Rushitha Santhoshi Mamidala,Anshuman Chhabra,Ankur Mali

Main category: cs.CL

TL;DR: This paper introduces a local automaton-based framework for reasoning in large language models, offering more reliable and interpretable results compared to prompt-based strategies.

Details Motivation: Prompt-based reasoning strategies like Chain-of-Thought (CoT) and In-Context Learning (ICL) are unreliable due to their implicit and fragile mechanisms. The paper aims to provide a more structured and trustworthy alternative through symbolic reasoning with WFAs. Method: The paper introduces a local automaton structure to the RetoMaton framework, replacing its global datastore with a WFA built from external domain corpora. This approach is evaluated on two pretrained LLMs across three reasoning tasks. Result: The proposed local RetoMaton variant consistently improved performance on reasoning tasks compared to the base model and prompting-based methods while enabling transparent retrieval dynamics. Conclusion: The paper concludes that the local RetoMaton variant, which uses a task-adaptive Weighted Finite Automaton (WFA), provides a more trustworthy and efficient approach to reasoning in large language models compared to prompt-based methods. Abstract: Prompt-based reasoning strategies such as Chain-of-Thought (CoT) and In-Context Learning (ICL) have become widely used for eliciting reasoning capabilities in large language models (LLMs). However, these methods rely on fragile, implicit mechanisms often yielding inconsistent outputs across seeds, formats, or minor prompt variations making them fundamentally unreliable for tasks requiring stable, interpretable reasoning. In contrast, automata-based neuro-symbolic frameworks like RetoMaton offer a more structured and trustworthy alternative by grounding retrieval in symbolic memory with deterministic transitions. In this work, we extend RetoMaton by replacing its global datastore with a local, task-adaptive Weighted Finite Automaton (WFA), constructed directly from external domain corpora. This local automaton structure promotes robust, context-aware retrieval while preserving symbolic traceability and low inference overhead. Unlike prompting, which entangles context and memory in opaque ways, our approach leverages the explicit structure of WFAs to provide verifiable and modular retrieval behavior, making it better suited for domain transfer and interoperability. We evaluate this local RetoMaton variant on two pretrained LLMs LLaMA-3.2-1B and Gemma-3-1B-PT across three reasoning tasks: TriviaQA (reading comprehension), GSM8K (multi-step math), and MMLU (domain knowledge). Compared to the base model and prompting-based methods, augmenting these setups with local RetoMaton consistently improves performance while enabling transparent and reproducible retrieval dynamics. Our results highlight a promising shift toward trustworthy, symbolic reasoning in modern LLMs via lightweight, automaton-guided memory.

[4] RAGAPHENE: A RAG Annotation Platform with Human Enhancements and Edits

Kshitij Fadnis,Sara Rosenthal,Maeda Hanafi,Yannis Katsis,Marina Danilevsky

Main category: cs.CL

TL;DR: RAGAPHENE is a chat-based annotation platform that helps create realistic multi-turn conversations to evaluate LLMs, ensuring factual accuracy and reducing hallucinations.

Details Motivation: The motivation is to create high-quality benchmarks for evaluating LLMs in Retrieval Augmented Generation (RAG) scenarios, ensuring factual accuracy and reducing hallucinations in multi-turn conversations. Method: The authors developed RAGAPHENE, a chat-based annotation platform designed to simulate real-world conversations, which is used to build evaluation benchmarks for LLMs. Result: RAGAPHENE has enabled approximately 40 annotators to build thousands of real-world conversations, demonstrating its effectiveness in simulating and generating evaluation data. Conclusion: RAGAPHENE is a successful chat-based annotation platform used by around 40 annotators to generate thousands of real-world conversations for evaluating LLMs in multi-turn RAG settings. Abstract: Retrieval Augmented Generation (RAG) is an important aspect of conversing with Large Language Models (LLMs) when factually correct information is important. LLMs may provide answers that appear correct, but could contain hallucinated information. Thus, building benchmarks that can evaluate LLMs on multi-turn RAG conversations has become an increasingly important task. Simulating real-world conversations is vital for producing high quality evaluation benchmarks. We present RAGAPHENE, a chat-based annotation platform that enables annotators to simulate real-world conversations for benchmarking and evaluating LLMs. RAGAPHENE has been successfully used by approximately 40 annotators to build thousands of real-world conversations.

[5] Leveraging Language Models and Machine Learning in Verbal Autopsy Analysis

Yue Chu

Main category: cs.CL

TL;DR: 该论文研究了口头尸检(VA)中叙述性信息在自动化死因(COD)分类中的作用,利用预训练语言模型(PLMs)和机器学习技术,发现叙述性信息显著提升了分类准确性,特别是在非传染性疾病的识别中。多模态方法进一步提高了性能,表明叙述和问题各自具有独特的价值。

Details Motivation: 在没有民事登记和生命统计系统的国家,死因估计依赖于口头尸检(VA)。现有的自动化VA死因分类算法仅使用结构化问题,忽略了叙述性信息。本研究旨在探索如何利用VA叙述提升自动化死因分类的效果,并评估叙述和问题在多模态框架下的联合价值。 Method: 本论文利用南非的实证数据,通过基于Transformer的预训练语言模型(PLMs)结合任务特定的微调技术,分析VA叙述在死因(COD)分类中的作用。同时,论文探索了多种多模态融合策略,将叙述和问题结合起来,形成统一的框架。此外,还分析了医生对VA信息充分性的感知,并研究了信息充分性对分类准确性的影响。 Result: 基于Transformer的PLMs在仅使用叙述的情况下,无论是在个体层面还是群体层面,都优于现有的基于问题的算法,特别是在非传染性疾病的识别中。多模态融合方法进一步提高了分类性能,表明叙述和问题各自提供了独特的信息。此外,信息充分性对医生和模型的分类准确性均有影响。 Conclusion: 论文得出结论,使用预训练语言模型(PLMs)和机器学习(ML)技术,VA叙述可以显著提升死因分类的准确性,尤其是在识别非传染性疾病方面。多模态方法进一步提升了分类性能,表明叙述和问题两种方式各自具有独特的价值。此外,论文强调了在更多样化的环境中使用高质量数据训练和微调PLM/ML方法的重要性,并提供了重新思考和重新设计VA工具和访谈的见解。 Abstract: In countries without civil registration and vital statistics, verbal autopsy (VA) is a critical tool for estimating cause of death (COD) and inform policy priorities. In VA, interviewers ask proximal informants for details on the circumstances preceding a death, in the form of unstructured narratives and structured questions. Existing automated VA cause classification algorithms only use the questions and ignore the information in the narratives. In this thesis, we investigate how the VA narrative can be used for automated COD classification using pretrained language models (PLMs) and machine learning (ML) techniques. Using empirical data from South Africa, we demonstrate that with the narrative alone, transformer-based PLMs with task-specific fine-tuning outperform leading question-only algorithms at both the individual and population levels, particularly in identifying non-communicable diseases. We explore various multimodal fusion strategies combining narratives and questions in unified frameworks. Multimodal approaches further improve performance in COD classification, confirming that each modality has unique contributions and may capture valuable information that is not present in the other modality. We also characterize physician-perceived information sufficiency in VA. We describe variations in sufficiency levels by age and COD and demonstrate that classification accuracy is affected by sufficiency for both physicians and models. Overall, this thesis advances the growing body of knowledge at the intersection of natural language processing, epidemiology, and global health. It demonstrates the value of narrative in enhancing COD classification. Our findings underscore the need for more high-quality data from more diverse settings to use in training and fine-tuning PLM/ML methods, and offer valuable insights to guide the rethinking and redesign of the VA instrument and interview.

[6] FLAIRR-TS -- Forecasting LLM-Agents with Iterative Refinement and Retrieval for Time Series

Gunjan Jalori,Preetika Verma,Sercan Ö Arık

Main category: cs.CL

TL;DR: FLAIRR-TS 是一种新的测试时提示优化框架,它通过代理系统进行自适应提示优化和检索,提供了一种实用的替代调优的方法,实现了强大的时间序列预测性能。

Details Motivation: 时间序列预测需要在数值模式和自然语言之间建立桥梁,而现有的大型语言模型(LLMs)往往需要大量的预处理和微调。此外,为每个任务定制提示是一项繁琐且临时的任务。 Method: FLAIRR-TS 使用了一种测试时提示优化框架,该框架利用了一个代理系统:预测代理使用初始提示生成预测,然后优化代理根据过去的输出和检索到的类比来改进提示。 Result: 实验结果表明,FLAIRR-TS 在基准数据集上提高了准确性,超过了静态提示和检索增强的基线,接近了专业提示的性能。 Conclusion: FLAIRR-TS 提供了一种实用的替代调优的方法,通过其智能代理方法实现自适应提示优化和检索,实现了强大的性能。 Abstract: Time series Forecasting with large languagemodels (LLMs) requires bridging numericalpatterns and natural language. Effective fore-casting on LLM often relies on extensive pre-processing and fine-tuning.Recent studiesshow that a frozen LLM can rival specializedforecasters when supplied with a carefully en-gineered natural-language prompt, but craft-ing such a prompt for each task is itself oner-ous and ad-hoc. We introduce FLAIRR-TS, atest-time prompt optimization framework thatutilizes an agentic system: a Forecaster-agentgenerates forecasts using an initial prompt,which is then refined by a refiner agent, in-formed by past outputs and retrieved analogs.This adaptive prompting generalizes across do-mains using creative prompt templates andgenerates high-quality forecasts without inter-mediate code generation.Experiments onbenchmark datasets show improved accuracyover static prompting and retrieval-augmentedbaselines, approaching the performance ofspecialized prompts.FLAIRR-TS providesa practical alternative to tuning, achievingstrong performance via its agentic approach toadaptive prompt refinement and retrieval.

[7] CORE: Lossless Compression for Retrieval-Augmented LLMs via Reinforcement Learning

Ziqiang Cui,Yunpeng Weng,Xing Tang,Peiyang Liu,Shiwei Li,Bowei He,Jiamin Chen,Xiuqiang He,Chen Ma

Main category: cs.CL

TL;DR: CORE is a reinforcement learning-based context compression method for RAG that improves response accuracy while significantly reducing input length.

Details Motivation: RAG enhances LLMs' knowledge timeliness and factual accuracy, but excessive retrieved documents increase computational costs. Existing compression methods often compromise task performance due to lack of defined targets. Method: CORE uses reinforcement learning, specifically GRPO, to train a compressor that optimizes context compression based on end-task performance as a reward signal. Result: Experiments on four datasets show that CORE achieves a 3% compression ratio, avoids performance degradation compared to using full documents, and improves the average Exact Match score by 3.3 points. Conclusion: The proposed CORE method achieves lossless context compression for RAG, improving the factual accuracy of responses while maintaining a high compression ratio. Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge and the factual accuracy of responses in Large Language Models (LLMs). However, the inclusion of excessive retrieved documents substantially increases the input length, leading to higher computational costs. Previous studies have attempted to compress retrieved documents into shorter texts before in-context integration, but such methods often compromise end-task performance. The lack of well-defined compression targets forces many approaches to rely on fixed heuristics, which cannot guarantee that the compressed content will effectively support the end task. To address these limitations, we propose CORE, a novel method designed to achieve lossless context compression for RAG. CORE employs reinforcement learning to optimize the compression process without relying on predefined compression labels. Specifically, it utilizes end-task performance as a reward signal and applies Generalized Reinforcement Learning Policy Optimization (GRPO) to train the compressor. This end-to-end training framework enables the compressor to generate summaries that maximize the accuracy of answers generated by the LLM. Extensive experiments on four datasets demonstrate the superiority of our approach. With a high compression ratio of 3\%, our method not only avoids performance degradation compared to prepending full documents across all datasets but also improves the average Exact Match (EM) score by 3.3 points. The code will be released soon.

[8] Context-Adaptive Synthesis and Compression for Enhanced Retrieval-Augmented Generation in Complex Domains

Peiran Zhou,Junnan Zhu,Yichen Shen,Ruoxi Yu

Main category: cs.CL

TL;DR: 本文提出了一种名为CASC的新框架,通过智能处理检索到的上下文信息,有效解决了传统RAG在处理复杂多文档任务时的信息过载和合成效率低下问题,并在实验中表现出色。

Details Motivation: 传统RAG在复杂领域中处理多文档、长文本或冲突文档时存在信息过载和合成效率低的问题,导致回答不准确且不可靠。 Method: 提出了CASC框架,包括Context Analyzer & Synthesizer (CAS)模块,通过微调的小型LLM进行关键信息提取、跨文档一致性检查与冲突解决以及问题导向的结构化合成。 Result: CASC在SciDocs-QA数据集上的实验结果显示其一致优于强基线方法。 Conclusion: CASC有效解决了传统RAG在处理多文档、长文本和冲突信息时的信息过载与合成效率低下的问题,提高了答案的准确性和可靠性。 Abstract: Large Language Models (LLMs) excel in language tasks but are prone to hallucinations and outdated knowledge. Retrieval-Augmented Generation (RAG) mitigates these by grounding LLMs in external knowledge. However, in complex domains involving multiple, lengthy, or conflicting documents, traditional RAG suffers from information overload and inefficient synthesis, leading to inaccurate and untrustworthy answers. To address this, we propose CASC (Context-Adaptive Synthesis and Compression), a novel framework that intelligently processes retrieved contexts. CASC introduces a Context Analyzer & Synthesizer (CAS) module, powered by a fine-tuned smaller LLM, which performs key information extraction, cross-document consistency checking and conflict resolution, and question-oriented structured synthesis. This process transforms raw, scattered information into a highly condensed, structured, and semantically rich context, significantly reducing the token count and cognitive load for the final Reader LLM. We evaluate CASC on SciDocs-QA, a new challenging multi-document question answering dataset designed for complex scientific domains with inherent redundancies and conflicts. Our extensive experiments demonstrate that CASC consistently outperforms strong baselines.

[9] Reflective Agreement: Combining Self-Mixture of Agents with a Sequence Tagger for Robust Event Extraction

Fatemeh Haji,Mazal Bethany,Cho-Yu Jason Chiang,Anthony Rios,Peyman Najafirad

Main category: cs.CL

TL;DR: 本文提出了一种名为ARIS的混合方法,结合了Self Mixture of Agents和判别序列标注器,以解决事件抽取中的挑战。

Details Motivation: 传统的判别模型在事件抽取中表现出高精度但召回率有限,尤其是对于细微或罕见事件。而基于大语言模型的生成方法虽然提供了更高的语义灵活性和召回率,但存在幻觉和预测不一致的问题。 Method: 提出了一种名为Agreement-based Reflective Inference System (ARIS)的方法,结合了Self Mixture of Agents和判别序列标注器。ARIS利用结构化模型共识、基于置信度的过滤以及LLM反思推理模块,以可靠地解决歧义并提高整体事件预测质量。此外,还研究了分解指令微调以增强LLM对事件抽取的理解。 Result: 实验表明,ARIS在三个基准数据集上均优于现有的最先进的事件抽取方法。 Conclusion: ARIS在事件抽取任务中有效结合了判别模型和生成模型的优势,提高了事件抽取的整体性能。 Abstract: Event Extraction (EE) involves automatically identifying and extracting structured information about events from unstructured text, including triggers, event types, and arguments. Traditional discriminative models demonstrate high precision but often exhibit limited recall, particularly for nuanced or infrequent events. Conversely, generative approaches leveraging Large Language Models (LLMs) provide higher semantic flexibility and recall but suffer from hallucinations and inconsistent predictions. To address these challenges, we propose Agreement-based Reflective Inference System (ARIS), a hybrid approach combining a Self Mixture of Agents with a discriminative sequence tagger. ARIS explicitly leverages structured model consensus, confidence-based filtering, and an LLM reflective inference module to reliably resolve ambiguities and enhance overall event prediction quality. We further investigate decomposed instruction fine-tuning for enhanced LLM event extraction understanding. Experiments demonstrate our approach outperforms existing state-of-the-art event extraction methods across three benchmark datasets.

[10] LongReasonArena: A Long Reasoning Benchmark for Large Language Models

Jiayu Ding,Shuming Ma,Lei Cui,Nanning Zheng,Furu Wei

Main category: cs.CL

TL;DR: 引入了一个新的基准LongReasonArena,用于评估大语言模型的长推理能力。

Details Motivation: 现有的大语言模型长上下文基准测试主要关注长输入的理解,而忽略了长推理能力的评估。 Method: 设计需要执行多步算法的任务,通过控制输入来任意扩展所需的推理长度。 Result: 评估结果显示,即使是先进的模型如Deepseek-R1,在该任务上的准确率也只有7.5%,并且准确率随着推理步骤数的对数呈线性下降。 Conclusion: LongReasonArena是一个新的基准,用于评估大语言模型的长推理能力,对开源和专有模型都构成了重大挑战。 Abstract: Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.

[11] Database Entity Recognition with Data Augmentation and Deep Learning

Zikun Fu,Chen Yang,Kourosh Davoudi,Ken Q. Pu

Main category: cs.CL

TL;DR: The paper presents a new approach to Database Entity Recognition in Natural Language Queries, including a benchmark, data augmentation method, and T5-based model, achieving better performance than existing methods.

Details Motivation: The paper aims to address the challenge of Database Entity Recognition in Natural Language Queries and advance the field with new contributions. Method: The paper introduces a novel data augmentation procedure and a specialized language model using T5 for DB-ER tasks. Result: The DB-ER tagger outperforms existing NER taggers, with data augmentation and T5 fine-tuning providing significant performance boosts. Conclusion: The paper concludes that their model outperforms state-of-the-art NER taggers in precision and recall, and data augmentation as well as T5 backbone fine-tuning significantly enhance performance. Abstract: This paper addresses the challenge of Database Entity Recognition (DB-ER) in Natural Language Queries (NLQ). We present several key contributions to advance this field: (1) a human-annotated benchmark for DB-ER task, derived from popular text-to-sql benchmarks, (2) a novel data augmentation procedure that leverages automatic annotation of NLQs based on the corresponding SQL queries which are available in popular text-to-SQL benchmarks, (3) a specialized language model based entity recognition model using T5 as a backbone and two down-stream DB-ER tasks: sequence tagging and token classification for fine-tuning of backend and performing DB-ER respectively. We compared our DB-ER tagger with two state-of-the-art NER taggers, and observed better performance in both precision and recall for our model. The ablation evaluation shows that data augmentation boosts precision and recall by over 10%, while fine-tuning of the T5 backbone boosts these metrics by 5-10%.

[12] One Joke to Rule them All? On the (Im)possibility of Generalizing Humor

Mor Turgeman,Chen Shani,Dafna Shahaf

Main category: cs.CL

TL;DR: This paper explores whether Large Language Models can generalize humor understanding across different types of humor, finding that diverse training improves transferability and that certain humor types, like Dad Jokes, help enable better transfer.

Details Motivation: The motivation is to determine whether competence in specific humor tasks can generalize to novel, unseen humor types, especially as new forms of humor emerge in online contexts, and to assess whether LLMs can keep up with this evolving landscape. Method: The researchers conducted transfer learning experiments across four humor datasets, training LLMs under varied diversity settings (using 1-3 datasets) and testing on novel, unseen tasks. Result: Models achieved up to 75% accuracy on unseen datasets, with training on diverse sources improving transferability by 1.88-4.05% without significantly affecting in-domain performance; Dad Jokes were found to be surprisingly effective in enabling transfer. Conclusion: The study concludes that Large Language Models (LLMs) can achieve transfer learning across different humor types, with improved performance when trained on diverse datasets, and humor types like Dad Jokes are found to be effective in enabling transfer. Abstract: Humor is a broad and complex form of communication that remains challenging for machines. Despite its broadness, most existing research on computational humor traditionally focused on modeling a specific type of humor. In this work, we wish to understand whether competence on one or more specific humor tasks confers any ability to transfer to novel, unseen types; in other words, is this fragmentation inevitable? This question is especially timely as new humor types continuously emerge in online and social media contexts (e.g., memes, anti-humor, AI fails). If Large Language Models (LLMs) are to keep up with this evolving landscape, they must be able to generalize across humor types by capturing deeper, transferable mechanisms. To investigate this, we conduct a series of transfer learning experiments across four datasets, representing different humor tasks. We train LLMs under varied diversity settings (1-3 datasets in training, testing on a novel task). Experiments reveal that models are capable of some transfer, and can reach up to 75% accuracy on unseen datasets; training on diverse sources improves transferability (1.88-4.05%) with minimal-to-no drop in in-domain performance. Further analysis suggests relations between humor types, with Dad Jokes surprisingly emerging as the best enabler of transfer (but is difficult to transfer to). We release data and code.

[13] A perishable ability? The future of writing in the face of generative artificial intelligence

Evandro L. T. P. Cunha

Main category: cs.CL

TL;DR: This paper explores the potential decline of human writing skills due to increased reliance on AI-generated text, comparing it to historical precedents like the Greek Dark Ages.

Details Motivation: The motivation for the study is the rapid development and integration of generative AI tools in text creation, raising concerns about a potential decrease in human involvement in writing tasks. Method: The article employs a historical and technological analysis to explore the potential decline in human writing abilities due to the increasing use of generative artificial intelligence tools. Result: The study highlights a concern that human writing abilities may diminish as reliance on machines for text generation increases, drawing comparisons with historical periods where writing skills were lost. Conclusion: The article concludes that there is a possibility that human writing skills may decline due to reliance on generative AI tools, drawing a parallel with historical instances of writing skill loss, such as during the Greek Dark Ages. Abstract: The 2020s have been witnessing a very significant advance in the development of generative artificial intelligence tools, including text generation systems based on large language models. These tools have been increasingly used to generate texts in the most diverse domains -- from technical texts to literary texts --, which might eventually lead to a lower volume of written text production by humans. This article discusses the possibility of a future in which human beings will have lost or significantly decreased their ability to write due to the outsourcing of this activity to machines. This possibility parallels the loss of the ability to write in other moments of human history, such as during the so-called Greek Dark Ages (approx. 1200 BCE - 800 BCE).

[14] Heterogeneous LLM Methods for Ontology Learning (Few-Shot Prompting, Ensemble Typing, and Attention-Based Taxonomies)

Aleksandra Beliaeva,Temurbek Rahmatullaev

Main category: cs.CL

TL;DR: This paper proposes a comprehensive system for Tasks A, B, and C of the LLMs4OL 2025 challenge, using retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling. It achieved top results in all tasks, demonstrating the scalability and robustness of LLM-based architectures for ontology learning.

Details Motivation: The motivation is to develop a comprehensive system to address Tasks A, B, and C of the LLMs4OL 2025 challenge, which span the ontology construction pipeline: term extraction, typing, and taxonomy discovery. Method: The approach combines retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling. Task A uses a retrieval-augmented generation (RAG) pipeline for term and type extraction. Task B applies a dual strategy: RAG with few-shot prompting for few-shot settings and a confidence-weighted zero-shot classifier for zero-shot settings. Task C models taxonomy discovery using a cross-attention layer to predict is-a relations. Result: The system achieved top-ranking results on the official leaderboard across all three tasks. Conclusion: The paper concludes that their modular, task-specific solutions achieved top-ranking results in all three tasks of the LLMs4OL 2025 challenge, showcasing the scalability, adaptability, and robustness of LLM-based architectures for ontology learning across heterogeneous domains. Abstract: We present a comprehensive system for addressing Tasks A, B, and C of the LLMs4OL 2025 challenge, which together span the full ontology construction pipeline: term extraction, typing, and taxonomy discovery. Our approach combines retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling -- each tailored to the demands of the respective task. For Task A, we jointly extract domain-specific terms and their ontological types using a retrieval-augmented generation (RAG) pipeline. Training data was reformulated into a document to terms and types correspondence, while test-time inference leverages semantically similar training examples. This single-pass method requires no model finetuning and improves overall performance through lexical augmentation Task B, which involves assigning types to given terms, is handled via a dual strategy. In the few-shot setting (for domains with labeled training data), we reuse the RAG scheme with few-shot prompting. In the zero-shot setting (for previously unseen domains), we use a zero-shot classifier that combines cosine similarity scores from multiple embedding models using confidence-based weighting. In Task C, we model taxonomy discovery as graph inference. Using embeddings of type labels, we train a lightweight cross-attention layer to predict is-a relations by approximating a soft adjacency matrix. These modular, task-specific solutions enabled us to achieve top-ranking results in the official leaderboard across all three tasks. Taken together these strategies showcase the scalability, adaptability, and robustness of LLM-based architectures for ontology learning across heterogeneous domains. Code is available at: https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek

[15] Bridging Language Gaps: Enhancing Few-Shot Language Adaptation

Philipp Borchert,Jochen De Weerdt,Marie-Francine Moens

Main category: cs.CL

TL;DR: 本文提出了一种名为CoLAP的方法,通过结合对比学习和跨语言表示来解决多语言NLP中高资源语言和低资源语言之间的数据差距问题。

Details Motivation: 多语言NLP中语言资源的差异是一个挑战,高资源语言拥有大量数据,而低资源语言缺乏足够的数据进行有效训练。 Method: 将对比学习与跨语言表示相结合,实现从高资源语言到低资源语言的任务特定知识转移。 Result: CoLAP在少量样本跨语言迁移基线和上下文学习中表现优异,即使数据量有限。 Conclusion: CoLAP有效地缩小了跨语言性能差距,促进了更高效的多语言NLP技术的发展。 Abstract: The disparity in language resources poses a challenge in multilingual NLP, with high-resource languages benefiting from extensive data, while low-resource languages lack sufficient data for effective training. Our Contrastive Language Alignment with Prompting (CoLAP) method addresses this gap by integrating contrastive learning with cross-lingual representations, facilitating task-specific knowledge transfer from high-resource to lower-resource languages. The primary advantage of our approach is its data efficiency, enabling rapid adaptation to new languages and reducing the need for large labeled datasets. We conduct experiments with multilingual encoder-only and decoder-only language models on natural language understanding tasks, including natural language inference and relation extraction, evaluating performance across both high- and low-resource languages. Our results demonstrate that CoLAP outperforms few-shot cross-lingual transfer baselines and in-context learning, even with limited available data. This effectively narrows the cross-lingual performance gap, contributing to the development of more efficient multilingual NLP techniques.

Sumon Kanti Dey,Jeanne M. Powell,Azra Ismail,Jeanmarie Perrone,Abeed Sarker

Main category: cs.CL

TL;DR: 本研究介绍了一种利用社交媒体数据提取阿片类药物使用相关临床和社会影响的NER框架,并开发了RedditImpacts 2.0数据集。研究发现,尽管经过微调的模型表现优于大型语言模型,但与专家水平仍有差距,突出了领域特定微调的重要性,并指出了当前AI技术在需要深度领域知识任务中的局限性。

Details Motivation: 社交媒体提供了一个未被充分利用的洞察来源,可以了解传统医疗环境报告中常常未被报道的阿片类药物使用的临床和社会影响。 Method: 提出了一种用于提取与阿片类药物使用相关的社交媒体叙述中的自我报告后果的命名实体识别框架,并在零样本和少样本学习环境下评估了微调编码器模型和最先进的大型语言模型的表现。 Result: 经过微调的DeBERTa-large模型在放松的标记级F1得分为0.61,虽然在精确度、跨度准确性和遵循特定任务指导方针方面持续优于大型语言模型,但与专家间的协议相比仍显著落后。 Conclusion: 研究强调了领域特定微调在临床自然语言处理任务中的重要性,并指出当前最先进的NER/AI技术与专家智能之间仍存在差距。 Abstract: Nonmedical opioid use is an urgent public health challenge, with far-reaching clinical and social consequences that are often underreported in traditional healthcare settings. Social media platforms, where individuals candidly share first-person experiences, offer a valuable yet underutilized source of insight into these impacts. In this study, we present a named entity recognition (NER) framework to extract two categories of self-reported consequences from social media narratives related to opioid use: ClinicalImpacts (e.g., withdrawal, depression) and SocialImpacts (e.g., job loss). To support this task, we introduce RedditImpacts 2.0, a high-quality dataset with refined annotation guidelines and a focus on first-person disclosures, addressing key limitations of prior work. We evaluate both fine-tuned encoder-based models and state-of-the-art large language models (LLMs) under zero- and few-shot in-context learning settings. Our fine-tuned DeBERTa-large model achieves a relaxed token-level F1 of 0.61 [95% CI: 0.43-0.62], consistently outperforming LLMs in precision, span accuracy, and adherence to task-specific guidelines. Furthermore, we show that strong NER performance can be achieved with substantially less labeled data, emphasizing the feasibility of deploying robust models in resource-limited settings. Our findings underscore the value of domain-specific fine-tuning for clinical NLP tasks and contribute to the responsible development of AI tools that may enhance addiction surveillance, improve interpretability, and support real-world healthcare decision-making. The best performing model, however, still significantly underperforms compared to inter-expert agreement (Cohen's kappa: 0.81), demonstrating that a gap persists between expert intelligence and current state-of-the-art NER/AI capabilities for tasks requiring deep domain knowledge.

[17] Automatic Question & Answer Generation Using Generative Large Language Model (LLM)

Md. Alvee Ehsan,A. S. M Mehedi Hasan,Kefaya Benta Shahnoor,Syeda Sumaiya Tasneem

Main category: cs.CL

TL;DR: 本研究提出了一种基于自然语言处理和生成式大语言模型的自动问答生成工具,旨在帮助教育工作者高效创建评估题目。

Details Motivation: 学生评估是教育中的重要环节,但传统基于文本的评估方法需要教师手动创建多样且公平的题目,这非常耗时费力。因此需要一种自动化工具来简化这一过程。 Method: 采用经过微调的生成式大语言模型(LLM)结合提示工程(Prompt Engineering)实现自动问答生成(AQAG),并使用RACE数据集进行训练。 Result: 开发了一种定制模型,能够根据教学者偏好的题型(如选择题、概念性问题或事实性问题)自动生成问题及答案,有效节省时间和资源。 Conclusion: 研究得出通过利用自然语言处理中的无监督学习方法,特别是基于Llama 2-7B模型和RACE数据集的微调,可以实现高效、可靠的自动问答生成,为教育工作者提供有效的评估工具。 Abstract: \Abstract{In the realm of education, student evaluation holds equal significance as imparting knowledge. To be evaluated, students usually need to go through text-based academic assessment methods. Instructors need to make diverse sets of questions that need to be fair for all students to prove their adequacy over a particular topic. This can prove to be quite challenging as they may need to manually go through several different lecture materials. Our objective is to make this whole process much easier by implementing Automatic Question Answer Generation /(AQAG), using fine-tuned generative LLM. For tailoring the instructor's preferred question style (MCQ, conceptual, or factual questions), prompt Engineering (PE) is being utilized. In this research, we propose to leverage unsupervised learning methods in NLP, primarily focusing on the English language. This approach empowers the base Meta-Llama 2-7B model to integrate RACE dataset as training data for the fine-tuning process. Creating a customized model that will offer efficient solutions for educators, instructors, and individuals engaged in text-based evaluations. A reliable and efficient tool for generating questions and answers can free up valuable time and resources, thus streamlining their evaluation processes.}

[18] Improving Low-Resource Translation with Dictionary-Guided Fine-Tuning and RL: A Spanish-to-Wayuunaiki Study

Manuel Mosquera,Melissa Robles,Johan Rodriguez,Ruben Manrique

Main category: cs.CL

TL;DR: The paper introduces a new method for low-resource machine translation that combines external dictionary tools with reinforcement learning, resulting in improved translation quality for the Spanish-Wayuunaiki language pair.

Details Motivation: Low-resource machine translation is challenging for large language models due to limited exposure to such languages during pretraining and scarce parallel data for fine-tuning. This motivates the need for a novel approach to enhance translation in these scenarios. Method: The paper proposes a novel approach involving the integration of an external dictionary tool and training models end-to-end using reinforcement learning along with supervised fine-tuning. Translation is framed as a tool-augmented decision-making problem, and the method combines supervised instruction tuning with Guided Reward Policy Optimization (GRPO). Result: Preliminary results show up to +3.37 BLEU improvement over previous work and an 18% relative gain compared to a supervised baseline without dictionary access. Ablation studies also assess the effects of model architecture and training strategy. Conclusion: Combining LLMs with external tools and utilizing reinforcement learning can significantly improve translation quality in low-resource language settings. Abstract: Low-resource machine translation remains a significant challenge for large language models (LLMs), which often lack exposure to these languages during pretraining and have limited parallel data for fine-tuning. We propose a novel approach that enhances translation for low-resource languages by integrating an external dictionary tool and training models end-to-end using reinforcement learning, in addition to supervised fine-tuning. Focusing on the Spanish-Wayuunaiki language pair, we frame translation as a tool-augmented decision-making problem in which the model can selectively consult a bilingual dictionary during generation. Our method combines supervised instruction tuning with Guided Reward Policy Optimization (GRPO), enabling the model to learn both when and how to use the tool effectively. BLEU similarity scores are used as rewards to guide this learning process. Preliminary results show that our tool-augmented models achieve up to +3.37 BLEU improvement over previous work, and a 18% relative gain compared to a supervised baseline without dictionary access, on the Spanish-Wayuunaiki test set from the AmericasNLP 2025 Shared Task. We also conduct ablation studies to assess the effects of model architecture and training strategy, comparing Qwen2.5-0.5B-Instruct with other models such as LLaMA and a prior NLLB-based system. These findings highlight the promise of combining LLMs with external tools and the role of reinforcement learning in improving translation quality in low-resource language settings.

[19] Rule Synergy Analysis using LLMs: State of the Art and Implications

Bahar Bateni,Benjamin Pratt,Jim Whitehead

Main category: cs.CL

TL;DR: 研究大型语言模型在卡牌游戏等动态环境中理解和推理复杂规则交互的能力,并提出改进模型预测规则及其交互效应的方向。

Details Motivation: 研究大型语言模型在动态环境(如卡牌游戏)中对复杂规则交互的理解和推理能力。 Method: 介绍了一个来自游戏《Slay the Spire》的卡牌协同效应数据集,并根据卡牌对的正向、负向或中性交互进行分类,对大型语言模型进行了评估。 Result: 大型语言模型能够很好地识别非协同卡牌对,但在识别正向和尤其是负向协同效应方面表现不佳,并对常见错误类型进行了分类。 Conclusion: 研究发现,大型语言模型在识别非协同卡牌对方面表现出色,但在检测正向和特别是负向协同效应方面存在困难,这表明在规则及其交互效应的预测方面需要进一步研究改进。 Abstract: Large language models (LLMs) have demonstrated strong performance across a variety of domains, including logical reasoning, mathematics, and more. In this paper, we investigate how well LLMs understand and reason about complex rule interactions in dynamic environments, such as card games. We introduce a dataset of card synergies from the game Slay the Spire, where pairs of cards are classified based on their positive, negative, or neutral interactions. Our evaluation shows that while LLMs excel at identifying non-synergistic pairs, they struggle with detecting positive and, particularly, negative synergies. We categorize common error types, including issues with timing, defining game states, and following game rules. Our findings suggest directions for future research to improve model performance in predicting the effect of rules and their interactions.

[20] Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding

Bowen Sun,Yujun Cai,Ming-Hsuan Yang,Yiwei Wang

Main category: cs.CL

TL;DR: Blockwise SFT aligns training with inference in diffusion models, improving performance on multiple tasks.

Details Motivation: Standard SFT misaligns with semi-autoregressive inference in discrete diffusion language models, causing issues like noisy prefixes and leaky suffixes. Method: Partitioning responses into fixed-size blocks and performing stochastic masking only on the active block while freezing preceding tokens and hiding future ones. Result: Experiments showed consistent performance gains over classical SFT under equal compute or token budgets, with improvements attributed to better training-inference alignment. Conclusion: Blockwise SFT improves the training-inference alignment for discrete diffusion language models, leading to better performance on tasks like GSM8K, MATH, and MetaMathQA. Abstract: Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models.

[21] Alignment with Fill-In-the-Middle for Enhancing Code Generation

Houxing Ren,Zimu Lu,Weikang Shi,Haotian Hou,Yunqiao Yang,Ke Wang,Aojun Zhou,Junting Pan,Mingjie Zhan,Hongsheng Li

Main category: cs.CL

TL;DR: This paper proposes a novel approach to enhance code generation in LLMs by splitting code into granular blocks and using AST splitting and curriculum training methods, showing significant improvements in benchmark datasets.

Details Motivation: The motivation is to address the challenge of limited verifiable training data with accurate test cases for improving code-related task performance in LLMs. Method: The method involves splitting code snippets into smaller blocks to generate more diverse DPO pairs, using AST splitting, and implementing curriculum training to improve DPO training for code generation tasks. Result: The experiments showed significant improvements in code generation tasks on benchmark datasets like HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench. Conclusion: The proposed approach of splitting code snippets into granular blocks and using AST splitting with curriculum training enhances DPO training, leading to significant improvements in code generation tasks. Abstract: The code generation capabilities of Large Language Models (LLMs) have advanced applications like tool invocation and problem-solving. However, improving performance in code-related tasks remains challenging due to limited training data that is verifiable with accurate test cases. While Direct Preference Optimization (DPO) has shown promise, existing methods for generating test cases still face limitations. In this paper, we propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases. Additionally, we introduce the Abstract Syntax Tree (AST) splitting and curriculum training method to enhance the DPO training. Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench. Code and data are available at https://github.com/SenseLLM/StructureCoder.

[22] Emotion Transfer with Enhanced Prototype for Unseen Emotion Recognition in Conversation

Kun Peng,Cong Cao,Hao Peng,Guanlin Wu,Zhifeng Hao,Lei Jiang,Yanbing Liu,Philip S. Yu

Main category: cs.CL

TL;DR: The paper introduces the UERC task to recognize unseen emotions in conversations and proposes ProEmoTrans, a framework that tackles key challenges like implicit emotion expressions, long conversation encoding, and emotion transition transfer.

Details Motivation: Current ERC research works under a closed-domain assumption, but there's no clear consensus on emotion classification in psychology. This makes it difficult for models to recognize previously unseen emotions in real-world scenarios, highlighting the need for a new approach. Method: ProEmoTrans, a prototype-based emotion transfer framework, addresses three key challenges in UERC: implicit emotion expressions, utterance encoding in long conversations, and emotion transition transfer. It uses an LLM-enhanced description approach, a parameter-free mechanism for encoding, and an improved Attention Viterbi Decoding (AVD) method. Result: The experiments on three datasets showed that ProEmoTrans performs well and serves as a strong baseline for future research in the field of Unseen Emotion Recognition in Conversation. Conclusion: The paper proposes a new framework called ProEmoTrans for the newly introduced task of Unseen Emotion Recognition in Conversation (UERC), which provides a promising solution for recognizing unseen emotions in real-world applications. Abstract: Current Emotion Recognition in Conversation (ERC) research follows a closed-domain assumption. However, there is no clear consensus on emotion classification in psychology, which presents a challenge for models when it comes to recognizing previously unseen emotions in real-world applications. To bridge this gap, we introduce the Unseen Emotion Recognition in Conversation (UERC) task for the first time and propose ProEmoTrans, a solid prototype-based emotion transfer framework. This prototype-based approach shows promise but still faces key challenges: First, implicit expressions complicate emotion definition, which we address by proposing an LLM-enhanced description approach. Second, utterance encoding in long conversations is difficult, which we tackle with a proposed parameter-free mechanism for efficient encoding and overfitting prevention. Finally, the Markovian flow nature of emotions is hard to transfer, which we address with an improved Attention Viterbi Decoding (AVD) method to transfer seen emotion transitions to unseen emotions. Extensive experiments on three datasets show that our method serves as a strong baseline for preliminary exploration in this new area.

[23] Language Models Identify Ambiguities and Exploit Loopholes

Jio Choi,Mohit Bansal,Elias Stengel-Eskin

Main category: cs.CL

TL;DR: This paper explores how large language models can exploit loopholes by identifying ambiguities and performing sophisticated pragmatic reasoning, posing potential AI safety risks.

Details Motivation: The motivation is to examine ambiguity and pragmatics in LLMs as well as address an interesting and novel alignment problem where models might exploit ambiguities for their own advantage due to conflicting goals. Method: The authors designed scenarios where LLMs were given a goal and an ambiguous user instruction conflicting with the goal. These scenarios covered scalar implicature, structural ambiguities, and power dynamics. They measured the models' abilities to exploit loopholes to satisfy their given goals instead of the user's. Result: The models were found capable of identifying ambiguities and exploiting loopholes to fulfill their goals over the user's. The analysis also showed that these models explicitly identify and reason about both ambiguity and conflicting goals. Conclusion: The study concludes that both closed-source and stronger open-source LLMs can identify ambiguities and exploit loopholes, which presents a potential AI safety risk. Abstract: Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to examine ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit ambiguities to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, with scenarios covering scalar implicature, structural ambiguities, and power dynamics. We then measure different models' abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-source models can identify ambiguities and exploit their resulting loopholes, presenting a potential AI safety risk. Our analysis indicates that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.

[24] Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts

Jiaqi Deng,Yuho Lee,Nicole Hee-Yeon Kim,Hyangsuk Min,Taewon Yun,Minjeong Ban,Kim Yul,Hwanjun Song

Main category: cs.CL

TL;DR: HAMLET是一个用于评估大语言模型长上下文理解能力的自动化框架,通过三级关键事实层次结构和查询聚焦摘要方法,发现模型在细粒度理解和位置效应上的挑战,并验证了其高效的自动化评估能力。

Details Motivation: 为了评估大语言模型在长上下文中的理解能力,需要一个自动化和系统化的框架,以减少人工评估的成本,并准确识别模型在不同层次理解上的挑战。 Method: HAMLET将源文本结构化为一个三级关键事实层次结构(根级、分支级和叶级),并采用查询聚焦的摘要方法来评估模型在每个层次上回忆和忠实表示信息的能力。 Result: 通过系统的人类研究验证,HAMLET的全自动评估与专家人工判断达成超过90%的一致性,同时将评估成本降低了25倍。研究还发现,LLMs在分析性查询上的表现比叙事性查询更具挑战性,开源模型和专有模型之间以及不同规模的模型之间存在明显的性能差距。 Conclusion: HAMLET是一个全面且自动化的框架,用于评估大语言模型(LLMs)的长上下文理解能力。它揭示了LLMs在细粒度理解上的困难,尤其是在叶级层次,以及其对位置效应的敏感性,如“失落于中间”。 Abstract: We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures source texts into a three-level key-fact hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, showing that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the cost by up to 25 times. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and proprietary models, as well as across model scales. Our code and dataset are publicly available at https://github.com/DISL-Lab/HAMLET.

[25] ArgCMV: An Argument Summarization Benchmark for the LLM-era

Omkar Gurjar,Agam Goyal,Eshwar Chandrasekharan

Main category: cs.CL

TL;DR: This paper introduces ArgCMV, a new and more complex dataset for argument key point extraction derived from real online debates, highlighting the shortcomings of current methods and setting the stage for improved summarization research.

Details Motivation: The motivation stems from the limitations of the existing ArgKP21 dataset, which is not representative of real human conversations, thus necessitating a more complex and realistic benchmark for key point extraction research. Method: The authors curated a new dataset called ArgCMV using SoTA LLMs, comprising 12K arguments from real online debates across 3K topics. They benchmarked existing KP extraction methods on this dataset and analyzed their performance. Result: The new ArgCMV dataset demonstrates higher complexity compared to ArgKP21, including longer arguments, co-referencing, subjective discourse, and a wider range of topics. Experimental results show that current methods do not perform well on this dataset. Conclusion: The paper concludes that existing KP extraction methods struggle with the new ArgCMV dataset and emphasizes the importance of developing more advanced models suited for complex, real-world argument summarization. Abstract: Key point extraction is an important task in argument summarization which involves extracting high-level short summaries from arguments. Existing approaches for KP extraction have been mostly evaluated on the popular ArgKP21 dataset. In this paper, we highlight some of the major limitations of the ArgKP21 dataset and demonstrate the need for new benchmarks that are more representative of actual human conversations. Using SoTA large language models (LLMs), we curate a new argument key point extraction dataset called ArgCMV comprising of around 12K arguments from actual online human debates spread across over 3K topics. Our dataset exhibits higher complexity such as longer, co-referencing arguments, higher presence of subjective discourse units, and a larger range of topics over ArgKP21. We show that existing methods do not adapt well to ArgCMV and provide extensive benchmark results by experimenting with existing baselines and latest open source models. This work introduces a novel KP extraction dataset for long-context online discussions, setting the stage for the next generation of LLM-driven summarization research.

[26] Towards stable AI systems for Evaluating Arabic Pronunciations

Hadi Zaatiti,Hatem Hajri,Osama Abdullah,Nader Masmoudi

Main category: cs.CL

TL;DR: 本研究分析了wav2vec 2.0模型在孤立阿拉伯字母识别任务中的表现,通过训练轻量级神经网络和对抗性训练显著提高了识别准确率,并发布了相关数据和代码以供未来研究。

Details Motivation: 现代阿拉伯语自动语音识别系统(如wav2vec 2.0)在单词和句子级别的转录上表现出色,但在孤立字母的分类上存在困难。这主要是因为孤立字母缺乏协同发音线索、没有词汇上下文,并且持续时间很短。因此,这项研究旨在解决这一挑战性的音素级任务,该任务对于语言学习、语音治疗和语音学研究至关重要。 Method: 这项研究采用了一个多样化的带音调标记的孤立阿拉伯字母语料库,使用wav2vec 2.0模型进行实验,并训练了一个轻量级神经网络以提高性能。此外,研究通过添加小幅度扰动来测试模型的鲁棒性,并应用对抗性训练来恢复模型性能。 Result: 研究结果表明,最先进的wav2vec 2.0模型在孤立阿拉伯字母识别任务中仅达到35%的准确率。通过在wav2vec嵌入上训练轻量级神经网络,性能提高到65%。然而,添加小幅度扰动(epsilon = 0.05)会使准确率降至32%。通过应用对抗性训练,噪声语音的准确率下降被限制在9%,同时保持了清晰语音的准确性。 Conclusion: 这项研究展示了wav2vec 2.0模型在孤立阿拉伯字母识别任务中的局限性,并通过引入对抗性训练提高了模型的鲁棒性,为未来扩展到词级和句子级框架奠定了基础。 Abstract: Modern Arabic ASR systems such as wav2vec 2.0 excel at word- and sentence-level transcription, yet struggle to classify isolated letters. In this study, we show that this phoneme-level task, crucial for language learning, speech therapy, and phonetic research, is challenging because isolated letters lack co-articulatory cues, provide no lexical context, and last only a few hundred milliseconds. Recogniser systems must therefore rely solely on variable acoustic cues, a difficulty heightened by Arabic's emphatic (pharyngealized) consonants and other sounds with no close analogues in many languages. This study introduces a diverse, diacritised corpus of isolated Arabic letters and demonstrates that state-of-the-art wav2vec 2.0 models achieve only 35% accuracy on it. Training a lightweight neural network on wav2vec embeddings raises performance to 65%. However, adding a small amplitude perturbation (epsilon = 0.05) cuts accuracy to 32%. To restore robustness, we apply adversarial training, limiting the noisy-speech drop to 9% while preserving clean-speech accuracy. We detail the corpus, training pipeline, and evaluation protocol, and release, on demand, data and code for reproducibility. Finally, we outline future work extending these methods to word- and sentence-level frameworks, where precise letter pronunciation remains critical.

[27] Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs

Jun Bai,Minghao Tong,Yang Liu,Zixia Jia,Zilong Zheng

Main category: cs.CL

TL;DR: This paper introduces CEFT, a method for improving context faithfulness in language models by selectively fine-tuning specialized experts identified using Router Lens.

Details Motivation: Large language models often struggle with context faithfulness, producing irrelevant responses in context-dependent scenarios. This work explores whether certain experts in mixture-of-experts models specialize in utilizing context, offering a path to targeted optimization for better context grounding. Method: The authors propose Router Lens to identify context-faithful experts within mixture-of-experts architectures. They analyze how these experts process contextual information and develop CEFT, an approach that selectively fine-tunes these experts to enhance context grounding. Result: Router Lens successfully identifies context-faithful experts, revealing that they amplify attention to relevant contextual information. CEFT, the proposed fine-tuning method, demonstrates comparable or superior performance to full fine-tuning across various benchmarks and models, with significantly higher efficiency. Conclusion: CEFT is a lightweight optimization approach that effectively improves context faithfulness in large language models by selectively fine-tuning context-faithful experts, matching or surpassing the performance of full fine-tuning. Abstract: Context faithfulness is essential for reliable reasoning in context-dependent scenarios. However, large language models often struggle to ground their outputs in the provided context, resulting in irrelevant responses. Inspired by the emergent expert specialization observed in mixture-of-experts architectures, this work investigates whether certain experts exhibit specialization in context utilization, offering a potential pathway toward targeted optimization for improved context faithfulness. To explore this, we propose Router Lens, a method that accurately identifies context-faithful experts. Our analysis reveals that these experts progressively amplify attention to relevant contextual information, thereby enhancing context grounding. Building on this insight, we introduce Context-faithful Expert Fine-Tuning (CEFT), a lightweight optimization approach that selectively fine-tunes context-faithful experts. Experiments across a wide range of benchmarks and models demonstrate that CEFT matches or surpasses the performance of full fine-tuning while being significantly more efficient.

[28] LFD: Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation

Yang Sun,Lixin Zou,Dan Luo,Zhiyong Xie,Long Zhang,Liming Dong,Yunwei Zhao,Xixun Lin,Yanxiong Lu,Chenliang Li

Main category: cs.CL

TL;DR: This paper introduces Layer Fused Decoding (LFD), a strategy to improve external knowledge integration in retrieval-augmented generation by leveraging intermediate layers of large language models, demonstrating improved performance with minimal cost.

Details Motivation: The motivation stems from the observation that injecting noise into retrieved documents can paradoxically improve the use of external knowledge in large language models (LLMs). This phenomenon allows for deeper analysis of how LLMs integrate external and internal knowledge, prompting the search for more effective knowledge utilization strategies. Method: The paper proposes Layer Fused Decoding (LFD), a method that combines representations from an intermediate layer with final-layer decoding outputs. An internal knowledge score (IKS) criterion is introduced to identify the optimal intermediate layer for knowledge integration. Result: Experimental results show that the proposed LFD method enhances the ability of RAG systems to exploit retrieved context knowledge effectively and efficiently across multiple benchmarks. Conclusion: The paper concludes that Layer Fused Decoding (LFD) enhances the ability of retrieval-augmented generation (RAG) systems to utilize external knowledge, leading to improved performance with minimal cost. Abstract: Retrieval-augmented generation (RAG) incorporates external knowledge into large language models (LLMs), improving their adaptability to downstream tasks and enabling information updates. Surprisingly, recent empirical evidence demonstrates that injecting noise into retrieved relevant documents paradoxically facilitates exploitation of external knowledge and improves generation quality. Although counterintuitive and challenging to apply in practice, this phenomenon enables granular control and rigorous analysis of how LLMs integrate external knowledge. Therefore, in this paper, we intervene on noise injection and establish a layer-specific functional demarcation within the LLM: shallow layers specialize in local context modeling, intermediate layers focus on integrating long-range external factual knowledge, and deeper layers primarily rely on parametric internal knowledge. Building on this insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that directly combines representations from an intermediate layer with final-layer decoding outputs to fully exploit the external factual knowledge. To identify the optimal intermediate layer, we introduce an internal knowledge score (IKS) criterion that selects the layer with the lowest IKS value in the latter half of layers. Experimental results across multiple benchmarks demonstrate that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal cost.

[29] A Symbolic Adversarial Learning Framework for Evolving Fake News Generation and Detection

Chong Tian,Qirong Ho,Xiuying Chen

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Rapid LLM advancements heighten fake news risks by enabling the automatic generation of increasingly sophisticated misinformation. Previous detection methods, including fine-tuned small models or LLM-based detectors, often struggle with its dynamically evolving nature. In this work, we propose a novel framework called the Symbolic Adversarial Learning Framework (SALF), which implements an adversarial training paradigm by an agent symbolic learning optimization process, rather than relying on numerical updates. SALF introduces a paradigm where the generation agent crafts deceptive narratives, and the detection agent uses structured debates to identify logical and factual flaws for detection, and they iteratively refine themselves through such adversarial interactions. Unlike traditional neural updates, we represent agents using agent symbolic learning, where learnable weights are defined by agent prompts, and simulate back-propagation and gradient descent by operating on natural language representations of weights, loss, and gradients. Experiments on two multilingual benchmark datasets demonstrate SALF's effectiveness, showing it generates sophisticated fake news that degrades state-of-the-art detection performance by up to 53.4% in Chinese and 34.2% in English on average. SALF also refines detectors, improving detection of refined content by up to 7.7%. We hope our work inspires further exploration into more robust, adaptable fake news detection systems.

[30] Automatic integration of SystemC in the FMI standard for Software-defined Vehicle design

Giovanni Pollo,Andrei Mihai Albu,Alessio Burrello,Daniele Jahier Pagliari,Cristian Tesconi,Loris Panaro,Dario Soldi,Fabio Autieri,Sara Vinco

Main category: cs.CL

TL;DR: 本研究提出一种基于FMI标准的自动封装SystemC模型的方法,以提升汽车领域协同仿真的协作性、可扩展性和IP保护能力。

Details Motivation: 汽车领域的发展需要强大的协同仿真方法,但当前缺乏标准化接口,且多为专有仿真平台,限制了协作、扩展性和IP保护。 Method: 结合SystemC的建模精度和FMI的互操作性优势,通过FMI标准封装SystemC模型,实现嵌入式组件的安全、可移植集成。 Result: 通过实际案例验证了该方法的有效性,表明其在复杂设计中也能有效工作。 Conclusion: 本文提出了一种基于FMI标准自动封装SystemC模型的方法,解决了汽车领域协同仿真中的标准化接口缺失和IP保护问题。 Abstract: The recent advancements of the automotive sector demand robust co-simulation methodologies that enable early validation and seamless integration across hardware and software domains. However, the lack of standardized interfaces and the dominance of proprietary simulation platforms pose significant challenges to collaboration, scalability, and IP protection. To address these limitations, this paper presents an approach for automatically wrapping SystemC models by using the Functional Mock-up Interface (FMI) standard. This method combines the modeling accuracy and fast time-to-market of SystemC with the interoperability and encapsulation benefits of FMI, enabling secure and portable integration of embedded components into co-simulation workflows. We validate the proposed methodology on real-world case studies, demonstrating its effectiveness with complex designs.

[31] Survey of Specialized Large Language Model

Chenghan Yang,Ruiyu Zhao,Yang Liu,Ling Jiang

Main category: cs.CL

TL;DR: This survey explores the evolution of specialized large language models (LLMs) across various domains, highlighting technical breakthroughs such as domain-native designs, parameter efficiency, and multimodal capabilities that address the limitations of general-purpose LLMs, with implications for the E-Commerce field.

Details Motivation: The motivation is to understand the paradigm shift in AI development driven by the rapid evolution of specialized LLMs and their impact on professional applications. Method: This survey systematically examines the progression of specialized LLMs across multiple domains, including healthcare, finance, legal, and technical fields, focusing on domain-native designs, parameter efficiency, and multimodal capabilities. Result: The analysis reveals that specialized LLMs overcome the limitations of general-purpose LLMs, showing consistent performance gains on domain-specific benchmarks, and highlights the implications for the E-Commerce field. Conclusion: The survey highlights the paradigm shift in AI development through the evolution of specialized LLMs and their technical breakthroughs, which address limitations of general-purpose LLMs and show implications for the E-Commerce field. Abstract: The rapid evolution of specialized large language models (LLMs) has transitioned from simple domain adaptation to sophisticated native architectures, marking a paradigm shift in AI development. This survey systematically examines this progression across healthcare, finance, legal, and technical domains. Besides the wide use of specialized LLMs, technical breakthrough such as the emergence of domain-native designs beyond fine-tuning, growing emphasis on parameter efficiency through sparse computation and quantization, increasing integration of multimodal capabilities and so on are applied to recent LLM agent. Our analysis reveals how these innovations address fundamental limitations of general-purpose LLMs in professional applications, with specialized models consistently performance gains on domain-specific benchmarks. The survey further highlights the implications for E-Commerce field to fill gaps in the field.

[32] Building Task Bots with Self-learning for Enhanced Adaptability, Extensibility, and Factuality

Xiaoying Zhang

Main category: cs.CL

TL;DR: This thesis explores methods for creating adaptable task bots that can learn and adapt autonomously in changing environments with minimal human intervention.

Details Motivation: The motivation is to create task bots that can operate with minimal or zero human intervention in constantly changing environments. Method: The thesis examines obstacles and potential solutions, focusing on innovative techniques for autonomous learning and adaptation. Result: The result is an analysis of the challenges and possible solutions for creating adaptable task bots. Conclusion: This thesis concludes that developing adaptable, extensible, and accurate task bots is a significant challenge in dialog research. Abstract: Developing adaptable, extensible, and accurate task bots with minimal or zero human intervention is a significant challenge in dialog research. This thesis examines the obstacles and potential solutions for creating such bots, focusing on innovative techniques that enable bots to learn and adapt autonomously in constantly changing environments.

[33] Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models

Yilin Wang,Heng Wang,Yuyang Bai,Minnan Luo

Main category: cs.CL

TL;DR: 提出了一种名为CSKS的新框架,可以灵活、高效地控制大语言模型对上下文知识的敏感度,解决知识冲突问题。

Details Motivation: 现有的调整大语言模型对上下文知识敏感度的方法通常效率低下、效果不佳,或无法连续调整敏感度,因此需要一种更高效、灵活的方法。 Method: 通过调整两个小型代理模型的输出分布差异,来调整原始大语言模型的输出分布,从而控制其对上下文知识的敏感度。 Result: 实验表明,CSKS可以在不修改大语言模型权重的情况下,实现对其上下文知识敏感度的连续控制,同时提高了对知识冲突的处理能力。 Conclusion: CSKS框架能够以较低的成本实现对大语言模型上下文知识敏感度的连续精确控制,提高了模型在处理知识冲突时的灵活性和实用性。 Abstract: In Large Language Models (LLMs) generation, there exist knowledge conflicts and scenarios where parametric knowledge contradicts knowledge provided in the context. Previous works studied tuning, decoding algorithms, or locating and editing context-aware neurons to adapt LLMs to be faithful to new contextual knowledge. However, they are usually inefficient or ineffective for large models, not workable for black-box models, or unable to continuously adjust LLMs' sensitivity to the knowledge provided in the context. To mitigate these problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a simple framework that can steer LLMs' sensitivity to contextual knowledge continuously at a lightweight cost. Specifically, we tune two small LMs (i.e. proxy models) and use the difference in their output distributions to shift the original distribution of an LLM without modifying the LLM weights. In the evaluation process, we not only design synthetic data and fine-grained metrics to measure models' sensitivity to contextual knowledge but also use a real conflict dataset to validate CSKS's practical efficacy. Extensive experiments demonstrate that our framework achieves continuous and precise control over LLMs' sensitivity to contextual knowledge, enabling both increased sensitivity and reduced sensitivity, thereby allowing LLMs to prioritize either contextual or parametric knowledge as needed flexibly. Our data and code are available at https://github.com/OliveJuiceLin/CSKS.

[34] CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese

Carlos Carvalho,Francisco Teixeira,Catarina Botelho,Anna Pompili,Rubén Solera-Ureña,Sérgio Paulo,Mariana Julião,Thomas Rolland,John Mendonça,Diogo Pereira,Isabel Trancoso,Alberto Abad

Main category: cs.CL

TL;DR: This paper introduces CAMÕES, the first open framework for Automatic Speech Recognition in European Portuguese and other varieties, achieving significant improvements in WER and establishing a new state-of-the-art.

Details Motivation: Existing resources for Automatic Speech Recognition in Portuguese are mostly focused on Brazilian Portuguese, leaving European Portuguese and other varieties under-explored. Method: The paper introduces CAMÕES, which includes a comprehensive evaluation benchmark with 46h of EP test data and state-of-the-art models, including foundation models and E-Branchformer models trained from scratch, using a dataset of 425h of EP data. Result: The results show comparable performance between fine-tuned foundation models and E-Branchformer models for European Portuguese, with the best-performing models achieving relative improvements above 35% WER compared to the strongest zero-shot foundation model. Conclusion: CAMÕES provides a new state-of-the-art framework for Automatic Speech Recognition in European Portuguese and other varieties, demonstrating comparable performance between fine-tuned foundation models and E-Branchformer models, with significant improvements in WER. Abstract: Existing resources for Automatic Speech Recognition in Portuguese are mostly focused on Brazilian Portuguese, leaving European Portuguese (EP) and other varieties under-explored. To bridge this gap, we introduce CAM\~OES, the first open framework for EP and other Portuguese varieties. It consists of (1) a comprehensive evaluation benchmark, including 46h of EP test data spanning multiple domains; and (2) a collection of state-of-the-art models. For the latter, we consider multiple foundation models, evaluating their zero-shot and fine-tuned performances, as well as E-Branchformer models trained from scratch. A curated set of 425h of EP was used for both fine-tuning and training. Our results show comparable performance for EP between fine-tuned foundation models and the E-Branchformer. Furthermore, the best-performing models achieve relative improvements above 35% WER, compared to the strongest zero-shot foundation model, establishing a new state-of-the-art for EP and other varieties.

[35] NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Aritra Dutta,Swapnanil Mukherjee,Deepanway Ghosal,Somak Aditya

Main category: cs.CL

TL;DR: The paper presents NLKI, a framework that enhances small vision-language models (sVLMs) by integrating commonsense knowledge through natural language facts and explanations, significantly improving their performance on visual-question answering tasks and matching or surpassing that of larger models.

Details Motivation: The motivation for this study is to explore how effectively integrating commonsense knowledge impacts the performance of small vision-language models (sVLMs), which typically underperform compared to larger generative models due to missing knowledge in visual-question answering tasks. Method: The researchers introduced an end-to-end framework called NLKI, which retrieves natural language facts, generates explanations using an LLM, and feeds these signals to sVLMs. They applied this framework on two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Noise-robust losses were also employed during fine-tuning to handle label noise. Result: The NLKI framework significantly reduced hallucinations and improved end-to-end answer accuracy by up to 7% across three datasets. Fine-tuning with noise-robust losses further improved performance by 2.5% on CRIC and 5.5% on AOKVQA. As a result, FLAVA and other models matched or exceeded the performance of medium-sized models like Qwen-2 VL-2B and SmolVLM-2.5B. Conclusion: The study concludes that integrating commonsense knowledge effectively can significantly enhance the performance of small vision-language models (sVLMs), enabling them to match or surpass medium-sized models. Additionally, noise-aware training stabilizes these smaller models when using external knowledge augmentation. Abstract: Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.

[36] Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

Wenhao Li,Yuxin Zhang,Gen Luo,Haiyuan Wan,Ziyang Gong,Fei Chao,Rongrong Ji

Main category: cs.CL

TL;DR: Spotlight Attention는 비선형 해싱 함수를 사용하여 LLM 추론 속도를 향상시키는 새로운 방법입니다.

Details Motivation: 기존의 랜덤 선형 해싱 방법은 LLM의 키-값 캐시 관리에서 비효율적입니다. Method: 비선형 해싱 함수를 사용한 Spotlight Attention과 브래들리-테리 랭킹 기반 손실을 이용한 학습 프레임워크. Result: 해시 코드 길이 5배 단축, 검색 정확도 향상, 512K 토큰당 100μs 이내 검색 성능. Conclusion: Spotlight Attention은 기존 선형 해싱 방법보다 LLM 추론 효율성을 크게 향상시킵니다. Abstract: Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.

[37] Uncovering the Bigger Picture: Comprehensive Event Understanding Via Diverse News Retrieval

Yixuan Tang,Yuanyuan Shi,Yiqun Sun,Anthony Kum Hoe Tung

Main category: cs.CL

TL;DR: 本文提出了一种名为NEWSCOPE的多样化新闻检索两阶段框架,它通过句子级别的语义变化显式建模来增强事件覆盖,从而缓解冗余问题,促进对事件的全面理解。

Details Motivation: 获取多样化的观点对于理解现实世界事件至关重要,然而大多数新闻检索系统优先考虑文本相关性,导致结果冗余且观点暴露有限。 Method: 我们提出了NEWSCOPE,这是一个两阶段框架,通过句子级别的语义变化显式建模来增强事件覆盖。第一阶段使用密集检索获取主题相关内容,第二阶段通过句子级别聚类和多样性感知的重新排序来展示互补信息。 Result: 实验表明,NEWSCOPE在实现更高多样性的同时始终优于强基线,且不影响相关性。 Conclusion: NEWSCOPE通过细粒度、可解释的建模有效缓解了冗余问题,促进了对事件的全面理解。 Abstract: Access to diverse perspectives is essential for understanding real-world events, yet most news retrieval systems prioritize textual relevance, leading to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a two-stage framework for diverse news retrieval that enhances event coverage by explicitly modeling semantic variation at the sentence level. The first stage retrieves topically relevant content using dense retrieval, while the second stage applies sentence-level clustering and diversity-aware re-ranking to surface complementary information. To evaluate retrieval diversity, we introduce three interpretable metrics, namely Average Pairwise Distance, Positive Cluster Coverage, and Information Density Ratio, and construct two paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Our results demonstrate the effectiveness of fine-grained, interpretable modeling in mitigating redundancy and promoting comprehensive event understanding. The data and code are available at https://github.com/tangyixuan/NEWSCOPE.

[38] Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance

Pedro Henrique Luz de Araujo,Paul Röttger,Dirk Hovy,Benjamin Roth

Main category: cs.CL

TL;DR: This paper investigates the effectiveness of expert persona prompting in language models, identifying three key criteria for success. It finds mixed results, with models being sensitive to irrelevant details and showing inconsistent gains, suggesting the need for more thoughtful persona design.

Details Motivation: Prior research on expert persona prompting shows mixed results regarding its effectiveness. This paper aims to understand when and why personas can improve performance by identifying key criteria and evaluating current models. Method: The authors analyze the literature on persona prompting to identify three desiderata: performance advantage of expert personas, robustness to irrelevant persona attributes, and fidelity to persona attributes. They then evaluate 9 state-of-the-art LLMs across 27 tasks based on these criteria. Result: Expert personas generally result in positive or neutral performance outcomes. However, models show high sensitivity to irrelevant persona details, causing significant performance drops. While higher education, specialization, and domain-relatedness can improve performance, their impact is often inconsistent or minimal. Conclusion: The study concludes that while expert personas can lead to improved or unchanged performance, they are often inconsistent in effectiveness across tasks and models. Additionally, mitigation strategies to improve robustness only work for the largest models, highlighting the need for careful persona design and better evaluation schemes. Abstract: Expert persona prompting -- assigning roles such as expert in math to language models -- is widely used for task improvement. However, prior work shows mixed results on its effectiveness, and does not consider when and why personas should improve performance. We analyze the literature on persona prompting for task improvement and distill three desiderata: 1) performance advantage of expert personas, 2) robustness to irrelevant persona attributes, and 3) fidelity to persona attributes. We then evaluate 9 state-of-the-art LLMs across 27 tasks with respect to these desiderata. We find that expert personas usually lead to positive or non-significant performance changes. Surprisingly, models are highly sensitive to irrelevant persona details, with performance drops of almost 30 percentage points. In terms of fidelity, we find that while higher education, specialization, and domain-relatedness can boost performance, their effects are often inconsistent or negligible across tasks. We propose mitigation strategies to improve robustness -- but find they only work for the largest, most capable models. Our findings underscore the need for more careful persona design and for evaluation schemes that reflect the intended effects of persona usage.

[39] T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Jie Zhang,Changzai Pan,Kaiwen Wei,Sishi Xiong,Yu Zhao,Xiangyu Li,Jiaxin Peng,Xiaoyan Gu,Jian Yang,Wenhan Chang,Zhenhe Wu,Jiang Zhong,Shuangyong Song,Yongxiang Li,Xuelong Li

Main category: cs.CL

TL;DR: This paper introduces T2R-bench, a new benchmark for evaluating the ability of LLMs to transform industrial table data into reports, revealing that current models still have significant room for improvement.

Details Motivation: The motivation stems from the challenge of converting table data into reports in industrial applications, hindered by complex tables and insufficient benchmarks for practical evaluation. Method: An analysis of 25 widely-used LLMs was conducted on the newly proposed T2R-bench, a bilingual benchmark consisting of 457 real-world industrial tables across 19 domains and 4 table types. An evaluation criteria was also proposed to assess report generation quality. Result: The experiments showed that even advanced LLMs, such as Deepseek-R1, only achieved an overall score of 62.71 on T2R-bench, highlighting the need for further improvements in this area. Conclusion: The study concludes that transforming table information into reports is a challenging task for LLMs, with current models like Deepseek-R1 achieving only moderate performance on the proposed T2R-bench benchmark. Abstract: Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.

[40] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan,Xiufeng Yang,Zuchao Huang,Ercong Nie,Zifeng Ding,Zonggen Li,Xiaowen Ma,Hinrich Schütze,Volker Tresp,Yunpu Ma

Main category: cs.CL

TL;DR: This paper proposes Memory-R1, a reinforcement learning framework that enables LLMs to actively manage external memory, enhancing their reasoning capabilities with minimal training data.

Details Motivation: The motivation is to overcome the limitations of stateless LLMs with constrained context windows by introducing a learned mechanism for managing external memory, enabling long-horizon reasoning. Method: The paper introduces Memory-R1, which uses two specialized agents (Memory Manager and Answer Agent) trained with outcome-driven RL (PPO and GRPO) to manage and utilize external memory. Result: Memory-R1 outperformed the most competitive baseline with strong generalization across question types and LLM backbones, using as few as 152 question-answer pairs for training. Conclusion: The paper concludes that Memory-R1, a reinforcement learning framework, effectively enhances LLMs with adaptive external memory management capabilities, providing insights into more advanced reasoning systems. Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations {ADD, UPDATE, DELETE, NOOP}, and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and use with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the most competitive existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behaviors in LLMs, pointing toward richer, more persistent reasoning systems.

[41] Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

Anusha Kamath,Kanishk Singla,Rakesh Paul,Raviraj Joshi,Utkarsh Vaidya,Sanjay Singh Chauhan,Niranjan Wartikar

Main category: cs.CL

TL;DR: This paper introduces a new methodology to create high-quality Hindi LLM evaluation datasets, addressing the lack of benchmarks for assessing LLMs in low-resource languages.

Details Motivation: The motivation behind the paper is the lack of high-quality benchmarks for evaluating instruction-tuned Large Language Models (LLMs) in Hindi, which makes it challenging to assess their performance accurately. Method: The paper uses a methodology combining from-scratch human annotation with a translate-and-verify process to create a suite of five Hindi LLM evaluation datasets. Result: The result of the study is the creation of a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi, along with an extensive benchmarking analysis of open-source LLMs supporting Hindi. Conclusion: The paper concludes that the introduced Hindi LLM evaluation datasets serve as a replicable methodology for developing benchmarks in other low-resource languages and provide a detailed comparative analysis of open-source LLMs supporting Hindi. Abstract: Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

[42] Scalable and consistent few-shot classification of survey responses using text embeddings

Jonas Timmann Mjaaland,Markus Fleten Kreutzer,Halvor Tyseng,Rebeckah K. Fussell,Gina Passante,N. G. Holmes,Anders Malthe-Sørenssen,Tor Ole B. Odden

Main category: cs.CL

TL;DR: 本文提出了一种基于文本嵌入的分类框架,可以有效地进行大规模定性分析。

Details Motivation: 传统的编码方法通常耗时且容易不一致。现有的自然语言处理解决方案在定性分析中的适用性有限,因为它们需要大量标注数据,破坏已建立的定性工作流程,和/或产生可变结果。 Method: 我们引入了一个基于文本嵌入的分类框架,该框架每个类别只需要少量示例,并且与标准定性工作流程非常契合。 Result: 在基准测试中,我们的框架与专家人工编码者相比,Cohen's Kappa 范围为 0.74 到 0.83。 Conclusion: 文本嵌入辅助编码可以灵活地扩展到数千条回复,而不会牺牲可解释性,为大规模演绎定性分析开辟了道路。 Abstract: Qualitative analysis of open-ended survey responses is a commonly-used research method in the social sciences, but traditional coding approaches are often time-consuming and prone to inconsistency. Existing solutions from Natural Language Processing such as supervised classifiers, topic modeling techniques, and generative large language models have limited applicability in qualitative analysis, since they demand extensive labeled data, disrupt established qualitative workflows, and/or yield variable results. In this paper, we introduce a text embedding-based classification framework that requires only a handful of examples per category and fits well with standard qualitative workflows. When benchmarked against human analysis of a conceptual physics survey consisting of 2899 open-ended responses, our framework achieves a Cohen's Kappa ranging from 0.74 to 0.83 as compared to expert human coders in an exhaustive coding scheme. We further show how performance of this framework improves with fine-tuning of the text embedding model, and how the method can be used to audit previously-analyzed datasets. These findings demonstrate that text embedding-assisted coding can flexibly scale to thousands of responses without sacrificing interpretability, opening avenues for deductive qualitative analysis at scale.

[43] TokenVerse++: Towards Flexible Multitask Learning with Dynamic Task Activation

Shashi Kumar,Srikanth Madikeri,Esaú Villatoro-Tello,Sergio Burdisso,Pradeep Rangappa,Andrés Carofilis,Petr Motlicek,Karthik Pandia,Shankar Venkatesan,Kadri Hacioğlu,Andreas Stolcke

Main category: cs.CL

TL;DR: 本文提出TokenVerse++,通过引入可学习向量实现动态任务激活,解决了Token-based多任务框架无法有效利用部分标注数据的问题,并在多个任务上实现了与或超越原有框架的性能。

Details Motivation: Token-based多任务框架(如TokenVerse)要求所有训练语句都必须具有所有任务的标签,这限制了它们利用部分标注数据集和有效扩展的能力。 Method: 在XLSR-Transducer ASR模型的声学嵌入空间中引入可学习向量,以实现动态任务激活。 Result: 通过使用仅部分任务标注的语句进行训练,成功整合了带有部分标签的数据集(特别是ASR和附加任务语言识别),提升了整体性能,并在多个任务上实现了与或超越TokenVerse的结果。 Conclusion: TokenVerse++是一个比TokenVerse更实用的多任务框架,它在不牺牲ASR性能的前提下,实现了与或超越TokenVerse的多任务性能。 Abstract: Token-based multitasking frameworks like TokenVerse require all training utterances to have labels for all tasks, hindering their ability to leverage partially annotated datasets and scale effectively. We propose TokenVerse++, which introduces learnable vectors in the acoustic embedding space of the XLSR-Transducer ASR model for dynamic task activation. This core mechanism enables training with utterances labeled for only a subset of tasks, a key advantage over TokenVerse. We demonstrate this by successfully integrating a dataset with partial labels, specifically for ASR and an additional task, language identification, improving overall performance. TokenVerse++ achieves results on par with or exceeding TokenVerse across multiple tasks, establishing it as a more practical multitask alternative without sacrificing ASR performance.

[44] Beyond Shallow Heuristics: Leveraging Human Intuition for Curriculum Learning

Vanessa Toborek,Sebastian Müller,Tim Selbach,Tamás Horváth,Christian Bauckhage

Main category: cs.CL

TL;DR: 研究探讨了人类对语言难度的直觉是否可以作为课程学习的有效信号,发现基于标签的课程学习优于基于能力的策略。

Details Motivation: 课程学习旨在通过从“简单”到“困难”的数据呈现来改进训练,但定义和衡量语言难度仍然是一个开放性挑战。 Method: 使用BERT-tiny模型进行实验,比较基于标签的课程学习与基于能力的策略的效果。 Result: 实验显示,单独添加简单数据没有明显优势,但通过课程结构化数据(尤其是首先引入简单数据)可以持续改善困惑度,特别是在简单语言上。相反,基于能力的课程学习未能超越随机排序的一致性增益。 Conclusion: 人类对语言难度的直觉可以指导语言模型预训练中的课程学习。 Abstract: Curriculum learning (CL) aims to improve training by presenting data from "easy" to "hard", yet defining and measuring linguistic difficulty remains an open challenge. We investigate whether human-curated simple language can serve as an effective signal for CL. Using the article-level labels from the Simple Wikipedia corpus, we compare label-based curricula to competence-based strategies relying on shallow heuristics. Our experiments with a BERT-tiny model show that adding simple data alone yields no clear benefit. However, structuring it via a curriculum -- especially when introduced first -- consistently improves perplexity, particularly on simple language. In contrast, competence-based curricula lead to no consistent gains over random ordering, probably because they fail to effectively separate the two classes. Our results suggest that human intuition about linguistic difficulty can guide CL for language model pre-training.

[45] AI-Powered Detection of Inappropriate Language in Medical School Curricula

Chiman Salavati,Shannon Song,Scott A. Hale,Roberto E. Montenegro,Shiri Dori-Hacohen,Fabricio Murai

Main category: cs.CL

TL;DR: 该研究通过使用小型语言模型(SLMs)和大语言模型(LLMs)自动检测医学教学材料中的不当语言,发现SLMs表现更优,并通过补充未标记数据作为负样本显著提高了分类器性能。

Details Motivation: 医学教学材料中的不当语言会影响临床培训、医患互动和健康结果,而手动识别这些语言的使用成本高昂且不切实际。 Method: 使用标记数据微调小型语言模型(SLMs)和上下文学习的预训练大语言模型(LLMs),并比较它们在包含500份文档和12,000页数据集上的表现。 Result: LLama-3 8B和70B模型即使在精心设计的提示下也表现不如SLMs,多标签分类器在补充负样本后效果最佳。 Conclusion: SLMs比LLMs在检测医学课程中的不当语言使用方面表现更好,多标签分类器在补充未标记摘录作为负样本后,AUC提高了25%。 Abstract: The use of inappropriate language -- such as outdated, exclusionary, or non-patient-centered terms -- medical instructional materials can significantly influence clinical training, patient interactions, and health outcomes. Despite their reputability, many materials developed over past decades contain examples now considered inappropriate by current medical standards. Given the volume of curricular content, manually identifying instances of inappropriate use of language (IUL) and its subcategories for systematic review is prohibitively costly and impractical. To address this challenge, we conduct a first-in-class evaluation of small language models (SLMs) fine-tuned on labeled data and pre-trained LLMs with in-context learning on a dataset containing approximately 500 documents and over 12,000 pages. For SLMs, we consider: (1) a general IUL classifier, (2) subcategory-specific binary classifiers, (3) a multilabel classifier, and (4) a two-stage hierarchical pipeline for general IUL detection followed by multilabel classification. For LLMs, we consider variations of prompts that include subcategory definitions and/or shots. We found that both LLama-3 8B and 70B, even with carefully curated shots, are largely outperformed by SLMs. While the multilabel classifier performs best on annotated data, supplementing training with unflagged excerpts as negative examples boosts the specific classifiers' AUC by up to 25%, making them most effective models for mitigating harmful language in medical curricula.

[46] Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement

Mohammed Rakibul Hasan,Rafi Majid,Ahanaf Tahmid

Main category: cs.CL

TL;DR: 本文提出了 Bangla-Bayanno,一个基于 Bangla 语言的高质量视觉问答数据集,通过多语言 LLM 辅助翻译优化流程构建,包含 52,650 个问答对,旨在推动低资源多模态 AI 的研究。

Details Motivation: 现有视觉问答数据集要么人工标注受限于特定领域、问题或答案类型,要么受限于特定答案格式。此外,多语言来源的翻译质量较低,需要构建一个高质量的低资源语言数据集。 Method: 通过多语言 LLM 辅助翻译优化流程来减少人工翻译错误并确保清晰度,构建了包含 52,650 个问答对的数据集,涵盖 4750 多张图片,并将问题分为三类:名词性、定量和极性问答。 Result: 构建了一个包含 52,650 个问答对的 Bangla 语言 VQA 数据集,覆盖 4750+ 图像,问题被分为三类:名词性、定量和极性,且数据集开源。 Conclusion: Bangla-Bayanno 是一个用低资源语言 Bangla 编写的开放源代码、高质量的视觉问答(VQA)数据集,旨在推动低资源多模态学习的研究,并促进更具包容性的 AI 系统的发展。 Abstract: In this paper, we introduce Bangla-Bayanno, an open-ended Visual Question Answering (VQA) Dataset in Bangla, a widely used, low-resource language in multimodal AI research. The majority of existing datasets are either manually annotated with an emphasis on a specific domain, query type, or answer type or are constrained by niche answer formats. In order to mitigate human-induced errors and guarantee lucidity, we implemented a multilingual LLM-assisted translation refinement pipeline. This dataset overcomes the issues of low-quality translations from multilingual sources. The dataset comprises 52,650 question-answer pairs across 4750+ images. Questions are classified into three distinct answer types: nominal (short descriptive), quantitative (numeric), and polar (yes/no). Bangla-Bayanno provides the most comprehensive open-source, high-quality VQA benchmark in Bangla, aiming to advance research in low-resource multimodal learning and facilitate the development of more inclusive AI systems.

[47] Logical Reasoning with Outcome Reward Models for Test-Time Scaling

Ramya Keerthy Thatikonda,Wray Buntine,Ehsan Shareghi

Main category: cs.CL

TL;DR: This paper introduces Outcome Reward Models (ORMs) trained on Chain-of-Thought (CoT) and echo-augmented data to improve deductive reasoning in large language models (LLMs), showing enhanced performance across multiple models and datasets.

Details Motivation: Enhancing LLM performance in deductive logical reasoning is under-explored despite progress in complex reasoning tasks. Method: Generated training data using Chain-of-Thought (CoT) and introduced an echo generation technique to expand error coverage; trained ORMs on this data. Result: ORMs trained on CoT and echo-augmented data showed improved performance across four LLMs on FOLIO, JustLogic, and ProverQA datasets. Conclusion: Outcome Reward Models trained on CoT and echo-augmented data improve LLM performance on deductive reasoning tasks. Abstract: Logical reasoning is a critical benchmark for evaluating the capabilities of large language models (LLMs), as it reflects their ability to derive valid conclusions from given premises. While the combination of test-time scaling with dedicated outcome or process reward models has opened up new avenues to enhance LLMs performance in complex reasoning tasks, this space is under-explored in deductive logical reasoning. We present a set of Outcome Reward Models (ORMs) for deductive reasoning. To train the ORMs we mainly generate data using Chain-of-Thought (CoT) with single and multiple samples. Additionally, we propose a novel tactic to further expand the type of errors covered in the training dataset of the ORM. In particular, we propose an echo generation technique that leverages LLMs' tendency to reflect incorrect assumptions made in prompts to extract additional training data, covering previously unexplored error types. While a standard CoT chain may contain errors likely to be made by the reasoner, the echo strategy deliberately steers the model toward incorrect reasoning. We show that ORMs trained on CoT and echo-augmented data demonstrate improved performance on the FOLIO, JustLogic, and ProverQA datasets across four different LLMs.

[48] Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems

Jingyu Guo,Yingying Xu

Main category: cs.CL

TL;DR: This paper demonstrates that stereotypes can emerge spontaneously in LLM-based AI multi-agent systems through interactions, independent of training data biases. It emphasizes the need for further research to understand and mitigate the ethical impacts of such emergent biases.

Details Motivation: While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases. The study aims to explore whether stereotypes can spontaneously emerge in AI agent interactions. Method: The researchers used a novel experimental framework simulating workplace interactions with neutral initial conditions to examine the emergence and evolution of stereotypes in LLM-based multi-agent systems. They conducted a comprehensive quantitative analysis across different LLM architectures. Result: 1. LLM-based AI agents develop stereotype-driven biases despite starting without predefined biases. 2. Stereotype effects intensify with increased interaction rounds and decision-making power, especially with hierarchical structures. 3. These systems exhibit group effects similar to human social behaviors, such as halo effects, confirmation bias, and role congruity. 4. Stereotype patterns are consistent across various LLM architectures. Conclusion: The study concludes that stereotypes can emerge as an inherent characteristic of multi-agent interactions in AI systems, rather than solely being a result of training data biases. This highlights the need for further research and strategies to address the ethical implications. Abstract: While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases. Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration. Through a novel experimental framework simulating workplace interactions with neutral initial conditions, we investigate the emergence and evolution of stereotypes in LLM-based multi-agent systems. Our findings reveal that (1) LLM-Based AI agents develop stereotype-driven biases in their interactions despite beginning without predefined biases; (2) stereotype effects intensify with increased interaction rounds and decision-making power, particularly after introducing hierarchical structures; (3) these systems exhibit group effects analogous to human social behavior, including halo effects, confirmation bias, and role congruity; and (4) these stereotype patterns manifest consistently across different LLM architectures. Through comprehensive quantitative analysis, these findings suggest that stereotype formation in AI systems may arise as an emergent property of multi-agent interactions, rather than merely from training data biases. Our work underscores the need for future research to explore the underlying mechanisms of this phenomenon and develop strategies to mitigate its ethical impacts.

[49] HEAL: A Hypothesis-Based Preference-Aware Analysis Framework

Yifu Huo,Chenglong Wang,Qiren Zhu,Shunjie Xing,Tong Xiao,Chunliang Zhang,Tongran Liu,Jinbo Zhu

Main category: cs.CL

TL;DR: This paper introduces HEAL, a new evaluation framework for preference alignment in LLMs, showing that current methods effectively capture preferences while suppressing negative samples, offering both theoretical and practical contributions to the field.

Details Motivation: Current preference optimization methods like DPO rely on single-response evaluations, ignoring other potential outputs. This paper aims to develop a more comprehensive framework for evaluating preference alignment in real-world applications. Method: The authors propose HEAL, a hypothesis-based preference-aware analysis framework, using ranking accuracy and preference strength correlation metrics, supported by the UniHypoBench benchmark for evaluation. Result: Extensive experiments using HEAL reveal that existing preference learning methods can effectively capture provided preferences and suppress negative samples, offering insights into their intrinsic mechanisms. Conclusion: Preference learning methods can effectively capture preferences while suppressing negative samples, contributing to both theoretical and practical advancements in preference optimization and alignment algorithms. Abstract: Preference optimization methods like DPO have achieved remarkable performance in LLM alignment. However, the evaluation for these methods relies on a single response and overlooks other potential outputs, which could also be generated in real-world applications within this hypothetical space. To address this issue, this paper presents a \textbf{H}ypothesis-based Pr\textbf{E}ference-aware \textbf{A}na\textbf{L}ysis Framework (HEAL), a novel evaluation paradigm that formulates preference alignment as a re-ranking process within hypothesis spaces. The framework incorporates two complementary metrics: ranking accuracy for evaluating ordinal consistency and preference strength correlation for assessing continuous alignment. To facilitate this framework, we develop UniHypoBench, a unified hypothesis benchmark constructed from diverse instruction-response pairs. Through extensive experiments based on HEAL, with a particular focus on the intrinsic mechanisms of preference learning, we demonstrate that current preference learning methods can effectively capture preferences provided by proxy models while simultaneously suppressing negative samples. These findings contribute to preference learning research through two significant avenues. Theoretically, we introduce hypothesis space analysis as an innovative paradigm for understanding preference alignment. Practically, HEAL offers researchers robust diagnostic tools for refining preference optimization methods, while our empirical results identify promising directions for developing more advanced alignment algorithms capable of comprehensive preference capture.

[50] Dhati+: Fine-tuned Large Language Models for Arabic Subjectivity Evaluation

Slimane Bellaouar,Attia Nehar,Soumia Souffi,Mounia Bouameur

Main category: cs.CL

TL;DR: This paper introduces AraDhati+, a new dataset for Arabic subjectivity analysis, and proposes an effective model using fine-tuned language models and ensemble learning, achieving high accuracy despite limited resources.

Details Motivation: Arabic, a linguistically rich language, lacks large annotated datasets for subjectivity analysis, limiting the effectiveness of tools for this purpose. Method: Development of the AraDhati+ dataset and fine-tuning state-of-the-art Arabic language models (XLM-RoBERTa, AraBERT, ArabianGPT) combined with an ensemble decision approach. Result: The approach achieved an accuracy of 97.79% for Arabic subjectivity classification. Conclusion: The proposed approach effectively addresses the challenges of subjectivity classification in Arabic, demonstrating high accuracy and potential for improving Arabic language processing. Abstract: Despite its significance, Arabic, a linguistically rich and morphologically complex language, faces the challenge of being under-resourced. The scarcity of large annotated datasets hampers the development of accurate tools for subjectivity analysis in Arabic. Recent advances in deep learning and Transformers have proven highly effective for text classification in English and French. This paper proposes a new approach for subjectivity assessment in Arabic textual data. To address the dearth of specialized annotated datasets, we developed a comprehensive dataset, AraDhati+, by leveraging existing Arabic datasets and collections (ASTD, LABR, HARD, and SANAD). Subsequently, we fine-tuned state-of-the-art Arabic language models (XLM-RoBERTa, AraBERT, and ArabianGPT) on AraDhati+ for effective subjectivity classification. Furthermore, we experimented with an ensemble decision approach to harness the strengths of individual models. Our approach achieves a remarkable accuracy of 97.79\,\% for Arabic subjectivity classification. Results demonstrate the effectiveness of the proposed approach in addressing the challenges posed by limited resources in Arabic language processing.

[51] Diffusion Language Models Know the Answer Before Decoding

Pengxiang Li,Yefan Zhou,Dilxat Muhtar,Lu Yin,Shilin Yan,Li Shen,Yi Liang,Soroush Vosoughi,Shiwei Liu

Main category: cs.CL

TL;DR: This paper introduces Prophet, a training-free decoding method for DLMs that accelerates inference by leveraging early answer convergence, significantly reducing decoding steps while maintaining output quality.

Details Motivation: DLMs are slower in inference compared to autoregressive models due to bidirectional attention and multiple refinement steps. This work aims to exploit early convergence properties to accelerate inference. Method: Prophet dynamically decides to stop refinement early based on the confidence gap between top-2 prediction candidates, allowing for early commit decoding without additional training. Result: Prophet reduced decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while preserving high generation quality on tasks like GSM8K and MMLU. Conclusion: Prophet enables faster decoding for Diffusion Language Models (DLMs) by leveraging early answer convergence, reducing decoding steps by up to 3.4x while maintaining generation quality. Abstract: Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

[52] AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

Lisa Alazraki,Lihu Chen,Ana Brassard,Joe Stacey,Hossein A. Rahmani,Marek Rei

Main category: cs.CL

TL;DR: The paper introduces AgentCoMa, a benchmark combining commonsense and math reasoning, revealing a significant performance drop in LLMs when both reasoning types are required together, unlike humans who maintain high accuracy.

Details Motivation: Current compositional benchmarks for LLMs tend to focus on either commonsense or mathematical reasoning, but real-world tasks often require a combination of both. This work aims to address this gap by introducing a benchmark that tests both reasoning types together. Method: The authors introduced a new benchmark called Agentic Commonsense and Math (AgentCoMa), which combines both commonsense and math reasoning steps. They tested 61 LLMs across different sizes, model families, and training strategies. They also conducted interpretability studies involving neuron patterns, attention maps, and membership inference. Result: LLMs showed a ~30% average drop in accuracy when solving combined commonsense and math reasoning tasks compared to individual steps. This performance gap is larger than what is observed in benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators performed consistently well on both types of tasks. Conclusion: The study concludes that while LLMs perform well on isolated commonsense or mathematical reasoning tasks, their performance significantly drops when both reasoning types are combined, highlighting a substantial brittleness in mixed-type compositional reasoning. Abstract: Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by ~30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.

[53] MathBuddy: A Multimodal System for Affective Math Tutoring

Debanjana Kar,Leopold Böss,Dacia Braca,Sebastian Maximilian Dennerlein,Nina Christine Hubig,Philipp Wintersberger,Yufang Hou

Main category: cs.CL

TL;DR: MathBuddy是一个情感感知的LLM数学导师系统,通过分析学生的对话文本和面部表情来识别情绪,并据此提供更有同理心的教学互动,研究显示这种方法显著提高了教学效果。

Details Motivation: 当前最先进的学习模型没有考虑学生的情感状态,而教育心理学的多项研究表明,积极或消极的情绪状态会影响学生的学习能力。 Method: 开发了一个名为MathBuddy的情感感知LLM数学导师系统,该系统从对话文本和面部表情两个模态捕捉学生情绪,并将这些情绪映射到相关的教学策略上,以生成情感感知的回应。 Result: 通过八个教学维度的自动评估指标和用户研究进行评估,结果显示胜率提高了23个百分点,DAMR得分整体提高了3个百分点。 Conclusion: 该论文提出通过建模学生的情绪,可以显著提高基于LLM的辅导系统的教学能力。 Abstract: The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student's affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student's learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student's emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student's emotions are captured from the conversational text as well as from their facial expressions. The student's emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have effectively evaluated our model using automatic evaluation metrics across eight pedagogical dimensions and user studies. We report a massive 23 point performance gain using the win rate and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor's pedagogical abilities by modeling students' emotions.

[54] ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning

Yiming Du,Yifan Xiang,Bin Liang,Dahua Lin,Kam-Fai Wong,Fei Tan

Main category: cs.CL

TL;DR: 本文提出ReSURE方法,通过动态调整训练中不可靠监督的权重,有效解决了多轮对话系统中低质量数据导致的错误传播问题。

Details Motivation: 微调多轮对话系统时,低质量数据会导致性能下降,且现有静态预过滤方法无法有效控制训练中的轮级错误传播。 Method: 提出ReSURE方法,利用Welford在线统计估计每轮损失分布,并动态重加权样本损失,以降低不可靠监督的影响。 Result: 实验表明,ReSURE在单源和混合质量数据集上均表现出更好的稳定性与响应质量,且响应评分与样本数量之间存在显著的正Spearman相关性。 Conclusion: ReSURE方法通过动态调整不可靠监督的权重,有效缓解了多轮对话系统中监督错误的传播问题,并在不同数据集上验证了其稳定性和响应质量的提升。 Abstract: Fine-tuning multi-turn dialogue systems requires high-quality supervision but often suffers from degraded performance when exposed to low-quality data. Supervision errors in early turns can propagate across subsequent turns, undermining coherence and response quality. Existing methods typically address data quality via static prefiltering, which decouples quality control from training and fails to mitigate turn-level error propagation. In this context, we propose ReSURE (Regularizing Supervision UnREliability), an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering. ReSURE estimates per-turn loss distributions using Welford's online statistics and reweights sample losses on the fly accordingly. Experiments on both single-source and mixed-quality datasets show improved stability and response quality. Notably, ReSURE enjoys positive Spearman correlations (0.21 ~ 1.0 across multiple benchmarks) between response scores and number of samples regardless of data quality, which potentially paves the way for utilizing large-scale data effectively. Code is publicly available at https://github.com/Elvin-Yiming-Du/ReSURE_Multi_Turn_Training.

Boheng Mao

Main category: cs.CL

TL;DR: This paper proposes Selective Retrieval-Augmentation (SRA) to address the long-tail label distribution problem in legal text classification, showing improved performance on benchmark datasets without model changes or external data.

Details Motivation: Benchmark datasets for legal text classification often exhibit long-tail label distributions, leading to poor model performance on rare classes. Method: SRA augments low-frequency label samples in the training set via retrieval from training data, avoiding noise introduction and information leakage. Result: SRA outperforms all current LexGLUE baselines on LEDGAR and UNFAIR-ToS datasets, achieving higher micro-F1 and macro-F1 scores. Conclusion: Selective Retrieval-Augmentation (SRA) effectively improves long-tail legal text classification performance without modifying model architecture or requiring external data. Abstract: Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper proposes Selective Retrieval-Augmentation (SRA) as a solution to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. The proposed SRA method is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). The results indicate that SRA attains higher micro-F1 and macro-F1 scores compared to all current LexGLUE baselines across both datasets, illustrating consistent improvements in long-tail legal text classification. The code repository is available at: https://github.com/Boheng-Mao/sra-legal

[56] DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

Liana Patel,Negar Arabzadeh,Harshit Gupta,Ankita Sundar,Ion Stoica,Matei Zaharia,Carlos Guestrin

Main category: cs.CL

TL;DR: 本文提出了DeepScholar-bench,一个用于评估生成性研究综合的实时基准测试和整体评估框架。

Details Motivation: 现有的问答基准测试和专家策划的数据集无法充分评估生成性研究综合系统的复杂性和动态性。 Method: 从最新的高质量ArXiv论文中提取查询,并专注于生成论文相关工作部分的真实研究综合任务。 Result: DeepScholar-base在评估框架中表现出色,建立了强大的基线,但所有系统在所有指标上的得分均未超过19%。 Conclusion: DeepScholar-bench强调了生成性研究综合任务的难度,并为实现能够进行生成性研究综合的AI系统提供了重要的进展路径。 Abstract: The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of $19\%$ across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.

[57] Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Sheng Liu,Qiang Sheng,Danding Wang,Yang Li,Guang Yang,Juan Cao

Main category: cs.CL

TL;DR: IMAGINE通过生成类似越狱的指令填补分布差距,有效提升大语言模型的安全性。

Details Motivation: 现有LLM在面对分布外恶意指令时仍易受攻击,凸显训练数据与现实攻击之间分布不匹配的问题,导致开发者陷入被动修补循环。 Method: IMAGINE利用嵌入空间分布分析生成类似越狱的指令,并通过迭代优化过程动态演化文本生成分布,从而扩展安全对齐数据的覆盖范围。 Result: 基于通过IMAGINE增强的安全对齐语料库,框架在多个模型上显著降低了攻击成功率。 Conclusion: IMAGINE框架显著降低了Qwen2.5、Llama3.1和Llama3.2上的攻击成功率,且不影响模型效用。 Abstract: Despite advances in improving large language model(LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs' inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.

[58] AraHealthQA 2025 Shared Task Description Paper

Hassan Alhuzali,Farah Shamout,Muhammad Abdul-Mageed,Chaimae Abouzahir,Mouath Abu-Daoud,Ashwag Alasmari,Walid Al-Eisawi,Renad Al-Monef,Ali Alqahtani,Lama Ayash,Nizar Habash,Leen Kharouf

Main category: cs.CL

TL;DR: The paper introduces AraHealthQA 2025, a shared task aimed at improving Arabic health question answering through two tracks: MentalQA for mental health questions and MedArabiQ for broader medical domains.

Details Motivation: The motivation is to address the paucity of high-quality Arabic medical QA resources. Method: The paper outlines the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarizes the overall outcomes. Result: The result is the introduction of AraHealthQA 2025, a comprehensive Arabic health question answering shared task with two complementary tracks: MentalQA and MedArabiQ. Conclusion: The paper concludes with reflections on the performance trends observed and prospects for future iterations in Arabic health QA. Abstract: We introduce {AraHealthQA 2025}, the {Comprehensive Arabic Health Question Answering Shared Task}, held in conjunction with {ArabicNLP 2025} (co-located with EMNLP 2025). This shared task addresses the paucity of high-quality Arabic medical QA resources by offering two complementary tracks: {MentalQA}, focusing on Arabic mental health Q\&A (e.g., anxiety, depression, stigma reduction), and {MedArabiQ}, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making. Each track comprises multiple subtasks, evaluation datasets, and standardized metrics, facilitating fair benchmarking. The task was structured to promote modeling under realistic, multilingual, and culturally nuanced healthcare contexts. We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes. We conclude with reflections on the performance trends observed and prospects for future iterations in Arabic health QA.

[59] 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

Chengzu Li,Wenshan Wu,Huanyu Zhang,Qingtao Li,Zeyu Gao,Yan Xia,José Hernández-Orallo,Ivan Vulić,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了一种评估多模态大语言模型空间推理能力的框架,并引入了一个名为11Plus-Bench的高质量基准测试。

Details Motivation: 人类认知过程中空间推理和感知紧密相关,但目前对多模态大语言模型在这方面的表现研究较少。 Method: 开发了一个系统评估框架,并创建了一个名为11Plus-Bench的基准测试,包含详细的专家注释,用于分析模型的行为。 Result: 实验发现当前的多模态大语言模型表现出初步的空间认知能力,尽管与人类相比仍有较大差距,但其认知特征与人类相似,实例级别的表现仍较为随机。 Conclusion: 本文揭示了当前多模态大语言模型在空间推理方面的潜力和局限性,并为未来模型设计提供了可行的见解。 Abstract: For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs' cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities and provide actionable insights for advancing model design.

[60] MovieCORE: COgnitive REasoning in Movies

Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Ying Cheng,Hung-Ting Su,Yung-Hao Tang,Shang-Hong Lai,Winston H. Hsu

Main category: cs.CL

TL;DR: MovieCORE is a new video question answering dataset that emphasizes deeper cognitive understanding of movies, using an innovative approach with large language models and an enhancement module to improve reasoning capabilities.

Details Motivation: The motivation is to move beyond surface-level comprehension in existing VQA datasets and focus on deeper cognitive understanding of movie content through System-2 thinking. Method: The authors introduce MovieCORE, a new VQA dataset, and use an agentic brainstorming approach with LLMs to generate high-quality question-answer pairs. They also propose the Agentic Choice Enhancement (ACE) module to improve model reasoning. Result: The authors developed cognitive tests to assess dataset quality and showed that their ACE module improves model reasoning capabilities post-training by up to 25%. Conclusion: The paper concludes that MovieCORE advances movie understanding in AI systems and provides insights into the capabilities and limitations of current VQA models in handling nuanced questions. Abstract: This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

cs.CV [Back]

[61] Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration

Jookyung Song,Mookyoung Kang,Nojun Kwak

Main category: cs.CV

TL;DR: A real-time generative drawing system that combines structural and semantic analysis for collaborative visual creation, allowing users of all skill levels to co-create with AI.

Details Motivation: The motivation is to develop a generative drawing system that interprets and integrates both formal intent (structural, compositional, and stylistic attributes) and contextual intent (semantic and thematic meaning) into a unified transformation process, differing from conventional text-prompt-based systems. Method: The method involves a multi-stage generation pipeline that combines contour-preserving structural control with style- and content-aware image synthesis, utilizing a touchscreen-based interface and distributed inference architecture. Result: The result is a low-latency, two-stage transformation system that supports multi-user collaboration on shared canvases, integrating both high-level semantic cues and ground-level geometric features. Conclusion: The paper concludes that their system enables synchronous, co-authored visual creation among participants regardless of artistic expertise, redefining human-AI interaction through co-creation and mutual enhancement. Abstract: This paper presents a real-time generative drawing system that interprets and integrates both formal intent - the structural, compositional, and stylistic attributes of a sketch - and contextual intent - the semantic and thematic meaning inferred from its visual content - into a unified transformation process. Unlike conventional text-prompt-based generative systems, which primarily capture high-level contextual descriptions, our approach simultaneously analyzes ground-level intuitive geometric features such as line trajectories, proportions, and spatial arrangement, and high-level semantic cues extracted via vision-language models. These dual intent signals are jointly conditioned in a multi-stage generation pipeline that combines contour-preserving structural control with style- and content-aware image synthesis. Implemented with a touchscreen-based interface and distributed inference architecture, the system achieves low-latency, two-stage transformation while supporting multi-user collaboration on shared canvases. The resulting platform enables participants, regardless of artistic expertise, to engage in synchronous, co-authored visual creation, redefining human-AI interaction as a process of co-creation and mutual enhancement.

[62] TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Chenghao Liu,Jiachen Zhang,Chengxuan Li,Zhimu Zhou,Shixin Wu,Songfang Huang,Huiling Duan

Main category: cs.CV

TL;DR: Temporal Token Fusion (TTF) 是一种训练无关的方法,通过结合历史和当前视觉表示,提高视觉语言动作模型的推理质量,实验结果表明其在多个任务中均有显著提升。

Details Motivation: 视觉语言动作模型在处理视觉输入时忽略了操作任务中固有的时间信息,这使模型容易受到视觉噪声的影响,并忽略了操作序列中连续帧之间的高度一致性。 Method: TTF 使用双维度检测,结合高效的灰度像素差异分析和基于注意力的语义相关性评估,实现选择性的时间令牌融合,通过硬融合策略和关键帧锚定防止错误累积。 Result: 实验结果表明,TTF 在 LIBERO、SimplerEnv 和真实机器人任务中均表现出一致的改进,平均提高 4.0 个百分点,在 SimplerEnv 中有 4.8% 的相对提升,在真实机器人任务中有 8.7% 的相对提升。 Conclusion: Temporal Token Fusion (TTF) 是一种有效的训练无关方法,可以提高视觉语言动作模型的推理质量,并且具有模型无关的特性,适用于不同的架构。 Abstract: Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.

[63] Seeing Like a Designer Without One: A Study on Unsupervised Slide Quality Assessment via Designer Cue Augmentation

Tai Inui,Steven Oh,Magdeline Kuan

Main category: cs.CV

TL;DR: A new unsupervised pipeline effectively evaluates presentation slide quality by combining visual-design metrics and AI embeddings, showing strong correlation with human ratings and outperforming existing models.

Details Motivation: To create an objective and scalable method for evaluating the visual quality of presentation slides in real time, addressing the limitations of existing vision-language models. Method: An unsupervised slide-quality assessment pipeline was developed using seven visual-design metrics and CLIP-ViT embeddings, evaluated through Isolation Forest-based anomaly scoring. Result: The method achieved Pearson correlations up to 0.83 with human visual-quality ratings, significantly outperforming leading vision-language models. Conclusion: The study concludes that combining low-level design cues with multimodal embeddings effectively approximates audience perceptions of slide quality, providing a scalable and objective method for real-time feedback. Abstract: We present an unsupervised slide-quality assessment pipeline that combines seven expert-inspired visual-design metrics (whitespace, colorfulness, edge density, brightness contrast, text density, color harmony, layout balance) with CLIP-ViT embeddings, using Isolation Forest-based anomaly scoring to evaluate presentation slides. Trained on 12k professional lecture slides and evaluated on six academic talks (115 slides), our method achieved Pearson correlations up to 0.83 with human visual-quality ratings-1.79x to 3.23x stronger than scores from leading vision-language models (ChatGPT o4-mini-high, ChatGPT o3, Claude Sonnet 4, Gemini 2.5 Pro). We demonstrate convergent validity with visual ratings, discriminant validity against speaker-delivery scores, and exploratory alignment with overall impressions. Our results show that augmenting low-level design cues with multimodal embeddings closely approximates audience perceptions of slide quality, enabling scalable, objective feedback in real time.

[64] Efficient Model-Based Purification Against Adversarial Attacks for LiDAR Segmentation

Alexandros Gkillas,Ioulia Kapsali,Nikos Piperigkos,Aris S. Lalos

Main category: cs.CV

TL;DR: This paper introduces an efficient model-based purification framework for adversarial defense in 2D range-view LiDAR segmentation, offering strong resilience and minimal computational overhead while demonstrating superior performance and real-world applicability.

Details Motivation: Modern LiDAR segmentation networks are vulnerable to adversarial attacks, and existing defenses are computationally intensive and not optimized for 2D range-view representations, which are widely used in state-of-the-art pipelines. Method: The paper proposes a direct attack formulation in the range-view domain and develops an explainable purification network based on a mathematically justified optimization problem. Result: The method achieves competitive performance on open benchmarks, consistently outperforming generative and adversarial training baselines. Conclusion: The paper concludes that the proposed efficient model-based purification framework provides strong adversarial resilience with minimal computational overhead, making it suitable for real-world deployment in autonomous driving scenarios. Abstract: LiDAR-based segmentation is essential for reliable perception in autonomous vehicles, yet modern segmentation networks are highly susceptible to adversarial attacks that can compromise safety. Most existing defenses are designed for networks operating directly on raw 3D point clouds and rely on large, computationally intensive generative models. However, many state-of-the-art LiDAR segmentation pipelines operate on more efficient 2D range view representations. Despite their widespread adoption, dedicated lightweight adversarial defenses for this domain remain largely unexplored. We introduce an efficient model-based purification framework tailored for adversarial defense in 2D range-view LiDAR segmentation. We propose a direct attack formulation in the range-view domain and develop an explainable purification network based on a mathematical justified optimization problem, achieving strong adversarial resilience with minimal computational overhead. Our method achieves competitive performance on open benchmarks, consistently outperforming generative and adversarial training baselines. More importantly, real-world deployment on a demo vehicle demonstrates the framework's ability to deliver accurate operation in practical autonomous driving scenarios.

[65] Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

Ranjan Sapkota,Manoj Karkee

Main category: cs.CV

TL;DR: This paper reviews how Large Vision-Language Models (LVLMs) are transforming object detection by integrating language and vision, offering improved adaptability and contextual understanding over traditional methods.

Details Motivation: LVLMs have revolutionized object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. Method: A structured three-step research review process was used to analyze the state-of-the-art in LVLMs for object detection. Result: The review demonstrates LVLMs' effectiveness in diverse scenarios, compares their performance to traditional systems, and identifies limitations and solutions for future advancement. Conclusion: LVLMs are expected to meet or surpass traditional methods in object detection and will likely have a transformative impact on robotic applications. Abstract: The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. This review presents comprehensive visualizations demonstrating LVLMs' effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. The review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and robotic applications in the future.

[66] Large VLM-based Stylized Sports Captioning

Sauptik Dhar,Nicholas Buoncristiani,Joe Anakata,Haoyu Zhang,Michelle Munson

Main category: cs.CV

TL;DR: 这篇论文讨论了大型语言模型在体育领域中的应用局限,并提出了一个两层微调的视觉语言模型管道来生成更自然的体育比赛描述。

Details Motivation: 现有大型语言模型在生成体育比赛的自然语言描述时缺乏足够的体育领域术语,限制了其在体育领域的应用。 Method: 提出了一种两层微调的视觉语言模型(LVLM)管道,以提高生成体育比赛描述的质量和自然程度。 Result: 该方法在F1分数上提高了超过8-10%,在BERT分数上提高了2-10%,并且具有较小的运行内存占用和快速执行时间。 Conclusion: 提出的两层微调LVLM管道在Super Bowl LIX中成功应用于实时职业体育新闻报道,能够快速生成高精度和风格化的描述。 Abstract: The advent of large (visual) language models (LLM / LVLM) have led to a deluge of automated human-like systems in several domains including social media content generation, search and recommendation, healthcare prognosis, AI assistants for cognitive tasks etc. Although these systems have been successfully integrated in production; very little focus has been placed on sports, particularly accurate identification and natural language description of the game play. Most existing LLM/LVLMs can explain generic sports activities, but lack sufficient domain-centric sports' jargon to create natural (human-like) descriptions. This work highlights the limitations of existing SoTA LLM/LVLMs for generating production-grade sports captions from images in a desired stylized format, and proposes a two-level fine-tuned LVLM pipeline to address that. The proposed pipeline yields an improvement > 8-10% in the F1, and > 2-10% in BERT score compared to alternative approaches. In addition, it has a small runtime memory footprint and fast execution time. During Super Bowl LIX the pipeline proved its practical application for live professional sports journalism; generating highly accurate and stylized captions at the rate of 6 images per 3-5 seconds for over 1000 images during the game play.

[67] DemoBias: An Empirical Study to Trace Demographic Biases in Vision Foundation Models

Abu Sufian,Anirudha Ghosh,Debaditya Barman,Marco Leo,Cosimo Distante

Main category: cs.CV

TL;DR: DemoBias研究发现,大型视觉语言模型(LVLMs)在生物识别面部识别任务中存在显著的人口统计偏差,特别是在某些种族群体中,PaliGemma和LLaVA表现出更高的差异性,而BLIP-2则相对一致。

Details Motivation: 尽管大型视觉语言模型在多种下游任务中表现出色,但在生物识别面部识别任务中仍然存在关键问题,即人口统计偏差。这种偏差可能影响模型在不同种族、性别和年龄群体中的公平性。 Method: 研究团队对LLaVA、BLIP-2和PaliGemma三种预训练大型视觉语言模型进行了微调,并使用生成的人口统计平衡数据集进行评估。他们采用了特定群体的BERTScore和公平差异率等指标来量化性能差异。 Result: 实验结果显示,当前的大型视觉语言模型在不同人口统计群体之间表现不均衡,PaliGemma和LLaVA在拉美裔/拉丁裔、高加索裔和南亚裔群体中表现出更高的差异性,而BLIP-2则表现出相对一致的性能。 Conclusion: DemoBias研究得出,当前的大型视觉语言模型在生物识别面部识别任务中存在显著的人口统计偏差,其中PaliGemma和LLaVA对某些种族群体表现出更高的差异性,而BLIP-2则相对一致。 Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities across various downstream tasks, including biometric face recognition (FR) with description. However, demographic biases remain a critical concern in FR, as these foundation models often fail to perform equitably across diverse demographic groups, considering ethnicity/race, gender, and age. Therefore, through our work DemoBias, we conduct an empirical evaluation to investigate the extent of demographic biases in LVLMs for biometric FR with textual token generation tasks. We fine-tuned and evaluated three widely used pre-trained LVLMs: LLaVA, BLIP-2, and PaliGemma on our own generated demographic-balanced dataset. We utilize several evaluation metrics, like group-specific BERTScores and the Fairness Discrepancy Rate, to quantify and trace the performance disparities. The experimental results deliver compelling insights into the fairness and reliability of LVLMs across diverse demographic groups. Our empirical study uncovered demographic biases in LVLMs, with PaliGemma and LLaVA exhibiting higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, whereas BLIP-2 demonstrated comparably consistent. Repository: https://github.com/Sufianlab/DemoBias.

[68] Geo2Vec: Shape- and Distance-Aware Neural Representation of Geospatial Entities

Chen Chu,Cyrus Shahabi

Main category: cs.CV

TL;DR: Geo2Vec is a novel method for spatial representation learning that improves upon existing approaches by adaptively sampling points and encoding their signed distances without decomposition, resulting in efficient and accurate representations for all geo-entity types.

Details Motivation: Existing methods for spatial representation learning either target a single geo-entity type or decompose entities into simpler components, introducing high computational cost and relying on uniform, non-adaptive sampling that blurs fine-grained features like edges and boundaries. Geo2Vec was developed to address these limitations. Method: Geo2Vec uses a signed distance field (SDF) approach, adaptively sampling points and encoding their signed distances to capture geometry without decomposition. A neural network trained to approximate the SDF generates representations. A rotation-invariant positional encoding is also introduced to model high-frequency spatial variations and construct a structured and robust embedding space. Result: Geo2Vec consistently outperforms existing methods in representing shape and location, capturing topological and distance relationships, and achieving greater efficiency in real-world GeoAI applications. Conclusion: Geo2Vec is a novel spatial representation learning method that overcomes limitations of existing approaches by operating directly in the original space, adaptively sampling points, and encoding their signed distances to capture geometry without decomposition, resulting in compact, geometry-aware, and unified representations for all geo-entity types. Abstract: Spatial representation learning is essential for GeoAI applications such as urban analytics, enabling the encoding of shapes, locations, and spatial relationships (topological and distance-based) of geo-entities like points, polylines, and polygons. Existing methods either target a single geo-entity type or, like Poly2Vec, decompose entities into simpler components to enable Fourier transformation, introducing high computational cost. Moreover, since the transformed space lacks geometric alignment, these methods rely on uniform, non-adaptive sampling, which blurs fine-grained features like edges and boundaries. To address these limitations, we introduce Geo2Vec, a novel method inspired by signed distance fields (SDF) that operates directly in the original space. Geo2Vec adaptively samples points and encodes their signed distances (positive outside, negative inside), capturing geometry without decomposition. A neural network trained to approximate the SDF produces compact, geometry-aware, and unified representations for all geo-entity types. Additionally, we propose a rotation-invariant positional encoding to model high-frequency spatial variations and construct a structured and robust embedding space for downstream GeoAI models. Empirical results show that Geo2Vec consistently outperforms existing methods in representing shape and location, capturing topological and distance relationships, and achieving greater efficiency in real-world GeoAI applications. Code and Data can be found at: https://github.com/chuchen2017/GeoNeuralRepresentation.

[69] Advancements in Crop Analysis through Deep Learning and Explainable AI

Hamza Khan

Main category: cs.CV

TL;DR: 本研究开发了一种基于深度学习的自动化系统,可高效准确地分类稻谷品种并诊断水稻叶片疾病,提升了农业质量检测的可靠性和效率。

Details Motivation: 确保消费者满意度并加强国家声誉,需要对水稻作物和谷物质量进行监测。手动检测劳动强度大、耗时且容易出错,因此需要自动化解决方案进行质量控制和提高产量。 Method: 本研究提出了一种使用卷积神经网络(CNN)对五种稻谷品种进行分类的自动化方法,并结合可解释人工智能(XAI)与深度学习模型(包括CNN、VGG16、ResNet50和MobileNetV2)开发了一种准确的水稻叶片疾病诊断方法。 Result: 实验结果表明分类准确率高且误分类少,证明模型在区分稻谷品种方面有效。 Conclusion: 深度学习在农业应用中表现出巨大潜力,为自动化作物质量检测和疾病诊断的可靠、可解释系统铺平了道路,最终造福农民、消费者和农业经济。 Abstract: Rice is a staple food of global importance in terms of trade, nutrition, and economic growth. Among Asian nations such as China, India, Pakistan, Thailand, Vietnam and Indonesia are leading producers of both long and short grain varieties, including basmati, jasmine, arborio, ipsala, and kainat saila. To ensure consumer satisfaction and strengthen national reputations, monitoring rice crops and grain quality is essential. Manual inspection, however, is labour intensive, time consuming and error prone, highlighting the need for automated solutions for quality control and yield improvement. This study proposes an automated approach to classify five rice grain varieties using Convolutional Neural Networks (CNN). A publicly available dataset of 75000 images was used for training and testing. Model evaluation employed accuracy, recall, precision, F1-score, ROC curves, and confusion matrices. Results demonstrated high classification accuracy with minimal misclassifications, confirming the model effectiveness in distinguishing rice varieties. In addition, an accurate diagnostic method for rice leaf diseases such as Brown Spot, Blast, Bacterial Blight, and Tungro was developed. The framework combined explainable artificial intelligence (XAI) with deep learning models including CNN, VGG16, ResNet50, and MobileNetV2. Explainability techniques such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) revealed how specific grain and leaf features influenced predictions, enhancing model transparency and reliability. The findings demonstrate the strong potential of deep learning in agricultural applications, paving the way for robust, interpretable systems that can support automated crop quality inspection and disease diagnosis, ultimately benefiting farmers, consumers, and the agricultural economy.

[70] Sistema de Reconocimiento Facial Federado en Conjuntos Abiertos basado en OpenMax

Ander Galván,Marivi Higuero,Jorge Sasiain,Eduardo Jacob

Main category: cs.CV

TL;DR: 该论文提出了一种结合联邦学习和OpenMax算法的面部识别系统,可有效处理开放集场景下的隐私和未知个体识别问题。

Details Motivation: 面部识别技术在特定场景下表现出高精度,但在开放集场景中面临隐私和身份管理方面的挑战,尤其是当系统中出现未知个体时。 Method: 将OpenMax算法集成到联邦学习框架中,并利用平均激活向量和局部距离度量进行训练和评估。 Result: 实验结果验证了该方法的有效性,证明其在分布式环境中能够提升隐私保护和鲁棒性。 Conclusion: 该论文提出了一种基于OpenMax算法的联邦学习框架,适用于开放集场景下的面部识别系统,能够在保护隐私的同时有效识别已知和未知个体。 Abstract: Facial recognition powered by Artificial Intelligence has achieved high accuracy in specific scenarios and applications. Nevertheless, it faces significant challenges regarding privacy and identity management, particularly when unknown individuals appear in the operational context. This paper presents the design, implementation, and evaluation of a facial recognition system within a federated learning framework tailored to open-set scenarios. The proposed approach integrates the OpenMax algorithm into federated learning, leveraging the exchange of mean activation vectors and local distance measures to reliably distinguish between known and unknown subjects. Experimental results validate the effectiveness of the proposed solution, demonstrating its potential for enhancing privacy-aware and robust facial recognition in distributed environments. -- El reconocimiento facial impulsado por Inteligencia Artificial ha demostrado una alta precisi\'on en algunos escenarios y aplicaciones. Sin embargo, presenta desaf\'ios relacionados con la privacidad y la identificaci\'on de personas, especialmente considerando que pueden aparecer sujetos desconocidos para el sistema que lo implementa. En este trabajo, se propone el dise\~no, implementaci\'on y evaluaci\'on de un sistema de reconocimiento facial en un escenario de aprendizaje federado, orientado a conjuntos abiertos. Concretamente, se dise\~na una soluci\'on basada en el algoritmo OpenMax para escenarios de aprendizaje federado. La propuesta emplea el intercambio de los vectores de activaci\'on promedio y distancias locales para identificar de manera eficaz tanto personas conocidas como desconocidas. Los experimentos realizados demuestran la implementaci\'on efectiva de la soluci\'on propuesta.

[71] Automated classification of natural habitats using ground-level imagery

Mahdis Tourian,Sareh Rowlands,Remy Vandaele,Max Fancourt,Rebecca Mein,Hywel T. P. Williams

Main category: cs.CV

TL;DR: This study introduces a deep learning-based methodology for habitat classification using ground-level imagery, demonstrating its potential for large-scale ecological monitoring and offering a user-friendly web application for practical implementation.

Details Motivation: The motivation of the study is to develop an accurate, scalable method for habitat classification that relies solely on ground-level imagery, aiming to improve validation processes and enable broader applications in biodiversity conservation and land-use planning. Method: The study uses a DeepLabV3-ResNet101 classifier, with ground-level photographs categorized into 18 classes defined by the 'Living England' framework. Images were pre-processed with resizing, normalization, and augmentation, while re-sampling balanced classes and enhanced model robustness. Five-fold cross-validation was employed to evaluate the model. Result: The model showed strong performance with a mean F1-score of 0.61 across all 18 habitat classes. Visually distinct habitats, such as Bare Soil, Silt and Peat (BSSP) and Bare Sand (BS), achieved F1-scores above 0.90, while mixed or ambiguous classes scored lower. A web application was also developed to support practical use of the model. Conclusion: The study concludes that the deep learning approach using ground-level imagery can effectively classify habitats, demonstrating potential for ecological monitoring and offering scalable solutions, particularly through citizen-science imagery. Abstract: Accurate classification of terrestrial habitats is critical for biodiversity conservation, ecological monitoring, and land-use planning. Several habitat classification schemes are in use, typically based on analysis of satellite imagery with validation by field ecologists. Here we present a methodology for classification of habitats based solely on ground-level imagery (photographs), offering improved validation and the ability to classify habitats at scale (for example using citizen-science imagery). In collaboration with Natural England, a public sector organisation responsible for nature conservation in England, this study develops a classification system that applies deep learning to ground-level habitat photographs, categorising each image into one of 18 classes defined by the 'Living England' framework. Images were pre-processed using resizing, normalisation, and augmentation; re-sampling was used to balance classes in the training data and enhance model robustness. We developed and fine-tuned a DeepLabV3-ResNet101 classifier to assign a habitat class label to each photograph. Using five-fold cross-validation, the model demonstrated strong overall performance across 18 habitat classes, with accuracy and F1-scores varying between classes. Across all folds, the model achieved a mean F1-score of 0.61, with visually distinct habitats such as Bare Soil, Silt and Peat (BSSP) and Bare Sand (BS) reaching values above 0.90, and mixed or ambiguous classes scoring lower. These findings demonstrate the potential of this approach for ecological monitoring. Ground-level imagery is readily obtained, and accurate computational methods for habitat classification based on such data have many potential applications. To support use by practitioners, we also provide a simple web application that classifies uploaded images using our model.

[72] MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation

Ming Chen,Liyuan Cui,Wenyuan Zhang,Haoxian Zhang,Yan Zhou,Xiaohan Li,Xiaoqiang Liu,Pengfei Wan

Main category: cs.CV

TL;DR: 本文提出了一种高效的交互式数字人视频生成框架,支持多模态控制和低延迟流式外推,显著提升了实时交互体验。

Details Motivation: 现有的交互式数字人视频生成系统在处理多种输入信号时存在高延迟、高计算成本和可控性有限的问题,因此需要一种更高效、低延迟且可控的解决方案。 Method: 通过最小修改标准大语言模型(LLM),接受包括音频、姿态和文本在内的多模态条件编码,并使用扩散头进行去噪过程。同时,引入了一种深度压缩自编码器以降低自回归模型的长视野推理负担。 Result: 在双工对话、多语言人类合成和交互式世界模型任务上的大量实验表明,该方法具有低延迟、高效率和精细的多模态可控性优势。 Conclusion: 本文提出了一种自回归视频生成框架,实现了交互式多模态控制和低延迟的流式外推,解决了现有方法在延迟、计算成本和可控性方面的局限性。 Abstract: Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with high latency, heavy computational cost, and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64$\times$ reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.

[73] Deep Data Hiding for ICAO-Compliant Face Images: A Survey

Jefferson David Rodriguez Chivata,Davide Ghiani,Simone Maurizio La Cava,Marco Micheletto,Giulia Orrù,Federico Lama,Gian Luca Marcialis

Main category: cs.CV

TL;DR: 本文探讨了数字水印和隐写术作为保护ICAO合规图像的新方法,以应对身份验证中的安全挑战。

Details Motivation: ICAO合规的面部图像越来越多地用于身份验证,但其标准化也促进了如变形和深度伪造等有害行为,传统对策如呈现攻击检测(PAD)存在局限性。 Method: 本文通过全面分析最新的技术,评估这些方法在涉及ICAO合规图像的应用中的潜力和缺点及其在标准约束下的适用性。 Result: 提供了首个全面分析,突出了关键权衡,并为在现实身份系统中的安全部署提供了指导。 Conclusion: 数字水印和隐写术被提出作为补充解决方案,能够在ICAO合规的面部图像中嵌入防篡改信号,实现持久验证,同时保持合规性。 Abstract: ICAO-compliant facial images, initially designed for secure biometric passports, are increasingly becoming central to identity verification in a wide range of application contexts, including border control, digital travel credentials, and financial services. While their standardization enables global interoperability, it also facilitates practices such as morphing and deepfakes, which can be exploited for harmful purposes like identity theft and illegal sharing of identity documents. Traditional countermeasures like Presentation Attack Detection (PAD) are limited to real-time capture and offer no post-capture protection. This survey paper investigates digital watermarking and steganography as complementary solutions that embed tamper-evident signals directly into the image, enabling persistent verification without compromising ICAO compliance. We provide the first comprehensive analysis of state-of-the-art techniques to evaluate the potential and drawbacks of the underlying approaches concerning the applications involving ICAO-compliant images and their suitability under standard constraints. We highlight key trade-offs, offering guidance for secure deployment in real-world identity systems.

[74] PRISM: A Framework Harnessing Unsupervised Visual Representations and Textual Prompts for Explainable MACE Survival Prediction from Cardiac Cine MRI

Haoyang Su,Jin-Yi Xiang,Shaohao Rui,Yifan Gao,Xingyu Chen,Tingxuan Yin,Xiaosong Wang,Lian-Ming Wu

Main category: cs.CV

TL;DR: PRISM是一种结合非对比心脏电影磁共振成像和电子健康记录的新方法,用于更准确地预测心血管风险,其在多个临床队列中表现出卓越的性能。

Details Motivation: 准确预测主要不良心脏事件(MACE)仍然是心血管预后中的一个核心挑战。 Method: PRISM采用了一种自监督框架,通过运动感知的多视角蒸馏提取时间同步的成像特征,并利用医学知情文本提示对其进行调节,以实现细粒度的风险预测。 Result: PRISM在内部和外部验证中均持续超越经典生存预测模型和最先进的深度学习基线模型。进一步的临床研究结果表明,PRISM衍生的成像和EHR表征为不同队列中的心脏风险提供了有价值的见解。三种与高MACE风险相关的不同成像特征被发现,包括侧壁不同步、下壁高敏感性和舒张期前壁增强焦点。提示引导的归因进一步确定了高血压、糖尿病和吸烟是临床和生理EHR因素中的主要贡献者。 Conclusion: PRISM通过整合非对比心脏电影磁共振成像的视觉表征和结构化的电子健康记录(EHRs)进行生存分析,为心脏风险预测提供了有价值的见解,并在四个独立临床队列中表现出优于经典模型和最先进深度学习基线模型的性能。 Abstract: Accurate prediction of major adverse cardiac events (MACE) remains a central challenge in cardiovascular prognosis. We present PRISM (Prompt-guided Representation Integration for Survival Modeling), a self-supervised framework that integrates visual representations from non-contrast cardiac cine magnetic resonance imaging with structured electronic health records (EHRs) for survival analysis. PRISM extracts temporally synchronized imaging features through motion-aware multi-view distillation and modulates them using medically informed textual prompts to enable fine-grained risk prediction. Across four independent clinical cohorts, PRISM consistently surpasses classical survival prediction models and state-of-the-art (SOTA) deep learning baselines under internal and external validation. Further clinical findings demonstrate that the combined imaging and EHR representations derived from PRISM provide valuable insights into cardiac risk across diverse cohorts. Three distinct imaging signatures associated with elevated MACE risk are uncovered, including lateral wall dyssynchrony, inferior wall hypersensitivity, and anterior elevated focus during diastole. Prompt-guided attribution further identifies hypertension, diabetes, and smoking as dominant contributors among clinical and physiological EHR factors.

[75] EffNetViTLoRA: An Efficient Hybrid Deep Learning Approach for Alzheimer's Disease Diagnosis

Mahdieh Behjat Khatooni,Mohsen Soryani

Main category: cs.CV

TL;DR: EffNetViTLoRA 是一种用于阿尔茨海默病诊断的新型深度学习模型,结合了CNN和ViT,并通过LoRA进行知识迁移,实现了高准确度和泛化能力。

Details Motivation: 阿尔茨海默病(AD)是一种不可逆的神经退行性疾病,早期诊断对于疾病管理至关重要。轻度认知障碍(MCI)是AD的前期阶段,其诊断由于与相邻诊断类别之间的细微差异而特别具有挑战性。 Method: EffNetViTLoRA 结合了卷积神经网络(CNN)和视觉变换器(ViT),利用全阿尔茨海默病神经影像计划(ADNI)磁共振成像(MRI)数据集进行训练,并通过低秩自适应(LoRA)进行微调,以提高模型的适应性和减少过拟合风险。 Result: 该模型在完整的ADNI数据集上的三个诊断类别(AD、MCI和CN)中实现了92.52%的分类准确率和92.76%的F1分数,显示出其卓越的性能和临床可靠性。 Conclusion: EffNetViTLoRA 是一个用于阿尔茨海默病诊断的高效模型,结合了CNN和ViT的优势,并通过LoRA实现知识迁移,提高了模型的泛化能力和诊断准确性。 Abstract: Alzheimer's disease (AD) is one of the most prevalent neurodegenerative disorders worldwide. As it progresses, it leads to the deterioration of cognitive functions. Since AD is irreversible, early diagnosis is crucial for managing its progression. Mild Cognitive Impairment (MCI) represents an intermediate stage between Cognitively Normal (CN) individuals and those with AD, and is considered a transitional phase from normal cognition to Alzheimer's disease. Diagnosing MCI is particularly challenging due to the subtle differences between adjacent diagnostic categories. In this study, we propose EffNetViTLoRA, a generalized end-to-end model for AD diagnosis using the whole Alzheimer's Disease Neuroimaging Initiative (ADNI) Magnetic Resonance Imaging (MRI) dataset. Our model integrates a Convolutional Neural Network (CNN) with a Vision Transformer (ViT) to capture both local and global features from MRI images. Unlike previous studies that rely on limited subsets of data, our approach is trained on the full T1-weighted MRI dataset from ADNI, resulting in a more robust and unbiased model. This comprehensive methodology enhances the model's clinical reliability. Furthermore, fine-tuning large pretrained models often yields suboptimal results when source and target dataset domains differ. To address this, we incorporate Low-Rank Adaptation (LoRA) to effectively adapt the pretrained ViT model to our target domain. This method enables efficient knowledge transfer and reduces the risk of overfitting. Our model achieves a classification accuracy of 92.52% and an F1-score of 92.76% across three diagnostic categories: AD, MCI, and CN for full ADNI dataset.

[76] Concurrent validity of computer-vision artificial intelligence player tracking software using broadcast footage

Zachary L. Crang,Rich D. Johnston,Katie L. Mills,Johsan Billingham,Sam Robertson,Michael H. Cole,Jonathon Weakley,Adam Hewitt and,Grant M. Duthie

Main category: cs.CV

TL;DR: This study evaluated the accuracy of commercial computer-vision and AI player tracking software using broadcast footage from a FIFA World Cup match, finding that tactical feeds improve accuracy and both 720p and 1080p resolutions are suitable when appropriate models are used.

Details Motivation: The study aimed to determine the accuracy of commercially available computer-vision and AI player tracking software in measuring player position, speed, and distance using broadcast footage, as well as the impact of camera feed and resolution on accuracy. Method: Three commercial tracking providers using computer-vision and AI were tested for accuracy in measuring player position, speed, and distance using broadcast footage from a match at the 2022 Qatar FIFA World Cup. Tactical, programme, and camera 1 feeds were used. Instantaneous position and speed data were compared against a high-definition multi-camera tracking system (TRACAB Gen 5), and accuracy was assessed using root mean square error (RMSE) and mean bias. Result: Position RMSE ranged from 1.68 to 16.39 m, and speed RMSE ranged from 0.34 to 2.38 m/s. Total match distance mean bias ranged from -1745 m (-21.8%) to 1945 m (24.3%) across providers. Conclusion: Computer-vision and AI player tracking software can offer fair precision in tracking players, with accuracy dependent on camera feed and resolution. Tactical feeds are recommended for better player detection and accuracy, and both 720p and 1080p resolutions are suitable if appropriate models are used. Abstract: This study aimed to: (1) understand whether commercially available computer-vision and artificial intelligence (AI) player tracking software can accurately measure player position, speed and distance using broadcast footage and (2) determine the impact of camera feed and resolution on accuracy. Data were obtained from one match at the 2022 Qatar Federation Internationale de Football Association (FIFA) World Cup. Tactical, programme and camera 1 feeds were used. Three commercial tracking providers that use computer-vision and AI participated. Providers analysed instantaneous position (x, y coordinates) and speed (m\,s^{-1}) of each player. Their data were compared with a high-definition multi-camera tracking system (TRACAB Gen 5). Root mean square error (RMSE) and mean bias were calculated. Position RMSE ranged from 1.68 to 16.39 m, while speed RMSE ranged from 0.34 to 2.38 m\,s^{-1}. Total match distance mean bias ranged from -1745 m (-21.8%) to 1945 m (24.3%) across providers. Computer-vision and AI player tracking software offer the ability to track players with fair precision when players are detected by the software. Providers should use a tactical feed when tracking position and speed, which will maximise player detection, improving accuracy. Both 720p and 1080p resolutions are suitable, assuming appropriate computer-vision and AI models are implemented.

[77] JVLGS: Joint Vision-Language Gas Leak Segmentation

Xinlong Zhao,Qixiang Pang,Shan Du

Main category: cs.CV

TL;DR: 本文提出了一种新的气体泄漏分割框架JVLGS,结合视觉和文本信息,有效提升检测准确率,尤其在监督和少样本学习环境下表现卓越。

Details Motivation: 气体泄漏对人类健康和大气污染构成严重威胁,但由于缺乏有效的检测方法,及时准确地识别气体泄漏仍然困难。 Method: 提出了一种名为Joint Vision-Language Gas leak Segmentation (JVLGS)的新框架,结合了视觉和文本模态的互补优势,以增强气体泄漏的表示和分割,并采用后处理步骤减少误报。 Result: 实验结果显示,JVLGS在各种场景下均显著优于现有的气体泄漏分割方法,且在监督学习和少样本学习环境下均表现良好。 Conclusion: JVLGS在监督学习和少样本学习设置下都表现出色,显著优于最先进的气体泄漏分割方法。 Abstract: Gas leaks pose serious threats to human health and contribute significantly to atmospheric pollution, drawing increasing public concern. However, the lack of effective detection methods hampers timely and accurate identification of gas leaks. While some vision-based techniques leverage infrared videos for leak detection, the blurry and non-rigid nature of gas clouds often limits their effectiveness. To address these challenges, we propose a novel framework called Joint Vision-Language Gas leak Segmentation (JVLGS), which integrates the complementary strengths of visual and textual modalities to enhance gas leak representation and segmentation. Recognizing that gas leaks are sporadic and many video frames may contain no leak at all, our method incorporates a post-processing step to reduce false positives caused by noise and non-target objects, an issue that affects many existing approaches. Extensive experiments conducted across diverse scenarios show that JVLGS significantly outperforms state-of-the-art gas leak segmentation methods. We evaluate our model under both supervised and few-shot learning settings, and it consistently achieves strong performance in both, whereas competing methods tend to perform well in only one setting or poorly in both. Code available at: https://github.com/GeekEagle/JVLGS

[78] UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

Yimu Wang,Weiming Zhuang,Chen Chen,Jiabo Huang,Jingtao Li,Lingjuan Lyu

Main category: cs.CV

TL;DR: This paper proposes UNIFORM, a framework for transferring knowledge from many diverse pre-trained models to a single student model, improving performance and scalability beyond existing methods.

Details Motivation: The authors aim to address the challenge of integrating knowledge from heterogeneous pre-trained models without strong assumptions on data distribution and network architecture. Method: The paper proposes UNIFORM, which uses a voting mechanism at both the logit and feature levels to integrate knowledge from multiple pre-trained models into a student model. Result: Experiments show that UNIFORM improves unsupervised object recognition performance compared to baseline methods and scales effectively with over one hundred teacher models. Conclusion: UNIFORM provides a scalable and effective solution for knowledge transfer from a diverse set of pre-trained models into a single student model, overcoming the limitations of existing methods. Abstract: In the era of deep learning, the increasing number of pre-trained models available online presents a wealth of knowledge. These models, developed with diverse architectures and trained on varied datasets for different tasks, provide unique interpretations of the real world. Their collective consensus is likely universal and generalizable to unseen data. However, effectively harnessing this collective knowledge poses a fundamental challenge due to the heterogeneity of pre-trained models. Existing knowledge integration solutions typically rely on strong assumptions about training data distributions and network architectures, limiting them to learning only from specific types of models and resulting in data and/or inductive biases. In this work, we introduce a novel framework, namely UNIFORM, for knowledge transfer from a diverse set of off-the-shelf models into one student model without such constraints. Specifically, we propose a dedicated voting mechanism to capture the consensus of knowledge both at the logit level -- incorporating teacher models that are capable of predicting target classes of interest -- and at the feature level, utilizing visual representations learned on arbitrary label spaces. Extensive experiments demonstrate that UNIFORM effectively enhances unsupervised object recognition performance compared to strong knowledge transfer baselines. Notably, it exhibits remarkable scalability by benefiting from over one hundred teachers, while existing methods saturate at a much smaller scale.

[79] Sat2Flow: A Structure-Aware Diffusion Framework for Human Flow Generation from Satellite Imagery

Xiangxu Wang,Tianhong Zhao,Wei Tu,Bowen Zhang,Guanzhou Chen,Jinzhou Cao

Main category: cs.CV

TL;DR: Sat2Flow is a framework that generates structurally coherent Origin-Destination flows using only satellite imagery, addressing limitations of existing methods that rely on costly auxiliary features and suffer from sensitivity to spatial topology.

Details Motivation: Existing methods suffer from two critical limitations: (1) reliance on auxiliary features that are costly to collect and have limited spatial coverage; and (2) sensitivity to spatial topology, where minor index reordering of urban regions disrupts structural coherence in generated flows. Method: Sat2Flow, a latent structure-aware diffusion-based framework that generates structurally coherent OD flows using solely satellite imagery as input. Our approach introduces a multi-kernel encoder to capture diverse regional interactions and employs a permutation-aware diffusion process that aligns latent representations across different regional orderings. Result: Experimental results on real-world urban datasets demonstrate that Sat2Flow outperforms both physics-based and data-driven baselines in numerical accuracy while preserving empirical distributions and spatial structures under index permutations. Conclusion: Sat2Flow offers a globally scalable solution for OD flow generation in data-scarce urban environments, eliminating region-specific auxiliary data dependencies while maintaining structural invariance for robust mobility modeling. Abstract: Origin-Destination (OD) flow matrices are essential for urban mobility analysis, underpinning applications in traffic forecasting, infrastructure planning, and policy design. However, existing methods suffer from two critical limitations: (1) reliance on auxiliary features (e.g., Points of Interest, socioeconomic statistics) that are costly to collect and have limited spatial coverage; and (2) sensitivity to spatial topology, where minor index reordering of urban regions (e.g., census tract relabeling) disrupts structural coherence in generated flows. To address these challenges, we propose Sat2Flow, a latent structure-aware diffusion-based framework that generates structurally coherent OD flows using solely satellite imagery as input. Our approach introduces a multi-kernel encoder to capture diverse regional interactions and employs a permutation-aware diffusion process that aligns latent representations across different regional orderings. Through a joint contrastive training objective that bridges satellite-derived features with OD patterns, combined with equivariant diffusion training that enforces structural consistency, Sat2Flow ensures topological robustness under arbitrary regional reindexing. Experimental results on real-world urban datasets demonstrate that Sat2Flow outperforms both physics-based and data-driven baselines in numerical accuracy while preserving empirical distributions and spatial structures under index permutations. Sat2Flow offers a globally scalable solution for OD flow generation in data-scarce urban environments, eliminating region-specific auxiliary data dependencies while maintaining structural invariance for robust mobility modeling.

[80] Weed Detection in Challenging Field Conditions: A Semi-Supervised Framework for Overcoming Shadow Bias and Data Scarcity

Alzayat Saleh,Shunsuke Hatano,Mostafa Rahimi Azghadi

Main category: cs.CV

TL;DR: This paper proposes a semi-supervised framework to improve weed detection in agriculture by addressing shadow bias and enhancing model performance with minimal labeled data.

Details Motivation: The motivation stems from the challenges faced by deep learning models in real-world agricultural environments, including environmental conditions like shadow bias and the high cost of data annotation. Method: The study uses a semi-supervised pipeline that leverages pseudo-labeling on a large set of unlabeled data to improve model robustness. Supervised baselines (ResNet, YOLO, RF-DETR) were established, and interpretability tools were used to diagnose model biases. Result: Strong supervised baselines were achieved with F1 scores up to 0.90 and mAP50 scores over 0.82. The semi-supervised framework effectively mitigated shadow bias and improved recall, validated in a low-data regime on a public benchmark. Conclusion: This study presents a semi-supervised framework for improving the robustness of deep learning models in identifying invasive weeds, particularly addressing shadow bias and enhancing recall for automated spraying systems in precision agriculture. Abstract: The automated management of invasive weeds is critical for sustainable agriculture, yet the performance of deep learning models in real-world fields is often compromised by two factors: challenging environmental conditions and the high cost of data annotation. This study tackles both issues through a diagnostic-driven, semi-supervised framework. Using a unique dataset of approximately 975 labeled and 10,000 unlabeled images of Guinea Grass in sugarcane, we first establish strong supervised baselines for classification (ResNet) and detection (YOLO, RF-DETR), achieving F1 scores up to 0.90 and mAP50 scores exceeding 0.82. Crucially, this foundational analysis, aided by interpretability tools, uncovered a pervasive "shadow bias," where models learned to misidentify shadows as vegetation. This diagnostic insight motivated our primary contribution: a semi-supervised pipeline that leverages unlabeled data to enhance model robustness. By training models on a more diverse set of visual information through pseudo-labeling, this framework not only helps mitigate the shadow bias but also provides a tangible boost in recall, a critical metric for minimizing weed escapes in automated spraying systems. To validate our methodology, we demonstrate its effectiveness in a low-data regime on a public crop-weed benchmark. Our work provides a clear and field-tested framework for developing, diagnosing, and improving robust computer vision systems for the complex realities of precision agriculture.

[81] MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment

Zhiting Gao,Dan Song,Diqiong Jiang,Chao Xue,An-An Liu

Main category: cs.CV

TL;DR: 本文提出了 TAPO 和 MotionFLUX,有效提升文字驅動動作生成的語義對齊與生成速度。

Details Motivation: 解決現有文字驅動方法在語言描述與動作語義對齊上的不足,以及推理效率低的問題。 Method: 提出了 TMR++ Aligned Preference Optimization (TAPO) 和 MotionFLUX,後者基於確定性整流流匹配,實現實時動作生成。 Result: 實驗結果顯示,新方法在動作生成的語義一致性和質量方面表現出色,同時加快了生成速度。 Conclusion: TAPO 和 MotionFLUX 的結合在語義一致性和動作質量方面超越了現有最先進的方法,同時加速了生成速度。 Abstract: Motion generation is essential for animating virtual characters and embodied agents. While recent text-driven methods have made significant strides, they often struggle with achieving precise alignment between linguistic descriptions and motion semantics, as well as with the inefficiencies of slow, multi-step inference. To address these issues, we introduce TMR++ Aligned Preference Optimization (TAPO), an innovative framework that aligns subtle motion variations with textual modifiers and incorporates iterative adjustments to reinforce semantic grounding. To further enable real-time synthesis, we propose MotionFLUX, a high-speed generation framework based on deterministic rectified flow matching. Unlike traditional diffusion models, which require hundreds of denoising steps, MotionFLUX constructs optimal transport paths between noise distributions and motion spaces, facilitating real-time synthesis. The linearized probability paths reduce the need for multi-step sampling typical of sequential methods, significantly accelerating inference time without sacrificing motion quality. Experimental results demonstrate that, together, TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality, while also accelerating generation speed. The code and pretrained models will be released.

[82] CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning

Nannan Zhu,Yonghao Dong,Teng Wang,Xueqian Li,Shengjun Deng,Yijia Wang,Zheng Hong,Tiantian Geng,Guo Niu,Hanyan Huang,Xiongfei Yao,Shuaiwei Jiao

Main category: cs.CV

TL;DR: 本文提出了CVBench,这是一个用于评估跨视频关系推理的基准,揭示了当前MLLMs在多视频推理中的不足,并为下一代模型的发展提供了架构见解。

Details Motivation: 多模态大语言模型(MLLMs)在单视频任务上表现良好,但在多视频任务上的能力仍严重不足,而这一能力在现实世界的应用中至关重要。 Method: CVBench包含1000个问答对,涵盖三个层次:跨视频对象关联、跨视频事件关联和跨视频复杂推理。 Result: 即使顶级模型如GPT-4o,在因果推理任务上的准确率也只有60%,而人类的表现为91%。 Conclusion: CVBench通过严格的评估揭示了当前MLLMs在多视频推理中的不足,为下一代模型的发展提供了架构见解。 Abstract: While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation MLLMs.The data and evaluation code are available at https://github.com/Hokhim2/CVBench.

[83] WEBEYETRACK: Scalable Eye-Tracking for the Browser via On-Device Few-Shot Personalization

Eduardo Davalos,Yike Zhang,Namrata Srivastava,Yashvitha Thatigotla,Jorge A. Salas,Sara McFadden,Sun-Joo Cho,Amanda Goodwin,Ashwin TS,Gautam Biswas

Main category: cs.CV

TL;DR: WebEyeTrack是一个高效的浏览器内凝视估计框架,解决了现有方法在模型大小、推理时间和隐私方面的问题。

Details Motivation: 现有的凝视估计方法在现实世界应用中与商业眼动追踪解决方案存在差距,而基于网络摄像头的方法由于头部运动等因素缺乏足够的准确性。 Method: 引入WebEyeTrack,结合基于模型的头部姿态估计和设备端少样本学习(仅需九个校准样本) Result: WebEyeTrack在GazeCapture数据集上实现了2.32厘米的误差范围,并在iPhone 14上实现了2.4毫秒的实时推理速度。 Conclusion: WebEyeTrack是一个在浏览器中集成轻量级SOTA凝视估计模型的框架,它能够适应新用户并实现实时推理。 Abstract: With advancements in AI, new gaze estimation methods are exceeding state-of-the-art (SOTA) benchmarks, but their real-world application reveals a gap with commercial eye-tracking solutions. Factors like model size, inference time, and privacy often go unaddressed. Meanwhile, webcam-based eye-tracking methods lack sufficient accuracy, in particular due to head movement. To tackle these issues, we introduce We bEyeTrack, a framework that integrates lightweight SOTA gaze estimation models directly in the browser. It incorporates model-based head pose estimation and on-device few-shot learning with as few as nine calibration samples (k < 9). WebEyeTrack adapts to new users, achieving SOTA performance with an error margin of 2.32 cm on GazeCapture and real-time inference speeds of 2.4 milliseconds on an iPhone 14. Our open-source code is available at https://github.com/RedForestAi/WebEyeTrack.

[84] MonoRelief V2: Leveraging Real Data for High-Fidelity Monocular Relief Recovery

Yu-Wei Zhang,Tongju Han,Lipeng Gao,Mingqiang Wei,Hui Liu,Changbao Li,Caiming Zhang

Main category: cs.CV

TL;DR: MonoRelief V2是MonoRelief V1的改进版本,通过结合伪真实和真实世界数据集进行训练,实现了更准确、更稳健的2.5D浮雕恢复。

Details Motivation: MonoRelief V2旨在解决MonoRelief V1仅在合成数据上训练所带来的稳健性、准确性和效率问题。 Method: 生成大约15,000张伪真实图像,并通过深度和法线预测的融合得出相应的深度伪标签。此外,通过多视角重建和细节优化构建了一个小规模的真实世界数据集(800个样本)。 Result: MonoRelief V2在深度和法线预测方面都表现出最先进的性能,突显了其在各种下游应用中的强大潜力。 Conclusion: MonoRelief V2是一种更准确、更稳健的2.5D浮雕恢复模型,通过结合伪真实和真实世界数据集进行训练,实现了更高效的性能。 Abstract: This paper presents MonoRelief V2, an end-to-end model designed for directly recovering 2.5D reliefs from single images under complex material and illumination variations. In contrast to its predecessor, MonoRelief V1 [1], which was solely trained on synthetic data, MonoRelief V2 incorporates real data to achieve improved robustness, accuracy and efficiency. To overcome the challenge of acquiring large-scale real-world dataset, we generate approximately 15,000 pseudo real images using a text-to-image generative model, and derive corresponding depth pseudo-labels through fusion of depth and normal predictions. Furthermore, we construct a small-scale real-world dataset (800 samples) via multi-view reconstruction and detail refinement. MonoRelief V2 is then progressively trained on the pseudo-real and real-world datasets. Comprehensive experiments demonstrate its state-of-the-art performance both in depth and normal predictions, highlighting its strong potential for a range of downstream applications. Code is at: https://github.com/glp1001/MonoreliefV2.

[85] FlowDet: Overcoming Perspective and Scale Challenges in Real-Time End-to-End Traffic Detection

Yuhang Zhao,Zixing Wang

Main category: cs.CV

TL;DR: FlowDet is a high-speed, efficient object detection model specifically designed for real-time applications like intersection traffic monitoring, outperforming existing models in both accuracy and speed.

Details Motivation: End-to-end object detectors have high computational costs, especially in complex situations like intersection traffic monitoring, which presents a significant barrier. Method: FlowDet uses a decoupled encoder optimization strategy applied to the DETR architecture, incorporating a Geometric Deformable Unit (GDU) and a Scale-Aware Attention (SAA) module. Result: On the Intersection-Flow-5k dataset, FlowDet improves AP(test) by 1.5% and AP50(test) by 1.6% compared to RT-DETR, while reducing GFLOPs by 63.2% and increasing inference speed by 16.2%. Conclusion: FlowDet demonstrates a new path towards building highly efficient and accurate detectors for demanding, real-world perception systems. Abstract: End-to-end object detectors offer a promising NMS-free paradigm for real-time applications, yet their high computational cost remains a significant barrier, particularly for complex scenarios like intersection traffic monitoring. To address this challenge, we propose FlowDet, a high-speed detector featuring a decoupled encoder optimization strategy applied to the DETR architecture. Specifically, FlowDet employs a novel Geometric Deformable Unit (GDU) for traffic-aware geometric modeling and a Scale-Aware Attention (SAA) module to maintain high representational power across extreme scale variations. To rigorously evaluate the model's performance in environments with severe occlusion and high object density, we collected the Intersection-Flow-5k dataset, a new challenging scene for this task. Evaluated on Intersection-Flow-5k, FlowDet establishes a new state-of-the-art. Compared to the strong RT-DETR baseline, it improves AP(test) by 1.5% and AP50(test) by 1.6%, while simultaneously reducing GFLOPs by 63.2% and increasing inference speed by 16.2%. Our work demonstrates a new path towards building highly efficient and accurate detectors for demanding, real-world perception systems. The Intersection-Flow-5k dataset is available at https://github.com/AstronZh/Intersection-Flow-5K.

[86] DNP-Guided Contrastive Reconstruction with a Reverse Distillation Transformer for Medical Anomaly Detection

Luhu Li,Bowen Lin,Mukhtiar Khan,Shujun Fu

Main category: cs.CV

TL;DR: This paper proposes a novel framework for anomaly detection in medical images, combining trainable encoders and prototype-guided reconstruction to overcome domain gaps and prototype collapse.

Details Motivation: Medical image anomaly detection faces challenges due to limited annotations and domain gaps, while existing methods suffer from limited adaptability and prototype collapse. Method: The method combines a trainable encoder with prototype-guided reconstruction and a Diversity-Aware Alignment Loss to prevent prototype collapse and enhance feature learning. Result: The proposed framework shows significant improvements in representation quality and anomaly localization across medical imaging benchmarks. Conclusion: The proposed framework addresses the issues of domain adaptation and prototype collapse in anomaly detection for medical images, achieving better performance and interpretability. Abstract: Anomaly detection in medical images is challenging due to limited annotations and a domain gap compared to natural images. Existing reconstruction methods often rely on frozen pre-trained encoders, which limits adaptation to domain-specific features and reduces localization accuracy. Prototype-based learning offers interpretability and clustering benefits but suffers from prototype collapse, where few prototypes dominate training, harming diversity and generalization. To address this, we propose a unified framework combining a trainable encoder with prototype-guided reconstruction and a novel Diversity-Aware Alignment Loss. The trainable encoder, enhanced by a momentum branch, enables stable domain-adaptive feature learning. A lightweight Prototype Extractor mines informative normal prototypes to guide the decoder via attention for precise reconstruction. Our loss enforces balanced prototype use through diversity constraints and per-prototype normalization, effectively preventing collapse. Experiments on multiple medical imaging benchmarks show significant improvements in representation quality and anomaly localization, outperforming prior methods. Visualizations and prototype assignment analyses further validate the effectiveness of our anti-collapse mechanism and enhanced interpretability.

[87] Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation

Mingxi Fu,Fanglei Fu,Xitong Ling,Huaitian Yuan,Tian Guan,Yonghong He,Lianghui Zhu

Main category: cs.CV

TL;DR: MPAMatch improves pathological image segmentation by introducing a dual contrastive learning approach and utilizing a pathology-pretrained model, achieving better performance than current methods.

Details Motivation: Pathological image segmentation faces challenges like ambiguous semantic boundaries and high annotation costs. Existing semi-supervised methods struggle with capturing high-level semantic priors. Method: MPAMatch uses a dual contrastive learning scheme between image prototypes and pixel labels, and between text prototypes and pixel labels. It also integrates a pathology-pretrained foundation model for feature extraction. Result: Experiments on GLAS, EBHI-SEG-GLAND, EBHI-SEG-CANCER, and KPI datasets demonstrate MPAMatch's superiority in structural and semantic modeling compared to existing methods. Conclusion: MPAMatch, a novel segmentation framework, outperforms state-of-the-art methods in both structural and semantic modeling for pathological image segmentation. Abstract: Pathological image segmentation faces numerous challenges, particularly due to ambiguous semantic boundaries and the high cost of pixel-level annotations. Although recent semi-supervised methods based on consistency regularization (e.g., UniMatch) have made notable progress, they mainly rely on perturbation-based consistency within the image modality, making it difficult to capture high-level semantic priors, especially in structurally complex pathology images. To address these limitations, we propose MPAMatch - a novel segmentation framework that performs pixel-level contrastive learning under a multimodal prototype-guided supervision paradigm. The core innovation of MPAMatch lies in the dual contrastive learning scheme between image prototypes and pixel labels, and between text prototypes and pixel labels, providing supervision at both structural and semantic levels. This coarse-to-fine supervisory strategy not only enhances the discriminative capability on unlabeled samples but also introduces the text prototype supervision into segmentation for the first time, significantly improving semantic boundary modeling. In addition, we reconstruct the classic segmentation architecture (TransUNet) by replacing its ViT backbone with a pathology-pretrained foundation model (Uni), enabling more effective extraction of pathology-relevant features. Extensive experiments on GLAS, EBHI-SEG-GLAND, EBHI-SEG-CANCER, and KPI show MPAMatch's superiority over state-of-the-art methods, validating its dual advantages in structural and semantic modeling.

[88] Interact-Custom: Customized Human Object Interaction Image Generation

Zhu Xu,Zhaowen Wang,Yuxin Peng,Yang Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的图像生成任务CHOI,旨在实现对目标人类对象的身份保持和交互语义控制。

Details Motivation: 现有方法主要关注目标实体的外观保持,而忽略了目标实体间的细粒度交互控制。 Method: 设计了一个两阶段模型Interact-Custom,并处理了一个大规模数据集。 Result: Interact-Custom 模型能够同时保持身份特征并控制交互语义。 Conclusion: Interact-Custom 模型在CHOI任务上表现良好,具备高内容可控性。 Abstract: Compositional Customized Image Generation aims to customize multiple target concepts within generation content, which has gained attention for its wild application.Existing approaches mainly concentrate on the target entity's appearance preservation, while neglecting the fine-grained interaction control among target entities.To enable the model of such interaction control capability, we focus on human object interaction scenario and propose the task of Customized Human Object Interaction Image Generation(CHOI), which simultaneously requires identity preservation for target human object and the interaction semantic control between them.Two primary challenges exist for CHOI:(1)simultaneous identity preservation and interaction control demands require the model to decompose the human object into self-contained identity features and pose-oriented interaction features, while the current HOI image datasets fail to provide ideal samples for such feature-decomposed learning.(2)inappropriate spatial configuration between human and object may lead to the lack of desired interaction semantics.To tackle it, we first process a large-scale dataset, where each sample encompasses the same pair of human object involving different interactive poses.Then we design a two-stage model Interact-Custom, which firstly explicitly models the spatial configuration by generating a foreground mask depicting the interaction behavior, then under the guidance of this mask, we generate the target human object interacting while preserving their identities features.Furthermore, if the background image and the union location of where the target human object should appear are provided by users, Interact-Custom also provides the optional functionality to specify them, offering high content controllability. Extensive experiments on our tailored metrics for CHOI task demonstrate the effectiveness of our approach.

[89] High-Speed FHD Full-Color Video Computer-Generated Holography

Haomiao Zhang,Miao Cao,Xuan Yu,Hui Luo,Yanling Piao,Mengjie Qin,Zhangyuan Li,Ping Wang,Xin Yuan

Main category: cs.CV

TL;DR: The paper introduces SGDDM and HoloMamba to enable high-speed, high-fidelity full-color holographic video generation by addressing existing limitations in phase smoothing, color crosstalk, and computational inefficiency.

Details Motivation: Existing methods for generating holographic video face limitations such as over-smoothed phases causing color crosstalk and frame-by-frame optimization neglecting spatial-temporal correlations, leading to trade-offs in frame rate, color fidelity, and computational efficiency. Method: SGDDM optimizes phase distributions via frequency modulation for high-fidelity full-color display at high frame rates, while HoloMamba is a lightweight asymmetric Mamba-Unet architecture modeling spatial-temporal correlations for enhanced reconstruction quality and efficiency. Result: SGDDM enables high-fidelity full-color display without compromising frame rate, and HoloMamba generates FHD full-color holographic video at over 260 FPS, more than 2.6x faster than prior methods. Conclusion: The paper proposes a novel high-speed full-color video CGH generation scheme that overcomes the limitations of existing methods by introducing SGDDM and HoloMamba. Abstract: Computer-generated holography (CGH) is a promising technology for next-generation displays. However, generating high-speed, high-quality holographic video requires both high frame rate display and efficient computation, but is constrained by two key limitations: ($i$) Learning-based models often produce over-smoothed phases with narrow angular spectra, causing severe color crosstalk in high frame rate full-color displays such as depth-division multiplexing and thus resulting in a trade-off between frame rate and color fidelity. ($ii$) Existing frame-by-frame optimization methods typically optimize frames independently, neglecting spatial-temporal correlations between consecutive frames and leading to computationally inefficient solutions. To overcome these challenges, in this paper, we propose a novel high-speed full-color video CGH generation scheme. First, we introduce Spectrum-Guided Depth Division Multiplexing (SGDDM), which optimizes phase distributions via frequency modulation, enabling high-fidelity full-color display at high frame rates. Second, we present HoloMamba, a lightweight asymmetric Mamba-Unet architecture that explicitly models spatial-temporal correlations across video sequences to enhance reconstruction quality and computational efficiency. Extensive simulated and real-world experiments demonstrate that SGDDM achieves high-fidelity full-color display without compromise in frame rate, while HoloMamba generates FHD (1080p) full-color holographic video at over 260 FPS, more than 2.6$\times$ faster than the prior state-of-the-art Divide-Conquer-and-Merge Strategy.

[90] Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction

Dat Nguyen Cong,Hieu Tran Bao,Hoang Thanh-Tung

Main category: cs.CV

TL;DR: This paper introduces a new method called Score-based Discriminator Correction (SBDC) to improve the performance of diffusion models in the presence of label noise in training data.

Details Motivation: Diffusion models are powerful for image and video generation but may be affected by label errors in the training data. The impact of such errors on model performance is not well understood, and techniques to mitigate them are limited. Method: The authors propose Score-based Discriminator Correction (SBDC), which uses a discriminator trained with adversarial loss to guide the generation process. This discriminator helps assess the authenticity of generated samples, focusing on correcting errors in the early stages of generation. The method does not require retraining the diffusion model and is computationally efficient. Result: Experiments under various noise conditions show that SBDC outperforms previous state-of-the-art methods in terms of generative performance, while only slightly increasing inference time. Conclusion: The proposed SBDC method effectively improves the robustness and controllability of diffusion models in the presence of label noise, without requiring retraining or significant computational overhead. Abstract: Diffusion models have gained prominence as state-of-the-art techniques for synthesizing images and videos, particularly due to their ability to scale effectively with large datasets. Recent studies have uncovered that these extensive datasets often contain mistakes from manual labeling processes. However, the extent to which such errors compromise the generative capabilities and controllability of diffusion models is not well studied. This paper introduces Score-based Discriminator Correction (SBDC), a guidance technique for aligning noisy pre-trained conditional diffusion models. The guidance is built on discriminator training using adversarial loss, drawing on prior noise detection techniques to assess the authenticity of each sample. We further show that limiting the usage of our guidance to the early phase of the generation process leads to better performance. Our method is computationally efficient, only marginally increases inference time, and does not require retraining diffusion models. Experiments on different noise settings demonstrate the superiority of our method over previous state-of-the-art methods.

[91] Generalizing Monocular 3D Object Detection

Abhinav Kumar

Main category: cs.CV

TL;DR: This thesis improves the generalization of Mono3D models across diverse scenarios by introducing methods that enhance occlusion robustness, improve generalization to new datasets, address large object detection, and analyze the extrapolation of models to unseen camera heights.

Details Motivation: The motivation is to improve the generalization of Mono3D models to diverse scenarios such as occlusions, varying datasets, object sizes, and camera parameters, which are critical for applications like autonomous driving, augmented reality, and robotics. Method: The thesis proposes GrooMeD-NMS for occlusion robustness, DEVIANT backbones for generalization to new datasets, SeaBird for large object detection, and mathematical analysis for extrapolation of Mono3D models to unseen camera heights. Result: The proposed methods enhance the robustness and generalization of Mono3D models across different scenarios, including occlusions, datasets, object sizes, and camera parameters. The thesis provides a mathematical analysis of the extrapolation of Mono3D models to unseen camera heights. Conclusion: The thesis successfully addresses the challenge of generalizing Mono3D models to diverse scenarios by proposing methods that enhance occlusion robustness, improve generalization to new datasets, address large object detection, and analyze the extrapolation of Mono3D models to unseen camera heights. Abstract: Monocular 3D object detection (Mono3D) is a fundamental computer vision task that estimates an object's class, 3D position, dimensions, and orientation from a single image. Its applications, including autonomous driving, augmented reality, and robotics, critically rely on accurate 3D environmental understanding. This thesis addresses the challenge of generalizing Mono3D models to diverse scenarios, including occlusions, datasets, object sizes, and camera parameters. To enhance occlusion robustness, we propose a mathematically differentiable NMS (GrooMeD-NMS). To improve generalization to new datasets, we explore depth equivariant (DEVIANT) backbones. We address the issue of large object detection, demonstrating that it's not solely a data imbalance or receptive field problem but also a noise sensitivity issue. To mitigate this, we introduce a segmentation-based approach in bird's-eye view with dice loss (SeaBird). Finally, we mathematically analyze the extrapolation of Mono3D models to unseen camera heights and improve Mono3D generalization in such out-of-distribution settings.

[92] Quantization Robustness to Input Degradations for Object Detection

Toghrul Karimov,Hassan Imani,Allan Kazakov

Main category: cs.CV

TL;DR: 该论文研究了YOLO模型在不同精度格式下的后训练量化效果,评估了降级感知校准策略对模型鲁棒性的影响,并发现模型规模和特定噪声条件下该策略可能有效。

Details Motivation: 在资源受限设备上部署高效的物体检测模型时,量化对模型鲁棒性的影响是一个重要问题。 Method: 对多种精度格式(FP32、FP16、动态UINT8和静态INT8)的YOLO模型进行了后训练量化研究,并引入了降级感知校准策略进行评估。 Result: 静态INT8格式在干净数据上提供了1.5-3.3倍的速度提升,但在多种降级条件下,降级感知校准策略未带来显著的鲁棒性改进。 Conclusion: 研究发现,静态INT8 TensorRT引擎在干净数据上提供了显著的速度提升,但降级感知校准并未在大多数模型和降级条件下带来鲁棒性的广泛改进。 Abstract: Post-training quantization (PTQ) is crucial for deploying efficient object detection models, like YOLO, on resource-constrained devices. However, the impact of reduced precision on model robustness to real-world input degradations such as noise, blur, and compression artifacts is a significant concern. This paper presents a comprehensive empirical study evaluating the robustness of YOLO models (nano to extra-large scales) across multiple precision formats: FP32, FP16 (TensorRT), Dynamic UINT8 (ONNX), and Static INT8 (TensorRT). We introduce and evaluate a degradation-aware calibration strategy for Static INT8 PTQ, where the TensorRT calibration process is exposed to a mix of clean and synthetically degraded images. Models were benchmarked on the COCO dataset under seven distinct degradation conditions (including various types and levels of noise, blur, low contrast, and JPEG compression) and a mixed-degradation scenario. Results indicate that while Static INT8 TensorRT engines offer substantial speedups (~1.5-3.3x) with a moderate accuracy drop (~3-7% mAP50-95) on clean data, the proposed degradation-aware calibration did not yield consistent, broad improvements in robustness over standard clean-data calibration across most models and degradations. A notable exception was observed for larger model scales under specific noise conditions, suggesting model capacity may influence the efficacy of this calibration approach. These findings highlight the challenges in enhancing PTQ robustness and provide insights for deploying quantized detectors in uncontrolled environments. All code and evaluation tables are available at https://github.com/AllanK24/QRID.

[93] IELDG: Suppressing Domain-Specific Noise with Inverse Evolution Layers for Domain Generalized Semantic Segmentation

Qizhe Fan,Chaoyu Liu,Zhonghua Qiao,Xiaoqin Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的方法IELDM和IELFormer,用于提高跨域语义分割的泛化能力,通过整合逆向进化层和多尺度频率融合模块,有效过滤不良生成模式并抑制缺陷传播。

Details Motivation: 解决使用扩散模型生成的数据中存在的结构性或语义缺陷问题,这些缺陷可能导致性能下降和错误累积。 Method: 通过将逆向进化层(IELs)整合到生成过程中,利用基于拉普拉斯的先验知识来突出空间不连续性和语义不一致性,从而过滤不良的生成模式。同时,将IELs嵌入到DGSS模型的解码器中,并引入多尺度频率融合模块进行频域分析,实现多分辨率特征的结构化集成。 Result: 在基准数据集上的实验表明,该方法在跨域场景中实现了比现有方法更优的泛化性能。 Conclusion: 本文提出了一种增强的扩散数据增强框架IELDM和一种新的DGSS模型IELFormer,以提高跨域语义分割的泛化能力。 Abstract: Domain Generalized Semantic Segmentation (DGSS) focuses on training a model using labeled data from a source domain, with the goal of achieving robust generalization to unseen target domains during inference. A common approach to improve generalization is to augment the source domain with synthetic data generated by diffusion models (DMs). However, the generated images often contain structural or semantic defects due to training imperfections. Training segmentation models with such flawed data can lead to performance degradation and error accumulation. To address this issue, we propose to integrate inverse evolution layers (IELs) into the generative process. IELs are designed to highlight spatial discontinuities and semantic inconsistencies using Laplacian-based priors, enabling more effective filtering of undesirable generative patterns. Based on this mechanism, we introduce IELDM, an enhanced diffusion-based data augmentation framework that can produce higher-quality images. Furthermore, we observe that the defect-suppression capability of IELs can also benefit the segmentation network by suppressing artifact propagation. Based on this insight, we embed IELs into the decoder of the DGSS model and propose IELFormer to strengthen generalization capability in cross-domain scenarios. To further strengthen the model's semantic consistency across scales, IELFormer incorporates a multi-scale frequency fusion (MFF) module, which performs frequency-domain analysis to achieve structured integration of multi-resolution features, thereby improving cross-scale coherence. Extensive experiments on benchmark datasets demonstrate that our approach achieves superior generalization performance compared to existing methods.

[94] Controllable Skin Synthesis via Lesion-Focused Vector Autoregression Model

Jiajun Sun,Zhen Yu,Siyuan Yan,Jason J. Ong,Zongyuan Ge,Lei Zhang

Main category: cs.CV

TL;DR: 该研究提出了一种新的可控皮肤图像合成方法LF-VAR,能够生成高质量、临床相关的皮肤图像。

Details Motivation: 现实世界临床实践中获取的皮肤图像往往有限,导致深度学习模型的训练数据不足。虽然已有许多研究探索皮肤图像的合成,但现有方法通常生成低质量的图像,并且缺乏对病变位置和类型的控制。 Method: LF-VAR模型利用量化病变测量评分和病变类型标签来指导皮肤图像的合成。该方法包括一个多层次的病变聚焦向量量化变分自编码器(VQVAE),用于将图像编码为离散的潜在表示,以及一个视觉自回归(VAR)Transformer,用于基于标记化表示进行图像合成。 Result: 该方法在7种病变类型中的整体FID评分为0.74,比之前最先进的(SOTA)方法提高了6.3%。 Conclusion: 该研究提出了一种新的LF-VAR模型,能够实现对皮肤图像的临床相关且可控的合成,生成的图像具有高保真度和临床相关性。代码已在GitHub上公开。 Abstract: Skin images from real-world clinical practice are often limited, resulting in a shortage of training data for deep-learning models. While many studies have explored skin image synthesis, existing methods often generate low-quality images and lack control over the lesion's location and type. To address these limitations, we present LF-VAR, a model leveraging quantified lesion measurement scores and lesion type labels to guide the clinically relevant and controllable synthesis of skin images. It enables controlled skin synthesis with specific lesion characteristics based on language prompts. We train a multiscale lesion-focused Vector Quantised Variational Auto-Encoder (VQVAE) to encode images into discrete latent representations for structured tokenization. Then, a Visual AutoRegressive (VAR) Transformer trained on tokenized representations facilitates image synthesis. Lesion measurement from the lesion region and types as conditional embeddings are integrated to enhance synthesis fidelity. Our method achieves the best overall FID score (average 0.74) among seven lesion types, improving upon the previous state-of-the-art (SOTA) by 6.3%. The study highlights our controllable skin synthesis model's effectiveness in generating high-fidelity, clinically relevant synthetic skin images. Our framework code is available at https://github.com/echosun1996/LF-VAR.

[95] Divide, Weight, and Route: Difficulty-Aware Optimization with Dynamic Expert Fusion for Long-tailed Recognition

Xiaolei Wei,Yi Ouyang,Haibo Ye

Main category: cs.CV

TL;DR: DQRoute is a modular framework for long-tailed visual recognition that combines difficulty-aware optimization with dynamic expert collaboration to improve performance on rare and difficult classes.

Details Motivation: Long-tailed visual recognition is challenging due to class imbalance and varying classification difficulty. Existing methods that reweight classes by frequency often overlook intrinsically hard-to-learn categories. Method: DQRoute estimates class-wise difficulty using prediction uncertainty and historical performance, guiding training through adaptive loss weighting. It employs a mixture-of-experts architecture with expert specialization and confidence-based prediction weighting during inference. Result: Experiments on standard long-tailed benchmarks show that DQRoute significantly enhances performance, particularly for rare and difficult classes. Conclusion: DQRoute improves long-tailed visual recognition by integrating difficulty modeling with decentralized expert routing. Abstract: Long-tailed visual recognition is challenging not only due to class imbalance but also because of varying classification difficulty across categories. Simply reweighting classes by frequency often overlooks those that are intrinsically hard to learn. To address this, we propose \textbf{DQRoute}, a modular framework that combines difficulty-aware optimization with dynamic expert collaboration. DQRoute first estimates class-wise difficulty based on prediction uncertainty and historical performance, and uses this signal to guide training with adaptive loss weighting. On the architectural side, DQRoute employs a mixture-of-experts design, where each expert specializes in a different region of the class distribution. At inference time, expert predictions are weighted by confidence scores derived from expert-specific OOD detectors, enabling input-adaptive routing without the need for a centralized router. All components are trained jointly in an end-to-end manner. Experiments on standard long-tailed benchmarks demonstrate that DQRoute significantly improves performance, particularly on rare and difficult classes, highlighting the benefit of integrating difficulty modeling with decentralized expert routing.

[96] Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception

Yang Li,Quan Yuan,Guiyang Luo,Xiaoyuan Fu,Rui Pan,Yujia Yang,Congzhang Shao,Yuewen Liu,Jinglin Li

Main category: cs.CV

TL;DR: This paper proposes a novel collaborative perception framework called CoPLOT that utilizes point-level optimized tokens for improved object recognition and localization, outperforming state-of-the-art models with reduced communication and computation overhead.

Details Motivation: Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recognition and localization. Method: CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens. It incorporates a point-native processing pipeline, including token reordering, sequence modeling, and multi-agent spatial alignment. Result: Extensive experiments on both simulated and real-world datasets show that CoPLOT outperforms state-of-the-art models, with even lower communication and computation overhead. Conclusion: CoPLOT outperforms state-of-the-art models with even lower communication and computation overhead. Abstract: Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recognition and localization. To this end, we first introduce point-level tokens as intermediate representations for collaborative perception. However, point-cloud data are inherently unordered, massive, and position-sensitive, making it challenging to produce compact and aligned point-level token sequences that preserve detailed structural information. Therefore, we present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens. It incorporates a point-native processing pipeline, including token reordering, sequence modeling, and multi-agent spatial alignment. A semantic-aware token reordering module generates adaptive 1D reorderings by leveraging scene-level and token-level semantic information. A frequency-enhanced state space model captures long-range sequence dependencies across both spatial and spectral domains, improving the differentiation between foreground tokens and background clutter. Lastly, a neighbor-to-ego alignment module applies a closed-loop process, combining global agent-level correction with local token-level refinement to mitigate localization noise. Extensive experiments on both simulated and real-world datasets show that CoPLOT outperforms state-of-the-art models, with even lower communication and computation overhead. Code will be available at https://github.com/CheeryLeeyy/CoPLOT.

[97] UTAL-GNN: Unsupervised Temporal Action Localization using Graph Neural Networks

Bikash Kumar Badatya,Vipul Baghel,Ravi Hegde

Main category: cs.CV

TL;DR: This paper introduces a lightweight, unsupervised skeleton-based method for action localization in sports videos, combining ASTGCN with a novel Action Dynamics Metric to achieve efficient and accurate results comparable to supervised approaches.

Details Motivation: Fine-grained action localization in sports videos is challenging due to rapid and subtle motion transitions. Existing methods are computationally intensive and rely on large annotated datasets, limiting their applicability in real-world scenarios. Method: The method involves pre-training an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising task, and using a novel Action Dynamics Metric (ADM) derived from ASTGCN embeddings to detect motion boundaries. Result: The method achieves a mean Average Precision (mAP) of 82.66% and an average localization latency of 29.09 ms on the DSV Diving dataset, matching supervised approaches while maintaining efficiency. Conclusion: The proposed unsupervised method for action localization is computationally efficient, matches state-of-the-art supervised performance, and generalizes well to real-world scenarios without retraining. Abstract: Fine-grained action localization in untrimmed sports videos presents a significant challenge due to rapid and subtle motion transitions over short durations. Existing supervised and weakly supervised solutions often rely on extensive annotated datasets and high-capacity models, making them computationally intensive and less adaptable to real-world scenarios. In this work, we introduce a lightweight and unsupervised skeleton-based action localization pipeline that leverages spatio-temporal graph neural representations. Our approach pre-trains an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising task with blockwise partitions, enabling it to learn intrinsic motion dynamics without any manual labeling. At inference, we define a novel Action Dynamics Metric (ADM), computed directly from low-dimensional ASTGCN embeddings, which detects motion boundaries by identifying inflection points in its curvature profile. Our method achieves a mean Average Precision (mAP) of 82.66% and average localization latency of 29.09 ms on the DSV Diving dataset, matching state-of-the-art supervised performance while maintaining computational efficiency. Furthermore, it generalizes robustly to unseen, in-the-wild diving footage without retraining, demonstrating its practical applicability for lightweight, real-time action analysis systems in embedded or dynamic environments.

[98] IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Dongjin Kim,Jaekyun Ko,Muhammad Kashif Ali,Tae Hyun Kim

Main category: cs.CV

TL;DR: This paper proposes a dynamic kernel-based image denoising method that effectively handles unseen noise types and levels, offering efficient and high-quality restoration with a compact model.

Details Motivation: The motivation is to overcome the limitations of deep learning-based image denoising methods that rely on specific noise distributions, which limits their generalization to unseen noise types and levels. Method: The method involves a Feature Extraction Module, Global Statistics and Local Correlation Modules, and a Kernel Prediction Module that generates pixel-wise varying kernels adapted to local structures for iterative denoising. Result: The compact model trained on single-level Gaussian noise performs well across diverse noise types and levels, demonstrating efficiency and superior restoration quality. Conclusion: The paper concludes that the proposed method of dynamically generating kernels through efficient operations improves resilience to unseen noise and prevents overfitting, making it promising for practical image denoising. Abstract: Image denoising is a fundamental challenge in computer vision, with applications in photography and medical imaging. While deep learning-based methods have shown remarkable success, their reliance on specific noise distributions limits generalization to unseen noise types and levels. Existing approaches attempt to address this with extensive training data and high computational resources but they still suffer from overfitting. To address these issues, we conduct image denoising by utilizing dynamically generated kernels via efficient operations. This approach helps prevent overfitting and improves resilience to unseen noise. Specifically, our method leverages a Feature Extraction Module for robust noise-invariant features, Global Statistics and Local Correlation Modules to capture comprehensive noise characteristics and structural correlations. The Kernel Prediction Module then employs these cues to produce pixel-wise varying kernels adapted to local structures, which are then applied iteratively for denoising. This ensures both efficiency and superior restoration quality. Despite being trained on single-level Gaussian noise, our compact model (~ 0.04 M) excels across diverse noise types and levels, demonstrating the promise of iterative dynamic filtering for practical image denoising.

[99] Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models

Hou Xia,Zheren Fu,Fangcan Ling,Jiajun Li,Yi Tu,Zhendong Mao,Yongdong Zhang

Main category: cs.CV

TL;DR: Video-LevelGauge is a new benchmark for measuring positional bias in video language models, showing that open-source models often show bias while commercial models like Gemini2.5-Pro perform consistently.

Details Motivation: The motivation stems from the lack of benchmarks focusing on contextual positional bias in LVLMs despite its importance in real-world applications. Method: The authors developed Video-LevelGauge, a benchmark using standardized probes and contextual setups, and applied statistical and morphological analysis to evaluate 27 LVLMs. Result: The benchmark includes 438 videos and 1,177 multiple-choice questions, revealing that many open-source models exhibit head or neighbor-content preferences, unlike commercial models. Conclusion: The study concludes that Video-LevelGauge effectively assesses positional bias in LVLMs, revealing significant biases in open-source models while commercial models like Gemini2.5-Pro perform consistently well. Abstract: Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement.

[100] Scalable Object Detection in the Car Interior With Vision Foundation Models

Bálint Mészáros,Ahmet Firintepe,Sebastian Schmidt,Stephan Günnemann

Main category: cs.CV

TL;DR: 本文提出了一种新的车内场景理解框架ODAL,通过分布式架构利用视觉基础模型,并引入了新的评估指标ODALbench。研究结果表明,经过微调的ODAL-LLaVA模型在性能上优于GPT-4o,同时减少了幻觉现象。

Details Motivation: 车载系统的计算资源高度受限,这限制了复杂解决方案在车内的直接部署。因此,需要一种新的方法来克服资源限制,并提高车内物体检测和定位任务的响应质量。 Method: 该研究采用了一种分布式的架构,将计算任务在车载系统和云端之间进行划分,利用视觉基础模型进行车内场景理解。此外,研究还引入了新的度量标准ODALbench,并对GPT-4o和轻量级LLaVA 1.5 7B模型进行了比较,探索了微调对轻量级模型性能的影响。 Result: 研究结果显示,微调后的ODAL-LLaVA模型在ODAL$_{score}$上达到了89%,相比其基线性能提高了71%。此外,该模型在保持高检测精度的同时显著减少了幻觉现象,ODAL$_{SNR}$达到了GPT-4o的三倍。 Conclusion: 论文的结论是,通过微调的ODAL-LLaVA模型在车内物体检测和定位任务中表现出色,相较于GPT-4o,其性能提高了近20%,同时显著减少了幻觉现象,达到了三倍于GPT-4o的ODAL$_{SNR}$。 Abstract: AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework's potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_{score}$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL$_{SNR}$ three times higher than GPT-4o.

[101] Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li,Wenhao Yu,Chengsong Huang,Rui Liu,Zhenwen Liang,Fuxiao Liu,Jingxi Che,Dian Yu,Jordan Boyd-Graber,Haitao Mi,Dong Yu

Main category: cs.CV

TL;DR: Vision-SR1 is a self-rewarding method that enhances visual reasoning in VLMs by decomposing the process into visual perception and language reasoning, reducing hallucinations and language shortcuts without external supervision.

Details Motivation: VLMs often suffer from visual hallucinations and language shortcuts due to sparse visual signals and reliance on text priors. Existing methods using external supervision face challenges like cost and distributional shifts. Method: Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. It generates self-contained visual perceptions and uses the same VLM model to compute reward through language reasoning on these perceptions. Result: Vision-SR1 effectively improves visual reasoning and reduces visual hallucinations and language shortcuts without relying on external visual supervision. Conclusion: Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks. Abstract: Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.

[102] Hardware-aware vs. Hardware-agnostic Energy Estimation for SNN in Space Applications

Matthias Höfflin,Jürgen Wassner

Main category: cs.CV

TL;DR: This paper investigates the energy efficiency of Spiking Neural Networks (SNNs) for 3-D satellite position estimation from monocular images, comparing hardware-aware and hardware-agnostic energy estimation methods. While SNNs achieved comparable performance to CNNs, significant energy savings were only observed under specific conditions such as neuromorphic hardware and high input sparsity, highlighting the importance of transparent evaluation methods for fair comparisons of neural network efficiency.

Details Motivation: SNNs are considered energy-efficient due to their biological inspiration, making them appealing for resource-constrained applications like space technology. However, recent studies have questioned this assumption, especially for digital implementations, prompting a closer investigation into their actual energy efficiency. Method: The authors investigated SNNs for multi-output regression tasks using a 3-D satellite position estimation dataset from monocular images. They compared hardware-aware and hardware-agnostic energy estimation methods, using a spiking neural network trained with the membrane potential of Leaky Integrate-and-Fire (LIF) neurons in the final layer. Result: The proposed SNN achieved comparable performance, measured by Mean Squared Error (MSE), to a reference Convolutional Neural Network (CNN). While hardware-agnostic methods predicted a 50-60% energy advantage for SNNs over CNNs, hardware-aware analysis showed that significant energy savings were only realized on neuromorphic hardware and with high input sparsity. The study also quantified how the dark pixel ratio influenced energy consumption. Conclusion: The paper concludes that the energy efficiency of SNNs compared to conventional ANNs is not as straightforward as previously thought and heavily depends on hardware implementation and input data characteristics. Abstract: Spiking Neural Networks (SNNs), inspired by biological intelligence, have long been considered inherently energy-efficient, making them attractive for resource-constrained domains such as space applications. However, recent comparative studies with conventional Artificial Neural Networks (ANNs) have begun to question this reputation, especially for digital implementations. This work investigates SNNs for multi-output regression, specifically 3-D satellite position estimation from monocular images, and compares hardware-aware and hardware-agnostic energy estimation methods. The proposed SNN, trained using the membrane potential of the Leaky Integrate-and-Fire (LIF) neuron in the final layer, achieves comparable Mean Squared Error (MSE) to a reference Convolutional Neural Network (CNN) on a photorealistic satellite dataset. Energy analysis shows that while hardware-agnostic methods predict a consistent 50-60% energy advantage for SNNs over CNNs, hardware-aware analysis reveals that significant energy savings are realized only on neuromorphic hardware and with high input sparsity. The influence of dark pixel ratio on energy consumption is quantified, emphasizing the impact of data characteristics and hardware assumptions. These findings highlight the need for transparent evaluation methods and explicit disclosure of underlying assumptions to ensure fair comparisons of neural network energy efficiency.

[103] A Frequency-Aware Self-Supervised Learning for Ultra-Wide-Field Image Enhancement

Weicheng Liao,Zan Chen,Jianyang Xie,Yalin Zheng,Yuhui Ma,Yitian Zhao

Main category: cs.CV

TL;DR: 这篇论文介绍了一种用于超广角视网膜图像增强的新颖频率感知自我监督学习方法,该方法能有效提高图像质量和疾病诊断性能。

Details Motivation: 超广角视网膜成像常常受到模糊和不均匀光照等因素的影响,这些因素会掩盖细节并隐藏病理信息。然而,尽管已经提出了许多其他眼底图像的增强方法,但它们往往不能满足UWF的特殊需求,特别是保留病理细节的需求。 Method: 论文采用了一种频率感知的自我监督学习方法,包括频率解耦图像去模糊和Retinex引导的照明补偿模块。 Result: 实验结果表明,该方法不仅提高了可视化质量,还通过恢复和纠正局部细节和不均匀强度提高了疾病诊断性能。 Conclusion: 该论文提出了一种新的超广角视网膜图像增强方法,为改善视网膜疾病管理提供了一个强大且具有临床价值的工具。 Abstract: Ultra-Wide-Field (UWF) retinal imaging has revolutionized retinal diagnostics by providing a comprehensive view of the retina. However, it often suffers from quality-degrading factors such as blurring and uneven illumination, which obscure fine details and mask pathological information. While numerous retinal image enhancement methods have been proposed for other fundus imageries, they often fail to address the unique requirements in UWF, particularly the need to preserve pathological details. In this paper, we propose a novel frequency-aware self-supervised learning method for UWF image enhancement. It incorporates frequency-decoupled image deblurring and Retinex-guided illumination compensation modules. An asymmetric channel integration operation is introduced in the former module, so as to combine global and local views by leveraging high- and low-frequency information, ensuring the preservation of fine and broader structural details. In addition, a color preservation unit is proposed in the latter Retinex-based module, to provide multi-scale spatial and frequency information, enabling accurate illumination estimation and correction. Experimental results demonstrate that the proposed work not only enhances visualization quality but also improves disease diagnosis performance by restoring and correcting fine local details and uneven intensity. To the best of our knowledge, this work is the first attempt for UWF image enhancement, offering a robust and clinically valuable tool for improving retinal disease management.

[104] SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Texture 3D Human Reconstruction

Gangjian Zhang,Jian Shu,Nanjie Yao,Hao Wang

Main category: cs.CV

TL;DR: This paper proposes SAT, a novel framework for 3D human reconstruction that integrates multiple geometric priors and uses online data augmentation, achieving better results than current state-of-the-art methods.

Details Motivation: The research aims to overcome geometric ambiguity in single-view 2D images and the lack of 3D training data in 3D human reconstruction. Method: The SAT framework uses a two-process approach with a Supervisor Feature Regularization module for better fusion of geometric priors and an Online Animation Augmentation module to generate additional training data. Result: The SAT framework achieves high-quality textured 3D avatars and demonstrates superior performance on two benchmarks compared to existing methods. Conclusion: The proposed SAT framework effectively integrates multiple geometric priors and uses online augmentation to improve 3D human reconstruction, outperforming state-of-the-art methods. Abstract: Monocular texture 3D human reconstruction aims to create a complete 3D digital avatar from just a single front-view human RGB image. However, the geometric ambiguity inherent in a single 2D image and the scarcity of 3D human training data are the main obstacles limiting progress in this field. To address these issues, current methods employ prior geometric estimation networks to derive various human geometric forms, such as the SMPL model and normal maps. However, they struggle to integrate these modalities effectively, leading to view inconsistencies, such as facial distortions. To this end, we propose a two-process 3D human reconstruction framework, SAT, which seamlessly learns various prior geometries in a unified manner and reconstructs high-quality textured 3D avatars as the final output. To further facilitate geometry learning, we introduce a Supervisor Feature Regularization module. By employing a multi-view network with the same structure to provide intermediate features as training supervision, these varied geometric priors can be better fused. To tackle data scarcity and further improve reconstruction quality, we also propose an Online Animation Augmentation module. By building a one-feed-forward animation network, we augment a massive number of samples from the original 3D human data online for model training. Extensive experiments on two benchmarks show the superiority of our approach compared to state-of-the-art methods.

[105] Synthetic Image Detection via Spectral Gaps of QC-RBIM Nishimori Bethe-Hessian Operators

V. S. Usatyuk,D. A. Sapozhnikov,S. I. Egorov

Main category: cs.CV

TL;DR: This paper introduces a novel unsupervised method for detecting synthetic images by constructing a Multi-Edge Type QC-LDPC graph based on deep image features and analyzing the Bethe-Hessian spectrum. The method is physics-inspired, model-agnostic, and robust to new generative architectures, achieving over 94% accuracy without labeled synthetic data or retraining.

Details Motivation: The motivation behind the research is the increasing ability of deep generative models like GANs and diffusion networks to produce images that are nearly indistinguishable from real photographs, posing challenges to media forensics and biometric security. Existing supervised and unsupervised detection methods have limitations in handling unseen generators or adversarial post-processing, prompting the need for a more robust and model-agnostic solution. Method: The method involves constructing a Multi-Edge Type QC-LDPC graph where image feature vectors, reduced to 32 dimensions using pretrained CNNs, serve as nodes. Pairwise similarities are transformed into edge couplings calibrated at the Nishimori temperature, forming a Random Bond Ising Model (RBIM). The detection relies on the Bethe-Hessian spectrum analysis, where real images exhibit a characteristic gap, unlike synthetic ones. This approach is unsupervised and does not rely on labeled synthetic data. Result: The proposed detector achieved over 94% accuracy on binary classification tasks (cat versus dog, male versus female) using real photos from Flickr-Faces-HQ and CelebA datasets and their synthetic counterparts generated by GANs and diffusion models. Spectral analysis revealed multiple well-separated gaps for real image sets and a collapsed spectrum for generated ones, validating the effectiveness of the method. Conclusion: The paper concludes that the proposed physics-inspired, model-agnostic detector is effective in distinguishing synthetic images from real ones, achieving over 94% accuracy without using labeled synthetic data or retraining the feature extractor. The method leverages the absence of a characteristic gap in the Bethe-Hessian spectrum of synthetic images and utilizes a novel LDPC graph construction with deep image features. Abstract: The rapid advance of deep generative models such as GANs and diffusion networks now produces images that are virtually indistinguishable from genuine photographs, undermining media forensics and biometric security. Supervised detectors quickly lose effectiveness on unseen generators or after adversarial post-processing, while existing unsupervised methods that rely on low-level statistical cues remain fragile. We introduce a physics-inspired, model-agnostic detector that treats synthetic-image identification as a community-detection problem on a sparse weighted graph. Image features are first extracted with pretrained CNNs and reduced to 32 dimensions, each feature vector becomes a node of a Multi-Edge Type QC-LDPC graph. Pairwise similarities are transformed into edge couplings calibrated at the Nishimori temperature, producing a Random Bond Ising Model (RBIM) whose Bethe-Hessian spectrum exhibits a characteristic gap when genuine community structure (real images) is present. Synthetic images violate the Nishimori symmetry and therefore lack such gaps. We validate the approach on binary tasks cat versus dog and male versus female using real photos from Flickr-Faces-HQ and CelebA and synthetic counterparts generated by GANs and diffusion models. Without any labeled synthetic data or retraining of the feature extractor, the detector achieves over 94% accuracy. Spectral analysis shows multiple well separated gaps for real image sets and a collapsed spectrum for generated ones. Our contributions are threefold: a novel LDPC graph construction that embeds deep image features, an analytical link between Nishimori temperature RBIM and the Bethe-Hessian spectrum providing a Bayes optimal detection criterion; and a practical, unsupervised synthetic image detector robust to new generative architectures. Future work will extend the framework to video streams and multi-class anomaly detection.

[106] LabelGS: Label-Aware 3D Gaussian Splatting for 3D Scene Segmentation

Yupeng Zhang,Dezhi Zheng,Ping Lu,Han Zhang,Lei Wang,Liping xiang,Cheng Luo,Kaijun Deng,Xiaowen Fu,Linlin Shen,Jinbao Wang

Main category: cs.CV

TL;DR: LabelGS是一种增强的3D高斯随机投影方法,解决了3D场景分割中的关键问题,提升了效率和性能。

Details Motivation: 3D高斯随机投影缺乏3D分割能力,限制了其在场景理解任务中的适用性。 Method: LabelGS通过引入对象标签增强高斯表示,采用遮挡分析模型、主高斯标注模型和高斯投影滤波器来优化3D场景分割。 Result: LabelGS在3D场景分割任务中表现优于Feature-3DGS,并实现了22倍的训练速度提升。 Conclusion: LabelGS有效地解决了3D高斯随机投影缺乏3D分割能力的问题,并显著提高了训练效率。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a novel explicit representation for 3D scenes, offering both high-fidelity reconstruction and efficient rendering. However, 3DGS lacks 3D segmentation ability, which limits its applicability in tasks that require scene understanding. The identification and isolating of specific object components is crucial. To address this limitation, we propose Label-aware 3D Gaussian Splatting (LabelGS), a method that augments the Gaussian representation with object label.LabelGS introduces cross-view consistent semantic masks for 3D Gaussians and employs a novel Occlusion Analysis Model to avoid overfitting occlusion during optimization, Main Gaussian Labeling model to lift 2D semantic prior to 3D Gaussian and Gaussian Projection Filter to avoid Gaussian label conflict. Our approach achieves effective decoupling of Gaussian representations and refines the 3DGS optimization process through a random region sampling strategy, significantly improving efficiency. Extensive experiments demonstrate that LabelGS outperforms previous state-of-the-art methods, including Feature-3DGS, in the 3D scene segmentation task. Notably, LabelGS achieves a remarkable 22X speedup in training compared to Feature-3DGS, at a resolution of 1440X1080. Our code will be at https://github.com/garrisonz/LabelGS.

[107] FreeVPS: Repurposing Training-Free SAM2 for Generalizable Video Polyp Segmentation

Qiang Hu,Ying Zhou,Gepeng Ji,Nick Barnes,Qiang Li,Zhiwei Wang

Main category: cs.CV

TL;DR: FreeVPS通过改进SAM2模型,结合两个训练自由模块,实现高效稳定的视频息肉分割,适用于真实临床场景。

Details Motivation: 现有的视频息肉分割方法在时空建模和域泛化之间难以平衡,限制了其在真实临床场景中的应用。 Method: 将视频息肉分割任务重新定义为检测跟踪范式,利用图像息肉分割模型的空间上下文信息和SAM2的时序建模能力,并引入两个模块:消除空间不准确的intra-association过滤模块和防止误差传播的inter-association优化模块。 Result: FreeVPS在领域内和跨领域场景中均达到最先进的性能,并在长视频中展现出稳健的跟踪能力,适合可靠的临床分析。 Conclusion: FreeVPS通过两个训练自由模块有效解决了SAM2在长期息肉跟踪中的误差累积问题,实现了领域内和跨领域的高性能视频息肉分割,并展示了其在无剪辑结肠镜视频中的稳健跟踪能力。 Abstract: Existing video polyp segmentation (VPS) paradigms usually struggle to balance between spatiotemporal modeling and domain generalization, limiting their applicability in real clinical scenarios. To embrace this challenge, we recast the VPS task as a track-by-detect paradigm that leverages the spatial contexts captured by the image polyp segmentation (IPS) model while integrating the temporal modeling capabilities of segment anything model 2 (SAM2). However, during long-term polyp tracking in colonoscopy videos, SAM2 suffers from error accumulation, resulting in a snowball effect that compromises segmentation stability. We mitigate this issue by repurposing SAM2 as a video polyp segmenter with two training-free modules. In particular, the intra-association filtering module eliminates spatial inaccuracies originating from the detecting stage, reducing false positives. The inter-association refinement module adaptively updates the memory bank to prevent error propagation over time, enhancing temporal coherence. Both modules work synergistically to stabilize SAM2, achieving cutting-edge performance in both in-domain and out-of-domain scenarios. Furthermore, we demonstrate the robust tracking capabilities of FreeVPS in long-untrimmed colonoscopy videos, underscoring its potential reliable clinical analysis.

[108] Improving Generalization in Deepfake Detection with Face Foundation Models and Metric Learning

Stelios Mylonas,Symeon Papadopoulos

Main category: cs.CV

TL;DR: 本文提出了一种具有强泛化能力的视频深度伪造检测框架,利用面部基础模型学习到的丰富面部表示,并通过三元组损失变体和基于归因的监督方案提升检测性能。

Details Motivation: 深度伪造技术的日益逼真和普及引发了人们对媒体真实性和信息完整性的担忧,现有的检测模型在面对真实世界数据时泛化能力不足。 Method: 基于FSFM(一种在真实人脸数据上训练的自监督模型),并使用包含面部交换和面部重演操作的深度伪造数据集集合进行微调。引入三元组损失变体以增强区分能力,并探索基于归因的监督方案以评估其对泛化的影响。 Result: 在多种评估基准上进行的广泛实验表明,该方法在具有挑战性的现实场景中表现出色,显示出其有效性和强泛化能力。 Conclusion: 所提出的深度伪造检测框架在真实世界数据上表现出优异的性能,为解决深度伪造问题提供了新的思路。 Abstract: The increasing realism and accessibility of deepfakes have raised critical concerns about media authenticity and information integrity. Despite recent advances, deepfake detection models often struggle to generalize beyond their training distributions, particularly when applied to media content found in the wild. In this work, we present a robust video deepfake detection framework with strong generalization that takes advantage of the rich facial representations learned by face foundation models. Our method is built on top of FSFM, a self-supervised model trained on real face data, and is further fine-tuned using an ensemble of deepfake datasets spanning both face-swapping and face-reenactment manipulations. To enhance discriminative power, we incorporate triplet loss variants during training, guiding the model to produce more separable embeddings between real and fake samples. Additionally, we explore attribution-based supervision schemes, where deepfakes are categorized by manipulation type or source dataset, to assess their impact on generalization. Extensive experiments across diverse evaluation benchmarks demonstrate the effectiveness of our approach, especially in challenging real-world scenarios.

[109] POEv2: a flexible and robust framework for generic line segment detection and wireframe line segment detection

Chenguang Liu,Chisheng Wang,Yuhua Cai,Chuanhua Zhu,Qingquan Li

Main category: cs.CV

TL;DR: 本文提出了一种改进的线段检测框架POEv2,能够同时适用于通用线段检测和线框线段检测,并在多个数据集上表现出色。

Details Motivation: 现有的线段检测器分为通用线段检测器和线框线段检测器,它们的设计目标不同,因此在彼此的任务中表现不佳。因此,需要一个既能处理通用线段检测又能处理线框线段检测的鲁棒框架。 Method: POEv2方法从边缘强度图中检测线段,是对像素方向估计(POE)方法的改进,并可以与任何边缘检测器结合使用。 Result: 实验表明,POEv2结合高效的边缘检测器在三个公开数据集上达到了最先进的性能。 Conclusion: POEv2是一个改进的线段检测框架,能够同时适用于通用线段检测和线框线段检测,并通过结合高效的边缘检测器在三个公开数据集中达到了最先进的性能。 Abstract: Line segment detection in images has been studied for several decades. Existing line segment detectors can be roughly divided into two categories: generic line segment detectors and wireframe line segment detectors. Generic line segment detectors aim to detect all meaningful line segments in images and traditional approaches usually fall into this category. Recent deep learning based approaches are mostly wireframe line segment detectors. They detect only line segments that are geometrically meaningful and have large spatial support. Due to the difference in the aim of design, the performance of generic line segment detectors for the task of wireframe line segment detection won't be satisfactory, and vice versa. In this work, we propose a robust framework that can be used for both generic line segment detection and wireframe line segment detection. The proposed method is an improved version of the Pixel Orientation Estimation (POE) method. It is thus named as POEv2. POEv2 detects line segments from edge strength maps, and can be combined with any edge detector. We show in our experiments that by combining the proposed POEv2 with an efficient edge detector, it achieves state-of-the-art performance on three publicly available datasets.

[110] SPLF-SAM: Self-Prompting Segment Anything Model for Light Field Salient Object Detection

Qiyao Xu,Qiming Wu,Xiaowei Li

Main category: cs.CV

TL;DR: 本文提出SPLF-SAM模型,结合UMFEB和MAFA模块,在LF SOD任务中有效提升小目标检测效果,优于现有方法。

Details Motivation: 现有模型忽视提示信息提取和频率域信息分析,导致小目标易受噪声干扰。 Method: 提出了SPLF-SAM模型,包含UMFEB模块用于多尺度特征提取,MAFA模块用于频率域滤波去噪。 Result: 实验表明SPLF-SAM在多个LF SOD数据集上均优于现有SOTA方法。 Conclusion: SPLF-SAM通过UMFEB和MAFA模块在LF SOD任务中表现出色,优于现有SOTA方法。 Abstract: Segment Anything Model (SAM) has demonstrated remarkable capabilities in solving light field salient object detection (LF SOD). However, most existing models tend to neglect the extraction of prompt information under this task. Meanwhile, traditional models ignore the analysis of frequency-domain information, which leads to small objects being overwhelmed by noise. In this paper, we put forward a novel model called self-prompting light field segment anything model (SPLF-SAM), equipped with unified multi-scale feature embedding block (UMFEB) and a multi-scale adaptive filtering adapter (MAFA). UMFEB is capable of identifying multiple objects of varying sizes, while MAFA, by learning frequency features, effectively prevents small objects from being overwhelmed by noise. Extensive experiments have demonstrated the superiority of our method over ten state-of-the-art (SOTA) LF SOD methods. Our code will be available at https://github.com/XucherCH/splfsam.

[111] FastAvatar: Towards Unified Fast High-Fidelity 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

Yue Wu,Yufan Wu,Wen Li,Yuxi Lu,Kairui Feng,Xuanhong Chen

Main category: cs.CV

TL;DR: FastAvatar is a fast and flexible 3D avatar framework that reconstructs high-quality 3D models in seconds, using a single model and diverse input data types, overcoming issues of prior methods like high complexity and low data utilization.

Details Motivation: 3D avatar reconstruction faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. FastAvatar was developed to overcome these issues and provide a fast, flexible, and high-quality solution. Method: FastAvatar uses a Large Gaussian Reconstruction Transformer with three key components: a VGGT-style transformer for aggregating multi-frame cues and predicting canonical 3DGS representations, multi-granular guidance encoding to handle animation-induced misalignment, and incremental Gaussian aggregation through landmark tracking and sliced fusion losses. Result: FastAvatar can reconstruct a high-quality 3DGS model within seconds using diverse daily recordings and a single unified model. Experimental results demonstrate that it offers higher quality and competitive speed compared to existing methods. Conclusion: FastAvatar offers a highly usable 3D avatar modeling solution by enabling fast, flexible, and high-quality reconstruction with improved data utilization and efficiency. Abstract: Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. FastAvatar's core is a Large Gaussian Reconstruction Transformer featuring three key designs: First, a variant VGGT-style transformer architecture aggregating multi-frame cues while injecting initial 3D prompt to predict an aggregatable canonical 3DGS representation; Second, multi-granular guidance encoding (camera pose, FLAME expression, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations, unlike prior work wasting input data. This yields a quality-speed-tunable paradigm for highly usable avatar modeling. Extensive experiments show that FastAvatar has higher quality and highly competitive speed compared to existing methods.

[112] BuzzSet v1.0: A Dataset for Pollinator Detection in Field Conditions

Ahmed Emam,Mohamed Elbassiouny,Julius Miller,Patrick Donworth,Sabine Seidel,Ribana Roscher

Main category: cs.CV

TL;DR: 研究介绍了一个新的大规模授粉昆虫图像数据集BuzzSet,并使用RF-DETR模型实现了高精度的自动检测。

Details Motivation: 授粉昆虫如蜜蜂和熊蜂对全球粮食生产和生态系统稳定至关重要,但它们的数量由于人为和环境压力因素的增加而减少。为了支持可扩展的自动授粉监测,引入了新的大规模授粉昆虫图像数据集BuzzSet。 Method: 使用YOLOv12模型进行初始注释,然后通过人工验证进行优化,使用RF-DETR基于变压器的目标检测模型进行强有力的基准测试。 Result: 模型对蜜蜂和熊蜂类别分别达到了0.94和0.92的F1分数,混淆矩阵结果显示这些类别之间的错误分类很少。未识别类别由于标签模糊和样本频率较低仍然更具挑战性,但仍为鲁棒性评估提供了有用的见解。总体检测质量很强,最佳mAP@0.50为0.559。 Conclusion: BuzzSet提供了一个有价值的基准,用于小型物体检测,标签噪声下的类分离和生态计算机视觉。 Abstract: Pollinator insects such as honeybees and bumblebees are vital to global food production and ecosystem stability, yet their populations are declining due to increasing anthropogenic and environmental stressors. To support scalable, automated pollinator monitoring, we introduce BuzzSet, a new large-scale dataset of high-resolution pollinator images collected in real agricultural field conditions. BuzzSet contains 7856 manually verified and labeled images, with over 8000 annotated instances across three classes: honeybees, bumblebees, and unidentified insects. Initial annotations were generated using a YOLOv12 model trained on external data and refined via human verification using open-source labeling tools. All images were preprocessed into 256~$\times$~256 tiles to improve the detection of small insects. We provide strong baselines using the RF-DETR transformer-based object detector. The model achieves high F1-scores of 0.94 and 0.92 for honeybee and bumblebee classes, respectively, with confusion matrix results showing minimal misclassification between these categories. The unidentified class remains more challenging due to label ambiguity and lower sample frequency, yet still contributes useful insights for robustness evaluation. Overall detection quality is strong, with a best mAP@0.50 of 0.559. BuzzSet offers a valuable benchmark for small object detection, class separation under label noise, and ecological computer vision.

[113] AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning

Shu Shen,C. L. Philip Chen,Tong Zhang

Main category: cs.CV

TL;DR: The paper proposes Adaptive Intra-Network Modulation (AIM) to address imbalanced multimodal learning by tackling optimization bias, achieving better performance without compromising dominant or weak modalities.

Details Motivation: Existing methods hinder dominant modality learning to boost weaker ones, causing suboptimal performance. This work addresses the overlooked issue of optimization bias within networks. Method: AIM decouples under-optimized parameters of the dominant modality into Auxiliary Blocks and adaptively adjusts modulation strength across network depths for balanced learning. Result: Experimental results show AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks with strong generalizability. Conclusion: Adaptive Intra-Network Modulation (AIM) improves balanced modality learning in multimodal networks by addressing optimization bias, enhancing performance across various backbones and fusion strategies. Abstract: Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality's learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality's under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.

[114] The Return of Structural Handwritten Mathematical Expression Recognition

Jakob Seitz,Tobias Lengfeld,Radu Timofte

Main category: cs.CV

TL;DR: The paper introduces a structural recognition approach for Handwritten Mathematical Expression Recognition with an automatic annotation system and a modular structural recognition system that achieves competitive performance on the CROHME-2023 benchmark while enabling transparent error analysis and interpretable outputs.

Details Motivation: The motivation for this research is the lack of explicit symbol-to-trace alignment in modern encoder-decoder architectures with large language models for LaTeX generation. This limitation hinders error analysis, interpretability, and spatially aware interactive applications requiring selective content updates. Method: The paper proposes a structural recognition approach with an automatic annotation system and a modular structural recognition system. The automatic annotation system uses a neural network to map LaTeX equations to raw traces for symbol segmentation, classification, and spatial relations. The modular structural recognition system independently optimizes segmentation, classification, and relation prediction. Result: The proposed recognition system combines graph-based trace sorting, a hybrid convolutional-recurrent network, and transformer-based correction to achieve competitive performance on the CROHME-2023 benchmark. Conclusion: The paper concludes that their structural recognition system achieves competitive performance on the CROHME-2023 benchmark and enables transparent error analysis and interpretable outputs by generating a complete graph structure linking handwritten traces to predicted symbols. Abstract: Handwritten Mathematical Expression Recognition is foundational for educational technologies, enabling applications like digital note-taking and automated grading. While modern encoder-decoder architectures with large language models excel at LaTeX generation, they lack explicit symbol-to-trace alignment, a critical limitation for error analysis, interpretability, and spatially aware interactive applications requiring selective content updates. This paper introduces a structural recognition approach with two innovations: 1 an automatic annotation system that uses a neural network to map LaTeX equations to raw traces, automatically generating annotations for symbol segmentation, classification, and spatial relations, and 2 a modular structural recognition system that independently optimizes segmentation, classification, and relation prediction. By leveraging a dataset enriched with structural annotations from our auto-labeling system, the proposed recognition system combines graph-based trace sorting, a hybrid convolutional-recurrent network, and transformer-based correction to achieve competitive performance on the CROHME-2023 benchmark. Crucially, our structural recognition system generates a complete graph structure that directly links handwritten traces to predicted symbols, enabling transparent error analysis and interpretable outputs.

[115] MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction

Han Jiao,Jiakai Sun,Yexing Xu,Lei Zhao,Wei Xing,Huaizhong Lin

Main category: cs.CV

TL;DR: Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo) improves dynamic scene reconstruction by specializing modeling for high-dynamic regions and ensuring visual continuity with a cross-frame consistency loss.

Details Motivation: Deformation-based dynamic scene reconstruction methods using 3D Gaussian Splatting often result in blurred renderings and loss of fine motion details in highly dynamic regions due to the limitations of a single unified model. Method: MAPo introduces a dynamic score-based partitioning strategy to classify high- and low-dynamic 3D Gaussians, with recursive temporal partitioning and duplicated deformation networks for high-dynamic regions, and a cross-frame consistency loss to ensure visual continuity. Result: Experiments show that MAPo achieves superior rendering quality compared to baseline methods, particularly in regions with complex or rapid motions, while maintaining comparable computational costs. Conclusion: MAPo provides a high-fidelity solution for dynamic scene reconstruction by addressing the limitations of deformation-based methods through temporal partitioning and cross-frame consistency. Abstract: 3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally and duplicate their deformation networks for each new temporal segment, enabling specialized modeling to capture intricate motion details. Concurrently, low-dynamic 3DGs are treated as static to reduce computational costs. However, this temporal partitioning strategy for high-dynamic 3DGs can introduce visual discontinuities across frames at the partition boundaries. To address this, we introduce a cross-frame consistency loss, which not only ensures visual continuity but also further enhances rendering quality. Extensive experiments demonstrate that MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly in regions with complex or rapid motions.

[116] StableIntrinsic: Detail-preserving One-step Diffusion Model for Multi-view Material Estimation

Xiuchao Wu,Pengfei Zhu,Jiangjing Lyu,Xinguo Liu,Jie Guo,Yanwen Guo,Weiwei Xu,Chengfei Lyu

Main category: cs.CV

TL;DR: StableIntrinsic是一种用于多视角材质估计的一阶扩散模型,它通过解决过度平滑问题和消除VAE编码导致的细节损失,实现了更高质量的估计结果。

Details Motivation: 现有的基于扩散模型的材质估计方法使用多步骤去噪策略,这在每次估计中都很耗时,并且由于随机推理导致估计结果的高方差。 Method: StableIntrinsic采用了一阶扩散模型来估计材料参数,通过在像素空间中应用基于材料特性的损失,并引入了细节注入网络(DIN)以消除VAE编码造成的细节损失。 Result: 实验结果表明,StableIntrinsic在PSNR(albedo)上比当前最先进的技术提高了9.9%,在金属和粗糙度的MSE上分别减少了44.4%和60.0%。 Conclusion: StableIntrinsic是一个用于多视角材质估计的一阶扩散模型,它通过解决过度平滑问题和消除VAE编码导致的细节损失,实现了比现有技术更高的估计质量。 Abstract: Recovering material information from images has been extensively studied in computer graphics and vision. Recent works in material estimation leverage diffusion model showing promising results. However, these diffusion-based methods adopt a multi-step denoising strategy, which is time-consuming for each estimation. Such stochastic inference also conflicts with the deterministic material estimation task, leading to a high variance estimated results. In this paper, we introduce StableIntrinsic, a one-step diffusion model for multi-view material estimation that can produce high-quality material parameters with low variance. To address the overly-smoothing problem in one-step diffusion, StableIntrinsic applies losses in pixel space, with each loss designed based on the properties of the material. Additionally, StableIntrinsic introduces a Detail Injection Network (DIN) to eliminate the detail loss caused by VAE encoding, while further enhancing the sharpness of material prediction results. The experimental results indicate that our method surpasses the current state-of-the-art techniques by achieving a $9.9\%$ improvement in the Peak Signal-to-Noise Ratio (PSNR) of albedo, and by reducing the Mean Square Error (MSE) for metallic and roughness by $44.4\%$ and $60.0\%$, respectively.

[117] Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models

Shay Shomer Chai,Wenxuan Peng,Bharath Hariharan,Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: This paper studies challenges in multi-object semantic alignment for text-to-image generation, particularly focusing on color attributes, and proposes an effective editing technique to address these issues.

Details Motivation: Contemporary text-to-image generation methods struggle with complex multi-object prompts, and current evaluation metrics are not sufficient. This work aims to address and evaluate semantic misalignments through a focused study on color attributes. Method: The authors performed a case study on colors as a fundamental attribute in text prompts, analyzed pretrained models' performance, and introduced a dedicated image editing technique to improve multi-object semantic alignment. Result: The analysis revealed that pretrained models struggle more with multi-color prompts than single-color ones, and existing methods do not reliably resolve these misalignments. The proposed image editing technique significantly improves performance across various metrics. Conclusion: The paper concludes that existing text-to-image generation methods struggle with multi-color prompts, and the introduced dedicated image editing technique effectively mitigates this semantic alignment issue. Abstract: Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors -- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.

[118] FusionSort: Enhanced Cluttered Waste Segmentation with Advanced Decoding and Comprehensive Modality Optimization

Muhammad Ali,Omar Ali AlSuwaidi

Main category: cs.CV

TL;DR: This paper proposes an improved Encoder-Decoder neural network for waste sorting automation, incorporating attention mechanisms and data fusion techniques, demonstrating superior performance in handling complex waste streams.

Details Motivation: Automating the sorting of non-biodegradable materials is challenging due to the complexity and variability of waste streams, which this paper aims to address. Method: The model uses an Encoder-Decoder structure with a Comprehensive Attention Block, Mamba architecture for attention, and a Data Fusion Block that applies PCA transformation to reduce dimensionality for multi-channel image processing. Result: The approach outperforms existing methods by a significant margin when evaluated on RGB, hyperspectral, multispectral, and combined RGB and hyperspectral data. Conclusion: The proposed enhanced neural architecture significantly improves the accuracy and efficiency of waste sorting systems compared to existing methods. Abstract: In the realm of waste management, automating the sorting process for non-biodegradable materials presents considerable challenges due to the complexity and variability of waste streams. To address these challenges, we introduce an enhanced neural architecture that builds upon an existing Encoder-Decoder structure to improve the accuracy and efficiency of waste sorting systems. Our model integrates several key innovations: a Comprehensive Attention Block within the decoder, which refines feature representations by combining convolutional and upsampling operations. In parallel, we utilize attention through the Mamba architecture, providing an additional performance boost. We also introduce a Data Fusion Block that fuses images with more than three channels. To achieve this, we apply PCA transformation to reduce the dimensionality while retaining the maximum variance and essential information across three dimensions, which are then used for further processing. We evaluated the model on RGB, hyperspectral, multispectral, and a combination of RGB and hyperspectral data. The results demonstrate that our approach outperforms existing methods by a significant margin.

[119] A bag of tricks for real-time Mitotic Figure detection

Christian Marzahl,Brian Napora

Main category: cs.CV

TL;DR: This paper proposes a robust and real-time method for mitotic figure detection in histopathology images using an efficient object detector and advanced training techniques, achieving high performance across diverse domains.

Details Motivation: Mitotic figure detection in histopathology images is challenging due to variations in slide scanners, staining protocols, tissue types, and the presence of artifacts. Method: The study builds on the efficient RTMDet single-stage object detector and addresses scanner variability and tumor heterogeneity through multi-domain training data, balanced sampling, careful augmentation, and targeted hard negative mining on necrotic and debris tissue. Result: In a grouped 5-fold cross-validation across multiple MF datasets, the model achieves an F1 score between 0.78 and 0.84. On the preliminary test set of the MIDOG 2025 challenge, the approach reaches an F1 of 0.81, outperforming larger models. Conclusion: The proposed solution offers a practical trade-off between accuracy and speed, making it attractive for real-world clinical adoption. Abstract: Mitotic figure (MF) detection in histopathology images is challenging due to large variations in slide scanners, staining protocols, tissue types, and the presence of artifacts. This paper presents a collection of training techniques - a bag of tricks - that enable robust, real-time MF detection across diverse domains. We build on the efficient RTMDet single stage object detector to achieve high inference speed suitable for clinical deployment. Our method addresses scanner variability and tumor heterogeneity via extensive multi-domain training data, balanced sampling, and careful augmentation. Additionally, we employ targeted, hard negative mining on necrotic and debris tissue to reduce false positives. In a grouped 5-fold cross-validation across multiple MF datasets, our model achieves an F1 score between 0.78 and 0.84. On the preliminary test set of the MItosis DOmain Generalization (MIDOG) 2025 challenge, our single-stage RTMDet-S based approach reaches an F1 of 0.81, outperforming larger models and demonstrating adaptability to new, unfamiliar domains. The proposed solution offers a practical trade-off between accuracy and speed, making it attractive for real-world clinical adoption.

[120] Context-aware Sparse Spatiotemporal Learning for Event-based Vision

Shenqi Wang,Guangzhi Tang

Main category: cs.CV

TL;DR: 本研究提出了一种新的上下文感知稀疏时空学习框架(CSSL),在基于事件的视觉任务中实现了高效能和高稀疏性,适用于神经形态处理。

Details Motivation: 基于事件的相机在机器人感知中具有高时间分辨率、高动态范围和抗运动模糊的优势,但现有的深度学习方法未能充分利用事件数据的稀疏性,且集成到资源受限的边缘应用中较为复杂。神经形态计算虽然能效高,但脉冲神经网络在复杂任务中难以达到先进模型的性能。 Method: 提出了上下文感知的稀疏时空学习(CSSL)框架,通过动态调节神经元激活来减少激活密度,而无需显式的稀疏性约束。 Result: CSSL在基于事件的目标检测和光流估计中实现了高激活稀疏性,同时保持了优越或相当的性能,且无需手动调整稀疏诱导损失项。 Conclusion: CSSL框架在保持高神经元稀疏性的同时,在基于事件的目标检测和光流估计任务中实现了与现有最先进方法相当或更优的性能,展示了其在神经形态处理中的潜力。 Abstract: Event-based camera has emerged as a promising paradigm for robot perception, offering advantages with high temporal resolution, high dynamic range, and robustness to motion blur. However, existing deep learning-based event processing methods often fail to fully leverage the sparse nature of event data, complicating their integration into resource-constrained edge applications. While neuromorphic computing provides an energy-efficient alternative, spiking neural networks struggle to match of performance of state-of-the-art models in complex event-based vision tasks, like object detection and optical flow. Moreover, achieving high activation sparsity in neural networks is still difficult and often demands careful manual tuning of sparsity-inducing loss terms. Here, we propose Context-aware Sparse Spatiotemporal Learning (CSSL), a novel framework that introduces context-aware thresholding to dynamically regulate neuron activations based on the input distribution, naturally reducing activation density without explicit sparsity constraints. Applied to event-based object detection and optical flow estimation, CSSL achieves comparable or superior performance to state-of-the-art methods while maintaining extremely high neuronal sparsity. Our experimental results highlight CSSL's crucial role in enabling efficient event-based vision for neuromorphic processing.

[121] AutoQ-VIS: Improving Unsupervised Video Instance Segmentation via Automatic Quality Assessment

Kaixuan Lu,Mehmet Onurcan Kaya,Dim P. Papadopoulos

Main category: cs.CV

TL;DR: 本文提出了一种名为AutoQ-VIS的新颖无监督框架,通过质量引导的自我训练方法,成功解决了视频实例分割中合成数据到真实数据的领域差距问题,实验结果显示其性能优于现有方法。

Details Motivation: 视频实例分割面临显著的注释挑战,由于其对像素级掩码和时间一致性标签的双重需求。虽然最近的无监督方法如VideoCutLER通过合成数据消除了对光流的依赖,但它们仍然受限于合成到真实的领域差距。 Method: 提出了一种名为AutoQ-VIS的新颖无监督框架,该框架通过生成伪标签和自动质量评估之间的闭环系统来逐步适应从合成到真实的视频。 Result: 实验表明,AutoQ-VIS在YouTubeVIS-2019验证集上达到了52.6 AP50的最先进性能,超越了之前的最先进方法VideoCutLER 4.4%,同时不需要人工注释。 Conclusion: AutoQ-VIS通过质量引导的自我训练框架,成功弥合了合成数据与真实数据之间的领域差距,展示了在无监督视频实例分割中的有效性。 Abstract: Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4$\%$, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. The source code of our method is available at https://github.com/wcbup/AutoQ-VIS.

[122] ERSR: An Ellipse-constrained pseudo-label refinement and symmetric regularization framework for semi-supervised fetal head segmentation in ultrasound images

Linkuan Zhou,Zhexin Chen,Yufei Shen,Junlin Xu,Ping Xuan,Yixin Zhu,Yuqi Fang,Cong Cong,Leyi Wei,Ran Su,Jia Zhou,Qiangguo Jin

Main category: cs.CV

TL;DR: 本文提出了一种新的半监督方法ERSR,用于胎儿头部超声图像分割,取得了最先进的结果。

Details Motivation: 由于超声图像质量差和标注数据不足,胎儿头部的自动分割仍然具有挑战性。 Method: 提出了一种新的半监督框架ERSR,包括双评分自适应滤波策略、椭圆约束伪标签优化和基于对称性的多一致性正则化。 Result: 在HC18和PSFH数据集上,ERSR在10%和20%标注数据下分别达到了92.05%、95.36%和91.68%、93.70%的Dice分数。 Conclusion: ERSR实现了对胎儿头部超声图像的高效分割,达到了最先进水平。 Abstract: Automated segmentation of the fetal head in ultrasound images is critical for prenatal monitoring. However, achieving robust segmentation remains challenging due to the poor quality of ultrasound images and the lack of annotated data. Semi-supervised methods alleviate the lack of annotated data but struggle with the unique characteristics of fetal head ultrasound images, making it challenging to generate reliable pseudo-labels and enforce effective consistency regularization constraints. To address this issue, we propose a novel semi-supervised framework, ERSR, for fetal head ultrasound segmentation. Our framework consists of the dual-scoring adaptive filtering strategy, the ellipse-constrained pseudo-label refinement, and the symmetry-based multiple consistency regularization. The dual-scoring adaptive filtering strategy uses boundary consistency and contour regularity criteria to evaluate and filter teacher outputs. The ellipse-constrained pseudo-label refinement refines these filtered outputs by fitting least-squares ellipses, which strengthens pixels near the center of the fitted ellipse and suppresses noise simultaneously. The symmetry-based multiple consistency regularization enforces multi-level consistency across perturbed images, symmetric regions, and between original predictions and pseudo-labels, enabling the model to capture robust and stable shape representations. Our method achieves state-of-the-art performance on two benchmarks. On the HC18 dataset, it reaches Dice scores of 92.05% and 95.36% with 10% and 20% labeled data, respectively. On the PSFH dataset, the scores are 91.68% and 93.70% under the same settings.

[123] Gradient Rectification for Robust Calibration under Distribution Shift

Yilin Zhang,Cai Xu,You Wu,Ziyu Guan,Wei Zhao

Main category: cs.CV

TL;DR: 本文提出了一种无需目标域信息即可提升深度神经网络在分布偏移下校准效果的新方法,通过低频滤波和梯度校正机制实现可靠性和性能的平衡。

Details Motivation: 深度神经网络在安全关键应用中的可靠性受到其过度自信预测和分布偏移下校准能力下降的限制,而现有方法因需要目标域信息而受限于实际应用场景。 Method: 从频域的角度出发,引入低频滤波策略以鼓励模型依赖于与领域无关的特征,并结合梯度校正机制保证在分布内的校准效果。 Result: 在合成和真实世界数据集(如CIFAR-10/100-C和WILDS)上的实验表明,所提出的方法在保持强在分布性能的同时,显著提高了分布偏移下的校准效果。 Conclusion: 该论文提出了一种新的校准框架,可以在不访问目标域信息的情况下提高深度神经网络在分布偏移下的可靠性,同时保持在分布内的性能。 Abstract: Deep neural networks often produce overconfident predictions, undermining their reliability in safety-critical applications. This miscalibration is further exacerbated under distribution shift, where test data deviates from the training distribution due to environmental or acquisition changes. While existing approaches improve calibration through training-time regularization or post-hoc adjustment, their reliance on access to or simulation of target domains limits their practicality in real-world scenarios. In this paper, we propose a novel calibration framework that operates without access to target domain information. From a frequency-domain perspective, we identify that distribution shifts often distort high-frequency visual cues exploited by deep models, and introduce a low-frequency filtering strategy to encourage reliance on domain-invariant features. However, such information loss may degrade In-Distribution (ID) calibration performance. Therefore, we further propose a gradient-based rectification mechanism that enforces ID calibration as a hard constraint during optimization. Experiments on synthetic and real-world shifted datasets, including CIFAR-10/100-C and WILDS, demonstrate that our method significantly improves calibration under distribution shift while maintaining strong in-distribution performance.

[124] Image Quality Assessment for Machines: Paradigm, Large-scale Database, and Models

Xiaoqi Wang,Yun Zhang,Weisi Lin

Main category: cs.CV

TL;DR: This paper proposes RA-MIQA, a machine-centric image quality assessment framework, which outperforms existing methods and highlights the limitations of human visual system-based metrics in evaluating machine vision system quality.

Details Motivation: Machine vision systems are vulnerable to performance degradation under adverse visual conditions, which necessitates a framework to quantify the impact of image degradations on these systems. Method: The researchers established an MIQA paradigm and developed the RA-MIQA model, supported by a large-scale database (MIQD-2.5M), and benchmarked against existing IQA metrics through extensive experiments. Result: RA-MIQA demonstrated superior performance with SRCC gains of 13.56% on consistency and 13.37% on accuracy for image classification, while revealing task-specific degradation sensitivities. Conclusion: The study concludes that RA-MIQA outperforms existing methods in assessing machine vision system quality, highlighting the inadequacy of human visual system-based metrics for this purpose. Abstract: Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions. To address this, we propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance. We establish an MIQA paradigm encompassing the end-to-end assessment workflow. To support this, we construct a machine-centric image quality database (MIQD-2.5M), comprising 2.5 million samples that capture distinctive degradation responses in both consistency and accuracy metrics, spanning 75 vision models, 250 degradation types, and three representative vision tasks. We further propose a region-aware MIQA (RA-MIQA) model to evaluate MVS visual quality through fine-grained spatial degradation analysis. Extensive experiments benchmark the proposed RA-MIQA against seven human visual system (HVS)-based IQA metrics and five retrained classical backbones. Results demonstrate RA-MIQA's superior performance in multiple dimensions, e.g., achieving SRCC gains of 13.56% on consistency and 13.37% on accuracy for image classification, while also revealing task-specific degradation sensitivities. Critically, HVS-based metrics prove inadequate for MVS quality prediction, while even specialized MIQA models struggle with background degradations, accuracy-oriented estimation, and subtle distortions. This study can advance MVS reliability and establish foundations for machine-centric image processing and optimization. The model and code are available at: https://github.com/XiaoqiWang/MIQA.

[125] KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

Taebaek Hwang,Minseo Kim,Gisang Lee,Seonuk Kim,Hyunjun Eun

Main category: cs.CV

TL;DR: 本文介绍了KRETA,一个针对韩语视觉文本问答的基准测试,填补了低资源语言在该领域的空白。

Details Motivation: 低资源语言如韩语缺乏全面的基准测试,阻碍了视觉语言模型的有效评估和比较。 Method: 开发了一个半自动的VQA生成流程,支持15个领域和26种图像类型的多方面评估。 Result: 推出了KRETA基准,包含丰富的文本理解与推理能力评估,同时提供数据质量和生成流程的优化。 Conclusion: KRETA是一个为韩语设计的多样化视觉文本问答基准,旨在促进多语言视觉语言模型的研究。 Abstract: Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.

[126] Ego-centric Predictive Model Conditioned on Hand Trajectories

Binjie Zhang,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了一种统一的两阶段预测框架,用于以自我为中心的场景中同时预测动作和视觉结果,结合手部轨迹、语言和视觉信息,实现了更好的动作预测与视频生成效果。

Details Motivation: 在以自我为中心的场景中,同时预测下一个动作及其视觉结果对于理解人-物交互和机器人规划至关重要,但现有方法无法联合建模这两个方面。 Method: 第一阶段进行连续状态建模,处理异构输入并显式预测未来手部轨迹;第二阶段引入因果交叉注意力机制,利用推断的动作信号引导基于图像的潜在扩散模型进行逐帧视频生成。 Result: 在Ego4D、BridgeData和RLBench上的大量实验表明,该方法在动作预测和未来视频合成方面均优于最先进的基线方法。 Conclusion: 该论文提出了一种统一的两阶段预测框架,能够联合建模以自我为中心的场景中的动作和视觉未来,通过手部轨迹进行条件处理,解决了现有方法在动作预测和视频生成方面的不足。 Abstract: In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions and for enabling robotic planning. However, existing paradigms fall short of jointly modeling these aspects. Vision-Language-Action (VLA) models focus on action prediction but lack explicit modeling of how actions influence the visual scene, while video prediction models generate future frames without conditioning on specific actions, often resulting in implausible or contextually inconsistent outcomes. To bridge this gap, we propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios, conditioned on hand trajectories. In the first stage, we perform consecutive state modeling to process heterogeneous inputs (visual observations, language, and action history) and explicitly predict future hand trajectories. In the second stage, we introduce causal cross-attention to fuse multi-modal cues, leveraging inferred action signals to guide an image-based Latent Diffusion Model (LDM) for frame-by-frame future video generation. Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks, providing explicit predictions of both upcoming actions and their visual consequences. Extensive experiments on Ego4D, BridgeData, and RLBench demonstrate that our method outperforms state-of-the-art baselines in both action prediction and future video synthesis.

[127] GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

Seongheon Park,Yixuan Li

Main category: cs.CV

TL;DR: This paper proposes GLSim, an effective training-free framework for detecting object hallucinations in vision-language models by combining global and local image-text similarity signals.

Details Motivation: Object hallucination in vision-language models can hinder their safe use in real-world applications. Current methods for detecting hallucinations use either global or local perspectives alone, which limits their reliability. Method: GLSim uses global and local embedding similarity signals between image and text modalities to detect object hallucinations without requiring training. Result: GLSim outperforms existing object hallucination detection methods by a significant margin in diverse scenarios. Conclusion: GLSim is a reliable and effective method for detecting object hallucinations in vision-language models. Abstract: Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.

[128] Multimodal Conditional MeshGAN for Personalized Aneurysm Growth Prediction

Long Chen,Ashiv Patel,Mengyun Qiao,Mohammad Yousuf Salmasi,Salah A. Hammouche,Vasilis Stavrinides,Jasleen Nagi,Soodeh Kalaie,Xiao Yun Xu,Wenjia Bai,Declan P. O'Regan

Main category: cs.CV

TL;DR: MCMeshGAN是一种用于3D动脉瘤生长预测的新型多模态条件网格到网格生成对抗网络,结合了局部KNN卷积网络和全局图卷积网络,并利用临床属性和时间间隔生成解剖合理且时间控制的预测。

Details Motivation: 由于需要在复杂的3D几何结构中同时建模细微的局部变形和全局解剖变化,主动脉瘤进展的个性化、准确预测仍是一个挑战。 Method: MCMeshGAN引入了一种双分支架构,结合了一种新的局部KNN卷积网络(KCN)和全局图卷积网络(GCN),并利用专用条件分支对临床属性和目标时间间隔进行编码以生成解剖学上合理的预测。 Result: MCMeshGAN在几何精度和临床重要的直径估计方面始终优于最先进的基线方法。 Conclusion: MCMeshGAN提供了一种个性化的、准确预测主动脉瘤进展的框架,为临床部署的3D疾病轨迹建模提供了坚实的基础。 Abstract: Personalized, accurate prediction of aortic aneurysm progression is essential for timely intervention but remains challenging due to the need to model both subtle local deformations and global anatomical changes within complex 3D geometries. We propose MCMeshGAN, the first multimodal conditional mesh-to-mesh generative adversarial network for 3D aneurysm growth prediction. MCMeshGAN introduces a dual-branch architecture combining a novel local KNN-based convolutional network (KCN) to preserve fine-grained geometric details and a global graph convolutional network (GCN) to capture long-range structural context, overcoming the over-smoothing limitations of deep GCNs. A dedicated condition branch encodes clinical attributes (age, sex) and the target time interval to generate anatomically plausible, temporally controlled predictions, enabling retrospective and prospective modeling. We curated TAAMesh, a new longitudinal thoracic aortic aneurysm mesh dataset consisting of 590 multimodal records (CT scans, 3D meshes, and clinical data) from 208 patients. Extensive experiments demonstrate that MCMeshGAN consistently outperforms state-of-the-art baselines in both geometric accuracy and clinically important diameter estimation. This framework offers a robust step toward clinically deployable, personalized 3D disease trajectory modeling. The source code for MCMeshGAN and the baseline methods is publicly available at https://github.com/ImperialCollegeLondon/MCMeshGAN.

[129] Self-supervised structured object representation learning

Oussama Hadjerci,Antoine Letienne,Mohamed Abbas Hedjazi,Adel Hafiane

Main category: cs.CV

TL;DR: 论文提出了一种新的自监督学习方法,用于构建结构化的视觉表示,该方法在有限数据条件下也能在目标检测任务中取得优异性能。

Details Motivation: 尽管现有的SSL方法在全局图像理解方面取得了良好的效果,但在捕捉场景中的结构化表示方面存在局限性。论文旨在解决这一问题。 Method: 基于一种新颖的ProtoScale模块,跨多个空间尺度捕捉视觉元素,并通过保留增强视图中的完整场景上下文来改进密集预测任务的性能。 Result: 实验结果表明,该方法在使用多个数据集(COCO和UA-DETRAC)的组合子集进行下游目标检测任务验证时,能够学习到以对象为中心的表示,并提升有监督的目标检测性能,即使在训练数据有限且微调周期较少的情况下也优于现有技术。 Conclusion: 该论文提出了一种新的自监督学习方法,能够通过结合语义分组、实例级别分离和层次结构,逐步构建结构化的视觉表示。 Abstract: Self-supervised learning (SSL) has emerged as a powerful technique for learning visual representations. While recent SSL approaches achieve strong results in global image understanding, they are limited in capturing the structured representation in scenes. In this work, we propose a self-supervised approach that progressively builds structured visual representations by combining semantic grouping, instance level separation, and hierarchical structuring. Our approach, based on a novel ProtoScale module, captures visual elements across multiple spatial scales. Unlike common strategies like DINO that rely on random cropping and global embeddings, we preserve full scene context across augmented views to improve performance in dense prediction tasks. We validate our method on downstream object detection tasks using a combined subset of multiple datasets (COCO and UA-DETRAC). Experimental results show that our method learns object centric representations that enhance supervised object detection and outperform the state-of-the-art methods, even when trained with limited annotated data and fewer fine-tuning epochs.

[130] TrajFusionNet: Pedestrian Crossing Intention Prediction via Fusion of Sequential and Visual Trajectory Representations

François G. Landry,Moulay A. Akhloufi

Main category: cs.CV

TL;DR: TrajFusionNet 是一种用于预测行人过街意图的新模型,结合了未来的行人轨迹和车辆速度预测,实现了高效的推理和先进的性能。

Details Motivation: 随着自动驾驶车辆在公共道路上的引入,预测行人过街意图成为研究热点。 Method: TrajFusionNet 是一种基于变压器的新模型,结合了未来的行人轨迹和车辆速度预测,用于预测过街意图。 Result: TrajFusionNet 在最先进的方法中实现了最低的总推理时间,并在行人过街意图预测的三个最常用数据集上达到了最先进的结果。 Conclusion: TrajFusionNet 实现了最先进的性能,并在行人过街意图预测的三个最常用数据集中实现了最低的总推理时间。 Abstract: With the introduction of vehicles with autonomous capabilities on public roads, predicting pedestrian crossing intention has emerged as an active area of research. The task of predicting pedestrian crossing intention involves determining whether pedestrians in the scene are likely to cross the road or not. In this work, we propose TrajFusionNet, a novel transformer-based model that combines future pedestrian trajectory and vehicle speed predictions as priors for predicting crossing intention. TrajFusionNet comprises two branches: a Sequence Attention Module (SAM) and a Visual Attention Module (VAM). The SAM branch learns from a sequential representation of the observed and predicted pedestrian trajectory and vehicle speed. Complementarily, the VAM branch enables learning from a visual representation of the predicted pedestrian trajectory by overlaying predicted pedestrian bounding boxes onto scene images. By utilizing a small number of lightweight modalities, TrajFusionNet achieves the lowest total inference time (including model runtime and data preprocessing) among current state-of-the-art approaches. In terms of performance, it achieves state-of-the-art results across the three most commonly used datasets for pedestrian crossing intention prediction.

[131] Sky Background Building of Multi-objective Fiber spectra Based on Mutual Information Network

Hui Zhang,Jianghui Cai,Haifeng Yang,Ali Luo,Yuqing Yang,Xiao Kong,Zhichao Ding,Lichan Zhou,Qin Han

Main category: cs.CV

TL;DR: The paper proposes SMI, a new sky background estimation method using mutual information and all fiber spectra, improving background subtraction in multi-object spectroscopy.

Details Motivation: Traditional methods rely on average sky fiber spectra, which lack detailed modeling of the object's surrounding environment, leading to suboptimal sky background estimation. Method: SMI uses mutual information and an incremental training approach with two networks: one for wavelength calibration and sky feature extraction, and another for maximizing/minimizing mutual information to separate common and individual components of spectra. Result: Experiments on LAMOST spectra show that SMI performs better in estimating object sky background, particularly at the blue end of the spectrum. Conclusion: The proposed SMI method effectively estimates sky background by utilizing spectra from all fibers, addressing the limitations of traditional sky fiber-based approaches. Abstract: Sky background subtraction is a critical step in Multi-objective Fiber spectra process. However, current subtraction relies mainly on sky fiber spectra to build Super Sky. These average spectra are lacking in the modeling of the environment surrounding the objects. To address this issue, a sky background estimation model: Sky background building based on Mutual Information (SMI) is proposed. SMI based on mutual information and incremental training approach. It utilizes spectra from all fibers in the plate to estimate the sky background. SMI contains two main networks, the first network applies a wavelength calibration module to extract sky features from spectra, and can effectively solve the feature shift problem according to the corresponding emission position. The second network employs an incremental training approach to maximize mutual information between representations of different spectra to capturing the common component. Then, it minimizes the mutual information between adjoining spectra representations to obtain individual components. This network yields an individual sky background at each location of the object. To verify the effectiveness of the method in this paper, we conducted experiments on the spectra of LAMOST. Results show that SMI can obtain a better object sky background during the observation, especially in the blue end.

[132] Multispectral LiDAR data for extracting tree points in urban and suburban areas

Narges Takhtkeshha,Gabriele Mazzacca,Fabio Remondino,Juha Hyyppä,Gottfried Mandlburger

Main category: cs.CV

TL;DR: 该研究探讨了使用多光谱LiDAR和深度学习模型,特别是Superpoint Transformer,在城市树木动态监测中的应用,展示了其在提高树木提取准确性方面的潜力。

Details Motivation: 城市树木动态监测对于支持绿化政策和减少对电力基础设施的风险至关重要。 Method: 该研究评估了三种最先进的模型:Superpoint Transformer(SPT)、Point Transformer V3(PTv3)和Point Transformer V1(PTv1),以探索使用多光谱(MS)光探测和测距(LiDAR)的树点提取。 Result: 结果表明,SPT具有显著的时间效率和准确性,平均交并比(mIoU)达到85.28%。通过结合伪归一化差异植被指数(pNDVI)与空间数据,获得了最高的检测准确性,与仅使用空间信息相比,错误率降低了10.61个百分点。 Conclusion: 这些发现突显了多光谱LiDAR和深度学习改善树木提取和进一步树木清单的潜力。 Abstract: Monitoring urban tree dynamics is vital for supporting greening policies and reducing risks to electrical infrastructure. Airborne laser scanning has advanced large-scale tree management, but challenges remain due to complex urban environments and tree variability. Multispectral (MS) light detection and ranging (LiDAR) improves this by capturing both 3D spatial and spectral data, enabling detailed mapping. This study explores tree point extraction using MS-LiDAR and deep learning (DL) models. Three state-of-the-art models are evaluated: Superpoint Transformer (SPT), Point Transformer V3 (PTv3), and Point Transformer V1 (PTv1). Results show the notable time efficiency and accuracy of SPT, with a mean intersection over union (mIoU) of 85.28%. The highest detection accuracy is achieved by incorporating pseudo normalized difference vegetation index (pNDVI) with spatial data, reducing error rate by 10.61 percentage points (pp) compared to using spatial information alone. These findings highlight the potential of MS-LiDAR and DL to improve tree extraction and further tree inventories.

[133] PersonaAnimator: Personalized Motion Transfer from Unconstrained Videos

Ziyun Qian,Runyu Xiao,Shuyuan Tu,Wei Xue,Dingkang Yang,Mingcheng Li,Dongliang Kou,Minghao Han,Zizhi Chen,Lihua Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的视频到视频运动个性化框架PersonaAnimator,解决了运动风格复制、数据获取难和物理合理性问题。

Details Motivation: 现有运动生成方法存在三个主要限制:缺乏风格特征学习、依赖难以获取的运动捕捉数据、生成的运动可能违反物理定律。 Method: 提出了一种新的框架PersonaAnimator和一个物理感知的运动风格正则化机制,并构建了第一个基于视频的个性化运动数据集PersonaVid。 Result: PersonaAnimator在运动转移任务中优于现有技术,并在视频到视频运动个性化任务上建立了新基准。 Conclusion: PersonaAnimator通过从无约束视频中学习个性化运动模式,实现了视频到视频的运动个性化任务,并引入了物理感知机制来确保生成运动的物理合理性。 Abstract: Recent advances in motion generation show remarkable progress. However, several limitations remain: (1) Existing pose-guided character motion transfer methods merely replicate motion without learning its style characteristics, resulting in inexpressive characters. (2) Motion style transfer methods rely heavily on motion capture data, which is difficult to obtain. (3) Generated motions sometimes violate physical laws. To address these challenges, this paper pioneers a new task: Video-to-Video Motion Personalization. We propose a novel framework, PersonaAnimator, which learns personalized motion patterns directly from unconstrained videos. This enables personalized motion transfer. To support this task, we introduce PersonaVid, the first video-based personalized motion dataset. It contains 20 motion content categories and 120 motion style categories. We further propose a Physics-aware Motion Style Regularization mechanism to enforce physical plausibility in the generated motions. Extensive experiments show that PersonaAnimator outperforms state-of-the-art motion transfer methods and sets a new benchmark for the Video-to-Video Motion Personalization task.

[134] Hyperspectral Sensors and Autonomous Driving: Technologies, Limitations, and Opportunities

Imad Ali Shah,Jiarong Li,Roshan George,Tim Brophy,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan

Main category: cs.CV

TL;DR: 这篇论文全面回顾了高光谱成像在汽车驾驶辅助系统和自动驾驶中的应用,分析了其商业化差距,并指出未来研究方向。

Details Motivation: 高光谱成像能够提供超越传统RGB成像的精细光谱分辨率,使材料级场景理解成为可能,但其在汽车应用中的商业化准备程度尚不清楚。 Method: 对216款商用高光谱和多光谱成像相机进行了基准测试,并审查了相关的数据集和应用。 Result: 分析揭示了高光谱成像的研究潜力与其商业化准备之间的显著差距,只有四款相机满足定义的性能阈值,且没有一款符合AEC-Q100温度标准。此外,目前的高光谱数据集在规模、光谱一致性、光谱通道数量和环境多样性方面存在局限性。 Conclusion: 该论文总结了高光谱成像在汽车应用中的现状,并提出了未来研究的方向。 Abstract: Hyperspectral imaging (HSI) offers a transformative sensing modality for Advanced Driver Assistance Systems (ADAS) and autonomous driving (AD) applications, enabling material-level scene understanding through fine spectral resolution beyond the capabilities of traditional RGB imaging. This paper presents the first comprehensive review of HSI for automotive applications, examining the strengths, limitations, and suitability of current HSI technologies in the context of ADAS/AD. In addition to this qualitative review, we analyze 216 commercially available HSI and multispectral imaging cameras, benchmarking them against key automotive criteria: frame rate, spatial resolution, spectral dimensionality, and compliance with AEC-Q100 temperature standards. Our analysis reveals a significant gap between HSI's demonstrated research potential and its commercial readiness. Only four cameras meet the defined performance thresholds, and none comply with AEC-Q100 requirements. In addition, the paper reviews recent HSI datasets and applications, including semantic segmentation for road surface classification, pedestrian separability, and adverse weather perception. Our review shows that current HSI datasets are limited in terms of scale, spectral consistency, the number of spectral channels, and environmental diversity, posing challenges for the development of perception algorithms and the adequate validation of HSI's true potential in ADAS/AD applications. This review paper establishes the current state of HSI in automotive contexts as of 2025 and outlines key research directions toward practical integration of spectral imaging in ADAS and autonomous systems.

[135] Streamlining the Development of Active Learning Methods in Real-World Object Detection

Moussa Kassem Sbeyti,Nadja Klein,Michelle Karg,Christian Wirth,Sahin Albayrak

Main category: cs.CV

TL;DR: 本文提出了一种新的主动学习评估指标OSS,能够在不训练检测器的情况下量化方法效果并选择代表性验证集,有效降低计算成本并提高评估可靠性,适用于多种检测器架构和自动驾驶数据集。

Details Motivation: 现有的主动学习方法在自动驾驶数据集上需要高昂的训练成本,并且在不同验证集上的方法排名差异较大,影响了安全关键系统中的可靠性,因此需要一种新的指标来解决这些问题。 Method: OSS通过使用对象级别的特征来衡量训练集和目标领域之间的相似性,无需训练检测器即可量化主动学习方法的效果,并且能够选择具有代表性的验证集以实现稳健评估。 Result: OSS在三个自动驾驶数据集(KITTI、BDD100K、CODA)上验证了其有效性,并且适用于不同的检测器架构(如EfficientDet和YOLOv3),首次在基于对象相似性的基础上统一了主动学习的训练和评估策略。 Conclusion: 本文介绍了一种名为object-based set similarity(OSS)的指标,该指标解决了实际部署中主动学习方法在计算效率和评估可靠性方面的挑战,并为现实世界中的应用提供了一个实用框架。 Abstract: Active learning (AL) for real-world object detection faces computational and reliability challenges that limit practical deployment. Developing new AL methods requires training multiple detectors across iterations to compare against existing approaches. This creates high costs for autonomous driving datasets where the training of one detector requires up to 282 GPU hours. Additionally, AL method rankings vary substantially across validation sets, compromising reliability in safety-critical transportation systems. We introduce object-based set similarity ($\mathrm{OSS}$), a metric that addresses these challenges. $\mathrm{OSS}$ (1) quantifies AL method effectiveness without requiring detector training by measuring similarity between training sets and target domains using object-level features. This enables the elimination of ineffective AL methods before training. Furthermore, $\mathrm{OSS}$ (2) enables the selection of representative validation sets for robust evaluation. We validate our similarity-based approach on three autonomous driving datasets (KITTI, BDD100K, CODA) using uncertainty-based AL methods as a case study with two detector architectures (EfficientDet, YOLOv3). This work is the first to unify AL training and evaluation strategies in object detection based on object similarity. $\mathrm{OSS}$ is detector-agnostic, requires only labeled object crops, and integrates with existing AL pipelines. This provides a practical framework for deploying AL in real-world applications where computational efficiency and evaluation reliability are critical. Code is available at https://mos-ks.github.io/publications/.

[136] Integrating SAM Supervision for 3D Weakly Supervised Point Cloud Segmentation

Lechun You,Zhonghua Wu,Weide Liu,Xulei Yang,Jun Cheng,Wei Zhou,Bharadwaj Veeravalli,Guosheng Lin

Main category: cs.CV

TL;DR: 本文提出了一种新的方法,通过结合2D基础模型生成的分割掩码,最大化稀疏3D注释的效用,从而改进3D弱监督分割的性能。

Details Motivation: 当前的3D语义分割方法通常只关注3D领域,未能充分利用2D和3D数据的互补性。此外,一些方法通过扩展原始标签或生成伪标签来指导训练,但往往无法充分使用这些标签或处理其中的噪声。 Method: 该方法通过建立3D场景与2D视图之间的几何对应关系,将2D分割掩码传播到3D空间,并通过基于置信度和不确定性的正则化选择可靠的伪标签,从而扩展稀疏的3D注释。 Result: 这种方法显著增加了可用标签的数量,弥补了有限的3D注释与强大的2D基础模型能力之间的差距,提高了3D弱监督分割的性能。 Conclusion: 文章通过结合2D基础模型和3D点云数据的方法,成功提高了3D弱监督分割的效果,为解决3D数据注释困难的问题提供了新的思路。 Abstract: Current methods for 3D semantic segmentation propose training models with limited annotations to address the difficulty of annotating large, irregular, and unordered 3D point cloud data. They usually focus on the 3D domain only, without leveraging the complementary nature of 2D and 3D data. Besides, some methods extend original labels or generate pseudo labels to guide the training, but they often fail to fully use these labels or address the noise within them. Meanwhile, the emergence of comprehensive and adaptable foundation models has offered effective solutions for segmenting 2D data. Leveraging this advancement, we present a novel approach that maximizes the utility of sparsely available 3D annotations by incorporating segmentation masks generated by 2D foundation models. We further propagate the 2D segmentation masks into the 3D space by establishing geometric correspondences between 3D scenes and 2D views. We extend the highly sparse annotations to encompass the areas delineated by 3D masks, thereby substantially augmenting the pool of available labels. Furthermore, we apply confidence- and uncertainty-based consistency regularization on augmentations of the 3D point cloud and select the reliable pseudo labels, which are further spread on the 3D masks to generate more labels. This innovative strategy bridges the gap between limited 3D annotations and the powerful capabilities of 2D foundation models, ultimately improving the performance of 3D weakly supervised segmentation.

[137] WaveHiT-SR: Hierarchical Wavelet Network for Efficient Image Super-Resolution

Fayaz Ali,Muhammad Zawish,Steven Davy,Radu Timofte

Main category: cs.CV

TL;DR: WaveHiT-SR 是一种结合小波变换与分层变压器的新方法,用于图像超分辨率任务,显著提高了效率和性能。

Details Motivation: 由于传统基于变压器的超分辨率方法中的窗口自注意力机制具有较高的计算复杂度,限制了感受野,因此需要一种更高效的方法来提升性能。 Method: 该方法通过引入自适应分层窗口替代固定小窗口,结合小波变换对图像进行多频带分解,逐步重建高分辨率图像,从而降低计算复杂度。 Result: WaveHiT-SR 在多个轻量化模型(如 SwinIR-Light、SwinIR-NG 和 SRFormer-Light)上实现了更高的效率,具有更少的参数、更低的 FLOPs 和更快的速度,同时保持了性能。 Conclusion: WaveHiT-SR 提出了一种新的基于小波变换和分层变压器框架的图像超分辨率方法,有效解决了窗口自注意力机制计算复杂度高的问题,并在多个轻量化模型中实现了先进的性能。 Abstract: Transformers have demonstrated promising performance in computer vision tasks, including image super-resolution (SR). The quadratic computational complexity of window self-attention mechanisms in many transformer-based SR methods forces the use of small, fixed windows, limiting the receptive field. In this paper, we propose a new approach by embedding the wavelet transform within a hierarchical transformer framework, called (WaveHiT-SR). First, using adaptive hierarchical windows instead of static small windows allows to capture features across different levels and greatly improve the ability to model long-range dependencies. Secondly, the proposed model utilizes wavelet transforms to decompose images into multiple frequency subbands, allowing the network to focus on both global and local features while preserving structural details. By progressively reconstructing high-resolution images through hierarchical processing, the network reduces computational complexity without sacrificing performance. The multi-level decomposition strategy enables the network to capture fine-grained information in lowfrequency components while enhancing high-frequency textures. Through extensive experimentation, we confirm the effectiveness and efficiency of our WaveHiT-SR. Our refined versions of SwinIR-Light, SwinIR-NG, and SRFormer-Light deliver cutting-edge SR results, achieving higher efficiency with fewer parameters, lower FLOPs, and faster speeds.

[138] Reimagining Image Segmentation using Active Contour: From Chan Vese Algorithm into a Proposal Novel Functional Loss Framework

Gianluca Guzzetta

Main category: cs.CV

TL;DR: 本文研究了Chan-Vese算法在图像分割中的应用,并提出了新的功能性损失函数。

Details Motivation: 为了提高图像分割的性能,对Chan-Vese模型进行了全面的研究和分析。 Method: 采用离散化方案,并利用MATLAB和PyTorch实现Chan-Vese模型的功能损失。 Result: 通过比较常见的计算机视觉分割数据集的结果,评估了传统损失函数与所提出方法的性能。 Conclusion: 本文得出Chan-Vese算法在图像分割中的有效性,并提出了基于主动轮廓的功能性分割损失。 Abstract: In this paper, we present a comprehensive study and analysis of the Chan-Vese algorithm for image segmentation. We employ a discretized scheme derived from the empirical study of the Chan-Vese model's functional energy and its partial differential equation based on its level set function. We provide a proof of the results and an implementation using MATLAB. Leveraging modern computer vision methodologies, we propose a functional segmentation loss based on active contours, utilizing pytorch.nn.ModuleLoss and a level set based on the Chan-Vese algorithm. We compare our results with common computer vision segmentation datasets and evaluate the performance of classical loss functions against our proposed method. All code and materials used are available at https://github.com/gguzzy/chan_vese_functional_loss.

[139] Assessing the Geolocation Capabilities, Limitations and Societal Risks of Generative Vision-Language Models

Oliver Grainge,Sania Waheed,Jack Stilgoe,Michael Milford,Shoaib Ehsan

Main category: cs.CV

TL;DR: 该研究评估了25个最先进的视觉-语言模型(VLMs)的地理位置定位能力,发现其在社交媒体内容图像上具有较高的准确率,引发了隐私担忧。

Details Motivation: 视觉-语言模型(VLMs)作为准确的图像地理位置定位工具,带来了隐私风险,特别是在社交媒体照片广泛共享的情况下。然而,目前缺乏对生成式VLMs地理位置精度、限制和潜在意外推断的系统评估。 Method: 对25个最先进的VLMs在四个不同环境下的图像数据集进行了全面评估,以分析其地理位置定位能力。 Result: 研究结果提供了对VLMs内部推理的理解,并强调了它们的优势、局限性和潜在的社会风险。 Conclusion: 当前的视觉-语言模型(VLMs)在普通街景图像上表现不佳,但在类似社交媒体内容的图像上实现了显著较高的准确率(61%),这引发了严重的隐私担忧。 Abstract: Geo-localization is the task of identifying the location of an image using visual cues alone. It has beneficial applications, such as improving disaster response, enhancing navigation, and geography education. Recently, Vision-Language Models (VLMs) are increasingly demonstrating capabilities as accurate image geo-locators. This brings significant privacy risks, including those related to stalking and surveillance, considering the widespread uses of AI models and sharing of photos on social media. The precision of these models is likely to improve in the future. Despite these risks, there is little work on systematically evaluating the geolocation precision of Generative VLMs, their limits and potential for unintended inferences. To bridge this gap, we conduct a comprehensive assessment of the geolocation capabilities of 25 state-of-the-art VLMs on four benchmark image datasets captured in diverse environments. Our results offer insight into the internal reasoning of VLMs and highlight their strengths, limitations, and potential societal risks. Our findings indicate that current VLMs perform poorly on generic street-level images yet achieve notably high accuracy (61\%) on images resembling social media content, raising significant and urgent privacy concerns.

[140] GS: Generative Segmentation via Label Diffusion

Yuhao Chen,Shubin Chen,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: GS introduces a generative approach to language-driven image segmentation, directly generating segmentation masks from noise, and outperforms existing methods on challenging benchmarks.

Details Motivation: Traditional methods treat segmentation as a discriminative problem or auxiliary process; GS aims to make label generation the primary target for better spatial and semantic fidelity. Method: GS formulates segmentation as a generative task via label diffusion, directly generating segmentation masks from noise conditioned on the input image and language description. Result: GS achieves superior performance on the Panoptic Narrative Grounding benchmark, outperforming existing discriminative and diffusion-based methods. Conclusion: GS (Generative Segmentation) significantly outperforms existing methods, setting a new state-of-the-art for language-driven segmentation. Abstract: Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions. Traditional methods approach this as a discriminative problem, assigning each pixel to foreground or background based on semantic alignment. Recently, diffusion models have been introduced to this domain, but existing approaches remain image-centric: they either (i) use image diffusion models as visual feature extractors, (ii) synthesize segmentation data via image generation to train discriminative models, or (iii) perform diffusion inversion to extract attention cues from pre-trained image diffusion models-thereby treating segmentation as an auxiliary process. In this paper, we propose GS (Generative Segmentation), a novel framework that formulates segmentation itself as a generative task via label diffusion. Instead of generating images conditioned on label maps and text, GS reverses the generative process: it directly generates segmentation masks from noise, conditioned on both the input image and the accompanying language description. This paradigm makes label generation the primary modeling target, enabling end-to-end training with explicit control over spatial and semantic fidelity. To demonstrate the effectiveness of our approach, we evaluate GS on Panoptic Narrative Grounding (PNG), a representative and challenging benchmark for multimodal segmentation that requires panoptic-level reasoning guided by narrative captions. Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.

[141] Segmentation Assisted Incremental Test Time Adaptation in an Open World

Manogna Sreenivas,Soma Biswas

Main category: cs.CV

TL;DR: This paper introduces SegAssist, a training-free segmentation-assisted active labeling module that enhances the generalization of Vision Language Models (VLMs) in dynamic environments by enabling continuous adaptation to unseen classes and domains during test time.

Details Motivation: The motivation stems from the challenge faced by deployed models in dynamic environments, where they encounter unfamiliar objects and distribution shifts. Traditional Test Time Adaptation approaches are limited to predefined classes, which hampers their ability to generalize in real-world scenarios with unseen classes and domains. Method: The paper proposes SegAssist, a segmentation-assisted active labeling module that is training-free. It repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. The approach is evaluated through extensive experiments on benchmark datasets. Result: The results show that SegAssist has the potential to significantly enhance the performance of VLMs in real-world scenarios that require continuous adaptation to emerging data. The method outperforms traditional Test Time Adaptation approaches by simultaneously addressing covariate and label shifts. Conclusion: The paper concludes that SegAssist enhances the performance of Vision Language Models (VLMs) in real-world scenarios by enabling continuous adaptation to emerging data, particularly for unseen classes and domains during test time. Abstract: In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/

[142] OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

Peng-Hao Hsu,Ke Zhang,Fu-En Wang,Tao Tu,Ming-Feng Li,Yu-Lun Liu,Albert Y. C. Chen,Min Sun,Cheng-Hao Kuo

Main category: cs.CV

TL;DR: OpenM3D is an efficient, open-vocabulary indoor 3D object detector that uses image-based methods and graph embedding to achieve superior performance without human annotations.

Details Motivation: Open-vocabulary 3D object detection through image-based methods is underexplored compared to 3D point cloud-based methods, prompting the development of OpenM3D. Method: OpenM3D is a single-stage detector using 2D-induced voxel features and trained with a class-agnostic 3D localization loss and a voxel-semantic alignment loss. A 3D Pseudo Box Generation method using graph embedding combines 2D segments into coherent 3D structures. Result: OpenM3D achieves high precision and recall with pseudo-boxes and performs efficiently, requiring only multi-view images for input at 0.3 seconds per scene. Conclusion: OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector, demonstrates superior accuracy and speed compared to existing methods. Abstract: Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.

[143] Patch Progression Masked Autoencoder with Fusion CNN Network for Classifying Evolution Between Two Pairs of 2D OCT Slices

Philippe Zhang,Weili Jiang,Yihao Li,Jing Zhang,Sarah Matta,Yubo Tan,Hui Lin,Haoshen Wang,Jiangtian Pan,Hui Xu,Laurent Borderie,Alexandre Le Guilcher,Béatrice Cochener,Chubin Ou,Gwenolé Quellec,Mathieu Lamard

Main category: cs.CV

TL;DR: 本文介绍了针对年龄相关性黄斑变性(AMD)的光学相干断层扫描(OCT)进展监测方法,参与了MARIO挑战赛,并在两个任务中分别使用融合CNN网络和模型集成、Patch Progression Masked Autoencoder方法取得了前十名的成绩。

Details Motivation: 及时诊断和持续监测新生血管性AMD可以提高治疗效果,通过跟踪OCT扫描中的新生血管活动,可以制定更个性化和有效的治疗计划。 Method: 在任务1中,使用融合CNN网络和模型集成方法对连续OCT采集的两对2D切片之间的演变进行分类。在任务2中,提出Patch Progression Masked Autoencoder方法,生成下一次检查的OCT图像,并使用任务1中的方法对当前OCT和生成的OCT之间的演变进行分类。 Result: 在MARIO挑战赛的两个任务中均取得了前十名的成绩。 Conclusion: 研究结果表明,所提出的方法在监测AMD进展方面具有潜力,但由于部分团队成员与挑战组织者属于同一机构,因此不具备获奖资格。 Abstract: Age-related Macular Degeneration (AMD) is a prevalent eye condition affecting visual acuity. Anti-vascular endothelial growth factor (anti-VEGF) treatments have been effective in slowing the progression of neovascular AMD, with better outcomes achieved through timely diagnosis and consistent monitoring. Tracking the progression of neovascular activity in OCT scans of patients with exudative AMD allows for the development of more personalized and effective treatment plans. This was the focus of the Monitoring Age-related Macular Degeneration Progression in Optical Coherence Tomography (MARIO) challenge, in which we participated. In Task 1, which involved classifying the evolution between two pairs of 2D slices from consecutive OCT acquisitions, we employed a fusion CNN network with model ensembling to further enhance the model's performance. For Task 2, which focused on predicting progression over the next three months based on current exam data, we proposed the Patch Progression Masked Autoencoder that generates an OCT for the next exam and then classifies the evolution between the current OCT and the one generated using our solution from Task 1. The results we achieved allowed us to place in the Top 10 for both tasks. Some team members are part of the same organization as the challenge organizers; therefore, we are not eligible to compete for the prize.

[144] PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence

Zheng Li,Yanming Guo,WenZhe Liu,Xueyi Zhang,Zhaoyun Ding,Long Xu,Mingrui Lao

Main category: cs.CV

TL;DR: This paper introduces PAUL, a novel framework for addressing the Noisy Correspondence on Cross-View Geo-Localization (NC-CVGL) problem by leveraging data uncertainty and loss discrepancy to provide robust supervision for noisy samples.

Details Motivation: Existing cross-view geo-localization methods assume perfect alignment of image pairs during training, which is rarely the case in real-world scenarios due to GPS drift and other factors. This noisy correspondence issue has received limited attention in current research. Method: PAUL (Partition and Augmentation by Uncertainty Learning) partitions and augments training data based on estimated data uncertainty through uncertainty-aware co-augmentation and evidential co-training. Result: PAUL consistently achieves superior performance over other competitive noisy-correspondence-driven methods across various noise ratios, as validated by comprehensive experiments. Conclusion: PAUL provides robust supervision for noisy samples by leveraging data uncertainty and loss discrepancy for targeted partitioning and augmentation, leading to superior performance in cross-view geo-localization tasks under noisy correspondence scenarios. Abstract: Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, as it enables matching between drone-captured and satellite imagery. Most existing approaches embed multi-modal data into a joint feature space to maximize the similarity of paired images. However, these methods typically assume perfect alignment of image pairs during training, which rarely holds true in real-world scenarios. In practice, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic alignment shifts where only partial correspondences exist between pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research. In this paper, we formally introduce and address the Noisy Correspondence on Cross-View Geo-Localization (NC-CVGL) problem, aiming to bridge the gap between idealized benchmarks and practical applications. To this end, we propose PAUL (Partition and Augmentation by Uncertainty Learning), a novel framework that partitions and augments training data based on estimated data uncertainty through uncertainty-aware co-augmentation and evidential co-training. Specifically, PAUL selectively augments regions with high correspondence confidence and utilizes uncertainty estimation to refine feature learning, effectively suppressing noise from misaligned pairs. Distinct from traditional filtering or label correction, PAUL leverages both data uncertainty and loss discrepancy for targeted partitioning and augmentation, thus providing robust supervision for noisy samples. Comprehensive experiments validate the effectiveness of individual components in PAUL,which consistently achieves superior performance over other competitive noisy-correspondence-driven methods in various noise ratios.

[145] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang,Yizhuo Li,Tianshuo Yang,Chengyue Wu,Sitong Mao,Liuao Pei,Xiaokang Yang,Jiangmiao Pang,Yao Mu,Ping Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为Discrete Diffusion VLA的新型单变压器策略,通过离散扩散模型对动作块进行建模,解决了现有VLA解码器的局限性。

Details Motivation: 现有的VLA解码器要么以固定的顺序自回归生成动作,要么在骨干网外部附加连续扩散或流匹配头,需要专门的训练和迭代采样,阻碍了统一、可扩展的架构。 Method: 提出了一种单变压器策略,使用离散扩散模型对离散化的动作块进行建模,并通过与VLM骨干网相同的交叉熵目标进行训练。 Result: 在LIBERO上实现了96.3%的平均成功率,在SimplerEnv Fractal上实现了71.2%的视觉匹配率,在SimplerEnv Bridge上实现了49.3%的整体表现,优于自回归和连续扩散基线。 Conclusion: Discrete Diffusion VLA支持精确的动作建模和一致的训练,为将VLA扩展到更大的模型和数据集奠定了基础。 Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.

[146] Seam360GS: Seamless 360° Gaussian Splatting from Real-World Omnidirectional Images

Changha Shin,Woong Oh Cho,Seon Joo Kim

Main category: cs.CV

TL;DR: 本文介绍了一种新的校准框架,用于解决消费级双鱼眼系统产生的不完美全景图问题,并在真实数据集上验证了其优于现有360度渲染模型的效果。

Details Motivation: 消费级双鱼眼系统由于固有的镜头分离和角度扭曲,始终产生不完美的全景图。 Method: 引入了一种新的校准框架,将双鱼眼相机模型纳入3D高斯点绘管线中。 Result: 广泛的真实数据集评估证实了我们的方法能够从不完美的图像中生成无缝的渲染效果。 Conclusion: 该框架能够将不完美的全向输入转换为完美的新视角合成,并在现有360度渲染模型中表现出色。 Abstract: 360-degree visual content is widely shared on platforms such as YouTube and plays a central role in virtual reality, robotics, and autonomous navigation. However, consumer-grade dual-fisheye systems consistently yield imperfect panoramas due to inherent lens separation and angular distortions. In this work, we introduce a novel calibration framework that incorporates a dual-fisheye camera model into the 3D Gaussian splatting pipeline. Our approach not only simulates the realistic visual artifacts produced by dual-fisheye cameras but also enables the synthesis of seamlessly rendered 360-degree images. By jointly optimizing 3D Gaussian parameters alongside calibration variables that emulate lens gaps and angular distortions, our framework transforms imperfect omnidirectional inputs into flawless novel view synthesis. Extensive evaluations on real-world datasets confirm that our method produces seamless renderings-even from imperfect images-and outperforms existing 360-degree rendering models.

[147] AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Yuxin Guo,Teng Wang,Yuying Ge,Shijie Ma,Yixiao Ge,Wei Zou,Ying Shan

Main category: cs.CV

TL;DR: AudioStory is a unified framework that integrates large language models with TTA systems to generate structured, long-form audio narratives, demonstrating superior performance in instruction-following and audio fidelity.

Details Motivation: Recent advances in text-to-audio generation struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. AudioStory addresses this gap. Method: AudioStory uses a unified framework integrating large language models (LLMs) with TTA systems. It employs a decoupled bridging mechanism and end-to-end training to enhance collaboration between components and ensure temporal coherence and emotional tone consistency. Result: AudioStory excels in both single-audio generation and narrative audio generation, surpassing prior TTA baselines in instruction-following ability and audio fidelity. A benchmark, AudioStory-10K, was also established for evaluation. Conclusion: AudioStory demonstrates superiority in generating structured, long-form audio narratives by integrating large language models with TTA systems, outperforming prior TTA baselines in instruction-following ability and audio fidelity. Abstract: Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory

[148] Bridging Domain Gaps for Fine-Grained Moth Classification Through Expert-Informed Adaptation and Foundation Model Priors

Ross J Gardiner,Guillaume Mougeot,Sareh Rowlands,Benno I Simmons,Flemming Helsing,Toke Thomas Høye

Main category: cs.CV

TL;DR: 本文开发了一种高效的昆虫图像分类方法,通过知识蒸馏将高性能模型BioCLIP2的能力转移到轻量级模型ConvNeXt-tiny中,解决了实地图像分类中的领域差异问题。

Details Motivation: 由于 curated 图像与噪声较大的实地图像之间存在领域差异,准确识别物种面临挑战,而这对理解昆虫数量下降至关重要。 Method: 结合有限的专家标记实地数据,通过知识蒸馏方法将高性能基础模型BioCLIP2的能力转移到轻量级的ConvNeXt-tiny架构中。 Result: 在AMI相机系统的101种丹麦蛾类实验中,BioCLIP2显著优于其他方法,并且蒸馏后的轻量级模型在计算成本显著降低的同时达到了相当的准确性。 Conclusion: 本文提出了一种轻量级分类方法,通过将专家标记的有限实地数据与从高性能基础模型BioCLIP2的知识蒸馏结合,有效弥合了细粒度分类的领域差距,并为高效昆虫监测系统的发展提供了实用指南。 Abstract: Labelling images of Lepidoptera (moths) from automated camera systems is vital for understanding insect declines. However, accurate species identification is challenging due to domain shifts between curated images and noisy field imagery. We propose a lightweight classification approach, combining limited expert-labelled field data with knowledge distillation from the high-performance BioCLIP2 foundation model into a ConvNeXt-tiny architecture. Experiments on 101 Danish moth species from AMI camera systems demonstrate that BioCLIP2 substantially outperforms other methods and that our distilled lightweight model achieves comparable accuracy with significantly reduced computational cost. These insights offer practical guidelines for the development of efficient insect monitoring systems and bridging domain gaps for fine-grained classification.

[149] CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

Zeyi Sun,Yuhang Cao,Jianze Liang,Qiushi Sun,Ziyu Liu,Zhixiong Zhang,Yuhang Zang,Xiaoyi Dong,Kai Chen,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: CODA is a new trainable compositional framework that combines planning and execution for improved performance in specialized domains like scientific computing.

Details Motivation: To overcome the limitations of existing compositional frameworks which are static and non-trainable, thus preventing adaptation from experience, especially in domains with scarce high-quality data. Method: CODA combines a generalist planner (Cerebrum) and a specialist executor (Cerebellum), trained through a two-stage pipeline involving Specialization and Generalization. Result: CODA outperforms existing baselines and achieves a new state of the art among open-source models on the ScienceBoard benchmark. Conclusion: CODA provides a trainable compositional framework that successfully integrates planning and execution capabilities, making it highly effective in specialized domains like scientific computing. Abstract: Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.