Skip to content

Table of Contents

cs.CL [Back]

[1] Spatial ModernBERT: Spatial-Aware Transformer for Table and Key-Value Extraction in Financial Documents at Scale

Javis AI Team,Amrendra Singh,Maulik Shah,Dharshan Sampath

Main category: cs.CL

TL;DR: This paper introduces Spatial ModernBERT, a model that combines textual and spatial information to accurately extract tables and key-value pairs from financial documents.

Details Motivation: Extracting tables and key-value pairs from financial documents is essential for business workflows like auditing, data analytics, and automated invoice processing. Method: Spatial ModernBERT, a transformer-based model with spatial embeddings, was developed. The model uses token classification across three heads (Label Head, Column Head, Row Head) and a post-processing method using B-I-IB tagging for merging tokens, reconstructing the table layout, and extracting key-value pairs. Result: The model was pre-trained on the PubTables-1M dataset and fine-tuned on a financial document dataset, achieving robust performance through cross-entropy loss on each classification head. Conclusion: Spatial ModernBERT effectively utilizes textual and spatial cues to enable highly accurate extraction of tables and key-value pairs from complex financial documents. Abstract: Extracting tables and key-value pairs from financial documents is essential for business workflows such as auditing, data analytics, and automated invoice processing. In this work, we introduce Spatial ModernBERT-a transformer-based model augmented with spatial embeddings-to accurately detect and extract tabular data and key-value fields from complex financial documents. We cast the extraction task as token classification across three heads: (1) Label Head, classifying each token as a label (e.g., PO Number, PO Date, Item Description, Quantity, Base Cost, MRP, etc.); (2) Column Head, predicting column indices; (3) Row Head, distinguishing the start of item rows and header rows. The model is pretrained on the PubTables-1M dataset, then fine-tuned on a financial document dataset, achieving robust performance through cross-entropy loss on each classification head. We propose a post-processing method to merge tokens using B-I-IB tagging, reconstruct the tabular layout, and extract key-value pairs. Empirical evaluation shows that Spatial ModernBERT effectively leverages both textual and spatial cues, facilitating highly accurate table and key-value extraction in real-world financial documents.

[2] SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan,Michael Fu,Rui Yang,Chakkrit,Tantithamthavorn

Main category: cs.CL

TL;DR: This paper introduces SEALGuard, a multilingual guardrail for LLM-powered systems that significantly improves detection of unsafe and jailbreak prompts across diverse languages.

Details Motivation: Current guardrails like LlamaGuard struggle with multilingual unsafe inputs, leaving systems vulnerable to prompts in low-resource languages. Method: Adapted a general-purpose multilingual language model into a guardrail using low-rank adaptation (LoRA), evaluated on SEALSBench. Result: SEALGuard achieved the best Defense Success Rate, precision, and F1-score compared to state-of-the-art guardrails. Conclusion: SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving Defense Success Rate by 48% over LlamaGuard. Abstract: Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?''), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. SEALGuard advances the safety alignment of LLM systems by introducing an effective multilingual guardrail.

[3] Evaluating LLMs in Medicine: A Call for Rigor, Transparency

Mahmoud Alwakeel,Aditya Nagori,Vijay Krishnamoorthy,Rishikesan Kamaleswaran

Main category: cs.CL

TL;DR: The study highlights the inadequacies of current datasets used to evaluate large language models in medical question answering. It emphasizes the need for more realistic, transparent, and validated datasets through collaborative efforts.

Details Motivation: To evaluate the current limitations of large language models (LLMs) in medical question answering, focusing on the quality of datasets used for their evaluation. Method: Widely-used benchmark datasets, including MedQA, MedMCQA, PubMedQA, and MMLU, were reviewed for their rigor, transparency, and relevance to clinical scenarios. Result: Most existing datasets lack clinical realism, transparency, and robust validation processes. Publicly available challenge questions offer some benefits but are limited by their small size, narrow scope, and exposure to LLM training. Conclusion: A standardized framework is critical for evaluating LLMs in medicine. Collaborative efforts among institutions and policymakers are needed to ensure datasets and methodologies are rigorous, unbiased, and reflective of clinical complexities. Abstract: Objectives: To evaluate the current limitations of large language models (LLMs) in medical question answering, focusing on the quality of datasets used for their evaluation. Materials and Methods: Widely-used benchmark datasets, including MedQA, MedMCQA, PubMedQA, and MMLU, were reviewed for their rigor, transparency, and relevance to clinical scenarios. Alternatives, such as challenge questions in medical journals, were also analyzed to identify their potential as unbiased evaluation tools. Results: Most existing datasets lack clinical realism, transparency, and robust validation processes. Publicly available challenge questions offer some benefits but are limited by their small size, narrow scope, and exposure to LLM training. These gaps highlight the need for secure, comprehensive, and representative datasets. Conclusion: A standardized framework is critical for evaluating LLMs in medicine. Collaborative efforts among institutions and policymakers are needed to ensure datasets and methodologies are rigorous, unbiased, and reflective of clinical complexities.

[4] From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

Seokhee Hong,Sunkyoung Kim,Guijin Son,Soyeon Kim,Yeonjung Hong,Jinsik Lee

Main category: cs.CL

TL;DR: 本文介绍了一种新的方法来构建更可靠的基准测试数据集,用于评估大型语言模型在韩国工业和学术领域的适用性。

Details Motivation: 为了有效评估大型语言模型在真实场景中的适用性,需要涵盖学术领域和工业领域的基准测试。 Method: 重建了现有的KMMLU数据集,去除了关键错误以提高可靠性,并基于韩国国家职业资格考试创建了新的基准测试KMMLU-Pro。 Result: 实验表明,这两个基准测试全面代表了韩国的工业知识。 Conclusion: 本文介绍了两个韩国专家级基准测试的开发,旨在评估大型语言模型在工业和学术领域的适用性,并公开发布了这些数据集。 Abstract: The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.

[5] Self-Improving Model Steering

Rongyi Zhu,Yuhui Wang,Tanqiu Jiang,Jiacheng Liang,Ting Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为SIMS的自我改进模型转向框架,它不依赖外部监督,具有更强的转向效果和适应性。

Details Motivation: 传统模型转向方法依赖外部标注数据,限制了它们在不同环境中的适应能力,并且受制于标注质量。 Method: SIMS通过迭代自我改进循环自主生成和优化对比样本,并采用提示排序和对比采样等新策略以增强转向效果。 Result: 广泛的评估表明,SIMS在转向效果和适应性方面显著优于现有方法。 Conclusion: SIMS是一个有前景的模型转向框架,为未来关于推理时LLM对齐的研究指明了方向。 Abstract: Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.

[6] Application of CARE-SD text classifier tools to assess distribution of stigmatizing and doubt-marking language features in EHR

Drew Walker,Jennifer Love,Swati Rajwal,Isabel C Walker,Hannah LF Cooper,Abeed Sarker,Melvin Livingston III

Main category: cs.CL

TL;DR: The study identifies disparities in stigmatizing language within EHRs, showing that historically marginalized patients face increased use of such language by multiple types of healthcare providers.

Details Motivation: To understand how electronic health records may perpetuate patient stigmatization through specific linguistic features across different healthcare teams. Method: Linguistic features like doubt markers and stigmatizing labels were identified using expanded lexicon matching and supervised learning classifiers on MIMIC-III EHR data. Poisson regression models were used to assess predictors of these linguistic features. Result: Higher rates of stigmatizing labels were found among Black or African American patients, those with government-run insurance or self-pay, and patients with certain health conditions. Male patients showed higher rates of doubt markers. Nurses and social workers contributed more to stigmatizing language patterns. Conclusion: Stigmatizing language in EHRs is more common among historically marginalized patient groups and is propagated by various healthcare providers. Abstract: Introduction: Electronic health records (EHR) are a critical medium through which patient stigmatization is perpetuated among healthcare teams. Methods: We identified linguistic features of doubt markers and stigmatizing labels in MIMIC-III EHR via expanded lexicon matching and supervised learning classifiers. Predictors of rates of linguistic features were assessed using Poisson regression models. Results: We found higher rates of stigmatizing labels per chart among patients who were Black or African American (RR: 1.16), patients with Medicare/Medicaid or government-run insurance (RR: 2.46), self-pay (RR: 2.12), and patients with a variety of stigmatizing disease and mental health conditions. Patterns among doubt markers were similar, though male patients had higher rates of doubt markers (RR: 1.25). We found increased stigmatizing labels used by nurses (RR: 1.40), and social workers (RR: 2.25), with similar patterns of doubt markers. Discussion: Stigmatizing language occurred at higher rates among historically stigmatized patients, perpetuated by multiple provider types.

[7] Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery

Ana Chkhaidze,Reshanne R. Reeder,Connor Gag,Anastasia Kiyonaga,Seana Coulson

Main category: cs.CL

TL;DR: 该研究发现,个体在Ganzflicker诱发的幻觉中所见内容与其视觉意象能力相关,强意象者看到更复杂自然的画面,而弱意象者看到简单图案,这可能反映了早期视觉区域与高阶脑区协调性的个体差异。

Details Motivation: 近期关于意象谱系(absent imagery, typical imagery, vivid imagery)的研究提出,这种个体差异可能会影响其他内部生成的视觉体验的复杂性,因此本研究旨在验证这一假设。 Method: 研究使用自然语言处理工具分析了4000多名参与者的自由文本描述,比较了不同意象表型个体在幻觉内容上的差异,并评估了视觉-语言模型和纯文本语言模型对这些差异的捕捉能力。 Result: 强意象者描述了更复杂、自然的幻觉内容,而弱意象者主要报告简单几何图案;视觉-语言模型比纯文本模型更能捕捉这些差异,且强意象者使用的语言具有更丰富的感官运动关联。 Conclusion: 个体在Ganzflicker诱发的视觉幻觉中所描述的内容与其视觉心理意象的能力密切相关,强意象者倾向于描述更复杂、自然的内容,而弱意象者则更多报告简单的几何图案。 Abstract: A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Recent proposals regarding the imagery spectrum, that is, differences in the visual system of individuals with absent imagery, typical imagery, and vivid imagery, suggest these differences should impact the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind's eye during Ganzflicker-induced hallucinations. Strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Embeddings from vision language models better captured these differences than text-only language models, and participants with stronger imagery used language with richer sensorimotor associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.

[8] Lizard: An Efficient Linearization Framework for Large Language Models

Chien Van Nguyen,Ruiyi Zhang,Hanieh Deilamsalehy,Puneet Mathur,Viet Dac Lai,Haoliang Wang,Jayakumar Subramanian,Ryan A. Rossi,Trung Bui,Nikos Vlassis,Franck Dernoncourt,Thien Huu Nguyen

Main category: cs.CL

TL;DR: Lizard improves the efficiency of large language models for long-context tasks without sacrificing quality.

Details Motivation: Transformer-based LLMs face memory and computational bottlenecks due to the quadratic complexity of softmax attention and growing KV cache. Method: Lizard uses a hybrid mechanism combining gated linear attention and sliding window attention with meta memory, along with a hardware-aware training algorithm. Result: Lizard achieves near-lossless recovery of teacher model performance and significantly outperforms previous linearization methods, including an 18-point improvement on the 5-shot MMLU benchmark. Conclusion: Lizard is an effective framework for transforming Transformer-based LLMs into subquadratic architectures, enabling efficient infinite-context generation while maintaining output quality. Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.

[9] ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making

Bharadwaj Ravichandran,David Joy,Paul Elliott,Brian Hu,Jadie Adams,Christopher Funk,Emily Veenhuis,Anthony Hoogs,Arslan Basharat

Main category: cs.CL

TL;DR: The ALIGN system introduces a dynamic personalization framework for LLM-based decision-makers, enabling prompt-based alignment to fine-grained attributes. It provides structured output, modular algorithm support, and both qualitative and quantitative evaluation capabilities across multiple domains.

Details Motivation: Users have diverse values and preferences that influence decision-making when using large language models (LLMs) as decision aids. Existing LLM comparison tools focus primarily on benchmarking tasks, missing the need for dynamic alignment and personalization. This necessitates novel methods to align LLMs with individual preferences and ethical considerations across different domains. Method: ALIGN employs prompt-based alignment techniques to personalize LLM-based decision-makers according to specific attributes. It includes a modular backend that allows easy integration of different algorithms and supports swappable LLM backbones. The system offers structured output generation with reasoning and robust configuration management. A user interface enables side-by-side qualitative comparison, while quantitative analysis is conducted in two domains: demographic alignment for public opinion surveys and value alignment for medical triage decisions. Result: The ALIGN system successfully enables dynamic personalization of LLMs through prompt-based alignment, offering features like structured output with reasoning and modular algorithm integration. It supports both qualitative and quantitative analyses, with demonstrated effectiveness in aligning LLMs for public opinion surveys and medical triage scenarios. The open-source nature of ALIGN encourages further research into responsible and personalized LLM-based decision-making systems. Conclusion: The ALIGN system provides a framework for the dynamic personalization of LLM-based decision-makers, allowing alignment to fine-grained attributes through prompt-based methods. It features robust configuration management, structured output generation with reasoning, and supports various algorithm implementations. The open-source framework facilitates qualitative and quantitative comparisons in different domains, advancing research on reliable and responsible personalized LLM applications. Abstract: Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LLM alignment and personalization. Existing LLM comparison tools largely focus on benchmarking tasks, such as knowledge-based question answering. In contrast, our proposed ALIGN system focuses on dynamic personalization of LLM-based decision-makers through prompt-based alignment to a set of fine-grained attributes. Key features of our system include robust configuration management, structured output generation with reasoning, and several algorithm implementations with swappable LLM backbones, enabling different types of analyses. Our user interface enables a qualitative, side-by-side comparison of LLMs and their alignment to various attributes, with a modular backend for easy algorithm integration. Additionally, we perform a quantitative analysis comparing alignment approaches in two different domains: demographic alignment for public opinion surveys and value alignment for medical triage decision-making. The entire ALIGN framework is open source and will enable new research on reliable, responsible, and personalized LLM-based decision-makers.

[10] OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

Wasi Uddin Ahmad,Somshubra Majumdar,Aleksander Ficek,Sean Narenthiran,Mehrzad Samadi,Jocelyn Huang,Siddhartha Jain,Vahid Noroozi,Boris Ginsburg

Main category: cs.CL

TL;DR: 本文提出了OpenCodeReasoning-II,一个用于代码生成与批判任务的大规模数据集,并结合两阶段微调策略显著提升了Qwen2.5-Instruct模型在代码生成及竞赛编程中的表现。

Details Motivation: 随着基于推理的大语言模型(LLMs)在测试时扩展方面的潜力,代码生成与批判领域出现了重大机遇,但这些进展依赖于大规模、高质量的数据集。 Method: 采用两阶段监督微调策略:第一阶段专注于代码生成的微调,第二阶段联合训练代码生成和批判模型。此外,还扩展了LiveCodeBench基准以支持C++语言。 Result: 构建了包含250万问题-解决方案-批判三元组的OpenCodeReasoning-II数据集,约为之前最大公开数据集的两倍;Qwen2.5-Instruct模型在代码生成任务中表现优异,并且在竞赛编程任务中也表现出显著提升。 Conclusion: 通过两阶段的监督微调策略,Qwen2.5-Instruct模型在代码生成和竞赛编程性能方面超过了或等于之前最佳的开源蒸馏模型,并且集成就生成和批判模型显著提高了整体性能。 Abstract: Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.

[11] Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation

Jialong Mai,Xiaofen Xing,Yawei Li,Zhipeng Li,Jingyuan Xing,Xiangmin Xu

Main category: cs.CL

TL;DR: This paper introduces a novel Dynamic Parameter Memory (DPM) approach that enhances speech large language models' capability to process and understand emotions in long audio conversations, leading to improved performance on the IEMOCAP dataset.

Details Motivation: The motivation stems from the limitations faced by current speech large language models in processing long audio sequences due to their high frame rate and the inadequacy of input token compression methods in preserving emotional continuity across conversation turns. Method: The method involves the development and incorporation of a Dynamic Parameter Memory (DPM) mechanism, which encodes sentence-level information and emotions into a temporary LoRA module during inference. This allows for the processing of unlimited-length audio within the constraints of limited context windows in SLLMs. Result: Experimental results on the IEMOCAP dataset demonstrate that the DPM mechanism significantly improves the ability of SLLMs to recognize emotions in long audio sequences. Conclusion: The paper concludes that the proposed Dynamic Parameter Memory (DPM) mechanism significantly enhances the emotion recognition capabilities of speech large language models (SLLM) when dealing with long audio sequences, achieving state-of-the-art performance. Abstract: Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively "memorize" the contextual information. We trained an emotion SLLM as a backbone and incorporated our DPM into inference for emotion recognition in conversation (ERC). Experimental results on the IEMOCAP dataset show that DPM significantly improves the emotion recognition capabilities of SLLM when processing long audio sequences, achieving state-of-the-art performance.

[12] CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

Taolin Zhang,Maosong Cao,Alexander Lam,Songyang Zhang,Kai Chen

Main category: cs.CL

TL;DR: CompassJudger-2是一种新型的通用判断模型,通过任务驱动、多领域数据策展策略克服了当前判断模型专业化狭窄和鲁棒性有限的问题,并提出了JudgerBenchV2作为跨领域判断准确性和排名一致性的综合基准。

Details Motivation: 当前LLM-as-judge模型在评估大型语言模型时存在专业化狭窄和鲁棒性不足的问题,影响了其全面评估的能力。 Method: 提出了一种基于可验证奖励监督判断任务的方法,利用拒绝采样引导内在批判性推理,从而促进强大且可泛化的判断能力,并引入了边际策略梯度损失来提升性能。 Result: CompassJudger-2在多个判断和奖励基准上取得了卓越的结果,其7B模型与DeepSeek-V3和Qwen3-235B-A22B等显著更大的模型相比展现了有竞争力的判断准确性。 Conclusion: CompassJudger-2推进了稳健、可扩展的LLM判断能力,并建立了新的性能和评估标准。 Abstract: Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

[13] OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering

Ali Vosoughi,Ayoub Shahnazari,Yufeng Xi,Zeliang Zhang,Griffin Hess,Chenliang Xu,Niaz Abdolrahim

Main category: cs.CL

TL;DR: OPENXRD is an open-book pipeline for crystallography question answering that uses GPT-4.5-generated summaries to improve the performance of smaller models on X-ray diffraction tasks.

Details Motivation: To address copyright issues with scanned textbooks and help smaller models better understand key concepts in X-ray diffraction by using AI-generated domain-specific references. Method: The study introduces OPENXRD, which integrates textual prompts with GPT-4.5-generated summaries to support crystallography question answering. It evaluates different vision-language models under closed-book and open-book conditions on 217 expert-level XRD questions. Result: Models using GPT-4.5-generated summaries showed significant accuracy improvements, especially those with limited prior training in crystallography. The approach successfully filled knowledge gaps and enhanced reasoning capabilities in scientific tasks. Conclusion: OPENXRD demonstrates that specialized open-book systems can be effective in materials science and provides a foundation for broader NLP tools in scientific domains. Abstract: This work presents OPENXRD, an open-book pipeline designed for crystallography question answering, which integrates textual prompts with concise supporting content generated by GPT-4.5. Instead of using scanned textbooks, which may lead to copyright issues, OPENXRD generates compact, domain-specific references that help smaller models understand key concepts in X-ray diffraction (XRD). We evaluate OPENXRD on a well-defined set of 217 expert-level XRD questions by comparing different vision-language models, including GPT-4 and LLaVA-based frameworks such as Mistral, LLaMA, and QWEN, under both closed-book (without supporting material) and open-book (with supporting material) conditions. Our experimental results show significant accuracy improvements in models that use the GPT-4.5-generated summaries, particularly those with limited prior training in crystallography. OPENXRD uses knowledge from larger models to fill knowledge gaps in crystallography and shows that AI-generated texts can help smaller models reason more effectively in scientific tasks. While the current version of OPENXRD focuses on text-based inputs, we also explore future extensions such as adding real crystal diagrams or diffraction patterns to improve interpretation in specialized materials science contexts. Overall, OPENXRD shows that specialized open-book systems can be useful in materials science and provides a foundation for broader natural language processing (NLP) tools in critical scientific fields.

[14] PU-Lie: Lightweight Deception Detection in Imbalanced Diplomatic Dialogues via Positive-Unlabeled Learning

Bhavinkumar Vinodbhai Kuwar,Bikrant Bikram Pratap Maurya,Priyanshu Gupta,Nitin Choudhury

Main category: cs.CL

TL;DR: 本文提出了一种名为PU-Lie的轻量级模型,用于解决战略性对话中欺骗检测的问题,尤其适用于欺骗性信息标注稀缺的情况。

Details Motivation: 由于语言的微妙性和欺骗与真实交流之间的极端类别不平衡,检测战略对话中的欺骗是一项复杂而高风险的任务。 Method: 结合冻结的BERT嵌入、可解释的语言和游戏特定特征以及正-未标记(PU)学习目标的轻量级模型。 Result: PU-Lie模型在减少可训练参数超过650倍的同时,达到了0.60的新最佳宏观F1分数,并通过了七种模型的全面评估和消融研究。 Conclusion: PU-Lie模型在检测战略性对话中的欺骗性信息方面表现优异,特别是在欺骗性信息标注稀缺的情况下。 Abstract: Detecting deception in strategic dialogues is a complex and high-stakes task due to the subtlety of language and extreme class imbalance between deceptive and truthful communications. In this work, we revisit deception detection in the Diplomacy dataset, where less than 5% of messages are labeled deceptive. We introduce a lightweight yet effective model combining frozen BERT embeddings, interpretable linguistic and game-specific features, and a Positive-Unlabeled (PU) learning objective. Unlike traditional binary classifiers, PU-Lie is tailored for situations where only a small portion of deceptive messages are labeled, and the majority are unlabeled. Our model achieves a new best macro F1 of 0.60 while reducing trainable parameters by over 650x. Through comprehensive evaluations and ablation studies across seven models, we demonstrate the value of PU learning, linguistic interpretability, and speaker-aware representations. Notably, we emphasize that in this problem setting, accurately detecting deception is more critical than identifying truthful messages. This priority guides our choice of PU learning, which explicitly models the rare but vital deceptive class.

[15] RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking

Shuo Yang,Zijian Yu,Zhenzhe Ying,Yuqin Dai,Guoqing Wang,Jun Lan,Jinfeng Xu,Jinze Li,Edith C. H. Ngai

Main category: cs.CL

TL;DR: 本文提出了一种名为RAMA的新框架,用于验证多模态虚假信息,通过整合网络来源的证据和多个大型语言模型的力量,有效提高了核查的准确性和可靠性。

Details Motivation: 应对多模态错误信息快速增长带来的挑战,特别是当声明模糊或缺乏足够背景时对自动化事实核查系统的影响。 Method: 介绍RAMA,一种检索增强型多代理框架,包含三个核心创新点:战略查询制定、交叉验证证据聚合以及多代理集成架构。 Result: 实验表明RAMA在基准数据集上表现优异,特别是在基于检索到的事实证据解决模糊或不可能声明方面。 Conclusion: RAMA为多模态错误信息验证提供了一个有效的框架,结合了网络证据和多代理推理的必要性。 Abstract: The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.

[16] Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models

Ameen Ali,Shahar Katz,Lior Wolf,Ivan Titov

Main category: cs.CL

TL;DR: This paper proposes a fine-tuning method that improves the generalization of large language models by identifying and removing neurons linked to dataset-specific patterns.

Details Motivation: LLMs often rely on dataset-specific correlations, which can degrade performance on novel tasks or distributions. Method: Integrated Gradients is used to identify and prune neurons that contribute to high-confidence predictions on specific datasets. Result: The pruning-based fine-tuning approach outperforms prior adaptation methods on multiple-choice benchmarks. Conclusion: Selective pruning of neurons related to dataset-specific mechanisms enhances generalization in LLMs. Abstract: Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.

[17] Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Leiyu Pan,Bojian Xiong,Lei Yang,Renren Jin,Shaowei Zhang,Yue Chen,Ling Shi,Jiang Zhou,Junru Wu,Zhen Wang,Jianxiang Peng,Juesi Xiao,Tianyu Dong,Zhuowen Han,Zhuo Chen,Sangjee Dondrub,Caizang Tai,Haixing Zhao,Huaque Cairang,Suonan Cairang,Rou Te,Lengben Zhaxi,Gazang Zhaxi,Zhonglin Ye,Yuhui Zheng,Chunyan Peng,Secha Jia,Pema Tashi,Cizhen Jiacuo,Pema Dorjee,Hongkai Liu,Pema Yanggon,Tsehang Dorjee,Jiaxin Han,Qiongying Hu,Jilin Man,Huanke You,Yuqi Ren,Duo La,Deyi Xiong

Main category: cs.CL

TL;DR: This paper introduces Banzhida, a multilingual large language model that improves generative AI for Tibetan by addressing the lack of high-quality training data.

Details Motivation: Tibetan, as a low-resource language, is underrepresented in existing language models due to the scarcity of high-quality training data. Method: Continue pre/post-training a multilingual base model with a curated Tibetan pre-training corpus to develop Banzhida. Result: Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across multiple tasks. Conclusion: Banzhida, a multilingual large language model, advances generative AI for Tibetan and outperforms other models in various tasks. Abstract: Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

Biagio Scalingi,Chiara Barattieri di San Pietro,Paolo Canal,Valentina Bambini

Main category: cs.CL

TL;DR: This study explores how visual metaphors (like melting glaciers as an ice grenade) affect climate change communication compared to literal images. While metaphors demand more cognitive effort and are less straightforward, they offer greater aesthetic appeal and a more positive emotional response. A new dataset, MetaClimage, is introduced to support future research.

Details Motivation: Visual metaphors are seen as useful tools for communicating complex environmental issues, yet their impact has been understudied due to limited availability of materials. This work aims to fill that gap by providing a structured dataset and analyzing how visual metaphors compare to literal imagery in terms of comprehension, emotion, and effectiveness. Method: The study presents the MetaClimage database, which includes metaphorical and literal images related to climate change, enriched with human ratings. Ratings covered difficulty, efficacy, artistic quality, and emotional arousal, along with participant-generated tags summarizing the message. Natural Language Processing was used to derive semantic and emotional variables from these tags. Result: Visual metaphors were rated as harder to understand but more aesthetically pleasing than literal images. They did not differ significantly in efficacy or emotional arousal overall, though arousal was higher in participants with high Need For Cognition. Metaphorical images received more descriptive tags, referenced more abstract concepts, and elicited more positively valenced and dominant language. Conclusion: Visual metaphors of climate change, while not more effective or arousing than literal images, generate greater aesthetic appreciation and a more positive emotional experience. They also impose a higher cognitive load but may encourage deeper cognitive processing. The MetaClimage database offers valuable resources for future research on this topic. Abstract: Visual metaphors of climate change (e.g., melting glaciers depicted as a melting ice grenade) are regarded as valuable tools for addressing the complexity of environmental challenges. However, few studies have examined their impact on communication, also due to scattered availability of material. Here, we present a novel database of Metaphors of Climate Change in Images (MetaClimage) https://doi.org/10.5281/zenodo.15861012, paired with literal images and enriched with human ratings. For each image, we collected values of difficulty, efficacy, artistic quality, and emotional arousal from human rating, as well as number of tags generated by participants to summarize the message. Semantic and emotion variables were further derived from the tags via Natural Language Processing. Visual metaphors were rated as more difficult to understand, yet more aesthetically pleasant than literal images, but did not differ in efficacy and arousal. The latter for visual metaphors, however, was higher in participants with higher Need For Cognition. Furthermore, visual metaphors received more tags, often referring to entities not depicted in the image, and elicited words with more positive valence and greater dominance than literal images. These results evidence the greater cognitive load of visual metaphors, which nevertheless might induce positive effects such as deeper cognitive elaboration and abstraction compared to literal stimuli. Furthermore, while they are not deemed as more effective and arousing, visual metaphors seem to generate superior aesthetic appreciation and a more positively valenced experience. Overall, this study contributes to understanding the impact of visual metaphors of climate change both by offering a database for future research and by elucidating a cost-benefit trade-off to take into account when shaping environmental communication.

[19] Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Deshan Sumanathilaka,Sameera Perera,Sachithya Dharmasiri,Maneesha Athukorala,Anuja Dilrukshi Herath,Rukshan Dias,Pasindu Gamage,Ruvan Weerasinghe,Y. H. P. P. Priyadarshana

Main category: cs.CL

TL;DR: The Swa-bhasha Resource Hub offers valuable datasets and tools for Romanized Sinhala to Sinhala transliteration, significantly contributing to Sinhala NLP advancements.

Details Motivation: To advance research in Sinhala Natural Language Processing by providing accessible datasets and tools for transliteration between Romanized Sinhala and Sinhala. Method: This paper presents an overview of the data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration, including a comparative analysis of existing transliteration applications. Result: A comprehensive collection of data resources and algorithms were developed and made publicly available through the Swa-bhasha Resource Hub. Conclusion: The Swa-bhasha Resource Hub has become a vital asset for advancing Sinhala NLP research and applications, providing datasets and tools for Romanized Sinhala to Sinhala transliteration. Abstract: The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.

[20] Psychology-Driven Enhancement of Humour Translation

Yuchen Su,Yonghua Zhu,Yang Chen,Diana Benavides-Prado,Michael Witbrock

Main category: cs.CL

TL;DR: This paper introduces a psychology-inspired method called Humour Decomposition Mechanism (HDM) that improves humour translation by mimicking human thought processes, resulting in better humour retention, fluency, and coherence.

Details Motivation: While LLMs excel at general translation tasks, they struggle with translating humour due to linguistic interference and lack of humour preservation. This limitation hinders effective cross-cultural communication. Method: We introduced a psychology-inspired Humour Decomposition Mechanism (HDM) using Chain-of-Thought (CoT) to mimic human cognitive processes, combined with humour theory, to improve the readability and humorous elements in translations. Result: Automatic evaluation on open-source humour datasets showed an improvement of 7.75% in humour, 2.81% in fluency, and 6.13% in coherence in translated texts using our method. Conclusion: The proposed psychology-inspired Humour Decomposition Mechanism (HDM) effectively enhances the ability of Large Language Models (LLMs) in translating humour, significantly improving humour, fluency, and coherence in translated texts. Abstract: Humour translation plays a vital role as a bridge between different cultures, fostering understanding and communication. Although most existing Large Language Models (LLMs) are capable of general translation tasks, these models still struggle with humour translation, which is especially reflected through linguistic interference and lacking humour in translated text. In this paper, we propose a psychology-inspired Humour Decomposition Mechanism (HDM) that utilises Chain-of-Thought (CoT) to imitate the ability of the human thought process, stimulating LLMs to optimise the readability of translated humorous texts. Moreover, we integrate humour theory in HDM to further enhance the humorous elements in the translated text. Our automatic evaluation experiments on open-source humour datasets demonstrate that our method significantly improves the quality of humour translation, yielding average gains of 7.75\% in humour, 2.81\% in fluency, and 6.13\% in coherence of the generated text.

[21] ClaritySpeech: Dementia Obfuscation in Speech

Dominika Woszczyk,Ranya Aloufi,Soteris Demetriou

Main category: cs.CL

TL;DR: This paper introduces ClaritySpeech, a framework that enhances dementia-affected speech using ASR, text obfuscation, and TTS, improving clarity, privacy, and accessibility while preserving speaker identity.

Details Motivation: Dementia alters speech patterns, creating communication barriers and privacy concerns. Current speech technologies like ASR struggle with atypical speech, prompting the need for a solution that improves accessibility and maintains privacy. Method: ClaritySpeech integrates automatic speech transcription (ASR), text obfuscation, and zero-shot text-to-speech (TTS) to correct speech patterns affected by dementia without requiring fine-tuning in low-data environments. Result: Results showed a significant improvement in speech clarity with a drop in mean F1 score across adversarial settings (16% for ADReSS, 10% for ADReSSo), improved WER (from 0.73 to 0.08 for ADReSS and 0.15 for ADReSSo), and increased speech quality from 1.65 to ~2.15, while maintaining 50% speaker similarity. Conclusion: The study concludes that ClaritySpeech effectively enhances the clarity and quality of speech affected by dementia, while preserving speaker identity and improving privacy and accessibility. Abstract: Dementia, a neurodegenerative disease, alters speech patterns, creating communication barriers and raising privacy concerns. Current speech technologies, such as automatic speech transcription (ASR), struggle with dementia and atypical speech, further challenging accessibility. This paper presents a novel dementia obfuscation in speech framework, ClaritySpeech, integrating ASR, text obfuscation, and zero-shot text-to-speech (TTS) to correct dementia-affected speech while preserving speaker identity in low-data environments without fine-tuning. Results show a 16% and 10% drop in mean F1 score across various adversarial settings and modalities (audio, text, fusion) for ADReSS and ADReSSo, respectively, maintaining 50% speaker similarity. We also find that our system improves WER (from 0.73 to 0.08 for ADReSS and 0.15 for ADReSSo) and speech quality from 1.65 to ~2.15, enhancing privacy and accessibility.

[22] DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Cathy Jiao,Yijun Pan,Emily Xiao,Daisy Sheng,Niket Jain,Hanzhang Zhao,Ishita Dasgupta,Jiaqi W. Ma,Chenyan Xiong

Main category: cs.CL

TL;DR: DATE-LM is introduced as a unified benchmark for evaluating data attribution methods in LLMs, addressing gaps in systematic evaluation and highlighting performance trade-offs.

Details Motivation: There are critical gaps in systematic evaluation of data attribution methods in LLM research, necessitating a unified benchmark. Method: DATE-LM benchmark evaluates data attribution methods using three key tasks: training data selection, toxicity/bias filtering, and factual attribution across various LLM architectures. Result: Findings indicate no single method dominates all tasks, existence of trade-offs with simpler baselines, and sensitivity of performance to task-specific design. Conclusion: DATE-LM serves as a foundation for future data attribution research in LLMs and facilitates community engagement through a public leaderboard. Abstract: Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks -- training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement. We hope DATE-LM serves as a foundation for future data attribution research in LLMs.

[23] Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models

Mingchuan Yang,Ziyuan Huang

Main category: cs.CL

TL;DR: This study optimizes the DRAGON Longformer base model for clinical text classification, achieving notable improvements in performance by incorporating domain-specific enhancements.

Details Motivation: To improve the accuracy and effectiveness of clinical text classification for medical case descriptions using a domain-specific language model. Method: Hyperparameter tuning, domain-specific preprocessing, and architectural adjustments were applied to the DRAGON Longformer base model, including increasing sequence length, adjusting learning rates, extending training epochs, and incorporating medical terminology. Result: The optimized model achieved accuracy of 85.2%, precision of 84.1%, recall of 86.3%, and an F1-score of 85.2%, with statistically significant improvements (p < .001). Conclusion: The optimized DRAGON Longformer base model demonstrates significant improvements in clinical text classification, enhancing performance metrics and offering practical applications in healthcare settings. Abstract: This study explores the optimization of the DRAGON Longformer base model for clinical text classification, specifically targeting the binary classification of medical case descriptions. A dataset of 500 clinical cases containing structured medical observations was used, with 400 cases for training and 100 for validation. Enhancements to the pre-trained joeranbosma/dragon-longformer-base-mixed-domain model included hyperparameter tuning, domain-specific preprocessing, and architectural adjustments. Key modifications involved increasing sequence length from 512 to 1024 tokens, adjusting learning rates from 1e-05 to 5e-06, extending training epochs from 5 to 8, and incorporating specialized medical terminology. The optimized model achieved notable performance gains: accuracy improved from 72.0% to 85.2%, precision from 68.0% to 84.1%, recall from 75.0% to 86.3%, and F1-score from 71.0% to 85.2%. Statistical analysis confirmed the significance of these improvements (p < .001). The model demonstrated enhanced capability in interpreting medical terminology, anatomical measurements, and clinical observations. These findings contribute to domain-specific language model research and offer practical implications for clinical natural language processing applications. The optimized model's strong performance across diverse medical conditions underscores its potential for broad use in healthcare settings.

[24] The CoNLL-2013 Shared Task on Grammatical Error Correction

Hwee Tou Ng,Siew Mei Wu,Yuanbin Wu,Christian Hadiwinoto,Joel Tetreault

Main category: cs.CL

TL;DR: 这篇论文描述了CoNLL-2013共享任务在语法错误纠正方面的研究成果,并总结了参与团队的方法和评估结果。

Details Motivation: 为了推动语法错误纠正领域的进展,组织了CoNLL-2013共享任务并邀请研究团队参与。 Method: 提供任务定义,展示数据集,并描述了在共享任务中使用的评估指标和评分器。 Result: 概述了各个参与团队采用的不同方法,并提供了评估结果。 Conclusion: 该论文总结了CoNLL-2013共享任务的参与团队所采用的各种方法,并展示了评估结果。 Abstract: The CoNLL-2013 shared task was devoted to grammatical error correction. In this paper, we give the task definition, present the data sets, and describe the evaluation metric and scorer used in the shared task. We also give an overview of the various approaches adopted by the participating teams, and present the evaluation results.

[25] Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

Yangning Li,Weizhi Zhang,Yuyao Yang,Wei-Chieh Huang,Yaozu Wu,Junyu Luo,Yuanchen Bei,Henry Peng Zou,Xiao Luo,Yusheng Zhao,Chunkit Chan,Yankai Chen,Zhongfen Deng,Yinghui Li,Hai-Tao Zheng,Dongyuan Li,Renhe Jiang,Ming Zhang,Yangqiu Song,Philip S. Yu

Main category: cs.CL

TL;DR: This paper explores how combining retrieval-augmented generation with reasoning improves the performance of large language models on complex, knowledge-intensive tasks.

Details Motivation: The motivation is to enhance the factuality and multi-step inference capabilities of Large Language Models through a combination of retrieval-augmented generation (RAG) and reasoning approaches. Method: The paper uses a survey methodology to synthesize reasoning-retrieval approaches, categorizing methods, datasets, and open challenges. Result: The paper identifies synergized RAG-Reasoning frameworks as state-of-the-art solutions for knowledge-intensive benchmarks, while spotlighting advanced reasoning techniques and providing a categorized overview of the field. Conclusion: The paper concludes by outlining research avenues towards deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. Abstract: Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.

[26] ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning

Changli Wang,Rui Wu,Fang Yin

Main category: cs.CL

TL;DR: 本文介绍了一个新的多模态讽刺生成数据集M2SaG和一个基于PPO和对比学习的生成框架ViSP,后者在生成高质量讽刺文本方面优于现有方法。

Details Motivation: 尽管在讽刺研究方面取得了进展,但由于过度依赖文本模态而忽略了视觉线索,以及现有数据集中图像内容与讽刺意图之间的不匹配,讽刺生成仍然未被充分探索。因此,引入了一个新的多模态讽刺生成数据集M2SaG。 Method: 提出了一种名为ViSP的生成框架,该框架结合了近端策略优化(PPO)和对比学习方法,利用DIP提供的奖励分数来引导讽刺文本的生成。 Result: 在五个指标集上评估了ViSP,发现其超越了所有基线模型,包括大型语言模型。此外,ViSP生成的文本显示出更高的平均讽刺得分(0.898 vs. 0.770)和事实不一致度(0.768 vs. 0.739)。 Conclusion: ViSP通过结合PPO和对比学习,有效提升了讽刺文本生成的质量,并且M2SaG数据集的发布为未来的研究提供了宝贵的资源。 Abstract: Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.

[27] Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis

Junjie Liu,Yuanhe Tian,Yan Song

Main category: cs.CL

TL;DR: This paper proposes an improved LLM-based approach for aspect-based sentiment analysis (ABSA) using data augmentation enhanced by reinforcement learning, showing significant performance gains on benchmark datasets.

Details Motivation: Existing approaches face challenges in capturing context due to short texts and limited, unbalanced labeled data. Data augmentation using LLMs can enrich training data but requires optimization for quality. Method: An LLM is used to generate augmented training data based on the original dataset. A reinforcement learning method is applied to enhance the quality of the generated data, aiming to improve the performance of the ABSA model. Result: Experimental results show that the proposed method achieves better performance over strong baselines and most existing studies in ABSA tasks. Conclusion: The proposed LLM-based ABSA approach with training data augmentation, optimized through reinforcement learning, demonstrates superior performance on English benchmark datasets compared to existing methods. Abstract: Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios to identify the sentiment polarity of specific aspect terms in a sentence. Although many existing studies leverage large language models (LLMs) to perform ABSA due to their strong context understanding capabilities, they still face challenges to learn the context information in the running text because of the short text, as well as the small and unbalanced labeled training data, where most data are labeled with positive sentiment. Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented data.In this paper, we propose an LLM-based ABSA approach with training data augmentation.Specifically, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and balanced label distributions to better train an ABSA model. Meanwhile, in order to improve the quality of the augmented data, we propose a reinforcement learning approach to optimize the data augmentation. LLM.Experiment results and further analyses on English benchmark datasets for ABSA demonstrate the effectiveness of our approach, where superior performance is observed over strong baselines and most existing studies.

[28] GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities

Siyi Wu,Zeyu Wang,Xinyuan Song,Zhengpeng Zhou,Lifan Sun,Tianyu Shi

Main category: cs.CL

TL;DR: This paper introduces GoalfyMax, a protocol-driven multi-agent collaboration framework that improves adaptability, coordination, and knowledge retention in complex environments.

Details Motivation: Traditional single-purpose AI systems struggle with coordination, memory reuse, and task decomposition, limiting their effectiveness in complex, dynamic enterprise environments. There is a need for more intelligent, autonomous, and adaptable multi-agent systems. Method: The paper introduces GoalfyMax, which uses a protocol-driven approach with an Agent-to-Agent (A2A) communication layer based on the Model Context Protocol (MCP) and an Experience Pack (XP) architecture for memory retention. It also integrates multi-turn dialogue, memory modules, and safety validation for robust operation. Result: Empirical evaluations show that GoalfyMax outperforms baseline frameworks in adaptability, coordination, and experience reuse on complex task orchestration benchmarks and case studies. Conclusion: GoalfyMax is a scalable and future-ready framework for multi-agent intelligent systems, demonstrating superior adaptability, coordination, and experience reuse. Abstract: Modern enterprise environments demand intelligent systems capable of handling complex, dynamic, and multi-faceted tasks with high levels of autonomy and adaptability. However, traditional single-purpose AI systems often lack sufficient coordination, memory reuse, and task decomposition capabilities, limiting their scalability in realistic settings. To address these challenges, we present \textbf{GoalfyMax}, a protocol-driven framework for end-to-end multi-agent collaboration. GoalfyMax introduces a standardized Agent-to-Agent (A2A) communication layer built on the Model Context Protocol (MCP), allowing independent agents to coordinate through asynchronous, protocol-compliant interactions. It incorporates the Experience Pack (XP) architecture, a layered memory system that preserves both task rationales and execution traces, enabling structured knowledge retention and continual learning. Moreover, our system integrates advanced features including multi-turn contextual dialogue, long-short term memory modules, and dynamic safety validation, supporting robust, real-time strategy adaptation. Empirical results on complex task orchestration benchmarks and case study demonstrate that GoalfyMax achieves superior adaptability, coordination, and experience reuse compared to baseline frameworks. These findings highlight its potential as a scalable, future-ready foundation for multi-agent intelligent systems.

[29] Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models

Junjie Wu,Gefei Gu,Yanan Zheng,Dit-Yan Yeung,Arman Cohan

Main category: cs.CL

TL;DR: This paper introduces Ref-Long, a benchmark for evaluating long-context referencing in language models, revealing key challenges and insights.

Details Motivation: Long-context referencing remains underexplored despite advancements in long-context language models (LCLMs). Method: Constructed Ref-Long benchmark with synthetic to realistic subsets; conducted human evaluations, task format adjustments, fine-tuning experiments, and error analyses. Result: Experiments on 13 LCLMs highlight challenges in contextual relationship understanding over simple retrieval tasks. Conclusion: Ref-Long benchmark reveals significant shortcomings in long-context referencing capabilities of LCLMs, even in advanced models like GPT-4o. Abstract: Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing -- a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data -- remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code can be found in https://github. com/wujunjie1998/Ref-Long.

[30] How Important is `Perfect' English for Machine Translation Prompts?

Patrícia Schmidtová,Niyati Bafna,Seth Aycock,Gianluca Vico,Wiktor Kamzela,Katharina Hämmerl,Vilém Zouhar

Main category: cs.CL

TL;DR: This paper explores how errors in user prompts impact the performance of large language models in machine translation tasks. It finds that prompt quality significantly affects performance, different error types have varying impacts, and LLMs show resilience to extreme noise levels.

Details Motivation: Large language models are known for achieving top results in machine translation but are sensitive to prompt errors. This study aims to understand how such errors impact model performance and provide insights into their robustness and limitations. Method: A systematic evaluation was conducted on how humanly plausible and synthetic errors in user prompts affect LLMs' performance on machine translation and evaluation tasks. The analysis included both quantitative measures and qualitative insights. Result: Prompt quality has a strong effect on translation performance. Prompts with many errors may underperform compared to simpler or less detailed prompts without errors. Character-level and combined noise types degrade performance more than phrasal perturbations. LLMs demonstrate resilience by translating even when prompts are rendered illegible to humans due to random noise. Conclusion: The study concludes that prompt quality significantly affects the performance of large language models (LLMs) in machine translation and evaluation tasks. Lower prompt quality primarily leads to poorer instruction following rather than directly impacting translation quality. Interestingly, LLMs can still perform translations even under high levels of random noise. Abstract: Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs' performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt. The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans.

[31] Adapting Definition Modeling for New Languages: A Case Study on Belarusian

Daniela Kazakouskaya,Timothee Mickus,Janine Siewert

Main category: cs.CL

TL;DR: This paper explores adapting definition modeling systems to Belarusian using a new dataset, showing promise with minimal data but highlighting limitations in current evaluation metrics.

Details Motivation: Definition modeling has potential to assist lexicographers in documenting diverse languages, but more research is needed on leveraging pre-existing models for unsupported languages. Method: The authors proposed a novel dataset of 43,150 definitions for Belarusian and conducted experiments to adapt existing definition modeling systems to this language. Result: The experiments showed that only small amounts of data are needed to adapt definition modeling systems to Belarusian, but current automatic evaluation metrics fail to fully capture the quality of generated definitions. Conclusion: Adapting definition modeling systems to new languages like Belarusian requires minimal data, but current automatic metrics have gaps in capturing necessary aspects. Abstract: Definition modeling, the task of generating new definitions for words in context, holds great prospect as a means to assist the work of lexicographers in documenting a broader variety of lects and languages, yet much remains to be done in order to assess how we can leverage pre-existing models for as-of-yet unsupported languages. In this work, we focus on adapting existing models to Belarusian, for which we propose a novel dataset of 43,150 definitions. Our experiments demonstrate that adapting a definition modeling systems requires minimal amounts of data, but that there currently are gaps in what automatic metrics do capture.

[32] NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance

Hanwool Lee,Sara Yu,Yewon Hwang,Jonghyun Choi,Heejae Ahn,Sungbum Jung,Youngjae Yu

Main category: cs.CL

TL;DR: 本文提出 NMIXX 跨语言金融语义嵌入模型和 KorFinSTS 韩国金融语义相似度基准测试,有效提升了低资源语言金融文本的理解能力。

Details Motivation: 通用句子嵌入模型在捕捉特定领域的金融语义方面存在不足,尤其是在像韩语这样的低资源语言中,由于领域术语、语义随时间变化以及双语词汇不匹配的问题,因此需要开发专门的跨语言金融语义模型。 Method: 论文提出 NMIXX 模型,该模型基于多语言 bge-m3 变体,并使用 18.8K 高置信度三元组进行微调,其中包括领域内释义、从语义变化类型学中提取的难负样本以及精确的韩英翻译对。同时引入 KorFinSTS 基准数据集用于评估。 Result: NMIXX 的多语言 bge-m3 变体在英文 FinSTS 和韩文 KorFinSTS 上分别取得了 +0.10 和 +0.22 的 Spearman's rho 提升,显著优于其他基线模型。然而,其在通用 STS 任务上的表现略有下降,显示出一定的权衡。 Conclusion: 论文得出结论,NMIXX 模型通过使用高质量的三元组进行微调,在金融语义理解方面表现出色,特别是在处理低资源语言如韩语时。此外,KorFinSTS 基准测试的发布为评估跨语言金融语义模型提供了新的标准。 Abstract: General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX's multilingual bge-m3 variant achieves Spearman's rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.

[33] SpreadPy: A Python tool for modelling spreading activation and superdiffusion in cognitive multiplex networks

Salvatore Citraro,Edith Haim,Alessandra Carini,Cynthia S. Q. Siew,Giulio Rossetti,Massimo Stella

Main category: cs.CL

TL;DR: SpreadPy is an open Python library that simulates spreading activation in cognitive networks, helping researchers explore how network structure influences cognitive processes and impairments through empirical or theoretical models.

Details Motivation: To understand structure-function relationships in cognitive processes and how activation dynamics reflect cognitive, psychological, and clinical phenomena. Method: Development of SpreadPy, a Python library for numerical simulations of spreading activation in single-layer and multiplex cognitive networks, demonstrated through three case studies. Result: Three key findings: (1) Activation patterns distinguish math anxiety levels via structural differences in knowledge networks; (2) Cognitive load affects lexical access during creativity tasks; (3) Simulated activation patterns correlate with clinical error types in aphasia patients. Conclusion: SpreadPy is a valuable tool for simulating spreading activation in cognitive networks, offering insights into individual differences and cognitive impairments while supporting reproducible research across multiple fields. Abstract: We introduce SpreadPy as a Python library for simulating spreading activation in cognitive single-layer and multiplex networks. Our tool is designed to perform numerical simulations testing structure-function relationships in cognitive processes. By comparing simulation results with grounded theories in knowledge modelling, SpreadPy enables systematic investigations of how activation dynamics reflect cognitive, psychological and clinical phenomena. We demonstrate the library's utility through three case studies: (1) Spreading activation on associative knowledge networks distinguishes students with high versus low math anxiety, revealing anxiety-related structural differences in conceptual organization; (2) Simulations of a creativity task show that activation trajectories vary with task difficulty, exposing how cognitive load modulates lexical access; (3) In individuals with aphasia, simulated activation patterns on lexical networks correlate with empirical error types (semantic vs. phonological) during picture-naming tasks, linking network structure to clinical impairments. SpreadPy's flexible framework allows researchers to model these processes using empirically derived or theoretical networks, providing mechanistic insights into individual differences and cognitive impairments. The library is openly available, supporting reproducible research in psychology, neuroscience, and education research.

[34] An Exploration of Knowledge Editing for Arabic

Basel Mousi,Nadir Durrani,Fahim Dalvi

Main category: cs.CL

TL;DR: This study investigates Knowledge Editing in Arabic, showing that instruction-tuned methods outperform parameter-based ones in cross-lingual settings, with improvements via multilingual training.

Details Motivation: Knowledge Editing (KE) has been widely explored in English, but its behavior in morphologically rich languages like Arabic remains underexamined. Method: Four KE methods (ROME, MEMIT, ICE, LTE) were evaluated on Llama-2-7B-chat using Arabic translations of ZsRE and Counterfact benchmarks, analyzing multilingual and cross-lingual settings. Result: Parameter-based KE methods struggle with cross-lingual generalization, while instruction-tuned methods perform better. Multilingual LTE training enhances performance. Conclusion: Instruction-tuned methods perform more robustly in cross-lingual settings compared to parameter-based methods. Joint Arabic-English training improves editability and transfer. Abstract: While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.

Pawitsapak Akarajaradwong,Chompakorn Chaksangchaichot,Pirat Pothavorn,Attapol Thamrongrattanarit-Rutherford,Ekapol Chuangsuwanich,Sarana Nutanong

Main category: cs.CL

TL;DR: This paper proposes GRPO, a resource-efficient method for improving Thai legal LLMs' performance on complex legal reasoning tasks, achieving significant improvements in citation accuracy and response quality.

Details Motivation: RAG systems have limited performance on Thai legal question answering, especially for questions requiring extensive and complex legal reasoning. This study aims to address these limitations. Method: Group-Relative Policy Optimization (GRPO) is introduced, leveraging BGE-M3 embeddings as a cost-efficient semantic-similarity reward to align LLMs. Result: Experiments on the NitiBench benchmark showed up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning, with computational expenses reduced up to 2.5x. Conclusion: The proposed GRPO approach effectively enhances Thai legal LLMs by improving law citation accuracy and response quality, demonstrating robustness on complex legal reasoning tasks compared to instruction tuning. Abstract: The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.

[36] MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs

Shulin Huang,Linyi Yang,Yue Zhang

Main category: cs.CL

TL;DR: 本文提出了一个名为MCEval的新型多语言评估框架,用于评估大型语言模型的文化理解能力,揭示了其文化偏见及公平性问题。

Details Motivation: 大型语言模型表现出文化偏见和有限的跨文化理解能力,特别是在服务全球不同用户群体时。 Method: 提出了一种新的多语言评估框架MCEval,该框架采用动态文化问题构建,并通过反事实重写和混杂因素重写进行因果分析。 Result: 实验结果揭示了不同语言场景下的性能差异,表明最佳文化表现不仅与训练数据分布有关,还与语言-文化一致性相关。评估结果还暴露了公平性问题,其中在英语场景中表现成功的方法可能会造成实质性不利影响。 Conclusion: MCEval是一个全面的多语言文化评估框架,提供了对大型语言模型文化理解能力的深入洞察,并揭示了公平性问题。 Abstract: Large language models exhibit cultural biases and limited cross-cultural understanding capabilities, particularly when serving diverse global user populations. We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction and enables causal analysis through Counterfactual Rephrasing and Confounder Rephrasing. Our comprehensive evaluation spans 13 cultures and 13 languages, systematically assessing both cultural awareness and cultural bias across different linguistic scenarios. The framework provides 39,897 cultural awareness instances and 17,940 cultural bias instances. Experimental results reveal performance disparities across different linguistic scenarios, demonstrating that optimal cultural performance is not only linked to training data distribution, but also is related to language-culture alignment. The evaluation results also expose the fairness issue, where approaches appearing successful in the English scenario create substantial disadvantages. MCEval represents the first comprehensive multilingual cultural evaluation framework that provides deeper insights into LLMs' cultural understanding.

[37] Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

Baturay Saglam,Paul Kassianik,Blaine Nelson,Sajana Weerawardhena,Yaron Singer,Amin Karbasi

Main category: cs.CL

TL;DR: 该研究揭示了大语言模型中隐藏的低维语义空间结构及其线性可分性,并展示了其在对抗内容检测中的应用潜力。

Details Motivation: 理解大语言模型的潜在空间几何结构对于解释其行为和改进对齐至关重要,但目前尚不清楚模型如何组织与语义理解相关的内部表征。 Method: 通过对11个仅解码器的大语言模型进行大规模实证研究,分析其隐藏状态,涵盖6个科学主题和每个模型12层结构。 Result: 发现高阶语义信息存在于低维子空间中,并形成跨领域的线性可分表征。深层结构及触发结构化推理或对齐行为的提示会使这种可分性更加明显。此外,通过简单的因果干预可以捕捉推理模式(如思维链)。 Conclusion: 研究得出,大语言模型的高阶语义信息存在于低维子空间中,并且这些表征在不同领域间具有线性可分性。这种几何特性可用于开发基于潜在表示的工具来检测和减轻有害内容。 Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.

[38] Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding

Qi Feng,Yihong Liu,Hinrich Schütze

Main category: cs.CL

TL;DR: This paper introduces a self-adaptive curriculum learning approach for NLP, leveraging pre-trained language models to determine example difficulty and optimize training order, resulting in faster convergence and better performance than random sampling.

Details Motivation: The motivation is to overcome the limitations of manually defined difficulty metrics (e.g., text length) in traditional curriculum learning, which may not accurately reflect the model's perspective. Method: The method involves predicting example difficulty scores using pre-trained language models (PLMs) and organizing training examples for fine-tuning in strategies such as easy-to-hard, hard-to-easy, or mixed sampling. Result: Experimental results on four natural language understanding (NLU) datasets show faster convergence and improved performance of the proposed method compared to random sampling. Conclusion: The proposed self-adaptive curriculum learning paradigm improves convergence speed and performance compared to standard random sampling in fine-tuning pre-trained language models. Abstract: Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics -- such as text length -- which may not accurately reflect the model's own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks. Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.

[39] Te Ahorré Un Click: A Revised Definition of Clickbait and Detection in Spanish News

Gabriel Mordecki,Guillermo Moncecchi,Javier Couto

Main category: cs.CL

TL;DR: This paper redefines clickbait as a technique that creates curiosity by omitting information, introduces TA1C, a refined Spanish clickbait dataset, and achieves strong detection results.

Details Motivation: There is a lack of consensus on the definition of clickbait, which affects dataset creation and detection efforts. The authors aim to refine the conceptual boundaries and provide an objective framework for identifying clickbait. Method: The authors proposed a new definition of clickbait based on the concept of a 'curiosity gap' and developed TA1C, a dataset for clickbait detection in Spanish with refined annotation criteria to minimize subjectivity. They also implemented baseline models to evaluate performance. Result: The authors created the TA1C dataset consisting of 3,500 manually annotated tweets with high inter-annotator agreement (Fleiss' K = 0.825). Baseline models achieved an F1-score of 0.84. Conclusion: The paper concludes that the deliberate omission of information in headlines to create curiosity is the defining characteristic of clickbait, and their refined approach to dataset creation improves objectivity and detection performance. Abstract: We revise the definition of clickbait, which lacks current consensus, and argue that the creation of a curiosity gap is the key concept that distinguishes clickbait from other related phenomena such as sensationalism and headlines that do not deliver what they promise or diverge from the article. Therefore, we propose a new definition: clickbait is a technique for generating headlines and teasers that deliberately omit part of the information with the goal of raising the readers' curiosity, capturing their attention and enticing them to click. We introduce a new approach to clickbait detection datasets creation, by refining the concept limits and annotations criteria, minimizing the subjectivity in the decision as much as possible. Following it, we created and release TA1C (for Te Ahorr\'e Un Click, Spanish for Saved You A Click), the first open source dataset for clickbait detection in Spanish. It consists of 3,500 tweets coming from 18 well known media sources, manually annotated and reaching a 0.825 Fleiss' K inter annotator agreement. We implement strong baselines that achieve 0.84 in F1-score.

[40] Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

Qinyuan Ye,Robin Jia,Xiang Ren

Main category: cs.CL

TL;DR: 本研究揭示了大语言模型在非预期任务中泛化能力的内部机制,特别是在离一加法任务中发现的函数归纳机制。

Details Motivation: 了解大语言模型在未见过的任务中表现出色的机制,尤其是任务级泛化的驱动力。 Method: 使用电路风格的解释技术,如路径修补,分析模型在离一加法任务中的内部计算过程。 Result: 发现了三个关键结果:1)一个函数归纳机制解释了从标准加法到离一加法的泛化;2)+1函数的归纳由多个注意力头并行处理;3)该机制在更广泛的任务中被重用。 Conclusion: 本文通过研究大语言模型在非预期任务中的泛化能力,特别是离一加法任务中的函数归纳机制,揭示了模型内部的可重用和可组合结构的重要性。 Abstract: Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their notable performance and present three key findings. First, we uncover a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.

[41] Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking

Hai Toan Nguyen,Tien Dat Nguyen,Viet Ha Nguyen

Main category: cs.CL

TL;DR: 本文提出了一种通过集成层次文本分割和聚类来提升RAG系统性能的新方法,显著改善了信息检索的精确度和相关性。

Details Motivation: 传统的分块方法未能创建出足够语义化的块,因此需要一种能够更好地捕捉文本结构和语义的新方法。 Method: 提出了一种新框架,该框架结合了层次文本分割和聚类技术,以生成更具语义一致性的块,并在推理过程中利用段级和簇级向量表示进行信息检索。 Result: 在NarrativeQA、QuALITY和QASPER数据集上的评估表明,与传统分块技术相比,所提方法取得了更好的结果。 Conclusion: 集成层次文本分割和聚类的方法提高了RAG系统的检索精度和上下文相关性。 Abstract: Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.

[42] Tiny Reward Models

Sarah Pan

Main category: cs.CL

TL;DR: TinyRM is a lightweight bidirectional masked language model that matches the performance of much larger models in reward modeling tasks while using significantly fewer resources.

Details Motivation: Large decoder-based language models are increasingly used for reward modeling in reinforcement learning from human feedback (RLHF), but their inference costs have become a concern when deployed in test-time strategies. Method: TinyRM combines FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve strong performance on RewardBench. Result: TinyRM models with as few as 400 million parameters perform comparably to models over 175 times larger on reasoning and safety preference modeling tasks. Conclusion: TinyRM, a family of small bidirectional masked language models, can rival much larger models in reasoning and safety preference modeling tasks while using significantly fewer resources. Abstract: Large decoder-based language models have become the dominant architecture for reward modeling in reinforcement learning from human feedback (RLHF). However, as reward models are increasingly deployed in test-time strategies, their inference costs become a growing concern. We present TinyRM, a family of small, bidirectional masked language models (MLMs) with as few as 400 million parameters, that rival the capabilities of models over 175 times larger on reasoning and safety preference modeling tasks. TinyRM combines FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve strong performance on RewardBench, despite using significantly fewer resources. Our experiments suggest that small models benefit from domain-specific tuning strategies, particularly in reasoning, where lightweight finetuning methods are especially effective. While challenges remain in building generalist models and conversational preference modeling, our preliminary results highlight the promise of lightweight bidirectional architectures as efficient, scalable alternatives for preference modeling.

[43] TextOmics-Guided Diffusion for Hit-like Molecular Generation

Hang Yuan,Chen Li,Wenjun Ma,Yuncheng Jiang

Main category: cs.CL

TL;DR: 为了解决靶向药物发现中缺乏异构数据和统一框架的问题,引入了TextOmics数据集和ToDi生成框架,以实现从基因组表达和分子文本描述中生成具有治疗潜力的分子。

Details Motivation: 靶向药物发现需要具有治疗潜力的Hit-like分子生成,但目前缺乏异构数据和统一框架来整合不同的分子表示。 Method: 通过构建TextOmics数据集建立基因组表达和分子文本描述之间的对应关系;使用两个编码器(OmicsEn和TextEn)捕捉多层次的生物学和语义关联;开发条件扩散(DiffGen)进行可控生成。 Result: 实验表明TextOmics的有效性,并展示ToDi在零样本治疗分子生成中的显著潜力。 Conclusion: ToDi是一个能够生成生物相关、化学有效和具有治疗潜力的分子的框架,它利用了TextOmics数据集,并且在现有最先进的方法上表现更优。 Abstract: Hit-like molecular generation with therapeutic potential is essential for target-specific drug discovery. However, the field lacks heterogeneous data and unified frameworks for integrating diverse molecular representations. To bridge this gap, we introduce TextOmics, a pioneering benchmark that establishes one-to-one correspondences between omics expressions and molecular textual descriptions. TextOmics provides a heterogeneous dataset that facilitates molecular generation through representations alignment. Built upon this foundation, we propose ToDi, a generative framework that jointly conditions on omics expressions and molecular textual descriptions to produce biologically relevant, chemically valid, hit-like molecules. ToDi leverages two encoders (OmicsEn and TextEn) to capture multi-level biological and semantic associations, and develops conditional diffusion (DiffGen) for controllable generation. Extensive experiments confirm the effectiveness of TextOmics and demonstrate ToDi outperforms existing state-of-the-art approaches, while also showcasing remarkable potential in zero-shot therapeutic molecular generation. Sources are available at: https://github.com/hala-ToDi.

[44] Protective Factor-Aware Dynamic Influence Learning for Suicide Risk Prediction on Social Media

Jun Li,Xiangmeng Wang,Haoyang Li,Yifei Yan,Hong Va Leong,Ling Feng,Nancy Xiaonan Yu,Qing Li

Main category: cs.CL

TL;DR: This paper introduces a new suicide risk prediction framework that integrates both risk and protective factors with dynamic modeling, achieving superior performance and clinical interpretability.

Details Motivation: Existing suicide risk detection models focus only on risk factors and fail to capture rapid fluctuations in mental health. This work addresses these limitations by integrating protective factors and temporal dynamics. Method: The authors proposed a Dynamic Factors Influence Learning approach using a new Protective Factor-Aware Dataset derived from Reddit posts with annotations of suicide risk factors over time. Result: Experiments showed the model outperforms state-of-the-art models and large language models across three datasets while providing interpretable insights into suicide risk patterns. Conclusion: The study concludes that incorporating protective factors and dynamic changes in mental states improves suicide risk prediction, offering better performance and interpretability for targeted interventions. Abstract: Suicide is a critical global health issue that requires urgent attention. Even though prior work has revealed valuable insights into detecting current suicide risk on social media, little attention has been paid to developing models that can predict subsequent suicide risk over time, limiting their ability to capture rapid fluctuations in individuals' mental state transitions. In addition, existing work ignores protective factors that play a crucial role in suicide risk prediction, focusing predominantly on risk factors alone. Protective factors such as social support and coping strategies can mitigate suicide risk by moderating the impact of risk factors. Therefore, this study proposes a novel framework for predicting subsequent suicide risk by jointly learning the dynamic influence of both risk factors and protective factors on users' suicide risk transitions. We propose a novel Protective Factor-Aware Dataset, which is built from 12 years of Reddit posts along with comprehensive annotations of suicide risk and both risk and protective factors. We also introduce a Dynamic Factors Influence Learning approach that captures the varying impact of risk and protective factors on suicide risk transitions, recognizing that suicide risk fluctuates over time according to established psychological theories. Our thorough experiments demonstrate that the proposed model significantly outperforms state-of-the-art models and large language models across three datasets. In addition, the proposed Dynamic Factors Influence Learning provides interpretable weights, helping clinicians better understand suicidal patterns and enabling more targeted intervention strategies.

[45] GeLaCo: An Evolutionary Approach to Layer Compression

David Ponce,Thierry Etchegoyhen,Javier Del Ser

Main category: cs.CL

TL;DR: 本文提出了 GeLaCo 方法,利用进化算法高效探索 LLM 压缩空间,在保持模型性能的同时显著提升压缩效果。

Details Motivation: 大型语言模型(LLM)由于计算需求大而面临部署难题,模型压缩方法成为缓解这些问题的重要手段。 Method: 使用基于群体搜索的方法以及模块级相似度适应度函数来捕捉注意力、前馈和隐藏状态表示,从而实现有效的模型压缩。 Result: GeLaCo 在困惑度和生成评估中均优于现有最先进的替代方案,首次建立了压缩与质量之间的帕累托前沿。 Conclusion: GeLaCo 是一种通过层折叠进行LLM压缩的进化方法,支持高效的压缩解决方案空间探索,并且在单目标和多目标进化压缩搜索方面表现出色。 Abstract: Large Language Models (LLM) have achieved remarkable performance across a large number of tasks, but face critical deployment and usage barriers due to substantial computational requirements. Model compression methods, which aim to reduce model size while preserving its capacity, are an important means to mitigate these issues. Promising approaches along these lines, such as structured pruning, typically require costly empirical search for optimal variants and may run the risk of ignoring better solutions. In this work we introduce GeLaCo, an evolutionary approach to LLM compression via layer collapse. Our approach supports an efficient exploration of the compression solution space via population-based search and a module-wise similarity fitness function capturing attention, feed-forward, and hidden state representations. GeLaCo also supports both single and multi-objective evolutionary compression search, establishing the first Pareto frontier along compression and quality axes. We evaluate GeLaCo solutions via both perplexity-based and generative evaluations over foundational and instruction-tuned models, outperforming state-of-the-art alternatives.

[46] Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Simon Münker

Main category: cs.CL

TL;DR: 研究发现大型语言模型(LLMs)无法准确代表多样化的文化道德框架,而是系统性地同质化道德多样性,这揭示了当前AI对人类价值观表示的根本局限。

Details Motivation: 研究LLMs是否真正反映人类价值观,还是仅仅对价值观进行平均化,特别是在多样化的文化道德框架中的表现。 Method: 应用道德基础问卷(Moral Foundations Questionnaire)在19种文化背景下比较AI生成的道德直觉和人类基准数据,对比多个最先进的LLMs与人类数据的差异。 Result: 研究发现LLMs在不同文化背景下未能准确代表人类的道德直觉,且模型规模的增加并未持续改善文化表征的保真度。 Conclusion: LLMs 不能代表多样化的文化道德框架,而是在不同文化中系统性地同质化道德多样性。 Abstract: Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs' origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn't consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.

[47] Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning

Chenxi Huang,Shaotian Yan,Liang Xie,Binbin Lin,Sinan Fan,Yue Xin,Deng Cai,Chen Shen,Jieping Ye

Main category: cs.CL

TL;DR: 本文提出CRFT方法,通过优化关键表示来提升复杂推理任务的表现,相比传统PEFT方法更高效且适用于少样本环境。

Details Motivation: ReFT方法在复杂推理任务中表现不佳,因为其修改的固定位置表示对输出的影响不确定,而复杂推理任务中存在对最终输出有重大影响的关键表示。 Method: 提出Critical Representation Fine-Tuning (CRFT),通过信息流分析识别并优化关键表示,在监督学习框架中动态优化关键表示的低秩线性子空间,同时冻结基础模型。 Result: CRFT在八个算术和常识推理基准测试中都验证了其有效性和效率,并且在少样本设置中提升了一次性准确率16.4%。 Conclusion: CRFT通过识别并优化复杂推理任务中的关键表示,提供了一种轻量级但强大的传统PEFT方法替代方案,适用于LLaMA和Mistral模型系列,并在少样本设置中表现出色。 Abstract: Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.

[48] Fusing Large Language Models with Temporal Transformers for Time Series Forecasting

Chen Su,Yuanhe Tian,Qinyu Liu,Jun Zhang,Yan Song

Main category: cs.CL

TL;DR: This paper introduces a novel hybrid model combining LLMs and Transformers to improve time series forecasting by integrating semantic understanding and temporal dynamics.

Details Motivation: LLMs excel at reasoning over discrete tokens but perform poorly on continuous numerical time series data; vanilla Transformers model temporal data well but struggle with high-level semantics. A hybrid approach is needed to bridge these gaps. Method: A hybrid Transformer-based architecture was designed to fuse representations from LLMs (capturing high-level semantics) and vanilla Transformers (modeling temporal dynamics). Result: Experiments showed that the proposed method outperforms existing approaches in time series forecasting tasks by leveraging both semantic and temporal patterns. Conclusion: The proposed approach effectively combines LLMs and vanilla Transformers for time series forecasting, enhancing prediction accuracy by integrating semantic representations with temporal information. Abstract: Recently, large language models (LLMs) have demonstrated powerful capabilities in performing various tasks and thus are applied by recent studies to time series forecasting (TSF) tasks, which predict future values with the given historical time series. Existing LLM-based approaches transfer knowledge learned from text data to time series prediction using prompting or fine-tuning strategies. However, LLMs are proficient at reasoning over discrete tokens and semantic patterns but are not initially designed to model continuous numerical time series data. The gaps between text and time series data lead LLMs to achieve inferior performance to a vanilla Transformer model that is directly trained on TSF data. However, the vanilla Transformers often struggle to learn high-level semantic patterns. In this paper, we design a novel Transformer-based architecture that complementarily leverages LLMs and vanilla Transformers, so as to integrate the high-level semantic representations learned by LLMs into the temporal information encoded by time series Transformers, where a hybrid representation is obtained by fusing the representations from the LLM and the Transformer. The resulting fused representation contains both historical temporal dynamics and semantic variation patterns, allowing our model to predict more accurate future values. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approach.

[49] Task-Based Flexible Feature Distillation for LLMs

Khouloud Saadi,Di Wang

Main category: cs.CL

TL;DR: This paper proposes a novel task-based feature distillation method for large language models that allows knowledge transfer between models of different sizes without introducing additional parameters, demonstrating enhanced performance on various NLP tasks.

Details Motivation: Traditional feature knowledge distillation methods limit the flexibility of the student's architecture due to the assumption of equal hidden sizes between teacher and student models. The common solution involving linear projectors introduces additional parameters and often degrades performance on downstream tasks. Method: Identifying the most task-relevant hidden units in the teacher model and directly distilling their activations to the student, leveraging the insight that only a subset of LLM components significantly contributes to a specific downstream task. Result: Empirical results indicate consistent improvements over prior approaches across diverse tasks like classification, instruction-following, and summarization, achieving up to a 3% performance gain over the linear projection baseline. Conclusion: The proposed task-based feature distillation method successfully enables knowledge transfer between teacher and student models with different hidden layer dimensions without adding new parameters, showing improved performance across various tasks. Abstract: Knowledge Distillation (KD) in general and feature distillation in particular are promising techniques for reducing the high computational demand of large language models (LLMs). However, traditional feature KD methods typically assume that the teacher and the student share the same hidden size, limiting the flexibility of the student's architecture. A common solution to this problem involves training a linear projector to align their feature spaces, but this introduces additional parameters that must be learned from scratch and often degrades performance on downstream tasks, especially in generative settings. To address this issue, in this work, we propose a novel task-based feature distillation method that enables knowledge transfer between teacher and student models with different hidden layer dimensions, without introducing any new parameters. Leveraging the insight that only a subset of LLM components contribute significantly to a specific downstream task, our approach identifies the most task-relevant hidden units in the teacher and directly distills their activations to the student. Our method is flexible and easily integrates with other distillation frameworks. Empirical results show consistent improvements over prior approaches across diverse tasks, including classification, instruction-following, and summarization, achieving up to a 3\% performance gain over the linear projection baseline.

[50] Abusive text transformation using LLMs

Rohitash Chandra,Jiyong Choi

Main category: cs.CL

TL;DR: 这项研究探讨了使用大型语言模型(LLMs)将包含仇恨言论和脏话的文本转换为非侮辱性版本的效果,同时保持文本的情感和语义不变。

Details Motivation: 尽管大型语言模型(LLMs)在自然语言处理任务中表现出显著的进步,但它们在将侮辱性文本转化为非侮辱性版本的有效性仍需探索。本研究旨在利用LLMs实现这一目标,同时保留文本的意图。 Method: 评估了Gemini、GPT-4o、DeekSeek和Groq等最先进的LLMs识别和转换包含仇恨言论和脏话的文本的能力,并通过情感分析和语义分析对原始和转换后的数据集进行评估。 Result: 结果显示Groq与其他LLMs相比提供了截然不同的结果,并且发现了GPT-4o和DeepSeek-V3之间的相似性。 Conclusion: 该研究发现Groq在转换结果上显著不同于其他大型语言模型(LLMs),并识别出GPT-4o和DeepSeek-V3之间的相似性。 Abstract: Although Large Language Models (LLMs) have demonstrated significant advancements in natural language processing tasks, their effectiveness in the classification and transformation of abusive text into non-abusive versions remains an area for exploration. In this study, we aim to use LLMs to transform abusive text (tweets and reviews) featuring hate speech and swear words into non-abusive text, while retaining the intent of the text. We evaluate the performance of two state-of-the-art LLMs, such as Gemini, GPT-4o, DeekSeek and Groq, on their ability to identify abusive text. We them to transform and obtain a text that is clean from abusive and inappropriate content but maintains a similar level of sentiment and semantics, i.e. the transformed text needs to maintain its message. Afterwards, we evaluate the raw and transformed datasets with sentiment analysis and semantic analysis. Our results show Groq provides vastly different results when compared with other LLMs. We have identified similarities between GPT-4o and DeepSeek-V3.

[51] Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects

Renad Al-Monef,Hassan Alhuzali,Nora Alturayeif,Ashwag Alasmari

Main category: cs.CL

TL;DR: 本文介绍了一种用于评估大型语言模型在沙特主要方言中表现的综合基准Absher,并指出当前模型在文化理解和上下文推理方面的不足。

Details Motivation: 随着大型语言模型在阿拉伯语NLP应用中的重要性日益增加,尤其是在语言多样性显著的沙特阿拉伯等地区,评估其对区域方言和文化细微差别的理解至关重要。 Method: 构建了一个名为Absher的全面基准测试,包含超过18,000个多选题,涵盖六个不同类别,并评估了多种最先进的大型语言模型的表现。 Result: 研究提供了关于大型语言模型能力与局限性的详细见解,特别是在需要文化推理或情境理解的任务中表现不佳。 Conclusion: 评估结果显示了当前大型语言模型在文化推断或情境理解任务中的显著性能差距,突显了针对方言意识训练和文化对齐评估方法的迫切需求。 Abstract: As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces \texttt{Absher}, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \texttt{Absher} comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.

[52] Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation

Muzhaffar Hazman,Minh-Khoi Pham,Shweta Soundararajan,Goncalo Mordido,Leonardo Custode,David Lynch,Giorgio Cruciata,Yucheng Shi,Hongmeng Song,Wang Chao,Pan Yue,Aleksandar Milenovic,Alexandros Agapitos

Main category: cs.CL

TL;DR: This paper introduces an evolutionary search-based method for automated prompt optimization, particularly effective for smaller language models tackling complex tasks, outperforming existing state-of-the-art techniques.

Details Motivation: The motivation stems from the limitations of current prompt engineering automation techniques, which are largely evaluated on large LLMs and simple tasks, while smaller models and complex tasks demand more precise and detailed prompt optimization. Method: The method involves a two-phase evolutionary search: (1) grammar-guided genetic programming to synthesize prompt-creating programmes using syntactic, dictionary-based, and LLM-based functions, and (2) local search to fine-tune the best-performing programmes. Result: The approach outperforms PromptWizard, OPRO, and RL-Prompt on three relatively small LLMs across four challenging domain-specific tasks, demonstrating consistent performance improvements with minimal degradation. Conclusion: The proposed evolutionary search approach to automated discrete prompt optimisation effectively enhances prompt performance on small general-purpose LLMs for complex, domain-specific tasks, outperforming state-of-the-art methods with minimal degradation. Abstract: Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and highly capable LLMs. In contrast, solving complex tasks that require detailed information to be included in the prompt increases the amount of text that needs to be optimised. Furthermore, smaller models have been shown to be more sensitive to prompt design. To address these challenges, we propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases. In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes by searching the space of programmes populated by function compositions of syntactic, dictionary-based and LLM-based prompt-editing functions. In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes in an attempt to further fine-tune their performance. Our approach outperforms three state-of-the-art prompt optimisation approaches, PromptWizard, OPRO, and RL-Prompt, on three relatively small general-purpose LLMs in four domain-specific challenging tasks. We also illustrate several examples where these benchmark methods suffer relatively severe performance degradation, while our approach improves performance in almost all task-model combinations, only incurring minimal degradation when it does not.

[53] Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach

Mohammed Bouri,Adnane Saoud

Main category: cs.CL

TL;DR: 本论文提出了基于 GBM 的新方法,显著提升了 LSTM、S4 和 CNN 等模型在面对对抗攻击时的鲁棒性,并首次系统分析了 S4 的鲁棒性表现。

Details Motivation: 尽管 NLP 模型已有长足发展,但它们仍易受到对抗攻击(如同义词替换),而现有工作主要集中于提升前馈和卷积架构的鲁棒性,对于循环网络和现代状态空间模型(如 S4)的研究较少。 Method: 研究聚焦于计算三种不同架构(LSTM、S4 和 CNN)的 GBM,并系统分析 SSM(S4)的鲁棒性,通过引入 GBM 正则化来增强模型对输入扰动的抵抗能力。 Result: 实验表明,该方法在提升对抗鲁棒性方面优于现有基线方法最多 8.8%,并且在干净文本上的泛化能力也有所提高。 Conclusion: 论文提出了一种基于增长边界矩阵(GBM)的新型正则化技术,以提高自然语言处理模型的鲁棒性,并在多个架构和基准数据集上验证了其有效性。 Abstract: Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at https://github.com/BouriMohammed/GBM

[54] Using AI to replicate human experimental results: a motion study

Rosa Illan Castillo,Javier Valenzuela

Main category: cs.CL

TL;DR: This paper demonstrates that large language models, such as GPT-4, can effectively replicate human judgments in linguistic studies involving affective meanings, suggesting their potential as reliable tools for future research.

Details Motivation: To examine whether LLMs can reliably replicate nuanced human judgments in linguistic research, particularly in analyzing affective meanings. Method: Four psycholinguistic studies were conducted on affective meanings in temporal expressions involving manner-of-motion verbs, first with humans and then replicated using an LLM. Statistical analyses were used to compare results. Result: Results showed strong convergence between human and AI responses, with high correlations (Spearman's rho = .73-.96), indicating LLMs can produce valid interpretative outcomes. Conclusion: The study supports the use of large language models (LLMs) as credible collaborators in linguistic research, capable of augmenting traditional human-based experimentation. Abstract: This paper explores the potential of large language models (LLMs) as reliable analytical tools in linguistic research, focusing on the emergence of affective meanings in temporal expressions involving manner-of-motion verbs. While LLMs like GPT-4 have shown promise across a range of tasks, their ability to replicate nuanced human judgements remains under scrutiny. We conducted four psycholinguistic studies (on emergent meanings, valence shifts, verb choice in emotional contexts, and sentence-emoji associations) first with human participants and then replicated the same tasks using an LLM. Results across all studies show a striking convergence between human and AI responses, with statistical analyses (e.g., Spearman's rho = .73-.96) indicating strong correlations in both rating patterns and categorical choices. While minor divergences were observed in some cases, these did not alter the overall interpretative outcomes. These findings offer compelling evidence that LLMs can augment traditional human-based experimentation, enabling broader-scale studies without compromising interpretative validity. This convergence not only strengthens the empirical foundation of prior human-based findings but also opens possibilities for hypothesis generation and data expansion through AI. Ultimately, our study supports the use of LLMs as credible and informative collaborators in linguistic inquiry.

[55] Meanings are like Onions: a Layered Approach to Metaphor Processing

Silvia Cappa,Anna Sofia Lippolis,Stefano Zoia

Main category: cs.CL

TL;DR: 本文提出了一种洋葱结构的隐喻处理模型,结合内容分析、概念融合与语用意向性,为计算系统提供更深层次的隐喻理解方法。

Details Motivation: 隐喻意义不是概念之间的简单映射,而是一种复杂的认知现象,需要多层级解释的整合。 Method: 构建了一个三维框架,包括内容分析、概念融合和语用意向性三个层面,并将这些层次统一到一个正式框架中。 Result: 开发了一个分层模型,能够通过基本概念元素进行隐喻标注,建模概念组合,并引入语用词汇来捕捉说话者意图和语境效果。 Conclusion: 该论文提出了一种多层次的隐喻处理模型,为计算系统中的隐喻理解提供了更深入、更符合语境的推理基础。 Abstract: Metaphorical meaning is not a flat mapping between concepts, but a complex cognitive phenomenon that integrates multiple levels of interpretation. In this paper, we propose a stratified model of metaphor processing that treats meaning as an onion: a multi-layered structure comprising (1) content analysis, (2) conceptual blending, and (3) pragmatic intentionality. This three-dimensional framework allows for a richer and more cognitively grounded approach to metaphor interpretation in computational systems. At the first level, metaphors are annotated through basic conceptual elements. At the second level, we model conceptual combinations, linking components to emergent meanings. Finally, at the third level, we introduce a pragmatic vocabulary to capture speaker intent, communicative function, and contextual effects, aligning metaphor understanding with pragmatic theories. By unifying these layers into a single formal framework, our model lays the groundwork for computational methods capable of representing metaphorical meaning beyond surface associations, toward deeper, more context-sensitive reasoning.

[56] From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

Xinnan Dai,Kai Yang,Jay Revolinsky,Kai Guo,Aoran Wang,Bohang Zhang,Jiliang Tang

Main category: cs.CL

TL;DR: 这项研究揭示了序列基础的Transformer模型如何执行图数据的子结构提取任务,并展示了它们从带属性的图(如分子图)中提取复杂组合模式的能力。

Details Motivation: 尽管大型语言模型(LLMs)能够解决图推理任务,但尚不清楚解码器仅有的Transformer架构如何理解底层图结构。 Method: 通过对Transformer模型进行实证结果和理论分析,研究其在子结构提取任务中的内部机制和查询输入的影响。 Result: 研究人员提出了诱导子结构过滤(ISF)视角,解释了多层Transformer中子结构识别的过程,并验证了LLMs中的ISF过程及其对不同类型图的处理能力。 Conclusion: Transformers 通过一种称为诱导子结构过滤(ISF)的机制来理解图结构,并能够有效地从图数据中提取子结构。 Abstract: Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.

[57] Referential ambiguity and clarification requests: comparing human and LLM behaviour

Chris Madge,Matthew Purver,Massimo Poesio

Main category: cs.CL

TL;DR: 本文构建了一个新语料库来比较人类和LLMs在任务导向对话中对歧义的澄清行为,发现两者之间关联较弱,LLMs的表现可以通过推理方法得到改善。

Details Motivation: 研究者希望理解LLMs在面向任务的对话中提出澄清问题的能力,并探索其与人类行为之间的差异及改进方法。 Method: 研究使用了一个新的语料库,该语料库结合了Minecraft Dialogue Corpus中的两种现有标注,用于分析人类和LLMs在面对歧义时的行为差异,并测试不同推理方法对LLMs提问行为的影响。 Result: 1. 人类很少针对指代表达的歧义提出澄清问题,但经常对任务不确定性提出此类问题;2. 相反,LLMs更多地针对指代表达的歧义提出问题,但在任务不确定性上提问较少;3. 引入推理方法后,LLMs提出的问题更相关且更频繁。 Conclusion: LLMs的提问行为与人类存在显著差异,尤其是在处理任务不确定性时。通过引入推理机制,可以提升LLMs提出澄清问题的相关性和频率。 Abstract: In this work we examine LLMs' ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus -- one for reference and ambiguity in reference, and one for SDRT including clarifications -- into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs' ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.

[58] From BERT to Qwen: Hate Detection across architectures

Ariadna Mon,Saúl Fenollosa,Jon Lecumberri

Main category: cs.CL

TL;DR: This paper investigates whether ultra-large autoregressive LLMs offer better hate-speech detection than classic encoder models, with inconclusive results.

Details Motivation: Online platforms face challenges in curbing hate speech without over-censoring legitimate discourse. Ultra-large autoregressive LLMs are promising due to their deeper context-awareness, but their practical effectiveness needs verification. Method: Benchmarking both classic bidirectional transformer encoders and next-generation LLMs on curated corpora of online interactions for hate-speech detection (Hate or No Hate). Result: It is unclear whether the increased scale of LLMs actually improves hate-speech detection performance compared to early encoder models. Conclusion: The study concludes that while ultra-large autoregressive LLMs offer deeper context-awareness, their practical improvement in hate-speech detection over classic encoders remains unverified. Abstract: Online platforms struggle to curb hate speech without over-censoring legitimate discourse. Early bidirectional transformer encoders made big strides, but the arrival of ultra-large autoregressive LLMs promises deeper context-awareness. Whether this extra scale actually improves practical hate-speech detection on real-world text remains unverified. Our study puts this question to the test by benchmarking both model families, classic encoders and next-generation LLMs, on curated corpora of online interactions for hate-speech detection (Hate or No Hate).

[59] MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking

Mohamed T. Younes,Omar Walid,Mai Hassan,Ali Hamdi

Main category: cs.CL

TL;DR: 本文提出了一种基于RPA和大语言模型(LLMs)的新型求职者跟踪系统MLAR,能够高效自动化简历筛选和候选人匹配过程。

Details Motivation: 传统的招聘流程在时间与资源限制下存在瓶颈,特别是在简历筛选和候选人初选方面。 Method: 该系统使用三层结构,利用大型语言模型(LLMs)进行工作描述特征提取、申请人简历解析以及相似度匹配,并结合先进的语义算法来识别最佳候选人。 Result: MLAR在处理2400份简历时,每份简历的平均处理时间为5.4秒,在性能上优于UiPath和Automation Anywhere等领先的RPA平台。 Conclusion: MLAR是一个创新的ATS系统,通过集成RPA和LLM技术,显著提高了招聘流程的效率、准确性和可扩展性。 Abstract: This paper introduces an innovative Applicant Tracking System (ATS) enhanced by a novel Robotic process automation (RPA) framework or as further referred to as MLAR. Traditional recruitment processes often encounter bottlenecks in resume screening and candidate shortlisting due to time and resource constraints. MLAR addresses these challenges employing Large Language Models (LLMs) in three distinct layers: extracting key characteristics from job postings in the first layer, parsing applicant resume to identify education, experience, skills in the second layer, and similarity matching in the third layer. These features are then matched through advanced semantic algorithms to identify the best candidates efficiently. Our approach integrates seamlessly into existing RPA pipelines, automating resume parsing, job matching, and candidate notifications. Extensive performance benchmarking shows that MLAR outperforms the leading RPA platforms, including UiPath and Automation Anywhere, in high-volume resume-processing tasks. When processing 2,400 resumes, MLAR achieved an average processing time of 5.4 seconds per resume, reducing processing time by approximately 16.9% compared to Automation Anywhere and 17.1% compared to UiPath. These results highlight the potential of MLAR to transform recruitment workflows by providing an efficient, accurate, and scalable solution tailored to modern hiring needs.

[60] Can You Detect the Difference?

İsmail Tarım,Aytuğ Onan

Main category: cs.CL

TL;DR: This paper compares diffusion-generated and autoregressive-generated text, showing that diffusion models like LLaDA mimic human text more closely in some aspects than traditional models, which necessitates the development of new AI-text detection methods specific to diffusion models.

Details Motivation: With the rapid advancement of large language models, there is growing concern about reliably detecting AI-generated text. While current stylometric metrics are effective for autoregressive models, their performance on diffusion-based models remains unknown. Method: The research conducts a systematic comparison of diffusion-generated text (LLaDA) and AR-generated text (LLaMA) using 2,000 samples. The analysis involves evaluating perplexity, burstiness, lexical diversity, readability, and BLEU/ROUGE scores. Result: LLaDA mimics human text closely in terms of perplexity and burstiness, leading to high false-negative rates for detectors optimized for AR models. In contrast, LLaMA shows lower perplexity but reduced lexical fidelity. No single metric effectively distinguishes diffusion-generated text from human writing. Conclusion: The study concludes that diffusion-generated text, such as LLaDA, closely resembles human text in certain stylometric properties, making it challenging for existing AI-text detectors designed for autoregressive models to identify. This highlights the need for new detection techniques tailored for diffusion models. Abstract: The rapid advancement of large language models (LLMs) has raised concerns about reliably detecting AI-generated text. Stylometric metrics work well on autoregressive (AR) outputs, but their effectiveness on diffusion-based models is unknown. We present the first systematic comparison of diffusion-generated text (LLaDA) and AR-generated text (LLaMA) using 2 000 samples. Perplexity, burstiness, lexical diversity, readability, and BLEU/ROUGE scores show that LLaDA closely mimics human text in perplexity and burstiness, yielding high false-negative rates for AR-oriented detectors. LLaMA shows much lower perplexity but reduced lexical fidelity. Relying on any single metric fails to separate diffusion outputs from human writing. We highlight the need for diffusion-aware detectors and outline directions such as hybrid models, diffusion-specific stylometric signatures, and robust watermarking.

[61] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Sangmin Bae,Yujin Kim,Reza Bayat,Sungnyun Kim,Jiyoun Ha,Tal Schuster,Adam Fisch,Hrayr Harutyunyan,Ziwei Ji,Aaron Courville,Se-Young Yun

Main category: cs.CL

TL;DR: Mixture-of-Recursions (MoR) is a framework that efficiently combines parameter sharing and adaptive computation in language models, achieving better performance with reduced costs.

Details Motivation: The computational and memory demands of scaling language models make training and deployment expensive, and existing efficiency efforts typically target only one aspect of efficiency. Method: Mixture-of-Recursions (MoR) combines parameter sharing and adaptive computation within a single Recursive Transformer to improve efficiency. Result: MoR significantly lowers validation perplexity, improves few-shot accuracy, and delivers higher throughput compared to vanilla and existing recursive baselines while having smaller model sizes at equal training FLOPs. Conclusion: MoR is an effective path towards large-model quality without incurring large-model cost. Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

[62] CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Hongchao Jiang,Yiming Chen,Yushi Cao,Hung-yi Lee,Robby T. Tan

Main category: cs.CL

TL;DR: This paper introduces CodeJudgeBench, a benchmark for evaluating LLM-as-a-Judge models in coding tasks, revealing that thinking models perform better but are still sensitive to presentation and prompting methods.

Details Motivation: The motivation was to explore the effectiveness of LLMs as judges in coding scenarios due to the lack of dedicated benchmarks, which is crucial for improving response quality and benchmarking different LLMs. Method: The researchers conducted a comprehensive benchmarking of 26 LLM-as-a-Judge models using CodeJudgeBench, which evaluates performance across code generation, repair, and unit test generation. They analyzed the impact of response presentation order and prompting strategies on judgment accuracy. Result: Recent thinking models significantly outperformed non-thinking models, with smaller thinking models like Qwen3-8B surpassing larger models up to 70B in size. Pair-wise comparison and retaining comments and reasoning in prompts improved judge performance. Conclusion: The study concludes that while thinking models outperform non-thinking models in the LLM-as-a-Judge paradigm for coding tasks, there is still significant randomness and sensitivity in judgment, raising concerns about reliability and consistency. Abstract: Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.

[63] REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Zhuoshi Pan,Qizhi Pei,Yu Li,Qiyao Sun,Zinan Tang,H. Vicky Zhao,Conghui He,Lijun Wu

Main category: cs.CL

TL;DR: This paper introduces REST, a new framework for evaluating reasoning models by testing them on multiple problems at once, revealing hidden weaknesses and providing a more realistic assessment of their capabilities.

Details Motivation: Current evaluation methods for Large Reasoning Models are limited because they rely on single-question testing, which leads to issues like data contamination and does not adequately test models under real-world multi-context scenarios. Method: The researchers developed REST, a stress-testing framework that evaluates Large Reasoning Models (LRMs) by exposing them to multiple problems simultaneously. This approach assesses capabilities like contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Result: The REST framework revealed significant performance degradation in state-of-the-art models like DeepSeek-R1 when subjected to multi-problem stress testing. It also showed stronger discriminative power than traditional benchmarks, highlighting differences between models with similar performance in single-question tests. Conclusion: The study concludes that REST is a more effective and future-proof evaluation framework compared to traditional methods, as it better reflects real-world reasoning demands and reduces the need for continuous human annotation. Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.

cs.CV [Back]

[64] View Invariant Learning for Vision-Language Navigation in Continuous Environments

Josh Qixuan Sun,Xiaoying Xing,Huaiyuan Weng,Chul Min Yeum,Mark Crowley

Main category: cs.CV

TL;DR: This paper introduces V2-VLNCE and VIL, a generalized scenario and a view-invariant post-training strategy, to enhance the robustness of navigation policies in Vision-Language Navigation tasks under varying camera viewpoints.

Details Motivation: Most navigation policies in Vision-Language Navigation in Continuous Environments (VLNCE) are sensitive to changes in camera viewpoint, such as variations in height and angle. This limitation hinders their robustness and applicability in real-world settings. Method: The paper proposes VIL (View Invariant Learning), which uses a contrastive learning framework to learn sparse, view-invariant features. It also introduces a teacher-student framework for the Waypoint Predictor Module, employing end-to-end training to jointly optimize components. Result: VIL outperforms state-of-the-art approaches on V2-VLNCE by 8-15% in Success Rate on R2R-CE and RxR-CE datasets. It also achieves state-of-the-art results on RxR-CE compared to other map-free methods, showing improved robustness without compromising standard viewpoint performance. Conclusion: The study concludes that VIL is an effective post-training strategy for improving the robustness and performance of navigation policies in VLNCE, applicable as a plug-and-play method without diminishing standard viewpoint performance. Abstract: Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.

[65] Detecting Deepfake Talking Heads from Facial Biometric Anomalies

Justin D. Norman,Hany Farid

Main category: cs.CV

TL;DR: 本文提出了一种新的深度伪造视频检测技术,利用面部生物识别中的非自然模式,有效应对欺诈、诈骗和政治虚假信息的问题。

Details Motivation: 高度逼真的语音克隆以及视觉上令人信服的头像、换脸或同步嘴唇的深度伪造视频生成,使得创建任何人说任何话的视频变得相对容易,这些深度伪造常常用于欺诈、诈骗和政治虚假信息。 Method: 利用面部生物识别中的非自然模式,提出了一种新的深度伪造视频检测的机器学习方法。 Result: 该技术在跨大量深度伪造技术和伪装的检测中表现出良好的性能,并且在抗视频清洗和推广到以前未见过的视频深度伪造生成器方面也表现出良好的可靠性。 Conclusion: 我们提出了一种新的深度伪造视频 impersonation 的取证机器学习技术,并证明了其在跨大量深度伪造技术、抗视频清洗和推广到以前未见过的视频深度伪造生成器方面的可靠性。 Abstract: The combination of highly realistic voice cloning, along with visually compelling avatar, face-swap, or lip-sync deepfake video generation, makes it relatively easy to create a video of anyone saying anything. Today, such deepfake impersonations are often used to power frauds, scams, and political disinformation. We propose a novel forensic machine learning technique for the detection of deepfake video impersonations that leverages unnatural patterns in facial biometrics. We evaluate this technique across a large dataset of deepfake techniques and impersonations, as well as assess its reliability to video laundering and its generalization to previously unseen video deepfake generators.

[66] PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection

Mahdiyar Molahasani,Azadeh Motamedi,Michael Greenspan,Il-Min Kim,Ali Etemad

Main category: cs.CV

TL;DR: PRISM 提出了一种无需额外数据的去偏方法,通过利用大语言模型生成虚假相关性场景,并采用对比式损失优化嵌入空间,成功减轻了视觉语言模型中的偏差。

Details Motivation: 视觉语言模型(VLM)通常继承并放大训练数据中的偏差,导致预测结果偏向某些类别。因此需要一种不依赖预定义偏差类别或额外外部数据的方法来减轻这种偏差。 Method: PRISM 分为两个阶段:首先使用简单的类别提示生成包含虚假相关性的场景描述,然后通过新颖的对比式去偏损失学习一个投影,将嵌入映射到最小化虚假相关性的潜在空间,同时保持图像和文本嵌入之间的对齐。 Result: 实验表明 PRISM 在常用的 Waterbirds 和 CelebA 数据集上优于现有的去偏方法,证明了其有效性。 Conclusion: PRISM 是一种新的无数据和任务无关的解决方案,用于减轻视觉语言模型中的偏差,且在 Waterbirds 和 CelebA 数据集上优于当前的去偏方法。 Abstract: We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions. PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings.Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://github.com/MahdiyarMM/PRISM.

[67] Video Inference for Human Mesh Recovery with Vision Transformer

Hanbyel Cho,Jaesung Ahn,Yooshin Cho,Junmo Kim

Main category: cs.CV

TL;DR: HMR-ViT improves Human Mesh Recovery by combining temporal and kinematic data, achieving strong results on benchmark datasets.

Details Motivation: To improve the accuracy of Human Mesh Recovery by incorporating both temporal information and kinematic relationships, which existing methods have not combined. Method: HMR-ViT constructs a Temporal-kinematic Feature Image using feature vectors from video frames, employing a Channel Rearranging Matrix to cluster similar kinematic features. This image is then encoded with a Vision Transformer, and SMPL parameters are inferred through a regression network. Result: The proposed method achieved competitive performance on the 3DPW and Human3.6M datasets. Conclusion: HMR-ViT is an effective method for Human Mesh Recovery that utilizes both temporal and kinematic information. Abstract: Human Mesh Recovery (HMR) from an image is a challenging problem because of the inherent ambiguity of the task. Existing HMR methods utilized either temporal information or kinematic relationships to achieve higher accuracy, but there is no method using both. Hence, we propose "Video Inference for Human Mesh Recovery with Vision Transformer (HMR-ViT)" that can take into account both temporal and kinematic information. In HMR-ViT, a Temporal-kinematic Feature Image is constructed using feature vectors obtained from video frames by an image encoder. When generating the feature image, we use a Channel Rearranging Matrix (CRM) so that similar kinematic features could be located spatially close together. The feature image is then further encoded using Vision Transformer, and the SMPL pose and shape parameters are finally inferred using a regression network. Extensive evaluation on the 3DPW and Human3.6M datasets indicates that our method achieves a competitive performance in HMR.

[68] From images to properties: a NeRF-driven framework for granular material parameter inversion

Cheng-Hsi Hsiao,Krishna Kumar

Main category: cs.CV

TL;DR: 这篇论文介绍了一种结合NeRF与MPM模拟的创新方法,成功实现了基于纯视觉观测的颗粒材料属性(如摩擦角)估计,误差仅为2度。

Details Motivation: 直接测量颗粒材料属性在现实场景中可能不切实际或不可能,因此需要一种仅依靠视觉观测的方法进行逆向分析。 Method: 使用NeRF从多视角初始图像重建3D几何结构,并将其用于初始化MPM模拟中的材料点位置;通过贝叶斯优化最小化图像损失来估计摩擦角。 Result: 实验表明,所提出的方法能够在视觉观测的基础上估计摩擦角,误差控制在2度以内,证明了方法的有效性。 Conclusion: 该论文提出了一种结合NeRF和MPM模拟的新框架,能够通过视觉观测有效推断颗粒材料的摩擦角,误差在2度以内。 Abstract: We introduce a novel framework that integrates Neural Radiance Fields (NeRF) with Material Point Method (MPM) simulation to infer granular material properties from visual observations. Our approach begins by generating synthetic experimental data, simulating an plow interacting with sand. The experiment is rendered into realistic images as the photographic observations. These observations include multi-view images of the experiment's initial state and time-sequenced images from two fixed cameras. Using NeRF, we reconstruct the 3D geometry from the initial multi-view images, leveraging its capability to synthesize novel viewpoints and capture intricate surface details. The reconstructed geometry is then used to initialize material point positions for the MPM simulation, where the friction angle remains unknown. We render images of the simulation under the same camera setup and compare them to the observed images. By employing Bayesian optimization, we minimize the image loss to estimate the best-fitting friction angle. Our results demonstrate that friction angle can be estimated with an error within 2 degrees, highlighting the effectiveness of inverse analysis through purely visual observations. This approach offers a promising solution for characterizing granular materials in real-world scenarios where direct measurement is impractical or impossible.

[69] VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels

Xiwei Xuan,Xiaoqi Wang,Wenbin He,Jorge Piazentin Ono,Liang Gou,Kwan-Liu Ma,Liu Ren

Main category: cs.CV

TL;DR: VISTA框架提升了多模态模型生成标签的数据质量,在开放词汇图像分割任务中表现出色。

Details Motivation: 现有方法过于关注数据数量而忽视质量,且缺乏全面视角或未能充分发现潜在问题,从而影响多模态模型的性能。 Method: 引入名为VISTA的视觉分析框架,结合多阶段数据验证策略与人类专家参与,对FM生成的标签进行识别、理解和修正。 Result: 通过两个基准数据集的具体案例分析和专家评审,从定量和定性角度验证了VISTA的有效性。 Conclusion: VISTA是一个视觉分析框架,通过整合多阶段数据验证策略和人类专业知识,有效提升多模态模型在开放词汇图像分割领域的性能。 Abstract: The advances in multi-modal foundation models (FMs) (e.g., CLIP and LLaVA) have facilitated the auto-labeling of large-scale datasets, enhancing model performance in challenging downstream tasks such as open-vocabulary object detection and segmentation. However, the quality of FM-generated labels is less studied as existing approaches focus more on data quantity over quality. This is because validating large volumes of data without ground truth presents a considerable challenge in practice. Existing methods typically rely on limited metrics to identify problematic data, lacking a comprehensive perspective, or apply human validation to only a small data fraction, failing to address the full spectrum of potential issues. To overcome these challenges, we introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models. Targeting the complex and demanding domain of open-vocabulary image segmentation, VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels. Through detailed use cases on two benchmark datasets and expert reviews, we demonstrate VISTA's effectiveness from both quantitative and qualitative perspectives.

[70] BrainLesion Suite: A Flexible and User-Friendly Framework for Modular Brain Lesion Image Analysis

Florian Kofler,Marcel Rosier,Mehdi Astaraki,Hendrik Möller,Ilhem Isra Mekki,Josef A. Buchner,Anton Schmick,Arianna Pfiffer,Eva Oswald,Lucas Zimmer,Ezequiel de la Rosa,Sarthak Pati,Julian Canisius,Arianna Piffer,Ujjwal Baid,Mahyar Valizadeh,Akis Linardos,Jan C. Peeken,Surprosanna Shit,Felix Steinbauer,Daniel Rueckert,Rolf Heckemann,Spyridon Bakas,Jan Kirschke,Constantin von See,Ivan Ezhov,Marie Piraud,Benedikt Wiestler,Bjoern Menze

Main category: cs.CV

TL;DR: BrainLesion Suite 是一个用于构建脑病变图像分析流程的多功能 Python 工具包,具有预处理、模态合成、病变修复、肿瘤分割及模型评估等功能,适用于多种生物医学图像分析任务。

Details Motivation: BrainLesion Suite 的设计旨在提供一种“无需费脑”的开发体验,减少认知负担,并简化临床和科学研究中复杂工作流程的创建。 Method: BrainLesion Suite 核心是一个可适应的预处理模块,可以在任意多模态输入图像上执行共配准、图谱配准以及可选的去颅骨和去面操作。它利用 BraTS 挑战中的算法来合成缺失模态、修复病变并生成病理特异性肿瘤分割。 Result: BrainLesion Suite 提供了量化分割模型性能的工具,例如 panoptica 可以计算病灶级别的指标。其个别 BrainLesion Suite 包和教程都可以在 GitHub 上获取。 Conclusion: BrainLesion Suite 是一个灵活的工具包,用于在 Python 中构建模块化的脑病变图像分析流程。它最初是为脑病变(如胶质瘤、转移瘤和多发性硬化症)的图像分析流程开发的,但也可适用于其他生物医学图像分析应用。 Abstract: BrainLesion Suite is a versatile toolkit for building modular brain lesion image analysis pipelines in Python. Following Pythonic principles, BrainLesion Suite is designed to provide a 'brainless' development experience, minimizing cognitive effort and streamlining the creation of complex workflows for clinical and scientific practice. At its core is an adaptable preprocessing module that performs co-registration, atlas registration, and optional skull-stripping and defacing on arbitrary multi-modal input images. BrainLesion Suite leverages algorithms from the BraTS challenge to synthesize missing modalities, inpaint lesions, and generate pathology-specific tumor segmentations. BrainLesion Suite also enables quantifying segmentation model performance, with tools such as panoptica to compute lesion-wise metrics. Although BrainLesion Suite was originally developed for image analysis pipelines of brain lesions such as glioma, metastasis, and multiple sclerosis, it can be adapted for other biomedical image analysis applications. The individual BrainLesion Suite packages and tutorials are accessible on GitHub.

[71] Can Contrastive Learning Improve Class-Imbalanced Diffusion Model?

Fang Chen,Alex Villa,Gongbo Liang,Xiaoyi Lu,Meng Tang

Main category: cs.CV

TL;DR: 本论文提出了一种基于对比学习的框架,以解决类别条件扩散模型在处理长尾分布数据时尾类图像多样性下降的问题。

Details Motivation: 训练用于类别条件图像合成的数据通常呈现长尾分布,导致尾类样本有限并引起模式崩溃,从而降低了生成图像的多样性。本文旨在不损害头类图像保真度和多样性的前提下提升尾类图像的多样性。 Method: 作者引入了两种对比损失函数:1)一种无监督的InfoNCE损失,通过利用负样本来增加合成图像之间的距离/差异性;2)一种MSE损失,通过在大时间步上对比类别条件生成与无条件生成来增强尾类的多样性,从而使去噪过程在初始步骤中对类别条件不敏感,并通过知识共享丰富尾类信息。 Result: 该对比学习框架易于实现,并且在多个长尾数据集(如CIFAR10/100-LT、PlacesLT、TinyImageNetLT和ImageNetLT)上优于标准DDPM和其他针对类别不平衡问题的方法。 Conclusion: 本文首次将条件-无条件对齐应用于扩散模型,并成功地利用对比学习提升了类别不平衡条件下扩散模型的性能。 Abstract: Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity of tail class images without compromising the fidelity and diversity of head class images. We achieve this by introducing two deceptively simple but highly effective contrastive loss functions. Firstly, we employ an unsupervised InfoNCE loss utilizing negative samples to increase the distance/dissimilarity among synthetic images, particularly for tail classes. To further enhance the diversity of tail classes, our second loss is an MSE loss that contrasts class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. Conditional-unconditional alignment has been shown to enhance the performance of long-tailed GAN. We are the first to adapt such alignment to diffusion models. We successfully leveraged contrastive learning for class-imbalanced diffusion models. Our contrastive learning framework is easy to implement and outperforms standard DDPM and alternative methods for class-imbalanced diffusion models across various datasets, including CIFAR10/100-LT, PlacesLT, TinyImageNetLT, and ImageNetLT.

[72] Infinite Video Understanding

Dell Zhang,Xiangyu Chen,Jixiang Luo,Mengxi Jia,Changzhi Sun,Ruilong Ren,Jingren Liu,Hao Sun,Xuelong Li

Main category: cs.CV

TL;DR: 这篇论文探讨了当前视频理解模型在处理长时间视频内容时的局限性,并提出了一个未来研究方向——“无限视频理解”,旨在推动流媒体架构、持久记忆机制、分层和自适应表示、事件中心推理等领域的创新。

Details Motivation: 当前最先进的模型在面对长时间视频序列时仍然遇到显著的计算和内存限制,且难以维持时间连贯性、跟踪复杂事件并保留细粒度细节,因此需要探索新的研究方向以突破这些瓶颈。 Method: 通过分析现有模型在处理长时间视频内容时面临的计算和内存限制,结合近期在长视频理解和相关领域的进展,提出“无限视频理解”的研究框架与核心挑战。 Result: 提出了“无限视频理解”的概念,并概述了实现这一目标的关键研究方向与核心挑战。 Conclusion: 该论文提出“无限视频理解”作为多媒体研究的下一个前沿目标,旨在推动流媒体架构、持久记忆机制、分层和自适应表示、事件中心推理以及新型评估范式等领域的创新。 Abstract: The rapid advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have ushered in remarkable progress in video understanding. However, a fundamental challenge persists: effectively processing and comprehending video content that extends beyond minutes or hours. While recent efforts like Video-XL-2 have demonstrated novel architectural solutions for extreme efficiency, and advancements in positional encoding such as HoPE and VideoRoPE++ aim to improve spatio-temporal understanding over extensive contexts, current state-of-the-art models still encounter significant computational and memory constraints when faced with the sheer volume of visual tokens from lengthy sequences. Furthermore, maintaining temporal coherence, tracking complex events, and preserving fine-grained details over extended periods remain formidable hurdles, despite progress in agentic reasoning systems like Deep Video Discovery. This position paper posits that a logical, albeit ambitious, next frontier for multimedia research is Infinite Video Understanding -- the capability for models to continuously process, understand, and reason about video data of arbitrary, potentially never-ending duration. We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia, and the wider AI, research communities, driving innovation in areas such as streaming architectures, persistent memory mechanisms, hierarchical and adaptive representations, event-centric reasoning, and novel evaluation paradigms. Drawing inspiration from recent work on long/ultra-long video understanding and several closely related fields, we outline the core challenges and key research directions towards achieving this transformative capability.

[73] BlindSight: Harnessing Sparsity for Efficient VLMs

Tharun Adithya Srikrishnan,Deval Shah,Steven K. Reinhardt

Main category: cs.CV

TL;DR: BlindSight is a training-free method that optimizes VLM inference by leveraging sparse attention patterns, reducing computational load while maintaining model accuracy.

Details Motivation: The integration of vision data in large vision-language models (VLMs) increases prompt length, leading to longer prefill durations due to the quadratic complexity of attention computation. This work aims to mitigate this bottleneck by exploiting inherent sparsity in attention computation. Method: BlindSight leverages attention sparsity masks based on observed sparse attention patterns in VLMs, categorizing them into sink-only, document mask, and hybrid document-sink mask. It uses dataset samples to derive prompt-agnostic sparsity categorization for each attention head. Result: BlindSight achieves a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy deviation compared to the original model across most evaluated multi-image understanding benchmarks like Qwen2-VL, Qwen2.5-VL, and Gemma-3. Conclusion: BlindSight is an effective training-free approach for optimizing VLM inference, achieving significant FLOPs reduction with minimal impact on accuracy. Abstract: Large vision-language models (VLMs) enable the joint processing of text and images. However, the inclusion of vision data significantly expands the prompt length. Along with the quadratic complexity of the attention computation, this results in a longer prefill duration. An approach to mitigate this bottleneck is to leverage the inherent sparsity in the attention computation. In our analysis of attention patterns in VLMs, we observe that a substantial portion of layers exhibit minimal cross-image attention, except through attention-sink tokens per image. These sparse attention patterns fall into distinct categories: sink-only, document mask and a hybrid document-sink mask. Based on this, we propose BlindSight: a training-free approach to optimize VLM inference using a input template-aware attention sparsity mask. We utilize samples from a dataset to derive a prompt-agnostic sparsity categorization for every attention head. We evaluate the proposed technique using VLMs such as Qwen2-VL, Qwen2.5-VL and Gemma-3. BlindSight results in a 32%-41% reduction in FLOPs on average with -2%-+2% accuracy compared to the original model in most evaluated multi-image understanding benchmarks.

[74] From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion

Zhenyu Yu,Mohd Yamani Idna Idris,Hua Wang,Pei Wang,Junyi Chen,Kun Wang

Main category: cs.CV

TL;DR: This paper reviews the progression of quantitative remote sensing inversion methods, highlighting the shift from physics-based models to foundation models and outlining challenges and future directions.

Details Motivation: The motivation is to analyze how remote sensing inversion has evolved from traditional physics-based paradigms to data-driven and foundation model-based approaches, especially with advancements in AI. Method: The paper systematically reviews and compares the evolution of inversion techniques from physical models to machine learning methods and then to foundation models. Result: The study identifies recent advances in foundation models like SatMAE, GFM, and mmEarth, focusing on self-supervised pretraining, multi-modal integration, and cross-task adaptation. Conclusion: The paper concludes that the future development of remote sensing inversion lies in next-generation foundation models with unified modeling capacity, cross-domain generalization, and physical interpretability. Abstract: Quantitative remote sensing inversion aims to estimate continuous surface variables-such as biomass, vegetation indices, and evapotranspiration-from satellite observations, supporting applications in ecosystem monitoring, carbon accounting, and land management. With the evolution of remote sensing systems and artificial intelligence, traditional physics-based paradigms are giving way to data-driven and foundation model (FM)-based approaches. This paper systematically reviews the methodological evolution of inversion techniques, from physical models (e.g., PROSPECT, SCOPE, DART) to machine learning methods (e.g., deep learning, multimodal fusion), and further to foundation models (e.g., SatMAE, GFM, mmEarth). We compare the modeling assumptions, application scenarios, and limitations of each paradigm, with emphasis on recent FM advances in self-supervised pretraining, multi-modal integration, and cross-task adaptation. We also highlight persistent challenges in physical interpretability, domain generalization, limited supervision, and uncertainty quantification. Finally, we envision the development of next-generation foundation models for remote sensing inversion, emphasizing unified modeling capacity, cross-domain generalization, and physical interpretability.

[75] Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim,Khai Loong Aw,Klemen Kotar,Cristobal Eyzaguirre,Wanhee Lee,Yunong Liu,Jared Watrous,Stefan Stojanov,Juan Carlos Niebles,Jiajun Wu,Daniel L. K. Yamins

Main category: cs.CV

TL;DR: This paper introduces KL-tracing, a zero-shot method for extracting optical flow from videos using generative models without fine-tuning, achieving superior performance on both real and synthetic datasets.

Details Motivation: The research is motivated by the need for a practical solution to extract optical flow from videos without requiring scarce labels or suffering from sim-to-real gaps in synthetic datasets. Method: The method involves KL-tracing, which injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. This is done without any flow-specific fine-tuning. Result: The proposed method using KL-tracing on the LRAS architecture achieves state-of-the-art results, showing a 16.6% relative improvement in endpoint error on the real-world TAP-Vid DAVIS dataset and a 4.7% relative improvement on the synthetic TAP-Vid Kubric dataset. Conclusion: The study concludes that counterfactual prompting of controllable generative video models offers a scalable and effective alternative to supervised or photometric-loss approaches for high-quality optical flow extraction. Abstract: Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

[76] MI CAM: Mutual Information Weighted Activation Mapping for Causal Visual Explanations of Convolutional Neural Networks

Ram S Iyer,Narayan S Iyer,Rugmini Ammal P

Main category: cs.CV

TL;DR: This paper introduces MI CAM, a novel post-hoc visual explanation method for convolutional neural networks, which generates saliency visualizations using mutual information and activation maps. Validated by counterfactual analysis, it offers causal interpretations and achieves competitive performance compared to existing state-of-the-art techniques.

Details Motivation: With machine vision playing an increasingly important role in critical areas like healthcare and automated systems, there is a growing need to understand the internal mechanisms and decision-making processes of convolutional neural networks. Method: The paper proposes MI CAM, a post-hoc visual explanation method based on activation mapping. It calculates saliency visualizations by weighing each feature map using its mutual information with the input image and applies a linear combination of weights and activation maps to generate results. Counterfactual analysis validates causal interpretations. Result: The proposed MI CAM approach performs at par with current state-of-the-art methods and outperforms some in both qualitative and quantitative evaluations. It provides visual performance and interpretable, unbiased justifications for model inference. Conclusion: MI CAM is able to produce visual explanations that are on par with or better than state-of-the-art methods in qualitative and quantitative measures, providing unbiased justifications for model inferencing. Abstract: With the intervention of machine vision in our crucial day to day necessities including healthcare and automated power plants, attention has been drawn to the internal mechanisms of convolutional neural networks, and the reason why the network provides specific inferences. This paper proposes a novel post-hoc visual explanation method called MI CAM based on activation mapping. Differing from previous class activation mapping based approaches, MI CAM produces saliency visualizations by weighing each feature map through its mutual information with the input image and the final result is generated by a linear combination of weights and activation maps. It also adheres to producing causal interpretations as validated with the help of counterfactual analysis. We aim to exhibit the visual performance and unbiased justifications for the model inferencing procedure achieved by MI CAM. Our approach works at par with all state-of-the-art methods but particularly outperforms some in terms of qualitative and quantitative measures. The implementation of proposed method can be found on https://anonymous.4open.science/r/MI-CAM-4D27

[77] RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

Yunsoo Kim,Jinge Wu,Honghan Wu

Main category: cs.CV

TL;DR: RadEyeVideo通过将放射科医生的眼动数据作为视频输入,显著提高了通用大视觉语言模型在胸部X光分析中的表现。

Details Motivation: 现有的胸部X光分析方法通常忽略放射科医生眼动的时序信息,而这些信息能够提供有价值的诊断线索。 Method: 提出RadEyeVideo方法,将眼动注视数据作为视频序列输入,利用三个具有视频处理能力的通用LVLMs进行胸部X光报告生成和疾病诊断任务评估。 Result: RadEyeVideo在报告生成任务中性能提高了24.6%,平均在两个任务上提升了15.2%,并且超越了基于大规模胸部X光数据训练的医学专用模型。 Conclusion: RadEyeVideo是一个将放射科医生眼动数据与大视觉语言模型结合的创新方法,有效提升了模型在胸部X光分析中的性能,表明领域专家知识与LVLMs的有效结合可以显著增强通用模型在临床任务中的能力。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists' eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert's knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models' capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.

[78] Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning

Yiyang Chen,Shanshan Zhao,Lunhao Duan,Changxing Ding,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出PointSD,利用Stable Diffusion模型提升点云自监督学习,通过替换编码器并训练点到图像扩散模型实现语义学习,实验验证其有效性。

Details Motivation: 现有的3D扩散模型受限于可用的小规模3D数据集,而基于大规模数据集训练的文本到图像扩散模型(如Stable Diffusion)具有更强的能力,可以用来解决这一限制。 Method: 将Stable Diffusion模型的文本编码器替换为3D编码器,训练一个以点云为条件的点到图像扩散模型,并使用无噪声图像和点云对SD特征进行提取,随后训练一个3D主干网络来对齐这些SD特征。 Result: PointSD框架能够有效增强点云的自监督学习,并在相关下游任务中展现出优越的表现,代码也已开源。 Conclusion: 通过利用大规模文本到图像扩散模型(如Stable Diffusion)的能力,PointSD框架能够提升点云的自监督学习效果,并且实验结果表明该方法在下游任务中表现良好。 Abstract: Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at https://github.com/wdttt/PointSD.

[79] Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production

Maoxiao Ye,Xinfeng Ye,Mano Manoharan

Main category: cs.CV

TL;DR: 本文提出了一种结合自回归模型和扩散模型的新方法,以提高手语生成的质量和实时性能,并引入了多尺度姿态表示和置信度感知注意力机制。

Details Motivation: 传统的自回归方法在推理过程中存在错误累积的问题,而基于扩散模型的最新方法由于其迭代特性和序列去噪要求限制了其在实时任务中的应用。 Method: 提出了一种新的混合方法,将自回归模型和扩散模型结合应用于手语生成(SLP),并设计了多尺度姿态表示模块和置信度感知因果注意力机制。 Result: 在PHOENIX14T和How2Sign数据集上验证了该方法的有效性,证明其在生成质量和实时流效率方面的优势。 Conclusion: 实验结果表明,所提出的结合自回归模型和扩散模型的方法在生成质量和实时流效率方面均有效。 Abstract: Earlier Sign Language Production (SLP) models typically relied on autoregressive methods that generate output tokens one by one, which inherently provide temporal alignment. Although techniques like Teacher Forcing can prevent model collapse during training, they still cannot solve the problem of error accumulation during inference, since ground truth is unavailable at that stage. In contrast, more recent approaches based on diffusion models leverage step-by-step denoising to enable high-quality generation. However, the iterative nature of these models and the requirement to denoise entire sequences limit their applicability in real-time tasks like SLP. To address it, we apply a hybrid approach combining autoregressive and diffusion models to SLP for the first time, leveraging the strengths of both models in sequential dependency modeling and output refinement. To capture fine-grained body movements, we design a Multi-Scale Pose Representation module that separately extracts detailed features from distinct articulators and integrates them via a Multi-Scale Fusion module. Furthermore, we introduce a Confidence-Aware Causal Attention mechanism that utilizes joint-level confidence scores to dynamically guide the pose generation process, improving accuracy and robustness. Extensive experiments on the PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method in both generation quality and real-time streaming efficiency.

[80] RoHOI: Robustness Benchmark for Human-Object Interaction Detection

Di Wen,Kunyu Peng,Kailun Yang,Yufan Chen,Ruiping Liu,Junwei Zheng,Alina Roitberg,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: This paper introduces RoHOI, a robustness benchmark for Human-Object Interaction detection, and proposes SAMPL, a novel learning strategy that significantly improves model resilience under real-world challenges.

Details Motivation: Existing HOI detection models degrade under real-world corruptions, so a robustness benchmark and solution are needed. Method: A Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy is proposed to enhance robust feature learning by using holistic and partial cues. Result: The RoHOI benchmark with 20 corruption types reveals performance drops in current models, while the proposed SAMPL approach outperforms state-of-the-art methods. Conclusion: The proposed SAMPL strategy improves robustness in HOI detection and sets a new standard, as demonstrated by extensive experiments. Abstract: Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusion, and noise. Our benchmark, RoHOI, includes 20 corruption types based on HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the related field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at https://github.com/Kratos-Wen/RoHOI.

[81] Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning

Linlan Huang,Xusheng Cao,Haori Lu,Yifan Meng,Fei Yang,Xialei Liu

Main category: cs.CV

TL;DR: This paper proposes MG-CLIP, a method that leverages modality gap preservation and compensation to improve CLIP's continual learning performance, particularly in class-incremental scenarios.

Details Motivation: Existing works overlook the inherent modality gap in CLIP, which is crucial for its generalization and adaptability. This paper aims to address this issue within continual learning scenarios. Method: MG-CLIP focuses on modality gap preservation to mitigate forgetting and modality gap compensation to enhance adaptability to new data, based on observations of modality gap variations during fine-tuning. Result: Extensive experiments show that MG-CLIP outperforms existing approaches in class-incremental learning without requiring additional replay data. Conclusion: The proposed MG-CLIP method effectively improves CLIP's performance in class-incremental learning by preserving and compensating for the modality gap, offering a new perspective for continual learning. Abstract: Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks. With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios. Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved. Based on these insights, we propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning. Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data. Our code is available at https://github.com/linlany/MindtheGap.

[82] SnapMoGen: Human Motion Generation from Expressive Texts

Chuan Guo,Inwoo Hwang,Jian Wang,Bing Zhou

Main category: cs.CV

TL;DR: 本文介绍了SnapMoGen数据集与MoMask++模型,以提升文本到运动生成的细粒度控制和泛化能力。

Details Motivation: 当前方法受限于合成从短或通用文本提示的运动,缺乏细粒度控制和泛化能力。 Method: 提出了SnapMoGen数据集和MoMask++模型,将运动转化为多尺度标记序列,并使用单一生成掩码变压器进行训练。 Result: 构建了包含20K运动片段的数据集,每个描述平均48个词,MoMask++在基准测试中表现优异。 Conclusion: MoMask++实现了最先进的性能,并且通过使用LLM处理用户提示,展示了对表达性和叙述风格的适应能力。 Abstract: Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: https://snap-research.github.io/SnapMoGen/

[83] PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment

Dewen Zhang,Tahir Hussain,Wangpeng An,Hayaru Shouno

Main category: cs.CV

TL;DR: PoseLLM采用非线性MLP连接器,提升语言引导的人体姿态估计性能。

Details Motivation: 传统的人体姿态估计方法受限于编码关键点先验知识,而LocLLM的线性投影器无法捕捉复杂的视觉-文本交互,影响定位精度。 Method: 将线性投影器替换为轻量级的两层MLP视觉-语言连接器,利用GELU激活函数实现跨模态特征转换。 Result: 在COCO验证集上达到77.8 AP,比LocLLM高出+0.4 AP,并在Human-Art和MPII数据集上表现出强零样本泛化能力。 Conclusion: PoseLLM通过使用非线性MLP视觉-语言连接器显著提高了定位精度,同时保持了泛化能力。 Abstract: Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.

[84] $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

Zhimin Liao,Ping Wei,Ruijie Zhang,Shuaijia Chen,Haoxuan Wang,Ziyang Ren

Main category: cs.CV

TL;DR: 本文提出了一个高效的4D占用预测框架I²-World,通过将场景标记化分解为场景内和场景间的操作,并结合编码器-解码器结构,显著提升了性能与计算效率。

Details Motivation: 尽管基于占用的世界模型在自动驾驶系统中的处理有潜力,但如何高效地标记复杂的3D场景仍是关键挑战。 Method: 提出了一种名为I²-World的编码器-解码器架构,该架构将场景标记化分为场景内标记器和场景间标记器,分别用于压缩3D场景和聚合时间依赖性。 Result: 实验表明,I²-World在mIoU和IoU上分别比现有方法高出25.1%和36.9%,并且仅需2.9GB训练内存,在实时推理中达到37.0 FPS。 Conclusion: I²-World框架在4D占用预测中实现了最先进的性能,同时表现出优异的计算效率。 Abstract: Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1\% in mIoU and 36.9\% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.

[85] Stable Score Distillation

Haiming Zhu,Yangyang Xu,Chenshu Xu,Tingrui Shen,Wenxi Liu,Yong Du,Jun Yu,Shengfeng He

Main category: cs.CV

TL;DR: Stable Score Distillation (SSD) improves the stability and effectiveness of text-guided image and 3D editing by simplifying the optimization process and enhancing alignment with the source prompt.

Details Motivation: Current diffusion-based models like Delta Denoising Score face challenges in stability, spatial control, and editing strength due to complex auxiliary structures that create conflicting optimization signals. Method: Stable Score Distillation (SSD) uses Classifier-Free Guidance (CFG) for cross-prompt alignment, adds a null-text branch for optimization stability, and includes a prompt enhancement branch to boost editing strength. Result: SSD achieves state-of-the-art performance in 2D and 3D editing tasks such as NeRF and text-driven style edits with faster convergence and reduced complexity. Conclusion: SSD provides a robust and efficient solution for text-guided image and 3D editing with improved stability, alignment, and editing strength. Abstract: Text-guided image and 3D editing have advanced with diffusion-based models, yet methods like Delta Denoising Score often struggle with stability, spatial control, and editing strength. These limitations stem from reliance on complex auxiliary structures, which introduce conflicting optimization signals and restrict precise, localized edits. We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process by anchoring a single classifier to the source prompt. Specifically, SSD utilizes Classifier-Free Guidance (CFG) equation to achieves cross-prompt alignment, and introduces a constant term null-text branch to stabilize the optimization process. This approach preserves the original content's structure and ensures that editing trajectories are closely aligned with the source prompt, enabling smooth, prompt-specific modifications while maintaining coherence in surrounding regions. Additionally, SSD incorporates a prompt enhancement branch to boost editing strength, particularly for style transformations. Our method achieves state-of-the-art results in 2D and 3D editing tasks, including NeRF and text-driven style edits, with faster convergence and reduced complexity, providing a robust and efficient solution for text-guided editing.

[86] Learning and Transferring Better with Depth Information in Visual Reinforcement Learning

Zichun Xu,Yuntao Li,Zhaomin Wang,Lei Zhuang,Guocai Yang,Jingdong Zhao

Main category: cs.CV

TL;DR: This paper proposes a visual backbone using vision transformers to fuse RGB and depth data, enhancing model generalization through unsupervised learning and curriculum learning for sim2real transfer.

Details Motivation: The motivation is to enhance the generalization of visual models by leveraging the robustness of depth information and its inherent 3D spatial details. Method: The method involves processing different modalities with separate CNN stems, utilizing a scalable vision transformer for visual representations, incorporating a contrastive unsupervised learning scheme, and implementing curriculum learning for sim2real transfer. Result: The result demonstrates accelerated sample efficiency in reinforcement learning and effective deployment of domain randomization through a flexible curriculum learning schedule. Conclusion: The paper concludes that the proposed visual backbone effectively enhances generalization by fusing RGB and depth modalities. Abstract: Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning progress. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.

[87] Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning

Yongwei Jiang,Yixiong Zou,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 本文提出了一种新的FSCIL方法LGSP-Prompt,通过将提示学习从token维度转移到空间维度,有效解决了模型过拟合问题,并在多个基准测试中达到了最先进的性能。

Details Motivation: 现有的基于prompt的方法在FSCIL设置中表现出性能下降,需要解决数据稀缺和增量学习的挑战。 Method: 提出了LGSP-Prompt方法,结合了本地空间特征和全局频域表示,实现动态提示选择。 Result: 实验表明,LGSP-Prompt在多个FSCIL基准测试中表现出色,显著优于现有方法。 Conclusion: LGSP-Prompt有效地解决了FSCIL中的token-dimension饱和问题,实现了最先进的性能。 Abstract: Few-Shot Class-Incremental Learning (FSCIL) faces dual challenges of data scarcity and incremental learning in real-world scenarios. While pool-based prompting methods have demonstrated success in traditional incremental learning, their effectiveness in FSCIL settings remains unexplored. This paper presents the first study of current prompt pool methods in FSCIL tasks, revealing an unanticipated performance degradation in incremental sessions. Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. LGSP-Prompt generates spatial prompts by synergistically combining local spatial features and global frequency-domain representations to highlight key patterns in input images. We construct two spatial prompt pools enabling dynamic prompt selection to maintain acquired knowledge while effectively learning novel sessions. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple FSCIL benchmarks, showing significant advantages in both base knowledge preservation and incremental learning. Our implementation is available at https://github.com/Jywsuperman/LGSP.

[88] MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models

Qiyan Zhao,Xiaofeng Zhang,Yiheng Li,Yun Xing,Xiaosong Yuan,Feilong Tang,Sinan Fan,Xuhang Chen,Xuyao Zhang,Dahan Wang

Main category: cs.CV

TL;DR: This paper introduces MCA-LLaVA, a method that improves multimodal alignment in LVLMs by addressing image alignment bias, thereby reducing hallucinations.

Details Motivation: Hallucinations in LVLMs are partly caused by misalignment between multimodal features, which is influenced by the long-term decay in RoPE. This work aims to address this issue by improving how image tokens interact with instruction tokens. Method: The study analyzes the impact of long-term decay in RoPE on multimodal alignment, proposes MCA-LLaVA based on Manhattan distance to extend decay to two-dimensional, multi-directional spatial modeling, and evaluates its performance across various benchmarks. Result: Experimental results show that MCA-LLaVA effectively reduces hallucinations and performs well across various benchmarks, demonstrating its generality and effectiveness. Conclusion: MCA-LLaVA effectively mitigates hallucinations in LVLMs by addressing image alignment bias through enhanced positional modeling that considers both one-dimensional sequence order and two-dimensional spatial positions. Abstract: Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.

[89] THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage

Trong-Thuan Nguyen,Pha Nguyen,Jackson Cothren,Alper Yilmaz,Minh-Triet Tran,Khoa Luu

Main category: cs.CV

TL;DR: This paper introduces the THYME approach for dynamic scene graph generation, which effectively captures spatial and temporal dependencies, and presents a new aerial video dataset, AeroEye-v1.0, demonstrating improved performance over existing methods.

Details Motivation: Existing methods for video scene graph generation suffer from fragmented representations, failing to capture fine-grained spatial details and long-range temporal dependencies simultaneously. Method: Temporal Hierarchical Cyclic Scene Graph (THYME) with hierarchical feature aggregation and cyclic temporal refinement, along with the introduction of the AeroEye-v1.0 aerial video dataset. Result: Extensive experiments show that the THYME approach outperforms state-of-the-art methods in scene understanding for both ground-view and aerial scenarios. Conclusion: The THYME approach improves dynamic scene graph generation by modeling multi-scale spatial context and enforcing temporal consistency, as demonstrated through experiments on ASPIRe and the new AeroEye-v1.0 dataset. Abstract: The rapid proliferation of video in applications such as autonomous driving, surveillance, and sports analytics necessitates robust methods for dynamic scene understanding. Despite advances in static scene graph generation and early attempts at video scene graph generation, previous methods often suffer from fragmented representations, failing to capture fine-grained spatial details and long-range temporal dependencies simultaneously. To address these limitations, we introduce the Temporal Hierarchical Cyclic Scene Graph (THYME) approach, which synergistically integrates hierarchical feature aggregation with cyclic temporal refinement to address these limitations. In particular, THYME effectively models multi-scale spatial context and enforces temporal consistency across frames, yielding more accurate and coherent scene graphs. In addition, we present AeroEye-v1.0, a novel aerial video dataset enriched with five types of interactivity that overcome the constraints of existing datasets and provide a comprehensive benchmark for dynamic scene graph generation. Empirically, extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that the proposed THYME approach outperforms state-of-the-art methods, offering improved scene understanding in ground-view and aerial scenarios.

[90] Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves

Alexander C. Ogren,Berthy T. Feng,Jihoon Ahn,Katherine L. Bouman,Chiara Daraio

Main category: cs.CV

TL;DR: 论文提出了一种通过分析表面波视频来推断材料厚度和刚度的新方法,适用于家庭健康监测等领域。

Details Motivation: 表面波传播包含材料表面下方物理特性的信息,通过视频分析可以推断这些特性。 Method: 通过从视频中提取色散关系,并求解基于物理的优化问题来找到最佳拟合的厚度和刚度参数。 Result: 在模拟和真实数据中,该方法均与真实测量结果高度一致。 Conclusion: 该论文提出了一种从表面波视频中推断结构厚度和刚度的方法,并验证了其在模拟和真实数据中的有效性。 Abstract: Wave propagation on the surface of a material contains information about physical properties beneath its surface. We propose a method for inferring the thickness and stiffness of a structure from just a video of waves on its surface. Our method works by extracting a dispersion relation from the video and then solving a physics-based optimization problem to find the best-fitting thickness and stiffness parameters. We validate our method on both simulated and real data, in both cases showing strong agreement with ground-truth measurements. Our technique provides a proof-of-concept for at-home health monitoring of medically-informative tissue properties, and it is further applicable to fields such as human-computer interaction.

[91] Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models

Xiao Liang,Di Wang,Zhicheng Jiao,Ronghan Li,Pengfei Yang,Quan Wang,Tat-Seng Chua

Main category: cs.CV

TL;DR: This paper proposes Expert-CFG, a framework that improves the reliability of Medical Vision Language Models by integrating clinical expertise without additional training, demonstrating strong performance on medical benchmarks.

Details Motivation: Current Medical Vision Language Models (MedVLMs) exhibit probabilistic uncertainties that lead to erroneous responses, which are particularly problematic in medical applications. Existing solutions rely on costly training adjustments and lack sufficient alignment with clinical expertise. Method: An expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) was developed. It uses uncertainty estimation to identify unreliable outputs, retrieves relevant references, and applies classifier-free guidance to refine token embeddings. Result: Expert-CFG with 4.2B parameters and limited expert annotations outperformed state-of-the-art models with 13B parameters across three medical visual question answering benchmarks. Conclusion: The Expert-CFG framework effectively aligns MedVLM with clinical expertise without additional training, making it feasible for deployment in resource-limited clinical settings. Abstract: The rapid advancements in Vision Language Models (VLMs) have prompted the development of multi-modal medical assistant systems. Despite this progress, current models still have inherent probabilistic uncertainties, often producing erroneous or unverified responses-an issue with serious implications in medical applications. Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning. However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. This framework introduces an uncertainty estimation strategy to identify unreliable outputs. It then retrieves relevant references to assist experts in highlighting key terms and applies classifier-free guidance to refine the token embeddings of MedVLM, ensuring that the adjusted outputs are correct and align with expert highlights. Evaluations across three medical visual question answering benchmarks demonstrate that the proposed Expert-CFG, with 4.2B parameters and limited expert annotations, outperforms state-of-the-art models with 13B parameters. The results demonstrate the feasibility of deploying such a system in resource-limited settings for clinical use.

[92] Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline

Shiyi Mu,Zichong Gu,Hanqi Lyu,Yilin Gao,Shugong Xu

Main category: cs.CV

TL;DR: This paper proposes the S3AD algorithm to improve the generalization of 3D detection models for autonomous driving, particularly for detecting rare anomalies. It also introduces a new dataset, KITTI-AR, with additional categories to test and enhance anomaly detection capabilities.

Details Motivation: The motivation stems from the limitations of current 3D detection models trained on closed sets, which often fail to detect rare anomaly objects on open conventional roads. There is a need to improve the generalization capability of these models to handle arbitrary shapes and filter out anomalies effectively. Method: The paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm, which decouples the training strategy of 3D and 2D to enhance generalization ability. It also introduces an anomaly scoring algorithm based on foreground confidence prediction. Additionally, a 3D rendering method is used to synthesize two augmented reality binocular stereo 3D detection datasets, KITTI-AR-ExD and KITTI-AR-OoD. Result: The result includes the successful development of the S3AD algorithm and the synthesis of the KITTI-AR dataset, which consists of 97 new categories. The KITTI-AR-ExD subset addresses sparse sample distribution with 39 common categories, while the KITTI-AR-OoD subset simulates zero-shot scenarios with 58 rare categories for evaluating 3D anomaly detection. Conclusion: This paper concludes that the proposed S3AD algorithm effectively enhances the generalization ability of 3D detection models for targets of arbitrary shapes and achieves target-level anomaly scoring. The creation of the KITTI-AR dataset further aids in verifying and enhancing the generalization of anomaly detection. Abstract: 3D detection technology is widely used in the field of autonomous driving, with its application scenarios gradually expanding from enclosed highways to open conventional roads. For rare anomaly categories that appear on the road, 3D detection models trained on closed sets often misdetect or fail to detect anomaly objects. To address this risk, it is necessary to enhance the generalization ability of 3D detection models for targets of arbitrary shapes and to possess the capability to filter out anomalies. The generalization of 3D detection is limited by two factors: the coupled training of 2D and 3D, and the insufficient diversity in the scale distribution of training samples. This paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm, which decouples the training strategy of 3D and 2D to release the generalization ability for arbitrary 3D foreground detection, and proposes an anomaly scoring algorithm based on foreground confidence prediction, achieving target-level anomaly scoring. In order to further verify and enhance the generalization of anomaly detection, we use a 3D rendering method to synthesize two augmented reality binocular stereo 3D detection datasets which named KITTI-AR. KITTI-AR extends upon KITTI by adding 97 new categories, totaling 6k pairs of stereo images. The KITTI-AR-ExD subset includes 39 common categories as extra training data to address the sparse sample distribution issue. Additionally, 58 rare categories form the KITTI-AR-OoD subset, which are not used in training to simulate zero-shot scenarios in real-world settings, solely for evaluating 3D anomaly detection. Finally, the performance of the algorithm and the dataset is verified in the experiments. (Code and dataset can be obtained at https://github.com/xxxx/xxx).

[93] 360-Degree Full-view Image Segmentation by Spherical Convolution compatible with Large-scale Planar Pre-trained Models

Jingguo Liu,Han Yu,Shigang Li,Jianfeng Li

Main category: cs.CV

TL;DR: 本研究解决了全景图像处理中因模型无法适应畸变而导致性能受限的问题,通过引入一种新的球面采样技术,使得现有的二维预训练模型能被有效利用,并在实际应用中表现出色。

Details Motivation: 由于目前缺乏大规模百万级数据集,涉及全景图像的任务主要依赖现有的二维预训练图像基准模型作为骨干网络,但这些模型无法识别全景图像中固有的畸变和不连续性,影响了它们在这些任务中的性能。 Method: 基于预训练模型权重进行球面离散采样,以缓解失真问题并实现良好的初始训练值;同时将该方法应用于全景图像分割,使用从球面模型获得的特征作为特定通道注意力的掩码。 Result: 所提方法在常用室内数据集Stanford2D3D上实现了令人满意的结果,尤其是在全景图像分割任务中表现良好。 Conclusion: 本文提出了一种用于全景图像的球面采样新方法,能够直接利用为二维图像开发的现有预训练模型,并在常见室内数据集Stanford2D3D上取得了显著效果。 Abstract: Due to the current lack of large-scale datasets at the million-scale level, tasks involving panoramic images predominantly rely on existing two-dimensional pre-trained image benchmark models as backbone networks. However, these networks are not equipped to recognize the distortions and discontinuities inherent in panoramic images, which adversely affects their performance in such tasks. In this paper, we introduce a novel spherical sampling method for panoramic images that enables the direct utilization of existing pre-trained models developed for two-dimensional images. Our method employs spherical discrete sampling based on the weights of the pre-trained models, effectively mitigating distortions while achieving favorable initial training values. Additionally, we apply the proposed sampling method to panoramic image segmentation, utilizing features obtained from the spherical model as masks for specific channel attentions, which yields commendable results on commonly used indoor datasets, Stanford2D3D.

[94] Online Long-term Point Tracking in the Foundation Model Era

Görkay Aydemir

Main category: cs.CV

TL;DR: This paper introduces Track-On, a transformer-based online tracking method that achieves state-of-the-art performance by integrating visual foundation models with temporal memory.

Details Motivation: Existing long-term tracking methods mostly work offline, but real-world applications require online tracking where only past frames are available. Visual foundation models offer robust spatial features but lack temporal reasoning. This work aims to integrate them into an online tracking framework. Method: Track-On is a transformer-based model that treats each tracked point as a query and processes video frames sequentially, using memory to propagate appearance and context across frames. Result: Track-On achieves state-of-the-art results on seven public benchmarks for online long-term point tracking. Conclusion: Track-On is a new state-of-the-art method for long-term point tracking in online settings, achieving excellent performance across seven benchmarks. Abstract: Point tracking aims to identify the same physical point across video frames and serves as a geometry-aware representation of motion. This representation supports a wide range of applications, from robotics to augmented reality, by enabling accurate modeling of dynamic environments. Most existing long-term tracking approaches operate in an offline setting, where future frames are available to refine predictions and recover from occlusions. However, real-world scenarios often demand online predictions: the model must operate causally, using only current and past frames. This constraint is critical in streaming video and embodied AI, where decisions must be made immediately based on past observations. Under such constraints, viewpoint invariance becomes essential. Visual foundation models, trained on diverse large-scale datasets, offer the potential for robust geometric representations. While they lack temporal reasoning on their own, they can be integrated into tracking pipelines to enrich spatial features. In this thesis, we address the problem of long-term point tracking in an online setting, where frames are processed sequentially without access to future information or sliding windows. We begin by evaluating the suitability of visual foundation models for this task and find that they can serve as useful initializations and be integrated into tracking pipelines. However, to enable long-term tracking in an online setting, a dedicated design is still required. In particular, maintaining coherence over time in this causal regime requires memory to propagate appearance and context across frames. To address this, we introduce Track-On, a transformer-based model that treats each tracked point as a query and processes video frames one at a time. Track-On sets a new state of the art across seven public benchmarks, demonstrating the feasibility of long-term tracking without future access.

[95] Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Behraj Khan,Tahir Syed

Main category: cs.CV

TL;DR: 本文提出了一种名为StaRFM的新框架,通过引入FIP和CMP解决基础模型在计算机视觉和医学图像分析中的分布偏移与置信度不一致问题,在多个任务和数据集中均取得了显著提升。

Details Motivation: 现有的解决方案通常是领域特定的,而CLIP和SAM等基础模型在部署过程中面临两个关键挑战:训练与测试数据之间的分布偏移以及导致错误预测过度自信的置信度不一致问题。 Method: 提出了Fisher信息惩罚(FIP)和置信度不一致惩罚(CMP),分别用于减少协变量偏移和校准分割任务中的不确定性。此外,还从理论上推导了PAC-Bayes边界以解释这些方法的有效性。 Result: StaRFM在19个视觉数据集(如ImageNet、Office-Home)上表现优越,提升了3.5%的准确率并降低了28%的ECE;在医学分割任务(如BraTS、ATLAS)中实现了84.7%的DSC和4.8mm HD95;并且跨域性能差距比现有方法低40%。 Conclusion: StaRFM是一个统一的框架,有效解决了CLIP和SAM等基础模型在部署时面临的分布偏移和置信度不一致问题,并且在多个任务和数据集上表现出卓越的性能。 Abstract: Foundation models like CLIP and SAM have transformed computer vision and medical imaging via low-shot transfer learning. However, deployment of these models hindered by two key challenges: \textit{distribution shift} between training and test data, and \textit{confidence misalignment} that leads to overconfident incorrect predictions. These issues manifest differently in vision-language classification and medical segmentation tasks, yet existing solutions remain domain-specific. We propose \textit{StaRFM}, a unified framework addressing both challenges. It introduces a Fisher information penalty (FIP), extended to 3D medical data via patch-wise regularization, to reduce covariate shift in CLIP and SAM embeddings. Additionally, a confidence misalignment penalty (CMP), reformulated for voxel-level predictions, calibrates uncertainty in segmentation tasks. We theoretically derive PAC-Bayes bounds showing FIP controls generalization via the Fisher-Rao norm, while CMP minimizes calibration error through Brier score optimization. StaRFM shows consistent performance like \texttt{+}3.5\% accuracy and 28\% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), 84.7\% DSC and 4.8mm HD95 in medical segmentation (e.g., BraTS, ATLAS), and 40\% lower cross-domain performance gap compared to prior benchmarking methods. The framework is plug-and-play, requiring minimal architectural changes for seamless integration with foundation models. Code and models will be released at https://anonymous.4open.science/r/StaRFM-C0CD/README.md

[96] EgoAnimate: Generating Human Animations from Egocentric top-down Views

G. Kutay Türkoglu,Julian Tanke,Iheb Belgacem,Lev Markhasin

Main category: cs.CV

TL;DR: This study introduces a generative prior-based method using ControlNet and Stable Diffusion to convert a single top-down egocentric image into a realistic frontal view, enabling avatar motion generation and improving telepresence systems.

Details Motivation: The motivation stems from the need for an ideal digital telepresence experience which requires accurate replication of a person's body, clothing, and movements. The egocentric perspective allows for a portable and cost-effective device but introduces challenges like occlusions and distorted body proportions. Method: The method involves a pipeline that uses ControlNet and a Stable Diffusion backbone to generate realistic frontal views from occluded top-down images. It is based on generative prior-based approach and aims to reduce training burden and improve generalizability. Result: The result is a novel approach that converts a single top-down egocentric image into a realistic frontal representation, allowing for the generation of avatar motions from minimal input. Conclusion: The paper concludes that their method successfully converts a single top-down egocentric image into a realistic frontal representation, enabling the generation of avatar motions from minimal input and contributing to more accessible and generalizable telepresence systems. Abstract: An ideal digital telepresence experience requires accurate replication of a person's body, clothing, and movements. To capture and transfer these movements into virtual reality, the egocentric (first-person) perspective can be adopted, which enables the use of a portable and cost-effective device without front-view cameras. However, this viewpoint introduces challenges such as occlusions and distorted body proportions. There are few works reconstructing human appearance from egocentric views, and none use a generative prior-based approach. Some methods create avatars from a single egocentric image during inference, but still rely on multi-view datasets during training. To our knowledge, this is the first study using a generative backbone to reconstruct animatable avatars from egocentric inputs. Based on Stable Diffusion, our method reduces training burden and improves generalizability. Inspired by methods such as SiTH and MagicMan, which perform 360-degree reconstruction from a frontal image, we introduce a pipeline that generates realistic frontal views from occluded top-down images using ControlNet and a Stable Diffusion backbone. Our goal is to convert a single top-down egocentric image into a realistic frontal representation and feed it into an image-to-motion model. This enables generation of avatar motions from minimal input, paving the way for more accessible and generalizable telepresence systems.

[97] PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process

Shiqi Jiang,Xinpeng Li,Xi Mao,Changbo Wang,Chenhui Li

Main category: cs.CV

TL;DR: 本文提出了一种新的绘画过程评估框架,包括首个大规模绘画过程数据集PPAD和基于Transformer的评估模型PPJudge,显著提升了评估效果。

Details Motivation: 现有方法主要关注静态最终图像,忽视了艺术创作过程中的动态和多阶段特性,因此需要一种更符合人类审美对齐的绘画过程评估框架。 Method: 提出了一个基于Transformer的模型PPJudge,结合了时间感知的位置编码和异构的专家混合架构,并构建了首个大规模包含真实和合成绘画过程图像的数据集PPAD。 Result: 实验结果表明,所提出的方法在准确性、鲁棒性和与人类判断的一致性方面优于现有基线模型。 Conclusion: 该研究通过引入PPAD数据集和PPJudge模型,有效提升了绘画过程评估的准确性、鲁棒性和与人类判断的一致性,为计算创造力和艺术教育提供了新的视角。 Abstract: Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD), the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.

[98] AGCD-Net: Attention Guided Context Debiasing Network for Emotion Recognition

Varsha Devi,Amine Bohi,Pardeep Kumar

Main category: cs.CV

TL;DR: AGCD-Net通过引入因果理论和注意力机制来减轻背景偏差,从而提高真实场景中情感识别的准确性。

Details Motivation: 传统情境感知情绪识别(CAER)方法存在背景与情绪标签之间的虚假关联(如将“花园”与“快乐”相关联),需要减少这种背景偏差以增强情感计算。 Method: 提出了一种新的卷积编码器Hybrid ConvNeXt,并结合空间Transformer网络和Squeeze-and-Excitation层进行特征重新校准;同时采用基于注意力引导的因果干预模块(AG-CIM)扰动上下文特征、隔离虚假相关性并纠正偏差。 Result: 在CAER-S数据集上的实验表明,AGCD-Net达到了最先进的性能,证明了因果去偏对于复杂环境下稳健情感识别的重要性。 Conclusion: AGCD-Net能够有效缓解CAER中的背景偏差问题,为更可靠的真实场景情感计算提供了新方法。 Abstract: Context-aware emotion recognition (CAER) enhances affective computing in real-world scenarios, but traditional methods often suffer from context bias-spurious correlation between background context and emotion labels (e.g. associating ``garden'' with ``happy''). In this paper, we propose \textbf{AGCD-Net}, an Attention Guided Context Debiasing model that introduces \textit{Hybrid ConvNeXt}, a novel convolutional encoder that extends the ConvNeXt backbone by integrating Spatial Transformer Network and Squeeze-and-Excitation layers for enhanced feature recalibration. At the core of AGCD-Net is the Attention Guided - Causal Intervention Module (AG-CIM), which applies causal theory, perturbs context features, isolates spurious correlations, and performs an attention-driven correction guided by face features to mitigate context bias. Experimental results on the CAER-S dataset demonstrate the effectiveness of AGCD-Net, achieving state-of-the-art performance and highlighting the importance of causal debiasing for robust emotion recognition in complex settings.

[99] Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching

Junyu Chen,Yihua Gao,Mingyuan Ge,Mingyong Li

Main category: cs.CV

TL;DR: 本研究提出了一种新的图像-文本匹配框架AAHR,通过解决语义歧义和利用高阶关系,在多个数据集上取得了更好的性能表现。

Details Motivation: 现有方法在处理高阶关联和相似实例之间的语义歧义方面存在挑战,且未能充分利用训练批次中语义相似实例之间的邻域关系。 Method: 提出了一个包含动态聚类原型对比学习、全局与局部特征提取机制、自适应聚合网络、图神经网络以及动量对比学习的框架。 Result: 实验结果表明,AAHR在Flickr30K、MSCOCO和ECCV Caption数据集上优于现有的最先进方法,显著提高了图像-文本匹配的准确性和效率。 Conclusion: AAHR有效地解决了图像-文本匹配中的软正样本问题和语义歧义问题,提高了模型的区分能力。 Abstract: Image-text matching is crucial for bridging the semantic gap between computer vision and natural language processing. However, existing methods still face challenges in handling high-order associations and semantic ambiguities among similar instances. These ambiguities arise from subtle differences between soft positive samples (semantically similar but incorrectly labeled) and soft negative samples (locally matched but globally inconsistent), creating matching uncertainties. Furthermore, current methods fail to fully utilize the neighborhood relationships among semantically similar instances within training batches, limiting the model's ability to learn high-order shared knowledge. This paper proposes the Ambiguity-Aware and High-order Relation learning framework (AAHR) to address these issues. AAHR constructs a unified representation space through dynamic clustering prototype contrastive learning, effectively mitigating the soft positive sample problem. The framework introduces global and local feature extraction mechanisms and an adaptive aggregation network, significantly enhancing full-grained semantic understanding capabilities. Additionally, AAHR employs intra-modal and inter-modal correlation matrices to investigate neighborhood relationships among sample instances thoroughly. It incorporates GNN to enhance semantic interactions between instances. Furthermore, AAHR integrates momentum contrastive learning to expand the negative sample set. These combined strategies significantly improve the model's ability to discriminate between features. Experimental results demonstrate that AAHR outperforms existing state-of-the-art methods on Flickr30K, MSCOCO, and ECCV Caption datasets, considerably improving the accuracy and efficiency of image-text matching. The code and model checkpoints for this research are available at https://github.com/Image-Text-Matching/AAHR .

[100] SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation

JianHe Low,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出了一个高效的手语翻译框架,利用分段感知视觉标记化和跨模态对齐策略,在减少计算需求的同时提升了性能。

Details Motivation: 现有的无词义标注手语翻译方法通常伴随着模型复杂度增加和计算需求提高,影响扩展性,因此需要更高效的解决方案。 Method: 引入一种基于手势分割的视觉标记化方法和跨模态对比对齐目标,同时采用双层监督策略。 Result: 在PHOENIX14T基准测试中显著超过了现有方法的性能,同时减少了输入序列长度并降低了内存使用。 Conclusion: 本文提出了一种无需依赖词义标注的分段感知视觉标记化框架,用于手语翻译,并通过了相关实验验证其性能优于现有方法。 Abstract: Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.

[101] Cross Knowledge Distillation between Artificial and Spiking Neural Networks

Shuhan Ye,Yuanbin Qian,Chong Wang,Sunqi Lin,Jiazhen Xu,Jiangbo Qian,Yuqi Li

Main category: cs.CV

TL;DR: 本文提出了一种名为跨知识蒸馏(CKD)的新方法,有效提升了尖峰神经网络(SNNs)在动态视觉传感器(DVS)数据上的性能。

Details Motivation: 由于尖峰神经网络(SNNs)在计算机视觉领域具有高生物合理性、事件驱动特性和节能效率,但受限于标注的事件数据集和不成熟的架构设计,其性能仍不如人工神经网络(ANNs)。因此,需要一种方法来提升SNNs在DVS数据上的表现。 Method: 通过利用语义相似性和滑动替换缓解跨模态挑战,并采用间接分阶段知识蒸馏缓解跨架构挑战。 Result: 实验结果表明,所提出的CKD方法在N-Caltech101和CEP-DVS等主流神经形态数据集上优于当前最先进的方法。 Conclusion: 本文提出了一种跨知识蒸馏(CKD)方法,以提高尖峰神经网络(SNNs)在动态视觉传感器(DVS)数据上的性能,并在多个主流神经形态数据集上验证了其优越性。 Abstract: Recently, Spiking Neural Networks (SNNs) have demonstrated rich potential in computer vision domain due to their high biological plausibility, event-driven characteristic and energy-saving efficiency. Still, limited annotated event-based datasets and immature SNN architectures result in their performance inferior to that of Artificial Neural Networks (ANNs). To enhance the performance of SNNs on their optimal data format, DVS data, we explore using RGB data and well-performing ANNs to implement knowledge distillation. In this case, solving cross-modality and cross-architecture challenges is necessary. In this paper, we propose cross knowledge distillation (CKD), which not only leverages semantic similarity and sliding replacement to mitigate the cross-modality challenge, but also uses an indirect phased knowledge distillation to mitigate the cross-architecture challenge. We validated our method on main-stream neuromorphic datasets, including N-Caltech101 and CEP-DVS. The experimental results show that our method outperforms current State-of-the-Art methods. The code will be available at https://github.com/ShawnYE618/CKD

[102] Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz,Elizabeth Laura Janes,Xing Shen,Tal Arbel

Main category: cs.CV

TL;DR: This paper proposes Prompt4Trust, an RL-based prompt augmentation method to calibrate confidence in MLLMs for healthcare applications, improving both reliability and performance.

Details Motivation: MLLMs have great potential in healthcare but face two critical limitations: sensitivity to prompt design and overconfidence in incorrect predictions. This work addresses these issues to ensure reliable and safe clinical decision-making. Method: The authors introduced Prompt4Trust, a reinforcement learning framework for prompt augmentation that trains a lightweight LLM to generate auxiliary prompts. These prompts guide a downstream MLLM to produce responses where confidence better aligns with predictive accuracy. Result: Prompt4Trust improved both confidence calibration and task accuracy, achieving state-of-the-art results on the PMC-VQA benchmark. It also showed zero-shot generalization to larger MLLMs, indicating scalability without additional computational costs. Conclusion: Prompt4Trust demonstrates the potential of automated prompt engineering to improve the trustworthiness of MLLMs in safety-critical settings, particularly by aligning model confidence with accuracy and enhancing task performance. Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/vccrl-llm.

[103] Generative Latent Kernel Modeling for Blind Motion Deblurring

Chenhao Ding,Jiangtao Zhang,Zongsheng Yue,Hui Wang,Qian Zhao,Deyu Meng

Main category: cs.CV

TL;DR: 本研究通过引入基于GAN的核生成器和初始化器,有效解决了盲运动去模糊中对初始模糊核敏感的问题,并达到了领先的性能。

Details Motivation: 为了解决传统深度先验方法在BMD中因优化过程的高度非凸性而导致的对初始模糊核极度敏感的问题。 Method: 预训练基于生成对抗网络(GAN)的核生成器和初始化器,并与现有BMD方法以即插即用的方式结合。 Result: 所提方法在无需额外先验的情况下实现了最先进的盲非均匀运动去模糊效果,并已在挑战性的基准数据集上验证。 Conclusion: 该论文提出了一种新的盲运动去模糊(BMD)框架,利用深度生成模型来编码核先验并提供更好的模糊核初始化。 Abstract: Deep prior-based approaches have demonstrated remarkable success in blind motion deblurring (BMD) recently. These methods, however, are often limited by the high non-convexity of the underlying optimization process in BMD, which leads to extreme sensitivity to the initial blur kernel. To address this issue, we propose a novel framework for BMD that leverages a deep generative model to encode the kernel prior and induce a better initialization for the blur kernel. Specifically, we pre-train a kernel generator based on a generative adversarial network (GAN) to aptly characterize the kernel's prior distribution, as well as a kernel initializer to provide a well-informed and high-quality starting point for kernel estimation. By combining these two components, we constrain the BMD solution within a compact latent kernel manifold, thus alleviating the aforementioned sensitivity for kernel initialization. Notably, the kernel generator and initializer are designed to be easily integrated with existing BMD methods in a plug-and-play manner, enhancing their overall performance. Furthermore, we extend our approach to tackle blind non-uniform motion deblurring without the need for additional priors, achieving state-of-the-art performance on challenging benchmark datasets. The source code is available at https://github.com/dch0319/GLKM-Deblur.

[104] Supercharging Floorplan Localization with Semantic Rays

Yuval Grader,Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: This paper proposes a semantic-aware localization framework for floorplans that improves accuracy by integrating both depth and semantic information, outperforming current methods and allowing for the use of metadata for further enhancements.

Details Motivation: Current floorplan localization techniques focus on depth-based structural cues while neglecting rich semantic information. The authors aim to improve localization accuracy by incorporating semantics into the localization process. Method: The authors introduced a framework that jointly estimates depth and semantic rays to create a structural-semantic probability volume. This volume is built in a coarse-to-fine manner by initially sampling a small set of rays and then refining probabilities through denser sampling in high-probability regions to predict 2D location and orientation angle. Result: The proposed approach substantially outperforms existing methods on two standard benchmarks, particularly in recall metrics. It also demonstrates the ability to utilize additional metadata like room labels for further accuracy and efficiency improvements. Conclusion: The paper concludes that their semantic-aware localization framework outperforms state-of-the-art methods in floorplan localization, with significant improvements in recall metrics and the ability to incorporate additional metadata for better performance. Abstract: Floorplans provide a compact representation of the building's structure, revealing not only layout information but also detailed semantics such as the locations of windows and doors. However, contemporary floorplan localization techniques mostly focus on matching depth-based structural cues, ignoring the rich semantics communicated within floorplans. In this work, we introduce a semantic-aware localization framework that jointly estimates depth and semantic rays, consolidating over both for predicting a structural-semantic probability volume. Our probability volume is constructed in a coarse-to-fine manner: We first sample a small set of rays to obtain an initial low-resolution probability volume. We then refine these probabilities by performing a denser sampling only in high-probability regions and process the refined values for predicting a 2D location and orientation angle. We conduct an evaluation on two standard floorplan localization benchmarks. Our experiments demonstrate that our approach substantially outperforms state-of-the-art methods, achieving significant improvements in recall metrics compared to prior works. Moreover, we show that our framework can easily incorporate additional metadata such as room labels, enabling additional gains in both accuracy and efficiency.

[105] Geo-RepNet: Geometry-Aware Representation Learning for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Rui Tang,Haochen Yin,Guankun Wang,Long Bai,An Wang,Huxin Gao,Jiazheng Wang,Hongliang Ren

Main category: cs.CV

TL;DR: 本文提出了Geo-RepNet,一个结合深度信息和RGB图像的几何感知卷积框架,用于手术阶段识别,取得了最先进的性能。

Details Motivation: 解决RGB图像中不同阶段高视觉相似性和缺乏结构线索的问题,利用深度信息提供有价值的几何线索。 Method: 基于深度信息和RGB图像的融合,构建了Geo-RepNet模型,包括DGPG模块和GEMA模块。 Result: 在提出的九相ESD数据集上进行了广泛的实验,验证了Geo-RepNet的有效性。 Conclusion: Geo-RepNet在复杂的低纹理手术环境中保持了鲁棒性和高计算效率的同时达到了最先进的性能。 Abstract: Surgical phase recognition plays a critical role in developing intelligent assistance systems for minimally invasive procedures such as Endoscopic Submucosal Dissection (ESD). However, the high visual similarity across different phases and the lack of structural cues in RGB images pose significant challenges. Depth information offers valuable geometric cues that can complement appearance features by providing insights into spatial relationships and anatomical structures. In this paper, we pioneer the use of depth information for surgical phase recognition and propose Geo-RepNet, a geometry-aware convolutional framework that integrates RGB image and depth information to enhance recognition performance in complex surgical scenes. Built upon a re-parameterizable RepVGG backbone, Geo-RepNet incorporates the Depth-Guided Geometric Prior Generation (DGPG) module that extracts geometry priors from raw depth maps, and the Geometry-Enhanced Multi-scale Attention (GEMA) to inject spatial guidance through geometry-aware cross-attention and efficient multi-scale aggregation. To evaluate the effectiveness of our approach, we construct a nine-phase ESD dataset with dense frame-level annotations from real-world ESD videos. Extensive experiments on the proposed dataset demonstrate that Geo-RepNet achieves state-of-the-art performance while maintaining robustness and high computational efficiency under complex and low-texture surgical environments.

[106] ViT-ProtoNet for Few-Shot Image Classification: A Multi-Benchmark Evaluation

Abdulvahap Mutlu,Şengül Doğan,Türker Tuncer

Main category: cs.CV

TL;DR: ViT-ProtoNet enhances few-shot image classification by integrating Vision Transformers into a prototypical network framework, delivering superior performance and robustness.

Details Motivation: Despite the strong representational capabilities of Vision Transformers (ViTs), their potential remains underutilized in few-shot image classification tasks. This work aims to bridge that gap. Method: The study introduces ViT-ProtoNet, which combines the Vision Transformer (ViT-Small backbone) with the Prototypical Network framework. Class prototypes are generated by averaging class conditional token embeddings from limited support examples. Result: ViT-ProtoNet consistently surpasses CNN-based prototypical networks across four benchmarks—Mini-ImageNet, FC100, CUB-200, and CIFAR-FS—achieving up to a 3.2% improvement in 5-shot accuracy. It also demonstrates better feature separability in latent space and performs competitively against other transformer-based approaches. Conclusion: ViT-ProtoNet proves to be a powerful and flexible approach for few-shot classification, outperforming CNN-based models and setting a new benchmark for transformer-based meta-learners. Abstract: The remarkable representational power of Vision Transformers (ViTs) remains underutilized in few-shot image classification. In this work, we introduce ViT-ProtoNet, which integrates a ViT-Small backbone into the Prototypical Network framework. By averaging class conditional token embeddings from a handful of support examples, ViT-ProtoNet constructs robust prototypes that generalize to novel categories under 5-shot settings. We conduct an extensive empirical evaluation on four standard benchmarks: Mini-ImageNet, FC100, CUB-200, and CIFAR-FS, including overlapped support variants to assess robustness. Across all splits, ViT-ProtoNet consistently outperforms CNN-based prototypical counterparts, achieving up to a 3.2\% improvement in 5-shot accuracy and demonstrating superior feature separability in latent space. Furthermore, it outperforms or is competitive with transformer-based competitors using a more lightweight backbone. Comprehensive ablations examine the impact of transformer depth, patch size, and fine-tuning strategy. To foster reproducibility, we release code and pretrained weights. Our results establish ViT-ProtoNet as a powerful, flexible approach for few-shot classification and set a new baseline for transformer-based meta-learners.

[107] DAA*: Deep Angular A Star for Image-based Path Planning

Zhiwei Xu

Main category: cs.CV

TL;DR: This paper introduces deep angular A (DAA) for path imitation learning, improving path similarity and optimality through adaptive path smoothness control, showing significant performance gains over existing methods.

Details Motivation: Path smoothness is often overlooked in path imitation learning. The motivation is to enhance path similarity by adaptively adjusting path smoothness using move angles in path node expansion. Method: The paper proposes deep angular A* (DAA*), incorporating the path angular freedom (PAF) into A*, optimizing path shortening and smoothing to align predicted paths with reference paths. Result: DAA* outperforms existing methods like neural A* and TransPath, achieving improvements of 9.0% SPR, 6.9% ASIM, and 3.9% PSIM over neural A*, and 6.7% SPR, 6.5% PSIM, and 3.7% ASIM over TransPath. Conclusion: The paper concludes that DAA* significantly improves path similarity and optimality in path imitation learning, demonstrating its effectiveness through comprehensive evaluations on multiple datasets. Abstract: Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.7% SPR, 6.5% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable.

[108] AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning

Zile Wang,Hao Yu,Jiabo Zhan,Chun Yuan

Main category: cs.CV

TL;DR: 本论文提出了ALPHA和ALPHAVAE,用于解决RGBA图像生成中的问题。

Details Motivation: 由于缺乏大规模的基准数据集,透明或分层内容(RGBA图像)的生成在很大程度上未被探索。 Method: 通过引入ALPHAVAE,一种统一的端到端RGBA VAE,并使用包含专用alpha通道的复合目标进行训练。 Result: 与LayerDiffuse相比,在仅使用8K图像训练的情况下,PSNR提高了4.9 dB,SSIM增加了3.2%。 Conclusion: ALPHAVAE在RGBA图像生成方面表现出色,促进了透明图像生成的发展。 Abstract: Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha representations. Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction. It also enables superior transparent image generation when fine-tuned within a latent diffusion framework. Our code, data, and models are released on https://github.com/o0o0o00o0/AlphaVAE for reproducibility.

[109] ProactiveBench: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

Yueqian Wang,Xiaojun Meng,Yifan Wang,Huishuai Zhang,Dongyan Zhao

Main category: cs.CV

TL;DR: 本文提出了一种用于评估多模态对话系统主动交互能力的新指标PAUC,并展示了其相较于传统方法的优势。

Details Motivation: 随着多模态对话系统的进步,用户期望系统具备主动性交互能力,而传统评估方法无法满足这一需求。 Method: 提出了ProactiveBench基准测试和PAUC评估指标,并通过基线系统测试和用户研究验证其有效性。 Result: PAUC在评估主动交互场景下的系统表现优于传统指标,与人类偏好一致性更高。 Conclusion: PAUC是一种更准确的评估主动交互场景下用户体验的指标,相比传统方法更能反映人类偏好。 Abstract: With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveBench, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveBench and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveBench

[110] Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

Kaixuan Cong,Yifan Wang,Rongkun Xue,Yuyang Jiang,Yiming Feng,Jing Yang

Main category: cs.CV

TL;DR: This paper proposes DICCAE, an audio-video pre-training encoder that enhances the model's ability to distinguish between similar activities by dynamically addressing inter-class confusion, achieving strong results on the VGGSound dataset.

Details Motivation: Existing audio-video pre-training paradigms focus only on overall modality alignment without reinforcing the distinction of easily confused classes. This work aims to enhance models' ability to distinguish between similar activities using cognitive induction and contrast. Method: The paper proposes DICCAE, which aligns audio-video representations at a fine-grained, category-level by dynamically adjusting confusion loss based on inter-class confusion degrees. It also introduces a new training framework incorporating audio and video modalities and their fusion, along with a cluster-guided self-supervised pre-training strategy. Result: DICCAE achieves a top-1 accuracy of 65.5% on the VGGSound dataset and extensive ablation studies validate the necessity of each module. Conclusion: The paper concludes that DICCAE, a novel encoder for audio-video pre-training, achieves near state-of-the-art performance on the VGGSound dataset and demonstrates the necessity of its modules through ablation studies. Abstract: Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model's ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion. To mitigate the scarcity of audio-video data in the human activity recognition task, we propose a cluster-guided audio-video self-supervised pre-training strategy for DICCAE. DICCAE achieves near state-of-the-art performance on the VGGSound dataset, with a top-1 accuracy of 65.5%. We further evaluate its feature representation quality through extensive ablation studies, validating the necessity of each module.

[111] Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding

Wencan Huang,Daizong Liu,Wei Hu

Main category: cs.CV

TL;DR: Fast3D是一个用于加速三维多模态大语言模型(MLLM)推理的视觉标记剪枝框架,通过全局注意力预测和样本自适应剪枝技术,在不修改目标模型参数的情况下提高计算效率。

Details Motivation: 三维MLLM在实际部署中面临计算效率低下的关键瓶颈,主要源于处理大量以对象为中心的视觉标记所带来的开销。虽然视觉标记剪枝已在二维MLLM中展现出加速效果,但其在三维领域的适用性因标记结构差异而尚未被充分探索。本文旨在揭示三维场景中视觉标记冗余的本质,并提出高效的剪枝方法。 Method: 基于两个重要发现:(1) 在对象级别的三维标记表示中存在显著冗余;(2) 全局注意力模式在三维上下文中具有识别非必要标记的强大预测能力。作者提出了Fast3D框架,包含两项技术创新:(1) 全局注意力预测(GAP),使用轻量神经网络预测目标模型的全局注意力分布,从而高效估计标记重要性;(2) 样本自适应视觉标记剪枝(SAP),通过基于注意力的复杂度评估引入动态标记预算,自动调整各层的剪枝比例。 Result: 在五个基准测试中的广泛评估验证了Fast3D的有效性,尤其是在高视觉标记剪枝比率下仍能保持良好性能。 Conclusion: Fast3D提供了一种即插即用的解决方案,有效解决了三维MLLM的计算效率问题,为未来三维视觉-语言模型的部署提供了重要的技术支撑。 Abstract: While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at https://github.com/wencan25/Fast3D

[112] Simplifying Traffic Anomaly Detection with Video Foundation Models

Svetlana Orlova,Tommie Kerssies,Brunó B. Englert,Gijs Dubbelman

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练视频视觉变换器的简单编码器方法,在交通异常检测中取得了优异性能,同时减少了架构复杂性。

Details Motivation: 现有的以自我为中心的交通异常检测(TAD)方法往往依赖复杂的多阶段或多表示融合架构,但这种复杂性是否必要仍不清楚。基于视觉感知领域的最新发现,通过先进的预训练启用的基础模型允许简单而灵活的架构超越专门设计的性能。 Method: 研究采用了一种简单的仅编码器的方法,使用普通的视频视觉变换器(Video ViTs),并研究了预训练如何促进强大的TAD性能。 Result: 研究发现:(i) 强大的预训练使简单的仅编码器模型能够匹配甚至超越专门的最先进的TAD方法的性能,同时效率更高;(ii) 尽管弱监督和全监督预训练在标准基准上具有优势,但它们对TAD效果不佳,而自监督的掩码视频建模(MVM)提供了最强的信号;(iii) 在未标记驾驶视频上进行领域自适应预训练(DAPT)无需异常示例即可进一步提升下游任务性能。 Conclusion: 研究强调了预训练的重要性,并表明可以使用最小的架构复杂性构建高效且可扩展的TAD模型。 Abstract: Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) strong pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.

[113] Automated Multi-Class Crop Pathology Classification via Convolutional Neural Networks: A Deep Learning Approach for Real-Time Precision Agriculture

Sourish Suri,Yifei Shao

Main category: cs.CV

TL;DR: This study proposes a CNN-based image classification system for automated detection and classification of crop diseases using leaf imagery.

Details Motivation: Crop diseases present a significant barrier to agricultural productivity and global food security, especially in large-scale farming where early identification is often delayed or inaccurate. Method: The methodology involves a complete deep learning pipeline: image acquisition from a large, labeled dataset, preprocessing via resizing, normalization, and augmentation, and model training using TensorFlow with Keras' Sequential API. Result: The system achieves high training accuracy (~90%) and demonstrates reliable performance on unseen data, although a validation accuracy of ~60% suggests minor overfitting. Conclusion: This research contributes a scalable and accessible tool to the field of precision agriculture, reducing reliance on manual inspection and promoting sustainable disease management practices. Abstract: Crop diseases present a significant barrier to agricultural productivity and global food security, especially in large-scale farming where early identification is often delayed or inaccurate. This research introduces a Convolutional Neural Network (CNN)-based image classification system designed to automate the detection and classification of eight common crop diseases using leaf imagery. The methodology involves a complete deep learning pipeline: image acquisition from a large, labeled dataset, preprocessing via resizing, normalization, and augmentation, and model training using TensorFlow with Keras' Sequential API. The CNN architecture comprises three convolutional layers with increasing filter sizes and ReLU activations, followed by max pooling, flattening, and fully connected layers, concluding with a softmax output for multi-class classification. The system achieves high training accuracy (~90%) and demonstrates reliable performance on unseen data, although a validation accuracy of ~60% suggests minor overfitting. Notably, the model integrates a treatment recommendation module, providing actionable guidance by mapping each detected disease to suitable pesticide or fungicide interventions. Furthermore, the solution is deployed on an open-source, mobile-compatible platform, enabling real-time image-based diagnostics for farmers in remote areas. This research contributes a scalable and accessible tool to the field of precision agriculture, reducing reliance on manual inspection and promoting sustainable disease management practices. By merging deep learning with practical agronomic support, this work underscores the potential of CNNs to transform crop health monitoring and enhance food production resilience on a global scale.

[114] GreenCrossingAI: A Camera Trap/Computer Vision Pipeline for Environmental Science Research Groups

Bernie Boscoe,Shawn Johnson,Andrea Osborn,Chandler Campbell,Karen Mager

Main category: cs.CV

TL;DR: 这篇论文介绍了一种针对资源和计算专业知识有限的小型研究小组的低资源摄像陷阱数据处理管道,该管道集成了定制化的机器学习/人工智能能力。

Details Motivation: 随着野外数据收集工具和技术的进步,如何开发、处理和管理数据(尤其是采用机器学习/人工智能工具)仍然具有挑战性,这些挑战包括数据量大、需要精确标注、环境条件变化影响数据质量和现有工作流程的整合问题。 Method: 论文主要提供了一个实用的解决方案,用于数据传输、推理和评估,以帮助研究人员从不断增加的摄像陷阱数据集中发现有意义的见解。 Result: 通过关注实际解决方案,该管道为数据传输、推理和评估提供了可访问的方法,使研究人员能够从其不断增加的摄像陷阱数据集中发现有意义的见解。 Conclusion: 该论文提出了一种适用于资源有限的小型研究团队的低资源管道,以处理现场摄像陷阱数据,并整合了定制化的机器学习/人工智能功能。 Abstract: Camera traps have long been used by wildlife researchers to monitor and study animal behavior, population dynamics, habitat use, and species diversity in a non-invasive and efficient manner. While data collection from the field has increased with new tools and capabilities, methods to develop, process, and manage the data, especially the adoption of ML/AI tools, remain challenging. These challenges include the sheer volume of data generated, the need for accurate labeling and annotation, variability in environmental conditions affecting data quality, and the integration of ML/AI tools into existing workflows that often require domain-specific customization and computational resources. This paper provides a guide to a low-resource pipeline to process camera trap data on-premise, incorporating ML/AI capabilities tailored for small research groups with limited resources and computational expertise. By focusing on practical solutions, the pipeline offers accessible approaches for data transmission, inference, and evaluation, enabling researchers to discover meaningful insights from their ever-increasing camera trap datasets.

[115] Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data

Timothy Chase Jr,Karthik Dantu

Main category: cs.CV

TL;DR: 本文提出了一种适用于自主航天飞行应用的新型实时地标跟踪系统,通过使用轻量级神经网络架构和改进的领域适应方法,提高了地形特征检测和描述的性能。

Details Motivation: 传统的光度立体法依赖于大量的先验成像和离线处理,受限于辐射硬化系统的计算限制,且通常增加任务成本和持续时间,处理速度低,泛化能力有限。 Method: 利用轻量级、计算效率高的神经网络架构,提出了改进的领域适应方法和新的注意力对齐公式。 Result: 新方法在现有最先进的技术基础上展示了卓越的性能,能够在显著地标视点变化的情况下保持对应关系。 Conclusion: 本文提出了一种新的原位地标跟踪方法,通过检测和描述实现了优越的性能,适用于当前一代航天器飞行处理器的实时执行。 Abstract: The detection and tracking of celestial surface terrain features are crucial for autonomous spaceflight applications, including Terrain Relative Navigation (TRN), Entry, Descent, and Landing (EDL), hazard analysis, and scientific data collection. Traditional photoclinometry-based pipelines often rely on extensive a priori imaging and offline processing, constrained by the computational limitations of radiation-hardened systems. While historically effective, these approaches typically increase mission costs and duration, operate at low processing rates, and have limited generalization. Recently, learning-based computer vision has gained popularity to enhance spacecraft autonomy and overcome these limitations. While promising, emerging techniques frequently impose computational demands exceeding the capabilities of typical spacecraft hardware for real-time operation and are further challenged by the scarcity of labeled training data for diverse extraterrestrial environments. In this work, we present novel formulations for in-situ landmark tracking via detection and description. We utilize lightweight, computationally efficient neural network architectures designed for real-time execution on current-generation spacecraft flight processors. For landmark detection, we propose improved domain adaptation methods that enable the identification of celestial terrain features with distinct, cheaply acquired training data. Concurrently, for landmark description, we introduce a novel attention alignment formulation that learns robust feature representations that maintain correspondence despite significant landmark viewpoint variations. Together, these contributions form a unified system for landmark tracking that demonstrates superior performance compared to existing state-of-the-art techniques.

[116] Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions

Yuanhong Zheng,Ruixuan Yu,Jian Sun

Main category: cs.CV

TL;DR: 本文提出了一种高效的多人运动预测模型,通过双分支结构和跨层次交互模块降低计算成本,并在多个数据集中表现优异。

Details Motivation: 3D多人运动预测任务复杂度高,主要因为需要依赖个体过去的运动以及个体之间的交互。此外,有效建模这些交互通常会带来巨大的计算成本。 Method: 设计了轻量级的双分支结构,分别学习个体和群体的局部和全局表示,并引入了一个新的跨层次交互模块来整合两个分支的空间和时间表示。此外,还显式地融入了人际空间距离嵌入以进一步增强交互建模。 Result: 在CMU-Mocap、MuPoTS-3D和3DPW等多个标准数据集上的多个指标中达到了最先进的性能,同时显著降低了计算成本。 Conclusion: 该论文提出了一种计算高效的多人运动预测模型,通过简化空间和时间交互,在多个标准数据集上实现了最先进的性能,同时显著降低了计算成本。 Abstract: 3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost. Code is available at https://github.com/Yuanhong-Zheng/EMPMP.

[117] SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation

Zhihan Kang,Boyu Wang

Main category: cs.CV

TL;DR: SegVec3D is a novel, minimally supervised framework for 3D point cloud instance segmentation that integrates attention mechanisms, embedding learning, and cross-modal alignment to enable unsupervised instance segmentation and zero-shot retrieval.

Details Motivation: To develop a minimally supervised and practically deployable framework that unifies instance segmentation and multimodal understanding. Method: Building a hierarchical feature extractor to enhance geometric structure modeling and enabling unsupervised instance segmentation through contrastive clustering while aligning 3D data with natural language queries in a shared semantic space. Result: SegVec3D outperforms recent methods like Mask3D and ULIP by offering zero-shot retrieval and integrating multimodal understanding with minimal supervision. Conclusion: SegVec3D provides a new method for 3D point cloud instance segmentation that combines attention mechanisms, embedding learning, and cross-modal alignment. Abstract: We propose SegVec3D, a novel framework for 3D point cloud instance segmentation that integrates attention mechanisms, embedding learning, and cross-modal alignment. The approach builds a hierarchical feature extractor to enhance geometric structure modeling and enables unsupervised instance segmentation via contrastive clustering. It further aligns 3D data with natural language queries in a shared semantic space, supporting zero-shot retrieval. Compared to recent methods like Mask3D and ULIP, our method uniquely unifies instance segmentation and multimodal understanding with minimal supervision and practical deployability.

[118] CKAA: Cross-subspace Knowledge Alignment and Aggregation for Robust Continual Learning

Lingfeng He,De Cheng,Zhiheng Ma,Huaijie Wang,Dingwen Zhang,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: The paper proposes CKAA, a new framework for continual learning that improves model robustness against misleading task identifiers by aligning knowledge across subspaces and adaptively aggregating task-specific knowledge.

Details Motivation: To address the issue of ambiguous decisions under misleading task-ids due to feature subspace misalignment from independently trained sub-modules in parameter-efficient fine-tuning-based continual learning methods. Method: Cross-subspace Knowledge Alignment and Aggregation (CKAA), including Dual-level Knowledge Alignment (DKA) and Task-Confidence-guided Mixture of Adapters (TC-MoA). Result: Extensive experiments demonstrate that CKAA outperforms existing PEFT-based CL methods. Conclusion: CKAA is a novel framework that enhances model robustness against misleading task-ids through Dual-level Knowledge Alignment and Task-Confidence-guided Mixture of Adapters, outperforming existing PEFT-based CL methods. Abstract: Continual Learning (CL) empowers AI models to continuously learn from sequential task streams. Recently, parameter-efficient fine-tuning (PEFT)-based CL methods have garnered increasing attention due to their superior performance. They typically allocate a unique sub-module for learning each task, with a task recognizer to select the appropriate sub-modules for testing images. However, due to the feature subspace misalignment from independently trained sub-modules, these methods tend to produce ambiguous decisions under misleading task-ids. To address this, we propose Cross-subspace Knowledge Alignment and Aggregation (CKAA), a novel framework that enhances model robustness against misleading task-ids through two key innovations: (1) Dual-level Knowledge Alignment (DKA): By aligning intra-class feature distributions across different subspaces and learning a robust global classifier through a feature simulation process, DKA enables the model to distinguish features from both correct and incorrect subspaces during training. (2) Task-Confidence-guided Mixture of Adapters (TC-MoA): A robust inference scheme that adaptively aggregates task-specific knowledge from relevant sub-modules based on task-confidence scores, avoiding overconfidence in misleading task-id predictions. Extensive experiments demonstrate that CKAA outperforms existing PEFT-based CL methods.

[119] HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space

Changli Wang,Fang Yin,Jiafeng Liu,Rui Wu

Main category: cs.CV

TL;DR: 本文提出了HMID-Net,这是一种创新且高效的方法,将掩码图像建模(MIM)和知识蒸馏技术整合到超球面空间中,解决了捕捉和利用视觉-语义层次结构的问题,并在各种下游任务中表现出色。

Details Motivation: 尽管MERU成功地将多模态学习技术从欧几里得空间适应到超球面空间,但如何更有效地训练模型以捕捉和利用这种层次结构仍然是一个关键问题。 Method: 提出了一种新的方法HMID-Net,该方法将掩码图像建模(MIM)和知识蒸馏技术结合在超球面空间中,并引入了专门设计的蒸馏损失函数以促进有效的知识转移。 Result: 实验表明,超球面空间中的MIM和知识蒸馏技术可以取得与欧几里得空间中同样显著的成功,广泛评估显示该方法在图像分类和检索等下游任务中明显优于现有模型如MERU和CLIP。 Conclusion: HMID-Net通过结合MIM和知识蒸馏技术,成功利用超球面空间中的层次结构,在下游任务中显著优于现有模型。 Abstract: Visual and semantic concepts are often structured in a hierarchical manner. For instance, textual concept `cat' entails all images of cats. A recent study, MERU, successfully adapts multimodal learning techniques from Euclidean space to hyperbolic space, effectively capturing the visual-semantic hierarchy. However, a critical question remains: how can we more efficiently train a model to capture and leverage this hierarchy? In this paper, we propose the \textit{Hyperbolic Masked Image and Distillation Network} (HMID-Net), a novel and efficient method that integrates Masked Image Modeling (MIM) and knowledge distillation techniques within hyperbolic space. To the best of our knowledge, this is the first approach to leverage MIM and knowledge distillation in hyperbolic space to train highly efficient models. In addition, we introduce a distillation loss function specifically designed to facilitate effective knowledge transfer in hyperbolic space. Our experiments demonstrate that MIM and knowledge distillation techniques in hyperbolic space can achieve the same remarkable success as in Euclidean space. Extensive evaluations show that our method excels across a wide range of downstream tasks, significantly outperforming existing models like MERU and CLIP in both image classification and retrieval.

[120] GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

Yiyang Zhou,Linjie Li,Shi Qiu,Zhengyuan Yang,Yuyang Zhao,Siwei Han,Yangfan He,Kangqi Li,Haonian Ji,Zihao Zhao,Haibo Tong,Lijuan Wang,Huaxiu Yao

Main category: cs.CV

TL;DR: GLIMPSE is a new benchmark designed to evaluate whether large vision-language models (LVLMs) can genuinely understand and reason with videos, rather than relying on superficial frame-level analysis. It presents complex questions requiring full video context understanding.

Details Motivation: Existing video benchmarks often resemble image-based ones, allowing models to answer questions by scanning a few key frames without deep temporal reasoning. This limits the ability to assess whether LVLMs can genuinely think with videos. Method: The GLIMPSE benchmark was introduced, consisting of 3,269 videos and over 4,342 visual-centric questions across 11 categories, crafted by human annotators. These questions require watching the entire video and reasoning over the full context. Result: In human evaluations, GLIMPSE achieved 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reached only 66.43% accuracy. Conclusion: GLIMPSE highlights that current LVLMs struggle to move beyond surface-level reasoning and truly think with videos. Abstract: Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.

[121] SDTN and TRN: Adaptive Spectral-Spatial Feature Extraction for Hyperspectral Image Classification

Fuyin Ye,Erwen Yao,Jianyong Chen,Fengmei He,Junxiang Zhang,Lihao Ni

Main category: cs.CV

TL;DR: This paper proposes SDTN and TRN for hyperspectral image classification, improving accuracy while reducing computational complexity for efficient real-time use.

Details Motivation: Traditional hyperspectral image classification methods face challenges with high-dimensional data, spectral-spatial redundancy, and limited labeled samples, leading to suboptimal performance. This work aims to overcome these limitations. Method: The Self-Adaptive Tensor-Regularized Network (SDTN) combines tensor decomposition with regularization to dynamically adjust tensor ranks. This is followed by the Tensor-Regularized Network (TRN), which integrates features from SDTN into a lightweight network to capture spectral-spatial features at multiple scales. Result: Experiments on the PaviaU dataset show significant improvements in classification accuracy and a reduction in model parameters compared to state-of-the-art methods. Conclusion: The proposed TRN framework, built upon SDTN, achieves high classification accuracy while significantly reducing computational complexity, making it ideal for real-time deployment in resource-constrained environments. Abstract: Hyperspectral image classification plays a pivotal role in precision agriculture, providing accurate insights into crop health monitoring, disease detection, and soil analysis. However, traditional methods struggle with high-dimensional data, spectral-spatial redundancy, and the scarcity of labeled samples, often leading to suboptimal performance. To address these challenges, we propose the Self-Adaptive Tensor- Regularized Network (SDTN), which combines tensor decomposition with regularization mechanisms to dynamically adjust tensor ranks, ensuring optimal feature representation tailored to the complexity of the data. Building upon SDTN, we propose the Tensor-Regularized Network (TRN), which integrates the features extracted by SDTN into a lightweight network capable of capturing spectral-spatial features at multiple scales. This approach not only maintains high classification accuracy but also significantly reduces computational complexity, making the framework highly suitable for real-time deployment in resource-constrained environments. Experiments on PaviaU datasets demonstrate significant improvements in accuracy and reduced model parameters compared to state-of-the-art methods.

[122] Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations

Yiwen Liang,Hui Chen,Yizhe Xiong,Zihan Zhou,Mengyao Lyu,Zijia Lin,Shuaicheng Niu,Sicheng Zhao,Jungong Han,Guiguang Ding

Main category: cs.CV

TL;DR: 本文提出ReTA方法,通过CER和DDC两个策略提升VLMs在分布偏移下的测试时适应能力,解决了熵不可靠和决策边界不灵活的问题,实验表明其性能优于现有方法。

Details Motivation: VLMs在分布偏移任务中表现不佳,现有缓存方法因熵不可靠和决策边界不灵活而效果有限,需要更可靠的TTA方法。 Method: 提出了ReTA方法,包含CER和DDC两个策略:CER通过一致性约束加权熵值,提高缓存可靠性;DDC通过建模文本嵌入为高斯分布,实现自适应决策边界。 Result: 实验表明ReTA在多种分布偏移场景下表现优异,特别是在现实世界任务中。 Conclusion: ReTA有效解决了VLMs在分布偏移下的可靠性问题,显著优于现有方法,尤其是在现实世界的分布偏移场景下。 Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs' performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under challenging real-world distribution shifts.

[123] Online Micro-gesture Recognition Using Data Augmentation and Spatial-Temporal Attention

Pengyu Liu,Kun Li,Fei Wang,Yanyan Wei,Junhui She,Dan Guo

Main category: cs.CV

TL;DR: This paper proposes HFUT-VUT, a novel solution for online micro-gesture recognition that leverages data augmentation and spatial-temporal attention to achieve precise detection and classification, ranking first in the IJCAI 2025 MiGA Challenge.

Details Motivation: Micro-gesture Online Recognition is a challenging task requiring precise temporal localization and categorization of subtle human actions, which demands improved model capabilities for distinguishing spontaneous variations. Method: Hand-crafted data augmentation and spatial-temporal attention mechanisms are introduced to enhance micro-gesture classification and localization accuracy. Result: The method achieves an F1 score of 38.03, outperforming the previous state-of-the-art by 37.9%. Conclusion: The proposed solution HFUT-VUT ranks first in the Micro-gesture Online Recognition track, showing superior performance compared to previous methods. Abstract: In this paper, we introduce the latest solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track of the IJCAI 2025 MiGA Challenge. The Micro-gesture Online Recognition task is a highly challenging problem that aims to locate the temporal positions and recognize the categories of multiple micro-gesture instances in untrimmed videos. Compared to traditional temporal action detection, this task places greater emphasis on distinguishing between micro-gesture categories and precisely identifying the start and end times of each instance. Moreover, micro-gestures are typically spontaneous human actions, with greater differences than those found in other human actions. To address these challenges, we propose hand-crafted data augmentation and spatial-temporal attention to enhance the model's ability to classify and localize micro-gestures more accurately. Our solution achieved an F1 score of 38.03, outperforming the previous state-of-the-art by 37.9%. As a result, our method ranked first in the Micro-gesture Online Recognition track.

[124] QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models

Tien-Yu Chi,Hung-Yueh Chiang,Diana Marculescu,Kai-Chiang Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为QuarterMap的后训练激活剪枝方法,用于解决VMamba中存在的空间冗余问题,从而提高处理速度而不需重新训练。

Details Motivation: VMamba作为一种强大的基于SSM的视觉骨干网络,但在其四向扫描中存在空间冗余的瓶颈。 Method: 提出了一种后训练激活剪枝方法QuarterMap,在扫描之前移除冗余的空间激活,并通过最近邻上采样恢复维度。 Result: 在ImageNet-1K数据集上,QuarterMap实现了高达11%的速度提升,准确率下降不到0.9%,并且在ADE20K分割任务中也取得了类似的增益。 Conclusion: QuarterMap是一个即插即用的部署时效率工具,不需要重新训练,并且在多个医学成像任务中保持准确性的同时提高了吞吐量。 Abstract: State space models (SSMs) reduce the quadratic complexity of transformers by leveraging linear recurrence. Recently, VMamba has emerged as a strong SSM-based vision backbone, yet remains bottlenecked by spatial redundancy in its four-directional scan. We propose QuarterMap, a post-training activation pruning method that removes redundant spatial activations before scanning and restores dimensions via nearest-neighbor upsampling. Our method improves throughput without retraining. On ImageNet-1K, QuarterMap achieves up to 11% speedup on VMamba with less than 0.9% accuracy drop, and yields similar gains on ADE20K segmentation. Beyond VMamba, we validate QuarterMap on MedMamba, a domain-specific model that shares the same four-directional scanning structure, where it consistently improves throughput while preserving accuracy across multiple medical imaging tasks. Compared to token merging methods like ToMe, QuarterMap is tailored for SSMs and avoids costly merge-unmerge operations. Our method offers a plug-and-play tool for deployment-time efficiency without compromising transferability.

[125] MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Haozhe Zhao,Zefan Cai,Shuzheng Si,Liang Chen,Jiuxiang Gu,Wen Xiao,Junjie Hu

Main category: cs.CV

TL;DR: 本文提出 MENTOR,通过一种新的自回归框架实现高效的多模态图像生成,在减少依赖额外模块的同时提升了生成质量与可控性。

Details Motivation: 现有的文本到图像模型在精确视觉控制、多模态输入平衡以及复杂多模态图像生成所需的大量训练方面仍存在局限性。 Method: MENTOR 结合了一个自回归图像生成器和一个两阶段训练范式:(1) 多模态对齐阶段建立稳健的像素级和语义级对齐;(2) 多模态指令调优阶段平衡多模态输入的集成并增强生成可控性。 Result: 尽管模型规模较小、基础组件不理想且训练资源有限,MENTOR 在 DreamBench++ 基准测试中表现出色,优于竞争基线方法,并在图像重建保真度、广泛的任务适应性和训练效率方面超越了基于扩散的方法。 Conclusion: MENTOR 是一种新颖的自回归框架,用于高效的多模态条件调优,实现了对多模态输入和图像输出之间细粒度、逐标记对齐的能力。 Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

[126] When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training

Yunwei Lan,Zhigao Cui,Xin Luo,Chang Liu,Nian Wang,Menglin Zhang,Yanzhao Su,Dong Liu

Main category: cs.CV

TL;DR: This paper proposes DehazeSB, an unpaired dehazing framework based on Schr"odinger Bridge and optimal transport theory, which improves the quality of dehazed images while preserving structural details.

Details Motivation: Unpaired dehazing methods using GANs have shown promise but are limited by the generator's transport mapping capability. This limitation motivates the need for a more effective framework like DehazeSB. Method: The proposed method, DehazeSB, uses Schr"odinger Bridge to bridge the distribution between hazy and clear images, utilizes detail-preserving regularization for pixel-level alignment, and incorporates prompt learning with pre-trained CLIP models to distinguish haze from clear images. Result: Experiments on multiple real-world datasets show that DehazeSB outperforms existing methods in terms of performance and effectiveness in unpaired dehazing tasks. Conclusion: DehazeSB demonstrates superiority in unpaired dehazing by leveraging Schr"odinger Bridge and optimal transport theory, resulting in high-quality image restoration. Abstract: Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator's limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schr\"odinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality results. To ensure the consistency of structural information and details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method's superiority. Code: https://github.com/ywxjm/DehazeSB.

[127] ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models

Yongheng Zhang,Xu Liu,Ruihan Tao,Qiguang Chen,Hao Fei,Wanxiang Che,Libo Qin

Main category: cs.CV

TL;DR: This paper introduces Video-Text Interleaved CoT (ViTCoT), a novel paradigm for video reasoning, which leverages both visual and textual modalities. The proposed method outperforms traditional text-only CoT approaches in video understanding tasks.

Details Motivation: Current approaches for video reasoning primarily rely on textual information, overlooking the visual modality. Humans naturally re-examine visual content during reasoning, which inspired the introduction of a new video reasoning paradigm called Video-Text Interleaved CoT (ViTCoT). Method: Constructed Video-Text Interleaved Benchmark (ViTIB) using MLLMs for key-video selection, manually verified the benchmark, and explored the potential of ViTCoT in video understanding. Result: The study demonstrated that ViTCoT improves performance in video understanding tasks and effectively activates more neuron values in MLLMs. Conclusion: ViTCoT significantly enhances video understanding performance compared to traditional text-only CoT paradigm and activates more neuron values in MLLMs. Abstract: Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.

[128] VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization

Son Nguyen,Giang Nguyen,Hung Dao,Thao Do,Daeyoung Kim

Main category: cs.CV

TL;DR: This paper proposes VDInstruct, a novel MLLM for Key Information Extraction (KIE) that improves document understanding by reducing redundancy through content-aware tokenization and layout modeling.

Details Motivation: Existing MLLMs perform poorly on dense documents and suffer from redundant computation due to vision tokenization methods that scale with image size. Method: The paper introduces VDInstruct, an MLLM that separates spatial region detection from semantic feature extraction using a content-aware tokenization strategy and a three-stage training paradigm. Result: VDInstruct achieves state-of-the-art results on KIE benchmarks, reduces image tokens by roughly 3.6x, and outperforms baselines like DocOwl 1.5 by +5.5 F1 points in zero-shot evaluations. Conclusion: The study concludes that content-aware tokenization combined with explicit layout modeling significantly enhances document understanding robustness and efficiency. Abstract: Key Information Extraction (KIE) underpins the understanding of visual documents (e.g., receipts and contracts) by extracting precise semantic content and accurately capturing spatial structure. Yet existing multimodal large language models (MLLMs) often perform poorly on dense documents and rely on vision tokenization approaches that scale with image size, leading to redundant computation and memory inefficiency. To address these challenges, we introduce VDInstruct, an MLLM that separates spatial region detection from semantic feature extraction. Central to our model is a content-aware tokenization strategy: rather than fragmenting the entire image uniformly, it generates tokens in proportion to document complexity, preserving critical structure while eliminating wasted tokens. Leveraging a three-stage training paradigm, our model achieves state-of-the-art (SOTA) results on KIE benchmarks, matching or exceeding the accuracy of leading approaches while reducing the number of image tokens by roughly 3.6x. In zero-shot evaluations, VDInstruct surpasses strong baselines-such as DocOwl 1.5-by +5.5 F1 points, highlighting its robustness to unseen documents. These findings show that content-aware tokenization combined with explicit layout modeling offers a promising direction forward for document understanding. Data, source code, and model weights will be made publicly available.

[129] Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect

Tom Kouwenhoven,Kiana Shahrasbi,Tessa Verhoef

Main category: cs.CV

TL;DR: This paper investigates whether modern vision-and-language models demonstrate the bouba-kiki effect—a cognitive phenomenon where humans link certain sounds to specific shapes. Using methods modeled after human experiments, the study finds that while some models show partial preferences, they generally fall short of replicating the integrated cross-modal behavior observed in human cognition.

Details Motivation: Recent advances in multimodal models have led to questions about whether VLMs integrate cross-modal information similarly to human cognition. The bouba-kiki effect, where humans associate certain pseudowords with specific shapes, served as a test case to explore this. Method: The researchers conducted a prompt-based evaluation using probabilities as model preference and employed Grad-CAM to interpret visual attention in shape-word matching tasks for two variants of CLIP: ResNet and Vision Transformer (ViT). Result: While ResNet showed a preference for round shapes, both models overall lacked the expected associations in the bouba-kiki effect. Additionally, when compared directly with prior human data, the models' responses did not match the robust, modality-integrated behavior seen in humans. Conclusion: The study concludes that current vision-and-language models (VLMs) do not consistently exhibit the bouba-kiki effect, indicating limitations in their understanding of cross-modal concepts compared to human cognition. Abstract: Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like "bouba" with round shapes and "kiki" with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as model preference, and we use Grad-CAM as a novel way to interpret visual attention in shape-word matching tasks. Our findings show that these models do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both models lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models' responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.

[130] DRPCA-Net: Make Robust PCA Great Again for Infrared Small Target Detection

Zihao Xiong,Fei Zhou,Fengyi Wu,Shuai Yuan,Maixia Fu,Zhenming Peng,Jian Yang,Yimian Dai

Main category: cs.CV

TL;DR: The paper proposes Dynamic RPCA Network (DRPCA-Net), a novel deep unfolding network that integrates the sparsity-aware prior into a learnable architecture with a dynamic unfolding mechanism via a lightweight hypernetwork.

Details Motivation: Many end-to-end convolutional models tend to pursue performance by stacking increasingly complex architectures, often at the expense of interpretability, parameter efficiency, and generalization. These models typically overlook the intrinsic sparsity prior of infrared small targets--an essential cue that can be explicitly modeled for both performance and efficiency gains. Method: Dynamic RPCA Network (DRPCA-Net), a novel deep unfolding network that integrates the sparsity-aware prior into a learnable architecture. Result: Extensive experiments on multiple public infrared datasets demonstrate that DRPCA-Net significantly outperforms existing state-of-the-art methods in detection accuracy. Conclusion: DRPCA-Net significantly outperforms existing state-of-the-art methods in detection accuracy. Abstract: Infrared small target detection plays a vital role in remote sensing, industrial monitoring, and various civilian applications. Despite recent progress powered by deep learning, many end-to-end convolutional models tend to pursue performance by stacking increasingly complex architectures, often at the expense of interpretability, parameter efficiency, and generalization. These models typically overlook the intrinsic sparsity prior of infrared small targets--an essential cue that can be explicitly modeled for both performance and efficiency gains. To address this, we revisit the model-based paradigm of Robust Principal Component Analysis (RPCA) and propose Dynamic RPCA Network (DRPCA-Net), a novel deep unfolding network that integrates the sparsity-aware prior into a learnable architecture. Unlike conventional deep unfolding methods that rely on static, globally learned parameters, DRPCA-Net introduces a dynamic unfolding mechanism via a lightweight hypernetwork. This design enables the model to adaptively generate iteration-wise parameters conditioned on the input scene, thereby enhancing its robustness and generalization across diverse backgrounds. Furthermore, we design a Dynamic Residual Group (DRG) module to better capture contextual variations within the background, leading to more accurate low-rank estimation and improved separation of small targets. Extensive experiments on multiple public infrared datasets demonstrate that DRPCA-Net significantly outperforms existing state-of-the-art methods in detection accuracy. Code is available at https://github.com/GrokCV/DRPCA-Net.

[131] FaceLLM: A Multimodal Large Language Model for Face Understanding

Hatef Otroshi Shahreza,Sébastien Marcel

Main category: cs.CV

TL;DR: 本文介绍FaceLLM,一个专为面部图像理解而训练的多模态大语言模型。

Details Motivation: 现有的MLLMs主要在通用数据集上训练,限制了其对特定领域视觉线索的理解能力。 Method: 提出了一种新的弱监督流水线,使用ChatGPT生成基于FairFace数据集图像的高质量问答对。 Result: FaceLLM在各种以面部为中心的任务中提高了MLLM的性能并达到了最先进的表现。 Conclusion: FaceLLM展示了通过语言模型进行合成监督的潜力,并为构建可信的、以人为中心的多模态AI系统树立了先例。 Abstract: Multimodal large language models (MLLMs) have shown remarkable performance in vision-language tasks. However, existing MLLMs are primarily trained on generic datasets, limiting their ability to reason on domain-specific visual cues such as those in facial images. In particular, tasks that require detailed understanding of facial structure, expression, emotion, and demographic features remain underexplored by MLLMs due to the lack of large-scale annotated face image-text datasets. In this work, we introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding. To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs based on images from the FairFace dataset. The resulting corpus, called FairFaceGPT, covers a diverse set of attributes including expression, pose, skin texture, and forensic information. Our experiments demonstrate that FaceLLM improves the performance of MLLMs on various face-centric tasks and achieves state-of-the-art performance. This work highlights the potential of synthetic supervision via language models for building domain-specialized MLLMs, and sets a precedent for trustworthy, human-centric multimodal AI systems. FairFaceGPT dataset and pretrained FaceLLM models are publicly available in the project page.

[132] SeqCSIST: Sequential Closely-Spaced Infrared Small Target Unmixing

Ximeng Zhai,Bohan Xu,Yaohong Chen,Hao Wang,Kehua Guo,Yimian Dai

Main category: cs.CV

TL;DR: This paper introduces a new task called Sequential CSIST Unmixing for detecting sub-pixel level targets from densely spaced infrared images, along with a new deep learning framework and an open-source dataset and toolkit.

Details Motivation: Distant Closely-Spaced Infrared Small Target (CSIST) groups typically appear as mixing spots in infrared images due to limitations in optical lens focal length and infrared detector resolution. Precise detection of these targets is challenging, and the lack of high-quality public datasets has restricted research progress. Method: The authors introduced a Temporal Deformable Feature Alignment (TDFA) module that enables adaptive inter-frame information aggregation. They also contributed an open-source ecosystem, including a sequential benchmark dataset named SeqCSIST and a toolkit with objective evaluation metrics and implementations of 23 relevant methods. Result: Experiments on the SeqCSIST dataset showed that the proposed method outperforms state-of-the-art approaches, with a mean Average Precision (mAP) improvement of 5.3%. Conclusion: This paper proposes a novel task called Sequential CSIST Unmixing, which aims to detect all targets in the form of sub-pixel localization from a highly dense CSIST group using a model-driven deep learning framework called Deformable Refinement Network (DeRefNet). Abstract: Due to the limitation of the optical lens focal length and the resolution of the infrared detector, distant Closely-Spaced Infrared Small Target (CSIST) groups typically appear as mixing spots in the infrared image. In this paper, we propose a novel task, Sequential CSIST Unmixing, namely detecting all targets in the form of sub-pixel localization from a highly dense CSIST group. However, achieving such precise detection is an extremely difficult challenge. In addition, the lack of high-quality public datasets has also restricted the research progress. To this end, firstly, we contribute an open-source ecosystem, including SeqCSIST, a sequential benchmark dataset, and a toolkit that provides objective evaluation metrics for this special task, along with the implementation of 23 relevant methods. Furthermore, we propose the Deformable Refinement Network (DeRefNet), a model-driven deep learning framework that introduces a Temporal Deformable Feature Alignment (TDFA) module enabling adaptive inter-frame information aggregation. To the best of our knowledge, this work is the first endeavor to address the CSIST Unmixing task within a multi-frame paradigm. Experiments on the SeqCSIST dataset demonstrate that our method outperforms the state-of-the-art approaches with mean Average Precision (mAP) metric improved by 5.3\%. Our dataset and toolkit are available from https://github.com/GrokCV/SeqCSIST.

[133] Devanagari Handwritten Character Recognition using Convolutional Neural Network

Diksha Mehta,Prateek Mehta

Main category: cs.CV

TL;DR: This paper proposes a deep convolutional neural network-based technique for recognizing handwritten Devanagari characters, achieving high accuracy rates through automated processing.

Details Motivation: The motivation is to digitize handwritten Devanagari script efficiently due to its historical significance and lack of proper digitization tools, enabling better technological applications. Method: The paper employs a deep convolutional neural network approach with two layers for recognizing handwritten Devanagari characters, utilizing the Devanagari handwritten character dataset (DHCD) for training and testing. Result: The method achieves promising results with 96.36% accuracy during testing and 99.55% during training. Conclusion: The paper concludes that the proposed methodology effectively enhances the recognition rate of handwritten Devanagari characters using a deep convolutional neural network. Abstract: Handwritten character recognition is getting popular among researchers because of its possible applications in facilitating technological search engines, social media, recommender systems, etc. The Devanagari script is one of the oldest language scripts in India that does not have proper digitization tools. With the advancement of computing and technology, the task of this research is to extract handwritten Hindi characters from an image of Devanagari script with an automated approach to save time and obsolete data. In this paper, we present a technique to recognize handwritten Devanagari characters using two deep convolutional neural network layers. This work employs a methodology that is useful to enhance the recognition rate and configures a convolutional neural network for effective Devanagari handwritten text recognition (DHTR). This approach uses the Devanagari handwritten character dataset (DHCD), an open dataset with 36 classes of Devanagari characters. Each of these classes has 1700 images for training and testing purposes. This approach obtains promising results in terms of accuracy by achieving 96.36% accuracy in testing and 99.55% in training time.

[134] EHPE: A Segmented Architecture for Enhanced Hand Pose Estimation

Bolun Zheng,Xinjie Liu,Qianyu Zhang,Canjin Wang,Fangni Chen,Mingen Xu

Main category: cs.CV

TL;DR: This paper proposes EHPE, a novel two-stage architecture for 3D hand pose estimation that improves accuracy by focusing on TIP and wrist joints, achieving state-of-the-art results.

Details Motivation: Existing methods neglect the importance of TIP and wrist joints, leading to error accumulation and poor pose estimation quality. This work aims to improve accuracy by focusing on these critical joints. Method: EHPE uses a two-stage approach: (1) TIP and Wrist Joints Extraction stage (TW-stage) for initial accurate joint configuration, and (2) Prior Guided Joints Estimation stage (PG-stage) with a dual-branch interaction network to refine remaining joints. Result: Extensive experiments on two widely used benchmarks show that EHPE outperforms existing methods and achieves state-of-the-art performance. Conclusion: The proposed EHPE method achieves state-of-the-art performance in 3D hand pose estimation by addressing error accumulation issues through a two-stage architecture focusing on TIP and wrist joint prediction. Abstract: 3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. The accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in the pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform local extraction of TIP and wrist, thus alleviating the effect of error accumulation on TIP prediction and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-arts performance. Code is available at https://github.com/SereinNout/EHPE.

[135] Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources

Daniele Rege Cambrin,Lorenzo Vaiani,Giuseppe Gallipoli,Luca Cagliero,Paolo Garza

Main category: cs.CV

TL;DR: 为了解决大多数文本到图像检索系统仅限于RGB数据的问题,本文提出了CLOSP和GeoCLOSP两个框架,前者通过文本桥梁对齐未配对的光学和SAR图像,后者则集成了地理坐标以优化位置依赖型危机事件和罕见地理特征的检索。

Details Motivation: 大多数文本到图像检索系统局限于RGB数据,未能充分利用其他传感器提供的独特物理信息,例如合成孔径雷达(SAR)的全天候结构敏感性或光学多光谱数据中的光谱特征。 Method: 引入了CLOSP框架,使用文本作为桥梁对齐未配对的光学图像和SAR图像,并提出了GeoCLOSP框架将地理坐标集成到模型中。 Result: 实验表明,CLOSP比现有模型提高了54%的检索nDGC表现,并且GeoCLOSP在通用性和特异性之间创造了强大的折衷方案。 Conclusion: 整合多样化的传感器数据和地理上下文对于释放遥感档案的全部潜力至关重要。 Abstract: Retrieving relevant imagery from vast satellite archives is crucial for applications like disaster response and long-term climate monitoring. However, most text-to-image retrieval systems are limited to RGB data, failing to exploit the unique physical information captured by other sensors, such as the all-weather structural sensitivity of Synthetic Aperture Radar (SAR) or the spectral signatures in optical multispectral data. To bridge this gap, we introduce CrisisLandMark, a new large-scale corpus of over 647,000 Sentinel-1 SAR and Sentinel-2 multispectral images paired with structured textual annotations for land cover, land use, and crisis events harmonized from authoritative land cover systems (CORINE and Dynamic World) and crisis-specific sources. We then present CLOSP (Contrastive Language Optical SAR Pretraining), a novel framework that uses text as a bridge to align unpaired optical and SAR images into a unified embedding space. Our experiments show that CLOSP achieves a new state-of-the-art, improving retrieval nDGC by 54% over existing models. Additionally, we find that the unified training strategy overcomes the inherent difficulty of interpreting SAR imagery by transferring rich semantic knowledge from the optical domain with indirect interaction. Furthermore, GeoCLOSP, which integrates geographic coordinates into our framework, creates a powerful trade-off between generality and specificity: while the CLOSP excels at general semantic tasks, the GeoCLOSP becomes a specialized expert for retrieving location-dependent crisis events and rare geographic features. This work highlights that the integration of diverse sensor data and geographic context is essential for unlocking the full potential of remote sensing archives.

[136] Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges

Yidong Jiang

Main category: cs.CV

TL;DR: 该论文全面综述了Segment Anything Model(SAM)及其变体中的提示工程技术,揭示了其发展过程及在不同领域的广泛应用。

Details Motivation: 尽管Segment Anything Model (SAM)通过其创新的基于提示的方法彻底改变了图像分割领域,但提示工程在其成功中的重要作用仍缺乏深入探讨。因此,本研究旨在填补这一文献空白。 Method: 文章对SAM和其变体中的提示工程技术进行了系统的组织与分析,涵盖了基础方法、实际应用以及关键挑战。 Result: 研究表明,提示工程已经从简单的几何输入演变为复杂的多模态方法,使得SAM能够适应医学成像和遥感等多个领域。同时,作者识别出了提示优化中的独特挑战,并讨论了有前景的研究方向。 Conclusion: 本文得出的结论是,提示工程在SAM及其变体中的应用具有重要意义,并通过系统性的综述揭示了其从简单几何输入到复杂多模态方法的发展趋势。 Abstract: The Segment Anything Model (SAM) has revolutionized image segmentation through its innovative prompt-based approach, yet the critical role of prompt engineering in its success remains underexplored. This paper presents the first comprehensive survey focusing specifically on prompt engineering techniques for SAM and its variants. We systematically organize and analyze the rapidly growing body of work in this emerging field, covering fundamental methodologies, practical applications, and key challenges. Our review reveals how prompt engineering has evolved from simple geometric inputs to sophisticated multimodal approaches, enabling SAM's adaptation across diverse domains including medical imaging and remote sensing. We identify unique challenges in prompt optimization and discuss promising research directions. This survey fills an important gap in the literature by providing a structured framework for understanding and advancing prompt engineering in foundation models for segmentation.

[137] EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Mingxian Lin,Wei Huang,Yitang Li,Chengjie Jiang,Kui Wu,Fangwei Zhong,Shengju Qian,Xin Wang,Xiaojuan Qi

Main category: cs.CV

TL;DR: This paper introduces EmRACE-3K, a dataset for evaluating embodied reasoning in VLMs, showing that current models struggle in interactive environments but can improve through targeted training.

Details Motivation: Current vision-language models (VLMs) perform well on offline visual understanding tasks but show limitations in embodied settings requiring active interaction, spatial reasoning, and long-horizon planning. Method: The authors introduced EmRACE-3K, a dataset for embodied reasoning tasks, and evaluated state-of-the-art models in zero-shot settings. They then applied supervised learning followed by reinforcement learning on Qwen2.5-VL-7B to assess improvement. Result: All tested models achieved success rates below 20% in zero-shot settings, indicating the difficulty of embodied reasoning tasks. Fine-tuning Qwen2.5-VL-7B using EmRACE-3K led to significant improvements across all challenge categories. Conclusion: The study concludes that current VLMs struggle with embodied reasoning tasks in interactive environments and demonstrates that fine-tuning on EmRACE-3K significantly improves performance. Abstract: Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

[138] WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending

Zhe Wang,Jingbo Zhang,Tianyi Wei,Wanchao Su,Can Wang

Main category: cs.CV

TL;DR: WordCraft是一个基于扩散模型的交互式艺术排版系统,通过区域注意力机制和噪声混合技术,实现了高质量的多字符、多语言排版生成,并结合大语言模型支持灵活的提示解析,从而提升了艺术排版的交互性和创意性。

Details Motivation: 传统方法依赖手动设计,而现有的生成模型在局部编辑、迭代优化、多字符组合和开放式提示解释方面存在不足。 Method: 引入了一个无需训练的区域注意力机制以实现精确的多区域生成,并采用噪声混合技术以支持连续优化;同时整合大型语言模型以解析用户提示并构建结构。 Result: WordCraft能够生成高质量的艺术字体,支持单字符和多字符输入,适用于多种语言,并提供多样化的用户中心工作流程。 Conclusion: WordCraft显著提高了艺术排版合成的交互性,为艺术家和设计师带来了更多创作可能性。 Abstract: Artistic typography aims to stylize input characters with visual effects that are both creative and legible. Traditional approaches rely heavily on manual design, while recent generative models, particularly diffusion-based methods, have enabled automated character stylization. However, existing solutions remain limited in interactivity, lacking support for localized edits, iterative refinement, multi-character composition, and open-ended prompt interpretation. We introduce WordCraft, an interactive artistic typography system that integrates diffusion models to address these limitations. WordCraft features a training-free regional attention mechanism for precise, multi-region generation and a noise blending that supports continuous refinement without compromising visual quality. To support flexible, intent-driven generation, we incorporate a large language model to parse and structure both concrete and abstract user prompts. These components allow our framework to synthesize high-quality, stylized typography across single- and multi-character inputs across multiple languages, supporting diverse user-centered workflows. Our system significantly enhances interactivity in artistic typography synthesis, opening up creative possibilities for artists and designers.

[139] Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation

Ming Yin,Fu Wang,Xujiong Ye,Yanda Meng,Zeyu Fu

Main category: cs.CV

TL;DR: This paper proposes MA-SAM2, a training-free video object segmentation method for surgical videos that improves robustness against occlusions and complex movements, achieving better performance than SAM2 on surgical datasets.

Details Motivation: The inherent limitations of SAM2's greedy selection memory design perform poorly on surgical videos due to rapid instrument movement, frequent occlusion, and complex interactions. This motivates the development of a more robust segmentation approach tailored to surgical scenarios. Method: The proposed MA-SAM2 introduces a training-free video object segmentation strategy with context-aware and occlusion-resilient memory models. It uses multi-target, single-loop, one-prompt inference to enhance tracking efficiency without additional parameters or training. Result: MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, while maintaining accuracy and efficiency without introducing additional parameters or training. Conclusion: MA-SAM2 demonstrates significant improvements over SAM2 in surgical video segmentation, particularly for complex and long videos with rapid instrument movement, frequent occlusion, and complex tissue interaction. Abstract: Surgical video segmentation is a critical task in computer-assisted surgery, essential for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has demonstrated remarkable advancements in both image and video segmentation. However, the inherent limitations of SAM2's greedy selection memory design are amplified by the unique properties of surgical videos-rapid instrument movement, frequent occlusion, and complex instrument-tissue interaction-resulting in diminished performance in the segmentation of complex, long videos. To address these challenges, we introduce Memory Augmented (MA)-SAM2, a training-free video object segmentation strategy, featuring novel context-aware and occlusion-resilient memory models. MA-SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements while maintaining accuracy in segmenting objects throughout videos. Employing a multi-target, single-loop, one-prompt inference further enhances the efficiency of the tracking process in multi-instrument videos. Without introducing any additional parameters or requiring further training, MA-SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, demonstrating its potential for practical surgical applications.

[140] Demystifying Flux Architecture

Or Greenberg

Main category: cs.CV

TL;DR: This paper reverse-engineers FLUX.1, a leading text-to-image generation model, to provide an unofficial technical overview in the absence of official documentation.

Details Motivation: FLUX.1 is a state-of-the-art model for text-to-image generation but lacks official technical documentation, limiting its adoption in research and development. Method: Reverse-engineering directly from FLUX.1's source code. Result: An extensive reverse-engineering effort successfully demystified FLUX.1's architecture, providing insights into its design and implementation. Conclusion: This report serves as an unofficial technical guide for FLUX.1, enhancing its accessibility for future research despite the lack of official documentation. Abstract: FLUX.1 is a diffusion-based text-to-image generation model developed by Black Forest Labs, designed to achieve faithful text-image alignment while maintaining high image quality and diversity. FLUX is considered state-of-the-art in text-to-image generation, outperforming popular models such as Midjourney, DALL-E 3, Stable Diffusion 3 (SD3), and SDXL. Although publicly available as open source, the authors have not released official technical documentation detailing the model's architecture or training setup. This report summarizes an extensive reverse-engineering effort aimed at demystifying FLUX's architecture directly from its source code, to support its adoption as a backbone for future research and development. This document is an unofficial technical report and is not published or endorsed by the original developers or their affiliated institutions.

[141] Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive

You Huang,Lichao Chen,Jiayi Ji,Liujuan Cao,Shengchuan Zhang,Rongrong Ji

Main category: cs.CV

TL;DR: Inter2Former 通过引入 DPE、DHA、HMoE 和 DLU 模块,在 CPU 上实现了高效且高精度的交互式图像分割。

Details Motivation: 当前方法在密集标记方法的准确性和稀疏标记方法的速度之间存在权衡问题,Inter2Former 的目标是在保持高质量分割的同时提高处理效率。 Method: 提出了 Dynamic Prompt Embedding (DPE)、Dynamic Hybrid Attention (DHA)、Hybrid Mixture of Experts (HMoE) 和 Dynamic Local Upsampling (DLU) 四个模块,以提升交互式分割的效率和性能。 Result: Inter2Former 在高精度交互式分割基准测试中展示了最先进的性能,并在 CPU 设备上具有很高的效率。 Conclusion: Inter2Former 提出了四个关键改进,优化了密集标记处理的计算分配,在 CPU 设备上实现了高效且高精度的交互式分割。 Abstract: Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention (O(N2)) for boundary regions or our proposed efficient BSQ attention (O(N)) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high efficiency on CPU devices.

[142] Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score

Eman Ali,Sathira Silva,Chetan Arora,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出了一种名为FAIR的新方法,通过动态对齐局部图像特征和语言嵌入来提升视觉-语言模型在细粒度无监督适应中的性能。

Details Motivation: 现有方法要么依赖固定的对齐分数,无法捕捉细微的类别差异,要么使用计算成本高昂的伪标签策略,限制了可扩展性。 Method: 引入了细粒度对齐与交互优化(FAIR),通过一组类别描述锚点(CDA)动态对齐局部图像特征和描述性语言嵌入,并提出了一个自适应分类器(LAS)来改进伪标签生成。此外,还设计了一个自训练加权机制以优化存在类别模糊情况下的伪标签。 Result: FAIR在13个细粒度数据集上相比SOTA方法取得了2.78%的整体性能提升。 Conclusion: 建模细粒度跨模态交互可以生成更准确、更具类别区分性的伪标签,从而显著提升无监督适应的性能。 Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that cannot capture evolving, subtle class distinctions or use computationally expensive pseudo-labeling strategies that limit scalability. In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods. We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation. Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities. Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.

[143] Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection

Yilin Lu,Jianghang Lin,Linhuang Xie,Kai Zhao,Yansong Qu,Shengchuan Zhang,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: This paper introduces GAA, a novel few-shot framework for generating realistic and semantically aligned anomaly image-mask pairs, effectively addressing the challenge of scarce anomaly samples in industrial manufacturing tasks.

Details Motivation: Anomaly inspection is crucial in industrial manufacturing, but existing methods are limited by the scarcity of real anomaly samples. Current anomaly synthesis approaches face challenges such as low realism, inaccurate mask alignment, and poor generalization. Method: The paper proposes GAA (Generate Aligned Anomaly), a region-guided, few-shot anomaly image-mask pair generation framework. It uses a pretrained latent diffusion model and incorporates techniques such as Localized Concept Decomposition, Adaptive Multi-Round Anomaly Clustering, region-guided mask generation, and low-quality sample filtering. Result: Extensive experiments on the MVTec AD and LOCO datasets show that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks like localization and classification. Conclusion: GAA demonstrates superior performance in anomaly synthesis quality and downstream tasks like localization and classification, offering a promising solution to the scarcity of real anomaly samples in industrial manufacturing. Abstract: Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.

[144] Brain Stroke Detection and Classification Using CT Imaging with Transformer Models and Explainable AI

Shomukh Qari,Maha A. Thafar

Main category: cs.CV

TL;DR: This study developed an AI framework using advanced Vision Transformers and Explainable AI techniques to accurately classify stroke types from CT scans, aiming to improve early diagnosis and clinical outcomes in emergency settings.

Details Motivation: Stroke is one of the leading causes of death globally, making early and accurate diagnosis essential for improving patient outcomes. The goal was to distinguish between stroke types with high accuracy while addressing crucial issues of transparency and trust in artificial intelligence models. Method: The proposed method adopted MaxViT, a state-of-the-art Vision Transformer, as the primary deep learning model for image-based stroke classification, with additional transformer variants. Data augmentation techniques were applied to enhance model generalization and address class imbalance. Result: The MaxViT model trained with augmentation achieved the best performance, reaching an accuracy and F1-score of 98.00%, outperforming all other evaluated models and baseline methods. Conclusion: This research contributed to the development of a trustworthy AI-assisted diagnostic tool for stroke, facilitating its integration into clinical practice and enhancing access to timely and optimal stroke diagnosis in emergency departments. Abstract: Stroke is one of the leading causes of death globally, making early and accurate diagnosis essential for improving patient outcomes, particularly in emergency settings where timely intervention is critical. CT scans are the key imaging modality because of their speed, accessibility, and cost-effectiveness. This study proposed an artificial intelligence framework for multiclass stroke classification (ischemic, hemorrhagic, and no stroke) using CT scan images from a dataset provided by the Republic of Turkey's Ministry of Health. The proposed method adopted MaxViT, a state-of-the-art Vision Transformer, as the primary deep learning model for image-based stroke classification, with additional transformer variants (vision transformer, transformer-in-transformer, and ConvNext). To enhance model generalization and address class imbalance, we applied data augmentation techniques, including synthetic image generation. The MaxViT model trained with augmentation achieved the best performance, reaching an accuracy and F1-score of 98.00%, outperforming all other evaluated models and the baseline methods. The primary goal of this study was to distinguish between stroke types with high accuracy while addressing crucial issues of transparency and trust in artificial intelligence models. To achieve this, Explainable Artificial Intelligence (XAI) was integrated into the framework, particularly Grad-CAM++. It provides visual explanations of the model's decisions by highlighting relevant stroke regions in the CT scans and establishing an accurate, interpretable, and clinically applicable solution for early stroke detection. This research contributed to the development of a trustworthy AI-assisted diagnostic tool for stroke, facilitating its integration into clinical practice and enhancing access to timely and optimal stroke diagnosis in emergency departments, thereby saving more lives.

[145] Disentanglement and Assessment of Shortcuts in Ophthalmological Retinal Imaging Exams

Leonor Fernandes,Tiago Gonçalves,João Matos,Luis Filipe Nakayama,Jaime S. Cardoso

Main category: cs.CV

TL;DR: 本文研究了AI模型在糖尿病视网膜病变预测中的公平性与性能,探讨了通过解缠敏感属性以减轻偏见的方法。

Details Motivation: 糖尿病视网膜病变是导致劳动年龄人群视力丧失的主要原因,传统筛查方法成本高且难以普及,因此需要可扩展的诊断解决方案并关注其公平性和泛化能力。 Method: 使用mBRSET眼底数据集评估了三种模型(ConvNeXt V2、DINOv2和Swin V2)在糖尿病视网膜病变预测中的公平性和性能,并应用了解缠技术来减轻偏差。 Result: 所有模型在DR预测中表现出高性能(最高94% AUROC),但在敏感属性预测中表现不一;解缠技术对DINOv2有改善作用,但对其他模型则导致性能下降。 Conclusion: 该论文强调了在医学影像AI中公平性的重要性,指出解缠敏感属性对不同模型的影响各异,突显了视网膜图像中细粒度特征解缠的复杂性。 Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. While screening reduces the risk of blindness, traditional imaging is often costly and inaccessible. Artificial intelligence (AI) algorithms present a scalable diagnostic solution, but concerns regarding fairness and generalization persist. This work evaluates the fairness and performance of image-trained models in DR prediction, as well as the impact of disentanglement as a bias mitigation technique, using the diverse mBRSET fundus dataset. Three models, ConvNeXt V2, DINOv2, and Swin V2, were trained on macula images to predict DR and sensitive attributes (SAs) (e.g., age and gender/sex). Fairness was assessed between subgroups of SAs, and disentanglement was applied to reduce bias. All models achieved high DR prediction performance in diagnosing (up to 94% AUROC) and could reasonably predict age and gender/sex (91% and 77% AUROC, respectively). Fairness assessment suggests disparities, such as a 10% AUROC gap between age groups in DINOv2. Disentangling SAs from DR prediction had varying results, depending on the model selected. Disentanglement improved DINOv2 performance (2% AUROC gain), but led to performance drops in ConvNeXt V2 and Swin V2 (7% and 3%, respectively). These findings highlight the complexity of disentangling fine-grained features in fundus imaging and emphasize the importance of fairness in medical imaging AI to ensure equitable and reliable healthcare solutions.

[146] EyeSeg: An Uncertainty-Aware Eye Segmentation Framework for AR/VR

Zhengyuan Peng,Jianqing Xu,Shen Li,Jiazhen Ji,Yuge Huang,Jingyun Zhang,Jinmin Li,Shouhong Ding,Rizen Guo,Xin Tan,Lizhuang Ma

Main category: cs.CV

TL;DR: EyeSeg is a new eye segmentation framework for AR/VR that improves gaze estimation by modeling uncertainty through Bayesian learning, effectively handling issues like motion blur, eyelid occlusion, and domain gaps.

Details Motivation: Accurate and efficient gaze estimation through AR and VR requires robust eye segmentation, which existing methods struggle with due to challenges like motion blur, eyelid occlusion, and domain gaps. Method: EyeSeg uses Bayesian uncertainty learning to model uncertainties in eye segmentation, explicitly addressing motion blur, eyelid occlusion, and domain differences between training and testing data. Result: EyeSeg outperforms existing methods in segmentation accuracy, as evidenced by improvements in MIoU, E1, F1, and ACC metrics. It also provides an uncertainty score for robust gaze estimation, enhancing performance in challenging scenarios. Conclusion: EyeSeg is a novel eye segmentation framework that effectively addresses the challenges of motion blur, eyelid occlusion, and train-test domain gaps in AR/VR environments by incorporating Bayesian uncertainty learning. This approach enhances segmentation accuracy and robustness, particularly under challenging conditions. Abstract: Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches. The code is publicly available at https://github.com/JethroPeng/EyeSeg.

[147] VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation

Xinyu Zhang,Zhonghao Ye,Jingwei Zhang,Xiang Tian,Zhisheng Liang,Shipeng Yu

Main category: cs.CV

TL;DR: This paper proposes VST-Pose, a deep learning framework for WiFi-based human pose estimation that offers high accuracy and privacy in smart home care scenarios.

Details Motivation: WiFi-based human pose estimation is a promising non-visual approach due to its penetrability and privacy advantages. The paper aims to provide an accurate and continuous pose estimation using WiFi channel state information. Method: The method introduces ViSTA-Former, a spatiotemporal attention backbone with dual-stream architecture that separately captures temporal dependencies and structural relationships among body joints. It also integrates a velocity modeling branch to enhance sensitivity to subtle human motions. Result: The proposed framework achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset. Further evaluation on the public MMFi dataset confirms the model's robustness and effectiveness in 3D pose estimation tasks. Conclusion: VST-Pose provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments. Abstract: WiFi-based human pose estimation has emerged as a promising non-visual alternative approaches due to its pene-trability and privacy advantages. This paper presents VST-Pose, a novel deep learning framework for accurate and continuous pose estimation using WiFi channel state information. The proposed method introduces ViSTA-Former, a spatiotemporal attention backbone with dual-stream architecture that adopts a dual-stream architecture to separately capture temporal dependencies and structural relationships among body joints. To enhance sensitivity to subtle human motions, a velocity modeling branch is integrated into the framework, which learns short-term keypoint dis-placement patterns and improves fine-grained motion representation. We construct a 2D pose dataset specifically designed for smart home care scenarios and demonstrate that our method achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset. Further evaluation on the public MMFi dataset confirms the model's robustness and effectiveness in 3D pose estimation tasks. The proposed system provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments. Our codes are available in https://github.com/CarmenQing/VST-Pose.

[148] Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model

Osher Rafaeli,Tal Svoray,Ariel Nahlieli

Main category: cs.CV

TL;DR: This paper introduces a novel deep learning framework for generating ultra-high-resolution Digital Elevation Models (DEMs), significantly outperforming existing methods in both resolution and accuracy.

Details Motivation: High-resolution elevation data is crucial across various domains such as hydrology, urban morphology, and ecosystem monitoring, but existing techniques face limitations in resolution and context. Method: The study fine-tunes a vision transformer encoder using LiDAR-derived DEMs and employs a prompting strategy to enable tasks like DEM estimation, void filling, and updating. Result: The method achieves a 100x resolution gain (from 30-m to 30-cm) with less than 5 m MAE compared to LiDAR, showing robust performance across diverse landscapes. Conclusion: The proposed framework for high-resolution DEM estimation provides significant improvements in resolution and accuracy, demonstrating its potential for broad applications in environmental studies. Abstract: High-resolution elevation estimations are essential to understand catchment and hillslope hydrology, study urban morphology and dynamics, and monitor the growth, decline, and mortality of terrestrial ecosystems. Various deep learning approaches (e.g., super-resolution techniques, monocular depth estimation) have been developed to create high-resolution Digital Elevation Models (DEMs). However, super-resolution techniques are limited by the upscaling factor, and monocular depth estimation lacks global elevation context, making its conversion to a seamless DEM restricted. The recently introduced technique of prompt-based monocular depth estimation has opened new opportunities to extract estimates of absolute elevation in a global context. We present here a framework for the estimation of high-resolution DEMs as a new paradigm for absolute global elevation mapping. It is exemplified using low-resolution Shuttle Radar Topography Mission (SRTM) elevation data as prompts and high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP). The approach fine-tunes a vision transformer encoder with LiDAR-derived DEMs and employs a versatile prompting strategy, enabling tasks such as DEM estimation, void filling, and updating. Our framework achieves a 100x resolution gain (from 30-m to 30-cm), surpassing prior methods by an order of magnitude. Evaluations across three diverse U.S. landscapes show robust generalization, capturing urban structures and fine-scale terrain features with < 5 m MAE relative to LiDAR, improving over SRTM by up to 18%. Hydrological analysis confirms suitability for hazard and environmental studies. We demonstrate scalability by applying the framework to large regions in the U.S. and Israel. All code and pretrained models are publicly available at: https://osherr1996.github.io/prompt2dem_propage/.

[149] ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments

Jiali Chen,Yujie Jia,Zihan Wu,Jinyu Yang,Jianpeng Chen,Xusen Hei,Jiayuan Xie,Yi Cai,Qing Li

Main category: cs.CV

TL;DR: 本文提出了ExpStar模型及ExpInstruct数据集,用于自动生成多学科科学实验评注,并展示了其优于现有大型多模态模型的表现。

Details Motivation: 人工生成实验评注费时且依赖专业知识,因此需要自动化的解决方案。 Method: 提出了ExpStar模型,利用检索增强机制来适应性地访问、评估和利用外部知识。 Result: ExpStar在实验评注生成任务上显著优于14种领先的LMMs。 Conclusion: ExpStar展现了在科学实验教学中应用AI的潜力,能够有效辅助教师生成实验评注。 Abstract: Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) We construct \textit{ExpInstruct}, the first dataset tailored for experiment commentary generation, featuring over 7\textit{K} step-level commentaries across 21 scientific subjects from 3 core disciplines (\ie, science, healthcare and engineering). Each sample includes procedural descriptions along with potential scientific principles (\eg, chemical equations and physical laws) and safety guidelines. (ii) We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model. We believe that ExpStar holds great potential for advancing AI-assisted scientific experiment instruction.

[150] Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI

Phat Nguyen,Ngai-Man Cheung

Main category: cs.CV

TL;DR: This paper provides a comprehensive analysis of token compression techniques for Vision Transformers, revealing that while these methods work well for standard models, they are less effective for compact architectures used in edge devices.

Details Motivation: Two critical gaps exist in current research on token compression techniques for Vision Transformers: a lack of unified surveys comparing different approaches and deployment settings, and limited evaluation on structurally compressed transformers suitable for resource-constrained edge devices. Method: The authors conducted a systematic taxonomy and comparative study of token compression methods, evaluating representative techniques on both standard and compact ViT architectures. Result: Experiments showed that while token compression methods offer good performance on standard ViTs (e.g., ViT-B, ViT-L), they often do not perform as well when applied directly to compact transformer designs. Conclusion: Token compression methods are effective for general-purpose ViTs but underperform when directly applied to compact designs. This highlights the need for future research on adapting these techniques for compact transformer-based networks, especially for edge AI and AI agent applications. Abstract: Token compression techniques have recently emerged as powerful tools for accelerating Vision Transformer (ViT) inference in computer vision. Due to the quadratic computational complexity with respect to the token sequence length, these methods aim to remove less informative tokens before the attention layers to improve inference throughput. While numerous studies have explored various accuracy-efficiency trade-offs on large-scale ViTs, two critical gaps remain. First, there is a lack of unified survey that systematically categorizes and compares token compression approaches based on their core strategies (e.g., pruning, merging, or hybrid) and deployment settings (e.g., fine-tuning vs. plug-in). Second, most benchmarks are limited to standard ViT models (e.g., ViT-B, ViT-L), leaving open the question of whether such methods remain effective when applied to structurally compressed transformers, which are increasingly deployed on resource-constrained edge devices. To address these gaps, we present the first systematic taxonomy and comparative study of token compression methods, and we evaluate representative techniques on both standard and compact ViT architectures. Our experiments reveal that while token compression methods are effective for general-purpose ViTs, they often underperform when directly applied to compact designs. These findings not only provide practical insights but also pave the way for future research on adapting token optimization techniques to compact transformer-based networks for edge AI and AI agent applications.

[151] Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation

Yu Lei,Bingde Liu,Qingsong Xie,Haonan Lu,Zhijie Deng

Main category: cs.CV

TL;DR: 本文提出了一种名为L^2-VSD的新方法,用于文本到3D生成任务,解决了现有方法在训练过程中收敛慢和不稳定的问题。

Details Motivation: VSD方法在实践中存在收敛速度慢和训练不稳定的问题,需要改进优化过程以提高生成质量。 Method: 提出了L^2-VSD方法,利用线性化模型进行分数蒸馏,并使用前向模式自动微分实现高效计算。 Result: 实验表明L^2-VSD在文本到3D生成任务中优于现有的分数蒸馏方法,且可无缝集成到其他基于VSD的框架中。 Conclusion: L^2-VSD能够解决VSD方法在实际应用中的慢收敛和不稳定训练问题,通过线性化模型实现高效的分数蒸馏。 Abstract: Text-to-3D generation based on score distillation of pre-trained 2D diffusion models has gained increasing interest, with variational score distillation (VSD) as a remarkable example. VSD proves that vanilla score distillation can be improved by introducing an extra score-based model, which characterizes the distribution of images rendered from 3D models, to correct the distillation gradient. Despite the theoretical foundations, VSD, in practice, is likely to suffer from slow and sometimes ill-posed convergence. In this paper, we perform an in-depth investigation of the interplay between the introduced score model and the 3D model, and find that there exists a mismatching problem between LoRA and 3D distributions in practical implementation. We can simply adjust their optimization order to improve the generation quality. By doing so, the score model looks ahead to the current 3D state and hence yields more reasonable corrections. Nevertheless, naive lookahead VSD may suffer from unstable training in practice due to the potential over-fitting. To address this, we propose to use a linearized variant of the model for score distillation, giving rise to the Linearized Lookahead Variational Score Distillation ($L^2$-VSD). $L^2$-VSD can be realized efficiently with forward-mode autodiff functionalities of existing deep learning libraries. Extensive experiments validate the efficacy of $L^2$-VSD, revealing its clear superiority over prior score distillation-based methods. We also show that our method can be seamlessly incorporated into any other VSD-based text-to-3D framework.

[152] Pairwise Alignment & Compatibility for Arbitrarily Irregular Image Fragments

Ofir Itzhak Shahar,Gur Elkin,Ohad Ben-Shahar

Main category: cs.CV

TL;DR: 本文提出了一种适用于各种碎片形状的高效混合方法,在考古拼图重建中表现出色。

Details Motivation: 大多数现有方法在处理具有现实几何属性的碎片时失败,或者没有被设计为应对实际拼图问题中的碎片复杂性。 Method: 引入了一种新的图像碎片数据集以及模拟现实世界考古侵蚀的正式侵蚀模型,并将提出的兼容性方法嵌入到考古拼图解决框架中。 Result: 在RePAIR 2D数据集上实现了最先进的邻域级精度和召回率,直接反映了兼容性性能的提升。 Conclusion: 该论文提出了一种高效的混合(几何和图像)方法来计算碎片之间的最佳对齐,且不依赖于碎片的形状、尺寸或图像内容。 Abstract: Pairwise compatibility calculation is at the core of most fragments-reconstruction algorithms, in particular those designed to solve different types of the jigsaw puzzle problem. However, most existing approaches fail, or aren't designed to deal with fragments of realistic geometric properties one encounters in real-life puzzles. And in all other cases, compatibility methods rely strongly on the restricted shapes of the fragments. In this paper, we propose an efficient hybrid (geometric and pictorial) approach for computing the optimal alignment for pairs of fragments, without any assumptions about their shapes, dimensions, or pictorial content. We introduce a new image fragments dataset generated via a novel method for image fragmentation and a formal erosion model that mimics real-world archaeological erosion, along with evaluation metrics for the compatibility task. We then embed our proposed compatibility into an archaeological puzzle-solving framework and demonstrate state-of-the-art neighborhood-level precision and recall on the RePAIR 2D dataset, directly reflecting compatibility performance improvements.

[153] NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

Amirhossein Ansari,Ke Wang,Pulei Xiong

Main category: cs.CV

TL;DR: This paper proposes NegRefine, a framework that enhances zero-shot OOD detection by refining negative labels and using a dynamic scoring mechanism to better distinguish between in-distribution and out-of-distribution samples.

Details Motivation: Existing negative label-based OOD detection methods often misclassify in-distribution samples as OOD due to overlapping or ambiguous negative labels. This limits their effectiveness in real-world applications. Method: NegRefine filters out subcategory labels and proper nouns from the negative label set and uses a dynamic scoring function to handle multiple label matches for each image. Result: NegRefine achieves robust performance on large-scale benchmarks like ImageNet-1K, demonstrating improved accuracy in distinguishing OOD samples compared to existing methods. Conclusion: NegRefine improves zero-shot OOD detection by refining negative labels and introducing a multi-matching-aware scoring function, leading to better separation of in-distribution and OOD samples. Abstract: Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. Source code is available at https://github.com/ah-ansari/NegRefine.

[154] VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding

Younggun Kim,Ahmed S. Abdelrahman,Mohamed Abdel-Aty

Main category: cs.CV

TL;DR: This paper introduces VRU-Accident, a new benchmark designed to evaluate multimodal large language models' ability to reason through safety-critical scenarios involving vulnerable road users like pedestrians and cyclists. Despite reasonable performance on visual attributes, current models struggle with higher-level reasoning around accident causes and prevention.

Details Motivation: Ensuring the safety of vulnerable road users (VRUs) is crucial for autonomous driving systems due to the severe consequences crashes often involve. However, there is no standardized benchmark to quantitatively evaluate multimodal large language models' (MLLMs) reasoning abilities in complex, safety-critical scenarios involving VRUs. Method: The study presents VRU-Accident, a large-scale vision-language benchmark with 1K real-world dashcam accident videos annotated with 6K multiple-choice question-answer pairs and 1K dense scene descriptions. The benchmark evaluates MLLMs across six safety-critical categories and assesses their performance through a comprehensive evaluation of 17 state-of-the-art models on multiple-choice VQA and dense captioning tasks. Result: A comprehensive evaluation of 17 state-of-the-art models on the VRU-Accident benchmark reveals that while MLLMs perform reasonably well on visually grounded attributes, they encounter significant difficulties in reasoning about and describing accident causes, types, and preventability. Conclusion: MLLMs face significant challenges in reasoning and describing accident causes, types, and preventability despite performing reasonably well on visually grounded attributes. Abstract: Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, is a critical challenge for autonomous driving systems, as crashes involving VRUs often result in severe or fatal consequences. While multimodal large language models (MLLMs) have shown promise in enhancing scene understanding and decision making in autonomous vehicles, there is currently no standardized benchmark to quantitatively evaluate their reasoning abilities in complex, safety-critical scenarios involving VRUs. To address this gap, we present VRU-Accident, a large-scale vision-language benchmark designed to evaluate MLLMs in high-risk traffic scenarios involving VRUs. VRU-Accident comprises 1K real-world dashcam accident videos, annotated with 6K multiple-choice question-answer pairs across six safety-critical categories (with 24K candidate options and 3.4K unique answer choices), as well as 1K dense scene descriptions. Unlike prior works, our benchmark focuses explicitly on VRU-vehicle accidents, providing rich, fine-grained annotations that capture both spatial-temporal dynamics and causal semantics of accidents. To assess the current landscape of MLLMs, we conduct a comprehensive evaluation of 17 state-of-the-art models on the multiple-choice VQA task and on the dense captioning task. Our findings reveal that while MLLMs perform reasonably well on visually grounded attributes, they face significant challenges in reasoning and describing accident causes, types, and preventability.

[155] Hierarchical Abstraction Enables Human-Like 3D Object Recognition in Deep Learning Models

Shuhao Fu,Philip J. Kellman,Hongjing Lu

Main category: cs.CV

TL;DR: This study compares human and deep learning model performance in recognizing 3D objects, showing that point transformer models better mimic human-like shape abstraction.

Details Motivation: To understand whether deep learning models develop human-like 3D shape representations for object recognition. Method: Two human experiments were conducted, manipulating point density, object orientation, and local geometric structure. Two deep learning models (DGCNN and point transformer) were evaluated against human performance. Result: Humans consistently performed well across all conditions. The point transformer model outperformed the convolutional model in accounting for human performance, primarily due to its ability to hierarchically abstract 3D shapes. Conclusion: The point transformer model better aligns with human performance in recognizing 3D shapes compared to convolution-based models, owing to its hierarchical abstraction mechanism. Abstract: Both humans and deep learning models can recognize objects from 3D shapes depicted with sparse visual information, such as a set of points randomly sampled from the surfaces of 3D objects (termed a point cloud). Although deep learning models achieve human-like performance in recognizing objects from 3D shapes, it remains unclear whether these models develop 3D shape representations similar to those used by human vision for object recognition. We hypothesize that training with 3D shapes enables models to form representations of local geometric structures in 3D shapes. However, their representations of global 3D object shapes may be limited. We conducted two human experiments systematically manipulating point density and object orientation (Experiment 1), and local geometric structure (Experiment 2). Humans consistently performed well across all experimental conditions. We compared two types of deep learning models, one based on a convolutional neural network (DGCNN) and the other on visual transformers (point transformer), with human performance. We found that the point transformer model provided a better account of human performance than the convolution-based model. The advantage mainly results from the mechanism in the point transformer model that supports hierarchical abstraction of 3D shapes.

Yihao Ding,Siwen Luo,Yue Dai,Yanbei Jiang,Zechuan Li,Geoffrey Martin,Yifan Peng

Main category: cs.CV

TL;DR: The paper discusses the use of Multimodal Large Language Models (MLLMs) in Visually-Rich Document Understanding (VRDU), analyzing methods, training paradigms, and datasets while proposing future directions for improving VRDU systems.

Details Motivation: Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information. Method: This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components: (1) methods for encoding and fusing textual, visual, and layout features; (2) training paradigms, including pretraining strategies, instruction-response tuning, and the trainability of different model modules; and (3) datasets utilized for pretraining, instruction-tuning, and supervised fine-tuning. Result: Recently, Multimodal Large Language Models (MLLMs) have shown remarkable potential in this domain, leveraging both Optical Character Recognition (OCR)-dependent and OCR-free frameworks to extract and interpret information in document images. Conclusion: Finally, we discuss the challenges and opportunities in this evolving field and propose future directions to advance the efficiency, generalizability, and robustness of VRDU systems. Abstract: Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information. Recently, Multimodal Large Language Models (MLLMs) have shown remarkable potential in this domain, leveraging both Optical Character Recognition (OCR)-dependent and OCR-free frameworks to extract and interpret information in document images. This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components: (1) methods for encoding and fusing textual, visual, and layout features; (2) training paradigms, including pretraining strategies, instruction-response tuning, and the trainability of different model modules; and (3) datasets utilized for pretraining, instruction-tuning, and supervised fine-tuning. Finally, we discuss the challenges and opportunities in this evolving field and propose future directions to advance the efficiency, generalizability, and robustness of VRDU systems.

[157] SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang,Zhaoyang Li,Duomin Wang,Jiahe Zhang,Deyu Zhou,Zixin Yin,Xili Dai,Gang Yu,Xiu Li

Main category: cs.CV

TL;DR: This paper presents SpeakerVid-5M, the first large-scale dataset for audio-visual dyadic interactive virtual humans, featuring diverse interaction types and data quality levels, along with a video chat baseline and benchmark.

Details Motivation: With advancements in large-scale models, the next challenge in digital humans is audio-visual dyadic interaction. The lack of suitable datasets has hindered progress, prompting the need for SpeakerVid-5M to support research in this emerging field. Method: The authors constructed the SpeakerVid-5M dataset by collecting and categorizing over 5.2 million video clips. They stratified the dataset into subsets for pre-training and supervised fine-tuning, and developed an autoregressive video chat baseline alongside evaluation metrics. Result: SpeakerVid-5M contains over 8,743 hours of video data across multiple interaction types. The dataset includes a pre-training subset and a curated SFT subset. An AR-based video chat baseline and VidChatBench benchmark were also introduced. Conclusion: The paper introduces the SpeakerVid-5M dataset, a large-scale, high-quality dataset for audio-visual dyadic interactive virtual human generation. It provides diverse data categorized by interaction type and quality, along with a video chat baseline and benchmark for future research. Abstract: The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

[158] OpenHuman4D: Open-Vocabulary 4D Human Parsing

Keito Suzuki,Bang Du,Runfa Blark Li,Kunyao Chen,Lei Wang,Peng Liu,Ning Bi,Truong Nguyen

Main category: cs.CV

TL;DR: This paper introduces the first 4D human parsing framework that reduces inference time and introduces open-vocabulary capabilities, making dynamic 3D human representation more efficient and flexible for virtual and extended reality applications.

Details Motivation: The motivation behind this paper is the increasing importance of dynamic 3D human representation in virtual and extended reality applications, and the limitations of current human part segmentation methods due to their reliance on closed-set datasets and long inference times. Method: The method involves three key innovations: mask-based video object tracking to establish spatial and temporal correspondences efficiently, a Mask Validation module for managing new target identification and mitigating tracking failures, and a 4D Mask Fusion module that integrates memory-conditioned attention and logits equalization for robust embedding fusion. Result: The results show that the proposed method achieves up to 93.3% acceleration compared to the previous state-of-the-art method, which was limited to parsing fixed classes, demonstrating its effectiveness and flexibility in 4D human-centric parsing tasks. Conclusion: The paper concludes that the proposed 4D human parsing framework successfully addresses the limitations of existing methods by reducing inference time and introducing open-vocabulary capabilities, making it highly effective and flexible for 4D human-centric parsing tasks. Abstract: Understanding dynamic 3D human representation has become increasingly critical in virtual and extended reality applications. However, existing human part segmentation methods are constrained by reliance on closed-set datasets and prolonged inference times, which significantly restrict their applicability. In this paper, we introduce the first 4D human parsing framework that simultaneously addresses these challenges by reducing the inference time and introducing open-vocabulary capabilities. Building upon state-of-the-art open-vocabulary 3D human parsing techniques, our approach extends the support to 4D human-centric video with three key innovations: 1) We adopt mask-based video object tracking to efficiently establish spatial and temporal correspondences, avoiding the necessity of segmenting all frames. 2) A novel Mask Validation module is designed to manage new target identification and mitigate tracking failures. 3) We propose a 4D Mask Fusion module, integrating memory-conditioned attention and logits equalization for robust embedding fusion. Extensive experiments demonstrate the effectiveness and flexibility of the proposed method on 4D human-centric parsing tasks, achieving up to 93.3% acceleration compared to the previous state-of-the-art method, which was limited to parsing fixed classes.

[159] Counterfactual Visual Explanation via Causally-Guided Adversarial Steering

Yiran Qiao,Disheng Liu,Yiren Lu,Yu Yin,Mengnan Du,Jing Ma

Main category: cs.CV

TL;DR: 本文提出了一种基于因果关系的对抗方法CECAS,用于生成更准确的反事实视觉解释,在多个指标上表现优异。

Details Motivation: 现有的反事实视觉解释方法忽略了图像生成过程中的因果关系和虚假相关性,导致反事实图像出现意外变化,解释质量有限。 Method: 引入了一个新颖的框架CECAS,该框架结合了因果视角和对抗方法以避免对虚假因素进行不必要的扰动。 Result: 实验表明,所提出的方法在多个基准数据集上优于现有最先进方法,并在有效性、稀疏性、接近性和真实性之间实现了良好的平衡。 Conclusion: CECAS能够通过因果引导的对抗方法有效地生成高质量的反事实解释,从而在多个基准数据集上取得了有效的平衡权衡。 Abstract: Recent work on counterfactual visual explanations has contributed to making artificial intelligence models more explainable by providing visual perturbation to flip the prediction. However, these approaches neglect the causal relationships and the spurious correlations behind the image generation process, which often leads to unintended alterations in the counterfactual images and renders the explanations with limited quality. To address this challenge, we introduce a novel framework CECAS, which first leverages a causally-guided adversarial method to generate counterfactual explanations. It innovatively integrates a causal perspective to avoid unwanted perturbations on spurious factors in the counterfactuals. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches across multiple benchmark datasets and ultimately achieves a balanced trade-off among various aspects of validity, sparsity, proximity, and realism.

[160] MCGA: Mixture of Codebooks Hyperspectral Reconstruction via Grayscale-Aware Attention

Zhanjiang Yang,Lijun Sun,Jiawei Dong,Xiaoxin An,Yang Liu,Meng Li

Main category: cs.CV

TL;DR: This paper proposes MCGA, a two-stage method for reconstructing hyperspectral images from RGB images, leveraging a Mixture of Codebooks and novel attention mechanisms to achieve superior performance.

Details Motivation: Most existing hyperspectral reconstruction methods neglect the inherent challenge of transitioning from low-dimensional RGB images to high-dimensional HSI. The proposed method aims to address this limitation by incorporating prior knowledge and designing physically motivated attention mechanisms for more effective and efficient reconstruction. Method: The MCGA approach consists of two stages: (1) learning spectral patterns using a multi-scale VQ-VAE to extract a Mixture of Codebooks (MoC), and (2) refining the RGB-to-HSI mapping by querying features from the MoC, along with the introduction of Grayscale-Aware Attention, Quantized Self-Attention, and an entropy-based Test-Time Adaptation strategy. Result: Extensive experiments demonstrate that the MCGA method achieves state-of-the-art performance in hyperspectral image reconstruction. Conclusion: MCGA is a two-stage method that achieves state-of-the-art performance in hyperspectral image reconstruction from RGB images by incorporating prior knowledge through a Mixture of Codebooks and physically motivated attention mechanisms. Abstract: Reconstructing hyperspectral images (HSI) from RGB images is a cost-effective solution for various vision-based applications. However, most existing learning-based hyperspectral reconstruction methods directly learn the RGB-to-HSI mapping using complex attention mechanisms, neglecting the inherent challenge of transitioning from low-dimensional to high-dimensional information. To address this limitation, we propose a two-stage approach, MCGA, which first learns spectral patterns before estimating the mapping. In the first stage, a multi-scale VQ-VAE learns representations from heterogeneous HSI datasets, extracting a Mixture of Codebooks (MoC). In the second stage, the RGB-to-HSI mapping is refined by querying features from the MoC to replace latent HSI representations, incorporating prior knowledge rather than forcing a direct high-dimensional transformation. To further enhance reconstruction quality, we introduce Grayscale-Aware Attention and Quantized Self-Attention, which adaptively adjust feature map intensities to meet hyperspectral reconstruction requirements. This physically motivated attention mechanism ensures lightweight and efficient HSI recovery. Moreover, we propose an entropy-based Test-Time Adaptation strategy to improve robustness in real-world scenarios. Extensive experiments demonstrate that our method, MCGA, achieves state-of-the-art performance. The code and models will be released at https://github.com/Fibonaccirabbit/MCGA

[161] Measuring the Impact of Rotation Equivariance on Aerial Object Detection

Xiuyu Wu,Xinhao Wang,Xiubin Zhu,Lan Yang,Jiyuan Liu,Xingchen Hu

Main category: cs.CV

TL;DR: This paper introduces MessDet, a rotation-equivariant detector that achieves superior performance on aerial image datasets with minimal parameters by leveraging strict rotation equivariance and a multi-branch head design.

Details Motivation: Rotation equivariance is crucial for aerial object detection, but current methods achieve only approximate rotation equivariance. This paper investigates whether strict rotation equivariance improves detection performance. Method: The study constructs a strictly rotation-equivariant backbone and neck network, compares it with approximately rotation-equivariant networks, and proposes a multi-branch head network to reduce parameters while improving accuracy. Result: The proposed MessDet detector outperforms existing methods on DOTA-v1.0, DOTA-v1.5, and DIOR-R datasets while using fewer parameters. Conclusion: MessDet achieves state-of-the-art performance on aerial image datasets with a low parameter count by implementing a strictly rotation-equivariant backbone and proposing a multi-branch head network. Abstract: Due to the arbitrary orientation of objects in aerial images, rotation equivariance is a critical property for aerial object detectors. However, recent studies on rotation-equivariant aerial object detection remain scarce. Most detectors rely on data augmentation to enable models to learn approximately rotation-equivariant features. A few detectors have constructed rotation-equivariant networks, but due to the breaking of strict rotation equivariance by typical downsampling processes, these networks only achieve approximately rotation-equivariant backbones. Whether strict rotation equivariance is necessary for aerial image object detection remains an open question. In this paper, we implement a strictly rotation-equivariant backbone and neck network with a more advanced network structure and compare it with approximately rotation-equivariant networks to quantitatively measure the impact of rotation equivariance on the performance of aerial image detectors. Additionally, leveraging the inherently grouped nature of rotation-equivariant features, we propose a multi-branch head network that reduces the parameter count while improving detection accuracy. Based on the aforementioned improvements, this study proposes the Multi-branch head rotation-equivariant single-stage Detector (MessDet), which achieves state-of-the-art performance on the challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and DIOR-R with an exceptionally low parameter count.

[162] IGD: Instructional Graphic Design with Multimodal Layer Generation

Yadong Qu,Shancheng Fang,Yuxin Wang,Xiaorui Wang,Zhineng Chen,Hongtao Xie,Yongdong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为 IGD 的新方法,能够根据自然语言指令快速生成具有可编辑灵活性的多模态图层,为图形设计提供了新的解决方案。

Details Motivation: 现有的两阶段方法缺乏创造力和智能,而基于扩散的方法生成的非可编辑文件在视觉文本渲染中可读性差,无法实现令人满意的自动化图形设计。 Method: IGD 利用参数化渲染和图像资产生成的新范式,结合多模态理解和推理能力以及扩散模型生成图像内容。 Result: 实验结果表明,IGD 在图形设计方面表现出色,能够根据自然语言指令快速生成具有可编辑灵活性的多模态图层。 Conclusion: IGD 提供了一种用于图形设计的新解决方案,具有可扩展性和可伸缩性,适用于复杂的图形设计任务。 Abstract: Graphic design visually conveys information and data by creating and combining text, images and graphics. Two-stage methods that rely primarily on layout generation lack creativity and intelligence, making graphic design still labor-intensive. Existing diffusion-based methods generate non-editable graphic design files at image level with poor legibility in visual text rendering, which prevents them from achieving satisfactory and practical automated graphic design. In this paper, we propose Instructional Graphic Designer (IGD) to swiftly generate multimodal layers with editable flexibility with only natural language instructions. IGD adopts a new paradigm that leverages parametric rendering and image asset generation. First, we develop a design platform and establish a standardized format for multi-scenario design files, thus laying the foundation for scaling up data. Second, IGD utilizes the multimodal understanding and reasoning capabilities of MLLM to accomplish attribute prediction, sequencing and layout of layers. It also employs a diffusion model to generate image content for assets. By enabling end-to-end training, IGD architecturally supports scalability and extensibility in complex graphic design tasks. The superior experimental results demonstrate that IGD offers a new solution for graphic design.

[163] Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios

Siyue Yao,Mingjie Sun,Eng Gee Lim,Ran Yi,Baojiang Zhong,Moncef Gabbouj

Main category: cs.CV

TL;DR: 本文提出了一种新的数据增强框架Crucial-Diff,用于解决数据稀缺问题,通过合成关键样本提高检测和分割性能。

Details Motivation: 在医疗、工业和自动驾驶等领域,数据稀缺导致模型过拟合和数据集不平衡,从而阻碍了有效的检测和分割性能。现有的生成方法生成的样本重复或过于简单,无法提供针对下游模型弱点的“关键信息”,并且通常需要为不同的对象分别训练,造成计算效率低下。 Method: 提出了Crucial-Diff,一种领域无关的框架,设计用于合成关键样本。该方法整合了两个关键模块:场景无关特征提取器(SAFE)利用统一的特征提取器捕捉目标信息;弱点感知样本挖掘器(WASM)根据下游模型检测结果的反馈生成难以检测的样本,并将其与SAFE模块的输出融合。 Result: 在MVTec数据集上,Crucial-Diff实现了83.63%的像素级AP和78.12%的F1-MAX。在结肠息肉数据集上,Crucial-Diff达到了81.64%的mIoU和87.69%的mDice。 Conclusion: Crucial-Diff框架能够生成多样化且高质量的训练数据,有效解决了数据稀缺问题,提高了检测和分割性能。 Abstract: The scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance, thus hindering effective detection and segmentation performance. Existing studies employ the generative models to synthesize more training samples to mitigate data scarcity. However, these synthetic samples are repetitive or simplistic and fail to provide "crucial information" that targets the downstream model's weaknesses. Additionally, these methods typically require separate training for different objects, leading to computational inefficiencies. To address these issues, we propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples. Our method integrates two key modules. The Scene Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to capture target information. The Weakness Aware Sample Miner (WASM) generates hard-to-detect samples using feedback from the detection results of downstream model, which is then fused with the output of SAFE module. Together, our Crucial-Diff framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset, Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code will be released after acceptance.

[164] Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis

Shubham Shukla,Kunal Sonalkar

Main category: cs.CV

TL;DR: This study evaluates Gemini 2.0 Flash and GPT-4o-mini for zero-shot fashion attribute recognition, highlighting Gemini's superior performance and the need for domain-specific tuning.

Details Motivation: To address the under-explored performance of large language models in fine-grained fashion attribute recognition, which is critical for enhancing customer experience and product discovery on retail websites. Method: A zero-shot evaluation of LLMs (GPT-4o-mini and Gemini 2.0 Flash) was conducted using the DeepFashion-MultiModal dataset across 18 fashion attribute categories, employing images as the sole input. Result: Gemini 2.0 Flash achieved a macro F1 score of 56.79%, while GPT-4o-mini scored 43.28%, demonstrating varying levels of effectiveness in multimodal fashion attribute recognition. Conclusion: Gemini 2.0 Flash outperforms GPT-4o-mini in zero-shot fashion attribute recognition, indicating the need for domain-specific fine-tuning and further research in fashion AI. Abstract: The fashion retail business is centered around the capacity to comprehend products. Product attribution helps in comprehending products depending on the business process. Quality attribution improves the customer experience as they navigate through millions of products offered by a retail website. It leads to well-organized product catalogs. In the end, product attribution directly impacts the 'discovery experience' of the customer. Although large language models (LLMs) have shown remarkable capabilities in understanding multimodal data, their performance on fine-grained fashion attribute recognition remains under-explored. This paper presents a zero-shot evaluation of state-of-the-art LLMs that balance performance with speed and cost efficiency, mainly GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset DeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal) to evaluate these models in the attribution tasks of fashion products. Our study evaluates these models across 18 categories of fashion attributes, offering insight into where these models excel. We only use images as the sole input for product information to create a constrained environment. Our analysis shows that Gemini 2.0 Flash demonstrates the strongest overall performance with a macro F1 score of 56.79% across all attributes, while GPT-4o-mini scored a macro F1 score of 43.28%. Through detailed error analysis, our findings provide practical insights for deploying these LLMs in production e-commerce product attribution-related tasks and highlight the need for domain-specific fine-tuning approaches. This work also lays the groundwork for future research in fashion AI and multimodal attribute extraction.

[165] 4D-MISR: A unified model for low-dose super-resolution imaging via feature fusion

Zifei Wang,Zian Mao,Xiaoya He,Xi Huang,Haoran Zhang,Chun Cheng,Shufen Chu,Tingzheng Hou,Xiaoqin Zeng,Yujun Xie

Main category: cs.CV

TL;DR: This paper introduces a novel atomic-scale imaging technique using MISR principles and a CNN to enable ultra-low-dose electron microscopy for radiation-sensitive materials.

Details Motivation: Radiation damage limits the use of conventional electron microscopy on beam-sensitive materials such as proteins and 2D materials. The authors aim to overcome this limitation by developing a technique that works under ultra-low-dose conditions. Method: The method involves adapting principles from multi-image super-resolution (MISR), fusing multiple low-resolution, sub-pixel-shifted views, and enhancing the reconstruction using a dual-path, attention-guided convolutional neural network (CNN) trained on synthetic, multi-angle observations. Result: The developed method achieves atomic-scale super-resolution comparable to conventional ptychography under ultra-low-dose conditions and enables robust atomic-scale visualization across various types of beam-sensitive specimens. Conclusion: The paper concludes that their proposed method expands the capabilities of 4D-STEM and provides a new, generalizable approach for structural analysis of radiation-vulnerable materials with atomic-scale resolution under ultra-low-dose conditions. Abstract: While electron microscopy offers crucial atomic-resolution insights into structure-property relationships, radiation damage severely limits its use on beam-sensitive materials like proteins and 2D materials. To overcome this challenge, we push beyond the electron dose limits of conventional electron microscopy by adapting principles from multi-image super-resolution (MISR) that have been widely used in remote sensing. Our method fuses multiple low-resolution, sub-pixel-shifted views and enhances the reconstruction with a convolutional neural network (CNN) that integrates features from synthetic, multi-angle observations. We developed a dual-path, attention-guided network for 4D-STEM that achieves atomic-scale super-resolution from ultra-low-dose data. This provides robust atomic-scale visualization across amorphous, semi-crystalline, and crystalline beam-sensitive specimens. Systematic evaluations on representative materials demonstrate comparable spatial resolution to conventional ptychography under ultra-low-dose conditions. Our work expands the capabilities of 4D-STEM, offering a new and generalizable method for the structural analysis of radiation-vulnerable materials.

[166] Uncertainty Quantification for Incomplete Multi-View Data Using Divergence Measures

Zhipeng Xue,Yan Zhang,Ming Li,Chun Li,Yue Liu,Fei Yu

Main category: cs.CV

TL;DR: KPHD-Net improves multi-view classification and clustering by leveraging Proper Holder divergence and integrating Dempster-Shafer evidence theory with the Kalman filter for enhanced uncertainty estimation and reliable fusion results.

Details Motivation: The motivation is to ensure the reliability of multi-view integration and final decisions particularly when dealing with noisy or corrupted data, by addressing the ignored domain gaps between different modalities using Proper Holder divergence. Method: KPHD-Net uses a variational Dirichlet distribution to represent class probability distributions, models evidence from different views, and integrates it with Dempster-Shafer evidence theory combined with the Kalman filter. Result: Extensive experiments show that KPHD-Net outperforms current state-of-the-art methods in classification and clustering tasks regarding accuracy, robustness, and reliability. Conclusion: KPHD-Net offers theoretical guarantees and enhances the reliability of final fusion results by combining Dempster-Shafer evidence theory with the Kalman filter for future state estimations. Abstract: Existing multi-view classification and clustering methods typically improve task accuracy by leveraging and fusing information from different views. However, ensuring the reliability of multi-view integration and final decisions is crucial, particularly when dealing with noisy or corrupted data. Current methods often rely on Kullback-Leibler (KL) divergence to estimate uncertainty of network predictions, ignoring domain gaps between different modalities. To address this issue, KPHD-Net, based on H\"older divergence, is proposed for multi-view classification and clustering tasks. Generally, our KPHD-Net employs a variational Dirichlet distribution to represent class probability distributions, models evidences from different views, and then integrates it with Dempster-Shafer evidence theory (DST) to improve uncertainty estimation effects. Our theoretical analysis demonstrates that Proper H\"older divergence offers a more effective measure of distribution discrepancies, ensuring enhanced performance in multi-view learning. Moreover, Dempster-Shafer evidence theory, recognized for its superior performance in multi-view fusion tasks, is introduced and combined with the Kalman filter to provide future state estimations. This integration further enhances the reliability of the final fusion results. Extensive experiments show that the proposed KPHD-Net outperforms the current state-of-the-art methods in both classification and clustering tasks regarding accuracy, robustness, and reliability, with theoretical guarantees.

[167] Latent Diffusion Models with Masked AutoEncoders

Junho Lee,Jeongwoo Shin,Hyungwook Choi,Joonseok Lee

Main category: cs.CV

TL;DR: 本文分析了LDM中自动编码器的三个关键属性,提出了VMAEs和LDMAEs,实现了图像生成质量和计算效率的显著提升。

Details Motivation: 尽管LDM在图像生成中有很大潜力,但对自动编码器的理想属性和最佳设计探索不足。 Method: 分析LDM中自动编码器的三个关键属性,并提出VMAEs和LDMAEs来优化这些属性。 Result: 现有自动编码器无法同时满足所有三个属性,而VMAEs和LDMAEs能够显著提升图像生成质量和计算效率。 Conclusion: 本文提出了VMAEs和LDMAEs,通过综合实验证明了其在图像生成质量和计算效率方面的显著提升。 Abstract: In spite of remarkable potential of the Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Through comprehensive experiments, we demonstrate significantly enhanced image generation quality and computational efficiency.

[168] 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving

Yixun Zhang,Lizhi Wang,Junjun Zhao,Wending Zhao,Feng Zhou,Yonghao Dang,Jianqin Yin

Main category: cs.CV

TL;DR: This paper proposes 3DGAA, an adversarial attack method for camera-based object detection systems in autonomous driving, which balances physical realism and attack robustness better than existing methods.

Details Motivation: Existing 2D and 3D physical attacks struggle to balance physical realism and attack robustness. Method: The authors proposed 3DGAA, which leverages the full 14-dimensional parameterization of 3D Gaussian Splatting to jointly optimize geometry and appearance. They also introduced a physical filtering module and a physical augmentation module. Result: 3DGAA reduced detection mAP from 87.21% to 7.38%, outperforming existing 3D physical attacks. It also maintained high transferability across different physical conditions. Conclusion: 3DGAA is a practical attack framework that can evaluate the safety of perception systems in autonomous driving. Abstract: Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. While existing 2D and 3D physical attacks typically optimize texture, they often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module to preserve geometric fidelity, and a physical augmentation module to simulate complex physical scenarios, thus enhancing attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21% to 7.38%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks. These results validate 3DGAA as a practical attack framework for evaluating the safety of perception systems in autonomous driving.

[169] Leveraging Swin Transformer for enhanced diagnosis of Alzheimer's disease using multi-shell diffusion MRI

Quentin Dessain,Nicolas Delinte,Bernard Hanseeuw,Laurence Dricot,Benoît Macq

Main category: cs.CV

TL;DR: This paper proposes a deep learning method using Swin Transformer and multi-shell dMRI data for early diagnosis of Alzheimer's disease and amyloid detection, achieving high accuracy and identifying important brain regions related to the disease.

Details Motivation: This study aims to support early diagnosis of Alzheimer's disease and detection of amyloid accumulation by leveraging the microstructural information available in multi-shell diffusion MRI (dMRI) data, using a vision transformer-based deep learning framework. Method: We present a classification pipeline that employs the Swin Transformer, a hierarchical vision transformer model, on multi-shell dMRI data for the classification of Alzheimer's disease and amyloid presence. Key metrics from DTI and NODDI were extracted and projected onto 2D planes to enable transfer learning with ImageNet-pretrained models. To efficiently adapt the transformerto limited labeled neuroimaging data, we integrated Low-Rank Adaptation. Result: The framework achieved competitive classification results within the scope of multi-shell dMRI-based features, with the best balanced accuracy of 95.2% for distinguishing cognitively normal individuals from those with Alzheimer's disease dementia using NODDI metrics. For amyloid detection, it reached 77.2% balanced accuracy in distinguishing amyloid-positive mild cognitive impairment/Alzheimer's disease dementia subjects from amyloid-negative cognitively normal subjects, and 67.9% for identifying amyloid-positive individuals among cognitively normal subjects. Grad-CAM-based explainability analysis identified clinically relevant brain regions, including the parahippocampal gyrus and hippocampus, as key contributors to model predictions. Conclusion: This study demonstrates the promise of diffusion MRI and transformer-based architectures for early detection of Alzheimer's disease and amyloid pathology, supporting biomarker-driven diagnostics in data-limited biomedical settings. Abstract: Objective: This study aims to support early diagnosis of Alzheimer's disease and detection of amyloid accumulation by leveraging the microstructural information available in multi-shell diffusion MRI (dMRI) data, using a vision transformer-based deep learning framework. Methods: We present a classification pipeline that employs the Swin Transformer, a hierarchical vision transformer model, on multi-shell dMRI data for the classification of Alzheimer's disease and amyloid presence. Key metrics from DTI and NODDI were extracted and projected onto 2D planes to enable transfer learning with ImageNet-pretrained models. To efficiently adapt the transformer to limited labeled neuroimaging data, we integrated Low-Rank Adaptation. We assessed the framework on diagnostic group prediction (cognitively normal, mild cognitive impairment, Alzheimer's disease dementia) and amyloid status classification. Results: The framework achieved competitive classification results within the scope of multi-shell dMRI-based features, with the best balanced accuracy of 95.2% for distinguishing cognitively normal individuals from those with Alzheimer's disease dementia using NODDI metrics. For amyloid detection, it reached 77.2% balanced accuracy in distinguishing amyloid-positive mild cognitive impairment/Alzheimer's disease dementia subjects from amyloid-negative cognitively normal subjects, and 67.9% for identifying amyloid-positive individuals among cognitively normal subjects. Grad-CAM-based explainability analysis identified clinically relevant brain regions, including the parahippocampal gyrus and hippocampus, as key contributors to model predictions. Conclusion: This study demonstrates the promise of diffusion MRI and transformer-based architectures for early detection of Alzheimer's disease and amyloid pathology, supporting biomarker-driven diagnostics in data-limited biomedical settings.

[170] Vision-Based Anti Unmanned Aerial Technology: Opportunities and Challenges

Guanghai Ding,Yihua Ren,Yuting Liu,Qijun Zhao,Shuiwang Li

Main category: cs.CV

TL;DR: 本文综述了反无人机检测与跟踪技术的现状与挑战,整理了相关数据集,并分析了主流算法,最后提出了未来的研究方向。

Details Motivation: 随着无人机技术的快速发展及其在军事侦察、环境监测和物流等领域的广泛应用,实现高效准确的反无人机跟踪变得至关重要,尤其是在公共安全、边境巡逻、搜救和农业监测等复杂环境中。 Method: 该论文的方法主要包括对现有反无人机技术的特点和挑战进行综述,调查并整理多个公开数据集,分析近年来的主要视觉和多传感器融合算法,并提出未来研究方向。 Result: 论文提供了对现有反无人机检测与跟踪技术的全面回顾,整理了可用的数据集,并分析了主要算法,为未来研究提供了指导。 Conclusion: 该论文总结了当前基于视觉和多传感器融合的反无人机检测与跟踪技术,并提出了未来的研究方向,以推动该领域的发展。 Abstract: With the rapid advancement of UAV technology and its extensive application in various fields such as military reconnaissance, environmental monitoring, and logistics, achieving efficient and accurate Anti-UAV tracking has become essential. The importance of Anti-UAV tracking is increasingly prominent, especially in scenarios such as public safety, border patrol, search and rescue, and agricultural monitoring, where operations in complex environments can provide enhanced security. Current mainstream Anti-UAV tracking technologies are primarily centered around computer vision techniques, particularly those that integrate multi-sensor data fusion with advanced detection and tracking algorithms. This paper first reviews the characteristics and current challenges of Anti-UAV detection and tracking technologies. Next, it investigates and compiles several publicly available datasets, providing accessible links to support researchers in efficiently addressing related challenges. Furthermore, the paper analyzes the major vision-based and vision-fusion-based Anti-UAV detection and tracking algorithms proposed in recent years. Finally, based on the above research, this paper outlines future research directions, aiming to provide valuable insights for advancing the field.

[171] Binomial Self-Compensation: Mechanism and Suppression of Motion Error in Phase-Shifting Profilometry

Geyou Zhang,Kai Liu,Ce Zhu

Main category: cs.CV

TL;DR: 提出了一种名为I-BSC的动态3D扫描方法,通过加权求和同质条纹图像解决运动误差问题,在保持高精度的同时显著降低计算复杂性。

Details Motivation: 传统相移轮廓术(PSP)因假设物体静止而难以应对动态测量中的运动误差,需要更高效的方法来提升精度和速度。 Method: 提出了图像序列二项式自补偿(I-BSC),通过在图像序列上应用二项式加权求和替代P-BSC的相位帧处理,并仅计算一次反正切函数以降低复杂度。 Result: 1) I-BSC 在减少运动误差方面优于现有方法,并实现接近单帧率的深度图输出;2) 相比P-BSC,I-BSC 的计算复杂度降低一个多项式阶数,加速了计算速度并提高了误差收敛效率。 Conclusion: I-BSC 是一种有效应对动态测量中运动误差的新方法,在保证高精度的同时显著提升了计算效率和实用性。 Abstract: Phase shifting profilometry (PSP) is widely used in high-precision 3D scanning due to its high accuracy, robustness, and pixel-wise handling. However, a fundamental assumption of PSP that the object should remain static does not hold in dynamic measurement, making PSP susceptible to object motion. To address this challenge, our proposed solution, phase-sequential binomial self-compensation (P-BSC), sums successive motion-affected phase frames weighted by binomial coefficients. This approach exponentially reduces the motion error in a pixel-wise and frame-wise loopable manner. Despite its efficacy, P-BSC suffers from high computational overhead and error accumulation due to its reliance on multi-frame phase calculations and weighted summations. Inspired by P-BSC, we propose an image-sequential binomial self-compensation (I-BSC) to weight sum the homogeneous fringe images instead of successive phase frames, which generalizes the BSC concept from phase sequences to image sequences. I-BSC computes the arctangent function only once, resolving both limitations in P-BSC. Extensive analysis, simulations, and experiments show that 1) the proposed BSC outperforms existing methods in reducing motion error while achieving a quasi-single-shot frame rate, i.e., depth map frame rate equals to the camera's acquisition rate, enabling 3D reconstruction with high pixel-depth-temporal resolution; 2) compared to P-BSC, our I-BSC reduces the computational complexity by one polynomial order, thereby accelerating the computational frame rate by several to dozen times, while also reaching faster motion error convergence.

[172] (Almost) Free Modality Stitching of Foundation Models

Jaisidh Singh,Diganta Misra,Boris Knyazev,Antonio Orvieto

Main category: cs.CV

TL;DR: This paper proposes Hyma, a hypernetwork-based method that efficiently selects optimal uni-modal models and trains connectors, achieving significant computational savings without sacrificing performance.

Details Motivation: The motivation stems from the computational challenges in aligning multiple pretrained uni-modal models using connector modules, especially when dealing with large datasets and a growing number of available models. Method: The study introduces Hypernetwork Model Alignment (Hyma), which uses hypernetworks to jointly train connector modules for $N \times M$ combinations of uni-modal models, enabling efficient model selection and alignment. Result: Hyma reduces the cost of searching for optimal uni-modal model pairs by $10\times$ on average across experiments, while preserving ranking accuracy and connector performance compared to traditional grid search approaches. Conclusion: Hyma provides an efficient solution for selecting optimal uni-modal model pairs and training connector modules, significantly reducing search costs while maintaining performance comparable to grid search methods. Abstract: Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an autoregressive text model. This stitching process is performed by training a connector module that aims to align the representation-representation or representation-input spaces of these uni-modal models. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the optimal uni-modal model pair search cost by $10\times$ (averaged across all experiments), while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

[173] Memory-Efficient Personalization of Text-to-Image Diffusion Models via Selective Optimization Strategies

Seokeon Choi,Sunghyun Park,Hyoungwoo Park,Jeongho Kim,Sungrack Yun

Main category: cs.CV

TL;DR: This paper proposes a selective optimization framework combining BP-low and ZO-high methods to enable efficient personalization of diffusion models on edge devices with limited resources.

Details Motivation: To enable memory-efficient personalization of text-to-image diffusion models while preserving privacy and operating within the constraints of edge devices. Method: A selective optimization framework dynamically selects between backpropagation on low-resolution images (BP-low) and zeroth-order optimization on high-resolution images (ZO-high), based on the diffusion process characteristics and timestep-aware probabilistic function. Result: Experimental results show competitive performance with significantly reduced memory consumption, allowing scalable and high-quality on-device personalization without increasing inference latency. Conclusion: The proposed selective optimization framework effectively combines BP-low and ZO-high methods, achieving memory-efficient and high-quality fine-tuning for text-to-image diffusion models. Abstract: Memory-efficient personalization is critical for adapting text-to-image diffusion models while preserving user privacy and operating within the limited computational resources of edge devices. To this end, we propose a selective optimization framework that adaptively chooses between backpropagation on low-resolution images (BP-low) and zeroth-order optimization on high-resolution images (ZO-high), guided by the characteristics of the diffusion process. As observed in our experiments, BP-low efficiently adapts the model to target-specific features, but suffers from structural distortions due to resolution mismatch. Conversely, ZO-high refines high-resolution details with minimal memory overhead but faces slow convergence when applied without prior adaptation. By complementing both methods, our framework leverages BP-low for effective personalization while using ZO-high to maintain structural consistency, achieving memory-efficient and high-quality fine-tuning. To maximize the efficacy of both BP-low and ZO-high, we introduce a timestep-aware probabilistic function that dynamically selects the appropriate optimization strategy based on diffusion timesteps. This function mitigates the overfitting from BP-low at high timesteps, where structural information is critical, while ensuring ZO-high is applied more effectively as training progresses. Experimental results demonstrate that our method achieves competitive performance while significantly reducing memory consumption, enabling scalable, high-quality on-device personalization without increasing inference latency.

[174] LifelongPR: Lifelong knowledge fusion for point cloud place recognition based on replay and prompt learning

Xianghong Zou,Jianping Li,Zhe Chen,Zhen Cao,Zhen Dong,Qiegen Liu,Bisheng Yang

Main category: cs.CV

TL;DR: 本文提出LifelongPR,一种用于点云场景识别的持续学习框架,有效缓解灾难性遗忘问题,并在多个指标上取得显著提升。

Details Motivation: 现有的点云场景识别模型在适应新环境时容易遗忘之前学习的知识,影响模型的可扩展性和实用性。 Method: 提出了一种基于重放样本选择和提示学习的持续学习框架,结合了动态样本分配和领域自适应策略。 Result: 与最先进的方法相比,该方法在mIR@1上提高了6.50%,在mR@1上提高了7.96%,F值降低了8.95%。 Conclusion: LifelongPR是一个新的持续学习框架,有效解决了点云场景识别模型中的灾难性遗忘问题,并在大规模数据集上表现优异。 Abstract: Point cloud place recognition (PCPR) plays a crucial role in photogrammetry and robotics applications such as autonomous driving, intelligent transportation, and augmented reality. In real-world large-scale deployments of a positioning system, PCPR models must continuously acquire, update, and accumulate knowledge to adapt to diverse and dynamic environments, i.e., the ability known as continual learning (CL). However, existing PCPR models often suffer from catastrophic forgetting, leading to significant performance degradation in previously learned scenes when adapting to new environments or sensor types. This results in poor model scalability, increased maintenance costs, and system deployment difficulties, undermining the practicality of PCPR. To address these issues, we propose LifelongPR, a novel continual learning framework for PCPR, which effectively extracts and fuses knowledge from sequential point cloud data. First, to alleviate the knowledge loss, we propose a replay sample selection method that dynamically allocates sample sizes according to each dataset's information quantity and selects spatially diverse samples for maximal representativeness. Second, to handle domain shifts, we design a prompt learning-based CL framework with a lightweight prompt module and a two-stage training strategy, enabling domain-specific feature adaptation while minimizing forgetting. Comprehensive experiments on large-scale public and self-collected datasets are conducted to validate the effectiveness of the proposed method. Compared with state-of-the-art (SOTA) methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly available at https://github.com/zouxianghong/LifelongPR.

[175] CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books

Marc Serra Ortega,Emanuele Vivoli,Artemis Llabrés,Dimosthenis Karatzas

Main category: cs.CV

TL;DR: 本文提出了CoSMo,一种用于漫画书页面流分割的新颖多模态Transformer模型,在性能上超过了现有方法。

Details Motivation: 漫画书的页面流分割是自动化内容理解的关键步骤,对于角色分析、故事索引或元数据丰富等下游任务至关重要。 Method: 开发了一种新颖的多模态Transformer模型CoSMo,并在视觉模态和多模态变体中进行了评估。 Result: CoSMo在F1-Macro、Panoptic Quality和流级别指标上始终优于传统基线模型和更大的通用视觉-语言模型。 Conclusion: CoSMo通过视觉模态和多模态方法在漫画书页面流分割任务中取得了卓越的性能,为可扩展的漫画书分析铺平了道路。 Abstract: This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.

[176] Lightweight Model for Poultry Disease Detection from Fecal Images Using Multi-Color Space Feature Optimization and Machine Learning

A. K. M. Shoriful Islam,Md. Rakib Hassan,Macbah Uddin,Md. Shahidur Rahman

Main category: cs.CV

TL;DR: 本文介绍了一种基于机器学习的轻量级方法,通过对家禽粪便图像的分析来检测疾病,旨在提供一种成本低廉、易于解释和扩展性强的深度学习替代方案。

Details Motivation: 家禽养殖是全球粮食供应链的重要组成部分,但其仍极易受到球虫病、沙门氏菌病和新城疫等传染病的影响。 Method: 该研究使用了多颜色空间特征提取(RGB、HSV、LAB)和广泛的基于颜色、纹理和形状的描述符,包括颜色直方图、局部二值模式(LBP)、小波变换和边缘检测器。通过系统性消融研究和使用PCA和XGBoost特征选择进行降维,确定了一个兼顾准确性和计算效率的紧凑全局特征集。 Result: 训练于这些特征的人工神经网络(ANN)分类器达到了95.85%的准确率,且不需要GPU,仅需638秒的Google Colab执行时间。与Xception和MobileNetV3等深度学习模型相比,所提出的模型在资源使用上大幅降低但仍保持了相当的准确性。 Conclusion: 本文提出了一种基于机器学习的轻量级方法,用于检测家禽粪便图像中的疾病。这种方法在低资源农业环境中展示了对深度学习模型的可比准确性,并具有更低的资源消耗。 Abstract: Poultry farming is a vital component of the global food supply chain, yet it remains highly vulnerable to infectious diseases such as coccidiosis, salmonellosis, and Newcastle disease. This study proposes a lightweight machine learning-based approach to detect these diseases by analyzing poultry fecal images. We utilize multi-color space feature extraction (RGB, HSV, LAB) and explore a wide range of color, texture, and shape-based descriptors, including color histograms, local binary patterns (LBP), wavelet transforms, and edge detectors. Through a systematic ablation study and dimensionality reduction using PCA and XGBoost feature selection, we identify a compact global feature set that balances accuracy and computational efficiency. An artificial neural network (ANN) classifier trained on these features achieved 95.85% accuracy while requiring no GPU and only 638 seconds of execution time in Google Colab. Compared to deep learning models such as Xception and MobileNetV3, our proposed model offers comparable accuracy with drastically lower resource usage. This work demonstrates a cost-effective, interpretable, and scalable alternative to deep learning for real-time poultry disease detection in low-resource agricultural settings.

[177] MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

Chenguo Lin,Yuchen Lin,Panwang Pan,Yifan Yu,Honglei Yan,Katerina Fragkiadaki,Yadong Mu

Main category: cs.CV

TL;DR: MoVieS是一种高效的动态新视角合成模型,能够在统一框架中实现外观、几何和运动建模,并支持多种零样本应用。

Details Motivation: 为了在单目视频中实现快速且准确的4D动态新视角合成,并减少对任务特定监督的依赖。 Method: MoVieS利用像素对齐的高斯基元网格表示动态3D场景,并明确监督其随时间变化的运动。 Result: MoVieS能够在多个任务上实现竞争性能,同时提供几个数量级的速度提升,并支持多种零样本应用,如场景流估计和移动物体分割。 Conclusion: MoVieS是一种新颖的前馈模型,能够在一个统一的学习框架中实现动态场景的外观、几何和运动建模,同时支持多种零样本应用。 Abstract: We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion. This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.

[178] Frequency Regulation for Exposure Bias Mitigation in Diffusion Models

Meng Yu,Kun Zhan

Main category: cs.CV

TL;DR: This paper introduces a training-free, plug-and-play method using wavelet transforms to address exposure bias in diffusion models by regulating low- and high-frequency subbands, significantly improving generative performance.

Details Motivation: Exposure bias negatively affects the generative capabilities of diffusion models. The authors aim to provide a robust, training-free solution to this issue by analyzing energy patterns in noisy images during the diffusion process. Method: The authors use wavelet transforms to separately regulate low- and high-frequency subbands during the diffusion process, based on observed energy reduction patterns and their impact on amplitude variations. Result: The proposed method significantly improves the generative quality of various diffusion models and offers a robust solution to exposure bias across different architectures. Conclusion: The paper concludes that exposure bias in diffusion models can be effectively addressed using a training-free, plug-and-play method based on frequency-domain regulation with wavelet transforms, significantly improving generative quality. Abstract: Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of the predicted noisy images decreases during the diffusion process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) This energy reduction results in amplitude variations between the network-reconstructed clean data and the real clean data. Based on the first finding, we introduce a frequency-domain regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we provide a more accurate analysis of exposure bias in the two subbands. Our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and providing a robust solution to exposure bias across different model architectures. The source code is available at https://github.com/kunzhan/wpp.

[179] A Transfer Learning-Based Method for Water Body Segmentation in Remote Sensing Imagery: A Case Study of the Zhada Tulin Area

Haonan Chen,Xin Tong

Main category: cs.CV

TL;DR: This study introduces a two-stage transfer learning method for improved water body segmentation in remote sensing images, especially effective in challenging domains like the Zhada Tulin area in Tibet.

Details Motivation: To overcome prevalent challenges of domain shift and small sample sizes in remote sensing image water body segmentation, particularly in regions with complex topography and spectral features. Method: A two-stage transfer learning approach was employed using the SegFormer model, involving initial training on a diverse source domain followed by fine-tuning on target domain data. Result: IoU for the water body segmentation task improved from 25.50% (direct transfer) to 64.84% with the proposed strategy. Conclusion: The two-stage transfer learning strategy based on the SegFormer model effectively addresses domain shift and small sample size challenges in remote sensing image water body segmentation, significantly improving performance. Abstract: To address the prevalent challenges of domain shift and small sample sizes in remote sensing image water body segmentation, this study proposes and validates a two-stage transfer learning strategy based on the SegFormer model. The approach begins by training a foundational segmentation model on a diverse source domain, where it achieves an Intersection over Union (IoU) of 68.80% on its validation set, followed by fine-tuning on data from the distinct target domain. Focusing on the Zhada Tulin area in Tibet -- a region characterized by highly complex topography and spectral features -- the experimental results demonstrate that this strategy significantly boosts the IoU for the water body segmentation task from 25.50% (for direct transfer) to 64.84%. This not only effectively resolves the model performance degradation caused by domain discrepancy but also provides an effective technical paradigm for high-precision thematic information extraction in data-scarce and environmentally unique remote sensing scenarios.

[180] FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Bingchao Wang,Zhiwei Ning,Jianyu Ding,Xuanang Gao,Yin Li,Dongsheng Jiang,Jie Yang,Wei Liu

Main category: cs.CV

TL;DR: FIX-CLIP enhances CLIP to handle long-text inputs effectively while preserving short-text capabilities, achieving top performance on retrieval benchmarks and showing strong application potential.

Details Motivation: CLIP performs well in zero-shot scenarios for short-text tasks but struggles with long-text inputs due to limitations in input length. The motivation is to enhance CLIP's capability for long-text representation without sacrificing its short-text performance. Method: FIX-CLIP introduces a dual-branch training pipeline, multiple learnable regional prompts with unidirectional masks in Transformer layers, and a hierarchical feature alignment module. Additionally, 30M images and synthesized long-text captions are used for training. Result: FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks and demonstrates promising plug-and-play applicability for diffusion models with long-text input. Conclusion: FIX-CLIP improves the performance of CLIP on long-text tasks while maintaining its effectiveness on short-text tasks, achieving state-of-the-art results. Abstract: CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs (>77 tokens). To remedy this issue, we propose FIX-CLIP which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input.

[181] Glance-MCMT: A General MCMT Framework with Glance Initialization and Progressive Association

Hamidreza Hashempoor

Main category: cs.CV

TL;DR: A multi-camera tracking framework is proposed that maintains consistent global identity assignment across views by integrating trajectory and appearance cues, with spatial validation through 3D position estimation.

Details Motivation: The motivation is to maintain consistent global identity assignment across different views in multi-camera setups, which is crucial for effective tracking in applications like surveillance and autonomous driving. Method: The method involves a multi-camera multi-target (MCMT) tracking framework that utilizes BoT-SORT-based single-camera tracking, followed by an initial glance phase for global ID initialization through trajectory-feature matching. Later frames use a prioritized global matching strategy to match new tracklets to existing global identities, introducing new IDs only when necessary. 3D positions are estimated using depth maps and calibration for spatial validation. Result: The result is a more accurate and consistent assignment of global identities to targets across multiple cameras, achieved through the integration of trajectory and appearance features as well as spatial validation using 3D position estimation. Conclusion: The proposed MCMT tracking framework ensures consistent global identity assignment across views using trajectory and appearance cues. Abstract: We propose a multi-camera multi-target (MCMT) tracking framework that ensures consistent global identity assignment across views using trajectory and appearance cues. The pipeline starts with BoT-SORT-based single-camera tracking, followed by an initial glance phase to initialize global IDs via trajectory-feature matching. In later frames, new tracklets are matched to existing global identities through a prioritized global matching strategy. New global IDs are only introduced when no sufficiently similar trajectory or feature match is found. 3D positions are estimated using depth maps and calibration for spatial validation.

[182] DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation

Ivan Martinović,Josip Šarić,Marin Oršić,Matej Kristan,Siniša Šegvić

Main category: cs.CV

TL;DR: DEARLi is a novel semi-supervised panoptic segmentation method leveraging foundation models to enhance recognition and localization, delivering strong performance with minimal labeled data and reduced memory usage.

Details Motivation: Pixel-level annotation is expensive and time-consuming, prompting the need for effective semi-supervised segmentation methods that can operate with few labeled images and a large corpus of unlabeled data. Foundation models offer potential in addressing label scarcity but lack effective exploitation mechanisms. Method: DEARLi combines unsupervised mask-transformer consistency with zero-shot classification of CLIP features for recognition enhancement and uses a class-agnostic decoder warm-up based on SAM pseudo-labels for localization improvement. Result: DEARLi achieves 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images, outperforms the state of the art in semi-supervised semantic segmentation, and requires 8x less GPU memory. Conclusion: The proposed DEARLi approach excels in semi-supervised segmentation scenarios with large taxonomies and limited labeled data, outperforming state-of-the-art methods while being more resource-efficient. Abstract: Pixel-level annotation is expensive and time-consuming. Semi-supervised segmentation methods address this challenge by learning models on few labeled images alongside a large corpus of unlabeled images. Although foundation models could further account for label scarcity, effective mechanisms for their exploitation remain underexplored. We address this by devising a novel semi-supervised panoptic approach fueled by two dedicated foundation models. We enhance recognition by complementing unsupervised mask-transformer consistency with zero-shot classification of CLIP features. We enhance localization by class-agnostic decoder warm-up with respect to SAM pseudo-labels. The resulting decoupled enhancement of recognition and localization (DEARLi) particularly excels in the most challenging semi-supervised scenarios with large taxonomies and limited labeled data. Moreover, DEARLi outperforms the state of the art in semi-supervised semantic segmentation by a large margin while requiring 8x less GPU memory, in spite of being trained only for the panoptic objective. We observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The source code is available at https://github.com/helen1c/DEARLi.

[183] Taming Modern Point Tracking for Speckle Tracking Echocardiography via Impartial Motion

Md Abulkalam Azad,John Nyberg,Håvard Dalen,Bjørnar Grenne,Lasse Lovstakken,Andreas Østvik

Main category: cs.CV

TL;DR: This paper explores the potential of modern point tracking methods for echocardiography motion estimation, identifies challenges like directional motion bias, and proposes improved training strategies and a lightweight network to significantly boost performance and reproducibility.

Details Motivation: Accurate motion estimation is crucial for precise cardiac function measurements in echocardiography. Traditional methods struggle with complex cardiac motion, and modern point tracking approaches remain underexplored in this domain. Method: This work investigates the effectiveness of SOTA point tracking methods in echocardiography by analyzing cardiac motion across heart cycles using real B-mode ultrasound videos. A directional motion bias affecting training strategies was identified, which was mitigated through refined training procedures and tailored augmentations. A lightweight network leveraging multi-scale cost volumes was also proposed. Result: Fine-tuning with the proposed strategies improved model performances over baselines, even for out-of-distribution cases. For example, EchoTracker boosted overall position accuracy by 61.5% and reduced median trajectory error by 60.7%. Clinical evaluation showed improved GLS measurements, aligning closely with expert-validated tools. Conclusion: The study concludes that while state-of-the-art point tracking methods show promise, they have limitations in echocardiography. The proposed lightweight network and training strategies significantly enhance performance and reproducibility for real-world applications. Abstract: Accurate motion estimation for tracking deformable tissues in echocardiography is essential for precise cardiac function measurements. While traditional methods like block matching or optical flow struggle with intricate cardiac motion, modern point tracking approaches remain largely underexplored in this domain. This work investigates the potential of state-of-the-art (SOTA) point tracking methods for ultrasound, with a focus on echocardiography. Although these novel approaches demonstrate strong performance in general videos, their effectiveness and generalizability in echocardiography remain limited. By analyzing cardiac motion throughout the heart cycle in real B-mode ultrasound videos, we identify that a directional motion bias across different views is affecting the existing training strategies. To mitigate this, we refine the training procedure and incorporate a set of tailored augmentations to reduce the bias and enhance tracking robustness and generalization through impartial cardiac motion. We also propose a lightweight network leveraging multi-scale cost volumes from spatial context alone to challenge the advanced spatiotemporal point tracking models. Experiments demonstrate that fine-tuning with our strategies significantly improves models' performances over their baselines, even for out-of-distribution (OOD) cases. For instance, EchoTracker boosts overall position accuracy by 60.7% and reduces median trajectory error by 61.5% across heart cycle phases. Interestingly, several point tracking models fail to outperform our proposed simple model in terms of tracking accuracy and generalization, reflecting their limitations when applied to echocardiography. Nevertheless, clinical evaluation reveals that these methods improve GLS measurements, aligning more closely with expert-validated, semi-automated tools and thus demonstrating better reproducibility in real-world applications.

[184] Deep Recurrence for Dynamical Segmentation Models

David Calhas,Arlindo L. Oliveira

Main category: cs.CV

TL;DR: 这项研究提出了一种受生物启发的反馈机制,在U-Net架构中实现了预测编码理念,通过引入反馈循环提高了模型在噪声环境下的性能和数据效率。

Details Motivation: 生物视觉系统依靠反馈连接来迭代优化感知,而大多数人工神经网络仍然是纯前馈的,因此提出了一个受生物启发的反馈机制。 Method: 在标准U-Net架构中实现了一种受预测编码启发的反馈机制,并引入了两种生物学动机操作(softmax投影和指数衰减)以确保反馈回路的稳定性。 Result: 通过在合成分割任务上的受控实验表明,反馈模型在噪声条件下显著优于其前馈对应模型,并在有限监督下更有效地泛化。反馈模型仅需两个训练样例即可达到随机以上的表现,而前馈模型至少需要四个。 Conclusion: 反馈机制增强了模型的鲁棒性和数据效率,为更自适应和受生物启发的神经架构提供了路径。 Abstract: While biological vision systems rely heavily on feedback connections to iteratively refine perception, most artificial neural networks remain purely feedforward, processing input in a single static pass. In this work, we propose a predictive coding inspired feedback mechanism that introduces a recurrent loop from output to input, allowing the model to refine its internal state over time. We implement this mechanism within a standard U-Net architecture and introduce two biologically motivated operations, softmax projection and exponential decay, to ensure stability of the feedback loop. Through controlled experiments on a synthetic segmentation task, we show that the feedback model significantly outperforms its feedforward counterpart in noisy conditions and generalizes more effectively with limited supervision. Notably, feedback achieves above random performance with just two training examples, while the feedforward model requires at least four. Our findings demonstrate that feedback enhances robustness and data efficiency, and offer a path toward more adaptive and biologically inspired neural architectures. Code is available at: github.com/DCalhas/feedback_segmentation.

[185] SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis

Youngmin Kim,Giyeong Oh,Kwangsoo Youm,Youngjae Yu

Main category: cs.CV

TL;DR: SlumpGuard 是一种基于 AI 的视频分析系统,可实时评估混凝土的和易性,提高质量控制的效率和准确性。

Details Motivation: 传统的坍落度测试是手动的、耗时的且容易不一致,限制了其在现场实时监测中的适用性。 Method: 提出了一种基于视频的 AI 系统 SlumpGuard,用于实时分析混凝土流动性。 Result: 展示了系统设计、构建的专用数据集以及来自实际部署的经验结果,证明了 SlumpGuard 的有效性。 Conclusion: SlumpGuard 是一种实用的解决方案,可以提高混凝土质量控制的准确性和效率。 Abstract: Concrete workability is essential for construction quality, with the slump test being the most common on-site method for its assessment. However, traditional slump testing is manual, time-consuming, and prone to inconsistency, limiting its applicability for real-time monitoring. To address these challenges, we propose SlumpGuard, an AI-powered, video-based system that automatically analyzes concrete flow from the truck chute to assess workability in real time. Our system enables full-batch inspection without manual intervention, improving both the accuracy and efficiency of quality control. We present the system design, a the construction of a dedicated dataset, and empirical results from real-world deployment, demonstrating the effectiveness of SlumpGuard as a practical solution for modern concrete quality assurance.

[186] Minimizing the Pretraining Gap: Domain-aligned Text-Based Person Retrieval

Shuyu Yang,Yaxiong Wang,Yongrui Li,Li Zhu,Zhedong Zheng

Main category: cs.CV

TL;DR: This paper introduces a novel dual-level domain adaptation approach for text-based person retrieval, effectively bridging the synthetic-to-real domain gap and achieving superior performance on multiple benchmarks.

Details Motivation: Motivated by privacy concerns and annotation costs, synthetic data is widely used for pretraining. However, the domain gap between synthetic and real-world data hinders performance, which this work aims to address. Method: The paper proposes a unified pipeline with Domain-aware Diffusion (DaD) for image-level adaptation and Multi-granularity Relation Alignment (MRA) for region-level adaptation to address domain gaps in synthetic-to-real scenarios. Result: Extensive experiments demonstrate that the proposed method achieves state-of-the-art results on three benchmark datasets: CUHK-PEDES, ICFG-PEDES, and RSTPReid. Conclusion: The study concludes that the proposed dual-level adaptation method effectively bridges the domain gap in text-based person retrieval, achieving state-of-the-art results on multiple datasets. Abstract: In this work, we focus on text-based person retrieval, which aims to identify individuals based on textual descriptions. Given the significant privacy issues and the high cost associated with manual annotation, synthetic data has become a popular choice for pretraining models, leading to notable advancements. However, the considerable domain gap between synthetic pretraining datasets and real-world target datasets, characterized by differences in lighting, color, and viewpoint, remains a critical obstacle that hinders the effectiveness of the pretrain-finetune paradigm. To bridge this gap, we introduce a unified text-based person retrieval pipeline considering domain adaptation at both image and region levels. In particular, it contains two primary components, i.e., Domain-aware Diffusion (DaD) for image-level adaptation and Multi-granularity Relation Alignment (MRA) for region-level adaptation. As the name implies, Domain-aware Diffusion is to migrate the distribution of images from the pretraining dataset domain to the target real-world dataset domain, e.g., CUHK-PEDES. Subsequently, MRA performs a meticulous region-level alignment by establishing correspondences between visual regions and their descriptive sentences, thereby addressing disparities at a finer granularity. Extensive experiments show that our dual-level adaptation method has achieved state-of-the-art results on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets, outperforming existing methodologies. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/MRA.

[187] A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

Jaeseong Lee,Yeeun Choi,Heechan Choi,Hanjung Kim,Seonjoo Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为 ECP 的新框架,用于提高 MLLM 在高分辨率图像上的性能,通过先提取候选区域再进行预测的方法,解决了 MLLM 在处理高分辨率图像时的挑战。

Details Motivation: MLLMs 在处理高分辨率图像时表现不佳,需要一种新的方法来解决这一问题。 Method: 提出了一种名为 ECP 的两阶段框架,首先提取候选区域,然后进行预测。 Result: 在 4K GUI 接地和 4K、8K MLLM 感知任务中分别取得了 +21.3%、+5.8% 和 +5.2% 的绝对提升。 Conclusion: ECP 是一种有效的框架,可以提高 MLLM 在高分辨率图像上的性能。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.

[188] Improving Multimodal Learning via Imbalanced Learning

Shicai Wei,Chunbo Luo,Yang Luo

Main category: cs.CV

TL;DR: This paper challenges the assumption that balanced learning is best for multimodal models and proposes ARL, an asymmetric learning strategy that leverages modality variances to optimize performance.

Details Motivation: Multimodal learning often underperforms unimodal learning due to imbalanced learning across modalities. Existing methods attempt to balance gradients, but this paper argues that balanced learning is not always optimal. Method: The authors use bias-variance analysis to justify imbalanced optimization and propose the Asymmetric Representation Learning (ARL) strategy. ARL introduces auxiliary regularizers to calculate prediction variances, re-weights modality optimization based on their variances, and jointly optimizes prediction biases with multimodal loss. Result: Extensive experiments show that the proposed ARL strategy is effective and versatile across various datasets, improving multimodal learning performance without introducing extra parameters or structural dependencies. Conclusion: The paper concludes that balanced learning is not optimal for multimodal learning, and the proposed ARL strategy effectively improves performance by optimizing modality dependence based on variance ratios. Abstract: Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility of ARL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-ARL}{https://github.com/shicaiwei123/ICCV2025-ARL}

[189] Is Micro-expression Ethnic Leaning?

Huai-Qian Khor,Yante Li,Xingxun Jiang,Guoying Zhao

Main category: cs.CV

TL;DR: This study investigates the role of ethnicity in emotional expression, arguing against the universality of emotional expressions and proposing an ethnically aware framework for micro-expression recognition.

Details Motivation: This paper explores how ethnicity affects emotional expression, challenging Ekman's assumption of emotion universality across cultures and social contexts. Method: The research constructs a cross-cultural micro-expression database and algorithmically annotates ethnic labels to investigate the influence of ethnicity in a controlled environment. Qualitative analyses are also conducted. Result: The study reveals a certain influence of ethnic bias and develops a framework integrating ethnic context into emotional feature learning for better micro-expression recognition. Conclusion: The study concludes that the emotional universality hypothesis is an overgeneralization, and it proposes an ethnically aware framework for micro-expression recognition. Abstract: How much does ethnicity play its part in emotional expression? Emotional expression and micro-expression research probe into understanding human psychological responses to emotional stimuli, thereby revealing substantial hidden yet authentic emotions that can be useful in the event of diagnosis and interviews. While increased attention had been provided to micro-expression analysis, the studies were done under Ekman's assumption of emotion universality, where emotional expressions are identical across cultures and social contexts. Our computational study uncovers some of the influences of ethnic background in expression analysis, leading to an argument that the emotional universality hypothesis is an overgeneralization from the perspective of manual psychological analysis. In this research, we propose to investigate the level of influence of ethnicity in a simulated micro-expression scenario. We construct a cross-cultural micro-expression database and algorithmically annotate the ethnic labels to facilitate the investigation. With the ethnically annotated dataset, we perform a prima facie study to compare mono-ethnicity and stereo-ethnicity in a controlled environment, which uncovers a certain influence of ethnic bias via an experimental way. Building on this finding, we propose a framework that integrates ethnic context into the emotional feature learning process, yielding an ethnically aware framework that recognises ethnicity differences in micro-expression recognition. For improved understanding, qualitative analyses have been done to solidify the preliminary investigation into this new realm of research. Code is publicly available at https://github.com/IcedDoggie/ICMEW2025_EthnicMER

[190] Boosting Multimodal Learning via Disentangled Gradient Learning

Shicai Wei,Chunbo Luo,Yang Luo

Main category: cs.CV

TL;DR: This paper identifies an optimization conflict in multimodal learning and proposes a disentangled gradient learning framework (DGL) to resolve it, achieving better performance across various tasks and modalities.

Details Motivation: Multimodal learning often underperforms unimodal learning due to optimization conflicts between the modality encoder and modality fusion module, which existing methods fail to adequately address. Method: The authors propose a disentangled gradient learning (DGL) framework that truncates the gradient from the multimodal loss to the modality encoder and replaces it with the gradient from the unimodal loss, while also removing the gradient from the unimodal loss to the modality fusion module. Result: Extensive experiments demonstrate that the DGL framework improves performance across multiple modalities, tasks, and frameworks with dense cross-modal interaction. Conclusion: The proposed DGL framework effectively decouples the optimization of modality encoder and modality fusion module, solving the under-optimized problem in multimodal learning. Abstract: Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types of modalities, tasks, and frameworks with dense cross-modal interaction demonstrate the effectiveness and versatility of the proposed DGL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-GDL}{https://github.com/shicaiwei123/ICCV2025-GDL}

[191] From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation

Jeongho Kim,Sunghyun Park,Hyoungwoo Park,Sungrack Yun,Jaegul Choo,Seokeon Cho

Main category: cs.CV

TL;DR: 本文提出了一种高效的个性化人物图像生成方法 Wardrobe Polyptych LoRA,解决了现有方法计算成本高、效果不佳的问题,实现了高质量的图像生成。

Details Motivation: 现有的个性化人物图像生成方法在精确性和一致性方面存在挑战,且计算成本高,难以实时应用。 Method: 通过仅训练LoRA层,并引入选择性主体区域损失,结合服装条件和空间参考来减少信息丢失。 Result: 构建了一个新的数据集和基准测试,并进行了大量实验,结果表明该方法在保真度和一致性方面显著优于现有技术。 Conclusion: Wardrobe Polyptych LoRA 是一种新的部分级可控模型,用于个性化人物图像生成,能够在推理阶段无需额外参数的情况下实现高保真度和一致性的图像生成。 Abstract: Recent diffusion models achieve personalization by learning specific subjects, allowing learned attributes to be integrated into generated images. However, personalized human image generation remains challenging due to the need for precise and consistent attribute preservation (e.g., identity, clothing details). Existing subject-driven image generation methods often require either (1) inference-time fine-tuning with few images for each new subject or (2) large-scale dataset training for generalization. Both approaches are computationally expensive and impractical for real-time applications. To address these limitations, we present Wardrobe Polyptych LoRA, a novel part-level controllable model for personalized human image generation. By training only LoRA layers, our method removes the computational burden at inference while ensuring high-fidelity synthesis of unseen subjects. Our key idea is to condition the generation on the subject's wardrobe and leverage spatial references to reduce information loss, thereby improving fidelity and consistency. Additionally, we introduce a selective subject region loss, which encourages the model to disregard some of reference images during training. Our loss ensures that generated images better align with text prompts while maintaining subject integrity. Notably, our Wardrobe Polyptych LoRA requires no additional parameters at the inference stage and performs generation using a single model trained on a few training samples. We construct a new dataset and benchmark tailored for personalized human image generation. Extensive experiments show that our approach significantly outperforms existing techniques in fidelity and consistency, enabling realistic and identity-preserving full-body synthesis.

[192] Straighten Viscous Rectified Flow via Noise Optimization

Jimin Dai,Jiexi Yan,Jian Yang,Lei Luo

Main category: cs.CV

TL;DR: This paper proposes VRFNO as an improvement over Reflow for image generation, introducing a historical velocity term and noise optimization to address distribution gaps and achieve better performance.

Details Motivation: Reflow has critical limitations in rapidly generating high-quality images due to a distribution gap between its constructed deterministic couplings and real images. Method: VRFNO introduces a historical velocity term to enhance trajectory distinction and employs noise optimization through reparameterization to form optimized couplings with real images for training. Result: Comprehensive experiments show that VRFNO significantly mitigates the limitations of Reflow across synthetic data and real datasets with varying resolutions. Conclusion: VRFNO effectively addresses the limitations of Reflow, achieving state-of-the-art performance in one-step and few-step image generation tasks. Abstract: The Reflow operation aims to straighten the inference trajectories of the rectified flow during training by constructing deterministic couplings between noises and images, thereby improving the quality of generated images in single-step or few-step generation. However, we identify critical limitations in Reflow, particularly its inability to rapidly generate high-quality images due to a distribution gap between images in its constructed deterministic couplings and real images. To address these shortcomings, we propose a novel alternative called Straighten Viscous Rectified Flow via Noise Optimization (VRFNO), which is a joint training framework integrating an encoder and a neural velocity field. VRFNO introduces two key innovations: (1) a historical velocity term that enhances trajectory distinction, enabling the model to more accurately predict the velocity of the current trajectory, and (2) the noise optimization through reparameterization to form optimized couplings with real images which are then utilized for training, effectively mitigating errors caused by Reflow's limitations. Comprehensive experiments on synthetic data and real datasets with varying resolutions show that VRFNO significantly mitigates the limitations of Reflow, achieving state-of-the-art performance in both one-step and few-step generation tasks.

[193] Spatial Lifting for Dense Prediction

Mingzhi Xu,Yizhe Zhang

Main category: cs.CV

TL;DR: Spatial Lifting improves dense prediction tasks by lifting inputs into higher-dimensional spaces, offering efficiency and performance benefits.

Details Motivation: The motivation behind Spatial Lifting is to improve efficiency, accuracy, and reliability of deep networks in dense prediction tasks while reducing model parameters and inference costs. Method: Spatial Lifting works by lifting standard inputs like 2D images into a higher-dimensional space and processing them using networks designed for that dimension, such as a 3D U-Net. Result: The SL framework was validated across 19 benchmark datasets, showing competitive performance in dense prediction while decreasing model parameters by over 98% and lowering inference costs. Conclusion: Spatial Lifting (SL) is a new vision modeling paradigm that provides a promising route towards more efficient, accurate, and reliable deep networks for dense prediction tasks. Abstract: We present Spatial Lifting (SL), a novel methodology for dense prediction tasks. SL operates by lifting standard inputs, such as 2D images, into a higher-dimensional space and subsequently processing them using networks designed for that higher dimension, such as a 3D U-Net. Counterintuitively, this dimensionality lifting allows us to achieve good performance on benchmark tasks compared to conventional approaches, while reducing inference costs and significantly lowering the number of model parameters. The SL framework produces intrinsically structured outputs along the lifted dimension. This emergent structure facilitates dense supervision during training and enables robust, near-zero-additional-cost prediction quality assessment at test time. We validate our approach across 19 benchmark datasets (13 for semantic segmentation and 6 for depth estimation), demonstrating competitive dense prediction performance while reducing the model parameter count by over 98% (in the U-Net case) and lowering inference costs. Spatial Lifting introduces a new vision modeling paradigm that offers a promising path toward more efficient, accurate, and reliable deep networks for dense prediction tasks in vision.

[194] ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users

Xiangyu Yin,Boyuan Yang,Weichen Liu,Qiyao Xue,Abrar Alamri,Goeran Fiedler,Wei Gao

Main category: cs.CV

TL;DR: This paper introduces ProGait, a new multi-purpose dataset designed to improve gait analysis for individuals with prosthetic legs, showing better performance in vision-based tasks compared to traditional models.

Details Motivation: The motivation is to address the lack of scalable and non-invasive gait analysis tools for optimizing prosthesis design and alignment, as current machine learning methods struggle with detecting and analyzing prostheses due to their unique features. Method: The authors created the ProGait dataset containing video clips of above-knee amputees walking with prosthetic legs. They conducted benchmark tasks and fine-tuned baseline models to evaluate the dataset's performance. Result: The ProGait dataset supports multiple vision tasks such as Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis, demonstrating enhanced performance in prosthesis-specific applications compared to existing models. Conclusion: The paper concludes that the ProGait dataset is effective for prosthesis-specific vision tasks, offering improved generalizability over pre-trained vision models. Abstract: Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. Our code is available at https://github.com/pittisl/ProGait and dataset at https://huggingface.co/datasets/ericyxy98/ProGait.

[195] Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection

Jinglun Li,Kaixun Jiang,Zhaoyu Chen,Bo Lin,Yao Tang,Weifeng Ge,Wenqiang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为SynOOD的新方法,利用基础模型生成用于微调CLIP模型的合成、具有挑战性的OOD数据,以提高CLIP模型在大规模ImageNet基准测试中的性能。

Details Motivation: 预训练的视觉-语言模型在检测分布外(OOD)样本方面表现出色,但一些在图像特征空间中接近分布内(InD)数据的具有挑战性的OOD样本仍可能导致错误分类。基础模型(如扩散模型和多模态大语言模型)的出现为解决这一问题提供了潜在的解决方案。 Method: SynOOD使用迭代修复过程,通过来自MLLMs的上下文提示生成细微的、边界对齐的OOD样本,并基于来自OOD分数的梯度调整噪声,从而从InD/OOD边界采样。 Result: SynOOD在大规模ImageNet基准测试中达到了最先进的性能,AUROC提高了2.80%,FPR95降低了11.13%。 Conclusion: SynOOD通过利用基础模型生成用于微调CLIP模型的合成、具有挑战性的OOD数据,提高了CLIP模型在大规模ImageNet基准测试中的性能,显著超过了现有方法,且参数和运行时间的增加最小。 Abstract: Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, improving AUROC by 2.80% and reducing FPR95 by 11.13%. Codes are available in https://github.com/Jarvisgivemeasuit/SynOOD.

[196] Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

Despina Konstantinidou,Dimitrios Karageorgiou,Christos Koutlis,Olga Papadopoulou,Emmanouil Schinas,Symeon Papadopoulos

Main category: cs.CV

TL;DR: This paper highlights the limitations of current AI-generated image detection models in real-world scenarios and proposes improvements through systematic analysis of key factors, resulting in notable performance gains.

Details Motivation: The motivation behind this research stems from the rapid advancement of generative technologies, which have reached a level of sophistication where AI-generated images can easily deceive even discerning observers. This poses significant challenges in maintaining social trust and ensuring the integrity of digital information, making the task of AI-generated image detection increasingly critical. Method: The authors conducted a systematic evaluation of existing AID models using a new dataset called ITW-SM, which consists of real and AI-generated images collected from major social media platforms. They analyzed the impact of four key factors on AID performance: backbone architecture, training data composition, pre-processing strategies, and data augmentation combinations. Result: The study revealed that while AID models perform well on controlled benchmark datasets, they struggle with real-world variations. However, after systematically analyzing and optimizing the four identified factors, the authors achieved an average AUC improvement of 26.87% across various AID models under real-world conditions. Conclusion: The paper concludes that current AI-Generated Image Detection (AID) models face significant challenges in real-world scenarios, but by systematically analyzing and modifying key factors like backbone architecture, training data composition, pre-processing strategies, and data augmentation combinations, there is potential for substantial improvement in detection efficacy. Abstract: The rapid advancement of generative technologies presents both unprecedented creative opportunities and significant challenges, particularly in maintaining social trust and ensuring the integrity of digital information. Following these concerns, the challenge of AI-Generated Image Detection (AID) becomes increasingly critical. As these technologies become more sophisticated, the quality of AI-generated images has reached a level that can easily deceive even the most discerning observers. Our systematic evaluation highlights a critical weakness in current AI-Generated Image Detection models: while they perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world variations. To assess this, we introduce ITW-SM, a new dataset of real and AI-generated images collected from major social media platforms. In this paper, we identify four key factors that influence AID performance in real-world scenarios: backbone architecture, training data composition, pre-processing strategies and data augmentation combinations. By systematically analyzing these components, we shed light on their impact on detection efficacy. Our modifications result in an average AUC improvement of 26.87% across various AID models under real-world conditions.

[197] Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks

Ben Hamscher,Edgar Heinert,Annika Mütze,Kira Maag,Matthias Rottmann

Main category: cs.CV

TL;DR: This paper shows that using style transfer with randomly styled Voronoi cells in training reduces texture bias and boosts robustness in semantic segmentation models.

Details Motivation: To investigate whether style transfer can reduce texture biases and improve robustness in semantic segmentation, similar to its effects in image classification. Method: The researchers applied style transfer to images by varying styles across artificial areas formed by Voronoi cells, using the resulting data to train deep neural networks (DNNs) for semantic segmentation. Result: Style transfer augmentation was found to decrease texture bias and significantly increase robustness against image corruptions and adversarial attacks in semantic segmentation tasks. Conclusion: The study concludes that applying style transfer augmentation in semantic segmentation reduces texture bias and enhances robustness against common image corruptions and adversarial attacks, applicable across different architectures and datasets. Abstract: Recent research has investigated the shape and texture biases of deep neural networks (DNNs) in image classification which influence their generalization capabilities and robustness. It has been shown that, in comparison to regular DNN training, training with stylized images reduces texture biases in image classification and improves robustness with respect to image corruptions. In an effort to advance this line of research, we examine whether style transfer can likewise deliver these two effects in semantic segmentation. To this end, we perform style transfer with style varying across artificial image areas. Those random areas are formed by a chosen number of Voronoi cells. The resulting style-transferred data is then used to train semantic segmentation DNNs with the objective of reducing their dependence on texture cues while enhancing their reliance on shape-based features. In our experiments, it turns out that in semantic segmentation, style transfer augmentation reduces texture bias and strongly increases robustness with respect to common image corruptions as well as adversarial attacks. These observations hold for convolutional neural networks and transformer architectures on the Cityscapes dataset as well as on PASCAL Context, showing the generality of the proposed method.

[198] Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures

Xinlong Ding,Hongwei Yu,Jiawei Li,Feifan Li,Yu Shang,Bochao Zou,Huimin Ma,Jiansheng Chen

Main category: cs.CV

TL;DR: This paper presents a method called the Kaleidoscopic Background Attack to effectively attack camera pose estimation models by using specially designed backgrounds.

Details Motivation: Camera pose estimation is crucial for computer vision tasks, but its accuracy can be influenced by background textures in object-centric scenarios with sparse inputs. Method: A Kaleidoscopic Background Attack (KBA) was introduced, which uses identical segments to form discs with multi-fold radial symmetry. Additionally, a projected orientation consistency loss was proposed to optimize these segments. Result: Experimental results showed that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models. Conclusion: The study concludes that adversarial kaleidoscopic backgrounds can effectively attack camera pose estimation models. Abstract: Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.

[199] FTCFormer: Fuzzy Token Clustering Transformer for Image Classification

Muyi Bao,Changyu Zeng,Yifan Wang,Zhengni Yang,Zimu Wang,Guangliang Cheng,Jun Qi,Wei Wang

Main category: cs.CV

TL;DR: 本文提出了FTCFormer,通过语义驱动的模糊聚类方法改进视觉Transformer的特征表示和分类性能。

Details Motivation: 传统Transformer模型在计算机视觉任务中忽略了图像区域的语义含义,导致特征表示次优。 Method: 提出Fuzzy Token Clustering Transformer (FTCFormer),包含基于密度峰值聚类和模糊K近邻机制的聚类中心确定方法、空间连接评分机制和通道合并策略。 Result: 在32个不同领域的数据集中进行实验,FTCFormer在细粒度数据集上提升了1.43%,自然图像数据集上提升了1.09%,医学数据集上提升了0.97%,遥感数据集上提升了0.55%。 Conclusion: FTCFormer通过语义驱动的模糊聚类方法,在图像分类任务中优于TCFormer基线模型。 Abstract: Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: https://github.com/BaoBao0926/FTCFormer/tree/main.

[200] Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration

Wenkang Han,Wang Lin,Yiyun Zhou,Qi Liu,Shulei Wang,Chang Yao,Jingyuan Chen

Main category: cs.CV

TL;DR: IP-FVR improves face video restoration by preserving identity details through novel attention mechanisms and learning strategies.

Details Motivation: Traditional FVR methods fail to preserve identity-specific features under severe degradation, producing generic faces. Method: IP-FVR uses a high-quality reference face image as a visual prompt with cross-attention mechanisms, identity-preserving feedback learning, exponential blending strategy, and multi-stream negative prompt. Result: IP-FVR achieves superior performance in restoring high-quality, identity-consistent face videos on synthetic and real-world datasets. Conclusion: IP-FVR outperforms existing methods in both quality and identity preservation for face video restoration. Abstract: Face Video Restoration (FVR) aims to recover high-quality face videos from degraded versions. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model's attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration.

[201] DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

Jiahe Zhao,Rongkun Zheng,Yi Wang,Helin Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: This paper introduces DisCo, a novel visual encapsulation method for video Multimodal Large Language Models that improves semantic distinctness and temporal coherence, leading to better performance and efficiency.

Details Motivation: Linear projectors used for visual encapsulation in video Multimodal Large Language Models (video MLLMs) introduce semantic indistinctness and temporal incoherence when applied to videos. Resampler structures show promise in tackling these challenges, but an effective solution remains unexplored. Method: DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Result: DisCo yields semantically distinct and temporally coherent visual tokens for video MLLMs, outperforming previous methods on video understanding benchmarks and achieving higher token efficiency. Conclusion: DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks while also achieving higher token efficiency. Abstract: In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: https://github.com/ZJHTerry18/DisCo.

[202] Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 这篇文章介绍了一种新的双视觉编码器框架,用于无需注释的手语翻译,在基准测试中表现出色。

Details Motivation: 早期的手语翻译系统依赖于昂贵且难以获取的注释数据,这促使研究人员开发一种更高效的方法来处理连续手语的复杂性。 Method: 该研究采用了一个两阶段、双视觉编码器框架,并通过对比视觉语言预训练来实现手语到文本的转换。 Result: 在Phoenix-2014T基准测试中,该方法始终优于单流变体,并在现有的无标注手语翻译方法中取得了最高的BLEU-4分数。 Conclusion: 本文提出了一种无需标注的双编码器架构,用于手语翻译,并在Phoenix-2014T基准测试中表现优于其他无标注方法。 Abstract: Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

[203] Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

Yuhan Liu,Jingwen Fu,Yang Wu,Kangyi Wu,Pengna Li,Jiayi Wu,Sanping Zhou,Jingmin Xin

Main category: cs.CV

TL;DR: This paper introduces a new framework called IMD that uses a pre-trained diffusion model to address the misalignment issue in image feature matching, particularly excelling in multi-instance scenarios with a 12% improvement on the proposed IMIM benchmark.

Details Motivation: The motivation stems from the observation that existing methods neglect the misalignment between single-image understanding by foundation models and the cross-image understanding required for feature matching, especially in multi-instance scenarios. Method: The method involves using a pre-trained diffusion model within a new framework (IMD) that captures instance-level details and facilitates cross-image interaction through a novel prompting module. Additionally, a new benchmark named IMIM is proposed to evaluate multi-instance feature matching scenarios. Result: The proposed IMD framework achieves a new state-of-the-art performance on commonly evaluated benchmarks and shows a 12% improvement on the IMIM benchmark, demonstrating its effectiveness in mitigating the misalignment problem. Conclusion: The paper concludes that the proposed IMD framework effectively addresses the misalignment issue in leveraging vision foundation models for image feature matching, achieving state-of-the-art results and showing significant improvement on the newly proposed IMIM benchmark. Abstract: Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior improvement 12% in IMIM indicates our method efficiently mitigates the misalignment.

[204] Text Embedding Knows How to Quantize Text-Guided Diffusion Models

Hongjae Lee,Myungjun Son,Dongjea Kang,Seung-Won Jung

Main category: cs.CV

TL;DR: 本文提出了QLIP方法,在扩散模型量化过程中考虑了文本提示的影响,从而有效地降低了计算复杂性并提高了生成图像的质量。

Details Motivation: 现有的扩散模型量化方法没有将输入条件(如文本提示)视为量化的重要信息来源,而扩散模型的高计算复杂性限制了其在资源受限环境中的使用。 Method: 提出了一种名为QLIP的新量化方法,并且该方法可以无缝集成到现有的量化方法中以增强量化效率。 Result: 广泛的实验表明,QLIP在减少计算复杂性和提高各种数据集上的图像生成质量方面具有有效性。 Conclusion: QLIP是一个有效的扩散模型量化方法,通过利用文本提示来指导每个时间步长中每一层的位精度选择,从而降低计算复杂度并提高生成图像的质量。 Abstract: Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.

[205] FGSSNet: Feature-Guided Semantic Segmentation of Real World Floorplans

Hugo Norrby,Gabriel Färm,Kevin Hernandez-Diaz,Fernando Alonso-Fernandez

Main category: cs.CV

TL;DR: 本文提出 FGSSNet,利用多头特征提取和 U-Net 结构提升墙体分割精度和泛化能力。

Details Motivation: 为了解决在平面图中墙体分割泛化能力不足的问题。 Method: 使用 U-Net 分割主干,并采用多头专用特征提取器来提取特定领域特征图,然后将其注入到 U-Net 的潜在空间中以指导分割过程。 Result: 实验表明,与 vanilla U-Net 相比,该方法通过注入特征提高了性能,验证了所提方法的有效性。 Conclusion: FGSSNet 提出了一种新的多头特征引导语义分割架构,提高了墙体分割的泛化能力。 Abstract: We introduce FGSSNet, a novel multi-headed feature-guided semantic segmentation (FGSS) architecture designed to improve the generalization ability of wall segmentation on floorplans. FGSSNet features a U-Net segmentation backbone with a multi-headed dedicated feature extractor used to extract domain-specific feature maps which are injected into the latent space of U-Net to guide the segmentation process. This dedicated feature extractor is trained as an encoder-decoder with selected wall patches, representative of the walls present in the input floorplan, to produce a compressed latent representation of wall patches while jointly trained to predict the wall width. In doing so, we expect that the feature extractor encodes texture and width features of wall patches that are useful to guide the wall segmentation process. Our experiments show increased performance by the use of such injected features in comparison to the vanilla U-Net, highlighting the validity of the proposed approach.

[206] Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter

Bo Jiang,Xueyang Ze,Beibei Wang,Xixi Wang,Xixi Wan,Bin Luo

Main category: cs.CV

TL;DR: This paper proposes VRGAdapter, which uses a random graph model to capture textual diversity and inter-class relationships, and UMF, a fusion scheme for ensemble prediction, to enhance the transfer of knowledge from pre-trained Vision-Language Models to downstream tasks.

Details Motivation: Traditional deterministic textual feature adapters fail to capture the rich semantic diversity in textual descriptions and do not exploit inter-class relationships. This limits the performance of Vision-Language Models (VLMs) on downstream tasks. Method: The paper introduces a Vertex Random Graph Adapter (VRGAdapter) that models textual diversity and inter-class relationships using a Vertex Random Knowledge Graph (VRKG) with probabilistic message propagation. It also incorporates a reparameterized sampling function for adapter learning and a Uncertainty-guided Multi-branch Fusion (UMF) scheme for ensemble prediction. Result: Extensive experiments on benchmark datasets demonstrate the effectiveness of VRGAdapter and the UMF scheme in improving performance on downstream visual learning tasks. Conclusion: The proposed VRGAdapter, together with the UMF scheme, provides a more general and robust adapter solution for transferring knowledge from pre-trained VLMs to downstream tasks, outperforming traditional methods by capturing diverse semantic information and inter-class relationships. Abstract: Textual adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graph-based adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.

[207] Fine-Grained Zero-Shot Object Detection

Hongxu Ma,Chenbo Zhang,Lu Zhang,Jiaogen Zhou,Jihong Guan,Shuigeng Zhou

Main category: cs.CV

TL;DR: This paper introduces the Fine-Grained Zero-Shot Object Detection (FG-ZSD) problem and proposes an effective solution named MSHC, evaluated on a new benchmark dataset FGZSD-Birds.

Details Motivation: Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different. However, real-life scenarios often require fine-grained object detection, where class distinctions are subtle and not easily discernible. Method: The paper proposes a method called MSHC, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss. Result: The paper presents results showing that the proposed MSHC method performs better than current ZSD models on the newly introduced FGZSD-Birds dataset. Conclusion: The paper concludes that the proposed MSHC method outperforms existing ZSD models in the FG-ZSD task. Abstract: Zero-shot object detection (ZSD) aims to leverage semantic descriptions to localize and recognize objects of both seen and unseen classes. Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different, thus are relatively easy to distinguish. However, in real life we often have to face fine-grained object detection scenarios, where the classes are too similar to be easily distinguished. For example, detecting different kinds of birds, fishes, and flowers. In this paper, we propose and solve a new problem called Fine-Grained Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of different classes with minute differences in details under the ZSD paradigm. We develop an effective method called MSHC for the FG-ZSD task, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss, ensuring tight coupling between the visual and semantic spaces. Considering that existing ZSD datasets are not suitable for the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds, which contains 148,820 images falling into 36 orders, 140 families, 579 genera and 1432 species. Extensive experiments on FGZSD-Birds show that our method outperforms existing ZSD models.

[208] Test-Time Canonicalization by Foundation Models for Robust Perception

Utkarsh Singhal,Ryan Feng,Stella X. Yu,Atul Prakash

Main category: cs.CV

TL;DR: 本文提出 FOCAL 方法,在不修改模型结构或重新训练的前提下,通过测试时优化候选变换来提高视觉模型对各种变换的鲁棒性。

Details Motivation: 现实世界中的视觉感知需要对多种变换具有不变性,但现有方法依赖于专用架构或预定义增强的训练,泛化能力受限。 Method: 提出了一种名为 FOCAL 的测试时数据驱动框架,通过利用基础模型中的大规模视觉先验知识,生成并优化向“典型”视角的变换,从而增强感知鲁棒性。 Result: 实验表明,FOCAL 在 CLIP 和 SAM 模型上对 2D/3D 旋转、光照变化(对比度和颜色)以及昼夜变化等复杂变换具有更好的鲁棒性,并展示了在主动视觉中的潜在应用。 Conclusion: FOCAL 提供了一种无需重新训练或改变架构即可提升视觉模型鲁棒性的方法,挑战了传统需要特定变换训练的假设。 Abstract: Real-world visual perception requires invariance to diverse transformations, yet current methods rely heavily on specialized architectures or training on predefined augmentations, limiting generalization. We propose FOCAL, a test-time, data-driven framework that achieves robust perception by leveraging internet-scale visual priors from foundation models. By generating and optimizing candidate transformations toward visually typical, "canonical" views, FOCAL enhances robustness without re-training or architectural changes. Our experiments demonstrate improved robustness of CLIP and SAM across challenging transformations, including 2D/3D rotations, illumination shifts (contrast and color), and day-night variations. We also highlight potential applications in active vision. Our approach challenges the assumption that transform-specific training is necessary, instead offering a scalable path to invariance. Our code is available at: https://github.com/sutkarsh/focal.

[209] Improving Remote Sensing Classification using Topological Data Analysis and Convolutional Neural Networks

Aaryam Sharma

Main category: cs.CV

TL;DR: 本文首次将拓扑数据分析应用于卫星场景分类,通过结合TDA特征和ResNet18模型显著提高了分类准确率。

Details Motivation: 卷积神经网络(CNN)倾向于基于纹理的局部特征,而TDA具有鲁棒性并能有效描述复杂数据集,因此希望通过引入TDA特征来提升模型表现。 Method: 提出了一种将拓扑数据分析(TDA)特征与深度学习模型结合的特征工程流程,并在ResNet18模型中进行实验验证。 Result: 该方法在EuroSAT数据集上提升了ResNet18模型1.44%的准确率,达到99.33%,在RESISC45数据集上提升了1.82%。 Conclusion: TDA特征可以与深度学习模型集成,即使在没有明确拓扑结构的数据集上也能提高性能。 Abstract: Topological data analysis (TDA) is a relatively new field that is gaining rapid adoption due to its robustness and ability to effectively describe complex datasets by quantifying geometric information. In imaging contexts, TDA typically models data as filtered cubical complexes from which we can extract discriminative features using persistence homology. Meanwhile, convolutional neural networks (CNNs) have been shown to be biased towards texture based local features. To address this limitation, we propose a TDA feature engineering pipeline and a simple method to integrate topological features with deep learning models on remote sensing classification. Our method improves the performance of a ResNet18 model on the EuroSAT dataset by 1.44% achieving 99.33% accuracy, which surpasses all previously reported single-model accuracies, including those with larger architectures, such as ResNet50 (2x larger) and XL Vision Transformers (197x larger). We additionally show that our method's accuracy is 1.82% higher than our ResNet18 baseline on the RESISC45 dataset. To our knowledge, this is the first application of TDA features in satellite scene classification with deep learning. This demonstrates that TDA features can be integrated with deep learning models, even on datasets without explicit topological structures, thereby increasing the applicability of TDA. A clean implementation of our method will be made publicly available upon publication.

[210] Numerically Computing Galois Groups of Minimal Problems

Timothy Duff

Main category: cs.CV

TL;DR: The paper discusses efforts over the past five years to measure the complexity and develop practical methods for solving parametric systems of algebraic equations, which are important in algebra, numerical computation, and computer vision.

Details Motivation: The motivation stems from the need to solve multiple instances of a parametric family of systems of algebraic equations, which is relevant both to ISSAC attendees and the computer vision community using robust model-fitting paradigms like RanSaC. Method: The method involves an overview of work done in the last five years focusing on measuring the intrinsic difficulty of solving parametric systems of algebraic equations. Result: The result is progress towards practical solutions for solving such parametric systems, alongside understanding their intrinsic difficulty. Conclusion: This paper concludes that the intrinsic difficulty of solving parametric systems can be measured, and practical solutions can be developed through work done over the last five years. Abstract: I discuss a seemingly unlikely confluence of topics in algebra, numerical computation, and computer vision. The motivating problem is that of solving multiples instances of a parametric family of systems of algebraic (polynomial or rational function) equations. No doubt already of interest to ISSAC attendees, this problem arises in the context of robust model-fitting paradigms currently utilized by the computer vision community (namely "Random Sampling and Consensus", aka "RanSaC".) This talk will give an overview of work in the last 5+ years that aspires to measure the intrinsic difficulty of solving such parametric systems, and makes strides towards practical solutions.

[211] Text-Visual Semantic Constrained AI-Generated Image Quality Assessment

Qiang Li,Qingsen Yan,Haojian Huang,Peng Wu,Haokui Zhang,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出SC-AGIQA,一种基于文本-视觉语义约束的AI生成图像质量评估框架,通过引入TSAM和FFDPM两个核心模块,显著提升了AGI质量评估的准确性和全面性。

Details Motivation: 随着人工智能生成图像(AGI)技术的快速发展,对其质量的准确评估变得愈发重要。然而,现有方法在处理AGI时面临语义错位和细节感知缺失两大挑战,亟需一种更为精确和全面的质量评估方法。 Method: 提出了一种名为SC-AGIQA的统一框架,包含两个核心模块:TSAM(文本辅助语义对齐模块)和FFDPM(频域细粒度退化感知模块)。TSAM利用多模态大语言模型生成图像描述,并与原始提示进行比较以提升一致性检查;FFDPM则通过频域分析和感知敏感性加权来增强对细微视觉失真和细节的捕捉能力。 Result: 在多个基准数据集上的实验表明,SC-AGIQA在文本-图像一致性和视觉质量评估方面均优于当前最先进的方法,能够更准确地捕捉到图像中的细微失真和细节变化。 Conclusion: SC-AGIQA提供了一种更全面、准确的AI生成图像质量评估方法,通过结合文本和视觉语义约束,有效解决了现有方法在语义对齐和细节感知方面的不足。 Abstract: With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.

[212] 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos

Shanshan Zhong,Jiawei Peng,Zehan Zheng,Zhongzhan Huang,Wufei Ma,Guofeng Zhang,Qihao Liu,Alan Yuille,Jieneng Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为4D-Animal的新框架,用于从视频中重建可动画3D动物,无需人工标注关键点,具有高效、稳定且准确的效果。

Details Motivation: 现有方法依赖于获取成本高昂且在有限动物数据上训练的不可靠稀疏语义关键点来拟合参数模型。 Method: 引入了一个密集特征网络,将2D表示映射到SMAL参数,并开发了一种分层对齐策略,整合了轮廓、部件级、像素级和时间线索。 Result: 4D-Animal 在跨帧的重建中实现了高效、稳定的拟合并生成高质量的3D资产,适用于大规模应用。 Conclusion: 4D-Animal 是一种无需稀疏关键点注释即可从视频中重建可动画3D动物的新框架,且实验表明其表现优于基于模型和无模型的基线方法。 Abstract: Existing methods for reconstructing animatable 3D animals from videos typically rely on sparse semantic keypoints to fit parametric models. However, obtaining such keypoints is labor-intensive, and keypoint detectors trained on limited animal data are often unreliable. To address this, we propose 4D-Animal, a novel framework that reconstructs animatable 3D animals from videos without requiring sparse keypoint annotations. Our approach introduces a dense feature network that maps 2D representations to SMAL parameters, enhancing both the efficiency and stability of the fitting process. Furthermore, we develop a hierarchical alignment strategy that integrates silhouette, part-level, pixel-level, and temporal cues from pre-trained 2D visual models to produce accurate and temporally coherent reconstructions across frames. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code is released at https://github.com/zhongshsh/4D-Animal.

[213] CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding

Hongyong Han,Wei Wang,Gaowei Zhang,Mingjie Li,Yi Wang

Main category: cs.CV

TL;DR: 本文介绍了首个大规模用于珊瑚礁分析的视觉问答数据集CoralVQA,旨在促进珊瑚保护工作的视觉-语言模型发展。

Details Motivation: 珊瑚礁是重要但脆弱的生态系统,需要持续监测以支持保护工作,而解释珊瑚礁图像需要领域专业知识,因此需要专门的数据集来解决特定领域的注释和多维度问题。 Method: 与海洋生物学家合作开发了一种半自动数据构建管道,构建了包含12,805张真实珊瑚图像和277,653对问答对的CoralVQA数据集。 Result: 通过评估几种最先进的LVLM模型,揭示了其关键限制和机会,为未来LVLM的发展奠定了基础,特别是在支持珊瑚保护方面。 Conclusion: CoralVQA为研究珊瑚礁图像中的视觉-语言推理提供了一个全面的基准,并揭示了LVLMs在珊瑚保护方面的局限性和机遇。 Abstract: Coral reefs are vital yet vulnerable ecosystems that require continuous monitoring to support conservation. While coral reef images provide essential information in coral monitoring, interpreting such images remains challenging due to the need for domain expertise. Visual Question Answering (VQA), powered by Large Vision-Language Models (LVLMs), has great potential in user-friendly interaction with coral reef images. However, applying VQA to coral imagery demands a dedicated dataset that addresses two key challenges: domain-specific annotations and multidimensional questions. In this work, we introduce CoralVQA, the first large-scale VQA dataset for coral reef analysis. It contains 12,805 real-world coral images from 67 coral genera collected from 3 oceans, along with 277,653 question-answer pairs that comprehensively assess ecological and health-related conditions. To construct this dataset, we develop a semi-automatic data construction pipeline in collaboration with marine biologists to ensure both scalability and professional-grade data quality. CoralVQA presents novel challenges and provides a comprehensive benchmark for studying vision-language reasoning in the context of coral reef images. By evaluating several state-of-the-art LVLMs, we reveal key limitations and opportunities. These insights form a foundation for future LVLM development, with a particular emphasis on supporting coral conservation efforts.

[214] RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening

Tao Tang,Chengxu Yang

Main category: cs.CV

TL;DR: 本文提出了RAPNet,通过内容自适应卷积解决全色锐化问题,提升了空间细节提取精度及性能表现。

Details Motivation: 尽管CNNs在解决全色锐化问题方面是有效的,但它们受限于在所有空间位置上均匀应用卷积核,忽略了局部内容的变化。 Method: 引入了RAPNet架构,该架构利用内容自适应卷积,其中RAPConv产生响应局部特征上下文的空间自适应核,同时结合带有注意力机制的PAN-DFF模块以平衡空间细节增强与光谱保真度。 Result: 全面评估证实,RAPNet在公共数据集上的表现优于现有方法,这由定量指标和定性评估证明,消融分析也进一步验证了所提出的自适应组件的有效性。 Conclusion: RAPNet的提出有效解决了传统CNN在全色锐化中忽略局部内容变化的问题,通过使用RAPConv和PAN-DFF模块,提高了空间细节提取的精确度,并且综合评估表明其性能优于现有方法。 Abstract: Pansharpening refers to the process of integrating a high resolution panchromatic (PAN) image with a lower resolution multispectral (MS) image to generate a fused product, which is pivotal in remote sensing. Despite the effectiveness of CNNs in addressing this challenge, they are inherently constrained by the uniform application of convolutional kernels across all spatial positions, overlooking local content variations. To overcome this issue, we introduce RAPNet, a new architecture that leverages content-adaptive convolution. At its core, RAPNet employs the Receptive-field Adaptive Pansharpening Convolution (RAPConv), designed to produce spatially adaptive kernels responsive to local feature context, thereby enhancing the precision of spatial detail extraction. Additionally, the network integrates the Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an attention mechanism to achieve an optimal balance between spatial detail enhancement and spectral fidelity. Comprehensive evaluations on publicly available datasets confirm that RAPNet delivers superior performance compared to existing approaches, as demonstrated by both quantitative metrics and qualitative assessments. Ablation analyses further substantiate the effectiveness of the proposed adaptive components.

[215] RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction

Zhicun Yin,Junjie Chen,Ming Liu,Zhixin Wang,Fan Li,Renjing Pei,Xiaoming Li,Rynson W. H. Lau,Wangmeng Zuo

Main category: cs.CV

TL;DR: 本文提出了一种新的盲脸图像修复方法RefSTAR,通过有效结合高质量参考图像中的特征,在身份保持和特征转移质量方面取得了优越的性能。

Details Motivation: 现有的盲脸图像修复方法在身份保持方面存在困难,主要由于对细节纹理的特征引入不当。因此,该研究旨在有效地从高质量参考图像中引入适当的特征来解决这一问题。 Method: 该论文设计了一个参考选择模块(RefSel),构建了包含10,000个真实-参考对的RefSel-HQ数据集,并提出了一种特征融合范式以强制整合参考图像特征,同时引入了基于掩码的循环一致性损失函数。 Result: 实验表明,该方法在各种骨干模型上均表现出卓越的性能,包括更好的身份保持能力和参考特征转移质量。此外,源代码、数据集和预训练模型均已公开。 Conclusion: 该论文提出了一种新的盲脸图像修复方法RefSTAR,通过参考选择、特征传递和重建机制,有效结合了高质量参考图像中的特征,显著提高了修复结果的身份保持能力和特征转移质量。 Abstract: Blind facial image restoration is highly challenging due to unknown complex degradations and the sensitivity of humans to faces. Although existing methods introduce auxiliary information from generative priors or high-quality reference images, they still struggle with identity preservation problems, mainly due to improper feature introduction on detailed textures. In this paper, we focus on effectively incorporating appropriate features from high-quality reference images, presenting a novel blind facial image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR). In terms of selection, we construct a reference selection (RefSel) module. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotating masks for 10,000 ground truth-reference pairs. As for the transfer, due to the trivial solution in vanilla cross-attention operations, a feature fusion paradigm is designed to force the features from the reference to be integrated. Finally, we propose a reference image reconstruction mechanism that further ensures the presence of reference image features in the output image. The cycle consistency loss is also redesigned in conjunction with the mask. Extensive experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality. Source code, dataset, and pre-trained models are available at https://github.com/yinzhicun/RefSTAR.

[216] GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

David G. Shatwell,Ishan Rajendrakumar Dave,Sirnam Swetha,Mubarak Shah

Main category: cs.CV

TL;DR: The paper introduces GT-Loc, a novel retrieval-based method that effectively jointly predicts the capture time and geo-location of images, surpassing previous timestamp prediction methods and achieving competitive results in geo-localization.

Details Motivation: Timestamp prediction is important for various applications but visual cues for prediction significantly depend on geographic context, linking it to geo-localization. The interdependence necessitates a method to jointly predict capture time and location. Method: The method involves employing separate encoders for images, time, and location while aligning their embeddings within a shared high-dimensional feature space. A temporal metric-learning objective is proposed to provide soft targets by modeling pairwise time differences over a cyclical toroidal surface. Result: GT-Loc outperforms previous time prediction methods according to new benchmarks, even surpassing those using ground-truth geo-location as input. It also achieves competitive results on standard geo-localization tasks and allows for compositional and text-based image retrieval. Conclusion: The paper concludes that GT-Loc, a novel retrieval-based method, successfully jointly predicts the capture time and geo-location of an image surpassing previous methods in timestamp prediction and achieving competitive results in geo-localization. Abstract: Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.

[217] Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation

Seyed Alireza Rahimi Azghadi,Truong-Thanh-Hung Nguyen,Helene Fournier,Monica Wachowicz,Rene Richard,Francis Palma,Hung Cao

Main category: cs.CV

TL;DR: This paper proposes a multi-system framework for fall detection in elderly individuals that combines a semi-supervised federated learning-based system, indoor localization and navigation, and a vision-based recognition system to achieve high reliability and privacy preservation.

Details Motivation: The rapid growth of the aging population has increased the risk of falls among older adults. Timely detection of falls can significantly reduce medical costs and recovery time, while addressing privacy concerns. Method: A framework combining a semi-supervised federated learning-based fall detection system, an indoor localization and navigation system, and a vision-based human fall recognition system was developed. Result: The individual systems achieved high accuracy rates: SF2D had a 99.19% accuracy, the vision-based detection achieved 96.3% accuracy, and the navigation system had a 95% success rate. When combined, the overall accuracy of the framework reached 99.99%. Conclusion: The proposed framework is both safe for older adults and a privacy-preserving solution for detecting falls. Abstract: The aging population is growing rapidly, and so is the danger of falls in older adults. A major cause of injury is falling, and detection in time can greatly save medical expenses and recovery time. However, to provide timely intervention and avoid unnecessary alarms, detection systems must be effective and reliable while addressing privacy concerns regarding the user. In this work, we propose a framework for detecting falls using several complementary systems: a semi-supervised federated learning-based fall detection system (SF2D), an indoor localization and navigation system, and a vision-based human fall recognition system. A wearable device and an edge device identify a fall scenario in the first system. On top of that, the second system uses an indoor localization technique first to localize the fall location and then navigate a robot to inspect the scenario. A vision-based detection system running on an edge device with a mounted camera on a robot is used to recognize fallen people. Each of the systems of this proposed framework achieves different accuracy rates. Specifically, the SF2D has a 0.81% failure rate equivalent to 99.19% accuracy, while the vision-based fallen people detection achieves 96.3% accuracy. However, when we combine the accuracy of these two systems with the accuracy of the navigation system (95% success rate), our proposed framework creates a highly reliable performance for fall detection, with an overall accuracy of 99.99%. Not only is the proposed framework safe for older adults, but it is also a privacy-preserving solution for detecting falls.

[218] The Power of Certainty: How Confident Models Lead to Better Segmentation

Tugberk Erol,Tuba Caglikantar,Duygu Sarikaya

Main category: cs.CV

TL;DR: 提出了一种基于置信度的自蒸馏方法,用于在不增加计算和内存需求的情况下提高结肠镜检查中息肉分割的性能。

Details Motivation: 现有的深度学习模型虽然在自动息肉检测和精确分割方面表现出色,但由于其复杂性容易过拟合,特别是在训练数据集有偏差时,导致在不同数据集间的泛化能力差。 Method: 提出了一种基于置信度的自蒸馏方法,该方法仅利用训练期间前一次迭代的数据存储,不需要额外的计算或内存使用。通过动态置信系数在一个批次内计算前一次和当前迭代之间的损失。 Result: 在多个临床中心收集的数据集中进行的全面实验表明,所提出的方法优于现有最先进的模型,并且具有良好的跨数据集泛化能力。 Conclusion: 所提出的基于置信度的自蒸馏方法是一种有效的策略,可以缓解大型超参数模型的局限性,同时在实际应用中具有较低的资源需求。 Abstract: Deep learning models have been proposed for automatic polyp detection and precise segmentation of polyps during colonoscopy procedures. Although these state-of-the-art models achieve high performance, they often require a large number of parameters. Their complexity can make them prone to overfitting, particularly when trained on biased datasets, and can result in poor generalization across diverse datasets. Knowledge distillation and self-distillation are proposed as promising strategies to mitigate the limitations of large, over-parameterized models. These approaches, however, are resource-intensive, often requiring multiple models and significant memory during training. We propose a confidence-based self-distillation approach that outperforms state-of-the-art models by utilizing only previous iteration data storage during training, without requiring extra computation or memory usage during testing. Our approach calculates the loss between the previous and current iterations within a batch using a dynamic confidence coefficient. To evaluate the effectiveness of our approach, we conduct comprehensive experiments on the task of polyp segmentation. Our approach outperforms state-of-the-art models and generalizes well across datasets collected from multiple clinical centers. The code will be released to the public once the paper is accepted.

[219] BenchReAD: A systematic benchmark for retinal anomaly detection

Chenyu Lian,Hong-Yu Zhou,Zhanli Hu,Jing Qin

Main category: cs.CV

TL;DR: This paper introduces a comprehensive benchmark for retinal anomaly detection and proposes NFM-DRA, a novel method combining DRA with a Normal Feature Memory, achieving state-of-the-art results.

Details Motivation: The motivation is to address the lack of a comprehensive and publicly available benchmark for retinal anomaly detection, which has limited progress in the field. Additionally, existing benchmarks focus on one-class supervised approaches while overlooking abundant labeled abnormal and unlabeled data in clinical practice. Method: The authors propose NFM-DRA, which combines disentangled representations of abnormalities (DRA) with a Normal Feature Memory to improve performance on retinal anomaly detection. They evaluate the method using their newly introduced comprehensive benchmark. Result: A fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieved the best performance but showed significant performance drops for unseen anomalies. The proposed NFM-DRA successfully mitigated this issue and set a new state-of-the-art result. Conclusion: The study concludes that NFM-DRA, integrating DRA with a Normal Feature Memory, mitigates performance degradation and establishes a new SOTA for retinal anomaly detection. Abstract: Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

[220] Cameras as Relative Positional Encoding

Ruilong Li,Brent Yi,Junchen Liu,Hang Gao,Yi Ma,Angjoo Kanazawa

Main category: cs.CV

TL;DR: This paper introduces PRoPE, a new method for conditioning transformers on camera geometry, which improves 3D perception across multiple tasks and model sizes.

Details Motivation: To improve 3D perception in multi-view computer vision tasks by leveraging geometric relationships between viewpoints through improved camera conditioning techniques. Method: Comparison of techniques for conditioning transformers on cameras, including token-level raymap encodings, attention-level relative pose encodings, and PRoPE. Result: Relative camera conditioning improves performance in novel view synthesis, with additional gains from PRoPE, and benefits extend to other tasks like stereo depth estimation. Conclusion: The proposed PRoPE method enhances the performance of multi-view transformers by effectively capturing camera frustums and generalizes well across various settings and tasks. Abstract: Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.

[221] National level satellite-based crop field inventories in smallholder landscapes

Philippe Rufin,Pauline Lucie Hammer,Leon-Friedrich Thomas,Sá Nogueira Lisboa,Natasha Ribeiro,Almeida Sitoe,Patrick Hostert,Patrick Meyfroidt

Main category: cs.CV

TL;DR: 通过高分辨率遥感数据和深度迁移学习技术,本研究首次在全国尺度上精确划定了莫桑比克复杂农业系统中的农田边界,并揭示了其农田面积普遍较小以及分布特征。

Details Motivation: 为了设计基于科学的政策以提高小农农业的可持续性,需要对基本系统属性有更深入的理解,例如活跃耕地和田块大小的空间分布。 Method: 整合极高空间分辨率(1.5米)地球观测数据和深度迁移学习方法,在全国范围内划定莫桑比克的农田边界。 Result: 提供了莫桑比克全国级别的2100万个独立田块的数据集,区分活跃耕地和非农业用地的整体准确率达到93%,并发现该国农田面积普遍较小,一半的田块小于0.16公顷。 Conclusion: 研究结果表明,田块大小是与农业社会经济和环境结果及其权衡关系相关的关键指标。 Abstract: The design of science-based policies to improve the sustainability of smallholder agriculture is challenged by a limited understanding of fundamental system properties, such as the spatial distribution of active cropland and field size. We integrate very high spatial resolution (1.5 m) Earth observation data and deep transfer learning to derive crop field delineations in complex agricultural systems at the national scale, while maintaining minimum reference data requirements and enhancing transferability. We provide the first national-level dataset of 21 million individual fields for Mozambique (covering ~800,000 km2) for 2023. Our maps separate active cropland from non-agricultural land use with an overall accuracy of 93% and balanced omission and commission errors. Field-level spatial agreement reached median intersection over union (IoU) scores of 0.81, advancing the state-of-the-art in large-area field delineation in complex smallholder systems. The active cropland maps capture fragmented rural regions with low cropland shares not yet identified in global land cover or cropland maps. These regions are mostly located in agricultural frontier regions which host 7-9% of the Mozambican population. Field size in Mozambique is very low overall, with half of the fields being smaller than 0.16 ha, and 83% smaller than 0.5 ha. Mean field size at aggregate spatial resolution (0.05{\deg}) is 0.32 ha, but it varies strongly across gradients of accessibility, population density, and net forest cover change. This variation reflects a diverse set of actors, ranging from semi-subsistence smallholder farms to medium-scale commercial farming, and large-scale farming operations. Our results highlight that field size is a key indicator relating to socio-economic and environmental outcomes of agriculture (e.g., food production, livelihoods, deforestation, biodiversity), as well as their trade-offs.

[222] Quantize-then-Rectify: Efficient VQ-VAE Training

Borui Zhang,Qihang Rao,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文介绍了一种名为ReVQ的方法,通过使用预训练VAE和量化噪声控制,实现高效的VQ-VAE训练,显著降低计算成本并保持高质量重建。

Details Motivation: 训练高压缩率的VQ-VAE计算需求高,需要数千小时的GPU时间,因此需要更高效的方法。 Method: 通过在VAE的容忍阈值内控制量化噪声,将预训练VAE转化为VQ-VAE,并引入通道多组量化和后校正器减少量化误差。 Result: ReVQ能在单个NVIDIA 4090上完成训练仅需约22小时,压缩ImageNet图像至最多512个标记,同时保持竞争性的重建质量(rFID = 1.06)。 Conclusion: ReVQ不仅大幅降低了计算成本,还保持了重建质量,在效率与重建之间取得了优越的平衡。 Abstract: Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.

[223] Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder

Vladimir Iashin,Horace Lee,Dan Schofield,Andrew Zisserman

Main category: cs.CV

TL;DR: 本研究提出了一种无需身份标签即可从相机陷阱视频中自动学习黑猩猩面部特征的自我监督方法,有效提升了开放集重识别性能。

Details Motivation: 相机陷阱通过捕获大量视觉数据正在彻底改变野生动物监测;然而,个体动物的手动识别仍然是一个重大瓶颈。 Method: 利用DINOv2框架,在自动挖掘的人脸图像上训练视觉转换器,提出了一种完全自我监督的学习黑猩猩面部嵌入的方法。 Result: 我们的方法展示了强大的开放集重新识别性能,在Bossou等具有挑战性的基准测试中超越了有监督的基础模型,尽管在训练期间未使用任何标记数据。 Conclusion: 该研究强调了自我监督学习在生物多样性监测中的潜力,并为可扩展的、非侵入性的种群研究铺平了道路。 Abstract: Camera traps are revolutionising wildlife monitoring by capturing vast amounts of visual data; however, the manual identification of individual animals remains a significant bottleneck. This study introduces a fully self-supervised approach to learning robust chimpanzee face embeddings from unlabeled camera-trap footage. Leveraging the DINOv2 framework, we train Vision Transformers on automatically mined face crops, eliminating the need for identity labels. Our method demonstrates strong open-set re-identification performance, surpassing supervised baselines on challenging benchmarks such as Bossou, despite utilising no labelled data during training. This work underscores the potential of self-supervised learning in biodiversity monitoring and paves the way for scalable, non-invasive population studies.