Skip to content

Table of Contents

cs.CL [Back]

[1] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

Tian Lan,Xiangdong Su,Xu Liu,Ruirui Wang,Ke Chang,Jiang Li,Guanglai Gao

Main category: cs.CL

TL;DR: 本研究提出了一种面向中文语言和文化的多任务偏见评估基准McBE,并基于该基准发现了主流LLMs中普遍存在的偏见问题。

Details Motivation: 现有LLM偏见评估数据集主要针对英语和北美文化,缺乏适用于其他文化的全面数据,尤其在中文语言和文化基础上的数据稀缺;此外,这些数据集通常仅支持单一评估任务,难以实现多维度偏见评估。 Method: 构建了一个包含4,077个实例的中文多任务偏见评估基准(McBE),覆盖12种偏见类型、82个子类,并引入5种评估任务;同时对多个主流LLMs进行了综合评估与深入分析。 Result: 所有参与评估的LLMs均表现出不同程度的偏见,McBE通过多维评估揭示了LLMs中的偏见特性,并提供了新的研究洞察。 Conclusion: McBE有效解决了当前偏见评估数据集在中文语境下的局限性,为多任务、细粒度的LLM偏见分析提供了系统基准。 Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

[2] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization

Keyan Jin,Yapeng Wang,Leonel Santos,Tao Fang,Xu Yang,Sio Kei Im,Hugo Gonçalo Oliveira

Main category: cs.CL

TL;DR: This paper evaluates reasoning and non-reasoning large language models for dialogue summarization, revealing that explicit reasoning doesn't always enhance performance and can sometimes be detrimental.

Details Motivation: To explore the effectiveness of Long Chain-of-Thought LLMs in dialogue summarization, a task requiring both abstraction and conciseness. Method: The study evaluated state-of-the-art reasoning and non-reasoning LLMs across multiple dialogue summarization paradigms using established benchmarks and evaluation protocols. Result: Reasoning LLMs often produce less concise and factually inconsistent summaries compared to non-reasoning models in dialogue contexts. Conclusion: Explicit step-by-step reasoning in LLMs does not consistently improve dialogue summarization and may lead to verbosity and factual inconsistencies. Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.

[3] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

Wenquan Lu,Yuechuan Yang,Kyle Lee,Yanshu Li,Enqi Liu

Main category: cs.CL

TL;DR: This paper investigates if Huginn-3.5B, a depth-recurrent Transformer, develops internal reasoning structures akin to latent Chain-of-thought, finding limited evidence and minimal benefits from deeper recurrence.

Details Motivation: The motivation is to explore whether reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer, that can internalize reasoning in latent space without increasing parameter count, potentially supporting latent Chain-of-thought (CoT). Method: The paper uses probing techniques such as Logit Lens and Coda Lens to examine the internal behavior of Huginn-3.5B on arithmetic tasks, analyzing rank trajectories of tokens and probing inconsistencies across recurrent blocks. Result: The findings show limited interpretable latent CoT in Huginn-3.5B, significant probing inconsistencies across recurrent blocks, and marginal gains from increasing recurrence depth. Conclusion: The study concludes that Huginn-3.5B shows limited evidence of interpretable latent CoT and increasing recurrence depth offers only marginal gains compared to models that externalize reasoning steps. Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model's internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.

[4] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Steven Song,Anirudh Subramanyam,Zhenyu Zhang,Aarti Venkat,Robert L. Grossman

Main category: cs.CL

TL;DR: 本文介绍了一个名为 GDC Cohort Copilot 的新工具,它可以根据用户的自然语言描述自动生成癌症基因组学数据队列,并且性能优于现有的 GPT-4o 模型。

Details Motivation: GDC 用户在数百个字段和属性中寻找特定的队列描述符可能具有挑战性,因此提出了一种基于自然语言处理的方法来简化这一过程。 Method: 开发并评估了多个大型语言模型(LLMs),用于将用户输入的自然语言描述转换为 GDC 队列过滤条件,并创建了一个开源工具 GDC Cohort Copilot。 Result: 引入了 GDC Cohort Copilot 工具,能够自动根据用户的自然语言描述生成相应的 GDC 队列过滤器,并通过交互式界面进行调整。此外,本地提供、开源的 GDC Cohort LLM 模型表现优于 GPT-4o 模型。 Conclusion: GDC Cohort Copilot 提供了一种基于自然语言描述生成癌症基因组学数据队列的新方法,并且通过交互式界面进一步优化队列,其性能优于 GPT-4o 模型。 Abstract: Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.

[5] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu,Tinghong Chen,Jiangtao Feng,Jiangjie Chen,Weinan Dai,Qiying Yu,Ya-Qin Zhang,Wei-Ying Ma,Jingjing Liu,Mingxuan Wang,Hao Zhou

Main category: cs.CL

TL;DR: MemAgent是一种新的代理工作流程,能够通过分段读取文本和使用覆盖策略更新内存,实现高效的长文本处理。

Details Motivation: 尽管在长度外推、高效注意力和内存模块方面有所改进,但在不导致性能下降的情况下以线性复杂度处理无限长文档仍然是长文本处理中的终极挑战。 Method: 引入了一种新的代理工作流程MemAgent,通过分段读取文本和使用覆盖策略更新内存来直接优化长文本任务。此外,还扩展了DAPO算法以通过独立上下文多对话生成进行训练。 Result: MemAgent能够在性能损失小于5%的情况下从8K上下文扩展到3.5M问答任务,并在512K RULER测试中达到95%以上的准确率。 Conclusion: MemAgent展现出卓越的长文本处理能力,能够在性能损失小于5%的情况下从8K上下文扩展到3.5M问答任务,并在512K RULER测试中达到95%以上的准确率。 Abstract: Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.

[6] DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

Dohoon Kim,Donghun Kang,Taesup Moon

Main category: cs.CL

TL;DR: 本文提出了DoMIX,一种基于LoRA模块的新颖域自适应预训练方法,解决了现有持续DAP方法的计算效率低、数据顺序敏感和通用性差的问题。

Details Motivation: 现有的持续域自适应预训练方法面临高计算成本、对增量数据顺序敏感以及难以为所有终端任务提供通用模型的问题,本文旨在克服这些挑战。 Method: 该论文提出了一种基于LoRA模块(一种参数高效微调方法)的新型域自适应预训练方法DoMIX,以解决现有持续DAP方法在计算成本、GPU内存使用和增量数据顺序敏感性方面的限制。 Result: 实验表明,DoMIX不仅在DAP环境下表现优异,还能扩展到标准的LLM微调场景,代码已公开以便复现和进一步研究。 Conclusion: DoMIX通过利用LoRA模块提供了一种新颖的方法来解决连续DAP方法的局限性,使预训练模型能够有效并行地适应不同领域,并为特定任务提供定制的预训练模型。 Abstract: Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.

[7] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Christian Jaumann,Annemarie Friedrich,Rainer Lienhart

Main category: cs.CL

TL;DR: 本论文提出了一种基于多模态大语言模型和少量样本检索策略的系统,用于参加SciVQA 2025科学视觉问答共享任务,并取得了第三名的成绩。

Details Motivation: 为了应对科学视觉问答领域的挑战并提高问题回答的准确性,作者设计了专门的系统来参与该共享任务。 Method: 该系统采用两个多模态大语言模型的集成方法,并结合多种少量样本示例检索策略。模型及少量样本设置根据图表和问题类型进行选择,答案的选择则基于模型的置信度水平。 Result: 在盲测数据上,该系统在ROUGE-1、ROUGE-L和BERTS指标上的平均F1得分为85.12,在七个参赛系统中排名第三。 Conclusion: 通过使用多模态大语言模型的集成和少量样本学习策略,该系统在科学视觉问答任务中表现出色,展示了良好的性能和应用潜力。 Abstract: This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models' confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

[8] QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers

Pilsung Kang

Main category: cs.CL

TL;DR: This paper introduces QFFN-BERT, a hybrid quantum-classical transformer that replaces traditional feedforward networks with parameterized quantum circuits, resulting in improved accuracy and significant reductions in parameters, particularly in the feedforward network portion.

Details Motivation: The motivation behind this work is the dominant parameter contribution of FFNs in standard Transformer encoder blocks, which account for approximately two-thirds of the parameters. The researchers aimed to explore PQCs as a replacement for FFNs to enhance expressibility while reducing parameter count. Method: The researchers introduced QFFN-BERT, a hybrid quantum-classical transformer where the FFN modules of a compact BERT variant are replaced by PQC-based layers. They incorporated a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy into their PQC architecture. Result: Experiments on SST-2 and DBpedia benchmarks showed that a carefully configured QFFN-BERT achieved up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Additionally, the model exhibited a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. Conclusion: Parameterized quantum circuits (PQCs) can serve as powerful and parameter-efficient alternatives to classical feedforward networks (FFNs) when co-designed with foundational deep learning principles. Abstract: Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.

[9] Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Weijie Lyu,Sheng-Jun Huang,Xuan Xia

Main category: cs.CL

TL;DR: 本文提出一种高效的代码数据选择方法,在保证数据质量的同时显著提升模型性能和训练效率。

Details Motivation: 现有方法主要依赖大量数据来提升模型性能,忽略了数据质量,导致训练效率降低。 Method: 通过优化参数模型确保所选子集的分布一致性和多样性,以提升数据质量。 Result: 实验结果表明,在仅使用10K样本的情况下,该方法在HumanEval和MBPP任务上分别比92K全样本基线提升了2.4%和2.3%,优于其他采样方法。 Conclusion: 本文提出了一种利用参数模型优化代码数据选择的方法,有效提高了训练效率和模型性能,并显著降低了计算成本。 Abstract: Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.

[10] Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

Mark Atta Mensah,Isaac Wiafe,Akon Ekpezu,Justice Kwame Appati,Jamal-Deen Abdulai,Akosua Nyarkoa Wiafe-Akenten,Frank Ernest Yeboah,Gifty Odame

Main category: cs.CL

TL;DR: This study evaluates the cross-domain generalization of transformer-based ASR models for the low-resource Akan language and highlights domain dependency, distinct error behaviors between architectures, and the need for improved adaptation strategies.

Details Motivation: Most automatic speech recognition (ASR) research evaluates models using in-domain datasets but rarely assesses how well these models generalize across diverse speech contexts. This study addresses this gap by investigating the cross-domain generalization capabilities of ASR models for the low-resource Akan language. Method: This study benchmarks seven Akan ASR models based on transformer architectures, such as Whisper and Wav2Vec2, using four diverse Akan speech corpora encompassing various domains like culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. It evaluates the models' performance by comparing word error rate and character error rate. Result: The results showed significant domain dependency in ASR model performance, with optimal results only within training domains and marked accuracy degradation in mismatched scenarios. Distinct error behaviors were observed between Whisper and Wav2Vec2 architectures: Whisper produced more fluent but potentially misleading errors, while Wav2Vec2 generated more obvious yet less interpretable outputs in unfamiliar contexts. Conclusion: The study concludes that ASR models, specifically transformer-based ones like Whisper and Wav2Vec2, exhibit domain dependency in their performance, with notable accuracy degradation when applied to mismatched domains. The research highlights the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other low-resource languages (LRLs). Abstract: Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.

[11] A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages

Sumaya Ahmed Salihs,Isaac Wiafe,Jamal-Deen Abdulai,Elikem Doe Atsakpo,Gifty Ayoka,Richard Cave,Akon Obu Ekpezu,Catherine Holloway,Katrin Tomanek,Fiifi Baffoe Payin Winful

Main category: cs.CL

TL;DR: 本研究提出了一种为低资源语言中的受损语音构建自动语音识别模型的方法,开发了首个阿坎语受损语音开源数据集,并提供了最佳实践指南和工具,以推动包容性语音技术的发展。

Details Motivation: 为了填补低资源语言在受损语音识别方面的空白,并使语音识别技术更加包容和普及,研究人员开发了一种可推广的方法来收集语音样本并建立ASR模型。 Method: 作为概念验证,该研究整理了第一个开源的阿坎语受损语音数据集,并对开源ASR模型进行了微调,以更好地识别阿坎语中的受损语音。 Result: 该研究发布了一个包含受损语音数据集、最佳实践指南以及开源工具的资源包,这些资源已公开可用,可用于开发满足受损语音用户需求的语音识别技术。此外,研究还展示了针对受损语音微调ASR模型的初步结果。 Conclusion: 该研究通过创建一份最佳实践“食谱”和培训材料,推动社区主导的数据收集和自动语音识别(ASR)模型构建,旨在实现语音技术的民主化,并促进低资源语言中受损语音的识别技术的发展。 Abstract: This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a "cookbook" of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.

Sneha Deshmukh,Prathmesh Kamble

Main category: cs.CL

TL;DR: This paper introduces a new benchmark dataset for Legal NLP in India called IndianBailJudgments-1200, aimed at enhancing legal research and applications through annotated court judgments on bail decisions.

Details Motivation: The motivation behind the paper is the underdevelopment of Legal NLP in regions like India due to the lack of structured datasets, particularly focusing on bail jurisprudence. Method: The method involves creating a benchmark dataset named IndianBailJudgments-1200 with annotations across multiple attributes using a prompt-engineered GPT-4o pipeline, which was then verified for consistency. Result: The result is the creation of the first publicly available dataset focused specifically on Indian bail jurisprudence, consisting of 1200 annotated court judgments. Conclusion: The paper concludes that the introduced dataset will significantly contribute to the development of Legal NLP in India by enabling a wide range of tasks such as outcome prediction, summarization, and fairness analysis. Abstract: Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.

[13] WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li,Zhongwang Zhang,Huifeng Yin,Liwen Zhang,Litu Ou,Jialong Wu,Wenbiao Yin,Baixuan Li,Zhengwei Tao,Xinyu Wang,Weizhou Shen,Junkai Zhang,Dingchu Zhang,Xixi Wu,Yong Jiang,Ming Yan,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: WebSailor introduces a post-training methodology using novel techniques to enhance open-source models' reasoning under uncertainty, closing the performance gap with proprietary systems.

Details Motivation: The motivation is to overcome human cognitive limitations and the lack of sophisticated reasoning patterns in open-source LLMs, especially when navigating highly uncertain and complex information landscapes. Method: WebSailor uses structured sampling, information obfuscation, RFT cold start, and an agentic RL training algorithm called DUPO to generate high-uncertainty tasks and improve reasoning capabilities. Result: WebSailor significantly outperforms existing open-source agents and matches the performance of proprietary systems like DeepResearch on complex information-seeking benchmarks. Conclusion: WebSailor successfully bridges the capability gap between open-source and proprietary agents in complex information-seeking tasks by introducing a methodology that enhances reasoning under uncertainty. Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

[14] Revisiting Active Learning under (Human) Label Variation

Cornelia Gruber,Helen Alber,Bernd Bischl,Göran Kauermann,Barbara Plank,Matthias Aßenmacher

Main category: cs.CL

TL;DR: This paper highlights the importance of considering human label variation (HLV) in active learning and proposes a conceptual framework that integrates HLV-aware strategies throughout the annotation process.

Details Motivation: Label variation is common in real-world annotation tasks, yet many existing annotation and active learning frameworks assume a single ground truth. This oversight limits the effectiveness and realism of supervised learning systems. Method: The authors examine foundational assumptions about labels in supervised learning, survey how active learning and HLV communities have addressed label variation, and propose a conceptual framework for integrating HLV-aware approaches into the active learning loop. They also discuss the use of large language models as annotators. Result: The authors identify key distinctions between signal (HLV) and noise (annotation error) in label variation, show how current active learning methods fall short when HLV is present, and propose a framework that accounts for HLV during instance selection, annotator choice, and label representation. Conclusion: The paper concludes that incorporating human label variation (HLV) into active learning frameworks can better reflect the complexities of real-world annotation, and it proposes a conceptual framework to achieve this integration. Abstract: Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed -- or neglected -- these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

[15] MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion

Xin Guan,PeiHsin Lin,Zekun Wu,Ze Wang,Ruibo Zhang,Emre Kazim,Adriano Koshiyama

Main category: cs.CL

TL;DR: Multiperspective Fusion (MPF) is a scalable and interpretable post-training alignment framework that effectively mitigates bias in large language models by leveraging multiperspective generations and aligning outputs with nuanced baselines.

Details Motivation: The increasing demand for easy bias mitigation in large language models led to the development of MPF, a novel post-training alignment framework. Method: MPF uses multiperspective generations to decompose baselines into interpretable components, guiding response generation by sampling and balancing weighted responses. Result: MPF successfully aligned LLM sentiment distributions with counterfactual and HR baselines, achieving low KL divergence, reduced calibration error, and generalization to unseen questions. Conclusion: MPF provides a scalable and interpretable method for aligning LLM outputs and mitigating bias, without requiring extensive prompt engineering or fine-tuning. Abstract: Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.

[16] Exploring Gender Bias Beyond Occupational Titles

Ahmed Sabir,Rajesh Sharama

Main category: cs.CL

TL;DR: 本论文提出了一种新的性别偏见分析框架和相关数据集,揭示了超越职业刻板印象的性别偏见,并展示了其在多个数据集上的有效性。

Details Motivation: 探讨性别与情境偏差之间的关系,特别是动作动词、物体名词以及职业中的偏差。 Method: 引入了一个名为GenderLexicon的新数据集,并提出了一个可以估计上下文偏差及其相关性别偏差的框架。 Result: 该模型能够通过评分解释偏差,提高了性别偏见的可解释性,并在五个不同数据集中验证了方法的有效性,包括一个日语数据集。 Conclusion: 研究确认了性别偏见不仅存在于职业刻板印象中,还介绍了可解释性别偏见的新框架和GenderLexicon数据集。 Abstract: In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.

[17] Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Zhijian Xu,Yilun Zhao,Manasi Patwardhan,Lovekesh Vig,Arman Cohan

Main category: cs.CL

TL;DR: This paper introduces LimitGen, a new benchmark for assessing LLMs' ability to identify research paper limitations, showing that these models can be enhanced with literature retrieval to provide better feedback in the peer review process.

Details Motivation: The motivation is to address the challenges of peer review due to the increasing volume of scientific publications and to explore how large language models (LLMs) can assist in identifying paper limitations, which has been understudied. Method: The researchers developed LimitGen, a benchmark for evaluating LLMs' capability to identify paper limitations. It includes a synthetic dataset (LimitGen-Syn) and a real-world dataset (LimitGen-Human). They also enhanced LLM systems by incorporating literature retrieval capabilities. Result: The result demonstrates that LLMs, when augmented with literature retrieval, show improved performance in identifying limitations in research papers, offering support in the early stages of peer review. Conclusion: The study concludes that augmenting LLMs with literature retrieval improves their ability to identify limitations in research papers, thus providing more concrete and constructive feedback alongside human peer review. Abstract: Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.

[18] Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

Peter Viechnicki

Main category: cs.CL

TL;DR: This paper investigates the minimal auditory space difference needed for reliable vowel distinction, termed 'Just Producible Difference,' finding it to be between 14 and 51 mels, with implications for speech production theories and vowel system structures.

Details Motivation: To determine the degree of accuracy in sub-phonemic control mechanisms during vowel production by investigating the minimum distance required in auditory space for reliably different vowel imitations, known as 'Just Producible Difference' (JPD). Method: The study uses a vowel mimicry paradigm to measure the 'Just Producible Difference' (JPD) among two sets of English speakers during front vowel production. Result: The JPD is estimated at between 14 and 51 mels in F1 x F2 auditory space, indicating the minimal perceptible difference necessary for distinct vowel imitations. Conclusion: The study provides a psychophysical explanation for trends in vowel phonemes and clarifies possible structures of human vowel systems by setting a theoretical lower bound for how close two vowel phonemes may be in auditory space. Abstract: A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed 'Just Producible Difference' (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker's formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.

[19] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Ken Tsui

Main category: cs.CL

TL;DR: 研究揭示了大型语言模型在自我纠正上的局限性,并提出了提高其可靠性的潜在方法。

Details Motivation: 大型语言模型尽管具有变革性,但仍然会犯错误且无法有效地自我纠正,这影响了其可靠性与信任度。 Method: 引入了一个系统性的自我纠正基准测试框架,并对14个模型进行测试。 Result: 发现平均有64.5%的盲点率,训练数据组成是导致这一限制的重要因素之一,同时简单地添加'Wait'可以减少89.3%的盲点。 Conclusion: 当前大型语言模型在自我纠正方面存在一个关键限制,但通过一些方法可以改进其可靠性和可信度。 Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

[20] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Riccardo Cantini,Nicola Gabriele,Alessio Orsino,Domenico Talia

Main category: cs.CL

TL;DR: 本研究评估了推理语言模型(RLMs)在社会偏见方面的对抗性鲁棒性,发现显式推理机制可能增加模型对偏见的敏感性,从而挑战了推理机制能提高模型鲁棒性的普遍假设。

Details Motivation: 尽管推理语言模型(RLMs)能够执行复杂的多步骤推理任务,但它们对于社会偏见的鲁棒性尚不清楚。因此,本文旨在系统地评估RLMs在不同社会文化维度上的公平性和鲁棒性。 Method: 使用CLEAR-Bias基准测试评估RLMs对社会偏见的对抗性鲁棒性,利用LLM作为评判标准进行自动化安全评分,并采用越狱技术来评估内置安全机制的有效性。 Result: 研究显示,无论通过CoT提示还是微调推理轨迹实现的显式推理,推理增强的模型通常比没有此类机制的基础模型更容易受到偏见的影响。此外,依赖CoT提示的模型在面对上下文重构攻击时尤为脆弱。 Conclusion: 研究发现,具有推理能力的模型相较于基础模型更容易受到偏见的影响,这表明推理机制可能会无意中强化刻板印象。同时,通过CoT提示进行推理的模型在面对特定攻击时尤其脆弱。这些结果挑战了推理机制固有提升鲁棒性的假设,并强调需要更加关注推理设计中的偏见问题。 Abstract: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

[21] Multimodal Mathematical Reasoning with Diverse Solving Perspective

Wenhao Shi,Zhiqiang Hu,Yi Bin,Yang Yang,See-Kiong Ng,Heng Tao Shen

Main category: cs.CL

TL;DR: 本文提出了一种新的多模态数学推理模型Qwen-VL-DP,通过多样化推理视角和强化学习优化,在准确性和生成多样性方面取得了显著提升。

Details Motivation: 当前的多模态大语言模型在数学推理中通常依赖一对一的图文配对和单一解法监督,忽略了有效推理视角和内部反思的多样性。 Method: 构建了一个名为MathV-DP的新数据集,并提出了基于Qwen-VL的Qwen-VL-DP模型,采用监督学习和基于规则的强化学习方法GRPO进行优化。 Result: 实验结果表明,Qwen-VL-DP在MathVista的minitest和Math-V基准测试中均显著优于现有的基础多模态大语言模型。 Conclusion: Qwen-VL-DP通过结合多样化的推理视角和反思性推理,显著提升了多模态大语言模型在数学推理任务中的准确性和生成多样性。 Abstract: Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista's minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.

[22] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

Wencheng Zhang,Shiqin Qiao,Lingjie Luo,Yinfeng Li,Chuanyang Zheng,Qian Xu,Meng Li,Yong Gui,Yijun He,Jianing Qiu,Jindong Hong,Jiankai Sun

Main category: cs.CL

TL;DR: 本论文提出了一个动态路由框架SynapseRoute,能够根据问题复杂性将输入查询分配到适当的处理模式,从而优化准确性、成本效率和用户体验。

Details Motivation: 随着大语言模型在实际应用中的广泛采用,选择适当的模型需要在性能与运营成本之间取得平衡,特别是考虑到'思考'(高推理)与'非思考'(快速、低成本)模式之间的成本差距。 Method: 提出了一种基于机器学习的动态路由框架SynapseRoute,并引入了Accuracy-Inference-Token (AIT)指数来全面评估准确性、延迟和令牌成本之间的权衡。 Result: 实验结果表明,SynapseRoute相比单独使用思考模式提高了整体准确性,并减少了推理时间和令牌消耗。 Conclusion: SynapseRoute通过智能分配查询到合适的模式,有效平衡了准确性、推理时间和令牌消耗,为解决高成本推理模型的应用提供了可行方案。 Abstract: With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.

[23] Generalizing Verifiable Instruction Following

Valentina Pyatkin,Saumya Malik,Victoria Graf,Hamish Ivison,Shengyi Huang,Pradeep Dasigi,Nathan Lambert,Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: 本文提出了一种新的基准IFBench,用于评估语言模型在遵循新颖、多样化和具有挑战性的输出约束方面的泛化能力,并介绍了通过可验证奖励的强化学习(RLVR)来提高指令跟随能力的方法。

Details Motivation: 当前的语言模型在遵循用户设定的特定输出约束时表现不佳,尤其是在面对未见过的约束条件时。因此,需要一个新的基准来评估模型在这些条件下的泛化能力和一种有效的训练方法。 Method: 设计了一个新的基准测试集IFBench,包含58个新的、多样化的、具有挑战性的可验证输出约束;开发了可验证奖励的强化学习(RLVR)方法来改善模型的指令跟随能力。 Result: 实验表明,RLVR显著提高了模型在遵循指令方面的能力,并且本文还发布了29个新的手工标注的训练约束和验证函数以及相关代码。 Conclusion: 本文通过引入新的基准测试集和训练方法,为提升语言模型对复杂输出约束的泛化能力和精确指令跟随能力提供了有效途径。 Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

[24] LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Almog Hilel,Idan Shenfeld,Leshem Choshen,Jacob Andreas

Main category: cs.CL

TL;DR: 论文发现了一种针对使用用户反馈训练的语言模型的新攻击方式,攻击者可以通过简单的提示和反馈改变模型的行为,可能导致虚假信息传播或安全漏洞。

Details Motivation: 论文旨在揭示基于用户反馈训练的语言模型中存在的安全漏洞,以及探索这种漏洞可能带来的影响和应用。 Method: 攻击者通过提示模型随机输出“中毒”或良性响应,并对“中毒”响应进行上票或对良性响应进行下票来实施攻击。 Result: 实验结果显示,该攻击可以用于插入新的虚假知识、修改代码生成模式以引入安全漏洞,以及注入伪造的金融新闻。 Conclusion: 论文得出结论,语言模型在偏好调整中存在漏洞,使得攻击者能够通过有限的反馈信号对模型行为施加细粒度控制。 Abstract: We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

[25] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Purbesh Mitra,Sennur Ulukus

Main category: cs.CL

TL;DR: MOTIF enables large language models to think in modules across multiple rounds, improving reasoning performance beyond context size limits with high sample efficiency.

Details Motivation: LLMs are limited by fixed context sizes, which restrict their ability to reason over long sequences. Existing methods like GRPO allow better responses but are still constrained by context limits. This work aims to overcome these limitations through multi-round modular reasoning. Method: MOTIF trains models using reinforcement learning with modular thinking over multiple rounds, extending the effective context size for reasoning. The Qwen2.5-3B-Instruct model was trained on the GSM8K dataset and evaluated on MATH500 and AIME2024 benchmarks. Result: Experiments showed a 3.8% improvement on MATH500 and 3.3% on AIME2024 compared to vanilla GRPO training, using only 15% of samples, demonstrating both accuracy improvement and sample efficiency. Conclusion: The proposed MOTIF method improves the reasoning capabilities of LLMs beyond context size limitations through modular thinking and reinforcement learning, showing significant performance gains on benchmarks while being sample-efficient. Abstract: Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

[26] Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Nikhil Chandak,Shashwat Goel,Ameya Prabhu,Moritz Hardt,Jonas Geiping

Main category: cs.CL

TL;DR: The paper introduces 'answer matching' as a superior evaluation method for language models, showing that it aligns better with human grading than traditional multiple-choice benchmarks.

Details Motivation: Multiple choice benchmarks often allow models to exploit shortcuts and do not accurately reflect a model's true understanding. The motivation is to develop a more valid and scalable evaluation method that avoids these limitations. Method: The authors introduced generative evaluation through answer matching, where models generate free-form responses evaluated against reference answers using modern language models. They compared this method with traditional multiple-choice evaluations and LLM-as-a-judge approaches by measuring agreement with human grading on annotated datasets (MMLU-Pro and GPQA-Diamond). Result: Answer matching achieved near-perfect agreement with human grading, comparable to inter-annotator agreement, while multiple-choice and LLM-as-a-judge methods showed poor alignment with human judgments. Conclusion: Answer matching provides a more accurate and reliable method for evaluating language models compared to traditional multiple-choice benchmarks. Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

cs.CV [Back]

[27] Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma

Main category: cs.CV

TL;DR: This paper reviews the use of large language models and vision-language models for crash detection from video feeds, providing a taxonomy of approaches, dataset summaries, performance comparisons, and insights into future research directions.

Details Motivation: Crash detection from video feeds is a crucial aspect of intelligent transportation systems. With the advancement of LLMs and VLMs, there's an opportunity to improve how multimodal information is processed and analyzed for crash detection, warranting a survey of current approaches. Method: The authors conducted a comprehensive review of recent methods using LLMs for crash detection, presenting a taxonomy of fusion strategies, summarizing datasets, analyzing model architectures, and comparing performance benchmarks. Result: The paper provides a structured overview of fusion strategies, key datasets, model architectures, and benchmark performances for LLM-based crash detection methods, while highlighting existing challenges and opportunities for further research. Conclusion: The paper concludes that leveraging large language models (LLMs) and vision-language models (VLMs) for crash detection from video data presents significant potential, offering a foundation for future research in this rapidly evolving field. Abstract: Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

[28] Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning

Zijie Cai,Christopher Metzler

Main category: cs.CV

TL;DR: 本文探讨了单目度量深度估计在水下环境中的应用,通过微调现有模型并引入合成水下数据集来提高性能。

Details Motivation: 由于光衰减、散射、颜色失真、浑浊以及缺乏高质量的度量真实数据,水下环境中的单目度量深度估计仍然具有挑战性。 Method: 微调Depth Anything V2模型,使用基于物理的水下图像形成模型生成Hypersim数据集的变体进行训练。 Result: 经过微调的模型在所有基准测试中表现更佳,超越了仅在干净非水下Hypersim数据集上训练的基线模型。 Conclusion: 研究强调了在水下场景中单目度量深度估计的重要性,并指出领域适应和尺度感知监督对于未来研究的必要性。 Abstract: Monocular depth estimation has recently advanced to provide not only relative but also metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground-truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, such as FLSea and SQUID. We evaluate a diverse set of state-of-the-art models across a range of underwater conditions with different ranges. Our results show that large-scale models trained on terrestrial (real or synthetic) data, while effective in in-air settings, perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S backbone encoder on a synthetic underwater variant of the Hypersim dataset, which we generated using a physically based underwater image formation model. We demonstrate our fine-tuned model consistently improves performance across all benchmarks and outperforms baselines trained only on the clean in-air Hypersim dataset. Our study provides a detailed evaluation and visualization for monocular metric depth estimation in underwater scenes, highlighting the importance of domain adaptation and scale-aware supervision for achieving robust and generalizable metric depth predictions in challenging underwater environments for future research.

[29] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

Xiao Wang,Jingtao Jiang,Qiang Chen,Lan Chen,Lin Zhu,Yaowei Wang,Yonghong Tian,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一个基于思维链推理的事件流场景文本识别新框架ESTR-CoT,结合视觉编码器和大语言模型实现端到端优化,并发布了一个大规模CoT数据集及开源代码与预训练模型。

Details Motivation: 现有的事件流场景文本识别方法在可解释性和上下文逻辑推理方面存在不足,需要一种更具解释性并能增强推理能力的新框架。 Method: 采用EVA-CLIP (ViT-G/14)作为视觉编码器将输入事件流转化为tokens,利用Llama分词器对生成提示进行编码,并使用Q-former对齐视觉token与预训练大语言模型Vicuna-7B,同时输出答案和思维链(CoT)推理过程。 Result: 提出的ESTR-CoT框架在三个事件流STR基准数据集(EventSTR、WordArt*、IC15*)上的大量实验验证了其有效性与可解释性。 Conclusion: 该论文提出了一种基于思维链推理的事件流场景文本识别框架ESTR-CoT,通过结合视觉编码器和大语言模型实现了端到端优化,并构建了一个大规模CoT数据集来支持后续基于推理的大模型发展。 Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.

[30] Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach

Elena Ryumina,Maxim Markitantov,Alexandr Axyonov,Dmitry Ryumin,Mikhail Dolgushin,Alexey Karpov

Main category: cs.CV

TL;DR: This paper proposes a novel zero-shot multimodal approach for Compound Expression Recognition that integrates multiple modalities and achieves performance comparable to supervised methods without domain-specific training.

Details Motivation: Compound Expression Recognition aims to detect complex emotional states formed by combinations of basic emotions, which traditional methods struggle with due to reliance on task-specific training data. Method: The method combines six heterogeneous modalities (static/dynamic facial expressions, scene and label matching, scene context, audio, text) using a Multi-Head Probability Fusion module and a Compound Expressions transformation module with Pair-Wise Probability Aggregation and Pair-Wise Feature Similarity Aggregation. Result: The approach achieved F1 scores of 46.95% on AffWild2, 49.02% on AFEW, and 34.85% on C-EXPR-DB through zero-shot testing. Conclusion: The proposed zero-shot multimodal approach effectively captures compound emotions without domain adaptation and produces results comparable to supervised approaches. Abstract: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available.

[31] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

Takuro Kawada,Shunsuke Kitada,Sota Nemoto,Hitoshi Iyatomi

Main category: cs.CV

TL;DR: 该论文介绍了SciGA-145k数据集及相关任务与指标,旨在推动科学交流的视觉化发展及科学人工智能的进步。

Details Motivation: 图表摘要(GAs)在传达科学论文关键发现方面起着至关重要的作用,但其增强科学交流的潜力仍待挖掘,设计有效的GAs需要高级可视化技能,这限制了它们的广泛应用。 Method: 通过定义两种任务:Intra-GA推荐和Inter-GA推荐,并提出CAR这一新的推荐指标来分析模型行为。 Result: 引入了一个大规模数据集SciGA-145k,包含约145,000篇科学论文和1.14百万张图表,用于支持GA选择、推荐以及促进自动GA生成研究。 Conclusion: SciGA-145k为科学交流的视觉化进步奠定了基础,同时推动了科学人工智能的发展。 Abstract: Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.

[32] Understanding Trade offs When Conditioning Synthetic Data

Brandon Trabucco,Qasim Wani,Benjamin Pikus,Vasu Sharma

Main category: cs.CV

TL;DR: This paper shows that layout-based conditioning in diffusion models enhances synthetic data quality for object detection, leading to significant improvements in model performance, especially under diverse conditions.

Details Motivation: Generating robust object detectors with limited real-world data is crucial in industrial vision systems, and while synthetic data offers a solution, current methods have limitations in efficiency and realism. Method: The study compares two synthetic data conditioning strategies—prompt-based and layout-based—across eighty visual concepts from four object detection benchmarks, evaluating their impact on detection performance. Result: Using layout conditioning that matches the full training distribution improves mean average precision by an average of 34% and up to 177% compared to using real data alone. Conclusion: Layout conditioning outperforms prompt conditioning in synthetic data generation for object detection when the diversity of conditioning cues increases, significantly improving mean average precision. Abstract: Learning robust object detectors from only a handful of images is a critical challenge in industrial vision systems, where collecting high quality training data can take months. Synthetic data has emerged as a key solution for data efficient visual inspection and pick and place robotics. Current pipelines rely on 3D engines such as Blender or Unreal, which offer fine control but still require weeks to render a small dataset, and the resulting images often suffer from a large gap between simulation and reality. Diffusion models promise a step change because they can generate high quality images in minutes, yet precise control, especially in low data regimes, remains difficult. Although many adapters now extend diffusion beyond plain text prompts, the effect of different conditioning schemes on synthetic data quality is poorly understood. We study eighty diverse visual concepts drawn from four standard object detection benchmarks and compare two conditioning strategies: prompt based and layout based. When the set of conditioning cues is narrow, prompt conditioning yields higher quality synthetic data; as diversity grows, layout conditioning becomes superior. When layout cues match the full training distribution, synthetic data raises mean average precision by an average of thirty four percent and by as much as one hundred seventy seven percent compared with using real data alone.

[33] High-Fidelity Differential-information Driven Binary Vision Transformer

Tian Gao,Zhiyuan Zhang,Kaijie Yin,Xu-Cheng Zhong,Hui Kong

Main category: cs.CV

TL;DR: DIDB-ViT improves the binarization of vision transformers through novel modules and activation functions, resulting in enhanced performance with reduced computational demands.

Details Motivation: The motivation is to address the trade-off between computational/storage demands and edge-device constraints by binarizing vision transformers while mitigating performance degradation and reducing reliance on full-precision modules. Method: DIDB-ViT introduces an informative attention module with differential information, frequency decomposition using discrete Haar wavelet for preserving similarity fidelity, and an improved RPReLU activation function to enhance model representation. Result: Experimental results show that DIDB-ViT achieves better performance in image classification and segmentation compared to existing binary ViT methods. Conclusion: DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance. Abstract: The binarization of vision transformers (ViTs) offers a promising approach to addressing the trade-off between high computational/storage demands and the constraints of edge-device deployment. However, existing binary ViT methods often suffer from severe performance degradation or rely heavily on full-precision modules. To address these issues, we propose DIDB-ViT, a novel binary ViT that is highly informative while maintaining the original ViT architecture and computational efficiency. Specifically, we design an informative attention module incorporating differential information to mitigate information loss caused by binarization and enhance high-frequency retention. To preserve the fidelity of the similarity calculations between binary Q and K tensors, we apply frequency decomposition using the discrete Haar wavelet and integrate similarities across different frequencies. Additionally, we introduce an improved RPReLU activation function to restructure the activation distribution, expanding the model's representational capacity. Experimental results demonstrate that our DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance.

[34] FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model

Jiangxia Chen,Tongyuan Huang,Ke Song

Main category: cs.CV

TL;DR: This paper introduces FMOcc, an efficient method for 3D semantic occupancy prediction that improves accuracy and reduces computational costs.

Details Motivation: To overcome the limitations of few-frame images and redundancy in 3D space, which affect prediction accuracy in occluded and distant scenes. Method: FMOcc uses a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model to enhance feature prediction and efficiency. Result: FMOcc achieved notable performance metrics on Occ3D-nuScenes and OpenOcc datasets, including 43.1% RayIoU and 39.8% mIoU with two-frame input. Conclusion: FMOcc outperforms existing state-of-the-art methods for 3D semantic occupancy prediction, achieving high accuracy with reduced computational resources. Abstract: 3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction. Firstly, to generate missing features, we designed a feature refinement module based on a flow matching model, which is called Flow Matching SSM module (FMSSM). Furthermore, by designing the TPV SSM layer and Plane Selective SSM (PS3M), we selectively filter TPV features to reduce the impact of air voxels on non-air voxels, thereby enhancing the overall efficiency of the model and prediction capability for distant scenes. Finally, we design the Mask Training (MT) method to enhance the robustness of FMOcc and address the issue of sensor data loss. Experimental results on the Occ3D-nuScenes and OpenOcc datasets show that our FMOcc outperforms existing state-of-theart methods. Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.

[35] SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement

Zeyu Lei,Hongyuan Yu,Jinlin Wu,Zhen Chen

Main category: cs.CV

TL;DR: The paper proposes SurgVisAgent, a versatile surgical vision agent based on MLLMs that outperforms single-task models by customizing image enhancements for various surgical distortions.

Details Motivation: Current surgical enhancement algorithms are limited to single tasks in specific scenarios, reducing their effectiveness in complex real-world situations. Method: Development of SurgVisAgent, an end-to-end intelligent surgical vision agent using multimodal large language models (MLLMs), incorporating a prior model for domain-specific knowledge, in-context few-shot learning, and chain-of-thought (CoT) reasoning. Result: SurgVisAgent dynamically identifies distortion categories and severity levels, performing various enhancement tasks effectively and demonstrating superiority over traditional models on a comprehensive benchmark simulating real-world surgical distortions. Conclusion: SurgVisAgent offers a unified solution for surgical assistance by surpassing traditional single-task models in handling diverse distortion types and severity levels. Abstract: Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context few-shot learning and chain-of-thought (CoT) reasoning, SurgVisAgent delivers customized image enhancements tailored to a wide range of distortion types and severity levels, thereby addressing the diverse requirements of surgeons. Furthermore, we construct a comprehensive benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models, highlighting its potential as a unified solution for surgical assistance.

[36] Multi-Label Classification Framework for Hurricane Damage Assessment

Zhangding Liu,Neda Mohammadi,John E. Taylor

Main category: cs.CV

TL;DR: 本文介绍了一种利用航拍图像进行飓风后多标签损害评估的新方法,其性能优于现有技术,有助于提高灾害响应效率。

Details Motivation: 传统的单标签分类方法无法捕捉飓风后损害的复杂性,因此需要一种更精确的方法来进行灾害评估。 Method: 该方法结合了基于ResNet的特征提取模块和特定类别的注意力机制,以识别单张图像中的多种损伤类型。 Result: 使用Rescuenet数据集,所提出的模型达到了90.23%的平均精度,优于现有的基线方法。 Conclusion: 该研究提出了一种新的多标签分类框架,用于飓风后的损害评估,提高了灾害响应的针对性和效率,并为未来的灾害缓解和恢复策略做出了贡献。 Abstract: Hurricanes cause widespread destruction, resulting in diverse damage types and severities that require timely and accurate assessment for effective disaster response. While traditional single-label classification methods fall short of capturing the complexity of post-hurricane damage, this study introduces a novel multi-label classification framework for assessing damage using aerial imagery. The proposed approach integrates a feature extraction module based on ResNet and a class-specific attention mechanism to identify multiple damage types within a single image. Using the Rescuenet dataset from Hurricane Michael, the proposed method achieves a mean average precision of 90.23%, outperforming existing baseline methods. This framework enhances post-hurricane damage assessment, enabling more targeted and efficient disaster response and contributing to future strategies for disaster mitigation and resilience. This paper has been accepted at the ASCE International Conference on Computing in Civil Engineering (i3CE 2025), and the camera-ready version will appear in the official conference proceedings.

[37] Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation

Yuxiang Zhang,Wei Li,Wen Jia,Mengmeng Zhang,Ran Tao,Shunlin Liang

Main category: cs.CV

TL;DR: 本文提出了一种新的 Bi-directional Domain Adaptation (BiDA) 框架,用于解决高光谱遥感图像中的跨域分类问题,通过创新的网络结构和学习策略,实现了优于现有方法的分类性能。

Details Motivation: 由于卫星或航空图像在不同区域或时间段采集时存在同一类别光谱偏移的问题,需要一种能够同时提取域不变特征和域特有信息的方法,以提升跨域场景的适应性和可分性。 Method: 提出了一种 Bi-directional Domain Adaptation (BiDA) 框架,采用三支变压器架构(源分支、目标分支和耦合分支),并设计了 Coupled Multi-head Cross-attention (CMCA) 机制和 bi-directional distillation loss,以增强特征交互与跨域相关性挖掘。此外,引入 Adaptive Reinforcement Strategy (ARS) 以提高模型在噪声条件下的泛化能力。 Result: 实验结果表明,BiDA 在交叉时间/场景的航空和卫星数据集上表现优异,显著优于当前先进的域适应方法。 Conclusion: BiDA 框架在跨域高光谱图像分类任务中显著优于现有方法,特别是在交叉时间树种分类任务中比最先进的方法高出 3%~5%。 Abstract: Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3\%$\sim$5\% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA.

[38] MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement

Fanghai Yi,Zehong Zheng,Zexiao Liang,Yihang Dong,Xiyang Fang,Wangyu Wu,Xuhang Chen

Main category: cs.CV

TL;DR: 本文提出 MAC-Lookup 模型,通过 CLTCC 和 MAAE 模块有效提升水下图像质量,在颜色、清晰度和对比度方面表现优异。

Details Motivation: 传统基于先验和像素的方法效果不佳,而深度学习方法受限于缺乏高质量数据集,因此需要一种更有效的水下图像增强方法。 Method: 引入了 Multi-Axis Conditional Lookup (MAC-Lookup) 模型,包含 Conditional 3D Lookup Table Color Correction (CLTCC) 和 Multi-Axis Adaptive Enhancement (MAAE) 两个模块,分别用于初步的颜色校正和细节优化。 Result: 实验表明,MAC-Lookup 在恢复细节和颜色方面优于现有方法,显著提升了水下图像的视觉质量。 Conclusion: MAC-Lookup 模型在水下图像增强方面表现出色,能够有效改善颜色准确性、清晰度和对比度,同时防止过度增强和饱和。 Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is https://github.com/onlycatdoraemon/MAC-Lookup.

[39] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Feizhen Huang,Yu Wu,Yutian Lin,Bo Du

Main category: cs.CV

TL;DR: 本文提出了一种新的自蒸馏方法,解决了现有V2A生成方法在部分可见情况下的局限性,并在大规模数据集上表现出色。

Details Motivation: 当前的方法忽视了电影语言这一关键的艺术表达成分,导致在Foley目标仅部分可见的情况下性能下降。 Method: 通过模拟电影语言变化,学生模型学习对齐具有相同音视频对应关系的训练对的视频特征。 Result: 该方法不仅在部分可见情况下所有评估指标上实现了显著改进,而且增强了在大规模V2A数据集VGGSound上的性能。 Conclusion: 本文提出了一种简单的自蒸馏方法,用于扩展V2A模型在电影语言场景中的应用,并取得了显著的改进效果。 Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

[40] LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

Juntao Liu,Liqiang Niu,Wenchao Chen,Jie Zhou,Fandong Meng

Main category: cs.CV

TL;DR: LaCo is a new token compression framework for Multimodal Large Language Models that improves training and inference efficiency without sacrificing performance.

Details Motivation: Existing visual token compression methods are limited as post-encoder modules, which restricts their efficiency gains. This paper proposes LaCo to address this limitation. Method: LaCo (Layer-wise Visual Token Compression) uses a layer-wise pixel-shuffle mechanism and a residual learning architecture to compress tokens in the intermediate layers of the vision encoder. Result: Extensive experiments show that LaCo outperforms all existing methods for token compression in MLLMs' vision encoders. Conclusion: LaCo improves training efficiency beyond 20% and increases inference throughput by over 15% while maintaining strong performance, outperforming existing visual token compression methods. Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.

[41] Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization

De Cheng,Zhipeng Xu,Xinyang Jiang,Dongsheng Li,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: This paper proposes a novel framework combining text feature-guided visual prompt tuning and WERA to improve domain generalization, achieving better results than current state-of-the-art methods on several benchmark datasets.

Details Motivation: Domain Generalization (DG) aims to build models that perform well across unseen domains. Although Visual Foundation Models (VFMs) like CLIP show promise, designing effective prompts for domain-invariant features remains challenging, prompting the need for better strategies. Method: The framework involves two key steps: first, automatically disentangling the text prompt using a large language model (LLM), and second, learning domain-invariant visual representation guided by the disentangled text feature. To overcome limitations in relying solely on language, WERA is introduced to extend visual prompts with abstract prompts and stylized image augmentations while maintaining alignment consistency. Result: Experiments on multiple DG datasets demonstrated superior performance of the proposed approach compared to existing state-of-the-art DG methods. Conclusion: The proposed method, which uses text feature-guided visual prompt tuning and Worst Explicit Representation Alignment (WERA), outperforms state-of-the-art DG methods on major datasets like PACS, VLCS, OfficeHome, DomainNet, and TerraInc. Abstract: Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.

[42] ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation

Hanbo Bi,Yulong Xu,Ya Li,Yongqiang Mao,Boyuan Tong,Chongyang Li,Chunbo Lang,Wenhui Diao,Hongqi Wang,Yingchao Feng,Xian Sun

Main category: cs.CV

TL;DR: 本文提出了一种新的框架ViRefSAM,该框架能够在不需要手动提示的情况下,通过对少量参考图像的学习,实现对遥感图像中未见类别的准确自动分割,并且在多个数据集上表现优于现有的少样本分割方法。

Details Motivation: 应用SAM到遥感图像面临两个主要挑战:手动构建精确提示效率低下;SAM缺乏领域适应性。因此提出了ViRefSAM以解决这些问题。 Method: ViRefSAM引入了两个关键组件:视觉上下文提示编码器和动态目标对齐适配器,前者从参考图像中提取类别特定的语义线索并通过与目标图像的上下文交互生成对象感知提示,后者通过将类别特定的语义注入到目标图像特征中来缩小领域差距。 Result: 在三个少样本分割基准上的实验表明,ViRefSAM能够利用少量参考图像实现对未见类别的准确自动分割,并且在多个数据集上表现优于现有方法。 Conclusion: ViRefSAM通过利用少量参考图像实现了对遥感图像中未见类别的准确自动分割,并且在不同数据集中始终优于现有的少样本分割方法。 Abstract: The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM's original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM's image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$, demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets.

[43] DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation

Yunhan Yang,Shuo Chen,Yukun Huang,Xiaoyang Wu,Yuan-Chen Guo,Edmund Y. Lam,Hengshuang Zhao,Tong He,Xihui Liu

Main category: cs.CV

TL;DR: DreamComposer++ 是一种改进视图感知扩散模型的框架,通过引入多视角条件提高生成新视角的可控性。

Details Motivation: 现有的单视角生成新视角的工作在产生可控的新视角方面面临挑战,因为缺乏多视角的信息。 Method: DreamComposer++ 利用视图感知的3D提升模块从不同视角提取物体的3D表示,并通过多视角特征融合模块将这些表示聚合并渲染到目标视图的潜在特征中。 Result: 实验结果表明,DreamComposer++ 能够无缝集成到最先进的视图感知扩散模型中,增强了它们从多视角条件生成可控新视角的能力。 Conclusion: DreamComposer++ 提出了一种灵活且可扩展的框架,通过引入多视角条件来改进当前的视图感知扩散模型。 Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.

[44] Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images

Haoxuan Li,Chenxu Wei,Haodong Wang,Xiaomeng Hu,Boyuan An,Lingyan Ran,Baosen Zhang,Jin Jin,Omirzhan Taukebayev,Amirkhan Temirbayev,Junrui Liu,Xiuwei Zhang

Main category: cs.CV

TL;DR: This paper introduces Flow-CDNet, a dual-branch network for detecting both slow and fast changes in images, achieving superior performance on a new dataset with customized loss and metric.

Details Motivation: Detecting both slow and fast changes in bitemporal images is crucial for identifying early signs of hazards in critical areas like slopes and dams, which presents a novel challenge in change detection. Method: Flow-CDNet uses a pyramid structure for multi-scale displacement extraction and combines ResNet with optical flow output for fast change detection. A custom loss function and evaluation metric (FEPE) are also introduced. Result: Experiments on the Flow-Change dataset show that the proposed method surpasses existing approaches, while ablation studies confirm the mutual enhancement of the two branches. Conclusion: The proposed Flow-CDNet effectively detects both slow and fast changes by integrating an optical flow branch and a binary change detection branch, outperforming existing methods on the Flow-Change dataset. Abstract: Change detection typically involves identifying regions with changes between bitemporal images taken at the same location. Besides significant changes, slow changes in bitemporal images are also important in real-life scenarios. For instance, weak changes often serve as precursors to major hazards in scenarios like slopes, dams, and tailings ponds. Therefore, designing a change detection network that simultaneously detects slow and fast changes presents a novel challenge. In this paper, to address this challenge, we propose a change detection network named Flow-CDNet, consisting of two branches: optical flow branch and binary change detection branch. The first branch utilizes a pyramid structure to extract displacement changes at multiple scales. The second one combines a ResNet-based network with the optical flow branch's output to generate fast change outputs. Subsequently, to supervise and evaluate this new change detection framework, a self-built change detection dataset Flow-Change, a loss function combining binary tversky loss and L2 norm loss, along with a new evaluation metric called FEPE are designed. Quantitative experiments conducted on Flow-Change dataset demonstrated that our approach outperforms the existing methods. Furthermore, ablation experiments verified that the two branches can promote each other to enhance the detection performance.

[45] LMPNet for Weakly-supervised Keypoint Discovery

Pei Guo,Ryan Farrell

Main category: cs.CV

TL;DR: This paper proposes LMPNet for semantic object keypoint discovery using weak supervision from category labels, achieving robustness to object pose and strong prediction accuracy.

Details Motivation: The task of semantic object keypoint discovery is weakly-supervised by only category labels. The work aims to transform discriminatively-trained intermediate layer filters into keypoint detectors while ensuring three preferred characteristics: spatially sparse activations, consistency, and diversity. Method: A novel computationally-efficient leaky max pooling (LMP) layer is proposed to explicitly encourage final conv-layer filters to learn "non-repeatable local patterns" that are well aligned with object keypoints. A simple yet effective selection strategy is proposed to ensure consistent filter activations and attention mask-out is then applied to force the network to distribute its attention to the whole object instead of just the most discriminative region. For the final keypoint prediction, a learnable clustering layer is proposed to group keypoint proposals into keypoint predictions. Result: The proposed LMPNet model can automatically discover semantic keypoints that are robust to object pose and achieves strong prediction accuracy comparable to a supervised pose estimation model. Conclusion: LMPNet is a highly interpretable model that can automatically discover semantic keypoints robust to object pose and achieves strong prediction accuracy comparable to a supervised pose estimation model. Abstract: In this work, we explore the task of semantic object keypoint discovery weakly-supervised by only category labels. This is achieved by transforming discriminatively-trained intermediate layer filters into keypoint detectors. We begin by identifying three preferred characteristics of keypoint detectors: (i) spatially sparse activations, (ii) consistency and (iii) diversity. Instead of relying on hand-crafted loss terms, a novel computationally-efficient leaky max pooling (LMP) layer is proposed to explicitly encourage final conv-layer filters to learn "non-repeatable local patterns" that are well aligned with object keypoints. Informed by visualizations, a simple yet effective selection strategy is proposed to ensure consistent filter activations and attention mask-out is then applied to force the network to distribute its attention to the whole object instead of just the most discriminative region. For the final keypoint prediction, a learnable clustering layer is proposed to group keypoint proposals into keypoint predictions. The final model, named LMPNet, is highly interpretable in that it directly manipulates network filters to detect predefined concepts. Our experiments show that LMPNet can (i) automatically discover semantic keypoints that are robust to object pose and (ii) achieves strong prediction accuracy comparable to a supervised pose estimation model.

[46] Perception Activator: An intuitive and portable framework for brain cognitive exploration

Le Xu,Qi Zhang,Qixian Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao

Main category: cs.CV

TL;DR: This paper introduces a new framework that leverages fMRI data through cross-attention mechanisms to enhance object detection and segmentation, revealing that fMRI contains untapped semantic and spatial cues.

Details Motivation: Existing brain-vision decoding methods rely heavily on low-level pixel alignment and lack sufficient semantic alignment, leading to distortions in reconstructing multiple semantic objects. The motivation is to better understand how the brain processes visual perception and how decoding models handle semantic objects. Method: The researchers developed an experimental framework using fMRI representations as intervention conditions. They injected these representations into multi-scale image features via cross-attention and compared downstream performance and intermediate feature changes for object detection and instance segmentation tasks with and without fMRI information. Result: Incorporating fMRI signals improved the accuracy of downstream tasks such as object detection and instance segmentation. This confirms that fMRI contains valuable semantic and spatial information about visual stimuli. Conclusion: The study concludes that fMRI signals contain rich multi-object semantic cues and coarse spatial localization information that current models have yet to fully exploit. Incorporating these signals enhances the accuracy of visual perception tasks like object detection and segmentation. Abstract: Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain's visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.

[47] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation

JaeHyuck Choi,MinJun Kim,JeHyeong Hong

Main category: cs.CV

TL;DR: MAGIC是一种用于工业质量检测的少样本异常生成方法,通过改进扩散模型的训练与掩码对齐机制,同时实现背景保留、准确对齐和生成多样化的异常。

Details Motivation: 现有的扩散模型方法无法同时满足保持正常背景、精确对齐掩码以及生成多样化外观的需求,而工业质量控制需要高质量的少样本异常生成。 Method: MAGIC基于Stable Diffusion的修复框架,引入多级扰动策略(高斯提示级扰动和掩码引导的空间噪声注入)和上下文感知的掩码对齐模块。 Result: MAGIC在MVTec-AD数据集上的下游异常任务表现优于当前最先进的方法,并成功解决了背景破坏、掩码错位和多样性不足的问题。 Conclusion: MAGIC实现了在工业质量检测中生成高质量异常数据的方法,解决了背景保持、掩码对齐和多样性生成的问题,并在MVTec-AD数据集上优于现有技术。 Abstract: Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC--Mask-guided inpainting with multi-level perturbations and Context-aware alignment--to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.

[48] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

Zecheng Zhao,Selena Song,Tong Chen,Zhi Chen,Shazia Sadiq,Yadan Luo

Main category: cs.CV

TL;DR: This paper introduces SynTVA, a new dataset and benchmark for evaluating the usefulness of synthetic videos in text-to-video retrieval tasks, showing that it can effectively assess alignment quality and improve retrieval outcomes through dataset augmentation.

Details Motivation: Current evaluation metrics for text-to-video synthesis primarily focus on visual quality and temporal consistency, offering limited insight into how effective synthetic videos are in downstream tasks like text-to-video retrieval (TVR). The authors aim to introduce a new benchmark that evaluates the utility of synthetic videos specifically for retrieval tasks. Method: The authors created SynTVA, a dataset and benchmark derived from 800 diverse user queries based on the MSRVTT training split. Synthetic videos were generated using state-of-the-art T2V models, and each video-text pair was annotated along four semantic alignment dimensions: Object & Scene, Action, Attribute, and Prompt Fidelity. They developed an Auto-Evaluator to estimate alignment quality from existing metrics and correlated general VQA metrics with alignment scores to examine their predictive power for TVR performance. Result: SynTVA provides detailed annotations across key semantic alignment dimensions and demonstrates correlations between VQA metrics and alignment scores. It successfully predicts downstream TVR performance and enables dataset augmentation by selecting high-quality synthetic samples. Conclusion: SynTVA serves not only as a benchmark but also as a valuable tool for dataset augmentation, improving the outcomes of text-to-video retrieval by selecting high-utility synthetic samples. Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

[49] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Nina Konovalova,Maxim Nikolaev,Andrey Kuznetsov,Aibek Alanov

Main category: cs.CV

TL;DR: InnerControl improves spatial control in text-to-image diffusion models by applying a training strategy that ensures consistency across all diffusion steps, outperforming previous approaches like ControlNet++.

Details Motivation: Precise spatial control over generated outputs in text-to-image diffusion models remains challenging. Existing methods like ControlNet and ControlNet++ focus on final denoising steps, neglecting intermediate stages which limits their effectiveness. Method: InnerControl introduces lightweight convolutional probes trained to reconstruct input control signals from intermediate UNet features at every denoising step. An alignment loss is applied throughout the entire diffusion process to minimize discrepancies between predicted and target conditions. Result: The proposed method achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth) by efficiently extracting control signals even from highly noisy latents and generating pseudo ground truth controls for training. Conclusion: InnerControl enhances spatial control in text-to-image diffusion models by enforcing consistency across all diffusion steps, leading to improved control fidelity and generation quality when combined with existing methods like ControlNet++. Abstract: Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

[50] Neural Network-based Study for Rice Leaf Disease Recognition and Classification: A Comparative Analysis Between Feature-based Model and Direct Imaging Model

Farida Siddiqi Prity,Mirza Raquib,Saydul Akbar Murad,Md. Jubayar Alam Rafi,Md. Khairul Bashar Bhuiyan,Anupam Kumar Bairagi

Main category: cs.CV

TL;DR: This paper proposes a Feature Analysis Detection Model using ANN-based image processing techniques to classify rice leaf diseases more effectively than direct image-centric methods.

Details Motivation: Rice leaf diseases significantly reduce productivity and cause economic losses, necessitating early detection for effective management. This study aims to evaluate the effectiveness of different detection models for timely classification of rice diseases. Method: This research employed an Artificial Neural Network (ANN) with image-processing techniques. It compared two models: one using Feature Extraction Algorithms (FEAs), Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM), and another directly inputting images without FEAs. The experiments were conducted on a dataset of rice leaf diseases using 10-fold Cross-Validation. Result: The Feature Analysis Detection Model demonstrated superior performance over the Direct Image-Centric Detection Model in classifying rice leaf diseases. Conclusion: The study concludes that the Feature Analysis Detection Model (FADM) outperforms the Direct Image-Centric Detection Model (DICDM) in classifying rice leaf diseases, showing great potential for enhancing crop health and productivity. Abstract: Rice leaf diseases significantly reduce productivity and cause economic losses, highlighting the need for early detection to enable effective management and improve yields. This study proposes Artificial Neural Network (ANN)-based image-processing techniques for timely classification and recognition of rice diseases. Despite the prevailing approach of directly inputting images of rice leaves into ANNs, there is a noticeable absence of thorough comparative analysis between the Feature Analysis Detection Model (FADM) and Direct Image-Centric Detection Model (DICDM), specifically when it comes to evaluating the effectiveness of Feature Extraction Algorithms (FEAs). Hence, this research presents initial experiments on the Feature Analysis Detection Model, utilizing various image Feature Extraction Algorithms, Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM). The experiments are carried out on datasets encompassing bacterial leaf blight, brown spot, leaf blast, leaf scald, Sheath blight rot, and healthy leaf, utilizing 10-fold Cross-Validation method. A Direct Image-Centric Detection Model is established without the utilization of any FEA, and the evaluation of classification performance relies on different metrics. Ultimately, an exhaustive contrast is performed between the achievements of the Feature Analysis Detection Model and Direct Image-Centric Detection Model in classifying rice leaf diseases. The results reveal that the highest performance is attained using the Feature Analysis Detection Model. The adoption of the proposed Feature Analysis Detection Model for detecting rice leaf diseases holds excellent potential for improving crop health, minimizing yield losses, and enhancing overall productivity and sustainability of rice farming.

[51] Two-Steps Neural Networks for an Automated Cerebrovascular Landmark Detection

Rafic Nader,Vincent L'Allinec,Romain Bourcier,Florent Autrusseau

Main category: cs.CV

TL;DR: 本文提出一种自动化检测 Willis 环动脉分叉点的新方法,通过两步神经网络提高检测准确性与效率。

Details Motivation: Willis 环特定动脉分叉处常发生颅内动脉瘤,准确检测这些关键标志物对于快速高效诊断至关重要,但现有方法存在漏检等问题。 Method: 首先使用目标检测网络识别近似关键点位置的感兴趣区域(ROIs),然后利用带有深度监督的改进 U-Net 准确定位分叉点。 Result: 实验结果显示,该方法在分叉点检测任务中表现最佳,有效解决了近距离、相似视觉特征的关键点漏检问题,并适应了解剖结构的变异性。 Conclusion: 论文提出了一种基于两步神经网络过程的完全自动化检测方法,用于 Willis 环动脉分叉点的定位,并证明了该方法在处理解剖变异和相似视觉特征导致的问题上的有效性。 Abstract: Intracranial aneurysms (ICA) commonly occur in specific segments of the Circle of Willis (CoW), primarily, onto thirteen major arterial bifurcations. An accurate detection of these critical landmarks is necessary for a prompt and efficient diagnosis. We introduce a fully automated landmark detection approach for CoW bifurcations using a two-step neural networks process. Initially, an object detection network identifies regions of interest (ROIs) proximal to the landmark locations. Subsequently, a modified U-Net with deep supervision is exploited to accurately locate the bifurcations. This two-step method reduces various problems, such as the missed detections caused by two landmarks being close to each other and having similar visual characteristics, especially when processing the complete MRA Time-of-Flight (TOF). Additionally, it accounts for the anatomical variability of the CoW, which affects the number of detectable landmarks per scan. We assessed the effectiveness of our approach using two cerebral MRA datasets: our In-House dataset which had varying numbers of landmarks, and a public dataset with standardized landmark configuration. Our experimental results demonstrate that our method achieves the highest level of performance on a bifurcation detection task.

[52] Lightweight Shrimp Disease Detection Research Based on YOLOv8n

Fei Yuhuan,Wang Gengchen,Liu Fenghao,Zang Ran,Sun Xufei,Chang Hao

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv8n的轻量级网络结构,用于虾类疾病的智能检测,在保持高精度的同时显著降低了计算资源需求。

Details Motivation: 为了防止疾病传播并提高虾类养殖中的智能化检测效率,需要一种高效且准确的疾病识别方法。 Method: 通过设计RLDD检测头和C2f-EMCM模块降低模型复杂度,并引入改进的SegNext_Attention自注意力机制以增强特征提取能力。 Result: 实验结果表明,与原始YOLOv8n相比,该模型参数减少了32.3%,mAP@0.5达到了92.7%(比YOLOv8n高3%),并且在URPC2020数据集上表现出更强的泛化能力。 Conclusion: 该论文提出的基于YOLOv8n的轻量级网络架构在参数减少的情况下实现了更高的检测精度和计算效率,为虾类养殖中的智能疾病检测提供了可靠的技术支持。 Abstract: Shrimp diseases are one of the primary causes of economic losses in shrimp aquaculture. To prevent disease transmission and enhance intelligent detection efficiency in shrimp farming, this paper proposes a lightweight network architecture based on YOLOv8n. First, by designing the RLDD detection head and C2f-EMCM module, the model reduces computational complexity while maintaining detection accuracy, improving computational efficiency. Subsequently, an improved SegNext_Attention self-attention mechanism is introduced to further enhance the model's feature extraction capability, enabling more precise identification of disease characteristics. Extensive experiments, including ablation studies and comparative evaluations, are conducted on a self-constructed shrimp disease dataset, with generalization tests extended to the URPC2020 dataset. Results demonstrate that the proposed model achieves a 32.3% reduction in parameters compared to the original YOLOv8n, with a mAP@0.5 of 92.7% (3% improvement over YOLOv8n). Additionally, the model outperforms other lightweight YOLO-series models in mAP@0.5, parameter count, and model size. Generalization experiments on the URPC2020 dataset further validate the model's robustness, showing a 4.1% increase in mAP@0.5 compared to YOLOv8n. The proposed method achieves an optimal balance between accuracy and efficiency, providing reliable technical support for intelligent disease detection in shrimp aquaculture.

[53] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Xiangfeng Wang,Xiao Li,Yadong Wei,Xueyu Song,Yang Song,Xiaoqiang Xia,Fangrui Zeng,Zaiyi Chen,Liu Liu,Gu Xu,Tong Xu

Main category: cs.CV

TL;DR: 本文提出了一种受人类启发的自动视频编辑框架HIVE,通过多模态叙事理解来解决现有方法忽略丰富视觉上下文的问题,并引入了一个新的基准数据集DramaAD。

Details Motivation: 在线视频内容的快速增长,尤其是短视频平台上的内容,催生了对高效视频编辑技术的需求,这些技术可以将长视频压缩成简洁且吸引人的片段。现有的自动编辑方法主要依赖于ASR转录的文本线索和端到端的段落选择,常常忽略了丰富的视觉上下文,导致输出不连贯。 Method: 提出了一种受人类启发的自动视频编辑框架(HIVE),该框架利用多模态叙事理解来解决上述限制。该方法包括角色提取、对话分析和叙事摘要,并通过场景级分割进一步增强连贯性,将编辑过程分解为三个子任务:亮点检测、开头/结尾选择和无关内容剪枝。同时引入了一个新的基准数据集DramaAD。 Result: 实验结果表明,该框架在一般和广告导向的编辑任务中均持续优于现有基线,显著缩小了自动和人工编辑视频之间的质量差距。 Conclusion: 本文提出的HIVE框架通过多模态叙事理解和场景级分割策略,有效提升了自动视频编辑的质量与连贯性,为相关研究提供了新思路和基准数据集DramaAD。 Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

[54] Holistic Tokenizer for Autoregressive Image Generation

Anlin Zheng,Haochen Wang,Yucheng Zhao,Weipeng Deng,Tiancai Wang,Xiangyu Zhang,Xiaojuan Qi

Main category: cs.CV

TL;DR: Hita是一种新的自回归图像生成方法,通过整体到局部的分词方案和优化的信息处理策略,提高了生成效果和速度,并在多个任务中表现出色。

Details Motivation: 传统的自回归图像生成模型逐个生成视觉token,限制了对token序列之间整体关系的捕捉能力。此外,大多数视觉分词器将本地图像块映射为潜在token,导致全局信息有限。 Method: Hita引入了一种从整体到局部的分词方案,并采用了两个关键策略:1) 安排一个顺序结构,以整体token开头,然后是补丁级别的token,并使用因果注意力建立先前token的意识;2) 在将去量化token输入解码器之前,采用轻量级融合模块来控制信息流,优先处理整体token。 Result: 实验表明,Hita加速了AR生成器的训练速度,并在ImageNet基准测试中达到了2.59 FID和281.9 IS的优异表现。同时,Hita还在零样本风格迁移和图像修复任务中展示了有效性。 Conclusion: Hita是一种用于自回归图像生成的新图像分词器,它通过整体到局部的分词方案和改进AR生成过程的两个关键策略,加速了AR生成器的训练速度并优于使用普通分词器的生成器。 Abstract: The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}

[55] Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Ziqi Miao,Yi Ding,Lijun Li,Jing Shao

Main category: cs.CV

TL;DR: This paper introduces the VisCo Attack, a novel visual-centric jailbreak method for exploiting security vulnerabilities in multimodal large language models (MLLMs).

Details Motivation: MLLMs have security vulnerabilities in their visual modality that are not grounded in realistic scenarios; this work aims to define a new visual-centric jailbreak context. Method: Proposed a novel visual-centric jailbreak setting and the VisCo Attack with four visual-focused strategies, incorporating toxicity obfuscation and semantic refinement. Result: Toxicity score of 4.78 and ASR of 85% on MM-SafetyBench against GPT-4o. Conclusion: VisCo Attack effectively triggers harmful responses from black-box MLLMs, outperforming baseline methods. Abstract: With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.

[56] LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling

Jiahao Wu,Rui Peng,Jianbo Jiao,Jiayu Yang,Luyang Tang,Kaiqiang Xiong,Jie Liang,Jinbo Yan,Runling Liu,Ronggang Wang

Main category: cs.CV

TL;DR: This paper introduces LocalDyGS, a method for synthesizing dynamic videos from multi-view inputs that addresses challenges in modeling both large-scale and fine-scale motions in real-world scenes.

Details Motivation: The motivation behind this work is the challenge of synthesizing dynamic videos from multi-view inputs due to complex and highly dynamic motions in real-world scenarios. Previous methods are limited in their ability to handle fine-scale motion, prompting the need for an improved approach. Method: LocalDyGS decomposes dynamic scenes into local spaces defined by seeds, decouples static and dynamic features for motion modeling, and combines these features to generate Temporal Gaussians that model motion within each local space. Result: The result of this research is a novel framework for dynamic scene reconstruction that effectively models highly dynamic real-world scenes. It demonstrates competitive performance on fine-scale datasets compared to state-of-the-art methods and extends capabilities to larger, more complex scenes. Conclusion: The paper concludes that LocalDyGS is a successful method for synthesizing dynamic videos from multi-view inputs, particularly for both fine-scale and large-scale motion scenes. Abstract: Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github.io/LocalDyGS/.

[57] UVLM: Benchmarking Video Language Model for Underwater World Understanding

Xizhe Xue,Yang Zhou,Dawei Yan,Ying Li,Haokui Zhang,Rong Xiao

Main category: cs.CV

TL;DR: This paper introduces UVLM, an underwater observation benchmark created using a combination of human expertise and AI models, aiming to improve underwater world understanding by fine-tuning VidLMs.

Details Motivation: The motivation behind this work is to address the gap in existing research which primarily focuses on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. Method: The method involves the creation of UVLM, an underwater observation benchmark built through a collaborative approach combining human expertise and AI models. The dataset was constructed by considering typical underwater challenges like light variations, water turbidity, and diverse viewing angles, ensuring data diversity across frame rates, resolutions, marine animals, static plants, and terrains. Task diversity was achieved with a structured design categorizing observation targets into biological and environmental classes, each including content observation and change/action observation, resulting in 20 distinct task types. Challenging evaluation metrics were also designed for quantitative comparison and analysis. Result: The experiments on two representative VidLMs demonstrate that fine-tuning them on UVLM significantly enhances their ability to understand underwater environments, indicating potential for slight improvements in performance on existing in-air VidLM benchmarks. Conclusion: The paper concludes that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks such as VideoMME and Perception text. Abstract: Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. To overcome this gap, we introduce UVLM, an under water observation benchmark which is build through a collaborative approach combining human expertise and AI models. To ensure data quality, we have conducted in-depth considerations from multiple perspectives. First, to address the unique challenges of underwater environments, we selected videos that represent typical underwater challenges including light variations, water turbidity, and diverse viewing angles to construct the dataset. Second, to ensure data diversity, the dataset covers a wide range of frame rates, resolutions, 419 classes of marine animals, and various static plants and terrains. Next, for task diversity, we adopted a structured design where observation targets are categorized into two major classes: biological and environmental. Each category includes content observation and change/action observation, totaling 20 distinct task types. Finally, we designed several challenging evaluation metrics to enable quantitative comparison and analysis of different methods. Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks, such as VideoMME and Perception text. The dataset and prompt engineering will be released publicly.

[58] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

Seokyeong Lee,Sithu Aung,Junyong Choi,Seungryong Kim,Ig-Jae Kim,Junghyun Cho

Main category: cs.CV

TL;DR: 本研究针对单目三维物体检测中的数据稀缺和2D-3D歧义问题,提出一种不依赖多视角或多传感器的新型伪标签框架,通过跨帧聚合伪LiDAR实现稳健的3D属性提取,验证了其准确性和可扩展性。

Details Motivation: 单目三维物体检测(M3OD)由于高标注成本导致的数据稀缺问题长期面临挑战,并且存在固有的二维到三维歧义。虽然提出了各种弱监督方法和伪标签方法来解决这些问题,但它们大多受限于特定领域的学习或仅依赖单一观测的形状信息。 Method: 探索了一种利用目标点跟踪在时间相邻帧之间聚合静态和动态目标的伪激光雷达的技术,从而实现在三维数据获取不可行的情况下的三维属性提取。 Result: 大量实验表明,所提方法确保了可靠的准确性并具有较强的可扩展性,使其成为M3OD的一种实用有效的解决方案。 Conclusion: 本文提出了一种新的伪标签框架,该框架仅使用视频数据,在不需要多视图设置、额外传感器、相机姿态或特定领域训练的情况下,对遮挡具有更强的鲁棒性。 Abstract: Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.

[59] Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis

Byung Hyun Lee,Wongi Jeong,Woojae Han,Kyoungbun Lee,Se Young Chun

Main category: cs.CV

TL;DR: 本文提出了一种新的多实例学习框架CoMEL,通过减少遗忘来提高大规模图像的定位适应性,取得了显著的性能提升。

Details Motivation: 为了应对大规模图像(如组织病理学全切片图像)中注释成本高昂的问题,并解决现有方法在连续任务和实例分类中的局限性。 Method: 提出了Continual Multiple Instance Learning with Enhanced Localization (CoMEL),包含Grouped Double Attention Transformer (GDAT)、Bag Prototypes-based Pseudo-Labeling (BPPL) 和Orthogonal Weighted Low-Rank Adaptation (OWLoRA)。 Result: CoMEL在三个公共WSI数据集上进行了广泛的实验验证,证明其在连续MIL设置下的优越性能。 Conclusion: CoMEL框架在连续MIL设置下表现出色,优于现有技术,在包级准确率和定位准确率上分别提高了11.00%和23.4%。 Abstract: Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., $16 \times 16$) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ($\sim 10^5$) of large patches (e.g., $256 \times 256$) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework for both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets demonstrate superior performance of CoMEL, outperforming the prior arts by up to $11.00\%$ in bag-level accuracy and up to $23.4\%$ in localization accuracy under the continual MIL setup.

[60] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Taehoon Kim,Jongwook Choi,Yonghyun Jeong,Haeun Noh,Jaejun Yoo,Seungryul Baek,Jongwon Choi

Main category: cs.CV

TL;DR: 本文提出一种新的深度伪造视频检测方法,利用像素级时间不一致性与1D傅里叶变换,结合注意力机制和变压器模块,显著提升了检测性能。

Details Motivation: 传统检测方法仅通过堆叠帧间的空间频率谱表示时间信息,难以检测像素平面上的时间伪影。 Method: 对每个像素的时间轴进行一维傅里叶变换,并引入注意力提议模块和联合变压器模块以提升检测精度。 Result: 该框架在多种复杂检测场景中表现出色,能够精确定位时间伪影区域并扩大可检测的伪造伪影范围。 Conclusion: 本文提出了一种基于像素级时间不一致性进行深度伪造视频检测的方法,相较于传统方法更有效地捕捉时间上的异常。 Abstract: We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.

[61] TABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation

Peilin Zhang,Shaouxan Wua,Jun Feng,Zhuo Jin,Zhizezhang Gao,Jingkun Chen,Yaqiong Xing,Xiao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为TAB Net的弱监督医学图像分割框架,该框架能够在使用涂鸦注释的情况下实现与全监督方法相当的性能。

Details Motivation: 获取大规模完全标注的医学图像数据集既耗时又昂贵,涂鸦注释作为稀疏标注的一种形式,提供了高效且经济的替代方案。然而,涂鸦注释的稀疏性限制了目标区域的特征学习并缺乏足够的边界监督。 Method: 提出了TAB Net框架,包括TAS模块和BAP模块。TAS模块通过强度变换、cutout和拼图增强来提升特征学习;BAP模块通过融合双分支预测和引入边界感知损失来提高伪监督精度和边界建模。 Result: 在两个公共数据集ACDC和MSCMR seg上进行的实验评估表明,TAB Net显著优于最先进的涂鸦弱监督分割方法。 Conclusion: TAB Net显著优于现有方法,并实现了与全监督方法相当的性能。 Abstract: Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the feature learning of the target region and lacks sufficient boundary supervision, which poses significant challenges for training segmentation networks. Methods: We propose TAB Net, a novel weakly-supervised medical image segmentation framework, consisting of two key components: the triplet augmentation self-recovery (TAS) module and the boundary-aware pseudo-label supervision (BAP) module. The TAS module enhances feature learning through three complementary augmentation strategies: intensity transformation improves the model's sensitivity to texture and contrast variations, cutout forces the network to capture local anatomical structures by masking key regions, and jigsaw augmentation strengthens the modeling of global anatomical layout by disrupting spatial continuity. By guiding the network to recover complete masks from diverse augmented inputs, TAS promotes a deeper semantic understanding of medical images under sparse supervision. The BAP module enhances pseudo-supervision accuracy and boundary modeling by fusing dual-branch predictions into a loss-weighted pseudo-label and introducing a boundary-aware loss for fine-grained contour refinement. Results: Experimental evaluations on two public datasets, ACDC and MSCMR seg, demonstrate that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation. Moreover, it achieves performance comparable to that of fully supervised methods.

[62] Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings

Mufhumudzi Muthivhi,Terence L. van Zyl

Main category: cs.CV

TL;DR: This paper explores self-supervised learning for wildlife re-identification, showing that it outperforms traditional supervised methods, particularly when data is limited.

Details Motivation: Current state-of-the-art models for wildlife re-identification rely on annotated datasets, which are labor-intensive to create. This study investigates the use of self-supervised learning as an alternative to reduce dependency on labeled data. Method: The study uses temporal image pairs from camera trap data to train a self-supervised learning (SSL) model without the need for class labels. These pairs are automatically extracted and used to learn representations that are evaluated against supervised features. Result: The experimental results show that self-supervised models produce more robust representations compared to supervised models. Additionally, self-supervised features perform better across all tested downstream wildlife tasks. Conclusion: Self-supervised models outperform supervised models in wildlife re-identification tasks, especially in scenarios with limited data. Abstract: Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here https://github.com/pxpana/SSLWildlife.

[63] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration

Ayantika Das,Moitreya Chaudhuri,Koushik Bhat,Keerthi Ram,Mihail Bota,Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: 本文通过融合扩散模型与自编码器的优点,设计了新的方法用于构建潜在空间,并提出了两种无监督图像修复技术,在脑图像分析和图像恢复任务中展现了潜力。

Details Motivation: 尽管扩散模型在生成高质量图像方面表现出色,但其采样机制无法提取图像特定的语义表示,而这是自编码器固有的优势。因此,将两者结合可以更好地利用它们的优势。 Method: 1. 设计了一种机制来构建扩散自编码模型的潜在空间,以识别脑图像中的区域特异性细胞模式;2. 提出了一种基于邻域感知的无监督撕裂伪影修复技术;3. 利用扩散模型的推理时间可控加噪和去噪能力,提出了一种无监督JPEG伪影修复技术。 Result: 1. 构建了一个结构化的潜在空间,有助于区分脑组织类型;2. 开发了两种无监督图像修复技术,分别用于修复撕裂伪影和JPEG压缩伪影。 Conclusion: 通过结合扩散模型和自编码器,该研究成功开发了能够学习图像特定表示并组织潜在空间的方法,从而在脑部图像分析和图像修复任务中取得了成果。 Abstract: Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique.

[64] A Novel Tuning Method for Real-time Multiple-Object Tracking Utilizing Thermal Sensor with Complexity Motion Pattern

Duong Nguyen-Ngoc Tran,Long Hoang Pham,Chi Dai Tran,Quoc Pham-Nam Ho,Huy-Hung Nguyen,Jae Wook Jeon

Main category: cs.CV

TL;DR: 本文介绍了一种针对热成像中行人跟踪问题的新型调优方法,通过两阶段优化实现了高精度、实时性的跟踪效果。

Details Motivation: 热传感器由于其低层次特征表示能力有限,使得准确检测和跟踪行人变得困难,因此需要一种有效的跟踪方法。 Method: 该框架通过在每个阶段使用最适合的超参数进行调优,以最大化跟踪性能,并通过微调超参数实现高精度的实时跟踪。 Result: 在PBVS Thermal MOT数据集上的广泛实验表明,该方法在各种热成像条件下都具有高效性。 Conclusion: 论文提出了一种专为热成像中的行人跟踪设计的新颖调优方法,通过优化两个阶段并在真实世界监控应用中表现出高度鲁棒性。 Abstract: Multi-Object Tracking in thermal images is essential for surveillance systems, particularly in challenging environments where RGB cameras struggle due to low visibility or poor lighting conditions. Thermal sensors enhance recognition tasks by capturing infrared signatures, but a major challenge is their low-level feature representation, which makes it difficult to accurately detect and track pedestrians. To address this, the paper introduces a novel tuning method for pedestrian tracking, specifically designed to handle the complex motion patterns in thermal imagery. The proposed framework optimizes two-stages, ensuring that each stage is tuned with the most suitable hyperparameters to maximize tracking performance. By fine-tuning hyperparameters for real-time tracking, the method achieves high accuracy without relying on complex reidentification or motion models. Extensive experiments on PBVS Thermal MOT dataset demonstrate that the approach is highly effective across various thermal camera conditions, making it a robust solution for real-world surveillance applications.

[65] Privacy-preserving Preselection for Face Identification Based on Packing

Rundong Xin,Taotao Wang,Jin Wang,Chonghe Zhao,Jing Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为PFIP的高效加密域人脸识别方案,通过预选择机制和打包模块显著提高了检索效率,同时保持了识别的准确性。

Details Motivation: 随着加密模板库规模的增长,人脸识别过程变得越来越耗时,因此需要一种更高效的方法来处理隐私保护下的人脸识别问题。 Method: 提出了一种新的加密域人脸识别方案PFIP,包括创新的预选择机制和打包模块,以减少计算开销并提高生物识别系统在注册阶段的灵活性。 Result: 在LFW和CASIA数据集上的实验表明,PFIP在保持原始人脸识别模型准确性的前提下,能够在300毫秒内检索1000个加密人脸模板,并实现100%的命中率。 Conclusion: PFIP实现了高效的加密域人脸识别,同时保持了原始模型的准确性,并在检索效率上比现有方法提高了近50倍。 Abstract: Face identification systems operating in the ciphertext domain have garnered significant attention due to increasing privacy concerns and the potential recovery of original facial data. However, as the size of ciphertext template libraries grows, the face retrieval process becomes progressively more time-intensive. To address this challenge, we propose a novel and efficient scheme for face retrieval in the ciphertext domain, termed Privacy-Preserving Preselection for Face Identification Based on Packing (PFIP). PFIP incorporates an innovative preselection mechanism to reduce computational overhead and a packing module to enhance the flexibility of biometric systems during the enrollment stage. Extensive experiments conducted on the LFW and CASIA datasets demonstrate that PFIP preserves the accuracy of the original face recognition model, achieving a 100% hit rate while retrieving 1,000 ciphertext face templates within 300 milliseconds. Compared to existing approaches, PFIP achieves a nearly 50x improvement in retrieval efficiency.

[66] Determination Of Structural Cracks Using Deep Learning Frameworks

Subhasis Dasgupta,Jaydip Sen,Tuhina Halder

Main category: cs.CV

TL;DR: This paper introduces an advanced deep-learning architecture combining residual U-Net models and a meta-model ensemble, significantly improving structural crack detection accuracy and efficiency over traditional methods.

Details Motivation: Structural crack detection is crucial for public safety, yet manual methods are slow, inconsistent, and error-prone. This research aims to improve detection through a novel deep-learning architecture for more reliable automated systems. Method: The research employed various configurations of residual U-Net models, known for capturing fine details effectively. These were integrated into an ensemble with a meta-model composed of convolutional blocks to enhance prediction performance. Result: The residual U-Net models outperformed existing architectures like SegNet and traditional U-Net, especially with low-resolution images. The ensemble model achieved the highest scores using metrics like Intersection over Union (IoU) and DICE coefficient, demonstrating superior accuracy. Conclusion: The study concludes that the proposed ensemble model, integrating residual U-Net models with a meta-model, surpasses traditional architectures in structural crack detection accuracy and efficiency, particularly for low-resolution imagery. Abstract: Structural crack detection is a critical task for public safety as it helps in preventing potential structural failures that could endanger lives. Manual detection by inexperienced personnel can be slow, inconsistent, and prone to human error, which may compromise the reliability of assessments. The current study addresses these challenges by introducing a novel deep-learning architecture designed to enhance the accuracy and efficiency of structural crack detection. In this research, various configurations of residual U-Net models were utilized. These models, due to their robustness in capturing fine details, were further integrated into an ensemble with a meta-model comprising convolutional blocks. This unique combination aimed to boost prediction efficiency beyond what individual models could achieve. The ensemble's performance was evaluated against well-established architectures such as SegNet and the traditional U-Net. Results demonstrated that the residual U-Net models outperformed their predecessors, particularly with low-resolution imagery, and the ensemble model exceeded the performance of individual models, proving it as the most effective. The assessment was based on the Intersection over Union (IoU) metric and DICE coefficient. The ensemble model achieved the highest scores, signifying superior accuracy. This advancement suggests way for more reliable automated systems in structural defects monitoring tasks.

[67] AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars

Yiming Zhong,Xiaolin Zhang,Ligang Liu,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出AvatarMakeup方法,利用预训练扩散模型和Coherent Duplication方法实现高质量且具有一致性的3D虚拟形象化妆。

Details Motivation: 当前的3D高斯编辑方法在面部化妆方面存在不足,无法满足真实化妆效果的基本要求,如确保可驱动表情下的一致性外观、保持身份不变以及对细节的精确控制。 Method: 提出了一种名为AvatarMakeup的方法,采用预训练扩散模型从单张参考照片转移妆容图案,并使用Coherent Duplication方法优化全局UV地图,最后通过Refinement Module提高化妆质量。 Result: 实验表明,AvatarMakeup在整个动画过程中实现了最先进的化妆转移质量和状态一致性。 Conclusion: AvatarMakeup实现了高质量的3D虚拟形象化妆效果,并在整个动画过程中保持了状态一致性。 Abstract: Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions, 2) preserving the identity throughout the makeup process, and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multiview effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.

[68] F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

Wei Li,Jingyang Zhang,Lihao Liu,Guoan Wang,Junjun He,Yang Chen,Lixu Gu

Main category: cs.CV

TL;DR: This paper proposes a new framework called I-DiPT for Test-Time Adaptation in medical data, which addresses challenges posed by unpredictable shifts in free-form data fragments, achieving better results than existing methods.

Details Motivation: The motivation is to address the practical challenge of adapting source models to unpredictable free-form domain fragments in clinical settings, where data arrives in arbitrary lengths and orders due to resource constraints and patient variability. Method: The paper proposes the Image-level Disentangled Prompt Tuning (I-DiPT) framework, which uses image-invariant and image-specific prompts. It also introduces Uncertainty-oriented Masking (UoM) and Parallel Graph Distillation (PGD) to enhance knowledge extraction and reuse. Result: Experiments on breast cancer and glaucoma classification show that the proposed I-DiPT framework achieves superior performance compared to existing TTA methods under the F²TTA setting. Conclusion: The paper concludes that the proposed I-DiPT framework outperforms existing TTA approaches in F²TTA for adapting source models to free-form domain fragments in medical data. Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F$^{2}$TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F$^{2}$TTA. Code is available at https://github.com/mar-cry/F2TTA.

[69] Red grape detection with accelerated artificial neural networks in the FPGA's programmable logic

Sandro Costa Magalhães,Marco Almeida,Filipe Neves dos Santos,António Paulo Moreira,Jorge Dias

Main category: cs.CV

TL;DR: This paper demonstrates how deploying optimized ANNs on FPGAs using FINN architecture improves object detection performance for robotic applications.

Details Motivation: Robots usually slow down while detecting objects, constrained by low framerate camera configurations. AMD's Vitis-AI framework doesn't fully utilize FPGAs' PL. Method: Used the FINN architecture to deploy three ANNs (MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation) into an FPGA's PL. Result: MobileNet v1 achieved a success rate of 98% and inference speed of 6611 FPS on the RG2C dataset. Conclusion: FPGAs can speed up ANNs and make them suitable for attention mechanisms. Abstract: Robots usually slow down for canning to detect objects while moving. Additionally, the robot's camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs' PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA's PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.

[70] IGDNet: Zero-Shot Robust Underexposed Image Enhancement via Illumination-Guided and Denoising

Hailong Yan,Junjian Huang,Tingwen Huang

Main category: cs.CV

TL;DR: This paper proposes IGDNet, a Zero-Shot enhancement method for restoring underexposed images using only a single test image without training data or guiding priors.

Details Motivation: Current methods rely on supervised learning with paired datasets, which are impractical to collect in real-world scenarios and may over-enhance well-illuminated regions. Method: IGDNet utilizes a decomposition module and a denoising module. The decomposition module separates the image into illumination and reflection components using a dense connection network, while the denoising module enhances non-uniformly illuminated regions through an illumination-guided pixel adaptive correction method. Result: Extensive experiments on four public datasets show that IGDNet significantly improves visual quality under complex lighting conditions, outperforming 14 state-of-the-art unsupervised methods with PSNR of 20.41dB and SSIM of 0.860dB. Conclusion: IGDNet is a promising solution for restoring underexposed images in real-world scenarios where paired datasets are not available. Abstract: Current methods for restoring underexposed images typically rely on supervised learning with paired underexposed and well-illuminated images. However, collecting such datasets is often impractical in real-world scenarios. Moreover, these methods can lead to over-enhancement, distorting well-illuminated regions. To address these issues, we propose IGDNet, a Zero-Shot enhancement method that operates solely on a single test image, without requiring guiding priors or training data. IGDNet exhibits strong generalization ability and effectively suppresses noise while restoring illumination. The framework comprises a decomposition module and a denoising module. The former separates the image into illumination and reflection components via a dense connection network, while the latter enhances non-uniformly illuminated regions using an illumination-guided pixel adaptive correction method. A noise pair is generated through downsampling and refined iteratively to produce the final result. Extensive experiments on four public datasets demonstrate that IGDNet significantly improves visual quality under complex lighting conditions. Quantitative results on metrics like PSNR (20.41dB) and SSIM (0.860dB) show that it outperforms 14 state-of-the-art unsupervised methods. The code will be released soon.

[71] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection

Weiwei Duan,Luping Ji,Shengjia Chen,Sicheng Zhu,Jianghong Huang,Mao Ye

Main category: cs.CV

TL;DR: This paper introduces a weakly-supervised contrastive learning scheme to reduce manual annotation needs in infrared small target detection, achieving superior performance compared to fully-supervised methods.

Details Motivation: Manual annotation for video sequences in moving infrared small target detection is expensive and time-consuming, so reducing annotation requirements is crucial. Method: A new weakly-supervised contrastive learning (WeCoL) scheme is proposed, incorporating a potential target mining strategy, contrastive learning, and long-short term motion-aware learning. Result: Experiments on DAUB and ITSDT-15K datasets demonstrate the effectiveness of the proposed method, showing it often surpasses early fully-supervised approaches. Conclusion: The proposed weakly-supervised scheme outperforms early fully-supervised methods and reaches over 90% of SOTA fully-supervised methods. Abstract: Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ($e.g.$, weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model training.Specifically, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy accumulation.Besides, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature subspace.Moreover, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small targets.The extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90\% of state-of-the-art (SOTA) fully-supervised ones.

[72] Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk

Gaochao Song,Zibo Zhao,Haohan Weng,Jingbo Zeng,Rongfei Jia,Shenghua Gao

Main category: cs.CV

TL;DR: Mesh Silksong是一种新颖的、紧凑高效的网格表示方法,通过类似丝绸编织的方式来自回归生成多边形网格,显著减少了标记序列的冗余并提高了几何完整性。

Details Motivation: 现有的网格标记化方法会产生带有重复顶点标记的标记序列,浪费了网络能力。因此,提出Mesh Silksong以更高效地生成多边形网格。 Method: Mesh Silksong通过对每个网格顶点仅访问一次进行标记化处理,从而减少标记序列的冗余达50%,并实现约22%的最先进压缩率。 Result: Mesh Silksong生成具有优越几何特性的多边形网格,包括流形拓扑、防水检测和一致的面法线,并在实验结果中展示了其有效性和显著改进的几何完整性。 Conclusion: Mesh Silksong是一种有效的网格生成方法,不仅实现了复杂的网格生成,而且在几何完整性方面有显著提升。 Abstract: We introduce Mesh Silksong, a compact and efficient mesh representation tailored to generate the polygon mesh in an auto-regressive manner akin to silk weaving. Existing mesh tokenization methods always produce token sequences with repeated vertex tokens, wasting the network capability. Therefore, our approach tokenizes mesh vertices by accessing each mesh vertice only once, reduces the token sequence's redundancy by 50\%, and achieves a state-of-the-art compression rate of approximately 22\%. Furthermore, Mesh Silksong produces polygon meshes with superior geometric properties, including manifold topology, watertight detection, and consistent face normals, which are critical for practical applications. Experimental results demonstrate the effectiveness of our approach, showcasing not only intricate mesh generation but also significantly improved geometric integrity.

[73] CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios

Teng Fu,Yuwen Chen,Zhuofan Chen,Mengyang Zhao,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: This paper introduces CrowdTrack, a new large-scale dataset for multi-pedestrian tracking designed to improve algorithm performance in complex, real-world scenarios.

Details Motivation: Existing MOT datasets suffer from simple scene composition and non-realistic scenarios, making them inadequate for training robust tracking algorithms. The authors aim to address this by introducing a more realistic and challenging dataset. Method: The authors propose a large-scale, challenging dataset for multi-pedestrian tracking, named CrowdTrack, collected from first-person views in real-life complex scenarios. They analyze the dataset and test multiple state-of-the-art models along with foundation models. Result: CrowdTrack consists of 33 videos with 5,185 trajectories, each annotated with bounding boxes and unique object IDs. Multiple SOTA and foundation models were tested on this dataset. Conclusion: The paper concludes that the proposed CrowdTrack dataset will facilitate the development of more effective multi-pedestrian tracking algorithms in complex real-life scenarios. Abstract: Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack'' because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: https://github.com/loseevaya/CrowdTrack .

[74] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention

Zunhui Xia,Hongxing Li,Libin Lan

Main category: cs.CV

TL;DR: MedFormer是一种用于医学图像识别的高效视觉转换器,它通过新型的结构设计和注意力机制提升了模型的性能和适用性。

Details Motivation: 为了解决现有基于视觉转换器的方法在医学图像识别中遇到的任务特定性、高计算成本和次优性能问题。 Method: 提出了一种新的医学视觉转换器MedFormer,其主要创新包括金字塔缩放结构和双稀疏选择注意(DSSA)机制,并进行了理论分析和实验验证。 Result: 在多种医学图像数据集上的实验表明,MedFormer在图像分类、语义分割和病变检测等任务中均表现出色,具有较高的性能和计算效率。 Conclusion: MedFormer是一个高效的医学视觉转换器,通过金字塔缩放结构和双稀疏选择注意机制,提高了医学图像识别任务的性能、通用性和计算效率。 Abstract: Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is explicitly designed to attend to the most relevant content. In addition, a detailed theoretical analysis has been conducted, demonstrating that MedFormer has superior generality and efficiency in comparison to existing medical vision transformers. Extensive experiments on a variety of imaging modality datasets consistently show that MedFormer is highly effective in enhancing performance across all three above-mentioned medical image recognition tasks. The code is available at https://github.com/XiaZunhui/MedFormer.

[75] Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy

Luca Parolari,Andrea Cherubini,Lamberto Ballan,Carlo Biffi

Main category: cs.CV

TL;DR: This paper introduces a novel supervised contrastive loss and temporal adjacency constraint to improve automated polyp counting in colonoscopy by incorporating temporal awareness, achieving a 2.2x reduction in fragmentation rate and establishing a new state-of-the-art.

Details Motivation: Existing methods neglect temporal relationships in tracklet feature learning and clustering stages, leading to suboptimal polyp counting. Method: Supervised contrastive loss with temporally-aware soft targets and temporal adjacency constraint for tracklet clustering. Result: 2.2x reduction in fragmentation rate compared to prior approaches. Conclusion: Temporal awareness is important for polyp counting and the proposed method establishes a new state-of-the-art. Abstract: Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.

[76] MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations

Hyunsoo Son,Jeonghyun Noh,Suemin Jeon,Chaoli Wang,Won-Ki Jeong

Main category: cs.CV

TL;DR: This paper proposes MC-INR, a new framework for neural representations that improves the encoding of complex, multivariate scientific data on unstructured grids, overcoming key limitations of existing methods.

Details Motivation: Current INR-based methods have limitations in representing complex structures, handling only single-variable data, and relying on structured grids. These drawbacks reduce their effectiveness for real-world datasets. Method: The paper introduces MC-INR, which combines meta-learning and clustering to encode complex structures. It also implements a residual-based dynamic re-clustering mechanism and a branched layer to handle multivariate data efficiently. Result: Experimental results show that MC-INR outperforms existing methods in scientific data encoding tasks, particularly for complex and multivariate data on unstructured grids. Conclusion: MC-INR provides a more effective solution for encoding complex, multivariate scientific data on unstructured grids compared to existing INR-based methods. Abstract: Implicit Neural Representations (INRs) are widely used to encode data as continuous functions, enabling the visualization of large-scale multivariate scientific simulation data with reduced memory usage. However, existing INR-based methods face three main limitations: (1) inflexible representation of complex structures, (2) primarily focusing on single-variable data, and (3) dependence on structured grids. Thus, their performance degrades when applied to complex real-world datasets. To address these limitations, we propose a novel neural network-based framework, MC-INR, which handles multivariate data on unstructured grids. It combines meta-learning and clustering to enable flexible encoding of complex structures. To further improve performance, we introduce a residual-based dynamic re-clustering mechanism that adaptively partitions clusters based on local error. We also propose a branched layer to leverage multivariate data through independent branches simultaneously. Experimental results demonstrate that MC-INR outperforms existing methods on scientific data encoding tasks.

[77] Automatic Labelling for Low-Light Pedestrian Detection

Dimitrios Bouzoulas,Eerik Alamikkotervo,Risto Ojala

Main category: cs.CV

TL;DR: 该研究提出了一种自动化红外-RGB标注管道,以解决低光条件下RGB行人检测缺乏大规模公共数据集的问题。

Details Motivation: 缺乏用于低光条件下RGB行人检测的大规模公共数据集。 Method: 提出了一种自动红外-RGB标注流程,包括红外检测、标签传递和使用生成的标签训练目标检测模型。 Result: 在mAP@50和mAP@50-95指标上,使用生成的标签训练的模型在9个案例中有6个超过了使用地面实况标签训练的模型。 Conclusion: 模型在6/9的情况下超过了使用地面实况标签训练的模型,表明自动生成的标签对于低光行人检测的有效性。 Abstract: Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling

[78] Detecting Multiple Diseases in Multiple Crops Using Deep Learning

Vivek Yadav,Anugrah Jain

Main category: cs.CV

TL;DR: This paper introduces a deep learning model trained on a diverse dataset of crops and diseases, achieving high accuracy for early disease detection to support Indian agriculture.

Details Motivation: India's agrarian economy suffers from significant crop losses due to diseases, pests, and environmental stress. Early and accurate disease detection is vital for improving yield and food security. Method: A unified dataset comprising images of 17 crops and 34 diseases was created. A deep learning model was trained on this dataset to improve detection accuracy and coverage. Result: The proposed deep learning model achieved a detection accuracy of 99%, outperforming state-of-the-art solutions by 7%, which only cover 14 crops and 26 diseases. Conclusion: This paper concludes that the proposed deep learning model can efficiently detect multiple diseases across various crops, offering a more comprehensive and accurate solution for Indian farmers. Abstract: India, as a predominantly agrarian economy, faces significant challenges in agriculture, including substantial crop losses caused by diseases, pests, and environmental stress. Early detection and accurate identification of diseases across different crops are critical for improving yield and ensuring food security. This paper proposes a deep learning based solution for detecting multiple diseases in multiple crops, aimed to cover India's diverse agricultural landscape. We first create a unified dataset encompassing images of 17 different crops and 34 different diseases from various available repositories. Proposed deep learning model is trained on this dataset and outperforms the state-of-the-art in terms of accuracy and the number of crops, diseases covered. We achieve a significant detection accuracy, i.e., 99 percent for our unified dataset which is 7 percent more when compared to state-of-the-art handling 14 crops and 26 different diseases only. By improving the number of crops and types of diseases that can be detected, proposed solution aims to provide a better product for Indian farmers.

[79] IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning

Abiam Remache González,Meriem Chagour,Timon Bijan Rüth,Raúl Trapiella Cañedo,Marina Martínez Soler,Álvaro Lorenzo Felipe,Hyun-Suk Shin,María-Jesús Zamorano Serrano,Ricardo Torres,Juan-Antonio Castillo Parra,Eduardo Reyes Abad,Miguel-Ángel Ferrer Ballester,Juan-Manuel Afonso López,Francisco-Mario Hernández Tejera,Adrian Penate-Sanchez

Main category: cs.CV

TL;DR: IMASHRIMPは、白エビ(Penaeus vannamei)の形態解析を自動化し、遺伝的選抜作業を最適化することで、養殖業の持続可能性を向上させる改良型システムである。

Details Motivation: 本論文では、シャベルモルフォロジー解析における既存の手法の限界を克服し、遺伝的選抜作業を最適化するために、IMASHRIMPという改良された自動形態解析システムを開発した。 Method: IMASHRIMPはRGBD画像のシャベルモルフォロジー解析に特有の課題に対処するために改良されたディープラーニングおよびコンピュータビジョン技術を組み合わせている。具体的には、ResNet-50アーキテクチャに基づく2つの判別モジュールを用いて画像を分類し、「二要素認証(人間とAI)」システムを提案している。また、VitPoseから適応させたポーズ推定モジュールが23個のキーポイントを予測し、サポートベクターマシン(SVM)モデルを使用した形態回帰モジュールがピクセル測定値をセンチメートル単位に変換している。 Result: IMASHRIMPは、人間の誤りを大幅に削減し、ポーズ推定において97.94%の平均平均精度(mAP)を達成し、ピクセルからセンチメートルへの変換誤差は0.07(+/- 0.1)cmであった。 Conclusion: IMASHRIMPは、遺伝的選択作業を効率化し、より持続可能な養殖業の実現に貢献することができ、その自動化と高速化の可能性を示した。 Abstract: This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei}, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a "two-factor authentication (human and IA)" system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp's skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices.The code are available at https://github.com/AbiamRemacheGonzalez/ImaShrimp-public

[80] MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang,Sicheng Xu,Yue Dong,Yu Deng,Jianfeng Xiang,Zelong Lv,Guangzhong Sun,Xin Tong,Jiaolong Yang

Main category: cs.CV

TL;DR: MoGe-2 improves upon its predecessor by delivering accurate metric geometry and detailed 3D reconstructions from single images.

Details Motivation: The motivation is to overcome the limitations of existing monocular geometry estimation models in achieving both accurate metric scale and detailed geometry reconstruction. Method: The method extends MoGe by incorporating strategies for metric geometry prediction and uses a data refinement approach to improve real data quality using synthetic labels. Result: The model achieves superior performance in accurate relative geometry, precise metric scale, and fine-grained detail recovery. Conclusion: MoGe-2 is capable of recovering a metric scale 3D point map with fine-grained detail from a single image, outperforming previous methods in accuracy and detail. Abstract: We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative geometry accuracy provided by the affine-invariant point representation. Additionally, we discover that noise and errors in real data diminish fine-grained detail in the predicted geometry. We address this by developing a unified data refinement approach that filters and completes real data from different sources using sharp synthetic labels, significantly enhancing the granularity of the reconstructed geometry while maintaining the overall accuracy. We train our model on a large corpus of mixed datasets and conducted comprehensive evaluations, demonstrating its superior performance in achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery -- capabilities that no previous methods have simultaneously achieved.

[81] Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

Buzhen Huang,Chen Li,Chongyang Xu,Dongyue Lu,Jinnan Chen,Yangang Wang,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出了一种新的双分支优化框架,利用人类外观线索解决了传统方法在野外视频中因遮挡和视觉模糊而难以进行准确人体姿态估计的问题。

Details Motivation: 由于视觉模糊性和人际遮挡问题,现有的人体姿态估计方法无法从野外视频中恢复合理的近距离互动。即使最先进的大型基础模型(如SAM)也无法在这种具有挑战性的场景下准确区分人类语义信息。因此,需要一种新方法解决这些问题。 Method: 作者首先训练了一个扩散模型来学习人类社交行为和姿态先验知识。然后将训练好的网络与两个可优化张量结合到一个双分支优化框架中,用于重建人体动作和外观。此外,还设计了基于3D高斯、2D关键点和网格穿透的多个约束条件以辅助优化过程。 Result: 实验结果显示,该方法在多个基准数据集上均优于现有技术。同时,研究人员还构建了一个带有伪真实标注的交互数据集,为未来的研究提供了宝贵的资源。代码和数据已公开提供。 Conclusion: 该论文提出了一种基于人类外观线索的双分支优化框架,能够准确重建包含合理身体接触的交互动作。通过引入社交空间先验和多种约束条件,该方法在复杂环境中从野外视频中估计出精确的交互结果,并构建了一个带有伪真实标注的数据集以促进未来研究。 Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.

[82] Parametric shape models for vessels learned from segmentations via differentiable voxelization

Alina F. Dima,Suprosanna Shit,Huaqi Qiu,Robbie Holland,Tamara T. Mueller,Fabio Antonio Musio,Kaiyuan Yang,Bjoern Menze,Rickmer Braren,Marcus Makowski,Daniel Rueckert

Main category: cs.CV

TL;DR: This paper proposes a framework that integrates voxel, mesh, and parametric vessel representations through differentiable transformations, enabling accurate modeling of complex vascular structures.

Details Motivation: To unify voxel, mesh, and parametric representations of vessels, which are typically used separately, into a single framework for improved applications in modeling complex structures. Method: The method uses differentiable voxelization to extract parametric shape models from segmentations, parametrizing vessels as centerlines and radii using cubic B-splines, and differentiably extracting high-fidelity meshes. Result: The method successfully produced high-fidelity vessel models and demonstrated accurate volumetric fits on aortas, aneurysms, and brain vessels. Conclusion: The framework accurately captures the geometry of complex vessels by integrating voxel, mesh, and parametric representations through differentiable transformations. Abstract: Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joins the three representations under differentiable transformations. By leveraging differentiable voxelization, we automatically extract a parametric shape model of the vessels through shape-to-segmentation fitting, where we learn shape parameters from segmentations without the explicit need for ground-truth shape parameters. The vessel is parametrized as centerlines and radii using cubic B-splines, ensuring smoothness and continuity by construction. Meshes are differentiably extracted from the learned shape parameters, resulting in high-fidelity meshes that can be manipulated post-fit. Our method can accurately capture the geometry of complex vessels, as demonstrated by the volumetric fits in experiments on aortas, aneurysms, and brain vessels.

[83] Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning

Tan Pan,Zhaorui Tan,Kaiyu Guo,Dongli Xu,Weidi Xu,Chen Jiang,Xin Guo,Yuan Qi,Yuan Cheng

Main category: cs.CV

TL;DR: 本文提出了一种新的3D医学图像自监督学习框架S²DC,通过同时考虑结构间的语义差异性和结构内的语义一致性,提高了表示学习的效果。

Details Motivation: 现有的3D医学图像自监督学习方法通常采用固定大小的块进行图像划分,忽略了位置、尺度和形态等解剖结构变化,而这些变化对捕捉有意义的区别至关重要。 Method: S²DC框架分为两个步骤:第一步利用最优传输策略增加语义差异性;第二步基于邻域相似性分布提升结构级别的语义一致性。 Result: S²DC在10个数据集、4个任务和3种模态上进行了全面评估,结果表明其性能一致优于当前最先进的mSSL方法。 Conclusion: S²DC通过结合结构级和块级表示,有效提升了3D医学图像自监督学习中的结构感知表示能力,并在多个数据集和任务上优于现有方法。 Abstract: 3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.

[84] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu,Enxin Song,Wenhao Chai,Xuexiang Wen,Tian Ye,Gaoang Wang

Main category: cs.CV

TL;DR: AuroraLong uses linear RNNs to efficiently handle long video understanding, achieving performance comparable to Transformer-based models with significantly reduced computational and memory costs.

Details Motivation: The high computational complexity and memory cost of transformer-based LLMs make long video understanding challenging. AuroraLong aims to address this by using a more efficient model architecture. Method: AuroraLong replaces the traditional transformer-based LLM in MLLMs with a linear RNN language model that maintains constant-size hidden states regardless of input length. Visual token merge is also applied by sorting visual tokens in ascending order to improve efficiency. Result: Despite having only 2B parameters and being trained solely on public data, AuroraLong achieves performance comparable to larger, private dataset-trained transformer models across multiple video benchmarks. Conclusion: The use of linear RNNs in AuroraLong demonstrates the potential for democratizing long video understanding by reducing the computational barriers typically associated with transformer-based approaches. Abstract: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.

[85] Addressing Camera Sensors Faults in Vision-Based Navigation: Simulation and Dataset Development

Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill

Main category: cs.CV

TL;DR: 本文研究了视觉导航中传感器故障的问题,提出了一种基于人工智能的解决方案,并开发了一个模拟框架以生成故障图像数据集,从而提高导航系统的可靠性和鲁棒性。

Details Motivation: 视觉导航(VBN)算法在太空任务中的重要性日益增加,而传感器故障可能导致导航算法输出不准确或数据处理失败,影响任务目标。虽然人工智能(AI)为解决这一问题提供了强大方案,但缺乏足够且具有代表性的故障图像数据集是其应用的主要障碍。 Method: 研究集中于行星际探索任务场景,对VBN流程中使用的相机传感器潜在故障案例进行了全面分析,并引入一个模拟框架,在合成生成的图像中重现故障条件。 Result: 本研究系统地描述了这些故障的原因和影响,包括它们对图像质量和导航算法性能的影响,以及常用的缓解策略;并通过引入模拟框架,实现对故障数据的系统性和受控再现,从而生成用于训练和测试AI故障检测算法的故障注入图像数据集。 Conclusion: 本文提出了一种基于AI的故障检测方法,通过模拟框架生成故障图像数据集来训练和测试该方法,并为未来行星际探索任务中的视觉导航可靠性提供支持。 Abstract: The increasing importance of Vision-Based Navigation (VBN) algorithms in space missions raises numerous challenges in ensuring their reliability and operational robustness. Sensor faults can lead to inaccurate outputs from navigation algorithms or even complete data processing faults, potentially compromising mission objectives. Artificial Intelligence (AI) offers a powerful solution for detecting such faults, overcoming many of the limitations associated with traditional fault detection methods. However, the primary obstacle to the adoption of AI in this context is the lack of sufficient and representative datasets containing faulty image data. This study addresses these challenges by focusing on an interplanetary exploration mission scenario. A comprehensive analysis of potential fault cases in camera sensors used within the VBN pipeline is presented. The causes and effects of these faults are systematically characterized, including their impact on image quality and navigation algorithm performance, as well as commonly employed mitigation strategies. To support this analysis, a simulation framework is introduced to recreate faulty conditions in synthetically generated images, enabling a systematic and controlled reproduction of faulty data. The resulting dataset of fault-injected images provides a valuable tool for training and testing AI-based fault detection algorithms. The final link to the dataset will be added after an embargo period. For peer-reviewers, this private link is available.

[86] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Ziyin Zhou,Yunpeng Luo,Yuanchen Wu,Ke Sun,Jiayi Ji,Ke Yan,Shouhong Ding,Xiaoshuai Sun,Yunsheng Wu,Rongrong Ji

Main category: cs.CV

TL;DR: This paper introduces AIGI-Holmes, a new approach for detecting AI-generated images using a comprehensive dataset, structured training framework, and collaborative decoding strategy, enabling accurate detection with human-understandable explanations.

Details Motivation: The motivation stems from the misuse of highly realistic AI-generated images in spreading misinformation and the limitations of current detection techniques, which lack human-verifiable explanations and generalization capabilities. Method: The researchers introduced a novel dataset (Holmes-Set), an efficient data annotation method (Multi-Expert Jury), and a three-stage training framework (Holmes Pipeline). They also employed a collaborative decoding strategy during inference to enhance performance. Result: Extensive experiments demonstrated the effectiveness of the AIGI-Holmes model on three benchmarks, showcasing its ability to detect AI-generated images while generating human-aligned explanations. Conclusion: The study concludes that the proposed AIGI-Holmes model, along with the Holmes Pipeline and collaborative decoding strategy, effectively addresses the challenges of detecting AI-generated images with human-verifiable explanations and improved generalization. Abstract: The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.

[87] Learning few-step posterior samplers by unfolding and distillation of diffusion models

Charlesquin Kemajou Mbakam,Jonathan Spence,Marcelo Pereyra

Main category: cs.CV

TL;DR: 本文介绍了一种新颖的扩散模型集成框架,通过深度展开和模型蒸馏实现高效后验采样,提升了贝叶斯计算成像中的准确性与灵活性。

Details Motivation: 为了克服现有两种主要策略(即插即用方法和专门的条件扩散模型)的局限性,如近似误差、监督训练需求或推理速度问题,研究者提出了这一新框架。 Method: 提出了一种结合深度展开和模型蒸馏的新框架,将扩散模型转化为后验采样的几步条件模型,并首次将深度展开应用于蒙特卡洛采样方案(特别是LATINO Langevin采样器)进行后验采样。 Result: 通过大量实验和与最先进方法的比较表明,所提出的展开和蒸馏采样器在精度和计算效率方面表现出色,同时保留了对推理时前向模型变化的适应能力。 Conclusion: 该框架在保持灵活性的同时,在准确性和计算效率方面表现优异,为贝叶斯计算成像中利用扩散模型提供了新的思路。 Abstract: Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm - specifically, the recently proposed LATINO Langevin sampler (Spagnoletti et al., 2025) - representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.

[88] APT: Adaptive Personalized Training for Diffusion Models with Limited Data

JungWoo Chae,Jiyoon Kim,JaeWoong Choi,Kyungyul Kim,Sangheum Hwang

Main category: cs.CV

TL;DR: This paper proposes APT, a framework that enables effective personalization of diffusion models using limited data by addressing overfitting and maintaining model coherence.

Details Motivation: Personalizing diffusion models with limited data faces challenges like overfitting, loss of prior knowledge, and degradation of text alignment. Method: APT includes Adaptive Training Adjustment, Representation Stabilization, and Attention Alignment for Prior Knowledge Preservation. Result: APT generates high-quality, diverse images while effectively preventing overfitting and preserving model performance. Conclusion: APT successfully mitigates overfitting, maintains semantic coherence and prior knowledge, and outperforms existing methods in image generation with limited data. Abstract: Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.

[89] CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

Xiangyang Luo,Ye Zhu,Yunfei Liu,Lijian Lin,Cong Wan,Zijian Cai,Shao-Lun Huang,Yu Li

Main category: cs.CV

TL;DR: 本文提出了一种新的视频换脸框架CanonSwap,通过解耦运动信息和外观信息,实现更精确的身份传递并保持目标面部动态属性,同时引入部分身份调制模块和同步评估指标,实验结果显示其性能优于现有方法。

Details Motivation: 现有方法主要关注高质量的身份传递,但在保持目标面部的动态属性(如头部姿态、面部表情、唇部同步等)方面往往不足,导致结果不一致,而这一问题是由于视频中面部外观和运动信息的耦合特性所引起的。 Method: 提出了一种名为CanonSwap的新视频换脸框架,该框架通过消除运动相关信息,在统一的规范空间内进行身份修改,并设计了部分身份调制模块和细粒度同步指标来提高身份传递精度和评估性能。 Result: 实验表明,CanonSwap在视觉质量、时间一致性和身份保持方面明显优于现有技术,并且项目页面已公开以供参考。 Conclusion: CanonSwap有效地解决了视频换脸中身份传递与动态属性保持之间的权衡问题,在视觉质量、时间一致性和身份保持方面显著优于现有方法。 Abstract: Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.

[90] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

Qi Xu,Dongxu Wei,Lingzhe Zhao,Wenpu Li,Zhangchi Huang,Shunping Ji,Peidong Liu

Main category: cs.CV

TL;DR: This paper introduces SIU3R, an alignment-free framework for simultaneous 3D reconstruction and understanding from unposed images, achieving state-of-the-art results by leveraging pixel-aligned 3D representations and unified learnable queries.

Details Motivation: To overcome the limitations of existing 2D-to-3D feature alignment approaches, which result in limited 3D understanding and semantic information loss. Method: SIU3R uses a pixel-aligned 3D representation to bridge reconstruction and understanding tasks and unifies multiple understanding tasks into learnable queries. It also includes two lightweight modules to enhance task collaboration. Result: SIU3R achieves superior performance on simultaneous understanding and 3D reconstruction from unposed images compared to existing methods. Conclusion: The proposed SIU3R framework demonstrates state-of-the-art performance on both individual and simultaneous tasks of 3D reconstruction and understanding, highlighting its effectiveness and the benefits of an alignment-free approach. Abstract: Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.

[91] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

Qin Guo,Ailing Zeng,Dongxu Yue,Ceyuan Yang,Yang Cao,Hanzhong Guo,Fei Shen,Wei Liu,Xihui Liu,Dan Xu

Main category: cs.CV

TL;DR: This paper introduces UniMC, a new framework for controllable multi-class image generation, and HAIG-2.9M, a large annotated dataset, to overcome limitations in generating images of non-rigid objects and overlapping instances.

Details Motivation: The motivation stems from challenges in existing keypoint-guided models regarding the generation of non-rigid objects like animals and overlapping instances, attributed to limitations in methods and datasets. Method: The method involves a DiT-based framework named UniMC that integrates instance- and keypoint-level conditions into compact tokens, along with the creation of the HAIG-2.9M dataset containing extensive annotations for humans and animals. Result: The experiments demonstrate the effectiveness of UniMC and the high quality of HAIG-2.9M, especially in handling occlusions and multi-class generation. Conclusion: The paper concludes that the proposed UniMC framework and HAIG-2.9M dataset significantly enhance keypoint-guided image generation, particularly for multi-class and occluded scenarios. Abstract: Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.

[92] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

Yuxuan Wang,Tianwei Cao,Huayu Zhang,Zhongjiang He,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: FairHuman提出了一种新的多目标微调方法,以改善人体图像生成中的局部细节问题,实现了更好的全局和局部质量平衡。

Details Motivation: 现有的大规模文本到图像模型在生成人体图像时难以产生逼真的细节,如脸部或手部,这是因为在训练过程中对局部区域的监督不足。 Method: FairHuman是一种多目标微调方法,包括一个全局目标和两个局部目标(针对手和脸),通过Minimum Potential Delay (MPD)准则进行公平优化。 Result: 实验结果表明,FairHuman在不同场景下都能显著提升人体图像生成的效果,尤其是在生成具有挑战性的局部细节方面。 Conclusion: FairHuman可以有效提高人体图像生成的质量,特别是在细节部分如手和脸的生成上取得了显著进展。 Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.

[93] Prompt learning with bounding box constraints for medical image segmentation

Mélanie Gaillochet,Mehrdad Noori,Sahar Dastani,Christian Desrosiers,Hervé Lombaert

Main category: cs.CV

TL;DR: 本文提出了一种弱监督分割方法,仅需边界框标注即可自动生成基础模型的提示,并取得了优异的性能表现。

Details Motivation: 像素级标注在医学领域中非常繁琐且昂贵,而基于边界框标注的弱监督方法提供了一个实用的替代方案。然而,现有的提示学习方法依赖于完全标注的分割掩码,因此需要一种更高效的方法。 Method: 该方法利用边界框标注自动化生成基础模型的提示,并通过整合来自边界框标注的多个约束条件和由提示基础模型生成的伪标签来优化分割结果。 Result: 在多模态数据集上的广泛实验表明,所提出的弱监督方法在有限数据设置下平均Dice得分达到了84.90%,优于现有的全监督和弱监督方法。 Conclusion: 该论文提出了一种新颖的框架,结合了基础模型的表征能力和弱监督分割的标注效率,通过仅使用边界框标注自动生成基础模型的提示。 Abstract: Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at https://github.com/Minimel/box-prompt-learning-VFM.git

[94] DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Jiawei He,Danshi Li,Xinqiang Yu,Zekun Qi,Wenyao Zhang,Jiayi Chen,Zhaoxiang Zhang,Zhizheng Zhang,Li Yi,He Wang

Main category: cs.CV

TL;DR: 本论文提出DexVLG,利用大规模合成数据集实现基于语言指令的灵巧抓取预测,展现了强大的零样本泛化性能。

Details Motivation: 当前大型模型在机器人领域取得进展,但针对类人灵巧手的功能性抓取研究较少,主要受限于数据收集的困难。 Method: 构建了一个大规模数据集DexGraspNet 3.0,结合视觉-语言模型和流匹配技术设计了DexVLG模型用于预测抓取姿态。 Result: DexVLG在模拟环境中实现了超过76%的零样本执行成功率和最先进的部件抓取精度,并能实现真实场景中的部分对齐抓取。 Conclusion: DexVLG实现了基于语言指令的精确抓取姿态预测,并在零样本泛化能力上表现出色,为类人机械手的功能性抓取提供了新方法。 Abstract: As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG's performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG's strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.

[95] Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen

Main category: cs.CV

TL;DR: This paper introduces MANO, a novel approach to attention computation in Transformers, achieving linear complexity and maintaining performance while reducing resource consumption.

Details Motivation: Standard Transformers have quadratic complexity with respect to input length, making them impractical for high-resolution inputs. Existing variants often lose fine-scale details due to patchification or downsampling techniques. Method: Inspired by n-body numerical simulations, the Multipole Attention Neural Operator (MANO) computes attention in a distance-based multiscale fashion. Result: Empirical results show that MANO rivals state-of-the-art models like ViT and Swin Transformer while significantly reducing runtime and peak memory usage. Conclusion: The proposed MANO model achieves linear time and memory complexity while maintaining a global receptive field in attention computation. Abstract: Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.

[96] Partial Weakly-Supervised Oriented Object Detection

Mingxin Liu,Peiyuan Zhang,Yuan Liu,Wei Zhang,Yue Zhou,Ning Liao,Ziyang Gong,Junwei Luo,Zhirui Wang,Yi Yu,Xue Yang

Main category: cs.CV

TL;DR: This paper proposes a new framework for oriented object detection that reduces annotation costs by using weak annotations and unlabeled data effectively.

Details Motivation: The high annotation cost of oriented object detection motivates the search for more efficient methods that can use weak annotations and unlabeled data. Method: The paper introduces three components: (1) the PWOOD framework based on partially weak annotations, (2) the OS-Student model to learn orientation and scale information, and (3) CPF strategy to reduce sensitivity to filtering thresholds. Result: Experiments on DOTA and DIOR datasets show that the PWOOD framework performs comparably or better than traditional semi-supervised algorithms while reducing annotation costs. Conclusion: The proposed PWOOD framework provides a cost-effective solution for oriented object detection by efficiently leveraging unlabeled data and weak annotations, outperforming traditional semi-supervised methods. Abstract: The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose:(1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms.

[97] From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images

Danrong Zhang,Huili Huang,N. Simrill Smith,Nimisha Roy,J. David Frost

Main category: cs.CV

TL;DR: 本研究通过语义分割技术和新评分系统实现对地震后社交媒体图像损害程度的客观量化,提升灾害响应效率。

Details Motivation: 传统方法在处理地震后的社交媒体图像时存在主观性和无法考虑图像中损害程度变化的问题,因此需要一种更客观的分析方法。 Method: 构建了一个分段的损害严重程度数据集,并微调了一个SegFormer模型来生成地震后社交媒体图像的损害严重程度分割。此外,还引入了考虑深度估计的新损害评分系统。 Result: 该方法能够以更加客观和全面的方式对社交媒体图像中的损害严重程度进行量化,并有助于为灾后侦察团队提供精准指导。 Conclusion: 通过将地震后社交媒体图像中的损害评估问题转化为语义分割任务,本研究提供了一种更客观和全面的损害量化方法,并引入了一个新的损害严重程度评分系统。 Abstract: In the aftermath of earthquakes, social media images have become a crucial resource for disaster reconnaissance, providing immediate insights into the extent of damage. Traditional approaches to damage severity assessment in post-earthquake social media images often rely on classification methods, which are inherently subjective and incapable of accounting for the varying extents of damage within an image. Addressing these limitations, this study proposes a novel approach by framing damage severity assessment as a semantic segmentation problem, aiming for a more objective analysis of damage in earthquake-affected areas. The methodology involves the construction of a segmented damage severity dataset, categorizing damage into three degrees: undamaged structures, damaged structures, and debris. Utilizing this dataset, the study fine-tunes a SegFormer model to generate damage severity segmentations for post-earthquake social media images. Furthermore, a new damage severity scoring system is introduced, quantifying damage by considering the varying degrees of damage across different areas within images, adjusted for depth estimation. The application of this approach allows for the quantification of damage severity in social media images in a more objective and comprehensive manner. By providing a nuanced understanding of damage, this study enhances the ability to offer precise guidance to disaster reconnaissance teams, facilitating more effective and targeted response efforts in the aftermath of earthquakes.

[98] RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Liheng Zhang,Lexi Pang,Hang Ye,Xiaoxuan Ma,Yizhou Wang

Main category: cs.CV

TL;DR: This paper proposes a training-free text-to-image diffusion framework that improves structural and appearance control through decoupled feature injection, achieving superior performance in zero-shot conditions.

Details Motivation: To address limitations in existing text-to-image diffusion models, such as structural misalignment, condition leakage, and visual artifacts when incorporating conditional images for spatial control. Method: A flexible feature injection framework that decouples the injection timestep from the denoising process, along with appearance-rich prompting and a restart refinement strategy. Result: State-of-the-art performance across diverse zero-shot conditioning scenarios, demonstrating improved structural and appearance fidelity compared to previous methods. Conclusion: The proposed framework provides a training-free approach that achieves both structure-rich and appearance-rich text-to-image generation, outperforming existing methods in diverse zero-shot conditioning scenarios. Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.

[99] No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa,Chenhongyi Yang,Linus Ericsson,Steven McDonagh,Elliot J. Crowley

Main category: cs.CV

TL;DR: 本研究提出一种新的无需训练的对象分割方法,在提供少量参考图像的情况下,能够自动生成实例级分割掩码,并在多个任务中达到最先进的性能。

Details Motivation: 减少Segment Anything Model (SAM) 所需的手动视觉提示或复杂的提示生成规则,探索仅提供少量参考图像的情况下进行对象分割的任务。 Method: 通过构建记忆库、表示聚合和语义感知特征匹配三个阶段实现对象分割。 Result: 在COCO FSOD(36.8% nAP)、PASCAL VOC Few-Shot(71.2% nAP50)和Cross-Domain FSOD基准测试中显著优于现有无需训练的方法(22.4% nAP)。 Conclusion: 本文提出了一种无需训练的多阶段方法,利用基础模型学习的强语义先验来自动生实例级分割掩码,减少了对手动提示或复杂提示生成规则的依赖,并在多个数据集上实现了最先进的性能。 Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

[100] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars

Gent Serifi,Marcel C. Bühler

Main category: cs.CV

TL;DR: This paper introduces HyperGaussians, an improved method for creating detailed and animatable 3D face avatars from monocular videos by enhancing the expressiveness and efficiency of 3D Gaussian representations.

Details Motivation: Creating high-quality animatable face avatars from monocular videos remains challenging due to limitations in handling nonlinear deformations, complex lighting, and fine facial details. The authors aim to improve the expressiveness of 3D Gaussian representations to overcome these issues. Method: The authors introduce HyperGaussians by extending 3D Gaussians to high-dimensional multivariate Gaussians, conditioned on learnable local embeddings. They use an 'inverse covariance trick' to reparameterize the covariance matrix for computational efficiency and integrate this into existing models like FlashAvatar. Result: HyperGaussians demonstrate superior performance over 3D Gaussian Splatting both numerically and visually, particularly for capturing high-frequency details such as eyeglass frames, teeth, facial movements, and specular reflections. Conclusion: HyperGaussians, a novel extension of 3D Gaussian Splatting, outperform traditional methods in representing animatable face avatars with high-frequency details. Abstract: We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.

[101] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu,Hao Li,Jiawei Chi,Hanyang Wang,Minghui Yang,Fudong Wang,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出LangScene-X,一个可以从稀疏视角生成高质量、泛化性强的3D多模态信息的新框架。

Details Motivation: 从2D图像中恢复带有开放词汇场景理解的3D结构是一个基础但具有挑战性的任务,而现有方法在视图有限时存在渲染伪影和语义合成不真实的问题。 Method: 提出了一种新的生成框架,包括TriMap视频扩散模型和语言量化压缩器(LQC),并重建语言表面场以支持开放式的语言查询。 Result: 实验表明LangScene-X在真实世界数据上的性能优于最先进的方法。 Conclusion: LangScene-X是一个能够从稀疏视角生成3D一致的多模态信息的框架,在质量与泛化能力上优于现有技术。 Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.

[102] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach

Panpan Ji,Junni Song,Hang Xiao,Hanyu Liu,Chao Li

Main category: cs.CV

TL;DR: This paper proposes the Dynamic Contrastive Dual-Path Network (DCDP-HAR) to improve multimodal sensor-based human activity recognition by addressing challenges like cross-modal feature alignment and modality imbalance.

Details Motivation: Multimodal HAR systems face challenges such as cross-modal feature alignment difficulties and imbalanced modality contributions, which this study aims to address. Method: A Dynamic Contrastive Dual-Path Network (DCDP-HAR) is proposed, incorporating a dual-path feature extraction architecture, multi-stage contrastive learning mechanism, and confidence-driven gradient modulation strategy. Result: The proposed framework achieves progressive alignment of features and improved training stability, as evidenced by ablation studies and comparative experiments on four public datasets. Conclusion: The DCDP-HAR framework demonstrates effectiveness in addressing challenges in multimodal HAR systems, as validated through ablation studies and comparative experiments on benchmark datasets. Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.

[103] USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network

Ying Yu,Hang Xiao,Siyao Li,Jiarui Li,Haotian Tang,Hanyu Liu,Chao Li

Main category: cs.CV

TL;DR: This paper proposes a comprehensive optimization approach for human activity recognition based on multi-attention interaction mechanisms, achieving superior accuracy and efficiency.

Details Motivation: HAR faces challenges such as labeled data scarcity, poor high-level feature extraction, and suboptimal model performance on lightweight devices. Method: USAD incorporates a multi-attention interaction mechanism, including unsupervised data augmentation, a multi-branch spatio-temporal interaction network, and an adaptive multi-loss function fusion strategy. Result: Experimental results show that the proposed method outperforms existing approaches with accuracies of 98.84%, 93.81%, and 80.92% on three public datasets. Conclusion: The proposed USAD method achieves state-of-the-art performance in human activity recognition and is efficient for deployment on embedded devices. Abstract: The primary objective of human activity recognition (HAR) is to infer ongoing human actions from sensor data, a task that finds broad applications in health monitoring, safety protection, and sports analysis. Despite proliferating research, HAR still faces key challenges, including the scarcity of labeled samples for rare activities, insufficient extraction of high-level features, and suboptimal model performance on lightweight devices. To address these issues, this paper proposes a comprehensive optimization approach centered on multi-attention interaction mechanisms. First, an unsupervised, statistics-guided diffusion model is employed to perform data augmentation, thereby alleviating the problems of labeled data scarcity and severe class imbalance. Second, a multi-branch spatio-temporal interaction network is designed, which captures multi-scale features of sequential data through parallel residual branches with 3*3, 5*5, and 7*7 convolutional kernels. Simultaneously, temporal attention mechanisms are incorporated to identify critical time points, while spatial attention enhances inter-sensor interactions. A cross-branch feature fusion unit is further introduced to improve the overall feature representation capability. Finally, an adaptive multi-loss function fusion strategy is integrated, allowing for dynamic adjustment of loss weights and overall model optimization. Experimental results on three public datasets, WISDM, PAMAP2, and OPPORTUNITY, demonstrate that the proposed unsupervised data augmentation spatio-temporal attention diffusion network (USAD) achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly outperforming existing approaches. Furthermore, practical deployment on embedded devices verifies the efficiency and feasibility of the proposed method.

[104] AnyI2V: Animating Any Conditional Image with Motion Control

Ziye Li,Hao Luo,Xincheng Shuai,Henghui Ding

Main category: cs.CV

TL;DR: 本文提出了一种名为AnyI2V的新颖视频生成框架,它可以在没有训练的情况下,根据用户的自定义运动轨迹生成视频,并且支持各种输入形式和风格转换。

Details Motivation: 现有的T2V方法通常依赖于文本提示,缺乏对生成内容空间布局的精确控制;而I2V方法则受限于其对真实图像的依赖性,限制了合成内容的可编辑性。为了解决这些限制,我们提出了AnyI2V。 Method: 提出了一种新的训练-free框架AnyI2V,该框架可以利用用户定义的运动轨迹来生成视频,并支持多种条件输入和LoRA以及文本提示进行风格转换和编辑。 Result: 实验结果表明,所提出的AnyI2V在空间和运动控制的视频生成中表现出色并提供了新的视角。 Conclusion: AnyI2V是一个无需训练的视频生成框架,能够通过用户定义的运动轨迹对任何条件图像进行动画处理,实现了更灵活、更通用的视频生成。 Abstract: Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at https://henghuiding.com/AnyI2V/.

[105] Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Jiaer Xia,Bingkui Tong,Yuhang Zang,Rui Shao,Kaiyang Zhou

Main category: cs.CV

TL;DR: This paper introduces Grounded Chain-of-Thought (GCoT), an effective method for improving MLLM adaptation to specialized vision tasks by incorporating grounding information into reasoning steps, especially useful under limited data conditions.

Details Motivation: Multimodal Large Language Models (MLLMs) struggle to adapt to specialized vision tasks due to a mismatch between pre-training datasets and downstream tasks, particularly when large-scale retraining data is unavailable. Additionally, existing CoT reasoning data often contains factual errors, limiting model performance. Method: We propose Grounded Chain-of-Thought (GCoT), a bootstrapping-based method that incorporates grounding information, such as bounding boxes, into Chain-of-Thought (CoT) reasoning data. This approach aims to make the reasoning steps more aligned with input images, addressing factual errors in traditional CoT data distilled from pre-trained MLLMs. Result: The GCoT approach was evaluated on five specialized vision tasks involving various visual formats like charts, tables, receipts, and reports. Under data-limited conditions, it significantly outperformed conventional fine-tuning and distillation methods. Conclusion: The proposed Grounded Chain-of-Thought (GCoT) approach effectively improves the adaptation of Multimodal Large Language Models (MLLMs) to specialized vision tasks under data-limited regimes by enhancing reasoning step fidelity with grounding information. Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.

[106] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

Xin Zhou,Dingkang Liang,Kaijin Chen,Tianrui Feng,Xiwu Chen,Hongkai Lin,Yikang Ding,Feiyang Tan,Hengshuang Zhao,Xiang Bai

Main category: cs.CV

TL;DR: EasyCache accelerates video diffusion models through a dynamic caching mechanism, significantly improving inference speed and visual quality without training.

Details Motivation: Video generation models face constraints due to slow inference speeds and substantial computational costs, primarily caused by the iterative nature of the denoising process. Addressing this bottleneck is crucial for wider adoption and real-world integration. Method: EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors to avoid redundant computations during inference. Result: The method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3 times compared to original baselines while maintaining high visual fidelity with a significant PSNR improvement of up to 36% compared to the previous SOTA method. Conclusion: EasyCache is an efficient and highly accessible solution for high-quality video generation, offering significant acceleration and visual fidelity improvements. Abstract: Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.

[107] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Zhening Huang,Xiaoyang Wu,Fangcheng Zhong,Hengshuang Zhao,Matthias Nießner,Joan Lasenby

Main category: cs.CV

TL;DR: LiteReality 提出了一种新的 3D 场景重建管线,可以将 RGB-D 数据转化为紧凑、逼真且可交互的虚拟场景,适用于 AR/VR、游戏、机器人和数字孪生等领域。

Details Motivation: 为了提供一个能够生成紧凑、逼真且可交互的 3D 场景重建管线,以满足 AR/VR、游戏、机器人和数字孪生等应用的需求。 Method: LiteReality 的核心方法包括场景理解与结构化场景图解析、从精选资产数据库中检索最视觉相似的 3D 模型、通过材质绘制模块增强现实感,以及将重建场景集成到具备基础物理属性的模拟引擎中。 Result: LiteReality 不仅实现了高质量的 3D 场景重建,还在 Scan2CAD 基准测试中展示了最先进的对象检索性能,并提出了鲁棒的材质绘制模块,即使在严重错位、遮挡和光照不良的情况下也能实现高质量的材质转移。 Conclusion: LiteReality 是一种新型的处理室内环境 RGB-D 扫描数据的管线,它能够生成紧凑、逼真且可交互的 3D 虚拟复制品,并在 AR/VR、游戏、机器人和数字孪生等领域具有广泛应用。 Abstract: We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c

[108] RefTok: Reference-Based Tokenization for Video Generation

Xiang Fan,Xiaohang Sun,Kushan Thakkar,Zhu Liu,Vimal Bhat,Ranjay Krishna,Xiang Hao

Main category: cs.CV

TL;DR: RefTok 是一种新的视频模型学习方法,能够更好地处理时间冗余,提高视频生成和压缩效果。

Details Motivation: 现有方法处理时间冗余效果不佳,难以有效捕捉视频中的时间依赖性和冗余性。 Method: 引入了 RefTok,一种基于参考的分词方法,通过编码和解码依赖于未量化的参考帧的帧集来捕捉复杂的时间动态和上下文信息。 Result: 在四个视频数据集中,RefTok 显著优于当前最先进的分词器(Cosmos 和 MAGVIT),所有评估指标平均提高了 36.7%。 Conclusion: RefTok 优于现有的视频模型学习方法,显著提高了多个评估指标,并在视频生成任务中表现出色。 Abstract: Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.

[109] Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Yuqi Wu,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了Point3R,一种用于在线密集流3D重建的框架,利用显式空间指针内存和最新帧进行有效交互,实现高效、统一的3D场景重建。

Details Motivation: 解决现有方法中隐式内存容量受限及早期帧信息丢失的问题,实现更高效的密集3D场景重建。 Method: 提出了一种基于显式空间指针内存的在线框架Point3R,设计了3D分层位置嵌入和融合机制以提升交互效率。 Result: 在多种任务上达到具有竞争力或最先进的性能,并且训练成本较低。 Conclusion: Point3R实现了有效的在线密集流3D重建,通过显式空间指针内存与最新帧的交互,实现了全局坐标系中的密集集成。 Abstract: Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: https://github.com/YkiWu/Point3R.