Skip to content

Table of Contents

cs.CL [Back]

[1] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

Tian Lan,Xiangdong Su,Xu Liu,Ruirui Wang,Ke Chang,Jiang Li,Guanglai Gao

Main category: cs.CL

TL;DR: This paper introduces McBE, a comprehensive Chinese bias evaluation benchmark for LLMs, revealing biases in current models and emphasizing the need for culturally relevant and multi-task evaluation.

Details Motivation: Most existing bias evaluation datasets focus on English and North American culture, limiting their applicability to other cultures like Chinese. Moreover, these datasets typically support only single evaluation tasks, failing to comprehensively evaluate bias in LLMs. Method: The authors constructed a Multi-task Chinese Bias Evaluation Benchmark (McBE) with 4,077 instances covering 12 single bias categories, 82 subcategories, and 5 evaluation tasks. They evaluated several popular LLMs and conducted an in-depth analysis of the results. Result: The proposed McBE benchmark provides extensive category coverage, content diversity, and comprehensive measurement for evaluating biases in LLMs. Evaluation of several LLMs showed varying degrees of bias, highlighting the need for further mitigation strategies. Conclusion: The paper concludes that popular LLMs demonstrate varying degrees of bias, emphasizing the importance of comprehensive bias evaluation from multiple aspects and cultural perspectives. Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

[2] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization

Keyan Jin,Yapeng Wang,Leonel Santos,Tao Fang,Xu Yang,Sio Kei Im,Hugo Gonçalo Oliveira

Main category: cs.CL

TL;DR: 本文首次系统性评估了推理型LLM在对话摘要任务中的表现,结果表明,显式逐步推理未能持续提升摘要质量,反而可能导致冗余和事实错误,因此需要为现实对话场景开发更有针对性的建模与评估策略。

Details Motivation: 尽管大语言模型在摘要任务上取得了显著进展,但像Long Chain-of-Thought(CoT)这类逐步推理架构在需要抽象性和简洁性的对话场景中的表现仍未被探索。 Method: 该研究对当前最先进的推理型LLM和非推理型LLM进行了全面系统的评估,覆盖三种主要范式(通用型、角色导向型和查询导向型)的对话摘要任务,并使用了多种语言、领域及摘要长度进行测试,采用了强大的基准测试(如SAMSum, DialogSum等)以及基于LLM的自动指标和人工标准相结合的评估方法。 Result: 研究结果显示,与其它推理密集型任务的趋势不同,在对话摘要任务中,显式逐步推理并未带来一致性改进,甚至可能影响摘要质量。推理型LLM往往生成更冗长且事实不一致的内容。 Conclusion: 研究发现,在对话摘要任务中,显式的逐步推理并不能持续提升摘要质量。相比非推理LLM,推理LLM更容易产生冗长、事实不一致和不够简洁的摘要。研究强调了针对实际应用场景改进模型设计与评估策略的重要性。 Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.

[3] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

Wenquan Lu,Yuechuan Yang,Kyle Lee,Yanshu Li,Enqi Liu

Main category: cs.CL

TL;DR: 本文研究了深度循环Transformer模型Huginn-3.5B是否具备内部链式推理能力,结果表明其效果有限,远不如显式外化推理步骤的模型。

Details Motivation: 探索循环架构是否能够将推理过程内化到潜在空间中,以支持潜在的链式推理(latent CoT)。 Method: 使用Logit Lens和Coda Lens等探测技术研究Huginn-3.5B在算术任务中的内部行为。 Result: 发现Huginn-3.5B在潜在空间中缺乏可解释的链式推理证据,并且不同递归块之间存在显著的探测不一致性。 Conclusion: Huginn-3.5B并未展现出显著的内部链式推理能力,增加递归深度仅能带来微小提升。 Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model's internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.

[4] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Steven Song,Anirudh Subramanyam,Zhenyu Zhang,Aarti Venkat,Robert L. Grossman

Main category: cs.CL

TL;DR: 本研究提出了 GDC Cohort Copilot,一个基于自然语言处理的开源工具,用于简化 Genomic Data Commons 中患者群体的创建过程,并展示了其优于 GPT-4o 的性能。

Details Motivation: 由于 Genomic Data Commons (GDC) 提供了数百个字段和属性,用户在使用图形化 Cohort Builder 创建复杂患者群体时可能遇到困难,而通过自然语言描述需求可能更为直观和便捷。 Method: 研究团队开发并评估了多个大型语言模型(LLMs),用于实现 GDC Cohort Copilot 的功能,并构建了一个交互式界面以支持用户进一步优化生成的患者群体。 Result: GDC Cohort Copilot 能够根据用户的自然语言描述自动生成对应的 GDC 患者群体筛选条件,并导出至 GDC 进行后续分析。研究还表明,本地部署的开源 GDC Cohort LLM 在性能上优于 GPT-4o。 Conclusion: GDC Cohort Copilot 是一个开源的辅助工具,能够通过自然语言处理帮助用户更高效地从 Genomic Data Commons 中创建和优化特定的患者群体。 Abstract: Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.

[5] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu,Tinghong Chen,Jiangtao Feng,Jiangjie Chen,Weinan Dai,Qiying Yu,Ya-Qin Zhang,Wei-Ying Ma,Jingjing Liu,Mingxuan Wang,Hao Zhou

Main category: cs.CL

TL;DR: MemAgent是一种新型的代理工作流程,可以有效地处理长文本任务,特别是在超长上下文场景下表现出色。

Details Motivation: 尽管有改进,但在线性复杂度下处理无限长文档并在外推期间不出现性能退化仍是长文本处理的终极挑战。 Method: 引入了一种新的代理工作流程MemAgent,并扩展了DAPO算法以通过独立上下文多对话生成来促进训练。 Result: MemAgent展现出了卓越的长上下文能力,在从8K上下文训练扩展到32K文本时,性能损失小于5%,并且在512K RULER测试中实现了95%+的表现。 Conclusion: MemAgent是处理长文本任务的一个有效解决方案,能够显著提升在超长上下文场景下的性能表现。 Abstract: Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.

[6] DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

Dohoon Kim,Donghun Kang,Taesup Moon

Main category: cs.CL

TL;DR: 本文提出 DoMIX,通过 LoRA 模块实现高效的持续领域自适应预训练,解决了现有方法的多个局限性,并适用于标准的 LLM 微调场景。

Details Motivation: 现有的持续 DAP 方法存在高计算成本、对增量数据顺序敏感以及无法为特定任务定制模型的问题,因此需要提出一种更高效和灵活的方法。 Method: 使用 LoRA 模块进行参数高效的微调,并设计了一种鲁棒性强、可并行处理的领域自适应预训练方法。 Result: DoMIX 能够显著降低训练过程中的计算和内存开销,对领域顺序具有鲁棒性,并能根据具体任务提供定制化的预训练模型。 Conclusion: DoMIX 是一种利用 LoRA 模块的新方法,可以有效解决持续领域自适应预训练(DAP)中的挑战,并且能够扩展到标准的大型语言模型(LLM)微调场景。 Abstract: Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.

[7] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Christian Jaumann,Annemarie Friedrich,Rainer Lienhart

Main category: cs.CL

TL;DR: 本文提出了一种基于多模态大语言模型和少样本学习的科学视觉问答系统,在SciVQA 2025共享任务中表现优异。

Details Motivation: 为了在SciVQA 2025共享任务中有效地进行科学视觉问答。 Method: 使用两个多模态大语言模型的集成和各种少样本示例检索策略。 Result: 系统在盲测数据上的ROUGE-1、ROUGE-L和BERTS的平均F1分数为85.12。 Conclusion: 该系统在SciVQA 2025共享任务中排名第七中的第三名,平均F1得分为85.12。 Abstract: This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models' confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

[8] QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers

Pilsung Kang

Main category: cs.CL

TL;DR: This research introduces QFFN-BERT, a hybrid quantum-classical transformer that replaces feedforward networks with parameterized quantum circuits to enhance expressibility and reduce parameters, achieving high accuracy and data efficiency.

Details Motivation: Feedforward networks contribute significantly to the parameters in standard Transformer encoder blocks, making them a target for enhancement through parameterized quantum circuits. Method: The study replaces the feedforward network modules in a compact BERT variant with parameterized quantum circuit-based layers. The architecture incorporates a residual connection, RY and RZ rotations, and an alternating entanglement strategy for stability and expressibility. Result: The QFFN-BERT model achieved up to 102.0% of the baseline accuracy on SST-2 and DBpedia benchmarks while reducing FFN-specific parameters by over 99%. It also showed a competitive advantage in few-shot learning scenarios. Conclusion: Parameterized quantum circuits can be powerful and parameter-efficient alternatives to classical feedforward networks when co-designed with fundamental deep learning principles. Abstract: Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.

[9] Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Weijie Lyu,Sheng-Jun Huang,Xuan Xia

Main category: cs.CL

TL;DR: 该研究提出了一种基于参数模型的代码数据选择方法,显著提高了训练效率和模型性能。

Details Motivation: 当前方法主要通过大量数据提升模型性能,但忽视了数据质量,降低了训练效率。 Method: 使用参数模型进行代码数据选择,并优化以确保所选子集的分布一致性和多样性。 Result: 实验结果表明,在仅使用10K样本的情况下,该方法在HumanEval和MBPP任务上分别比92K全样本基线提高了2.4%和2.3%。 Conclusion: 该方法有效提升了模型性能并显著降低了计算成本。 Abstract: Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.

[10] Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

Mark Atta Mensah,Isaac Wiafe,Akon Ekpezu,Justice Kwame Appati,Jamal-Deen Abdulai,Akosua Nyarkoa Wiafe-Akenten,Frank Ernest Yeboah,Gifty Odame

Main category: cs.CL

TL;DR: This study investigates the cross-domain generalization of transformer-based ASR models for the low-resource Akan language, revealing domain dependency and architectural trade-offs in error behavior.

Details Motivation: Most existing automatic speech recognition (ASR) research evaluates models using in-domain datasets but seldom examines how they generalize across diverse speech contexts. This study aims to address this gap by exploring the cross-domain generalization of ASR models for the low-resource Akan language. Method: Seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, were benchmarked using four Akan speech corpora covering diverse domains. The evaluation focused on word error rate and character error rate to assess model performance across different contexts. Result: The results revealed significant domain dependency, with models performing optimally only within their training domains and showing marked accuracy degradation in mismatched scenarios. Additionally, distinct error behaviors were identified between Whisper and Wav2Vec2 architectures: Whisper produced more fluent but potentially misleading errors, while Wav2Vec2 generated more obvious yet less interpretable outputs in unfamiliar contexts. Conclusion: The study concludes that ASR models, specifically those based on transformer architectures like Whisper and Wav2Vec2, exhibit domain dependency in their performance. It highlights the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other low-resource languages (LRLs). Abstract: Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.

[11] A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages

Sumaya Ahmed Salihs,Isaac Wiafe,Jamal-Deen Abdulai,Elikem Doe Atsakpo,Gifty Ayoka,Richard Cave,Akon Obu Ekpezu,Catherine Holloway,Katrin Tomanek,Fiifi Baffoe Payin Winful

Main category: cs.CL

TL;DR: 本研究提出了一种针对低资源语言受损语音的社区驱动数据收集和ASR模型构建方法,并发布了首个阿坎语受损语音数据集及相关的开源工具。

Details Motivation: 民主化ASR技术和数据收集,特别是针对资源匮乏的语言和语音障碍人群。 Method: 通过开发最佳实践“食谱”以及培训来促进社区驱动的数据收集和ASR模型构建,并对开源ASR模型进行微调以更好地识别阿坎语中的受损语音。 Result: 整理了第一个开源的阿坎语受损语音数据集,并展示了微调开源ASR模型以更好识别受损语音的初步结果。 Conclusion: 研究得出了一种社区驱动的数据收集和ASR模型构建方法,为语音障碍个体创造了包容性的ASR技术。 Abstract: This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a "cookbook" of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.

Sneha Deshmukh,Prathmesh Kamble

Main category: cs.CL

TL;DR: This paper introduces an annotated dataset for Indian court judgments on bail decisions to advance Legal NLP research in underdeveloped regions like India.

Details Motivation: The motivation behind this work is the lack of structured datasets in Legal NLP, particularly in regions like India, which hampers research and development in this field. Method: The authors used a prompt-engineered GPT-4o pipeline to generate annotations and verified them for consistency. Result: The result is the creation of IndianBailJudgments-1200, a benchmark dataset with 1200 annotated Indian court judgments focusing on bail decisions. Conclusion: The paper concludes that the introduced dataset will significantly contribute to legal NLP research in Indian jurisprudence, especially for bail decisions. Abstract: Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.

[13] WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li,Zhongwang Zhang,Huifeng Yin,Liwen Zhang,Litu Ou,Jialong Wu,Wenbiao Yin,Baixuan Li,Zhengwei Tao,Xinyu Wang,Weizhou Shen,Junkai Zhang,Dingchu Zhang,Xixi Wu,Yong Jiang,Ming Yan,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: This paper introduces WebSailor, a post-training methodology for open-source language models that enables them to handle extreme uncertainty in complex information-seeking tasks, thereby matching the performance of proprietary systems like DeepResearch.

Details Motivation: Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on complex benchmarks like BrowseComp, which open-source models lack due to the absence of sophisticated reasoning patterns needed to handle extreme uncertainty. Method: WebSailor utilizes a complete post-training methodology involving structured sampling, information obfuscation, RFT cold start, and an efficient agentic RL training algorithm called DUPO. Result: WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching the performance of proprietary agents. Conclusion: WebSailor successfully instills the capability of systematically reducing extreme uncertainty in complex information-seeking tasks, closing the gap with proprietary agents. Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

[14] Revisiting Active Learning under (Human) Label Variation

Cornelia Gruber,Helen Alber,Bernd Bischl,Göran Kauermann,Barbara Plank,Matthias Aßenmacher

Main category: cs.CL

TL;DR: 该研究强调了在主动学习中考虑人类标签变异的重要性,并提出了一个新的概念框架来更好地利用这种变异,以提高机器学习模型的训练效率。

Details Motivation: 由于高质量标记数据的获取仍是监督学习的一个限制因素,同时自然语言处理中普遍存在标签变异(LV),但传统方法通常假设存在单一真实标签,忽略了HLV作为有用信号的作用。 Method: 通过调查和分析当前在主动学习和标签变体领域的文献,探讨如何分解观察到的标签变异(LV)为信号(如HLV)和噪声(如注释错误),并提出了一个整合HLV的新概念框架。 Result: 论文揭示了传统主动学习方法在考虑HLV时的局限性,并讨论了将大语言模型(LLM)作为注释者的可能性,为HLV感知的主动学习奠定了理论基础。 Conclusion: 本论文提出了一种将人类标签变异(HLV)纳入主动学习(AL)循环的概念框架,旨在更好地反映现实世界中的注释复杂性。 Abstract: Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed -- or neglected -- these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

[15] MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion

Xin Guan,PeiHsin Lin,Zekun Wu,Ze Wang,Ruibo Zhang,Emre Kazim,Adriano Koshiyama

Main category: cs.CL

TL;DR: Multiperspective Fusion (MPF) is a novel post-training alignment framework for large language models designed to mitigate bias by leveraging multiperspective generations and aligning model outputs with nuanced human-like baselines.

Details Motivation: MPF was developed in response to the growing need for easy bias mitigation in large language models. Method: Multiperspective Fusion (MPF) is built on top of the SAGED pipeline, leveraging multiperspective generations to expose and align biases in LLM outputs. It decomposes baselines into interpretable perspective components and guides generation through sampling and balancing of responses weighted by decomposition probabilities. Result: Empirical results show that MPF can align LLM sentiment distributions with both counterfactual baselines (absolute equality) and HR baselines (biased for Top University), leading to small KL divergence, reduced calibration error, and generalization to unseen questions. Conclusion: MPF offers a scalable and interpretable method for alignment and bias mitigation in large language models, compatible with deployed LLMs and requiring no extensive prompt engineering or fine-tuning. Abstract: Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.

[16] Exploring Gender Bias Beyond Occupational Titles

Ahmed Sabir,Rajesh Sharama

Main category: cs.CL

TL;DR: 本文提出了GenderLexicon数据集和一种新框架,可以量化并解释语境中的性别偏见,研究表明偏见不仅存在于职业刻板印象中。

Details Motivation: 探究性别与语境偏见之间的相关性,特别是在动作动词、物体名词以及职业方面的偏见。 Method: 通过构建GenderLexicon数据集和一个新框架,对五组不同的数据集(包括日文数据集)进行评估,量化并解释性别偏见。 Result: 模型能够用评分解释偏见,提高了性别偏见的可解释性,并发现了超出职业刻板印象的性别偏见。 Conclusion: 研究确认了除职业刻板印象外的性别偏见,并引入了一个新的数据集GenderLexicon及一个可解释的框架来估计和解释语境中的性别偏见。 Abstract: In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.

[17] Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Zhijian Xu,Yilun Zhao,Manasi Patwardhan,Lovekesh Vig,Arman Cohan

Main category: cs.CL

TL;DR: This paper introduces LimitGen, a new benchmark for assessing LLMs' capability to identify research limitations when augmented with literature retrieval, showing promise in supporting peer review processes.

Details Motivation: The increasing volume of scientific publications has made the peer review process more challenging. While LLMs have shown potential in various scientific tasks, their application in identifying research paper limitations remains underexplored. Method: The researchers developed a taxonomy of limitation types in scientific research, particularly focusing on AI. They created LimitGen, which includes synthetic (LimitGen-Syn) and real human-written (LimitGen-Human) datasets to evaluate LLMs' ability to detect limitations. The method also involves improving LLM performance through literature retrieval integration. Result: LimitGen was successfully developed as a benchmark for evaluating LLMs in identifying paper limitations. It showed that augmenting LLM systems with literature retrieval significantly improves their ability to generate meaningful and constructive feedback. Conclusion: The study concludes that LLMs, when augmented with literature retrieval, can effectively identify paper limitations and provide constructive feedback, thus complementing human peer review. Abstract: Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.

[18] Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

Peter Viechnicki

Main category: cs.CL

TL;DR: This study measures the minimal perceptible difference in vowel sounds (JPD) to clarify human vowel system structures and inform speech production theories.

Details Motivation: To determine the degree of accuracy in sub-phonemic control mechanisms in human vowel production by investigating how far apart two vowel stimuli must be in auditory space to yield reliably different imitations. Method: A vowel mimicry paradigm was used to measure the 'Just Producible Difference' (JPD) among two sets of English speakers during front vowel production, estimating JPD between 14 and 51 mels in F1 X F2 space. Result: JPD was estimated at between 14 and 51 mels in F1 X F2 space, offering insights into the precision of articulatory control and its implications for speech production theories. Conclusion: The study provides a psychophysical explanation for trends in vowel phonemes and clarifies the possible structures of human vowel systems by establishing a theoretical lower bound for how close two vowel phonemes may be in auditory space. Abstract: A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed 'Just Producible Difference' (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker's formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.

[19] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Ken Tsui

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)存在自我修正盲点,即无法纠正自己输出中的错误。

Details Motivation: 为了提高LLMs的可靠性和可信度,研究其自我修正能力是必要的。 Method: 引入了一个名为Self-Correction Bench的系统框架,通过三个复杂性级别的控制错误注入来测量这种现象。 Result: 测试了14个模型,发现平均64.5%的盲点率。训练数据组成与该限制有关:人类训练示范主要展示无错误响应而非错误修正序列。添加“Wait”可减少89.3%的盲点。 Conclusion: 当前LLMs有一个关键局限性,即自我修正盲点,但可以通过改变训练方法和激活机制来改进其可靠性。 Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

[20] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Riccardo Cantini,Nicola Gabriele,Alessio Orsino,Domenico Talia

Main category: cs.CL

TL;DR: This paper explores whether reasoning in language models improves robustness against social biases. It finds that explicit reasoning can make models more prone to bias, highlighting the need for better strategies to address biases in reasoning design.

Details Motivation: To investigate how reasoning mechanisms affect fairness, robustness, and susceptibility to social biases in Reasoning Language Models (RLMs), challenging the assumption that reasoning inherently improves model reliability. Method: The research employed the CLEAR-Bias benchmark, an LLM-as-a-judge approach for safety scoring, and jailbreak techniques to evaluate adversarial robustness of RLMs across sociocultural dimensions. Result: Models with explicit reasoning were found to be more vulnerable to stereotype reinforcement compared to base models. CoT prompting was particularly susceptible to contextual attacks, suggesting reasoning does not inherently improve robustness. Conclusion: The study concludes that reasoning capabilities in models may unintentionally increase vulnerability to bias, and more bias-aware approaches are necessary for reasoning design. Abstract: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

[21] Multimodal Mathematical Reasoning with Diverse Solving Perspective

Wenhao Shi,Zhiqiang Hu,Yi Bin,Yang Yang,See-Kiong Ng,Heng Tao Shen

Main category: cs.CL

TL;DR: 本文介绍了一种新的数据集MathV-DP和一个改进的模型Qwen-VL-DP,旨在通过结合多样化的推理视角和强化学习策略来提升多模态大语言模型在数学推理方面的性能。

Details Motivation: 现有的多模态LLMs在数学推理方面通常依赖一对一的图文对和单一解决方案监督,忽略了有效推理视角和内部反思的多样性。 Method: 引入了一个名为MathV-DP的新数据集,并提出了一个基于Qwen-VL的模型,通过监督学习进行微调,并通过一种基于规则的强化学习方法GRPO进行增强。 Result: 实验结果表明,Qwen-VL-DP在准确性和生成多样性方面显著优于之前的基线MLLMs。 Conclusion: Qwen-VL-DP强调了在多模态数学推理中融入多样化视角和反思性推理的重要性。 Abstract: Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista's minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.

[22] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

Wencheng Zhang,Shiqin Qiao,Lingjie Luo,Yinfeng Li,Chuanyang Zheng,Qian Xu,Meng Li,Yong Gui,Yijun He,Jianing Qiu,Jindong Hong,Jiankai Sun

Main category: cs.CL

TL;DR: 本文提出了一种动态路由框架SynapseRoute,能根据问题复杂度智能分配查询到思考或非思考模式,从而在保证准确性的同时显著降低了推理时间和成本。

Details Motivation: 随着大语言模型的广泛应用,选择合适的模型需要平衡性能与操作成本,而具有推理能力的模型进一步拉大了成本差距。研究发现约58%的医学问题可以通过低代价的非思考模式准确回答,这表明根据问题复杂度动态分配查询模式可以优化整体用户体验和成本效益。 Method: 提出了SynapseRoute,一种基于机器学习的动态路由框架,并引入了AIT指数来评估准确性、延迟和令牌成本之间的权衡。 Result: 实验结果显示,SynapseRoute相较于单独使用思考模式提升了整体准确性(0.8390 vs. 0.8272),同时减少了36.8%的推理时间和39.66%的令牌消耗。定性分析还表明,对简单问题过度推理会导致不必要的延迟甚至降低准确性。 Conclusion: SynapseRoute通过动态分配查询到合适的模式,提高了准确性并减少了推理时间和令牌消耗,证明了自适应路由在平衡准确性和成本效率上的有效性。 Abstract: With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.

[23] Generalizing Verifiable Instruction Following

Valentina Pyatkin,Saumya Malik,Victoria Graf,Hamish Ivison,Shengyi Huang,Pradeep Dasigi,Nathan Lambert,Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: 本研究介绍了IFBench基准测试和使用强化学习改进语言模型遵循复杂指令的方法。

Details Motivation: 当前的语言模型在遵循用户指定的输出约束方面存在困难,需要改进其泛化能力。 Method: 开发了一个新的基准测试IFBench,并设计了约束验证模块用于评估和改进模型对新指令的适应能力。 Result: 引入IFBench基准测试以及29个新的手工注释训练约束和验证功能,显著提升了指令跟随能力。 Conclusion: 训练模型以提高遵循精确指令的能力,特别是通过强化学习与可验证奖励(RLVR)的方法表现最佳。 Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

[24] LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Almog Hilel,Idan Shenfeld,Leshem Choshen,Jacob Andreas

Main category: cs.CL

TL;DR: A vulnerability in language models allows attackers to change model behavior by manipulating feedback, leading to security risks and misinformation.

Details Motivation: To identify vulnerabilities in language models trained with user feedback and demonstrate potential attack mechanisms. Method: Attackers use prompts and upvote/downvote feedback to manipulate LM outputs, training the model to favor 'poisoned' responses. Result: LMs exhibit increased likelihood of producing manipulated responses even without malicious prompts, enabling insertion of false knowledge, security flaws, and fake news. Conclusion: The paper concludes that preference tuning in language models can be exploited through feedback manipulation, allowing attackers to alter model behavior and knowledge. Abstract: We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

[25] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Purbesh Mitra,Sennur Ulukus

Main category: cs.CL

TL;DR: MOTIF enables large language models to think beyond context size limits through modular reinforcement learning, achieving better accuracy with fewer samples.

Details Motivation: Context size limits hinder LLMs' ability to reason over arbitrarily long sequences; modular thinking is needed. Method: Proposed MOTIF for modular thinking via reinforcement finetuning, using GRPO algorithm with multi-round reasoning tokens generation. Result: 3.8% improvement on GSM8K dataset and 3.3% improvement on AIME2024 benchmark with only 15% of samples compared to vanilla GRPO training. Conclusion: MOTIF demonstrates sample efficiency and effectiveness in improving LLM reasoning capabilities beyond context size limitations. Abstract: Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

[26] Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Nikhil Chandak,Shashwat Goel,Ameya Prabhu,Moritz Hardt,Jonas Geiping

Main category: cs.CL

TL;DR: This paper shows that multiple choice benchmarks have flaws, as they allow models to exploit shortcuts. It proposes answer matching as a better evaluation method, showing strong agreement with human grading and suggesting a shift away from traditional multiple choice approaches.

Details Motivation: The motivation stems from the discovery that many multiple choice questions can be answered without reading the question, indicating a flaw in discriminative evaluation methods. This highlights the need for a more reliable evaluation system like generative evaluation. Method: The researchers used answer matching, where a model generates free-form responses which are then evaluated against reference answers using modern language models. They compared this approach with multiple choice evaluations and LLM-as-a-judge methods by measuring agreement with human grading data on MMLU-Pro and GPQA-Diamond datasets. Result: Answer matching using recent language models achieved near-perfect agreement with human grading, similar to inter-annotator agreement. In contrast, traditional multiple choice evaluation and LLM-as-a-judge approaches showed poor alignment with human judgments. Conclusion: The study concludes that generative evaluation through answer matching is a more valid and scalable alternative to multiple choice benchmarks for assessing language models. Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

cs.CV [Back]

[27] Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma

Main category: cs.CV

TL;DR: This paper reviews the application of large language models and vision-language models in crash detection from video feeds, providing a taxonomy of fusion strategies, dataset summaries, model analysis, and performance comparisons to lay the groundwork for future research.

Details Motivation: The motivation stems from the critical need for effective crash detection in intelligent transportation systems and the transformative potential of large language models and vision-language models in processing and reasoning multimodal information. Method: The paper uses a structured taxonomy of fusion strategies, summarizes key datasets, analyzes model architectures, and compares performance benchmarks to evaluate recent methods leveraging LLMs for crash detection. Result: The result is a comprehensive review that outlines current methodologies, datasets, model architectures, and performance benchmarks for using LLMs in crash detection from video data. Conclusion: The paper concludes that the integration of LLMs and VLMs presents a promising avenue for future research in crash detection from video feeds within intelligent transportation systems. Abstract: Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

[28] Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning

Zijie Cai,Christopher Metzler

Main category: cs.CV

TL;DR: 本文提出了一个全面的零样本和微调单目度量深度估计模型基准,用于具有度量深度注释的真实世界水下数据集,并通过微调Depth Anything V2模型展示了改进的性能。

Details Motivation: 由于光衰减和散射、颜色失真、浑浊度以及缺乏高质量的度量真实数据,单目度量深度估计在水下环境中的可靠性仍然有限。 Method: 使用基于物理的水下图像形成模型生成HyPersim数据集的合成水下变体,并对Depth Anything V2进行微调。 Result: 大规模训练于陆地(真实或合成)数据的模型在空气环境中有效,但由于显著的领域偏移在水下表现不佳。通过微调Depth Anything V2模型,该研究在所有基准测试中一致地提高了性能,并优于仅训练于干净空气中HyPersim数据集的基线模型。 Conclusion: 研究强调了在水下场景中实现鲁棒且可推广的单目度量深度预测需要领域适应和尺度感知监督的重要性。 Abstract: Monocular depth estimation has recently advanced to provide not only relative but also metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground-truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, such as FLSea and SQUID. We evaluate a diverse set of state-of-the-art models across a range of underwater conditions with different ranges. Our results show that large-scale models trained on terrestrial (real or synthetic) data, while effective in in-air settings, perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S backbone encoder on a synthetic underwater variant of the Hypersim dataset, which we generated using a physically based underwater image formation model. We demonstrate our fine-tuned model consistently improves performance across all benchmarks and outperforms baselines trained only on the clean in-air Hypersim dataset. Our study provides a detailed evaluation and visualization for monocular metric depth estimation in underwater scenes, highlighting the importance of domain adaptation and scale-aware supervision for achieving robust and generalizable metric depth predictions in challenging underwater environments for future research.

[29] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

Xiao Wang,Jingtao Jiang,Qiang Chen,Lan Chen,Lin Zhu,Yaowei Wang,Yonghong Tian,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种新的基于思维链推理的事件流场景文本识别框架ESTR-CoT,并构建了一个大规模的CoT数据集用于训练,以提升在极端挑战场景下的识别效果与模型可解释性。

Details Motivation: 现有的事件流场景文本识别方法在低照度和快速运动等极端挑战场景下表现不佳,且存在解释性不足和上下文逻辑推理能力弱的问题。 Method: 采用视觉编码器EVA-CLIP将输入的事件流转换为token,使用Llama分词器对生成提示进行编码,并利用Q-former对齐视觉token和预训练大语言模型Vicuna-7B以同时输出答案和思维链(CoT)推理过程。 Result: 提出的ESTR-CoT框架在三个事件流STR基准数据集上进行了广泛实验,充分验证了其有效性与可解释性。 Conclusion: 该论文提出了一种基于思维链推理的事件流场景文本识别框架ESTR-CoT,并通过实验证明了其有效性和可解释性。 Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.

[30] Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach

Elena Ryumina,Maxim Markitantov,Alexandr Axyonov,Dmitry Ryumin,Mikhail Dolgushin,Alexey Karpov

Main category: cs.CV

TL;DR: This paper introduces a novel zero-shot multimodal framework for Compound Expression Recognition (CER), combining six diverse modalities without requiring task-specific training data, achieving competitive performance on multiple emotion detection datasets.

Details Motivation: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions, which traditional methods struggle with due to reliance on task-specific training data. Method: A zero-shot multimodal pipeline combining six heterogeneous modalities (static and dynamic facial expressions, scene and label matching, scene context, audio, and text) was developed. The method includes CLIP-based label matching, Qwen-VL for scene understanding, a Multi-Head Probability Fusion module, and a Compound Expressions transformation module using Pair-Wise Probability Aggregation and Pair-Wise Feature Similarity Aggregation. Result: The approach achieved F1 scores of 46.95% on AffWild2, 49.02% on AFEW, and 34.85% on C-EXPR-DB through zero-shot testing, showing results comparable to supervised approaches trained on target data. Conclusion: The proposed zero-shot multimodal approach effectively captures compound emotions without domain adaptation, achieving performance comparable to supervised methods on multiple datasets. Abstract: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available.

[31] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

Takuro Kawada,Shunsuke Kitada,Sota Nemoto,Hitoshi Iyatomi

Main category: cs.CV

TL;DR: 本文介绍了SciGA-145k数据集和新推荐度量CAR,旨在提升科学论文的视觉交流并推动AI for Science的发展。

Details Motivation: 图形摘要在传达科学论文核心发现方面具有重要作用,但其设计需要高级可视化技能,限制了广泛采用。因此,需要自动化方法来辅助图形摘要的设计。 Method: 引入了大规模数据集SciGA-145k,并定义了两个任务:Intra-GA推荐和Inter-GA推荐,同时提出了新的推荐度量标准CAR。 Result: 构建了一个包含约145,000篇科学论文和1.14百万张图表的数据集SciGA-145k,并提出了有效的推荐度量标准CAR。 Conclusion: SciGA-145k为科学论文中图形摘要的选择和推荐提供了支持,并推动了自动图形摘要生成的研究。 Abstract: Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.

[32] Understanding Trade offs When Conditioning Synthetic Data

Brandon Trabucco,Qasim Wani,Benjamin Pikus,Vasu Sharma

Main category: cs.CV

TL;DR: This paper explores how different conditioning strategies (prompt vs. layout) affect synthetic data quality for object detection, showing that layout-based conditioning significantly boosts performance when visual concept diversity is high.

Details Motivation: There is a critical need for robust object detectors trained on limited real-world data in industrial vision systems. Synthetic data generation using diffusion models offers promise due to faster generation times and potential improvements in data efficiency compared to traditional 3D rendering methods. Method: The researchers compared two conditioning strategies—prompt based and layout based—across eighty diverse visual concepts from four standard object detection benchmarks. They evaluated the quality of synthetic data generated under each strategy and its impact on object detection performance. Result: When layout cues matched the full training distribution, synthetic data improved mean average precision by an average of 34% and up to 177% compared to using real data alone. Conclusion: The study concludes that the effectiveness of conditioning strategies in generating synthetic data using diffusion models depends on the diversity of the visual concepts. Prompt-based conditioning is better for narrow cue sets, while layout-based conditioning excels as diversity increases. Abstract: Learning robust object detectors from only a handful of images is a critical challenge in industrial vision systems, where collecting high quality training data can take months. Synthetic data has emerged as a key solution for data efficient visual inspection and pick and place robotics. Current pipelines rely on 3D engines such as Blender or Unreal, which offer fine control but still require weeks to render a small dataset, and the resulting images often suffer from a large gap between simulation and reality. Diffusion models promise a step change because they can generate high quality images in minutes, yet precise control, especially in low data regimes, remains difficult. Although many adapters now extend diffusion beyond plain text prompts, the effect of different conditioning schemes on synthetic data quality is poorly understood. We study eighty diverse visual concepts drawn from four standard object detection benchmarks and compare two conditioning strategies: prompt based and layout based. When the set of conditioning cues is narrow, prompt conditioning yields higher quality synthetic data; as diversity grows, layout conditioning becomes superior. When layout cues match the full training distribution, synthetic data raises mean average precision by an average of thirty four percent and by as much as one hundred seventy seven percent compared with using real data alone.

[33] High-Fidelity Differential-information Driven Binary Vision Transformer

Tian Gao,Zhiyuan Zhang,Kaijie Yin,Xu-Cheng Zhong,Hui Kong

Main category: cs.CV

TL;DR: This paper proposes DIDB-ViT, a novel binary ViT that addresses the limitations of existing binary ViT methods, achieving superior image classification and segmentation performance compared to state-of-the-art network quantization methods.

Details Motivation: The motivation is to address the trade-off between high computational/storage demands of vision transformers (ViTs) and the constraints of edge-device deployment, while overcoming the limitations of existing binary ViT methods which suffer from performance degradation or rely heavily on full-precision modules. Method: The authors propose DIDB-ViT, a novel binary ViT that includes an informative attention module, frequency decomposition using the discrete Haar wavelet, and an improved RPReLU activation function. Result: Experimental results show that DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance. Conclusion: The paper concludes that DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance. Abstract: The binarization of vision transformers (ViTs) offers a promising approach to addressing the trade-off between high computational/storage demands and the constraints of edge-device deployment. However, existing binary ViT methods often suffer from severe performance degradation or rely heavily on full-precision modules. To address these issues, we propose DIDB-ViT, a novel binary ViT that is highly informative while maintaining the original ViT architecture and computational efficiency. Specifically, we design an informative attention module incorporating differential information to mitigate information loss caused by binarization and enhance high-frequency retention. To preserve the fidelity of the similarity calculations between binary Q and K tensors, we apply frequency decomposition using the discrete Haar wavelet and integrate similarities across different frequencies. Additionally, we introduce an improved RPReLU activation function to restructure the activation distribution, expanding the model's representational capacity. Experimental results demonstrate that our DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance.

[34] FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model

Jiangxia Chen,Tongyuan Huang,Ke Song

Main category: cs.CV

TL;DR: 本论文提出FMOcc,通过流匹配选择性状态空间模型和三视角精炼网络提升3D语义占用预测的准确性和效率,尤其适用于自动驾驶中的遮挡和远距离场景。

Details Motivation: 由于少帧图像的固有局限性和3D空间冗余,现有方法在遮挡和远距离场景下预测精度受限,同时融合历史帧数据的方法需要额外资源,因此提出一种高效且鲁棒的新方法。 Method: 设计了基于流匹配模型的特征精炼模块(FMSSM)、三视角(TPV SSM层和PS3M)选择性过滤机制,并提出了掩码训练(MT)方法以增强模型鲁棒性。 Result: 实验表明,在Occ3D-nuScenes和OpenOcc数据集上,FMOcc优于现有方法,使用两帧输入取得43.1% RayIoU和39.8% mIoU成绩,推理内存和时间分别为5.4G和330ms。 Conclusion: FMOcc通过创新的流匹配和三视角选择性建模方法,有效解决了3D语义占用预测中的关键挑战,提升了预测精度和效率。 Abstract: 3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction. Firstly, to generate missing features, we designed a feature refinement module based on a flow matching model, which is called Flow Matching SSM module (FMSSM). Furthermore, by designing the TPV SSM layer and Plane Selective SSM (PS3M), we selectively filter TPV features to reduce the impact of air voxels on non-air voxels, thereby enhancing the overall efficiency of the model and prediction capability for distant scenes. Finally, we design the Mask Training (MT) method to enhance the robustness of FMOcc and address the issue of sensor data loss. Experimental results on the Occ3D-nuScenes and OpenOcc datasets show that our FMOcc outperforms existing state-of-theart methods. Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.

[35] SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement

Zeyu Lei,Hongyuan Yu,Jinlin Wu,Zhen Chen

Main category: cs.CV

TL;DR: SurgVisAgent is an intelligent surgical vision agent that provides customized image enhancements for various distortion types and severity levels, demonstrating potential as a unified solution for surgical assistance.

Details Motivation: Current enhancement algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world surgical situations. Method: SurgVisAgent is built on multimodal large language models (MLLMs) with a prior model for domain-specific knowledge, using in-context few-shot learning and chain-of-thought (CoT) reasoning to deliver customized image enhancements. Result: On a comprehensive benchmark simulating real-world surgical distortions, SurgVisAgent outperforms traditional single-task models in dynamic identification of distortion categories and severity levels, enabling various enhancement tasks. Conclusion: SurgVisAgent demonstrates potential as a unified solution for surgical assistance by surpassing traditional single-task models in handling diverse distortion types and severity levels. Abstract: Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context few-shot learning and chain-of-thought (CoT) reasoning, SurgVisAgent delivers customized image enhancements tailored to a wide range of distortion types and severity levels, thereby addressing the diverse requirements of surgeons. Furthermore, we construct a comprehensive benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models, highlighting its potential as a unified solution for surgical assistance.

[36] Multi-Label Classification Framework for Hurricane Damage Assessment

Zhangding Liu,Neda Mohammadi,John E. Taylor

Main category: cs.CV

TL;DR: 本研究开发了一个新颖的多标签分类框架,利用空中影像评估飓风造成的损害,其准确率高达90.23%,比现有方法更优,有助于提高灾后响应效率并为未来减灾策略提供支持。

Details Motivation: 传统的单标签分类方法无法捕捉飓风后复杂破坏类型的全面信息,因此需要一种更精确及时的评估方式。 Method: 该方法结合了基于ResNet的特征提取模块和特定类别的注意力机制,以识别单张图像中的多种损坏类型。 Result: 在Hurricane Michael的Rescuenet数据集上,所提出的方法实现了90.23%的平均精度,超过了现有的基线方法。 Conclusion: 该研究提出了一种新的多标签分类框架,用于评估飓风后的破坏情况,并展示了其在提高灾害响应效率和对未来减灾策略的贡献方面的潜力。 Abstract: Hurricanes cause widespread destruction, resulting in diverse damage types and severities that require timely and accurate assessment for effective disaster response. While traditional single-label classification methods fall short of capturing the complexity of post-hurricane damage, this study introduces a novel multi-label classification framework for assessing damage using aerial imagery. The proposed approach integrates a feature extraction module based on ResNet and a class-specific attention mechanism to identify multiple damage types within a single image. Using the Rescuenet dataset from Hurricane Michael, the proposed method achieves a mean average precision of 90.23%, outperforming existing baseline methods. This framework enhances post-hurricane damage assessment, enabling more targeted and efficient disaster response and contributing to future strategies for disaster mitigation and resilience. This paper has been accepted at the ASCE International Conference on Computing in Civil Engineering (i3CE 2025), and the camera-ready version will appear in the official conference proceedings.

[37] Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation

Yuxiang Zhang,Wei Li,Wen Jia,Mengmeng Zhang,Ran Tao,Shunlin Liang

Main category: cs.CV

TL;DR: This paper proposes a Bi-directional Domain Adaptation framework for hyperspectral image classification to address spectral shifts in different scenes, achieving better performance than existing domain adaptation approaches.

Details Motivation: Spectral shifts in different scenes make it challenging to classify fine-grained land cover using hyperspectral remote sensing technology. Method: Bi-directional Domain Adaptation (BiDA) framework with triple-branch transformer architecture, Coupled Multi-head Cross-attention (CMCA), bi-directional distillation loss, and Adaptive Reinforcement Strategy (ARS). Result: Experimental results show that the BiDA framework performs significantly better than some state-of-the-art domain adaptation approaches, particularly in cross-temporal tree species classification tasks. Conclusion: The proposed BiDA framework demonstrates significant improvement in cross-temporal/space hyperspectral image classification compared to state-of-the-art domain adaptation methods. Abstract: Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3\%$\sim$5\% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA.

[38] MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement

Fanghai Yi,Zehong Zheng,Zexiao Liang,Yihang Dong,Xiyang Fang,Wangyu Wu,Xuhang Chen

Main category: cs.CV

TL;DR: This paper introduces the MAC-Lookup model to enhance underwater images more effectively than existing methods by improving color, sharpness, and contrast.

Details Motivation: Enhancing underwater images is crucial for exploration, but traditional prior-based and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. Method: The Multi-Axis Conditional Lookup (MAC-Lookup) model, which includes Conditional 3D Lookup Table Color Correction (CLTCC) and Multi-Axis Adaptive Enhancement (MAAE). Result: Extensive experiments show that MAC-Lookup enhances visual quality by improving color accuracy, sharpness, and contrast, preventing over-enhancement and saturation. Conclusion: MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is https://github.com/onlycatdoraemon/MAC-Lookup.

[39] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Feizhen Huang,Yu Wu,Yutian Lin,Bo Du

Main category: cs.CV

TL;DR: 本文提出一种基于自蒸馏的视频到音频生成方法,解决了电影语言场景下部分可见目标的局限性,并在多个评估指标和大规模数据集上表现出色。

Details Motivation: 当前的视频到音频生成方法忽略了电影语言这一关键的艺术表达要素,在Foley目标仅部分可见的情况下性能下降。 Method: 通过模拟电影语言的变化,学生模型学习对齐具有相同音视频对应关系的训练对的视频特征,从而捕捉声音与部分视觉信息之间的关联。 Result: 所提方法不仅在所有评估指标下的部分可见场景中实现了显著改进,还增强了在大规模V2A数据集VGGSound上的性能。 Conclusion: 该论文提出了一种简单的自蒸馏方法,以将视频到音频(V2A)生成模型扩展到电影语言场景中,有效提升了在部分可见情况下的性能表现。 Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

[40] LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

Juntao Liu,Liqiang Niu,Wenchao Chen,Jie Zhou,Fandong Meng

Main category: cs.CV

TL;DR: LaCo is a novel visual token compression framework for MLLMs that operates within intermediate vision encoder layers, achieving better efficiency and performance than existing methods.

Details Motivation: Existing visual token compression methods operate only as post-encoder modules, limiting their efficiency gains. This motivates the need for a more effective compression approach within intermediate layers. Method: LaCo incorporates a layer-wise pixel-shuffle mechanism and a residual learning architecture with non-parametric shortcuts for preserving visual information during compression. Result: Experiments show that LaCo outperforms existing methods in intermediate-layer token compression, improves training efficiency by over 20%, and increases inference throughput by over 15% while maintaining performance. Conclusion: LaCo demonstrates superior effectiveness in visual token compression for MLLMs by enabling efficient compression within intermediate layers of the vision encoder. Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.

[41] Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization

De Cheng,Zhipeng Xu,Xinyang Jiang,Dongsheng Li,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: This paper proposes a new approach for Domain Generalization using text feature-guided visual prompt tuning and introduces WERA to improve domain-invariant visual representation, which outperforms existing methods on multiple datasets.

Details Motivation: Recent advances in pre-trained Visual Foundation Models have shown potential in enhancing generalization capabilities, but the design of prompts capable of disentangling invariant features across diverse domains remains challenging. Method: A novel framework for text feature-guided visual prompt tuning is introduced. It uses a large language model to disentangle text prompts and learn domain-invariant visual representations. Worst Explicit Representation Alignment (WERA) is also introduced to enhance source domain diversity through stylized image augmentations. Result: Experiments on PACS, VLCS, OfficeHome, DomainNet, and TerraInc datasets show that the proposed method surpasses current state-of-the-art Domain Generalization techniques. Conclusion: The proposed method, leveraging controllable language prompts of VFMs and introducing WERA, outperforms state-of-the-art DG methods on major datasets. Abstract: Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.

[42] ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation

Hanbo Bi,Yulong Xu,Ya Li,Yongqiang Mao,Boyuan Tong,Chongyang Li,Chunbo Lang,Wenhui Diao,Hongqi Wang,Yingchao Feng,Xian Sun

Main category: cs.CV

TL;DR: 本文提出了ViRefSAM,一种用于遥感图像的少样本分割框架,解决了SAM在手动提示构造和领域适应性方面的不足。

Details Motivation: 应用SAM到遥感图像面临两个主要挑战:手动构建精确提示低效,以及缺乏领域适应性。 Method: 提出了一种新的框架ViRefSAM,包含两个关键组件:视觉上下文提示编码器和动态目标对齐适配器。 Result: 在三个少样本分割基准上的实验表明,ViRefSAM能够利用少量参考图像实现对未见类别的准确自动分割,并且性能优于现有方法。 Conclusion: ViRefSAM通过利用少量带注释的参考图像,实现了对遥感图像中未见类别的准确自动分割,并且在不同数据集上持续优于现有的少样本分割方法。 Abstract: The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM's original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM's image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$, demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets.

[43] DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation

Yunhan Yang,Shuo Chen,Yukun Huang,Xiaoyang Wu,Yuan-Chen Guo,Edmund Y. Lam,Hengshuang Zhao,Tong He,Xihui Liu

Main category: cs.CV

TL;DR: DreamComposer++ 是一个基于多视图条件的新型框架,通过3D表示提取和多视图特征融合,提升了现有视图感知扩散模型的可控新视图生成能力。

Details Motivation: 现有的利用单张野外图像生成高质量新视图的方法在缺乏多视图信息的情况下难以产生可控的新视图,因此需要一种新的方法来克服这一挑战。 Method: DreamComposer++ 利用视图感知的3D提升模块从不同视图中提取对象的3D表示,并通过多视图特征融合模块将这些表示聚合并渲染到目标视图的潜在特征中,最终将目标视图的特征集成到预训练的图像或视频扩散模型中进行新视图合成。 Result: 实验结果表明,DreamComposer++ 能够无缝集成到最先进的视图感知扩散模型中,并增强其从多视图条件下生成可控新视图的能力,促进了可控的3D物体重建并开辟了广泛的应用前景。 Conclusion: DreamComposer++ 是一种灵活且可扩展的框架,能够通过整合多视图条件来改进当前的视图感知扩散模型,从而实现更可控的新视图生成。 Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.

[44] Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images

Haoxuan Li,Chenxu Wei,Haodong Wang,Xiaomeng Hu,Boyuan An,Lingyan Ran,Baosen Zhang,Jin Jin,Omirzhan Taukebayev,Amirkhan Temirbayev,Junrui Liu,Xiuwei Zhang

Main category: cs.CV

TL;DR: This paper introduces Flow-CDNet, a dual-branch network for detecting both slow and fast changes in bitemporal images, achieving superior performance on a newly developed dataset.

Details Motivation: Detecting both slow and fast changes in bitemporal images is crucial for identifying early signs of potential hazards in critical areas such as slopes, dams, and tailings ponds. Method: Flow-CDNet uses a dual-branch architecture: one for extracting multi-scale displacement changes via a pyramid structure (optical flow branch), and another based on ResNet for fast change detection, combined with the optical flow output. A custom loss function and evaluation metric (FEPE) are also introduced. Result: Experiments on the Flow-Change dataset show that Flow-CDNet surpasses existing methods, with ablation studies confirming the mutual enhancement of its two branches. Conclusion: The proposed Flow-CDNet effectively detects both slow and fast changes by integrating an optical flow branch with a binary change detection branch, outperforming existing methods on the Flow-Change dataset. Abstract: Change detection typically involves identifying regions with changes between bitemporal images taken at the same location. Besides significant changes, slow changes in bitemporal images are also important in real-life scenarios. For instance, weak changes often serve as precursors to major hazards in scenarios like slopes, dams, and tailings ponds. Therefore, designing a change detection network that simultaneously detects slow and fast changes presents a novel challenge. In this paper, to address this challenge, we propose a change detection network named Flow-CDNet, consisting of two branches: optical flow branch and binary change detection branch. The first branch utilizes a pyramid structure to extract displacement changes at multiple scales. The second one combines a ResNet-based network with the optical flow branch's output to generate fast change outputs. Subsequently, to supervise and evaluate this new change detection framework, a self-built change detection dataset Flow-Change, a loss function combining binary tversky loss and L2 norm loss, along with a new evaluation metric called FEPE are designed. Quantitative experiments conducted on Flow-Change dataset demonstrated that our approach outperforms the existing methods. Furthermore, ablation experiments verified that the two branches can promote each other to enhance the detection performance.

[45] LMPNet for Weakly-supervised Keypoint Discovery

Pei Guo,Ryan Farrell

Main category: cs.CV

TL;DR: 本研究提出了LMPNet,这是一种在仅有类别标签弱监督下能自动发现语义对象关键点并实现高预测准确性的新方法。

Details Motivation: 探索在仅有类别标签弱监督下的语义对象关键点发现任务。这是通过将区分性训练的中间层过滤器转换为关键点检测器实现的。 Method: 提出了一种新颖的计算效率高的漏斗最大池化层(LMP),以明确鼓励最终卷积层过滤器学习与对象关键点良好对齐的“不可重复的局部模式”。通过选择策略确保一致的过滤器激活,并应用注意力遮蔽使网络关注整个对象而非仅最具判别力的区域。最后使用可学习聚类层将关键点提议分组为关键点预测。 Result: 实验证明,LMPNet可以自动发现对物体姿态具有鲁棒性的语义关键点,并实现了与有监督姿态估计模型相当的预测准确性。 Conclusion: LMPNet是一种高度可解释的模型,可以直接操作网络过滤器来检测预定义概念,并且能够自动发现对物体姿态具有鲁棒性的语义关键点。 Abstract: In this work, we explore the task of semantic object keypoint discovery weakly-supervised by only category labels. This is achieved by transforming discriminatively-trained intermediate layer filters into keypoint detectors. We begin by identifying three preferred characteristics of keypoint detectors: (i) spatially sparse activations, (ii) consistency and (iii) diversity. Instead of relying on hand-crafted loss terms, a novel computationally-efficient leaky max pooling (LMP) layer is proposed to explicitly encourage final conv-layer filters to learn "non-repeatable local patterns" that are well aligned with object keypoints. Informed by visualizations, a simple yet effective selection strategy is proposed to ensure consistent filter activations and attention mask-out is then applied to force the network to distribute its attention to the whole object instead of just the most discriminative region. For the final keypoint prediction, a learnable clustering layer is proposed to group keypoint proposals into keypoint predictions. The final model, named LMPNet, is highly interpretable in that it directly manipulates network filters to detect predefined concepts. Our experiments show that LMPNet can (i) automatically discover semantic keypoints that are robust to object pose and (ii) achieves strong prediction accuracy comparable to a supervised pose estimation model.

[46] Perception Activator: An intuitive and portable framework for brain cognitive exploration

Le Xu,Qi Zhang,Qixian Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao

Main category: cs.CV

TL;DR: This paper presents a new framework that utilizes fMRI data to improve object detection and segmentation by leveraging multi-object semantic cues and spatial information not fully utilized by current models.

Details Motivation: To better understand the brain's visual perception patterns and how current decoding models process semantic objects due to limitations in existing methods that rely heavily on pixel-level alignment and lack fine-grained semantic alignment. Method: An experimental framework was developed using fMRI representations as intervention conditions. These representations were injected into multi-scale image features through cross-attention, and the effects were analyzed on object detection and instance segmentation tasks. Result: Incorporating fMRI signals improved the accuracy of downstream detection and segmentation tasks, demonstrating their value in enhancing model performance. Conclusion: The study concludes that fMRI signals contain rich multi-object semantic cues and coarse spatial localization information that current models have not fully exploited. Abstract: Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain's visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.

[47] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation

JaeHyuck Choi,MinJun Kim,JeHyeong Hong

Main category: cs.CV

TL;DR: 本文提出MAGIC方法,解决了少样本异常生成中背景保留、掩码对齐及多样化生成的问题,并在实际数据集上验证了其优越性能。

Details Motivation: 现有的扩散模型在少样本异常生成中难以同时满足保留正常背景、准确对齐异常掩码并生成外观多样且语义合理异常的需求。 Method: MAGIC基于Stable Diffusion修补模型,并引入了多级扰动策略和上下文感知对齐模块:1)高斯提示级扰动;2)掩码引导的空间噪声注入;3)上下文感知的掩码对齐模块实现异常位置重定位。 Result: 在MVTec-AD数据集下,MAGIC在一致评估协议下超越了先前最先进的方法,在工业质量检测场景中表现出色。 Conclusion: MAGIC实现了对正常背景的保留、异常区域与掩码的严格对齐以及语义上合理的异常生成,同时在下游异常任务中优于之前的最先进方法。 Abstract: Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC--Mask-guided inpainting with multi-level perturbations and Context-aware alignment--to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.

[48] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

Zecheng Zhao,Selena Song,Tong Chen,Zhi Chen,Shazia Sadiq,Yadan Luo

Main category: cs.CV

TL;DR: This paper introduces SynTVA, a benchmark and dataset for evaluating synthetic video utility in text-to-video retrieval tasks, demonstrating its value in dataset augmentation and model improvement.

Details Motivation: Current text-to-video synthesis evaluation metrics focus on visual quality and temporal consistency, offering limited insights into performance in downstream tasks like text-to-video retrieval (TVR). This work aims to address this gap by introducing a comprehensive benchmark. Method: The work introduces SynTVA, a new dataset and benchmark derived from 800 diverse user queries based on the MSRVTT training split. Synthetic videos are generated using state-of-the-art T2V models and annotated across four semantic alignment dimensions. An evaluation framework correlates VQA metrics with alignment scores, while an Auto-Evaluator is developed to estimate alignment quality. Result: SynTVA provides insights into semantic alignment between video and text, demonstrates correlation between general video quality metrics and alignment scores, and enables improved TVR performance through dataset augmentation with high-utility synthetic samples. Conclusion: SynTVA serves not only as a benchmark but also as a valuable tool for dataset augmentation, improving TVR outcomes by selecting high-utility synthetic samples. Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

[49] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Nina Konovalova,Maxim Nikolaev,Andrey Kuznetsov,Aibek Alanov

Main category: cs.CV

TL;DR: 本文提出InnerControl,通过在整个扩散过程中实施空间一致性训练策略,显著提升了文本到图像扩散模型的空间控制能力和生成质量。

Details Motivation: 尽管文本到图像扩散模型取得了显著进展,但对输出结果实现精确的空间控制仍具挑战性,现有方法如ControlNet和ControlNet++存在局限性。 Method: 训练轻量级卷积探测器,从每个去噪步骤的中间UNet特征中重建输入控制信号,并在整个扩散过程中最小化预测条件与目标条件之间的差异。 Result: InnerControl 在多种条件方法(如边缘、深度)上实现了最先进的性能,并且可以与已有技术如ControlNet++结合使用。 Conclusion: InnerControl 提出了一种新的训练策略,通过在所有扩散步骤中强制空间一致性来提高文本到图像扩散模型的空间控制精度和生成质量。 Abstract: Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

[50] Neural Network-based Study for Rice Leaf Disease Recognition and Classification: A Comparative Analysis Between Feature-based Model and Direct Imaging Model

Farida Siddiqi Prity,Mirza Raquib,Saydul Akbar Murad,Md. Jubayar Alam Rafi,Md. Khairul Bashar Bhuiyan,Anupam Kumar Bairagi

Main category: cs.CV

TL;DR: This paper compares two image-processing models for detecting rice leaf diseases, finding that the Feature Analysis Detection Model outperforms the Direct Image-Centric Detection Model, offering better accuracy and potential benefits for rice farming productivity.

Details Motivation: Rice leaf diseases significantly reduce productivity and cause economic losses, emphasizing the need for early detection. This study aims to address the lack of comparative analysis between FADM and DICDM to determine the most effective approach for rice disease classification. Method: This research employed Artificial Neural Network (ANN)-based image processing techniques to compare two models: FADM, which uses Feature Extraction Algorithms (FEAs), Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM); and DICDM, which does not use any FEAs. The experiments were conducted on a dataset of rice leaf images across six categories using 10-fold cross-validation. Result: The results show that the Feature Analysis Detection Model achieves the highest performance in classifying rice leaf diseases compared to the Direct Image-Centric Detection Model, highlighting its effectiveness in improving detection accuracy. Conclusion: The study concludes that the Feature Analysis Detection Model (FADM) outperforms the Direct Image-Centric Detection Model (DICDM) in classifying rice leaf diseases, offering potential improvements in crop health, yield loss reduction, and overall rice farming productivity and sustainability. Abstract: Rice leaf diseases significantly reduce productivity and cause economic losses, highlighting the need for early detection to enable effective management and improve yields. This study proposes Artificial Neural Network (ANN)-based image-processing techniques for timely classification and recognition of rice diseases. Despite the prevailing approach of directly inputting images of rice leaves into ANNs, there is a noticeable absence of thorough comparative analysis between the Feature Analysis Detection Model (FADM) and Direct Image-Centric Detection Model (DICDM), specifically when it comes to evaluating the effectiveness of Feature Extraction Algorithms (FEAs). Hence, this research presents initial experiments on the Feature Analysis Detection Model, utilizing various image Feature Extraction Algorithms, Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM). The experiments are carried out on datasets encompassing bacterial leaf blight, brown spot, leaf blast, leaf scald, Sheath blight rot, and healthy leaf, utilizing 10-fold Cross-Validation method. A Direct Image-Centric Detection Model is established without the utilization of any FEA, and the evaluation of classification performance relies on different metrics. Ultimately, an exhaustive contrast is performed between the achievements of the Feature Analysis Detection Model and Direct Image-Centric Detection Model in classifying rice leaf diseases. The results reveal that the highest performance is attained using the Feature Analysis Detection Model. The adoption of the proposed Feature Analysis Detection Model for detecting rice leaf diseases holds excellent potential for improving crop health, minimizing yield losses, and enhancing overall productivity and sustainability of rice farming.

[51] Two-Steps Neural Networks for an Automated Cerebrovascular Landmark Detection

Rafic Nader,Vincent L'Allinec,Romain Bourcier,Florent Autrusseau

Main category: cs.CV

TL;DR: 本文提出了一种基于两步神经网络的方法,用于自动化检测大脑动脉环分叉点,解决了传统方法中漏检和解剖变异带来的挑战,且在多个数据集上表现优异。

Details Motivation: 准确检测颅内动脉瘤常发的13个主要动脉分叉对于快速诊断至关重要,但传统的单步方法存在漏检和误检的问题。 Method: 首先使用目标检测网络识别感兴趣区域(ROI),然后采用改进的具有深度监督的U-Net精确定位分叉点。 Result: 该方法在两个MRA数据集上均取得了最高性能,包括一个具有可变数量标记点的内部数据集和一个标准化标记点配置的公开数据集。 Conclusion: 该论文提出的两步神经网络方法在检测大脑动脉环分叉方面表现出最佳性能,能够有效解决因解剖结构复杂和相似性导致的漏检问题。 Abstract: Intracranial aneurysms (ICA) commonly occur in specific segments of the Circle of Willis (CoW), primarily, onto thirteen major arterial bifurcations. An accurate detection of these critical landmarks is necessary for a prompt and efficient diagnosis. We introduce a fully automated landmark detection approach for CoW bifurcations using a two-step neural networks process. Initially, an object detection network identifies regions of interest (ROIs) proximal to the landmark locations. Subsequently, a modified U-Net with deep supervision is exploited to accurately locate the bifurcations. This two-step method reduces various problems, such as the missed detections caused by two landmarks being close to each other and having similar visual characteristics, especially when processing the complete MRA Time-of-Flight (TOF). Additionally, it accounts for the anatomical variability of the CoW, which affects the number of detectable landmarks per scan. We assessed the effectiveness of our approach using two cerebral MRA datasets: our In-House dataset which had varying numbers of landmarks, and a public dataset with standardized landmark configuration. Our experimental results demonstrate that our method achieves the highest level of performance on a bifurcation detection task.

[52] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Xiangfeng Wang,Xiao Li,Yadong Wei,Xueyu Song,Yang Song,Xiaoqiang Xia,Fangrui Zeng,Zaiyi Chen,Liu Liu,Gu Xu,Tong Xu

Main category: cs.CV

TL;DR: This paper introduces HIVE, a human-inspired automatic video editing framework that leverages multimodal narrative understanding to improve coherence and efficiency in condensing long-form videos, outperforming existing methods and narrowing the gap with human-edited content.

Details Motivation: The motivation stems from the growing demand for efficient video editing techniques due to the rapid growth of online video content, especially on short video platforms, and the limitations of current automatic editing methods that neglect rich visual context and produce incoherent outputs. Method: The method involves a human-inspired framework (HIVE) that integrates character extraction, dialogue analysis, narrative summarization using multimodal large language models, and scene-level segmentation to decompose editing into highlight detection, opening/ending selection, and pruning irrelevant content. Result: Experimental results show that the HIVE framework outperforms existing baselines across both general and advertisement-oriented editing tasks. Additionally, a new benchmark dataset called DramaAD was introduced, comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Conclusion: The paper concludes that the proposed HIVE framework effectively addresses the limitations of existing automatic video editing methods by leveraging multimodal narrative understanding, resulting in outputs that significantly narrow the quality gap between automatic and human-edited videos. Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

[53] Lightweight Shrimp Disease Detection Research Based on YOLOv8n

Fei Yuhuan,Wang Gengchen,Liu Fenghao,Zang Ran,Sun Xufei,Chang Hao

Main category: cs.CV

TL;DR: 本研究提出了一种新的轻量级网络架构,用于提高对虾养殖中疾病检测的效率和准确性,通过优化模型结构和引入注意力机制,在减少参数的同时提高了检测性能。

Details Motivation: 虾类疾病是水产养殖中经济损失的主要原因之一。为了防止疾病传播并提高虾类养殖的智能化检测效率,需要一种高效且准确的疾病检测方法。 Method: 1. 设计了RLDD检测头和C2f-EMCM模块以降低计算复杂度并保持检测精度;2. 引入改进的SegNext_Attention自注意力机制以增强特征提取能力;3. 在自建的对虾疾病数据集和URPC2020数据集上进行了广泛的实验,包括消融研究和对比评估。 Result: 1. 相比原始YOLOv8n,参数减少了32.3%;2. mAP@0.5达到92.7%,比YOLOv8n提高了3%;3. 在URPC2020数据集上的泛化实验显示,mAP@0.5比YOLOv8n提高了4.1%;4. 模型在精度、参数数量和模型大小方面均优于其他轻量级YOLO系列模型。 Conclusion: 本文提出了一种基于YOLOv8n的轻量级网络架构,通过设计RLDD检测头和C2f-EMCM模块,并引入改进的SegNext_Attention自注意力机制,实现了在减少参数量的同时提高检测精度和计算效率。该方法在对虾疾病数据集和URPC2020数据集上的实验结果表明,其在mAP@0.5、参数数量和模型大小方面优于其他轻量级YOLO系列模型,为对虾养殖中的智能疾病检测提供了可靠的技术支持。 Abstract: Shrimp diseases are one of the primary causes of economic losses in shrimp aquaculture. To prevent disease transmission and enhance intelligent detection efficiency in shrimp farming, this paper proposes a lightweight network architecture based on YOLOv8n. First, by designing the RLDD detection head and C2f-EMCM module, the model reduces computational complexity while maintaining detection accuracy, improving computational efficiency. Subsequently, an improved SegNext_Attention self-attention mechanism is introduced to further enhance the model's feature extraction capability, enabling more precise identification of disease characteristics. Extensive experiments, including ablation studies and comparative evaluations, are conducted on a self-constructed shrimp disease dataset, with generalization tests extended to the URPC2020 dataset. Results demonstrate that the proposed model achieves a 32.3% reduction in parameters compared to the original YOLOv8n, with a mAP@0.5 of 92.7% (3% improvement over YOLOv8n). Additionally, the model outperforms other lightweight YOLO-series models in mAP@0.5, parameter count, and model size. Generalization experiments on the URPC2020 dataset further validate the model's robustness, showing a 4.1% increase in mAP@0.5 compared to YOLOv8n. The proposed method achieves an optimal balance between accuracy and efficiency, providing reliable technical support for intelligent disease detection in shrimp aquaculture.

[54] Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Ziqi Miao,Yi Ding,Lijun Li,Jing Shao

Main category: cs.CV

TL;DR: This paper proposes VisCo Attack, a novel visual-centric method to exploit MLLMs by generating realistic contextual dialogues with high attack success rates.

Details Motivation: MLLMs are vulnerable to attacks through visual inputs, but existing methods lack realism. This work aims to develop a more effective and realistic visual-based attack strategy. Method: VisCo Attack uses four visual-focused strategies and dynamically generates auxiliary images to create a contextual dialogue. It incorporates toxicity obfuscation and semantic refinement to enhance attack effectiveness. Result: VisCo achieved a toxicity score of 4.78 and an ASR of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline. Conclusion: The paper introduces a new visual-centric jailbreak approach called VisCo Attack, which effectively triggers harmful responses from MLLMs by constructing realistic visual contexts. Abstract: With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.

[55] Holistic Tokenizer for Autoregressive Image Generation

Anlin Zheng,Haochen Wang,Yucheng Zhao,Weipeng Deng,Tiancai Wang,Xiangyu Zhang,Xiaojuan Qi

Main category: cs.CV

TL;DR: Hita improves autoregressive image generation by introducing a novel tokenizer that better captures global image properties, leading to superior performance on key benchmarks.

Details Motivation: The motivation is to overcome the limitations of vanilla autoregressive models and visual tokenizers in capturing global information and holistic relationships among token sequences. Method: Hita introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens, along with a sequential structure and fusion module to enhance AR generation. Result: Hita achieves 2.59 FID and 281.9 IS on ImageNet, accelerates training speed, and demonstrates effectiveness in zero-shot style transfer and image in-painting. Conclusion: Hita, a novel image tokenizer for autoregressive image generation, effectively captures holistic relationships and outperforms vanilla tokenizers in performance and efficiency. Abstract: The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}

[56] LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling

Jiahao Wu,Rui Peng,Jianbo Jiao,Jiayu Yang,Luyang Tang,Kaiqiang Xiong,Jie Liang,Jinbo Yan,Runling Liu,Ronggang Wang

Main category: cs.CV

TL;DR: 本研究提出了LocalDyGS,一种用于高度动态场景建模的新框架,通过分解场景并解耦静态与动态特征实现更真实的动态视频合成。

Details Motivation: 由于现实世界中复杂且高度动态的运动,从多视角输入合成任意视点的动态视频具有挑战性,而基于神经辐射场或3D高斯点绘的方法在精细尺度运动建模上存在限制。 Method: 1)将复杂动态场景分解为由种子定义的流线型局部空间,使全局建模成为可能;2)对局部空间运动建模时解耦静态和动态特征,通过共享静态特征和时间特定的动态残差场生成时间高斯分布。 Result: 该方法不仅在多种精细尺度数据集上表现出与最先进方法相当的竞争力,而且首次尝试建模更大且更复杂的高度动态场景。 Conclusion: 本文提出了一种新的动态场景重建框架LocalDyGS,能够更真实地建模高度动态的真实世界场景。 Abstract: Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github.io/LocalDyGS/.

[57] UVLM: Benchmarking Video Language Model for Underwater World Understanding

Xizhe Xue,Yang Zhou,Dawei Yan,Ying Li,Haokui Zhang,Rong Xiao

Main category: cs.CV

TL;DR: This paper introduces UVLM, an underwater observation benchmark designed to enhance the application of video language models (VidLMs) in underwater environments, demonstrating significant improvements in understanding the underwater world.

Details Motivation: The motivation is to address the gap in the application of video language models (VidLMs) for underwater observation, as existing works focus mainly on terrestrial scenarios. Method: The method involves creating a collaborative underwater observation benchmark called UVLM, which includes a diverse dataset addressing underwater challenges like light variations, water turbidity, and diverse viewing angles. It also adopts a structured design for task diversity and designs challenging evaluation metrics. Result: Experiments show that fine-tuning VidLMs on UVLM significantly improves underwater world understanding and demonstrates potential for slight improvements on existing in-air VidLM benchmarks like VideoMME and Perception text. Conclusion: The paper concludes that fine-tuning VidLMs on UVLM significantly improves underwater world understanding and shows potential for slight improvements on existing in-air VidLM benchmarks. Abstract: Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. To overcome this gap, we introduce UVLM, an under water observation benchmark which is build through a collaborative approach combining human expertise and AI models. To ensure data quality, we have conducted in-depth considerations from multiple perspectives. First, to address the unique challenges of underwater environments, we selected videos that represent typical underwater challenges including light variations, water turbidity, and diverse viewing angles to construct the dataset. Second, to ensure data diversity, the dataset covers a wide range of frame rates, resolutions, 419 classes of marine animals, and various static plants and terrains. Next, for task diversity, we adopted a structured design where observation targets are categorized into two major classes: biological and environmental. Each category includes content observation and change/action observation, totaling 20 distinct task types. Finally, we designed several challenging evaluation metrics to enable quantitative comparison and analysis of different methods. Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks, such as VideoMME and Perception text. The dataset and prompt engineering will be released publicly.

[58] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

Seokyeong Lee,Sithu Aung,Junyong Choi,Seungryong Kim,Ig-Jae Kim,Junghyun Cho

Main category: cs.CV

TL;DR: 本文提出了一种新的仅使用视频数据的伪标签框架,用于单目三维物体检测(M3OD),该框架对遮挡具有更强的鲁棒性,并且不需要多视角设置、额外传感器、相机姿态或特定领域的训练。

Details Motivation: 由于高标注成本和固有的二维到三维模糊性导致的数据稀缺问题,单目三维物体检测一直面临挑战。尽管已有各种弱监督方法和伪标签方法被提出以解决这些问题,但它们大多受限于特定领域学习或仅依赖单个观测的形状信息。 Method: 我们探索了一种技术,通过对象点跟踪聚合时间相邻帧中静态和动态对象的伪激光雷达,从而在三维数据获取不可行的情况下实现三维属性提取。 Result: 广泛的实验表明,我们的方法确保了可靠的准确性与强大的可扩展性,使其成为单目三维物体检测的一种实用有效解决方案。 Conclusion: 综上所述,所提出的伪标签框架为单目三维物体检测提供了一个既实用又高效的解决方案,尤其适用于存在遮挡情况的场景,同时避免了传统方法所需的复杂设置和条件限制。 Abstract: Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.

[59] Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis

Byung Hyun Lee,Wongi Jeong,Woojae Han,Kyoungbun Lee,Se Young Chun

Main category: cs.CV

TL;DR: This paper proposes CoMEL, a novel MIL framework for continual learning that improves localization accuracy and reduces forgetting, achieving state-of-the-art results on histopathological image datasets.

Details Motivation: To address the lack of adaptability in MIL for continual tasks with minimal forgetting, especially for instance classification and localization challenges in large-scale histopathological images. Method: The proposed CoMEL framework includes three components: GDAT for efficient instance encoding, BPPL for pseudo-labeling, and OWLoRA to reduce classification forgetting. Result: CoMEL outperforms previous methods by up to 11.00% in bag-level accuracy and 23.4% in localization accuracy on three public WSI datasets under continual MIL setup. Conclusion: CoMEL demonstrates superior performance in continual MIL tasks, particularly in localization accuracy and adaptability with minimal forgetting. Abstract: Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., $16 \times 16$) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ($\sim 10^5$) of large patches (e.g., $256 \times 256$) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework for both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets demonstrate superior performance of CoMEL, outperforming the prior arts by up to $11.00\%$ in bag-level accuracy and up to $23.4\%$ in localization accuracy under the continual MIL setup.

[60] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Taehoon Kim,Jongwook Choi,Yonghyun Jeong,Haeun Noh,Jaejun Yoo,Seungryul Baek,Jongwon Choi

Main category: cs.CV

TL;DR: 本文提出了一种新的deepfake视频检测方法,利用像素级时间不一致性提升检测性能,结合傅里叶变换、注意力机制和Transformer模块实现高效准确的伪造识别。

Details Motivation: 传统的基于空间频率的方法在检测deepfake视频时忽略了像素级的时间不一致性,导致难以检测到时间伪影。因此,需要一种更精确且鲁棒的检测方法。 Method: 对每个像素的时间轴进行一维傅里叶变换,提取对时间不一致性高度敏感的特征;引入注意力提议模块以精确定位包含时间伪影的区域,并通过联合Transformer模块整合像素级时间频率特征与时空上下文特征。 Result: 该框架在多种复杂和具有挑战性的检测场景中表现出色,为deepfake视频检测提供了显著的进步。 Conclusion: 本文提出了一种基于像素级时间不一致性的deepfake视频检测方法,相较于传统方法更有效地识别伪造视频中的时间伪影。 Abstract: We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.

[61] TABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation

Peilin Zhang,Shaouxan Wua,Jun Feng,Zhuo Jin,Zhizezhang Gao,Jingkun Chen,Yaqiong Xing,Xiao Zhang

Main category: cs.CV

TL;DR: This paper proposes a new weakly-supervised medical image segmentation framework called TAB Net that effectively addresses the challenges posed by sparse scribble annotations. Experimental results show its superior performance compared to existing methods.

Details Motivation: Medical image segmentation is a crucial task but acquiring large-scale annotated datasets is costly and time-consuming. Scribble annotations offer an efficient alternative but pose challenges due to their sparsity and lack of boundary supervision. This paper aims to address these challenges with a novel weakly-supervised segmentation framework. Method: The proposed TAB Net framework consists of a triplet augmentation self-recovery (TAS) module and a boundary-aware pseudo-label supervision (BAP) module. The TAS module uses three augmentation strategies - intensity transformation, cutout, and jigsaw augmentation to enhance feature learning. The BAP module enhances pseudo-supervision accuracy and boundary modeling through dual-branch predictions fusion and a boundary-aware loss. Result: Experimental evaluations on ACDC and MSCMR seg datasets show that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation and achieves performance comparable to fully supervised methods. Conclusion: TAB Net demonstrates significant improvement over state-of-the-art methods in scribble-based weakly supervised segmentation and achieves performance comparable to fully supervised methods. Abstract: Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the feature learning of the target region and lacks sufficient boundary supervision, which poses significant challenges for training segmentation networks. Methods: We propose TAB Net, a novel weakly-supervised medical image segmentation framework, consisting of two key components: the triplet augmentation self-recovery (TAS) module and the boundary-aware pseudo-label supervision (BAP) module. The TAS module enhances feature learning through three complementary augmentation strategies: intensity transformation improves the model's sensitivity to texture and contrast variations, cutout forces the network to capture local anatomical structures by masking key regions, and jigsaw augmentation strengthens the modeling of global anatomical layout by disrupting spatial continuity. By guiding the network to recover complete masks from diverse augmented inputs, TAS promotes a deeper semantic understanding of medical images under sparse supervision. The BAP module enhances pseudo-supervision accuracy and boundary modeling by fusing dual-branch predictions into a loss-weighted pseudo-label and introducing a boundary-aware loss for fine-grained contour refinement. Results: Experimental evaluations on two public datasets, ACDC and MSCMR seg, demonstrate that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation. Moreover, it achieves performance comparable to that of fully supervised methods.

[62] Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings

Mufhumudzi Muthivhi,Terence L. van Zyl

Main category: cs.CV

TL;DR: This paper investigates self-supervised learning for wildlife re-identification, demonstrating that it surpasses supervised methods in performance while reducing reliance on annotated data.

Details Motivation: Current state-of-the-art models rely on annotated data, which necessitates large-scale curated datasets. This work explores self-supervised alternatives to eliminate dependence on labeled data. Method: This study uses self-supervised learning (SSL) with temporal image pairs from camera trap data to train a model without supervision for wildlife re-identification. Result: The analysis shows that self-supervised models perform better than supervised models in open-world scenarios and transfer learning tasks even when limited data is available. Conclusion: Self-supervised models are more robust and outperform supervised features across all downstream tasks in wildlife re-identification. Abstract: Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here https://github.com/pxpana/SSLWildlife.

[63] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration

Ayantika Das,Moitreya Chaudhuri,Koushik Bhat,Keerthi Ram,Mihail Bota,Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: 本文结合扩散模型和自编码器,提出了一种新的框架,能够在潜在空间中构建结构化表示,并实现了脑图像区域特定细胞模式识别以及两种无监督图像修复应用(撕裂伪影和JPEG伪影修复)。

Details Motivation: 扩散模型虽然能够生成高质量的图像样本,但缺乏提取图像特定语义表示的能力,而自编码器可以通过其编码组件提供这种能力。因此,将两者结合可以更好地利用各自的优点。 Method: 1. 设计一种机制来构建扩散自动编码模型的潜在空间,以识别脑图像中的区域特定细胞模式;2. 利用潜在表示和扩散模型的受限生成能力,在推理过程中基于邻域感知进行无监督的撕裂伪影修复;3. 利用表示引导和扩散模型在推理时间的可控加噪与去噪能力,设计一种无监督的JPEG伪影修复技术。 Result: 1. 建立了一个能够捕捉高分辨率图像中局部位置信息的潜在空间,有利于区分不同的脑组织类型;2. 提出了两种无监督的图像修复方法:一种用于修复撕裂伪影,另一种用于修复JPEG压缩引起的伪影;3. 所提方法在潜在空间中实现了结构化表示,并通过扩散模型的可控生成能力取得了良好的修复效果。 Conclusion: 通过结合扩散模型和自编码器,本文提出了一种能够学习图像特定表示并组织潜在空间的方法,从而实现对脑部图像区域特定细胞模式的识别、撕裂伪影修复和JPEG伪影修复。 Abstract: Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique.

[64] A Novel Tuning Method for Real-time Multiple-Object Tracking Utilizing Thermal Sensor with Complexity Motion Pattern

Duong Nguyen-Ngoc Tran,Long Hoang Pham,Chi Dai Tran,Quoc Pham-Nam Ho,Huy-Hung Nguyen,Jae Wook Jeon

Main category: cs.CV

TL;DR: 本文介绍了一种用于热图像中多目标跟踪的新方法,该方法通过针对热成像中的复杂运动模式设计的两阶段优化和超参数调优,提高了跟踪性能。

Details Motivation: 热传感器在低可见度或光照条件差的情况下可以增强识别任务,但其低级特征表示使得准确检测和跟踪行人变得困难,因此需要一种新的方法来解决这个问题。 Method: 论文提出了一种新颖的行人跟踪调优方法,专门用于处理热图像中的复杂运动模式,并通过优化两个阶段并微调超参数以最大化跟踪性能。 Result: 在PBVS Thermal MOT数据集上进行的广泛实验表明,该方法在各种热相机条件下都非常有效,并且在不依赖复杂的重新识别或运动模型的情况下实现了高精度的实时跟踪。 Conclusion: 论文得出结论,通过超参数调优的两阶段框架在热成像多目标跟踪中表现出色,为现实世界的监控应用提供了一个强大的解决方案。 Abstract: Multi-Object Tracking in thermal images is essential for surveillance systems, particularly in challenging environments where RGB cameras struggle due to low visibility or poor lighting conditions. Thermal sensors enhance recognition tasks by capturing infrared signatures, but a major challenge is their low-level feature representation, which makes it difficult to accurately detect and track pedestrians. To address this, the paper introduces a novel tuning method for pedestrian tracking, specifically designed to handle the complex motion patterns in thermal imagery. The proposed framework optimizes two-stages, ensuring that each stage is tuned with the most suitable hyperparameters to maximize tracking performance. By fine-tuning hyperparameters for real-time tracking, the method achieves high accuracy without relying on complex reidentification or motion models. Extensive experiments on PBVS Thermal MOT dataset demonstrate that the approach is highly effective across various thermal camera conditions, making it a robust solution for real-world surveillance applications.

[65] Privacy-preserving Preselection for Face Identification Based on Packing

Rundong Xin,Taotao Wang,Jin Wang,Chonghe Zhao,Jing Wang

Main category: cs.CV

TL;DR: 本文提出了一种高效的加密域人脸识别方案PFIP,通过预选机制和打包模块显著提升了检索效率,同时保持了识别精度。

Details Motivation: 随着隐私问题的日益突出以及加密域中原始面部数据可能被恢复,现有的加密域人脸识别系统面临模板库规模增大导致检索过程耗时增加的问题。 Method: 提出了一种名为Privacy-Preserving Preselection for Face Identification Based on Packing (PFIP) 的新型高效加密域人脸检索方案,包含创新的预选机制和打包模块。 Result: 实验结果显示,PFIP在检索1000个加密人脸模板时保持100%命中率,且时间控制在300毫秒内,检索效率比现有方法提高了近50倍。 Conclusion: PFIP实现了高效的加密域人脸识别,同时保持了原有模型的准确性,并在LFW和CASIA数据集上验证了其高效性和实用性。 Abstract: Face identification systems operating in the ciphertext domain have garnered significant attention due to increasing privacy concerns and the potential recovery of original facial data. However, as the size of ciphertext template libraries grows, the face retrieval process becomes progressively more time-intensive. To address this challenge, we propose a novel and efficient scheme for face retrieval in the ciphertext domain, termed Privacy-Preserving Preselection for Face Identification Based on Packing (PFIP). PFIP incorporates an innovative preselection mechanism to reduce computational overhead and a packing module to enhance the flexibility of biometric systems during the enrollment stage. Extensive experiments conducted on the LFW and CASIA datasets demonstrate that PFIP preserves the accuracy of the original face recognition model, achieving a 100% hit rate while retrieving 1,000 ciphertext face templates within 300 milliseconds. Compared to existing approaches, PFIP achieves a nearly 50x improvement in retrieval efficiency.

[66] Determination Of Structural Cracks Using Deep Learning Frameworks

Subhasis Dasgupta,Jaydip Sen,Tuhina Halder

Main category: cs.CV

TL;DR: 本研究提出了一种基于残差U-Net模型的新深度学习架构,并通过集成方法进一步提升性能,在结构裂缝检测任务中实现了更高的准确性和效率。

Details Motivation: 手动检测结构裂缝可能缓慢、不一致且容易出错,影响评估的可靠性。因此,需要一种更高效和准确的自动化解决方案来防止潜在的结构失效,从而保障公众安全。 Method: 研究使用了不同配置的残差U-Net模型,并将这些模型与包含卷积块的元模型集成,以提高预测效率。通过IoU指标和DICE系数评估模型性能,并与SegNet和传统U-Net等成熟架构进行比较。 Result: 结果表明,残差U-Net模型在低分辨率图像上优于其前辈模型,而集成模型的表现则超过了单个模型,在IoU指标和DICE系数方面取得了最高分,证明了其卓越的准确性。 Conclusion: 该论文提出的残差U-Net模型及其集成模型在结构裂缝检测中优于现有架构,特别是在低分辨率图像上表现出更高的准确性和效率。这种改进为更可靠的自动化结构缺陷监测系统铺平了道路。 Abstract: Structural crack detection is a critical task for public safety as it helps in preventing potential structural failures that could endanger lives. Manual detection by inexperienced personnel can be slow, inconsistent, and prone to human error, which may compromise the reliability of assessments. The current study addresses these challenges by introducing a novel deep-learning architecture designed to enhance the accuracy and efficiency of structural crack detection. In this research, various configurations of residual U-Net models were utilized. These models, due to their robustness in capturing fine details, were further integrated into an ensemble with a meta-model comprising convolutional blocks. This unique combination aimed to boost prediction efficiency beyond what individual models could achieve. The ensemble's performance was evaluated against well-established architectures such as SegNet and the traditional U-Net. Results demonstrated that the residual U-Net models outperformed their predecessors, particularly with low-resolution imagery, and the ensemble model exceeded the performance of individual models, proving it as the most effective. The assessment was based on the Intersection over Union (IoU) metric and DICE coefficient. The ensemble model achieved the highest scores, signifying superior accuracy. This advancement suggests way for more reliable automated systems in structural defects monitoring tasks.

[67] AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars

Yiming Zhong,Xiaolin Zhang,Ligang Liu,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: AvatarMakeup is a novel approach for realistically applying makeup to 3D avatars, ensuring high quality and consistency.

Details Motivation: Current methods for applying makeup to 3D avatars don't provide realistic effects or maintain consistency across different expressions and views. Method: The method involves using a pretrained diffusion model and a Coherent Duplication technique to apply makeup coarsely and then refine it for high quality. Result: Experiments show that AvatarMakeup outperforms existing methods in terms of makeup transfer quality and consistency during animation. Conclusion: AvatarMakeup provides a new method for applying realistic makeup to 3D avatars with high quality and consistency. Abstract: Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions, 2) preserving the identity throughout the makeup process, and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multiview effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.

[68] F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

Wei Li,Jingyang Zhang,Lihao Liu,Guoan Wang,Junjun He,Yang Chen,Lixu Gu

Main category: cs.CV

TL;DR: This paper introduces a novel framework (I-DiPT) for adapting medical AI models to real-world scenarios where test data arrives in unpredictable fragments, achieving better performance than current methods.

Details Motivation: Existing Test-Time Adaptation (TTA) methods assume data arrives in complete domain units, which is not reflective of clinical practice where data often comes as fragmented domains with unpredictable shifts. This limitation motivates the investigation of a more practical Free-Form Test-Time Adaptation (F²TTA) task. Method: The paper introduces the Image-level Disentangled Prompt Tuning (I-DiPT) framework. It uses an image-invariant prompt to find domain-invariant representations and an image-specific prompt to adapt the model to each test image. Additionally, it incorporates Uncertainty-oriented Masking (UoM) and Parallel Graph Distillation (PGD) to enhance knowledge extraction and reuse. Result: Experiments on breast cancer and glaucoma classification tasks showed that the proposed I-DiPT framework outperforms existing TTA approaches under the F²TTA setting. Conclusion: The paper proposes a new framework called I-DiPT for F²TTA, which demonstrates superiority over existing TTA methods in handling free-form domain fragments with unpredictable shifts. Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F$^{2}$TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F$^{2}$TTA. Code is available at https://github.com/mar-cry/F2TTA.

[69] Red grape detection with accelerated artificial neural networks in the FPGA's programmable logic

Sandro Costa Magalhães,Marco Almeida,Filipe Neves dos Santos,António Paulo Moreira,Jorge Dias

Main category: cs.CV

TL;DR: 通过使用FINN架构在FPGA上部署三种人工神经网络模型,证明了FPGA可以加速ANN并使其适用于注意力机制。

Details Motivation: 为了克服机器人在执行任务和探索时由于摄像头低帧率配置而受到的限制,并充分利用FPGAs的PL资源。 Method: 采用FINN架构将三个经过量化的ANN(MobileNet v1、CNV 2位量化、CNV 1位量化)部署到FPGA的PL中,并利用RG2C数据集进行训练和测试。 Result: MobileNet v1表现最佳,成功率达到98%,推理速度达到6611 FPS。 Conclusion: 本研究证实FPGA可以有效加速ANN的运行,从而提升机器人感知效率并支持注意力机制的应用。 Abstract: Robots usually slow down for canning to detect objects while moving. Additionally, the robot's camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs' PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA's PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.

[70] IGDNet: Zero-Shot Robust Underexposed Image Enhancement via Illumination-Guided and Denoising

Hailong Yan,Junjian Huang,Tingwen Huang

Main category: cs.CV

TL;DR: This paper proposes IGDNet, a Zero-Shot enhancement method for restoring underexposed images without requiring guiding priors or training data.

Details Motivation: Current methods for restoring underexposed images typically rely on supervised learning with paired underexposed and well-illuminated images. However, collecting such datasets is often impractical in real-world scenarios. Moreover, these methods can lead to over-enhancement, distorting well-illuminated regions. Method: The framework comprises a decomposition module and a denoising module. The former separates the image into illumination and reflection components via a dense connection network, while the latter enhances non-uniformly illuminated regions using an illumination-guided pixel adaptive correction method. Result: Extensive experiments on four public datasets demonstrate that IGDNet significantly improves visual quality under complex lighting conditions. Quantitative results on metrics like PSNR (20.41dB) and SSIM (0.860dB) show that it outperforms 14 state-of-the-art unsupervised methods. Conclusion: IGDNet exhibits strong generalization ability and effectively suppresses noise while restoring illumination. Abstract: Current methods for restoring underexposed images typically rely on supervised learning with paired underexposed and well-illuminated images. However, collecting such datasets is often impractical in real-world scenarios. Moreover, these methods can lead to over-enhancement, distorting well-illuminated regions. To address these issues, we propose IGDNet, a Zero-Shot enhancement method that operates solely on a single test image, without requiring guiding priors or training data. IGDNet exhibits strong generalization ability and effectively suppresses noise while restoring illumination. The framework comprises a decomposition module and a denoising module. The former separates the image into illumination and reflection components via a dense connection network, while the latter enhances non-uniformly illuminated regions using an illumination-guided pixel adaptive correction method. A noise pair is generated through downsampling and refined iteratively to produce the final result. Extensive experiments on four public datasets demonstrate that IGDNet significantly improves visual quality under complex lighting conditions. Quantitative results on metrics like PSNR (20.41dB) and SSIM (0.860dB) show that it outperforms 14 state-of-the-art unsupervised methods. The code will be released soon.

[71] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection

Weiwei Duan,Luping Ji,Shengjia Chen,Sicheng Zhu,Jianghong Huang,Mao Ye

Main category: cs.CV

TL;DR: 本文提出了一种用于移动红外小目标检测的弱监督对比学习(WeCoL)方案,通过减少对大量人工标注的依赖,在两个公开数据集上取得了优于早期全监督方法的性能,并接近最先进的全监督方法。

Details Motivation: 红外小目标检测面临目标尺寸小、背景对比度弱的巨大挑战,而现有的全监督方法依赖大量人工标注,手动标注视频序列尤其昂贵且耗时。因此需要一种非全监督的方法来降低标注需求。 Method: 该论文提出了一种新的弱监督对比学习(WeCoL)方案,基于预训练的SAM模型设计了一种潜在目标挖掘策略,并结合对比学习和长短期运动感知学习方案进行改进。 Result: 实验表明,该论文提出的弱监督方案在两个公共数据集(DAUB和ITSDT-15K)上的性能通常优于早期的全监督方法,并能达到最先进的全监督方法90%以上的性能。 Conclusion: 该论文提出的弱监督方案在两个公开数据集上验证了其性能,通常优于早期的全监督方法,并能达到最先进的全监督方法90%以上的性能。 Abstract: Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ($e.g.$, weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model training.Specifically, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy accumulation.Besides, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature subspace.Moreover, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small targets.The extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90\% of state-of-the-art (SOTA) fully-supervised ones.

[72] Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk

Gaochao Song,Zibo Zhao,Haohan Weng,Jingbo Zeng,Rongfei Jia,Shenghua Gao

Main category: cs.CV

TL;DR: Mesh Silksong是一种新颖的网格表示方法,通过减少冗余显著提高了效率和几何完整性。

Details Motivation: 现有网格分词方法生成的序列包含重复的顶点标记,浪费了网络资源。因此,需要一种更高效、紧凑的网格表示方法。 Method: 引入了一种新的网格表示方法Mesh Silksong,该方法以类似丝绸编织的方式自回归生成多边形网格,并且每个顶点仅访问一次以减少冗余。 Result: Mesh Silksong将标记序列的冗余减少了50%,并达到了约22%的最先进压缩率,同时生成具有优异几何特性的多边形网格。 Conclusion: Mesh Silksong通过减少序列冗余实现了高效的网格表示,提高了几何完整性,并在实际应用中展现出优越的性能。 Abstract: We introduce Mesh Silksong, a compact and efficient mesh representation tailored to generate the polygon mesh in an auto-regressive manner akin to silk weaving. Existing mesh tokenization methods always produce token sequences with repeated vertex tokens, wasting the network capability. Therefore, our approach tokenizes mesh vertices by accessing each mesh vertice only once, reduces the token sequence's redundancy by 50\%, and achieves a state-of-the-art compression rate of approximately 22\%. Furthermore, Mesh Silksong produces polygon meshes with superior geometric properties, including manifold topology, watertight detection, and consistent face normals, which are critical for practical applications. Experimental results demonstrate the effectiveness of our approach, showcasing not only intricate mesh generation but also significantly improved geometric integrity.

[73] CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios

Teng Fu,Yuwen Chen,Zhuofan Chen,Mengyang Zhao,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 本文提出了一种新的多行人跟踪数据集CrowdTrack,用于解决现有数据集在复杂场景中的局限性,该数据集包含大量现实生活中复杂的视频序列,并提供了有效的测试基准。

Details Motivation: 现有的多目标跟踪数据集存在场景简单、非真实感的问题,无法有效支持复杂场景下的行人跟踪研究,因此需要一个新的高质量数据集来推动相关算法的发展。 Method: 提出了一个名为CrowdTrack的大型困难数据集,主要从第一视角拍摄,包含现实生活中的复杂场景,并对数据集进行了全面分析以及多个SOTA模型的测试。 Result: CrowdTrack数据集包含33个视频,共计5,185条轨迹,每条轨迹都有完整的边界框和唯一的对象ID,适用于复杂环境下的多行人跟踪研究。 Conclusion: CrowdTrack数据集为多行人跟踪提供了一个具有挑战性的平台,解决了现有数据集在复杂场景中的不足,并通过测试多个SOTA模型和基础模型验证了其有效性。 Abstract: Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack'' because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: https://github.com/loseevaya/CrowdTrack .

[74] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention

Zunhui Xia,Hongxing Li,Libin Lan

Main category: cs.CV

TL;DR: This paper proposes MedFormer, an efficient and versatile vision transformer for medical image recognition, overcoming existing limitations through a pyramid scaling structure and a novel attention mechanism called Dual Sparse Selection Attention (DSSA).

Details Motivation: Existing vision transformer-based approaches for medical image recognition face two primary challenges: limited general applicability due to being task-specific and architecture-tailored, and high computational costs or suboptimal performance from current attention mechanisms. The need to address these issues motivated this study. Method: The paper introduces MedFormer, which uses a pyramid scaling structure as a versatile backbone and incorporates a novel Dual Sparse Selection Attention (DSSA) mechanism. This approach enhances computational efficiency, robustness against noise, and performance while reducing computational costs. Result: Extensive experiments on multiple imaging modality datasets demonstrate that MedFormer significantly improves performance in medical image classification, semantic segmentation, and lesion detection. It also shows superior generality and efficiency compared to existing medical vision transformers. Conclusion: MedFormer is a highly effective and efficient medical vision transformer that overcomes the limitations of existing methods, offering superior performance across various medical image recognition tasks. Abstract: Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is explicitly designed to attend to the most relevant content. In addition, a detailed theoretical analysis has been conducted, demonstrating that MedFormer has superior generality and efficiency in comparison to existing medical vision transformers. Extensive experiments on a variety of imaging modality datasets consistently show that MedFormer is highly effective in enhancing performance across all three above-mentioned medical image recognition tasks. The code is available at https://github.com/XiaZunhui/MedFormer.

[75] Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy

Luca Parolari,Andrea Cherubini,Lamberto Ballan,Carlo Biffi

Main category: cs.CV

TL;DR: 本文提出一种新方法用于结肠镜自动息肉计数,通过引入时间感知的监督对比损失和时间邻接约束,显著提高了计数准确性并降低了碎片率。

Details Motivation: 现有的息肉计数方法主要依赖自监督学习,忽视了轨迹特征学习和聚类阶段中的时间关系,导致性能受限。 Method: 引入监督对比损失以结合时间感知的软目标,并改进了轨迹片段聚类的时间邻接约束。 Result: 在公开数据集上验证后,与之前的方法相比,碎片率减少了2.2倍,证明了时间感知在息肉计数中的重要性。 Conclusion: 该论文提出了一种基于时间感知的监督对比损失和时间邻接约束的新方法,在结肠镜检查中实现更鲁棒的息肉计数,显著降低了碎片率,并建立了新的最先进的技术。 Abstract: Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.

[76] MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations

Hyunsoo Son,Jeonghyun Noh,Suemin Jeon,Chaoli Wang,Won-Ki Jeong

Main category: cs.CV

TL;DR: This paper proposes MC-INR, a new neural network-based framework for encoding scientific data that solves the limitations of current INR methods by using meta-learning, clustering, dynamic re-clustering, and a branched layer, resulting in better performance on complex datasets.

Details Motivation: Existing INR-based methods have three main limitations: inflexible representation of complex structures, focus on single-variable data, and dependence on structured grids. These limitations lead to degraded performance on complex real-world datasets. Method: The proposed MC-INR framework utilizes meta-learning and clustering for flexible structure encoding. It also implements a residual-based dynamic re-clustering mechanism and a branched layer to handle multivariate data effectively. Result: Experimental results show that MC-INR outperforms existing methods in scientific data encoding tasks, particularly in handling complex multivariate data on unstructured grids. Conclusion: MC-INR is a novel framework that overcomes the limitations of existing INR-based methods by combining meta-learning and clustering, introducing a dynamic re-clustering mechanism and a branched layer, leading to superior performance in encoding complex multivariate scientific data. Abstract: Implicit Neural Representations (INRs) are widely used to encode data as continuous functions, enabling the visualization of large-scale multivariate scientific simulation data with reduced memory usage. However, existing INR-based methods face three main limitations: (1) inflexible representation of complex structures, (2) primarily focusing on single-variable data, and (3) dependence on structured grids. Thus, their performance degrades when applied to complex real-world datasets. To address these limitations, we propose a novel neural network-based framework, MC-INR, which handles multivariate data on unstructured grids. It combines meta-learning and clustering to enable flexible encoding of complex structures. To further improve performance, we introduce a residual-based dynamic re-clustering mechanism that adaptively partitions clusters based on local error. We also propose a branched layer to leverage multivariate data through independent branches simultaneously. Experimental results demonstrate that MC-INR outperforms existing methods on scientific data encoding tasks.

[77] Automatic Labelling for Low-Light Pedestrian Detection

Dimitrios Bouzoulas,Eerik Alamikkotervo,Risto Ojala

Main category: cs.CV

TL;DR: 提出了一种自动化的红外-rgb标签管道,用于低光rgb行人物检测。

Details Motivation: 缺乏大规模公共数据集来研究低光照条件下的rgb行人物检测问题。 Method: 该管道包括红外线检测、从红外检测到其rgb对应物的标签转移过程以及使用生成的标签训练目标检测模型。 Result: 在9个案例中有6个案例,使用生成的标签训练的模型在mAP@50和mAP@50-95指标上超过了使用真实标签训练的模型。 Conclusion: 这种方法可以提高自动驾驶车辆和高级驾驶辅助系统在低光环境下的行人安全性。 Abstract: Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling

[78] Detecting Multiple Diseases in Multiple Crops Using Deep Learning

Vivek Yadav,Anugrah Jain

Main category: cs.CV

TL;DR: 本文提出了一种新的深度学习方法,可高效检测多种农作物疾病,显著提升检测准确率并扩展适用范围。

Details Motivation: 印度作为一个以农业为主的经济体,面临着因病害、虫害和环境压力导致的农作物重大损失问题,早期检测和准确识别病害对提高产量和确保粮食安全至关重要。 Method: 创建了一个统一的数据集,并在该数据集上训练了深度学习模型,以覆盖更多的作物和疾病类型。 Result: 所提出的深度学习模型在统一数据集上实现了99%的检测准确率,比处理14种作物和26种疾病的现有技术高出7个百分点。 Conclusion: 该论文提出了一种基于深度学习的解决方案,用于检测多种作物中的多种疾病,旨在提高印度农民的产品适用性。 Abstract: India, as a predominantly agrarian economy, faces significant challenges in agriculture, including substantial crop losses caused by diseases, pests, and environmental stress. Early detection and accurate identification of diseases across different crops are critical for improving yield and ensuring food security. This paper proposes a deep learning based solution for detecting multiple diseases in multiple crops, aimed to cover India's diverse agricultural landscape. We first create a unified dataset encompassing images of 17 different crops and 34 different diseases from various available repositories. Proposed deep learning model is trained on this dataset and outperforms the state-of-the-art in terms of accuracy and the number of crops, diseases covered. We achieve a significant detection accuracy, i.e., 99 percent for our unified dataset which is 7 percent more when compared to state-of-the-art handling 14 crops and 26 different diseases only. By improving the number of crops and types of diseases that can be detected, proposed solution aims to provide a better product for Indian farmers.

[79] IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning

Abiam Remache González,Meriem Chagour,Timon Bijan Rüth,Raúl Trapiella Cañedo,Marina Martínez Soler,Álvaro Lorenzo Felipe,Hyun-Suk Shin,María-Jesús Zamorano Serrano,Ricardo Torres,Juan-Antonio Castillo Parra,Eduardo Reyes Abad,Miguel-Ángel Ferrer Ballester,Juan-Manuel Afonso López,Francisco-Mario Hernández Tejera,Adrian Penate-Sanchez

Main category: cs.CV

TL;DR: 本研究提出了IMASHRIMP系统,用于自动化分析白虾形态,通过减少人为错误并提高分析效率,从而优化遗传选择过程。

Details Motivation: 本文旨在优化水产养殖中的遗传选择任务,解决传统人工分析易出错且耗时的问题,通过改进现有的深度学习和计算机视觉技术对虾形态进行自动分析。 Method: IMASHRIMP结合了两种鉴别模块(基于改进的ResNet-50架构)、姿态估计模块(改编自VitPose)以及形态回归模块(使用支持向量机模型)以完成分类、预测关键点和像素到厘米的转换任务。 Result: 实验结果表明,IMASHRIMP显著减少了人为错误,在视角分类中将错误率从0.97%降至0%,在额角检测中从12.46%降至3.64%;姿态估计模块达到了97.94%的平均精度(mAP),像素到厘米的转换误差为0.07 (+/- 0.1) cm。 Conclusion: IMASHRIMP有望通过自动化和加速虾类形态分析来提高遗传选择效率,促进更可持续的水产养殖实践。 Abstract: This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei}, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a "two-factor authentication (human and IA)" system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp's skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices.The code are available at https://github.com/AbiamRemacheGonzalez/ImaShrimp-public

[80] MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang,Sicheng Xu,Yue Dong,Yu Deng,Jianfeng Xiang,Zelong Lv,Guangzhong Sun,Xin Tong,Jiaolong Yang

Main category: cs.CV

TL;DR: MoGe-2是一种先进的开域几何估计模型,能够从单张图像中恢复具有度量尺度和细粒度的3D点图。

Details Motivation: 现有方法(如MoGe)只能预测仿射不变点图而缺乏度量尺度,且真实数据中的噪声和误差会削弱几何细节。因此需要一种能同时保证几何精度和细粒度恢复的方法。 Method: 在MoGe的基础上,通过有效策略扩展以进行度量几何预测,并开发了统一的数据优化方法来过滤和补全真实数据。 Result: MoGe-2在混合数据集上训练并进行了全面评估,结果表明其在准确的相对几何、精确的度量尺度以及细粒度细节恢复方面均表现出优越性能。 Conclusion: MoGe-2实现了对单张图像的度量尺度3D点图预测,同时保持相对几何精度和恢复细粒度细节,这是以前方法无法同时实现的。 Abstract: We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative geometry accuracy provided by the affine-invariant point representation. Additionally, we discover that noise and errors in real data diminish fine-grained detail in the predicted geometry. We address this by developing a unified data refinement approach that filters and completes real data from different sources using sharp synthetic labels, significantly enhancing the granularity of the reconstructed geometry while maintaining the overall accuracy. We train our model on a large corpus of mixed datasets and conducted comprehensive evaluations, demonstrating its superior performance in achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery -- capabilities that no previous methods have simultaneously achieved.

[81] Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

Buzhen Huang,Chen Li,Chongyang Xu,Dongyue Lu,Jinnan Chen,Yangang Wang,Gim Hee Lee

Main category: cs.CV

TL;DR: 为解决野外视频中近距离互动恢复的问题,文章提出了利用人类外观线索的新方法,并构建了相关数据集。

Details Motivation: 现有方法在处理野外视频中的视觉模糊性和人际遮挡时无法恢复合理的近距离互动,而大型基础模型也无法在这种挑战性场景中准确区分人类语义。 Method: 首先训练扩散模型学习人类近体行为和姿态先验知识,并结合两个可优化张量构建双分支优化框架以重建人体动作和外观。同时设计了多种约束条件辅助优化过程。 Result: 实验结果表明,该方法在多个基准数据集上均优于现有方法,并构建了一个带有伪真实标注的交互数据集,有助于未来研究姿态估计与人类行为理解。 Conclusion: 本文提出了一种基于人类外观线索的双分支优化框架,用于从野外视频中重建准确的交互动作。 Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.

[82] Parametric shape models for vessels learned from segmentations via differentiable voxelization

Alina F. Dima,Suprosanna Shit,Huaqi Qiu,Robbie Holland,Tamara T. Mueller,Fabio Antonio Musio,Kaiyuan Yang,Bjoern Menze,Rickmer Braren,Marcus Makowski,Daniel Rueckert

Main category: cs.CV

TL;DR: This paper proposes a differentiable framework that unifies voxel, mesh, and parametric representations for accurate and flexible modeling of complex vessel structures.

Details Motivation: Current vessel modeling techniques rely on separate, disjoint representations (voxels, meshes, parametric models), limiting their integration and joint optimization. The motivation is to unify these representations to improve accuracy and flexibility in modeling complex vascular structures. Method: The method uses differentiable voxelization to extract parametric shape models via shape-to-segmentation fitting. Vessels are parametrized using cubic B-splines for centerlines and radii, ensuring smoothness. Meshes are then extracted from the learned parameters in a differentiable manner. Result: The method achieves accurate volumetric fits on complex vessel structures such as aortas, aneurysms, and brain vessels. High-fidelity meshes are generated and can be manipulated post-fit, demonstrating the effectiveness of the approach. Conclusion: The proposed framework successfully integrates voxel, mesh, and parametric representations of vessels through differentiable transformations, enabling accurate geometric modeling and high-fidelity mesh generation without explicit ground-truth shape parameters. Abstract: Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joins the three representations under differentiable transformations. By leveraging differentiable voxelization, we automatically extract a parametric shape model of the vessels through shape-to-segmentation fitting, where we learn shape parameters from segmentations without the explicit need for ground-truth shape parameters. The vessel is parametrized as centerlines and radii using cubic B-splines, ensuring smoothness and continuity by construction. Meshes are differentiably extracted from the learned shape parameters, resulting in high-fidelity meshes that can be manipulated post-fit. Our method can accurately capture the geometry of complex vessels, as demonstrated by the volumetric fits in experiments on aortas, aneurysms, and brain vessels.

[83] Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning

Tan Pan,Zhaorui Tan,Kaiyu Guo,Dongli Xu,Weidi Xu,Chen Jiang,Xin Guo,Yuan Qi,Yuan Cheng

Main category: cs.CV

TL;DR: 本研究提出了S²DC框架,解决了传统3D医学图像自监督学习忽略结构变化的问题,显著提升了表示学习的效果。

Details Motivation: 现有的3D医学图像自监督学习方法通常使用固定大小的区域划分,忽略了位置、尺度和形态的变化,难以捕捉有意义的区别。 Method: 通过最优传输策略增加不同区域间的语义差异,并基于邻域相似性分布提升结构级的语义一致性。 Result: S²DC在10个数据集、4个任务和3种模态上的实验表明,其性能始终优于当前最先进的方法。 Conclusion: 该论文提出了一种新的3D医学图像自监督学习框架S²DC,能够实现结构感知的表示学习,并在多个数据集和任务上优于现有方法。 Abstract: 3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.

[84] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu,Enxin Song,Wenhao Chai,Xuexiang Wen,Tian Ye,Gaoang Wang

Main category: cs.CV

TL;DR: AuroraLong 提出了一种基于线性 RNN 的高效方法,用于解决长视频理解中的高计算和内存成本问题,并取得了与传统 Transformer 方法相当的性能。

Details Motivation: 由于基于 Transformer 的大语言模型(LLMs)所需的内存和计算资源随输入序列长度呈二次增长,导致长视频理解面临较高的计算复杂性和内存消耗问题,因此需要一种更高效的解决方案。 Method: 提出 AuroraLong,将多模态大语言模型(MLLM)中的 LLM 组件替换为可处理任意长度输入且具有恒定大小隐藏状态的线性 RNN 模型;结合视觉 token merge 技术,通过按尺寸升序重新排列视觉 token 来提高吞吐量和效率。 Result: 尽管 AuroraLong 仅包含 20 亿参数并仅在公开数据上训练,它在多个视频基准测试中表现出了与在私有数据集上训练的类似规模 Transformer 模型相当的性能。 Conclusion: AuroraLong 通过使用线性 RNN 模型代替传统的 Transformer 模型,显著降低了长视频理解的计算和内存复杂度,同时保持了与基于 Transformer 的模型相当的性能。这种方法有望降低长视频理解的计算门槛,实现更广泛的应用。 Abstract: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.

[85] Addressing Camera Sensors Faults in Vision-Based Navigation: Simulation and Dataset Development

Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill

Main category: cs.CV

TL;DR: 本文研究了视觉导航中摄像机传感器的潜在故障问题,并提出了一种仿真框架来生成故障数据集,以促进基于人工智能的故障检测方法的应用。

Details Motivation: 确保视觉导航(VBN)算法在太空任务中的可靠性和操作鲁棒性是重要的挑战,而传感器故障可能导致导航算法输出不准确甚至数据处理故障。 Method: 对行星际探索任务场景中的摄像机传感器潜在故障案例进行了全面分析,并系统地描述了这些故障的原因和影响。 Result: 研究生成了一个包含故障注入图像的数据集,旨在支持AI-based故障检测算法的发展和测试。 Conclusion: 该研究通过引入一个用于重现故障条件的仿真框架,提供了训练和测试基于AI的故障检测算法的有价值工具。 Abstract: The increasing importance of Vision-Based Navigation (VBN) algorithms in space missions raises numerous challenges in ensuring their reliability and operational robustness. Sensor faults can lead to inaccurate outputs from navigation algorithms or even complete data processing faults, potentially compromising mission objectives. Artificial Intelligence (AI) offers a powerful solution for detecting such faults, overcoming many of the limitations associated with traditional fault detection methods. However, the primary obstacle to the adoption of AI in this context is the lack of sufficient and representative datasets containing faulty image data. This study addresses these challenges by focusing on an interplanetary exploration mission scenario. A comprehensive analysis of potential fault cases in camera sensors used within the VBN pipeline is presented. The causes and effects of these faults are systematically characterized, including their impact on image quality and navigation algorithm performance, as well as commonly employed mitigation strategies. To support this analysis, a simulation framework is introduced to recreate faulty conditions in synthetically generated images, enabling a systematic and controlled reproduction of faulty data. The resulting dataset of fault-injected images provides a valuable tool for training and testing AI-based fault detection algorithms. The final link to the dataset will be added after an embargo period. For peer-reviewers, this private link is available.

[86] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Ziyin Zhou,Yunpeng Luo,Yuanchen Wu,Ke Sun,Jiayi Ji,Ke Yan,Shouhong Ding,Xiaoshuai Sun,Yunsheng Wu,Rongrong Ji

Main category: cs.CV

TL;DR: This paper introduces AIGI-Holmes, an improved AI-generated image detection framework that provides human-verifiable explanations and better generalization, addressing limitations in current detection techniques.

Details Motivation: The motivation is to overcome two key challenges in existing AI-generated image (AIGI) detection techniques: the lack of human-verifiable explanations and poor generalization on the latest generation technology. Method: The authors introduced a comprehensive dataset, Holmes-Set, with two subsets: Holmes-SFTSet for instruction-tuning with explanations and Holmes-DPOSet for human-aligned preferences. They also proposed an efficient data annotation method called Multi-Expert Jury and a three-stage training framework named Holmes Pipeline, which adapts multimodal large language models (MLLMs) for AIGI detection. Result: Extensive experiments demonstrated the effectiveness of the proposed AIGI-Holmes model across three benchmarks, showing improved performance and generalization capabilities in detecting AI-generated images with human-verifiable explanations. Conclusion: The study concludes that the proposed Holmes Pipeline and AIGI-Holmes model effectively address the issues of existing AIGI detection techniques by providing human-verifiable explanations and better generalization capabilities. Abstract: The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.

[87] Learning few-step posterior samplers by unfolding and distillation of diffusion models

Charlesquin Kemajou Mbakam,Jonathan Spence,Marcelo Pereyra

Main category: cs.CV

TL;DR: 本文提出了一种结合深度展开和模型蒸馏的新框架,用于将扩散模型转换为适用于后验采样的条件模型,在保留灵活性的同时提高了准确性和计算效率。

Details Motivation: 扩散模型(DMs)作为强大的图像先验在贝叶斯计算成像中得到了应用,但现有方法要么依赖近似且灵活性高,要么通过监督训练实现更高精度和更快推断。本文旨在提出一种新的方法来弥补这些不足。 Method: 引入了一种结合深度展开和模型蒸馏的新框架,将马尔可夫链蒙特卡洛(MCMC)算法(特别是LATINO Langevin采样器)进行深度展开,并将其应用于后验采样。 Result: 通过广泛的实验和与最先进方法的比较,证明了所提出的展开和蒸馏采样器具有优秀的准确性和计算效率。 Conclusion: 该框架在保持适应前向模型变化灵活性的同时,在准确性和计算效率方面表现出色。 Abstract: Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm - specifically, the recently proposed LATINO Langevin sampler (Spagnoletti et al., 2025) - representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.

[88] APT: Adaptive Personalized Training for Diffusion Models with Limited Data

JungWoo Chae,Jiyoon Kim,JaeWoong Choi,Kyungyul Kim,Sangheum Hwang

Main category: cs.CV

TL;DR: This paper proposes APT, a method to personalize diffusion models with limited data while avoiding overfitting and preserving semantic coherence.

Details Motivation: Personalizing diffusion models using limited data presents challenges such as overfitting, loss of prior knowledge, and degradation of text alignment. Method: The paper proposes Adaptive Personalized Training (APT), which uses adaptive training adjustment, representation stabilization, and attention alignment for prior knowledge preservation. Result: Through extensive experiments, the authors demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and generates high-quality, diverse images even with limited reference data. Conclusion: APT successfully mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data. Abstract: Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.

[89] CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

Xiangyang Luo,Ye Zhu,Yunfei Liu,Lijian Lin,Cong Wan,Zijian Cai,Shao-Lun Huang,Yu Li

Main category: cs.CV

TL;DR: CanonSwap是一种新的视频换脸方法,通过解耦运动和外观信息,实现了高质量的身份转换和动态属性保留。

Details Motivation: 现有的视频换脸方法在保持目标面部动态属性方面存在不足,导致结果不一致。 Method: 提出了一种新的视频换脸框架CanonSwap,通过解耦运动信息和外观信息,并设计了部分身份调制模块以实现精确的身份转换。 Result: CanonSwap在视觉质量、时间一致性和身份保留方面显著优于现有方法,并引入了细粒度同步指标进行评估。 Conclusion: CanonSwap实现了高质量的视频换脸,在视觉质量、时间一致性和身份保留方面显著优于现有方法。 Abstract: Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.

[90] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

Qi Xu,Dongxu Wei,Lingzhe Zhao,Wenpu Li,Zhangchi Huang,Shunping Ji,Peidong Liu

Main category: cs.CV

TL;DR: This paper introduces SIU3R, an alignment-free framework for simultaneous understanding and 3D reconstruction from unposed images, achieving superior performance by avoiding traditional 2D-to-3D alignment paradigms.

Details Motivation: The motivation is to overcome the limitations of existing 2D-to-3D feature alignment paradigms, which result in limited 3D understanding capability and semantic information loss. Method: SIU3R bridges reconstruction and understanding tasks through pixel-aligned 3D representation and unifies multiple understanding tasks into learnable queries, with additional lightweight modules to enhance task interaction. Result: Extensive experiments show that SIU3R performs exceptionally well on 3D reconstruction and understanding tasks individually, as well as simultaneously, demonstrating its superiority over current methods. Conclusion: The paper concludes that the proposed alignment-free framework, SIU3R, achieves state-of-the-art performance on both individual and simultaneous tasks of 3D reconstruction and understanding. Abstract: Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.

[91] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

Qin Guo,Ailing Zeng,Dongxu Yue,Ceyuan Yang,Yang Cao,Hanzhong Guo,Fei Shen,Wei Liu,Xihui Liu,Dan Xu

Main category: cs.CV

TL;DR: This paper introduces UniMC, a novel DiT-based framework, and HAIG-2.9M, a comprehensive dataset, to improve keypoint-guided image generation for multi-class objects like humans and animals, particularly under complex conditions.

Details Motivation: The motivation is to overcome limitations in existing keypoint-guided Text-to-Image diffusion models, particularly their inability to handle general non-rigid objects like animals and overlapping multi-instance generation effectively. Method: The paper proposes a DiT-based framework named UniMC, which integrates instance- and keypoint-level conditions into compact tokens, including attributes like class, bounding box, and keypoint coordinates. Additionally, the authors introduce the HAIG-2.9M dataset, a large-scale, high-quality dataset with extensive annotations for humans and animals. Result: The experiments show the high quality of the HAIG-2.9M dataset and the effectiveness of the UniMC framework, especially in challenging scenarios involving heavy occlusions and multiple object classes. Conclusion: The paper concludes that the proposed UniMC framework and HAIG-2.9M dataset effectively address the challenges in keypoint-guided image generation for multi-class instances, particularly showing promising results in complex cases like heavy occlusions and multi-class scenarios. Abstract: Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.

[92] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

Yuxuan Wang,Tianwei Cao,Huayu Zhang,Zhongjiang He,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: This paper proposes FairHuman, a multi-objective fine-tuning method that effectively improves both global and local quality in human image generation, especially for challenging regions like faces and hands.

Details Motivation: Generating human images with realistic local details like faces and hands is difficult due to insufficient local supervision during training, which needs to be addressed for better performance. Method: FairHuman constructs three learning objectives: a global objective from the default diffusion objective and two local objectives for hands and faces using pre-annotated positional priors. It uses the Minimum Potential Delay criterion to derive an optimal parameter updating strategy for fair multi-objective optimization. Result: Extensive experiments demonstrate that the method significantly improves the generation of challenging local details and achieves superior performance in various scenarios. Conclusion: The proposed FairHuman approach enhances the generation of challenging local details in human images while maintaining overall quality, showing effectiveness across different scenarios. Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.

[93] Prompt learning with bounding box constraints for medical image segmentation

Mélanie Gaillochet,Mehrdad Noori,Sahar Dastani,Christian Desrosiers,Hervé Lombaert

Main category: cs.CV

TL;DR: 这篇论文介绍了一种新的弱监督学习框架,利用边界框注释自动生成基础模型的提示,以减少用户干预并在医学图像分割任务中取得优异性能。

Details Motivation: 在医学领域中,像素级注释非常费力且昂贵,而基于边界框注释的弱监督方法提供了一个实用替代方案。然而,现有的提示学习方法依赖于完全标注的分割掩码,这限制了其应用。因此,本文旨在开发一种不需要完全标注数据的方法。 Method: 该方法利用基础模型提供的提示(如点或边界框)进行自动分割,并通过优化方案将来自边界框注释的多个约束与由提示基础模型生成的伪标签相结合。 Result: 实验结果表明,在有限的数据设置下,该弱监督方法平均Dice得分为84.90%,优于现有的全监督和弱监督方法。 Conclusion: 该论文提出了一种新颖的框架,结合了基础模型的表示能力和弱监督分割的注释效率,通过仅使用边界框注释来自动化基础模型的提示生成,从而减少了用户干预。 Abstract: Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at https://github.com/Minimel/box-prompt-learning-VFM.git

[94] DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Jiawei He,Danshi Li,Xinqiang Yu,Zekun Qi,Wenyao Zhang,Jiayi Chen,Zhaoxiang Zhang,Zhizheng Zhang,Li Yi,He Wang

Main category: cs.CV

TL;DR: 本文提出 DexVLG,一个基于大规模数据集 DexGraspNet 3.0 的视觉-语言-抓取模型,用于预测与语言指令对齐的灵巧抓取姿态,展示了其在零样本泛化和部分抓取准确性方面的优异表现。

Details Motivation: 现有的研究主要集中在简单的夹持器控制上,缺乏针对人手般灵巧抓取的大规模研究。 Method: 通过生成包含170亿个灵巧抓取姿态的数据集 DexGraspNet 3.0,并结合视觉-语言模型和基于流匹配的姿态头进行训练。 Result: DexVLG 在零样本执行成功率方面达到76%以上,并在物理对象的现实场景中成功实现部分对齐抓取。 Conclusion: DexVLG 展示了在复杂任务中使用大模型进行人手般灵巧抓取的潜力,实现了零样本泛化能力和先进的部分抓取准确性。 Abstract: As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG's performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG's strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.

[95] Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen

Main category: cs.CV

TL;DR: 本文提出了一种新的注意力机制方法MANO,通过将注意力视为网格点之间的相互作用问题,实现了线性时间和内存复杂度,同时保持了全局感受野。

Details Motivation: 标准Transformer的二次复杂度使其在处理高分辨率输入时不够实用,现有的解决方案通常以丢失最细粒度的细节为代价进行降维。因此,作者希望找到一种不同的方法来解决这个问题。 Method: 受n体数值模拟技术的启发,作者引入了基于距离的多尺度注意力机制,称为Multipole Attention Neural Operator (MANO),从而实现线性和高效的注意力计算。 Result: 实验结果表明,MANO在图像分类和Darcy流任务上能够与ViT和Swin Transformer等最先进的模型相媲美,同时显著降低了运行时间和峰值内存使用。 Conclusion: MANO是一种有效的替代现有Transformer架构的方法,特别是在需要处理高分辨率输入的任务中,其优势在于保持高性能的同时减少了计算资源消耗。 Abstract: Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.

[96] Partial Weakly-Supervised Oriented Object Detection

Mingxin Liu,Peiyuan Zhang,Yuan Liu,Wei Zhang,Yue Zhou,Ning Liao,Ziyang Gong,Junwei Luo,Zhirui Wang,Yi Yu,Xue Yang

Main category: cs.CV

TL;DR: This paper proposes an efficient Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations, which can efficiently leverage large amounts of unlabeled data and performs comparably to, or even surpasses, traditional semi-supervised algorithms while offering a lower cost solution.

Details Motivation: To address the high cost of dataset annotation for oriented object detection (OOD), especially with fully supervised methods using complete oriented bounding box annotations. Method: The authors propose the PWOOD framework, which includes the Orientation-and-Scale-aware Student (OS-Student) model and the Class-Agnostic Pseudo-Label Filtering strategy (CPF). Result: Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that the PWOOD framework is efficient, offers a lower cost solution, and significantly outperforms weakly supervised algorithms trained with partially weak annotations. Conclusion: The PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms in oriented object detection. Abstract: The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose:(1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms.

[97] From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images

Danrong Zhang,Huili Huang,N. Simrill Smith,Nimisha Roy,J. David Frost

Main category: cs.CV

TL;DR: 本研究提出了一种利用语义分割和深度估计量化地震后社交媒体图像破坏程度的新方法,提高了灾后评估的准确性和实用性。

Details Motivation: 传统方法依赖于主观分类技术,难以考虑图像中破坏程度的变化,因此需要一种更客观的方法来评估地震后社交媒体图像中的破坏程度。 Method: 构建了一个分段破坏程度数据集,并使用SegFormer模型进行微调,以生成地震后社交媒体图像的破坏程度分割结果。此外,还引入了一种新的破坏严重性评分系统,并结合深度估计进行调整。 Result: 开发并应用了基于语义分割的破坏程度评估方法,能够量化社交媒体图像中的破坏严重程度,并提供了对灾后地区更细致的理解。 Conclusion: 通过将地震后社交媒体图像中的破坏程度评估重新定义为语义分割问题,该研究提供了一种更客观、全面的分析方法,并提出了新的破坏评分系统,有助于提高灾后侦察团队的响应效率。 Abstract: In the aftermath of earthquakes, social media images have become a crucial resource for disaster reconnaissance, providing immediate insights into the extent of damage. Traditional approaches to damage severity assessment in post-earthquake social media images often rely on classification methods, which are inherently subjective and incapable of accounting for the varying extents of damage within an image. Addressing these limitations, this study proposes a novel approach by framing damage severity assessment as a semantic segmentation problem, aiming for a more objective analysis of damage in earthquake-affected areas. The methodology involves the construction of a segmented damage severity dataset, categorizing damage into three degrees: undamaged structures, damaged structures, and debris. Utilizing this dataset, the study fine-tunes a SegFormer model to generate damage severity segmentations for post-earthquake social media images. Furthermore, a new damage severity scoring system is introduced, quantifying damage by considering the varying degrees of damage across different areas within images, adjusted for depth estimation. The application of this approach allows for the quantification of damage severity in social media images in a more objective and comprehensive manner. By providing a nuanced understanding of damage, this study enhances the ability to offer precise guidance to disaster reconnaissance teams, facilitating more effective and targeted response efforts in the aftermath of earthquakes.

[98] RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Liheng Zhang,Lexi Pang,Hang Ye,Xiaoxuan Ma,Yizhou Wang

Main category: cs.CV

TL;DR: This paper introduces a training-free framework for text-to-image diffusion models that improves structural alignment and visual quality by decoupling feature injection from the denoising process.

Details Motivation: Existing feature injection methods suffer from structural misalignment, condition leakage, and visual artifacts when condition images diverge from natural RGB distributions. Method: A decoupled feature injection framework that separates the injection timestep from the denoising process, along with appearance-rich prompting and a restart refinement strategy. Result: State-of-the-art performance across diverse zero-shot conditioning scenarios with more faithful structural and appearance-rich generation. Conclusion: The proposed framework allows for training-free, high-quality text-to-image generation with improved structural and appearance control. Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.

[99] No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa,Chenhongyi Yang,Linus Ericsson,Steven McDonagh,Elliot J. Crowley

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

[100] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars

Gent Serifi,Marcel C. Bühler

Main category: cs.CV

TL;DR: This paper introduces HyperGaussians, an improved method for creating detailed animatable face avatars from monocular videos, which surpasses current techniques in quality and efficiency.

Details Motivation: Creating high-quality animatable face avatars from monocular videos remains challenging due to limitations in handling nonlinear deformations, lighting effects, and fine details with existing methods like 3D Gaussian Splatting. Method: The authors propose HyperGaussians using high-dimensional multivariate Gaussians with a reparameterized covariance matrix for efficient splatting. They integrate this method into FlashAvatar and evaluate it on 19 subjects from 4 datasets. Result: HyperGaussians outperformed 3DGS numerically and visually, particularly in capturing high-frequency details such as eyeglass frames, teeth, facial movements, and specular reflections. Conclusion: HyperGaussians, an extension of 3D Gaussian Splatting, improve the representation of animatable face avatars by increasing expressivity and efficiency. Abstract: We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.

[101] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu,Hao Li,Jiawei Chi,Hanyang Wang,Minghui Yang,Fudong Wang,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出LangScene-X,能够通过稀疏视图生成统一且一致的3D多模态场景信息,并具有更好的质量和泛化性能。

Details Motivation: 现有的基于密集视图重建范式的方法在视图有限时容易出现渲染伪影和语义合成不合理的问题,因此需要一种新的方法来解决这些问题。 Method: 提出了一种名为LangScene-X的生成框架,包括TriMap视频扩散模型和语言量化压缩器(LQC),并结合语言信息重建3D场景的语言表面场。 Result: 在真实世界数据上的实验表明,LangScene-X在质量与泛化性方面优于当前最先进的方法。 Conclusion: LangScene-X提供了一种从稀疏视图构建可泛化的3D语言嵌入场景的新方法,实验证明其在质量和泛化能力方面优于现有技术。 Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.

[102] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach

Panpan Ji,Junni Song,Hang Xiao,Hanyu Liu,Chao Li

Main category: cs.CV

TL;DR: 提出了一种用于多模态人类活动识别的动态对比双路径网络框架。

Details Motivation: 解决多模态HAR系统中的跨模态特征对齐困难和模态贡献不平衡问题。 Method: 设计了包含双路径特征提取架构、多阶段对比学习机制和置信度驱动梯度调制策略的DCDP-HAR框架。 Result: 通过消融研究验证了每个组件的有效性,并在四个公开基准数据集上进行了广泛的比较实验。 Conclusion: 提出的DCDP-HAR框架有效解决了多模态HAR系统的挑战,具有良好的训练稳定性和性能表现。 Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.

[103] USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network

Ying Yu,Hang Xiao,Siyao Li,Jiarui Li,Haotian Tang,Hanyu Liu,Chao Li

Main category: cs.CV

TL;DR: This paper introduces USAD, a novel human activity recognition method combining unsupervised data augmentation, multi-branch spatio-temporal attention networks, and adaptive loss fusion, achieving state-of-the-art performance on three HAR datasets.

Details Motivation: HAR faces key challenges such as limited labeled samples for rare activities, inadequate high-level feature extraction, and poor model performance on lightweight devices. These issues necessitate the development of a more robust and efficient solution. Method: USAD employs an unsupervised, statistics-guided diffusion model for data augmentation, a multi-branch spatio-temporal interaction network with parallel residual branches of varying convolutional kernels, temporal and spatial attention mechanisms, cross-branch feature fusion, and adaptive multi-loss function fusion. Result: On WISDM, PAMAP2, and OPPORTUNITY datasets, USAD achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly outperforming existing methods. Practical deployment on embedded devices also confirms its efficiency and feasibility. Conclusion: The paper proposes USAD, a comprehensive optimization approach for HAR that addresses challenges like labeled data scarcity and suboptimal model performance. It achieves superior accuracy on three public datasets and proves efficient for embedded device deployment. Abstract: The primary objective of human activity recognition (HAR) is to infer ongoing human actions from sensor data, a task that finds broad applications in health monitoring, safety protection, and sports analysis. Despite proliferating research, HAR still faces key challenges, including the scarcity of labeled samples for rare activities, insufficient extraction of high-level features, and suboptimal model performance on lightweight devices. To address these issues, this paper proposes a comprehensive optimization approach centered on multi-attention interaction mechanisms. First, an unsupervised, statistics-guided diffusion model is employed to perform data augmentation, thereby alleviating the problems of labeled data scarcity and severe class imbalance. Second, a multi-branch spatio-temporal interaction network is designed, which captures multi-scale features of sequential data through parallel residual branches with 3*3, 5*5, and 7*7 convolutional kernels. Simultaneously, temporal attention mechanisms are incorporated to identify critical time points, while spatial attention enhances inter-sensor interactions. A cross-branch feature fusion unit is further introduced to improve the overall feature representation capability. Finally, an adaptive multi-loss function fusion strategy is integrated, allowing for dynamic adjustment of loss weights and overall model optimization. Experimental results on three public datasets, WISDM, PAMAP2, and OPPORTUNITY, demonstrate that the proposed unsupervised data augmentation spatio-temporal attention diffusion network (USAD) achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly outperforming existing approaches. Furthermore, practical deployment on embedded devices verifies the efficiency and feasibility of the proposed method.

[104] AnyI2V: Animating Any Conditional Image with Motion Control

Ziye Li,Hao Luo,Xincheng Shuai,Henghui Ding

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频生成框架AnyI2V,能够通过对任何条件图像使用用户定义的运动轨迹进行动画处理,从而实现更加灵活和通用的视频生成。

Details Motivation: 现有T2V方法依赖于文本提示,缺乏对生成内容空间布局的精确控制;而I2V方法受限于其对真实图像的依赖性,限制了合成内容的可编辑性。此外,虽然一些方法结合了ControlNet来引入基于图像的条件,但它们通常缺乏显式的运动控制并需要昂贵的计算训练。 Method: 提出了一种新的训练-free框架AnyI2V,利用LoRA和文本提示实现混合条件输入、风格迁移和编辑,以支持更广泛的视频生成。 Result: 实验表明,AnyI2V在视频生成的空间和运动控制方面取得了优越的性能,并提供了新的视角。 Conclusion: AnyI2V是一个无需训练的视频生成框架,能够通过用户定义的运动轨迹对任何条件图像进行动画处理,实现了更灵活和通用的视频生成,并支持混合条件输入、风格迁移和编辑。 Abstract: Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at https://henghuiding.com/AnyI2V/.

[105] Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Jiaer Xia,Bingkui Tong,Yuhang Zang,Rui Shao,Kaiyang Zhou

Main category: cs.CV

TL;DR: 这篇论文探讨了如何利用基于引导链式推理(GCoT)的方法改进多模态大语言模型在数据有限的专业视觉任务中的表现。

Details Motivation: 多模态大语言模型(MLLMs)虽然在自然语言处理和图像解释方面表现出色,但在没有大规模重新训练的情况下难以适应特定视觉任务,如图表理解。这是由于预训练数据集主要集中在场景和物体上,而缺乏对非物体图像(如图表和表格)的关注。 Method: 论文采用了Chain-of-Thought(CoT)推理数据训练MLLM,并针对其中出现的事实错误问题,提出了GCoT方法,将图像的边界框信息注入到CoT数据中以提高推理步骤与输入图像的一致性。 Result: 实验结果显示,在数据受限情况下,GCoT方法在五个涵盖图表、表格、收据和报告等多种视觉格式的专业视觉任务上显著优于微调和蒸馏方法。 Conclusion: 论文提出了一种新的方法,称为基于引导链式推理(GCoT),通过引入边界框等基础信息到推理步骤中,解决了在数据有限的情况下多模态大语言模型适应专业视觉任务的问题。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.

[106] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

Xin Zhou,Dingkang Liang,Kaijin Chen,Tianrui Feng,Xiwu Chen,Hongkai Lin,Yikang Ding,Feiyang Tan,Hengshuang Zhao,Xiang Bai

Main category: cs.CV

TL;DR: EasyCache是一种无需训练的视频扩散模型加速框架,通过运行时自适应缓存机制显著提升推理速度和视觉质量,适用于多种大型视频生成模型。

Details Motivation: 现有的视频生成模型因为去噪过程的迭代性质导致推理速度慢且计算成本高,阻碍了其广泛应用。EasyCache旨在解决这一瓶颈,使先进的视频合成技术更加普及并可用于实际应用。 Method: EasyCache引入了一种轻量级的、运行时自适应的缓存机制,通过动态重用之前计算的转换向量来避免推理过程中的冗余计算。 Result: EasyCache在多个大型视频生成模型(如OpenSora、Wan2.1和HunyuanVideo)上进行了全面研究,与原始基线相比,推理时间减少了2.1-3.3倍,同时保持了高视觉保真度,并比之前的最先进方法PSNR提高了36%。 Conclusion: EasyCache是一个高效的视频生成加速框架,适用于各种大规模视频生成模型。 Abstract: Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.

[107] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Zhening Huang,Xiaoyang Wu,Fangcheng Zhong,Hengshuang Zhao,Matthias Nießner,Joan Lasenby

Main category: cs.CV

TL;DR: LiteReality是一个高效的3D虚拟场景生成系统,结合了场景理解、对象检索、材质优化和物理交互功能,能够在多种应用场景下生成高质量的虚拟环境。

Details Motivation: 为了满足图形管线对视觉真实感、对象独立性、高精度材质渲染以及物理交互能力的需求,同时提高场景的兼容性和交互性,从而适用于广泛的应用场景如AR/VR、游戏、机器人导航和数字孪生技术。 Method: LiteReality通过四个主要步骤进行构建:1)利用结构化场景图进行场景理解和布局解析;2)从精心策划的资产数据库中检索最相似的3D模型;3)使用材质绘画模块恢复高质量的空间变化材质;4)将重建的场景集成到具备基础物理属性的模拟引擎中以实现交互行为。此外,该方法引入了一个无需训练的对象检索模块和一个稳健的材质绘画模块。 Result: LiteReality不仅在Scan2CAD基准测试中实现了最先进的对象检索性能,还展示了其在现实扫描数据和公共数据集上的有效性。即使在严重错位、遮挡或照明不良的情况下,材质绘画模块也能够成功地将任何风格的外观转移到3D资产上。最终的场景具有紧凑、可编辑、与标准图形管线完全兼容的特点。 Conclusion: LiteReality是一种新型的、轻量级的3D虚拟复制品生成管道,能够将室内环境的RGB-D扫描转换为紧凑、逼真且可交互的3D虚拟场景,并适用于AR/VR、游戏、机器人和数字孪生等应用。 Abstract: We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c

[108] RefTok: Reference-Based Tokenization for Video Generation

Xiang Fan,Xiaohang Sun,Kushan Thakkar,Zhu Liu,Vimal Bhat,Ranjay Krishna,Xiang Hao

Main category: cs.CV

TL;DR: RefTok is introduced as a new reference-based tokenization method for video modeling, which successfully captures complex temporal dynamics and context while significantly outperforming existing methods in performance and efficiency.

Details Motivation: Handling temporal redundancy in video models remains challenging as prevailing approaches often treat each set of frames independently, failing to capture inherent temporal dependencies and redundancies. Method: RefTok encodes and decodes sets of frames conditioned on an unquantized reference frame to preserve motion continuity and object appearance across frames. Result: RefTok outperforms existing tokenizers like Cosmos and MAGVIT by improving metrics such as PSNR, SSIM, and LPIPS by 36.7% at similar or higher compression ratios. Additionally, when used for training a video generation model on the BAIR Robot Pushing task, it outperforms even larger models like MAGVIT-L by 27.9% across all generation metrics. Conclusion: RefTok is a novel reference-based tokenization method that effectively captures temporal dynamics and contextual information in videos, significantly outperforming current state-of-the-art tokenizers. Abstract: Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.

[109] Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Yuqi Wu,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: Point3R通过显式空间指针内存和改进的信息交互机制,实现了高效的密集流式3D重建。

Details Motivation: 解决现有方法中隐式内存容量有限导致的信息丢失问题,特别是在处理早期帧时的局限性。 Method: 引入显式空间指针内存和设计3D分层位置嵌入以促进信息交互,同时采用简单而有效的融合机制确保内存的均匀性和效率。 Result: 实现了具有竞争力或最先进的性能,同时保持了较低的训练成本。 Conclusion: Point3R是一种针对密集流式3D重建的在线框架,它通过显式的空间指针内存实现高效、统一的3D场景重建,并在各种任务中表现出色且训练成本低。 Abstract: Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: https://github.com/YkiWu/Point3R.