Skip to content

Table of Contents

cs.CL [Back]

[1] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

Tian Lan,Xiangdong Su,Xu Liu,Ruirui Wang,Ke Chang,Jiang Li,Guanglai Gao

Main category: cs.CL

TL;DR: This paper introduces a Multi-task Chinese Bias Evaluation Benchmark (McBE) to address the lack of diverse and culturally relevant bias evaluation datasets for large language models (LLMs).

Details Motivation: Most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. There is a scarcity of datasets grounded in the Chinese language and culture, and existing datasets usually only support single evaluation tasks. Method: The authors present a Multi-task Chinese Bias Evaluation Benchmark (McBE) with 4,077 instances covering 12 single bias categories, 82 subcategories, and 5 evaluation tasks. They evaluate several popular LLMs from different series and parameter sizes. Result: The proposed benchmark provides extensive category coverage, content diversity, and comprehensive measurement. The evaluation of LLMs showed varying degrees of bias, leading to novel insights into bias in LLMs. Conclusion: The paper concludes that popular LLMs demonstrate varying degrees of bias and emphasizes the importance of measuring biases in LLMs to mitigate ethical risks. Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.

[2] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization

Keyan Jin,Yapeng Wang,Leonel Santos,Tao Fang,Xu Yang,Sio Kei Im,Hugo Gonçalo Oliveira

Main category: cs.CL

TL;DR: 本论文探讨了推理型大型语言模型(LLMs)在对话摘要任务中的表现,发现显式逐步推理并不总是提高摘要质量,反而可能导致冗长和事实不一致。

Details Motivation: 尽管LLMs在摘要任务中取得了进展,但像Long Chain-of-Thought(CoT)这样的逐步推理架构在对话摘要场景中的效果尚未得到充分研究,而这些场景需要同时具备抽象能力和简洁性。 Method: 作者对最先进的推理LLMs和非推理LLMs进行了全面系统的评估,涵盖了通用型、角色导向型和查询导向型三种对话摘要范式,并利用多语言、多领域数据以及不同长度的摘要进行测试。他们使用了强基准(如SAMSum, DialogSum等)和高级评估协议,包括基于LLM的自动指标和人类启发标准。 Result: 研究发现,与一些其他推理密集型任务的趋势不同,显式逐步推理并未持续提升对话摘要的质量。推理型LLMs往往比非推理型LLMs更容易产生冗长、事实不一致且不够简洁的摘要。 Conclusion: 该研究揭示了当前推理型LLMs在复杂对话上下文中可能失效的原因,并强调了为实际对话摘要任务开发有针对性建模和评估策略的重要性。 Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.

[3] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

Wenquan Lu,Yuechuan Yang,Kyle Lee,Yanshu Li,Enqi Liu

Main category: cs.CL

TL;DR: This paper investigates latent CoT in Huginn-3.5B and finds limited interpretability and minimal benefit from recurrence depth compared to models that explicitly use CoT.

Details Motivation: To determine whether latent chain-of-thought (CoT) reasoning emerges in depth-recurrent Transformers like Huginn-3.5B, which aims to internalize reasoning without increasing parameters. Method: The study uses probing techniques such as Logit Lens and Coda Lens to analyze the internal behavior of Huginn-3.5B on arithmetic tasks. Result: The analysis reveals limited evidence of latent CoT, with inconsistent interpretability across recurrent blocks and minimal performance improvement from deeper recurrence. Conclusion: Huginn-3.5B does not exhibit significant interpretable latent CoT reasoning, and increasing recurrence depth provides only marginal gains compared to models that externalize reasoning steps. Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model's internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.

[4] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Steven Song,Anirudh Subramanyam,Zhenyu Zhang,Aarti Venkat,Robert L. Grossman

Main category: cs.CL

TL;DR: GDC Cohort Copilot is an open-source tool that translates natural language descriptions into GDC cohort filters, simplifying data exploration for cancer genomics research.

Details Motivation: Users, especially new ones, may struggle to navigate hundreds of fields and properties to create specific cohorts in the Genomic Data Commons (GDC). Natural language descriptions could provide a more intuitive way for users to define their desired cohorts. Method: Development and evaluation of multiple large language models (LLMs) to automatically generate GDC cohort filters based on user-input natural language descriptions. The best-performing model, the open-source GDC Cohort LLM, was compared against GPT-4o prompting. Result: Introduction of GDC Cohort Copilot, which successfully translates natural language input into GDC cohort filters, allowing for further refinement and export back to GDC. The locally-served, open-source GDC Cohort LLM outperformed GPT-4o in this task. Conclusion: The GDC Cohort Copilot is an effective open-source tool that simplifies the process of creating complex cohorts on the Genomic Data Commons by translating natural language descriptions into cohort filters, outperforming GPT-4o in the process. Abstract: Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.

[5] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu,Tinghong Chen,Jiangtao Feng,Jiangjie Chen,Weinan Dai,Qiying Yu,Ya-Qin Zhang,Wei-Ying Ma,Jingjing Liu,Mingxuan Wang,Hao Zhou

Main category: cs.CL

TL;DR: MemAgent是一种高效的长文本处理框架,通过分段处理和内存优化,实现了对超长上下文的优秀处理能力。

Details Motivation: 现有的方法在处理无限长文档时面临性能下降的问题,需要一种高效且线性复杂度的解决方案。 Method: MemAgent通过分段读取文本并使用覆盖策略更新内存,并利用扩展的DAPO算法进行训练。 Result: MemAgent能够从8K上下文训练扩展到32K文本,在3.5M问答任务中性能损失小于5%,并在512K RULER测试中达到95%以上的准确率。 Conclusion: MemAgent是专为长文本任务优化的代理工作流程,具备卓越的长上下文处理能力。 Abstract: Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.

[6] DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning

Dohoon Kim,Donghun Kang,Taesup Moon

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.

[7] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Christian Jaumann,Annemarie Friedrich,Rainer Lienhart

Main category: cs.CL

TL;DR: 本论文提出了一种用于SciVQA 2025共享任务的科学视觉问答系统,结合了两种多模态大语言模型和多种少样本示例检索策略,并根据模型置信度选择答案。

Details Motivation: 为了提高科学领域视觉问答系统的性能,通过集成多模态大语言模型和优化少样本设置来更好地理解和回答涉及科学图表的问题。 Method: 采用两个多模态大语言模型的集成方法,并利用不同少样本示例检索策略;根据图表和问题类型选择模型及设置,同时依据模型置信度水平选择最终答案。 Result: 在盲测数据集上,该系统在七个参赛系统中排名第三,ROUGE-1、ROUGE-L和BERTS的平均F1得分为85.12。 Conclusion: 所提出的集成多模态大语言模型与少样本策略有效提升了科学视觉问答的表现,结果验证了其竞争力。 Abstract: This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models' confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

[8] QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers

Pilsung Kang

Main category: cs.CL

TL;DR: This paper presents QFFN-BERT, a hybrid quantum-classical transformer that replaces feedforward networks with parameterized quantum circuits, achieving higher accuracy and significant parameter reduction compared to classical models.

Details Motivation: Feedforward network modules account for approximately two-thirds of the parameters within standard Transformer encoder blocks, so reducing their parameter contribution could enhance the expressibility of neural architectures. Method: The researchers introduced QFFN-BERT by replacing the feedforward network modules in a compact BERT variant with PQC-based layers. They used a residual connection, both RY and RZ rotations, and an alternating entanglement strategy for stable training and high expressibility. Result: Experiments showed that a well-configured QFFN-BERT achieved up to 102.0% of baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. The model also performed competitively in few-shot learning scenarios, demonstrating superior data efficiency. Conclusion: QFFN-BERT, a hybrid quantum-classical transformer that replaces the feedforward network modules of a compact BERT variant with parameterized quantum circuits (PQC)-based layers, can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles. Abstract: Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.

[9] Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Weijie Lyu,Sheng-Jun Huang,Xuan Xia

Main category: cs.CL

TL;DR: 提出了一种基于参数化模型的代码数据选择方法,不仅提高了模型性能,还显著降低了计算需求。

Details Motivation: 当前方法主要通过利用大量数据来提升模型性能,关注数据数量而常常忽视数据质量,从而降低训练效率。 Method: 使用参数化模型进行代码数据选择,优化参数化模型以确保分布一致性和所选子集的多样性。 Result: 实验结果表明,仅使用10K样本时,该方法在HumanEval上获得2.4%、在MBPP上获得2.3%的提升,并且优于其他采样方法的性能和效率。 Conclusion: 该方法在提升模型性能的同时显著降低了计算成本。 Abstract: Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.

[10] Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

Mark Atta Mensah,Isaac Wiafe,Akon Ekpezu,Justice Kwame Appati,Jamal-Deen Abdulai,Akosua Nyarkoa Wiafe-Akenten,Frank Ernest Yeboah,Gifty Odame

Main category: cs.CL

TL;DR: This study benchmarks Akan ASR models (e.g., Whisper and Wav2Vec2) across multiple domains and finds they perform best only in their training domains, with notable differences in error behavior between architectures. It highlights the need for better domain adaptation for low-resource languages.

Details Motivation: Most ASR research evaluates models using in-domain datasets but rarely considers how well these models generalize across diverse speech contexts. This gap is addressed by studying the generalization of ASR models for Akan, a low-resource language, across multiple domains. Method: Seven Akan ASR models based on transformer architectures (including Whisper and Wav2Vec2) were benchmarked using four diverse Akan speech corpora covering various domains such as image descriptions, informal conversations, biblical scripture readings, and financial dialogues. Performance was evaluated using word error rate and character error rate to assess domain dependency and error behaviors across architectures. Result: Models performed best within their training domains but showed significant accuracy degradation when tested on mismatched domains. Whisper-based models produced more fluent but potentially misleading transcription errors, while Wav2Vec2 models generated more obvious but less interpretable outputs when encountering unfamiliar inputs. This indicates a trade-off between readability and transparency in ASR errors. Conclusion: The study concludes that ASR models, specifically those based on Whisper and Wav2Vec2 architectures, exhibit domain dependency in their performance on Akan speech data. The research highlights the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for low-resource languages like Akan. Abstract: Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.

[11] A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages

Sumaya Ahmed Salihs,Isaac Wiafe,Jamal-Deen Abdulai,Elikem Doe Atsakpo,Gifty Ayoka,Richard Cave,Akon Obu Ekpezu,Catherine Holloway,Katrin Tomanek,Fiifi Baffoe Payin Winful

Main category: cs.CL

TL;DR: 本文提出了一种用于低资源语言受损语音的ASR模型构建方法,开发了相关最佳实践指南,并发布了首个阿坎语受损语音开源数据集。

Details Motivation: 研究旨在解决受损语音在低资源语言中缺乏自动语音识别(ASR)技术覆盖的问题,同时推动数据收集与技术发展的民主化。 Method: 本研究采用了一种创新的方法进行语音样本收集,包括制定最佳实践指南、使用开放源代码工具以及对阿坎语中的受损语音进行微调以优化ASR模型性能。 Result: 成功创建了首个针对加纳广泛使用的本土语言阿坎语的受损语音开源数据集,并展示了微调后的开放源ASR模型在识别受损语音方面的初步成果。 Conclusion: 该研究通过开发最佳实践“食谱”和培训材料,促进语音识别技术的普及化,并结合社区驱动的数据收集和ASR模型构建,为低资源语言的残障语音建立了首个开源数据集。 Abstract: This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a "cookbook" of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.

Sneha Deshmukh,Prathmesh Kamble

Main category: cs.CL

TL;DR: 本文介绍了IndianBailJudgments-1200,这是一个包含1200个印度法院保释决定的新基准数据集。

Details Motivation: 由于结构化数据集的缺乏,法律自然语言处理在印度等地区发展不足。 Method: 使用提示工程GPT-4o流水线生成注释,并验证其一致性。 Result: 该资源支持多种法律自然语言处理任务,如结果预测、摘要和公平性分析,并且是首个专注于印度保释判例的公开数据集。 Conclusion: IndianBailJudgments-1200为法律自然语言处理提供了新的重要资源。 Abstract: Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.

[13] WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li,Zhongwang Zhang,Huifeng Yin,Liwen Zhang,Litu Ou,Jialong Wu,Wenbiao Yin,Baixuan Li,Zhengwei Tao,Xinyu Wang,Weizhou Shen,Junkai Zhang,Dingchu Zhang,Xixi Wu,Yong Jiang,Ming Yan,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: WebSailor is a post-training methodology that enables open-source LLMs to match proprietary systems in handling complex, uncertainty-laden information-seeking tasks through novel task generation and an efficient RL training algorithm.

Details Motivation: Proprietary systems like DeepResearch have demonstrated superior performance on complex benchmarks due to their ability to handle extreme uncertainty, a capability largely absent in open-source models. This work aims to close that gap. Method: WebSailor introduces high-uncertainty tasks through structured sampling, information obfuscation, RFT cold start, and an efficient agentic RL training algorithm called Duplicating Sampling Policy Optimization (DUPO). Result: WebSailor significantly outperforms all open-source agents in complex information-seeking tasks and matches the performance of proprietary agents. Conclusion: WebSailor successfully bridges the capability gap between open-source and proprietary agents in complex information-seeking tasks by instilling a crucial reasoning pattern that systematically reduces extreme uncertainty. Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

[14] Revisiting Active Learning under (Human) Label Variation

Cornelia Gruber,Helen Alber,Bernd Bischl,Göran Kauermann,Barbara Plank,Matthias Aßenmacher

Main category: cs.CL

TL;DR: 这篇论文分析了监督学习中标记数据的限制,探讨了人类标签变异(HLV)与主动学习(AL)的关系,并提出了一个将HLV纳入AL循环的概念框架。

Details Motivation: 由于高质量标记数据的获取仍然是监督学习的一个限制因素,并且在实践中常常存在人类标签变异(HLV),这促使作者重新审视现有的假设并开发新的方法来更好地反映现实世界的复杂性。 Method: 作者通过调查AL和(H)LV领域的方法,探讨了如何分解观察到的标签变异(LV)为信号和噪声,并提出了一种概念框架。 Result: 该研究揭示了需要将观察到的标签变异(LV)分解为信号和噪声,并提出了一种将HLV纳入主动学习循环的概念框架。 Conclusion: 本文提出了一个概念框架,旨在将人类标签变异(HLV)纳入主动学习(AL)循环的各个方面,并讨论了大型语言模型作为标注者的整合。 Abstract: Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed -- or neglected -- these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.

[15] MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion

Xin Guan,PeiHsin Lin,Zekun Wu,Ze Wang,Ruibo Zhang,Emre Kazim,Adriano Koshiyama

Main category: cs.CL

TL;DR: Multiperspective Fusion (MPF) 是一种用于大型语言模型的新型训练后对齐框架,旨在缓解偏见问题,通过利用多视角生成将模型输出与细微的人类基线对齐,具有可扩展性和可解释性,适用于已部署的模型。

Details Motivation: 随着对易用性偏见缓解需求的增长,开发一种新颖的训练后对齐框架 MPF 成为了当务之急。 Method: 在 SAGED 管道的基础上,MPF 利用多视角生成来揭示和调整 LLM 输出中的偏见,通过分解基线(如 HR 专业人员的情感分布)为可解释的视角成分,并通过采样与响应平衡进行引导生成。 Result: MPF 能够将 LLM 的情感分布与反事实基线和 HR 基线对齐,实现较小的 KL 散度、校准误差减少以及泛化到未见过的问题。 Conclusion: MPF 提供了一种可扩展且可解释的对齐和偏见缓解方法,适用于已部署的大型语言模型,无需广泛的提示工程或微调。 Abstract: Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.

[16] Exploring Gender Bias Beyond Occupational Titles

Ahmed Sabir,Rajesh Sharama

Main category: cs.CL

TL;DR: This paper introduces GenderLexicon and a new framework for estimating and interpreting gender biases in contextual elements like action verbs and nouns, proving the existence of such biases beyond occupational stereotypes.

Details Motivation: The research aims to investigate the correlation between gender and contextual biases, particularly focusing on action verbs, object nouns, and occupations. Method: A novel dataset named GenderLexicon was introduced, along with a framework designed to estimate contextual bias and its related gender bias. Evaluations were conducted on five diverse datasets, including a Japanese dataset, to validate the approach. Result: The findings confirm the existence of gender biases beyond occupational stereotypes, and the model demonstrates effectiveness in interpreting bias through scoring. Conclusion: The study concludes that gender biases exist beyond occupational stereotypes, and the proposed model can interpret these biases with a score, enhancing their explainability. Abstract: In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.

[17] Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

Zhijian Xu,Yilun Zhao,Manasi Patwardhan,Lovekesh Vig,Arman Cohan

Main category: cs.CL

TL;DR: This paper introduces LimitGen, a new benchmark for evaluating LLMs' ability to identify research paper limitations, showing that these models can enhance peer review when combined with literature retrieval.

Details Motivation: The motivation stems from the increasing volume of scientific publications intensifying the challenges in peer review, particularly in identifying paper limitations, and the underexplored potential of LLMs in this area. Method: The researchers developed LimitGen, a benchmark comprising synthetic (LimitGen-Syn) and human-written (LimitGen-Human) datasets, to evaluate LLMs' ability to detect limitations in scientific papers. They enhanced LLM performance using literature retrieval techniques. Result: LLMs demonstrated improved capability in generating concrete and constructive feedback on paper limitations when augmented with literature retrieval, showing promise in supporting early-stage peer review. Conclusion: The study concludes that LLMs, when augmented with literature retrieval, can effectively identify paper limitations and provide constructive feedback, thus complementing human peer review. Abstract: Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.

[18] Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

Peter Viechnicki

Main category: cs.CL

TL;DR: This research identifies the minimum perceptible difference in vowel sounds during speech production, offering insights into the structure of human vowel systems and implications for theories of speech production.

Details Motivation: To determine the degree of accuracy in control mechanisms governing human vowel production in auditory space, specifically how far apart two vowel stimuli must be to yield reliably different imitations. Method: A vowel mimicry paradigm was used to measure the 'Just Producible Difference' (JPD) among two sets of English speakers during front vowel production. Result: The JPD was estimated to be between 14 and 51 mels in F1 x F2 space. Conclusion: The study provides a psychophysical explanation for trends in vowel phonemes by establishing a theoretical lower bound for how close two vowel phonemes can be in a speaker's formant space. Abstract: A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed 'Just Producible Difference' (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker's formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.

[19] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Ken Tsui

Main category: cs.CL

TL;DR: 本论文研究了LLMs在自我纠正方面的局限性,提出了Self-Correction Bench测试框架,并发现训练数据组成及简单的提示技术对提升LLMs自我纠正能力有显著影响。

Details Motivation: 虽然LLMs具有变革性,但它们仍会犯错并探索低效的推理路径。自我纠正能力对于提高LLM的可靠性和信任度至关重要。然而,LLMs在自我纠正时表现出系统性的盲区,这需要深入研究和解决。 Method: 介绍了一个名为Self-Correction Bench的系统框架,通过在三个复杂度级别上进行受控错误注入来测量LLMs的自我纠正能力。测试了14种模型,并分析了训练数据组成对自我纠正能力的影响。此外,还评估了简单提示如'Wait'对减少盲区的效果。 Result: 平均盲区率为64.5%;训练数据中人类示范主要展示无错误响应而非错误修正序列是导致盲区的重要原因;RL训练模型可以通过结果反馈学习错误纠正;添加'Wait'提示可减少89.3%的盲区。 Conclusion: 该论文指出当前大型语言模型(LLMs)在自我纠正方面存在一个“盲区”,即它们难以纠正自己输出中的错误,尽管可以识别用户输入中的错误。通过引入一种新的测试框架Self-Correction Bench,研究发现这一限制与训练数据构成有关,并提出可能的改进方向以增强LLMs的可靠性和可信度。 Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

[20] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

Riccardo Cantini,Nicola Gabriele,Alessio Orsino,Domenico Talia

Main category: cs.CL

TL;DR: This paper investigates whether reasoning capabilities in language models improve their robustness to social biases. It finds that such capabilities can increase vulnerability to bias and highlights the need for safer reasoning design.

Details Motivation: To understand how reasoning mechanisms impact model fairness and robustness against social biases, particularly as these models are increasingly used in complex reasoning tasks. Method: The research employs the CLEAR-Bias benchmark and uses an LLM-as-a-judge approach alongside jailbreak techniques to assess adversarial robustness across sociocultural dimensions. Result: Models with explicit reasoning mechanisms were found to be more vulnerable to stereotype reinforcement compared to base models, challenging the assumption that reasoning inherently improves robustness. Conclusion: The study concludes that reasoning capabilities in language models may unintentionally increase vulnerability to bias, suggesting a need for more bias-aware reasoning designs. Abstract: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

[21] Multimodal Mathematical Reasoning with Diverse Solving Perspective

Wenhao Shi,Zhiqiang Hu,Yi Bin,Yang Yang,See-Kiong Ng,Heng Tao Shen

Main category: cs.CL

TL;DR: 本研究提出了MathV-DP数据集和Qwen-VL-DP模型,通过多样化的推理视角提升多模态数学推理的效果。

Details Motivation: 现有的多模态大语言模型在数学推理中通常依赖一对一图文对和单一解决方案监督,忽略了有效推理视角和内部反思的多样性。 Method: 构建了一个基于Qwen-VL的模型,并使用监督学习进行微调,通过组相对策略优化(GRPO)增强,结合正确性判别和多样性感知奖励函数。 Result: 在MathVista的minitest和Math-V基准测试中,Qwen-VL-DP表现突出,展现了更高的准确性和生成多样性。 Conclusion: Qwen-VL-DP在准确性和生成多样性方面显著优于之前的多模态大语言模型,强调了多角度推理和反思性推理的重要性。 Abstract: Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista's minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.

[22] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

Wencheng Zhang,Shiqin Qiao,Lingjie Luo,Yinfeng Li,Chuanyang Zheng,Qian Xu,Meng Li,Yong Gui,Yijun He,Jianing Qiu,Jindong Hong,Jiankai Sun

Main category: cs.CL

TL;DR: This paper proposes SynapseRoute, a dynamic routing framework that optimizes large language model usage by assigning queries to 'thinking' or 'non-thinking' modes based on complexity, improving accuracy while significantly reducing costs.

Details Motivation: With the widespread adoption of LLMs, selecting an appropriate model requires balancing performance and operational cost. This work addresses this challenge by exploring dynamic query routing to optimize resource usage. Method: The authors propose SynapseRoute, a machine learning-based dynamic routing framework that assigns queries to appropriate modes based on complexity. They also introduce the Accuracy-Inference-Token (AIT) index for comprehensive evaluation. Result: SynapseRoute improves overall accuracy compared to using the thinking mode alone while reducing inference time by 36.8% and token consumption by 39.66%. Qualitative analysis shows over-reasoning on simple queries leads to inefficiencies. Conclusion: This study concludes that there exists a significant dichotomy in problem complexity, which can be optimized by dynamically routing queries to either thinking or non-thinking modes, leading to improved accuracy, cost-efficiency, and user experience. Abstract: With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.

[23] Generalizing Verifiable Instruction Following

Valentina Pyatkin,Saumya Malik,Victoria Graf,Hamish Ivison,Shengyi Huang,Pradeep Dasigi,Nathan Lambert,Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: 本文介绍了一种新基准IFBench,用于评估语言模型在遵循精确指令方面的能力,并提出了通过可验证奖励的强化学习(RLVR)来提高这种能力的方法。

Details Motivation: 当前的语言模型在遵循用户指令中的输出限制时表现不佳,尤其是在面对未见过的约束时难以泛化。 Method: 引入了一个新的基准测试IFBench,包含58个新的、多样化的和具有挑战性的可验证输出约束;设计了约束验证模块,并使用可验证奖励的强化学习(RLVR)进行训练。 Result: 实验结果显示,基于RLVR的训练方法显著提高了模型在遵循精确指令上的表现,特别是在泛化到未见过的输出约束时。 Conclusion: 文章提出的新基准IFBench和RLVR训练方法有效提升了语言模型在遵循用户指令方面的泛化能力和准确性。 Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

[24] LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Almog Hilel,Idan Shenfeld,Leshem Choshen,Jacob Andreas

Main category: cs.CL

TL;DR: A vulnerability exists in language models trained with user feedback, allowing attackers to manipulate model behavior through strategic prompting and feedback manipulation.

Details Motivation: To identify potential vulnerabilities in language models trained with user feedback and explore how these can be exploited to alter model behavior. Method: The researchers conducted an experimental attack where a user provides prompts and manipulates feedback (upvotes/downvotes) to alter the behavior of a language model during preference tuning. Result: The study found that attackers can use feedback manipulation to insert new knowledge, modify code generation patterns introducing security flaws, and inject fake financial news into the model's outputs. Conclusion: The paper concludes that preference tuning in language models can be exploited by attackers through manipulated feedback, leading to persistent alterations in model behavior. Abstract: We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

[25] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Purbesh Mitra,Sennur Ulukus

Main category: cs.CL

TL;DR: This paper introduces MOTIF, a method that enables large language models to think in modules across multiple rounds, improving reasoning performance on mathematical tasks while being sample efficient.

Details Motivation: Large language models (LLMs) are limited in their reasoning capabilities due to fixed context size, restricting the number of tokens they can attend to. This work aims to overcome this limitation by enabling LLMs to reason over multiple rounds through a modular thinking strategy. Method: The authors proposed MOTIF, a reinforcement learning (RL) training method for modular thinking via multi-round generation of thinking tokens. They employed parameter-efficient fine-tuning on the Qwen2.5-3B-Instruct model using the GSM8K dataset and evaluated it on MATH500 and AIME2024 benchmarks. Result: Training with MOTIF resulted in 3.8% and 3.3% improvements in accuracy on the MATH500 and AIME2024 benchmarks, respectively, compared to vanilla GRPO-based training, using only 15% of the samples. Conclusion: MOTIF improves the reasoning capabilities of LLMs beyond context size limits through modular thinking and reinforcement learning, achieving better accuracy with sample efficiency. Abstract: Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

[26] Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Nikhil Chandak,Shashwat Goel,Ameya Prabhu,Moritz Hardt,Jonas Geiping

Main category: cs.CL

TL;DR: This paper proposes 'answer matching' as a better evaluation method for language models, overcoming limitations of multiple-choice benchmarks by using free-form responses and reference answers for more accurate results.

Details Motivation: The motivation stems from the observation that multiple-choice benchmarks often allow models to exploit shortcuts, answering correctly without fully understanding the question. This highlights the need for more accurate and reliable evaluation methods. Method: The researchers introduced 'answer matching' as an evaluation method. This involves giving models questions without options, allowing them to generate free-form answers, and then comparing these responses to reference answers using modern language models. Result: Answer matching using recent language models achieved agreement levels comparable to inter-annotator agreement, outperforming both multiple-choice evaluations and LLM-as-a-judge approaches without reference answers. Conclusion: The study concludes that answer matching provides a superior evaluation method compared to traditional multiple-choice benchmarks, showing near-perfect agreement with human grading and significantly changing model rankings. Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

cs.CV [Back]

[27] Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma

Main category: cs.CV

TL;DR: This paper reviews the use of large language models and vision-language models for crash detection from video feeds, offering insights into methods, challenges, and future research directions.

Details Motivation: Crash detection from video feeds is crucial in intelligent transportation systems, and recent developments in LLMs and VLMs offer new opportunities for processing multimodal information. Method: The authors surveyed recent methods using large language models for crash detection, presenting a taxonomy of fusion strategies, summarizing datasets, analyzing architectures, and comparing benchmarks. Result: A structured taxonomy of fusion strategies, summary of key datasets, analysis of model architectures, and comparison of performance benchmarks in using LLMs for crash detection are presented. Conclusion: The paper concludes that leveraging large language models and vision-language models offers transformative potential for crash detection from video feeds, providing a foundation for future research. Abstract: Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

[28] Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning

Zijie Cai,Christopher Metzler

Main category: cs.CV

TL;DR: 该论文系统评估了水下单目度量深度估计模型,并通过域适应方法显著提高了其在复杂水下环境中的性能。

Details Motivation: 由于光衰减、散射、颜色失真、浑浊以及缺乏高质量的度量真实数据,单目深度估计在水下环境中的可靠性仍然有限,因此需要系统地评估和改进现有方法。 Method: 本文评估了零样本和微调的单目度量深度估计模型在具有度量深度注释的真实世界水下数据集(如FLSea和SQUID)上的表现,并利用基于物理的水下图像形成模型生成了Hypersim数据集的水下变体用于训练。 Result: 大规模训练于陆地(真实或合成)数据的模型在空气中有效,但在水下表现不佳。经过微调的Depth Anything V2模型在所有基准测试中均表现出一致的性能提升,并优于仅训练于干净空气中Hypersim数据集的基线模型。 Conclusion: 本研究通过在合成水下数据集上微调Depth Anything V2模型,显著提升了其在真实水下环境中的深度估计性能,强调了域适应和尺度感知监督对未来水下深度估计研究的重要性。 Abstract: Monocular depth estimation has recently advanced to provide not only relative but also metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground-truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, such as FLSea and SQUID. We evaluate a diverse set of state-of-the-art models across a range of underwater conditions with different ranges. Our results show that large-scale models trained on terrestrial (real or synthetic) data, while effective in in-air settings, perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S backbone encoder on a synthetic underwater variant of the Hypersim dataset, which we generated using a physically based underwater image formation model. We demonstrate our fine-tuned model consistently improves performance across all benchmarks and outperforms baselines trained only on the clean in-air Hypersim dataset. Our study provides a detailed evaluation and visualization for monocular metric depth estimation in underwater scenes, highlighting the importance of domain adaptation and scale-aware supervision for achieving robust and generalizable metric depth predictions in challenging underwater environments for future research.

[29] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

Xiao Wang,Jingtao Jiang,Qiang Chen,Lan Chen,Lin Zhu,Yaowei Wang,Yonghong Tian,Jin Tang

Main category: cs.CV

TL;DR: This paper introduces ESTR-CoT, a novel framework for event stream scene text recognition that enhances interpretability and contextual reasoning by combining vision encoders and large language models with chain-of-thought reasoning.

Details Motivation: Existing event stream scene text recognition methods suffer from insufficient interpretability and weak contextual logical reasoning, prompting the need for a more explainable and logically robust framework. Method: The method uses a vision encoder (EVA-CLIP) to process event streams, aligns visual tokens with a pre-trained large language model (Vicuna-7B) using a Q-former, and generates both answers and chain-of-thought reasoning. The model is trained on a newly proposed large-scale CoT dataset through three stages: generation, polish, and expert verification. Result: Extensive experiments on three benchmark datasets (EventSTR, WordArt*, IC15*) validate the effectiveness and interpretability of the ESTR-CoT framework. The source code and pre-trained models are publicly available. Conclusion: The proposed ESTR-CoT framework improves interpretability and contextual logical reasoning in event stream scene text recognition, outperforming existing methods. Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.

[30] Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach

Elena Ryumina,Maxim Markitantov,Alexandr Axyonov,Dmitry Ryumin,Mikhail Dolgushin,Alexey Karpov

Main category: cs.CV

TL;DR: This paper proposes a zero-shot multimodal approach for Compound Expression Recognition that integrates six modalities without requiring target-domain training data, achieving performance comparable to supervised models.

Details Motivation: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. Traditional approaches rely on task-specific training data, which limits their adaptability. Method: The method combines six heterogeneous modalities (static/dynamic facial expressions, scene and label matching, scene context, audio, text) using zero-shot components like CLIP and Qwen-VL, along with a Multi-Head Probability Fusion module and a Compound Expressions transformation module using Pair-Wise Probability Aggregation and Pair-Wise Feature Similarity Aggregation. Result: Evaluated under multi-corpus training, the approach achieved F1 scores of 46.95% on AffWild2, 49.02% on AFEW, and 34.85% on C-EXPR-DB through zero-shot testing. Conclusion: The proposed zero-shot multimodal approach effectively captures compound emotions without domain adaptation, achieving results comparable to supervised methods. Abstract: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available.

[31] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

Takuro Kawada,Shunsuke Kitada,Sota Nemoto,Hitoshi Iyatomi

Main category: cs.CV

TL;DR: 本文提出了一种用于支持科学论文图形摘要选择和推荐的大规模数据集SciGA-145k,并定义了两个推荐任务以及新的评价指标CAR。

Details Motivation: 图形摘要(GAs)在科学论文中起着关键作用,但其设计需要高级可视化技能,阻碍了广泛采用。 Method: 定义了两种任务:Intra-GA推荐和Inter-GA推荐,并提出了CAR评价指标。 Result: 构建了一个包含约145,000篇科学论文和114万张图表的数据集SciGA-145k,并给出了相关基线模型。 Conclusion: SciGA-145k为科学论文的视觉化传播提供了基础,并推动了AI在科学研究中的应用。 Abstract: Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.

[32] Understanding Trade offs When Conditioning Synthetic Data

Brandon Trabucco,Qasim Wani,Benjamin Pikus,Vasu Sharma

Main category: cs.CV

TL;DR: This paper explores how different conditioning strategies in diffusion models impact synthetic data quality for object detection, finding that layout-based approaches outperform prompt-based ones as visual concept diversity increases.

Details Motivation: The motivation stems from the need to learn robust object detectors using only limited images in industrial vision systems, where gathering high-quality training data is time-consuming and challenging. Method: The researchers compared two conditioning strategies—prompt-based and layout-based—using eighty diverse visual concepts from four standard object detection benchmarks. Result: When layout cues match the full training distribution, synthetic data improves mean average precision by an average of thirty-four percent and up to one hundred seventy-seven percent compared to using real data alone. Conclusion: The study concludes that the effectiveness of conditioning strategies in diffusion models for generating synthetic data depends on the diversity of visual concepts, with layout-based conditioning becoming superior as diversity increases. Abstract: Learning robust object detectors from only a handful of images is a critical challenge in industrial vision systems, where collecting high quality training data can take months. Synthetic data has emerged as a key solution for data efficient visual inspection and pick and place robotics. Current pipelines rely on 3D engines such as Blender or Unreal, which offer fine control but still require weeks to render a small dataset, and the resulting images often suffer from a large gap between simulation and reality. Diffusion models promise a step change because they can generate high quality images in minutes, yet precise control, especially in low data regimes, remains difficult. Although many adapters now extend diffusion beyond plain text prompts, the effect of different conditioning schemes on synthetic data quality is poorly understood. We study eighty diverse visual concepts drawn from four standard object detection benchmarks and compare two conditioning strategies: prompt based and layout based. When the set of conditioning cues is narrow, prompt conditioning yields higher quality synthetic data; as diversity grows, layout conditioning becomes superior. When layout cues match the full training distribution, synthetic data raises mean average precision by an average of thirty four percent and by as much as one hundred seventy seven percent compared with using real data alone.

[33] High-Fidelity Differential-information Driven Binary Vision Transformer

Tian Gao,Zhiyuan Zhang,Kaijie Yin,Xu-Cheng Zhong,Hui Kong

Main category: cs.CV

TL;DR: The paper proposes DIDB-ViT, a novel binary vision transformer method that enhances performance by incorporating differential information, frequency decomposition, and improved activation functions.

Details Motivation: To overcome performance degradation in existing binary ViT methods and reduce reliance on full-precision modules while maintaining computational efficiency. Method: Designing an informative attention module with differential information, using frequency decomposition via discrete Haar wavelet, and introducing an improved RPReLU activation function. Result: Experimental results show DIDB-ViT achieves superior performance in image classification and segmentation across multiple ViT architectures compared to current quantization techniques. Conclusion: DIDB-ViT is an effective binary ViT approach that outperforms state-of-the-art network quantization methods in multiple ViT architectures for image classification and segmentation. Abstract: The binarization of vision transformers (ViTs) offers a promising approach to addressing the trade-off between high computational/storage demands and the constraints of edge-device deployment. However, existing binary ViT methods often suffer from severe performance degradation or rely heavily on full-precision modules. To address these issues, we propose DIDB-ViT, a novel binary ViT that is highly informative while maintaining the original ViT architecture and computational efficiency. Specifically, we design an informative attention module incorporating differential information to mitigate information loss caused by binarization and enhance high-frequency retention. To preserve the fidelity of the similarity calculations between binary Q and K tensors, we apply frequency decomposition using the discrete Haar wavelet and integrate similarities across different frequencies. Additionally, we introduce an improved RPReLU activation function to restructure the activation distribution, expanding the model's representational capacity. Experimental results demonstrate that our DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance.

[34] FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model

Jiangxia Chen,Tongyuan Huang,Ke Song

Main category: cs.CV

TL;DR: 本研究提出了FMOcc,一种高效的3D语义占用预测方法,适用于自动驾驶中的遮挡和远距离场景,性能优于现有技术。

Details Motivation: 传统方法依赖于多帧图像融合,计算资源消耗大且数据需求高,而本文旨在解决少帧输入下的3D占用预测问题,提升遮挡与远距离区域的预测能力。 Method: 提出了一种基于三视角(TPV)的特征细化网络FMOcc,包括Flow Matching SSM模块(FMSSM)、TPV SSM层、Plane Selective SSM(PS3M)以及Mask Training方法。 Result: 实验结果表明,FMOcc在Occ3D-nuScenes和OpenOcc数据集上表现优异,使用两帧输入达到43.1% RayIoU和39.8% mIoU,并在低内存和推理时间条件下取得良好效果。 Conclusion: FMOcc通过结合流匹配模型和选择性状态空间模型,在3D语义占用预测中实现了更高的精度和效率,尤其在处理遮挡和远距离场景时优于现有方法。 Abstract: 3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction. Firstly, to generate missing features, we designed a feature refinement module based on a flow matching model, which is called Flow Matching SSM module (FMSSM). Furthermore, by designing the TPV SSM layer and Plane Selective SSM (PS3M), we selectively filter TPV features to reduce the impact of air voxels on non-air voxels, thereby enhancing the overall efficiency of the model and prediction capability for distant scenes. Finally, we design the Mask Training (MT) method to enhance the robustness of FMOcc and address the issue of sensor data loss. Experimental results on the Occ3D-nuScenes and OpenOcc datasets show that our FMOcc outperforms existing state-of-theart methods. Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.

[35] SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement

Zeyu Lei,Hongyuan Yu,Jinlin Wu,Zhen Chen

Main category: cs.CV

TL;DR: SurgVisAgent is an advanced surgical vision agent designed to dynamically address multiple types of image distortions in endoscopic procedures, offering a unified and effective solution for real-world surgical challenges.

Details Motivation: Current surgical enhancement algorithms are limited to single tasks in specific scenarios, which hampers their effectiveness in complex real-world situations. This limitation motivates the development of a more versatile and adaptive solution for surgical assistance. Method: The paper proposes SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). It dynamically identifies distortion categories and severity levels in endoscopic images, utilizing a prior model for domain-specific knowledge. The method also incorporates in-context few-shot learning and chain-of-thought (CoT) reasoning to deliver customized enhancements. Result: Extensive experiments on a comprehensive benchmark simulating real-world surgical distortions demonstrate that SurgVisAgent outperforms traditional single-task models. It successfully performs various enhancement tasks, including low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Conclusion: SurgVisAgent is presented as a unified solution for surgical assistance, demonstrating its potential to surpass traditional single-task models through dynamic and customized image enhancement capabilities tailored to diverse surgical requirements. Abstract: Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context few-shot learning and chain-of-thought (CoT) reasoning, SurgVisAgent delivers customized image enhancements tailored to a wide range of distortion types and severity levels, thereby addressing the diverse requirements of surgeons. Furthermore, we construct a comprehensive benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models, highlighting its potential as a unified solution for surgical assistance.

[36] Multi-Label Classification Framework for Hurricane Damage Assessment

Zhangding Liu,Neda Mohammadi,John E. Taylor

Main category: cs.CV

TL;DR: 本研究介紹了一種利用航拍圖像進行多標籤分類的新框架,提高了風暴後損害評估的準確性和效率。

Details Motivation: 傳統的單一標籤分類方法無法捕捉颶風後損害的複雜性,因此需要一種更有效的方法進行損害評估。 Method: 結合基於ResNet的特徵提取模塊和特定類別注意力機制來識別圖像中的多種損壞類型。 Result: 在Rescuenet數據集上實現了90.23%的平均精度率,優於現有的基線方法。 Conclusion: 研究提出了一种新的多标签分類框架,用於風暴後損害評估,能更精確且及時地進行災害回應。 Abstract: Hurricanes cause widespread destruction, resulting in diverse damage types and severities that require timely and accurate assessment for effective disaster response. While traditional single-label classification methods fall short of capturing the complexity of post-hurricane damage, this study introduces a novel multi-label classification framework for assessing damage using aerial imagery. The proposed approach integrates a feature extraction module based on ResNet and a class-specific attention mechanism to identify multiple damage types within a single image. Using the Rescuenet dataset from Hurricane Michael, the proposed method achieves a mean average precision of 90.23%, outperforming existing baseline methods. This framework enhances post-hurricane damage assessment, enabling more targeted and efficient disaster response and contributing to future strategies for disaster mitigation and resilience. This paper has been accepted at the ASCE International Conference on Computing in Civil Engineering (i3CE 2025), and the camera-ready version will appear in the official conference proceedings.

[37] Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation

Yuxiang Zhang,Wei Li,Wen Jia,Mengmeng Zhang,Ran Tao,Shunlin Liang

Main category: cs.CV

TL;DR: 本文提出一种名为BiDA的新框架,利用三支变压器结构和双向领域适应策略,在跨领域高光谱图像分类任务中取得了优于现有方法的结果。

Details Motivation: 由于不同区域或时间获取的高光谱图像存在显著的光谱偏移问题,传统的训练和测试数据假设已不再适用,因此需要开发新的跨领域适应方法以提高模型在目标场景中的可迁移性和分类性能。 Method: 提出了一种Bi-directional Domain Adaptation (BiDA) 框架,采用三支变压器架构(源分支、目标分支和耦合分支)并设计了Coupled Multi-head Cross-attention (CMCA)机制和双向蒸馏损失函数,以增强特征交互与领域间相关性挖掘。同时引入了自适应强化策略(ARS)来提升噪声条件下模型对通用特征的关注能力。 Result: 实验结果表明,所提出的BiDA框架在多个跨时间/场景的航空和卫星数据集上均优于当前最先进的领域自适应方法,在树种分类任务中准确率高出3%~5%。 Conclusion: BiDA框架在跨领域高光谱图像分类任务中表现出优于现有方法的性能,尤其是在时间/场景变化的树种分类任务中显著提高了准确性。 Abstract: Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3\%$\sim$5\% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA.

[38] MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement

Fanghai Yi,Zehong Zheng,Zexiao Liang,Yihang Dong,Xiyang Fang,Wangyu Wu,Xuhang Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的水下图像增强模型MAC-Lookup,结合CLTCC和MAAE模块,在细节和颜色恢复方面优于现有方法。

Details Motivation: 由于光线变化、水浑浊和气泡等因素,水下图像存在可视性和颜色问题,而传统的基于先验的方法和像素级方法往往失败,深度学习方法又缺乏高质量数据集。 Method: 引入了Multi-Axis Conditional Lookup (MAC-Lookup)模型,包括Conditional 3D Lookup Table Color Correction (CLTCC)和Multi-Axis Adaptive Enhancement (MAAE)两个模块,分别用于初步的颜色与质量校正以及细节优化。 Result: 大量实验表明,MAC-Lookup在增强水下图像方面表现出色,代码已在GitHub上公开。 Conclusion: MAC-Lookup模型在水下图像增强方面优于现有方法,能够更好地恢复细节和颜色,并且避免了过度增强和饱和问题。 Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is https://github.com/onlycatdoraemon/MAC-Lookup.

[39] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Feizhen Huang,Yu Wu,Yutian Lin,Bo Du

Main category: cs.CV

TL;DR: 本文提出了一种自蒸馏方法,解决了视频生成音频任务中因忽略影视语言而导致的部分可见场景性能下降问题,显著提升了模型效果。

Details Motivation: 当前的视频生成音频方法忽视了影视语言这一重要的艺术表达形式,导致在Foley目标仅部分可见的情况下表现不佳。 Method: 通过模拟影视语言的变化,利用学生模型学习训练样本的视频特征与相同音视频对应关系的对齐,从而更有效地捕捉声音与部分视觉信息之间的关联。 Result: 该方法不仅在所有评估指标下在部分可见情况下取得了显著改进,还在大规模V2A数据集VGGSound上增强了性能。 Conclusion: 本文提出了一种基于自蒸馏的方法,以应对影视语言在视频生成音频任务中的挑战,并提升了模型在部分可见情况下的性能。 Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

[40] LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

Juntao Liu,Liqiang Niu,Wenchao Chen,Jie Zhou,Fandong Meng

Main category: cs.CV

TL;DR: LaCo is a novel framework for visual token compression in MLLMs, enabling effective compression within intermediate layers of the vision encoder through a pixel-shuffle mechanism and residual learning architecture.

Details Motivation: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. LaCo aims to address this limitation. Method: LaCo introduces a layer-wise pixel-shuffle mechanism and a residual learning architecture with non-parametric shortcuts for visual token compression in the intermediate layers of the vision encoder. Result: Extensive experiments indicate that LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. Conclusion: LaCo improves training efficiency and inference throughput while maintaining performance compared to external compression methods. Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.

[41] Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization

De Cheng,Zhipeng Xu,Xinyang Jiang,Dongsheng Li,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: This paper proposes WERA, a novel framework for domain generalization that combines text-guided visual prompts with abstract prompts and stylized image augmentations, achieving better performance than existing methods.

Details Motivation: Despite the promise of Visual Foundation Models in domain generalization, designing prompts that can disentangle invariant features across domains remains challenging. Method: A framework that uses a large language model to disentangle text prompts and guide visual feature learning, enhanced by Worst Explicit Representation Alignment (WERA) to incorporate abstract prompts and stylized image augmentations. Result: Experiments on major DG datasets such as PACS, VLCS, OfficeHome, DomainNet, and TerraInc show that the proposed approach outperforms state-of-the-art DG methods. Conclusion: The proposed method WERA effectively improves domain generalization by combining text-guided visual prompts with abstract prompts, demonstrating superior performance on DG datasets. Abstract: Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.

[42] ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation

Hanbo Bi,Yulong Xu,Ya Li,Yongqiang Mao,Boyuan Tong,Chongyang Li,Chunbo Lang,Wenhui Diao,Hongqi Wang,Yingchao Feng,Xian Sun

Main category: cs.CV

TL;DR: ViRefSAM improves the Segment Anything Model for remote sensing by automatically segmenting unseen classes using just a few reference images, eliminating manual prompts and enhancing domain adaptability.

Details Motivation: Two main challenges motivated this work: (1) manual prompt creation for SAM is inefficient in remote sensing scenarios with dense or fragmented objects, and (2) SAM lacks adaptability to remote sensing images due to its training on natural images. Method: ViRefSAM introduces two components: a Visual Contextual Prompt Encoder for generating object-aware prompts and a Dynamic Target Alignment Adapter to bridge the domain gap. The original SAM architecture remains unchanged. Result: Experiments on iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$ benchmarks show that ViRefSAM consistently outperforms existing few-shot segmentation methods in segmenting unseen classes in remote sensing images. Conclusion: ViRefSAM enables accurate and automatic segmentation of unseen classes in remote sensing images using only a few reference images, outperforming existing methods. Abstract: The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM's original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM's image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$, demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets.

[43] DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation

Yunhan Yang,Shuo Chen,Yukun Huang,Xiaoyang Wu,Yuan-Chen Guo,Edmund Y. Lam,Hengshuang Zhao,Tong He,Xihui Liu

Main category: cs.CV

TL;DR: DreamComposer++是一个通过结合多视角条件来改进现有视图感知扩散模型的框架,实现了可控的新视角合成。

Details Motivation: 现有的单视角生成新视角的方法由于缺乏多视角信息而在可控性方面面临挑战。 Method: DreamComposer++利用视图感知的3D提升模块从不同视角提取对象的3D表示,并通过多视角特征融合模块将这些表示聚合并渲染到目标视角的潜在特征中。最后,将目标视角的特征集成到预训练的图像或视频扩散模型中以进行新视角合成。 Result: 实验结果表明,DreamComposer++能够无缝集成到最新的视图感知扩散模型中,并增强其从多视角条件下生成可控新视角的能力。 Conclusion: DreamComposer++促进了可控的3D物体重建,并为各种应用提供了可能性。 Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.

[44] Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images

Haoxuan Li,Chenxu Wei,Haodong Wang,Xiaomeng Hu,Boyuan An,Lingyan Ran,Baosen Zhang,Jin Jin,Omirzhan Taukebayev,Amirkhan Temirbayev,Junrui Liu,Xiuwei Zhang

Main category: cs.CV

TL;DR: This paper proposes Flow-CDNet, a dual-branch network for detecting both slow and fast changes in bitemporal images, achieving superior performance on a custom dataset with a new loss and evaluation metric.

Details Motivation: Detecting both fast and slow changes in bitemporal images is crucial, as slow changes can act as early indicators of potential hazards in critical areas such as slopes and dams. Method: The method involves a dual-branch network: one for optical flow to detect slow changes and another combining ResNet and optical flow output for fast change detection, trained using a hybrid loss function and evaluated with the FEPE metric. Result: Quantitative experiments on the Flow-Change dataset show that the proposed approach outperforms existing methods, and ablation studies confirm the mutual enhancement of the two branches. Conclusion: The paper concludes that the proposed Flow-CDNet effectively detects both slow and fast changes, with experimental results showing its superiority over existing methods. Abstract: Change detection typically involves identifying regions with changes between bitemporal images taken at the same location. Besides significant changes, slow changes in bitemporal images are also important in real-life scenarios. For instance, weak changes often serve as precursors to major hazards in scenarios like slopes, dams, and tailings ponds. Therefore, designing a change detection network that simultaneously detects slow and fast changes presents a novel challenge. In this paper, to address this challenge, we propose a change detection network named Flow-CDNet, consisting of two branches: optical flow branch and binary change detection branch. The first branch utilizes a pyramid structure to extract displacement changes at multiple scales. The second one combines a ResNet-based network with the optical flow branch's output to generate fast change outputs. Subsequently, to supervise and evaluate this new change detection framework, a self-built change detection dataset Flow-Change, a loss function combining binary tversky loss and L2 norm loss, along with a new evaluation metric called FEPE are designed. Quantitative experiments conducted on Flow-Change dataset demonstrated that our approach outperforms the existing methods. Furthermore, ablation experiments verified that the two branches can promote each other to enhance the detection performance.

[45] LMPNet for Weakly-supervised Keypoint Discovery

Pei Guo,Ryan Farrell

Main category: cs.CV

TL;DR: This paper introduces LMPNet, a weakly-supervised approach for semantic keypoint detection using a novel leaky max pooling layer and filter manipulation, achieving results comparable to supervised methods.

Details Motivation: The motivation stems from the need to perform semantic object keypoint discovery in a weakly-supervised setting using only category labels. The goal is to transform intermediate layer filters into effective keypoint detectors without relying on hand-crafted loss terms. Method: The paper proposes a novel leaky max pooling (LMP) layer to encourage filters to learn non-repeatable local patterns aligned with object keypoints. It also uses a selection strategy for consistent activations, attention mask-out to distribute attention across the whole object, and a learnable clustering layer to group keypoint proposals into predictions. Result: The proposed LMPNet model successfully discovers semantic keypoints that are robust to object pose variations while maintaining competitive performance against supervised pose estimation models. Conclusion: LMPNet is a highly interpretable model that can automatically discover semantic keypoints robust to object pose and achieves strong prediction accuracy comparable to supervised models. Abstract: In this work, we explore the task of semantic object keypoint discovery weakly-supervised by only category labels. This is achieved by transforming discriminatively-trained intermediate layer filters into keypoint detectors. We begin by identifying three preferred characteristics of keypoint detectors: (i) spatially sparse activations, (ii) consistency and (iii) diversity. Instead of relying on hand-crafted loss terms, a novel computationally-efficient leaky max pooling (LMP) layer is proposed to explicitly encourage final conv-layer filters to learn "non-repeatable local patterns" that are well aligned with object keypoints. Informed by visualizations, a simple yet effective selection strategy is proposed to ensure consistent filter activations and attention mask-out is then applied to force the network to distribute its attention to the whole object instead of just the most discriminative region. For the final keypoint prediction, a learnable clustering layer is proposed to group keypoint proposals into keypoint predictions. The final model, named LMPNet, is highly interpretable in that it directly manipulates network filters to detect predefined concepts. Our experiments show that LMPNet can (i) automatically discover semantic keypoints that are robust to object pose and (ii) achieves strong prediction accuracy comparable to a supervised pose estimation model.

[46] Perception Activator: An intuitive and portable framework for brain cognitive exploration

Le Xu,Qi Zhang,Qixian Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao

Main category: cs.CV

TL;DR: This paper explores the integration of fMRI signals into visual decoding models, showing enhanced accuracy in object detection and segmentation by utilizing rich semantic cues from fMRI data.

Details Motivation: The motivation is to better understand the brain's visual perception patterns and how current decoding models process semantic objects due to existing methods' limitations in semantic alignment leading to reconstruction distortions. Method: An experimental framework was developed using fMRI representations as intervention conditions. These representations were injected into multi-scale image features through cross-attention to compare performance on object detection and instance segmentation tasks with and without fMRI information. Result: The results showed improved accuracy in downstream detection and segmentation tasks when fMRI signals were incorporated, demonstrating that fMRI contains rich multi-object semantic cues and coarse spatial localization information not fully exploited by current models. Conclusion: Incorporating fMRI signals into models enhances the accuracy of detection and segmentation, indicating that fMRI contains valuable multi-object semantic cues. Abstract: Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain's visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.

[47] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation

JaeHyuck Choi,MinJun Kim,JeHyeong Hong

Main category: cs.CV

TL;DR: MAGIC是一种用于工业质量控制中少样本异常生成的方法,通过改进扩散模型实现背景保留、掩码对齐与语义有效性的统一,并展现出优越的性能。

Details Motivation: 现有方法无法同时满足保持正常背景、精确覆盖掩码区域并生成语义有效的异常样本的需求。 Method: 提出了基于扩散模型的MAGIC方法,包括多级扰动策略和上下文感知的掩码对齐模块。 Result: 在MVTec-AD数据集上验证了MAGIC在下游异常任务中的性能优于现有的最先进方法。 Conclusion: MAGIC实现了对工业质量检测中异常数据的高效增强,解决了背景保持、掩码对齐和语义有效性之间的矛盾。 Abstract: Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC--Mask-guided inpainting with multi-level perturbations and Context-aware alignment--to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.

[48] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

Zecheng Zhao,Selena Song,Tong Chen,Zhi Chen,Shazia Sadiq,Yadan Luo

Main category: cs.CV

TL;DR: 本文提出了 SynTVA,一个新的用于评估合成视频在文本到视频检索中表现的数据集和基准框架。

Details Motivation: 当前文本到视频合成评价指标主要关注视觉质量和时间一致性,缺乏对下游任务如文本到视频检索性能的洞察。 Method: 基于 MSRVTT 训练集的 800 个多样化用户查询生成合成视频,并从四个语义对齐维度标注视频-文本对,通过与现有视频质量评估指标相关性分析及其对下游任务预测能力的检验进行评估。 Result: 开发了 SynTVA 数据集及 Auto-Evaluator 模型,结果表明其可有效支持数据集增强并提升文本到视频检索效果。 Conclusion: SynTVA 是一个用于评估合成视频在文本到视频检索任务中效用的新数据集和基准,结果显示 SynTVA 可有效提升 TVR 表现。 Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

[49] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Nina Konovalova,Maxim Nikolaev,Andrey Kuznetsov,Aibek Alanov

Main category: cs.CV

TL;DR: InnerControl enhances spatial control in text-to-image diffusion models by maintaining consistency across all diffusion stages, leading to better generation quality and control accuracy.

Details Motivation: Existing methods like ControlNet and ControlNet++ provide spatial control but neglect intermediate generation stages, limiting their effectiveness. This work aims to enhance control fidelity and generation quality by addressing these limitations. Method: InnerControl introduces lightweight convolutional probes to reconstruct input control signals from intermediate UNet features at every denoising step, and applies an alignment loss throughout the entire diffusion process. Result: InnerControl demonstrates improved control fidelity and image generation quality by efficiently extracting control signals from noisy latents and minimizing discrepancies between predicted and target conditions during the diffusion process. Conclusion: InnerControl improves spatial control and generation quality in text-to-image diffusion models by enforcing consistency across all diffusion steps, and achieves state-of-the-art performance when combined with existing techniques like ControlNet++. Abstract: Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

[50] Neural Network-based Study for Rice Leaf Disease Recognition and Classification: A Comparative Analysis Between Feature-based Model and Direct Imaging Model

Farida Siddiqi Prity,Mirza Raquib,Saydul Akbar Murad,Md. Jubayar Alam Rafi,Md. Khairul Bashar Bhuiyan,Anupam Kumar Bairagi

Main category: cs.CV

TL;DR: 本文提出了一种基于人工神经网络的图像处理技术,用于及时分类和识别水稻叶片疾病。文章比较了特征分析检测模型(FADM)和直接图像中心检测模型(DICDM),结果显示FADM具有更高的性能。

Details Motivation: 水稻叶片疾病显著降低生产力并造成经济损失,因此需要早期检测以实现有效管理和提高产量。 Method: 提出了基于人工神经网络的图像处理技术,并对特征分析检测模型(FADM)和直接图像中心检测模型(DICDM)进行了比较研究。实验采用了不同的特征提取算法、降维算法、特征选择算法和极限学习机(ELM)。 Result: 实验结果表明,使用特征分析检测模型(FADM)可以获得最高的性能。 Conclusion: 采用所提出的特征分析检测模型(FADM)检测水稻叶片疾病具有改善作物健康、减少产量损失以及提高水稻种植整体生产力和可持续性的潜力。 Abstract: Rice leaf diseases significantly reduce productivity and cause economic losses, highlighting the need for early detection to enable effective management and improve yields. This study proposes Artificial Neural Network (ANN)-based image-processing techniques for timely classification and recognition of rice diseases. Despite the prevailing approach of directly inputting images of rice leaves into ANNs, there is a noticeable absence of thorough comparative analysis between the Feature Analysis Detection Model (FADM) and Direct Image-Centric Detection Model (DICDM), specifically when it comes to evaluating the effectiveness of Feature Extraction Algorithms (FEAs). Hence, this research presents initial experiments on the Feature Analysis Detection Model, utilizing various image Feature Extraction Algorithms, Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM). The experiments are carried out on datasets encompassing bacterial leaf blight, brown spot, leaf blast, leaf scald, Sheath blight rot, and healthy leaf, utilizing 10-fold Cross-Validation method. A Direct Image-Centric Detection Model is established without the utilization of any FEA, and the evaluation of classification performance relies on different metrics. Ultimately, an exhaustive contrast is performed between the achievements of the Feature Analysis Detection Model and Direct Image-Centric Detection Model in classifying rice leaf diseases. The results reveal that the highest performance is attained using the Feature Analysis Detection Model. The adoption of the proposed Feature Analysis Detection Model for detecting rice leaf diseases holds excellent potential for improving crop health, minimizing yield losses, and enhancing overall productivity and sustainability of rice farming.

[51] Two-Steps Neural Networks for an Automated Cerebrovascular Landmark Detection

Rafic Nader,Vincent L'Allinec,Romain Bourcier,Florent Autrusseau

Main category: cs.CV

TL;DR: 本文提出一种自动检测Willis环分叉点的方法,通过两步神经网络减少漏检问题并适应解剖结构差异,表现优异。

Details Motivation: 颅内动脉瘤通常发生在Willis环的特定段落,尤其是十三个主要动脉分叉处;准确检测这些关键点对快速有效诊断至关重要。 Method: 首先使用目标检测网络识别近似关键点位置的感兴趣区域(ROIs),然后利用带有深度监督的改进U-Net准确识别分叉点。 Result: 实验结果表明,该方法在分叉点检测任务中达到了最高水平的性能,特别是在处理具有变化和相似视觉特征的数据时。 Conclusion: 该论文提出了一种基于两步神经网络过程的完全自动化检测大脑Willis环分叉点的方法,这种方法在处理具有解剖变异性和复杂视觉特征的数据时表现出最高性能。 Abstract: Intracranial aneurysms (ICA) commonly occur in specific segments of the Circle of Willis (CoW), primarily, onto thirteen major arterial bifurcations. An accurate detection of these critical landmarks is necessary for a prompt and efficient diagnosis. We introduce a fully automated landmark detection approach for CoW bifurcations using a two-step neural networks process. Initially, an object detection network identifies regions of interest (ROIs) proximal to the landmark locations. Subsequently, a modified U-Net with deep supervision is exploited to accurately locate the bifurcations. This two-step method reduces various problems, such as the missed detections caused by two landmarks being close to each other and having similar visual characteristics, especially when processing the complete MRA Time-of-Flight (TOF). Additionally, it accounts for the anatomical variability of the CoW, which affects the number of detectable landmarks per scan. We assessed the effectiveness of our approach using two cerebral MRA datasets: our In-House dataset which had varying numbers of landmarks, and a public dataset with standardized landmark configuration. Our experimental results demonstrate that our method achieves the highest level of performance on a bifurcation detection task.

[52] Lightweight Shrimp Disease Detection Research Based on YOLOv8n

Fei Yuhuan,Wang Gengchen,Liu Fenghao,Zang Ran,Sun Xufei,Chang Hao

Main category: cs.CV

TL;DR: 本研究开发了一个高效的虾病检测模型,结合了轻量化设计和改进的注意力机制,实现了更高的准确率和更低的计算需求。

Details Motivation: 预防疾病传播并提高虾类养殖中的智能检测效率。 Method: 设计了RLDD检测头和C2f-EMCM模块以降低计算复杂度,并引入改进的SegNext_Attention自注意力机制来增强特征提取能力。 Result: 与原始YOLOv8n相比,参数减少了32.3%,mAP@0.5达到92.7%(比YOLOv8n提高了3%),在URPC2020数据集上的mAP@0.5提高了4.1%。 Conclusion: 本文提出了一种基于YOLOv8n的轻量级网络架构,用于虾类疾病检测,在保持检测精度的同时减少了计算复杂度,并在自构建的数据集和URPC2020数据集上验证了模型的鲁棒性。 Abstract: Shrimp diseases are one of the primary causes of economic losses in shrimp aquaculture. To prevent disease transmission and enhance intelligent detection efficiency in shrimp farming, this paper proposes a lightweight network architecture based on YOLOv8n. First, by designing the RLDD detection head and C2f-EMCM module, the model reduces computational complexity while maintaining detection accuracy, improving computational efficiency. Subsequently, an improved SegNext_Attention self-attention mechanism is introduced to further enhance the model's feature extraction capability, enabling more precise identification of disease characteristics. Extensive experiments, including ablation studies and comparative evaluations, are conducted on a self-constructed shrimp disease dataset, with generalization tests extended to the URPC2020 dataset. Results demonstrate that the proposed model achieves a 32.3% reduction in parameters compared to the original YOLOv8n, with a mAP@0.5 of 92.7% (3% improvement over YOLOv8n). Additionally, the model outperforms other lightweight YOLO-series models in mAP@0.5, parameter count, and model size. Generalization experiments on the URPC2020 dataset further validate the model's robustness, showing a 4.1% increase in mAP@0.5 compared to YOLOv8n. The proposed method achieves an optimal balance between accuracy and efficiency, providing reliable technical support for intelligent disease detection in shrimp aquaculture.

[53] Holistic Tokenizer for Autoregressive Image Generation

Anlin Zheng,Haochen Wang,Yucheng Zhao,Weipeng Deng,Tiancai Wang,Xiangyu Zhang,Xiaojuan Qi

Main category: cs.CV

TL;DR: Hita是一种新的自回归图像生成的图像分词器,通过整体到局部的分词方案和改进的信息流控制策略,提高了生成效果和训练速度。

Details Motivation: 传统的自回归图像生成模型逐步生成视觉标记,限制了捕捉标记序列间整体关系的能力。此外,大多数视觉分词器将局部图像块映射为潜在标记,导致全局信息有限。 Method: Hita引入了可学习的整体查询和局部块状标记,并在解码前采用轻量级融合模块控制信息流以优先处理整体标记。 Result: 实验表明,Hita加速了AR生成器的训练速度,并优于使用传统分词器训练的生成器,在ImageNet基准上达到2.59 FID和281.9 IS。 Conclusion: Hita是一种用于自回归图像生成的新颖图像分词器,它通过引入整体到局部的分词方案和两个关键策略来改进与自回归生成过程的对齐。 Abstract: The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}

[54] LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling

Jiahao Wu,Rui Peng,Jianbo Jiao,Jiayu Yang,Luyang Tang,Kaiqiang Xiong,Jie Liang,Jinbo Yan,Runling Liu,Ronggang Wang

Main category: cs.CV

TL;DR: 本文提出LocalDyGS,通过分解复杂动态场景并解耦静态与动态特征,有效建模大规模及精细尺度运动,显著提升高度动态真实世界场景的重建效果。

Details Motivation: 由于现实世界中存在复杂且高度动态的运动,从多视角输入合成任意视点的动态视频具有挑战性,现有方法如神经辐射场或3D高斯点绘在精细尺度运动建模方面受限。 Method: LocalDyGS方法包含两个部分:1)将复杂动态场景分解为由种子定义的局部空间;2)解耦每个局部空间中的静态和动态特征,并结合生成时间高斯分布以建模运动。 Result: 该方法不仅在多个精细尺度数据集上表现出与最先进方法相当的性能,还首次尝试建模更大且更复杂的高度动态场景。 Conclusion: LocalDyGS通过分解复杂动态场景并解耦静态与动态特征,实现了对大规模和精细尺度运动场景的有效建模,从而在高度动态的真实世界场景重建中表现出色。 Abstract: Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github.io/LocalDyGS/.

[55] UVLM: Benchmarking Video Language Model for Underwater World Understanding

Xizhe Xue,Yang Zhou,Dawei Yan,Ying Li,Haokui Zhang,Rong Xiao

Main category: cs.CV

TL;DR: 本文介绍了一种新的水下观察基准UVLM,该基准通过协作方法构建,并用于提升视频语言模型在水下环境的理解能力。

Details Motivation: 现有的工作主要集中在陆地场景,忽视了水下观测的高度需求应用。 Method: 通过结合人类专业知识和AI模型的协作方法构建了UVLM,并从多个角度确保数据质量。 Result: 构建了一个包含419类海洋动物、不同帧率、分辨率、静态植物和地形的多样化数据集,并设计了20种不同的任务类型和挑战性评估指标。 Conclusion: 实验结果表明,在UVLM上微调VidLMs显著提升了对水下世界的理解,同时在现有的空中VidLM基准测试中也显示出轻微改进的潜力。 Abstract: Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. To overcome this gap, we introduce UVLM, an under water observation benchmark which is build through a collaborative approach combining human expertise and AI models. To ensure data quality, we have conducted in-depth considerations from multiple perspectives. First, to address the unique challenges of underwater environments, we selected videos that represent typical underwater challenges including light variations, water turbidity, and diverse viewing angles to construct the dataset. Second, to ensure data diversity, the dataset covers a wide range of frame rates, resolutions, 419 classes of marine animals, and various static plants and terrains. Next, for task diversity, we adopted a structured design where observation targets are categorized into two major classes: biological and environmental. Each category includes content observation and change/action observation, totaling 20 distinct task types. Finally, we designed several challenging evaluation metrics to enable quantitative comparison and analysis of different methods. Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks, such as VideoMME and Perception text. The dataset and prompt engineering will be released publicly.

[56] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

Seokyeong Lee,Sithu Aung,Junyong Choi,Seungryong Kim,Ig-Jae Kim,Junghyun Cho

Main category: cs.CV

TL;DR: This paper proposes a new pseudo-labeling framework for Monocular 3D object detection that uses video data and object point tracking to overcome existing limitations like domain-specific learning and reliance on shape information. It ensures reliable accuracy and scalability without requiring additional setups or data.

Details Motivation: Monocular 3D object detection faces challenges due to data scarcity from high annotation costs and 2D-to-3D ambiguity. Existing methods are limited by domain-specific learning or reliance on shape information from single observations. Method: A novel pseudo-labeling framework using video data and object point tracking to aggregate pseudo-LiDARs across temporally adjacent frames for 3D attribute extraction. Result: The method is robust to occlusion and does not require multi-view setup, additional sensors, camera poses, or domain-specific training, demonstrating reliable accuracy and strong scalability. Conclusion: The proposed pseudo-labeling framework is a practical and effective solution for M3OD, ensuring reliable accuracy and strong scalability. Abstract: Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.

[57] Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis

Byung Hyun Lee,Wongi Jeong,Woojae Han,Kyoungbun Lee,Se Young Chun

Main category: cs.CV

TL;DR: CoMEL improves continual multiple instance learning for localization in large-scale histopathological images using a novel framework with attention transformer, pseudo-labeling, and adaptation techniques.

Details Motivation: To explore adaptability in continual tasks with minimal forgetting for MIL, particularly for localization in large-scale histopathological images. Method: CoMEL incorporates GDAT for efficient instance encoding, BPPL for reliable pseudo-labeling, and OWLoRA to reduce forgetting in classification. Result: CoMEL outperforms prior methods by up to 11.00% in bag-level accuracy and 23.4% in localization accuracy under continual MIL settings. Conclusion: The proposed CoMEL framework effectively addresses the challenges of continual multiple instance learning with enhanced localization, demonstrating superior performance on WSI datasets. Abstract: Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., $16 \times 16$) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ($\sim 10^5$) of large patches (e.g., $256 \times 256$) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework for both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets demonstrate superior performance of CoMEL, outperforming the prior arts by up to $11.00\%$ in bag-level accuracy and up to $23.4\%$ in localization accuracy under the continual MIL setup.

[58] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Taehoon Kim,Jongwook Choi,Yonghyun Jeong,Haeun Noh,Jaejun Yoo,Seungryul Baek,Jongwon Choi

Main category: cs.CV

TL;DR: 本文提出一种利用像素级时间不一致性的深度伪造视频检测方法,结合一维傅里叶变换和注意力机制提升检测效果。

Details Motivation: 传统基于空间频率的检测方法忽略了像素级的时间不一致性,导致难以检测时间伪影。 Method: 对每个像素的时间轴进行一维傅里叶变换,并引入注意力提议模块和联合变压器模块来整合时空特征。 Result: 新方法能够有效捕捉时间上的不一致,并准确定位包含时间伪影的区域,提高了检测精度。 Conclusion: 该论文提出了一种新的深度伪造视频检测方法,显著提升了在多样化和具有挑战性的检测场景下的性能。 Abstract: We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.

[59] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

Xiangfeng Wang,Xiao Li,Yadong Wei,Xueyu Song,Yang Song,Xiaoqiang Xia,Fangrui Zeng,Zaiyi Chen,Liu Liu,Gu Xu,Tong Xu

Main category: cs.CV

TL;DR: 本文提出了一种新的自动视频编辑框架HIVE,利用多模态叙事理解提升视频剪辑质量,解决了现有方法忽略视觉上下文的问题,并引入了一个新的基准数据集DramaAD。

Details Motivation: 快速增长的在线视频内容需要高效的视频编辑技术,而现有的自动编辑方法主要依赖于文本线索,忽略了丰富的视觉上下文,导致输出不连贯。 Method: 提出了一种受人类启发的自动视频编辑框架(HIVE),结合多模态大语言模型进行角色提取、对话分析和叙事摘要,并采用场景级分割来增强连贯性。 Result: 实验结果表明,HIVE框架在通用和广告导向的编辑任务上均优于现有基线方法。 Conclusion: HIVE框架通过多模态叙事理解提升了自动视频剪辑的质量,显著缩小了自动和人工编辑视频之间的质量差距。 Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.

[60] TABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation

Peilin Zhang,Shaouxan Wua,Jun Feng,Zhuo Jin,Zhizezhang Gao,Jingkun Chen,Yaqiong Xing,Xiao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为TAB Net的新框架,通过三元组增强自恢复和边界感知伪标签监督模块,在基于涂鸦的弱监督医学图像分割中取得了优于现有方法的结果。

Details Motivation: 获取大规模、完全标注的医学图像数据集既耗时又昂贵,而涂鸦注释提供了一个高效且成本效益高的替代方案,但其稀疏性限制了目标区域的特征学习并缺乏足够的边界监督。 Method: 提出了一种新的弱监督医学图像分割框架TAB Net,包含三元组增强自恢复(TAS)模块和边界感知伪标签监督(BAP)模块。 Result: 在两个公开数据集ACDC和MSCMR seg上的实验评估表明,TAB Net显著优于基于涂鸦的弱监督分割的最先进方法,并实现了与全监督方法相当的性能。 Conclusion: TAB Net显著优于基于涂鸦的弱监督分割的最先进方法,并实现了与全监督方法相当的性能。 Abstract: Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the feature learning of the target region and lacks sufficient boundary supervision, which poses significant challenges for training segmentation networks. Methods: We propose TAB Net, a novel weakly-supervised medical image segmentation framework, consisting of two key components: the triplet augmentation self-recovery (TAS) module and the boundary-aware pseudo-label supervision (BAP) module. The TAS module enhances feature learning through three complementary augmentation strategies: intensity transformation improves the model's sensitivity to texture and contrast variations, cutout forces the network to capture local anatomical structures by masking key regions, and jigsaw augmentation strengthens the modeling of global anatomical layout by disrupting spatial continuity. By guiding the network to recover complete masks from diverse augmented inputs, TAS promotes a deeper semantic understanding of medical images under sparse supervision. The BAP module enhances pseudo-supervision accuracy and boundary modeling by fusing dual-branch predictions into a loss-weighted pseudo-label and introducing a boundary-aware loss for fine-grained contour refinement. Results: Experimental evaluations on two public datasets, ACDC and MSCMR seg, demonstrate that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation. Moreover, it achieves performance comparable to that of fully supervised methods.

[61] Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

Ziqi Miao,Yi Ding,Lijun Li,Jing Shao

Main category: cs.CV

TL;DR: This paper introduces VisCo Attack, a new method for generating harmful responses from AI models by creating realistic visual-based contexts, showing significant success over existing methods.

Details Motivation: MLLMs' security vulnerabilities pose challenges; previous approaches lack realism. This work aims to create realistic jailbreak contexts focusing on visual input. Method: Proposed a novel visual-centric jailbreak setting and developed the VisCo Attack with four visual-focused strategies and toxicity obfuscation. Result: Achieved a toxicity score of 4.78 and an ASR of 85% on MM-SafetyBench against GPT-4o. Conclusion: VisCo Attack proves to be highly effective in triggering harmful responses from MLLMs, significantly outperforming the baseline. Abstract: With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.

[62] Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings

Mufhumudzi Muthivhi,Terence L. van Zyl

Main category: cs.CV

TL;DR: This paper explores self-supervised learning for wildlife re-identification, demonstrating that it can outperform traditional supervised methods without relying on annotated data.

Details Motivation: Current state-of-the-art models for wildlife re-identification rely on annotated datasets, which has led to the curation of large-scale wildlife datasets. This study investigates the use of self-supervised learning to reduce dependence on labeled data. Method: The study uses self-supervised learning (SSL) by automatically extracting two distinct views of an individual from temporal image pairs in camera trap data without supervision. These pairs are used to train a self-supervised model using an endless stream of video data. Result: Self-supervised features were found to outperform supervised features across all downstream wildlife tasks, showing better performance even in open-world scenarios and transfer learning contexts. Conclusion: The study concludes that self-supervised models are more robust and outperform supervised models in wildlife re-identification tasks, even with limited data. Abstract: Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here https://github.com/pxpana/SSLWildlife.

[63] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration

Ayantika Das,Moitreya Chaudhuri,Koushik Bhat,Keerthi Ram,Mihail Bota,Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: 本文提出了一种扩散自编码模型的结构化潜在空间学习方法,并开发了两种基于扩散模型的无监督图像修复技术,用于处理脑图像中的组织分类和图像伪影问题。

Details Motivation: 尽管扩散模型能够生成高质量的图像样本,但其采样机制无法提取图像特定的语义表示,而这是自编码器固有的优势。因此,本文旨在结合扩散模型和自编码器的优点,解决潜在空间结构化以及图像修复问题。 Method: 1. 提出了一个机制来构建扩散自编码模型的潜在空间,通过对高分辨率图像的块位置信息进行回归分析,以区分脑组织类型;2. 基于邻域感知提出了一种无监督的撕裂伪影修复技术;3. 利用扩散模型在推理过程中可操控的加噪和去噪能力,提出了无监督的JPEG伪影修复技术。 Result: 1. 构建了一个能够在潜在空间中识别脑图像区域特异性细胞模式的扩散自编码模型;2. 开发了一种基于邻域感知的无监督撕裂伪影修复技术;3. 提出了一种利用扩散模型可控性的无监督JPEG伪影修复技术。 Conclusion: 通过将编码器与扩散模型集成,该工作建立了一种自编码框架,不仅学习了图像特定的表示,还提供了组织潜在空间的方法。具体来说,该方法在潜在空间中识别脑图像中的区域特异性细胞模式、修复撕裂伪影和JPEG伪影方面表现出色。 Abstract: Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique.

[64] A Novel Tuning Method for Real-time Multiple-Object Tracking Utilizing Thermal Sensor with Complexity Motion Pattern

Duong Nguyen-Ngoc Tran,Long Hoang Pham,Chi Dai Tran,Quoc Pham-Nam Ho,Huy-Hung Nguyen,Jae Wook Jeon

Main category: cs.CV

TL;DR: 本文提出了一种用于热成像中行人跟踪的新方法,通过优化两阶段框架和超参数调整,在不依赖复杂模型的情况下实现高效准确的实时跟踪。

Details Motivation: 热成像传感器在低可见度或光照条件差的环境中具有优势,但其低层次特征表示使准确检测和跟踪行人变得困难。因此需要一种有效的方法来提升热图像中的跟踪性能。 Method: 论文提出了一个两阶段优化框架,并为每个阶段选择最适合的超参数以最大化跟踪性能。这种方法专注于实时跟踪,并避免使用复杂的重识别或运动模型。 Result: 在PBVS Thermal MOT数据集上的实验表明,所提出的方法在各种热成像条件下均表现出色,证明了其在现实世界监控应用中的有效性与鲁棒性。 Conclusion: 论文提出了一种针对热成像中行人跟踪的新型调优方法,通过优化两阶段框架和超参数调整,在不依赖复杂重识别或运动模型的情况下实现了高精度的实时跟踪。实验表明该方法在多种热成像条件下表现优异,适用于实际的监控应用。 Abstract: Multi-Object Tracking in thermal images is essential for surveillance systems, particularly in challenging environments where RGB cameras struggle due to low visibility or poor lighting conditions. Thermal sensors enhance recognition tasks by capturing infrared signatures, but a major challenge is their low-level feature representation, which makes it difficult to accurately detect and track pedestrians. To address this, the paper introduces a novel tuning method for pedestrian tracking, specifically designed to handle the complex motion patterns in thermal imagery. The proposed framework optimizes two-stages, ensuring that each stage is tuned with the most suitable hyperparameters to maximize tracking performance. By fine-tuning hyperparameters for real-time tracking, the method achieves high accuracy without relying on complex reidentification or motion models. Extensive experiments on PBVS Thermal MOT dataset demonstrate that the approach is highly effective across various thermal camera conditions, making it a robust solution for real-world surveillance applications.

[65] Privacy-preserving Preselection for Face Identification Based on Packing

Rundong Xin,Taotao Wang,Jin Wang,Chonghe Zhao,Jing Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为PFIP的新方法,通过预选机制和打包技术,在保护隐私的同时显著提升了密文域人脸检索的效率。

Details Motivation: 随着密文模板库规模的增长,人脸检索过程变得愈发耗时,需要一种更高效的方案。 Method: 提出了隐私保护预选机制和打包模块以提高效率和灵活性。 Result: 实验表明PFIP在LFW和CASIA数据集上取得了100%的命中率,并在300毫秒内完成1000个密文人脸模板的检索,检索效率提高了近50倍。 Conclusion: PFIP实现了在密文域中高效的人脸检索,同时保持了原有模型的准确性。 Abstract: Face identification systems operating in the ciphertext domain have garnered significant attention due to increasing privacy concerns and the potential recovery of original facial data. However, as the size of ciphertext template libraries grows, the face retrieval process becomes progressively more time-intensive. To address this challenge, we propose a novel and efficient scheme for face retrieval in the ciphertext domain, termed Privacy-Preserving Preselection for Face Identification Based on Packing (PFIP). PFIP incorporates an innovative preselection mechanism to reduce computational overhead and a packing module to enhance the flexibility of biometric systems during the enrollment stage. Extensive experiments conducted on the LFW and CASIA datasets demonstrate that PFIP preserves the accuracy of the original face recognition model, achieving a 100% hit rate while retrieving 1,000 ciphertext face templates within 300 milliseconds. Compared to existing approaches, PFIP achieves a nearly 50x improvement in retrieval efficiency.

[66] Determination Of Structural Cracks Using Deep Learning Frameworks

Subhasis Dasgupta,Jaydip Sen,Tuhina Halder

Main category: cs.CV

TL;DR: 本文介绍了一种结合残差U-Net和集成学习的新方法,有效提高了结构裂缝检测的自动化水平和准确性,尤其适用于低分辨率图像。

Details Motivation: 传统的手动裂缝检测方法依赖经验不足的人员,速度慢、结果不一致且容易出错,影响评估的可靠性;因此需要更高效和准确的自动化解决方案。 Method: 研究中使用了多种残差U-Net模型配置,并将其与包含卷积块的元模型集成,形成一种新的集成架构,以提升预测效率。 Result: 实验结果显示,残差U-Net模型优于传统SegNet和U-Net等经典架构,尤其是在低分辨率图像中表现突出,而集成模型进一步提升了性能,在IoU指标和DICE系数上取得了最高得分。 Conclusion: 该研究提出了一种基于残差U-Net模型的集成深度学习架构,显著提高了结构裂缝检测的准确性和效率,证明了其在低分辨率图像中的优越性能,并为自动化结构缺陷监测系统的发展提供了新路径。 Abstract: Structural crack detection is a critical task for public safety as it helps in preventing potential structural failures that could endanger lives. Manual detection by inexperienced personnel can be slow, inconsistent, and prone to human error, which may compromise the reliability of assessments. The current study addresses these challenges by introducing a novel deep-learning architecture designed to enhance the accuracy and efficiency of structural crack detection. In this research, various configurations of residual U-Net models were utilized. These models, due to their robustness in capturing fine details, were further integrated into an ensemble with a meta-model comprising convolutional blocks. This unique combination aimed to boost prediction efficiency beyond what individual models could achieve. The ensemble's performance was evaluated against well-established architectures such as SegNet and the traditional U-Net. Results demonstrated that the residual U-Net models outperformed their predecessors, particularly with low-resolution imagery, and the ensemble model exceeded the performance of individual models, proving it as the most effective. The assessment was based on the Intersection over Union (IoU) metric and DICE coefficient. The ensemble model achieved the highest scores, signifying superior accuracy. This advancement suggests way for more reliable automated systems in structural defects monitoring tasks.

[67] AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars

Yiming Zhong,Xiaolin Zhang,Ligang Liu,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: AvatarMakeup是一种用于3D虚拟头像化妆的新方法,它通过使用预训练扩散模型和从粗到细的优化策略来实现高质量、一致性的化妆效果。

Details Motivation: 当前的3D高斯编辑方法无法满足实现真实化妆效果的基本要求:1)确保可驱动表情下的一致外观;2)在整个化妆过程中保持身份;3)能够精确控制细节。 Method: 提出了一种名为AvatarMakeup的方法,采用预训练扩散模型从单个参考图像迁移化妆模式,并使用了从粗到细的策略,包括Coherent Duplication方法和Refinement Module。 Result: 实验表明,AvatarMakeup在动画过程中实现了最先进的化妆迁移质量和一致性。 Conclusion: AvatarMakeup实现了高质量和一致性的虚拟形象化妆效果,解决了3D虚拟头像化妆的现存问题。 Abstract: Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions, 2) preserving the identity throughout the makeup process, and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multiview effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.

[68] F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

Wei Li,Jingyang Zhang,Lihao Liu,Guoan Wang,Junjun He,Yang Chen,Lixu Gu

Main category: cs.CV

TL;DR: This paper introduces F²TTA, a practical approach for adapting medical models to fragmented, unpredictable test data, using a novel prompt tuning framework with uncertainty-based masking and graph distillation.

Details Motivation: Existing TTA methods assume complete domain data arrival, which is impractical in clinical settings due to resource constraints and patient variability. Method: The proposed Image-level Disentangled Prompt Tuning (I-DiPT) framework includes image-invariant and image-specific prompts. It uses Uncertainty-oriented Masking (UoM) and Parallel Graph Distillation (PGD) to enhance knowledge extraction and reuse. Result: Experiments show that the proposed method outperforms existing TTA approaches in handling free-form domain fragments for breast cancer and glaucoma classification. Conclusion: The paper proposes an effective framework for free-form test-time adaptation in medical imaging scenarios, where data arrives unpredictably and in fragments. Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F$^{2}$TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F$^{2}$TTA. Code is available at https://github.com/mar-cry/F2TTA.

[69] Red grape detection with accelerated artificial neural networks in the FPGA's programmable logic

Sandro Costa Magalhães,Marco Almeida,Filipe Neves dos Santos,António Paulo Moreira,Jorge Dias

Main category: cs.CV

TL;DR: This paper explores deploying quantized artificial neural networks (ANNs) on FPGAs using the FINN architecture for faster inference in robotic object detection.

Details Motivation: Robots often slow down task execution due to limited processing speed of detection algorithms. Existing frameworks like Vitis-AI do not fully utilize FPGA capabilities, prompting the need for more efficient deployment methods. Method: The authors used the FINN architecture to deploy three quantized ANNs (MobileNet v1, CNV with 2-bit quantization, and BNN with 1-bit quantization) onto an FPGA's PL. These models were trained on a self-acquired open-access dataset called RG2C. Result: MobileNet v1 outperformed others, achieving a 98% success rate and an inference speed of 6611 FPS. Conclusion: FPGAs can effectively accelerate quantized ANNs, making them suitable for real-time robotic applications such as attention mechanisms. Abstract: Robots usually slow down for canning to detect objects while moving. Additionally, the robot's camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs' PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA's PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.

[70] IGDNet: Zero-Shot Robust Underexposed Image Enhancement via Illumination-Guided and Denoising

Hailong Yan,Junjian Huang,Tingwen Huang

Main category: cs.CV

TL;DR: 提出了一种名为IGDNet的零样本增强方法,用于恢复曝光不足的图像,无需配对数据或训练数据,具有强大的泛化能力和有效的噪声抑制。

Details Motivation: 现有的恢复曝光不足图像的方法依赖于成对的曝光不足和良好照明图像的监督学习,但在实际场景中收集这样的数据集往往不切实际,并且可能导致过度增强,扭曲良好照明区域。 Method: IGDNet框架包含一个分解模块和一个去噪模块,前者通过密集连接网络将图像分离为照明和反射分量,后者使用照明引导的像素自适应校正方法增强非均匀照明区域。 Result: IGDNet在四个公开数据集上的广泛实验表明,在复杂光照条件下显著提高了视觉质量。定量结果显示,PSNR达到20.41dB,SSIM达到0.860dB。 Conclusion: IGDNet显著提升了复杂光照条件下的视觉质量,并在PSNR和SSIM等指标上优于14种最先进的无监督方法。 Abstract: Current methods for restoring underexposed images typically rely on supervised learning with paired underexposed and well-illuminated images. However, collecting such datasets is often impractical in real-world scenarios. Moreover, these methods can lead to over-enhancement, distorting well-illuminated regions. To address these issues, we propose IGDNet, a Zero-Shot enhancement method that operates solely on a single test image, without requiring guiding priors or training data. IGDNet exhibits strong generalization ability and effectively suppresses noise while restoring illumination. The framework comprises a decomposition module and a denoising module. The former separates the image into illumination and reflection components via a dense connection network, while the latter enhances non-uniformly illuminated regions using an illumination-guided pixel adaptive correction method. A noise pair is generated through downsampling and refined iteratively to produce the final result. Extensive experiments on four public datasets demonstrate that IGDNet significantly improves visual quality under complex lighting conditions. Quantitative results on metrics like PSNR (20.41dB) and SSIM (0.860dB) show that it outperforms 14 state-of-the-art unsupervised methods. The code will be released soon.

[71] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection

Weiwei Duan,Luping Ji,Shengjia Chen,Sicheng Zhu,Jianghong Huang,Mao Ye

Main category: cs.CV

TL;DR: 本文提出了一种弱监督对比学习(WeCoL)方案,用于移动红外小目标检测,该方案仅需简单的目标数量提示即可进行模型训练,并在两个公共数据集上验证了其性能优于早期的全监督方法。

Details Motivation: 移动红外小目标检测面临目标尺寸小、背景对比度弱等挑战,现有的全监督方法依赖大量手动标注,成本高且耗时。因此需要一种非全监督策略来减少标注需求。 Method: 提出了基于预训练分割模型(SAM)的潜在目标挖掘策略,结合对比学习和长短期运动感知学习方案,以提高伪标签的可靠性并建模局部和全局运动模式。 Result: 实验表明,在DAUB和ITSDT-15K两个数据集上,该方法性能优于早期全监督方法,甚至能达到最先进的全监督方法90%以上的性能。 Conclusion: 提出的弱监督对比学习方案有效减少了对大量标注数据的依赖,并在红外小目标检测任务中表现出色。 Abstract: Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ($e.g.$, weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model training.Specifically, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy accumulation.Besides, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature subspace.Moreover, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small targets.The extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90\% of state-of-the-art (SOTA) fully-supervised ones.

[72] Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk

Gaochao Song,Zibo Zhao,Haohan Weng,Jingbo Zeng,Rongfei Jia,Shenghua Gao

Main category: cs.CV

TL;DR: Mesh Silksong是一种新颖且高效的网格表示方法,可有效提升生成网格的几何完整性和压缩效率。

Details Motivation: 现有网格标记化方法会产生包含重复顶点标记的序列,浪费网络能力,因此需要一种更高效的标记化方法。 Method: 引入了一种新的网格表示方法Mesh Silksong,以类似丝绸编织的方式自回归生成多边形网格。 Result: 该方法将标记序列的冗余度降低了50%,并达到了约22%的最先进压缩率;此外,生成的多边形网格具有优越的几何特性。 Conclusion: Mesh Silksong通过减少冗余实现了更高效的网格表示,并且在几何完整性方面表现出显著改进。 Abstract: We introduce Mesh Silksong, a compact and efficient mesh representation tailored to generate the polygon mesh in an auto-regressive manner akin to silk weaving. Existing mesh tokenization methods always produce token sequences with repeated vertex tokens, wasting the network capability. Therefore, our approach tokenizes mesh vertices by accessing each mesh vertice only once, reduces the token sequence's redundancy by 50\%, and achieves a state-of-the-art compression rate of approximately 22\%. Furthermore, Mesh Silksong produces polygon meshes with superior geometric properties, including manifold topology, watertight detection, and consistent face normals, which are critical for practical applications. Experimental results demonstrate the effectiveness of our approach, showcasing not only intricate mesh generation but also significantly improved geometric integrity.

[73] CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios

Teng Fu,Yuwen Chen,Zhuofan Chen,Mengyang Zhao,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 本文介绍了一个新的大规模多行人跟踪数据集CrowdTrack,专注于复杂的真实场景,以帮助改进现有跟踪算法。

Details Motivation: 现有的MOT数据集在复杂场景中难以满足研究需求,因为它们通常具有相对简单的场景组成和非真实场景,因此需要一个更加困难且大规模的数据集来推动多行人跟踪领域的发展。 Method: 作者提出了一个名为CrowdTrack的数据集,主要从第一人称视角拍摄,包含现实生活中的复杂场景,并对数据集进行了全面分析,在多个SOTA模型上进行了测试。 Result: CrowdTrack数据集包含33个视频,总计5,185条轨迹,每个目标都用完整的边界框和唯一的对象ID进行标注。通过对现有模型的测试,展示了该数据集的挑战性,并为未来算法发展提供了一个平台。 Conclusion: 本文提出了一种用于多人跟踪的困难大规模数据集CrowdTrack,旨在促进复杂场景下有效算法的发展,并提供了该数据集和项目代码。 Abstract: Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack'' because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: https://github.com/loseevaya/CrowdTrack .

[74] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention

Zunhui Xia,Hongxing Li,Libin Lan

Main category: cs.CV

TL;DR: 本文提出了一种适用于多种医学图像识别任务的高效视觉转换器MedFormer,通过创新的双稀疏选择注意力机制显著提升了性能与效率。

Details Motivation: 现有的基于视觉转换器的方法在处理医学识别任务时存在任务特异性强、架构定制化高以及计算成本大或依赖手工制作的稀疏注意力的问题,需要一种更通用且高效的解决方案。 Method: 提出了一种新的医学视觉转换器MedFormer,采用金字塔缩放结构作为多种医学图像识别任务的通用骨干,并引入了具有内容意识的双稀疏选择注意力(DSSA)以提高计算效率和鲁棒性。 Result: MedFormer在各种成像模态数据集上进行了广泛的实验,结果一致表明其在所有提到的医学图像识别任务中均表现出色,代码已在GitHub上公开。 Conclusion: MedFormer是一个高效的医学视觉转换器,通过使用金字塔扩展结构和双稀疏选择注意机制,提高了医学图像识别任务的性能,并具有良好的通用性和计算效率。 Abstract: Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is explicitly designed to attend to the most relevant content. In addition, a detailed theoretical analysis has been conducted, demonstrating that MedFormer has superior generality and efficiency in comparison to existing medical vision transformers. Extensive experiments on a variety of imaging modality datasets consistently show that MedFormer is highly effective in enhancing performance across all three above-mentioned medical image recognition tasks. The code is available at https://github.com/XiaZunhui/MedFormer.

[75] Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy

Luca Parolari,Andrea Cherubini,Lamberto Ballan,Carlo Biffi

Main category: cs.CV

TL;DR: 本文提出一种新方法用于结肠镜息肉自动计数,通过引入时间感知模型和改进聚类策略,显著提高了准确性。

Details Motivation: 现有息肉计数方法主要依赖自监督学习,忽略了时间关系,影响了检测与聚类效果。 Method: 引入了监督对比损失以结合时间感知软目标,并改进了片段聚类方法,加入了时间邻接约束。 Result: 相比先前方法,碎片率降低了2.2倍,并在公开数据集上验证了性能提升。 Conclusion: 该论文提出了一种基于监督对比损失和时间邻接约束的新方法,在结肠镜检查中实现了更鲁棒的息肉计数,建立了新的最先进水平。 Abstract: Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.

[76] MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations

Hyunsoo Son,Jeonghyun Noh,Suemin Jeon,Chaoli Wang,Won-Ki Jeong

Main category: cs.CV

TL;DR: This paper introduces MC-INR, a new framework for encoding complex multivariate scientific data using implicit neural representations on unstructured grids, achieving superior performance compared to existing approaches.

Details Motivation: Current INR-based methods suffer from inflexible representation, focus on single-variable data, and rely on structured grids, leading to degraded performance on complex real-world datasets. Method: The paper proposes MC-INR, which combines meta-learning and clustering for flexible encoding, introduces a residual-based dynamic re-clustering mechanism, and employs a branched layer to handle multivariate data effectively. Result: Experimental results show that MC-INR outperforms existing methods in scientific data encoding tasks. Conclusion: MC-INR is a novel framework that addresses the limitations of existing INR-based methods for encoding complex, multivariate scientific data on unstructured grids. Abstract: Implicit Neural Representations (INRs) are widely used to encode data as continuous functions, enabling the visualization of large-scale multivariate scientific simulation data with reduced memory usage. However, existing INR-based methods face three main limitations: (1) inflexible representation of complex structures, (2) primarily focusing on single-variable data, and (3) dependence on structured grids. Thus, their performance degrades when applied to complex real-world datasets. To address these limitations, we propose a novel neural network-based framework, MC-INR, which handles multivariate data on unstructured grids. It combines meta-learning and clustering to enable flexible encoding of complex structures. To further improve performance, we introduce a residual-based dynamic re-clustering mechanism that adaptively partitions clusters based on local error. We also propose a branched layer to leverage multivariate data through independent branches simultaneously. Experimental results demonstrate that MC-INR outperforms existing methods on scientific data encoding tasks.

[77] Automatic Labelling for Low-Light Pedestrian Detection

Dimitrios Bouzoulas,Eerik Alamikkotervo,Risto Ojala

Main category: cs.CV

TL;DR: This paper proposes an automated infrared-RGB labeling pipeline that enhances low-light RGB pedestrian detection by utilizing infrared data to generate effective training labels, demonstrating improved performance compared to ground-truth label-based approaches.

Details Motivation: Low-light conditions pose a significant challenge for RGB pedestrian detection due to the lack of large public datasets addressing this scenario. This research aims to overcome this limitation by leveraging infrared data to generate reliable labels for low-light RGB images. Method: An automated infrared-RGB labeling pipeline was developed, involving infrared detection using a fine-tuned model, label transfer from infrared to RGB images, and training object detection models with the generated labels for low-light RGB pedestrian detection. Result: Models trained on the generated labels outperformed those trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics when evaluated on previously unseen image sequences. Conclusion: The proposed infrared-RGB labeling pipeline effectively improves low-light RGB pedestrian detection performance, with models trained on generated labels outperforming those trained on ground-truth labels in most evaluation cases. Abstract: Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling

[78] Detecting Multiple Diseases in Multiple Crops Using Deep Learning

Vivek Yadav,Anugrah Jain

Main category: cs.CV

TL;DR: This paper proposes a deep learning model for early detection of crop diseases across diverse Indian agriculture, achieving high accuracy and broader coverage than existing methods.

Details Motivation: India's agrarian economy suffers significant crop losses due to diseases, pests, and environmental stress. Early detection is crucial for improving yield and food security. Method: A deep learning-based solution was developed using a unified dataset of images from 17 crops and 34 diseases to detect agricultural diseases. Result: The model achieved 99% accuracy on the unified dataset, outperforming state-of-the-art models by covering more crops and diseases with a 7% improvement in accuracy. Conclusion: The paper concludes that the proposed deep learning model effectively detects multiple diseases across various crops, offering a promising solution for Indian farmers. Abstract: India, as a predominantly agrarian economy, faces significant challenges in agriculture, including substantial crop losses caused by diseases, pests, and environmental stress. Early detection and accurate identification of diseases across different crops are critical for improving yield and ensuring food security. This paper proposes a deep learning based solution for detecting multiple diseases in multiple crops, aimed to cover India's diverse agricultural landscape. We first create a unified dataset encompassing images of 17 different crops and 34 different diseases from various available repositories. Proposed deep learning model is trained on this dataset and outperforms the state-of-the-art in terms of accuracy and the number of crops, diseases covered. We achieve a significant detection accuracy, i.e., 99 percent for our unified dataset which is 7 percent more when compared to state-of-the-art handling 14 crops and 26 different diseases only. By improving the number of crops and types of diseases that can be detected, proposed solution aims to provide a better product for Indian farmers.

[79] IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning

Abiam Remache González,Meriem Chagour,Timon Bijan Rüth,Raúl Trapiella Cañedo,Marina Martínez Soler,Álvaro Lorenzo Felipe,Hyun-Suk Shin,María-Jesús Zamorano Serrano,Ricardo Torres,Juan-Antonio Castillo Parra,Eduardo Reyes Abad,Miguel-Ángel Ferrer Ballester,Juan-Manuel Afonso López,Francisco-Mario Hernández Tejera,Adrian Penate-Sanchez

Main category: cs.CV

TL;DR: This paper presents IMASHRIMP, an automated system for shrimp morphological analysis that improves genetic selection accuracy and sustainability in aquaculture.

Details Motivation: To optimize genetic selection tasks in aquaculture by automating morphological analysis of white shrimp through modified deep learning and computer vision techniques. Method: IMASHRIMP incorporates two discrimination modules based on a modified ResNet-50 architecture, a pose estimation module adapted from VitPose, and a morphological regression module using an SVM model. Result: The system reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. It achieves a mAP of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. Conclusion: IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices. Abstract: This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei}, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a "two-factor authentication (human and IA)" system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp's skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices.The code are available at https://github.com/AbiamRemacheGonzalez/ImaShrimp-public

[80] MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang,Sicheng Xu,Yue Dong,Yu Deng,Jianfeng Xiang,Zelong Lv,Guangzhong Sun,Xin Tong,Jiaolong Yang

Main category: cs.CV

TL;DR: MoGe-2 improves upon the MoGe model to recover metric-scale 3D geometry from single images, combining high relative geometry accuracy with fine-grained detail through a novel data refinement strategy.

Details Motivation: The motivation was to overcome the limitations of existing monocular geometry estimation approaches, particularly in achieving both metric scale accuracy and detailed geometry reconstruction. Method: The authors built upon the MoGe method, extending it for metric geometry prediction while maintaining affine-invariant accuracy. They also developed a unified data refinement approach using sharp synthetic labels to enhance detail and reduce noise. Result: The proposed MoGe-2 model successfully achieves superior performance in recovering accurate relative geometry, metric scale precision, and fine-grained details, which no prior methods have achieved simultaneously. Conclusion: MoGe-2 is a highly effective open-domain geometry estimation model that outperforms previous methods in delivering accurate relative geometry, precise metric scale, and fine-grained detail recovery. Abstract: We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative geometry accuracy provided by the affine-invariant point representation. Additionally, we discover that noise and errors in real data diminish fine-grained detail in the predicted geometry. We address this by developing a unified data refinement approach that filters and completes real data from different sources using sharp synthetic labels, significantly enhancing the granularity of the reconstructed geometry while maintaining the overall accuracy. We train our model on a large corpus of mixed datasets and conducted comprehensive evaluations, demonstrating its superior performance in achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery -- capabilities that no previous methods have simultaneously achieved.

[81] Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

Buzhen Huang,Chen Li,Chongyang Xu,Dongyue Lu,Jinnan Chen,Yangang Wang,Gim Hee Lee

Main category: cs.CV

TL;DR: This paper proposes a novel dual-branch optimization framework using appearance cues and proxemics priors to improve pose estimation for close human interactions in complex scenarios.

Details Motivation: Existing human pose estimation methods struggle with visual ambiguities and inter-person occlusions in in-the-wild videos, leading to implausible interaction reconstructions. Even large foundation models cannot accurately distinguish human semantics in these challenging cases. Method: A diffusion model is trained to learn human proxemic behavior and pose priors. This, along with two optimizable tensors, is integrated into a dual-branch optimization framework incorporating constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations. Result: The method achieves superior performance over existing approaches on multiple benchmarks and reconstructs accurate interactive motions with plausible body contacts. A new dataset with pseudo ground-truth interaction annotations is also introduced. Conclusion: The proposed dual-branch optimization framework, combined with proxemics prior and diverse constraints, can accurately estimate human interactions from in-the-wild videos, outperforming existing methods. Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.

[82] Parametric shape models for vessels learned from segmentations via differentiable voxelization

Alina F. Dima,Suprosanna Shit,Huaqi Qiu,Robbie Holland,Tamara T. Mueller,Fabio Antonio Musio,Kaiyuan Yang,Bjoern Menze,Rickmer Braren,Marcus Makowski,Daniel Rueckert

Main category: cs.CV

TL;DR: This paper proposes a differentiable framework that unifies voxel, mesh, and parametric vessel representations, enabling accurate and flexible modeling of complex vascular structures without needing explicit shape labels.

Details Motivation: Current vessel modeling techniques rely on separate representations—voxels, meshes, and parametric models—limiting their integration and joint optimization. The motivation is to unify these representations under a differentiable framework to enhance accuracy and flexibility in modeling complex vascular structures. Method: The method uses differentiable voxelization to extract parametric shape models via shape-to-segmentation fitting. Vessels are parametrized using cubic B-splines for centerlines and radii, ensuring smoothness, while meshes are derived from the learned parameters through differentiable extraction. Result: The approach accurately captures vessel geometry across diverse anatomical structures (e.g., aortas, aneurysms, brain vessels) with high-fidelity meshes that can be manipulated post-fit, demonstrating its effectiveness through volumetric fit experiments. Conclusion: The proposed framework successfully integrates voxel, mesh, and parametric representations of vessels through differentiable transformations, allowing for accurate geometric modeling and high-fidelity mesh generation without reliance on explicit ground-truth shape parameters. Abstract: Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joins the three representations under differentiable transformations. By leveraging differentiable voxelization, we automatically extract a parametric shape model of the vessels through shape-to-segmentation fitting, where we learn shape parameters from segmentations without the explicit need for ground-truth shape parameters. The vessel is parametrized as centerlines and radii using cubic B-splines, ensuring smoothness and continuity by construction. Meshes are differentiably extracted from the learned shape parameters, resulting in high-fidelity meshes that can be manipulated post-fit. Our method can accurately capture the geometry of complex vessels, as demonstrated by the volumetric fits in experiments on aortas, aneurysms, and brain vessels.

[83] Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning

Tan Pan,Zhaorui Tan,Kaiyu Guo,Dongli Xu,Weidi Xu,Chen Jiang,Xin Guo,Yuan Qi,Yuan Cheng

Main category: cs.CV

TL;DR: 本研究提出了一种新的医学图像自监督学习框架S^2DC,通过增强语义差异性和一致性来更好地捕捉解剖结构的变化,从而在多个任务和数据集上取得了更好的效果。

Details Motivation: 3D医学图像自监督学习(mSSL)需要考虑解剖结构在位置、尺度和形态上的变化,以捕捉有意义的区别。而现有的mSSL方法通常采用固定大小的图像块分割方式,忽略了这些结构变化。 Method: 提出了名为S^2DC的医学图像自监督学习框架,通过两个步骤实现结构感知的语义差异与一致性:第一步利用最优传输策略增强不同区域之间的语义差异;第二步基于邻域相似性分布提高结构层面的语义一致性。 Result: S^2DC框架通过连接图像块级别和结构级别的表示实现了结构感知的表示,并在多种任务和数据集上表现出优越性能。 Conclusion: S^2DC方法在10个数据集、4个任务和3种模态上全面评估,结果表明其在医学图像自监督学习中表现优于现有最先进方法。 Abstract: 3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.

[84] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu,Enxin Song,Wenhao Chai,Xuexiang Wen,Tian Ye,Gaoang Wang

Main category: cs.CV

TL;DR: AuroraLong 提出了一种高效的线性 RNN 方法用于长视频理解,在减少计算和内存需求的同时保持了高性能。

Details Motivation: 由于基于 Transformer 的 LLM 所需的内存和计算资源随输入序列长度呈二次增长,因此长视频理解面临高计算复杂性和巨大内存消耗的问题。这促使了 AuroraLong 的开发,旨在降低计算门槛。 Method: AuroraLong 使用线性 RNN 来替代 MLLMs 中的 LLM 组件,以处理任意长度的输入序列。此外,通过按大小升序重新排序视觉 token 并结合视觉 token 合并技术来进一步提高吞吐量和效率。 Result: 尽管 AuroraLong 只有 2B 参数且仅在公开数据上训练,它在多个视频基准测试中实现了与基于私有数据训练的 Transformer 模型相似的性能。 Conclusion: AuroraLong 通过使用线性 RNN 替换传统的基于 Transformer 的 LLM 组件,有效降低了长视频理解的计算和内存成本。这种方法在性能上可媲美基于专有数据训练的类似规模的 Transformer 模型,并展示了在线性 RNN 上进行开放视频理解的巨大潜力。 Abstract: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.

[85] Addressing Camera Sensors Faults in Vision-Based Navigation: Simulation and Dataset Development

Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill

Main category: cs.CV

TL;DR: 这篇论文旨在解决视觉导航中因传感器故障影响任务可靠性的问题,通过仿真生成故障数据,为AI训练提供重要数据支持。

Details Motivation: 视觉导航(VBN)算法在太空任务中至关重要,但其可靠性面临传感器故障带来的挑战,传统故障检测方法存在局限性,因此需要探索AI解决方案。 Method: 本研究聚焦于深空探测任务场景,分析相机传感器可能出现的故障情况,并构建了一个仿真框架以合成故障图像。 Result: 研究系统地分析了相机传感器故障的影响,提出了仿真框架生成故障图像,并创建了可用于训练和测试AI故障检测算法的故障图像数据集。 Conclusion: 论文强调AI在视觉导航中的重要作用,并通过仿真框架生成故障数据来支持AI训练和测试,为后续研究提供了宝贵的数据资源。 Abstract: The increasing importance of Vision-Based Navigation (VBN) algorithms in space missions raises numerous challenges in ensuring their reliability and operational robustness. Sensor faults can lead to inaccurate outputs from navigation algorithms or even complete data processing faults, potentially compromising mission objectives. Artificial Intelligence (AI) offers a powerful solution for detecting such faults, overcoming many of the limitations associated with traditional fault detection methods. However, the primary obstacle to the adoption of AI in this context is the lack of sufficient and representative datasets containing faulty image data. This study addresses these challenges by focusing on an interplanetary exploration mission scenario. A comprehensive analysis of potential fault cases in camera sensors used within the VBN pipeline is presented. The causes and effects of these faults are systematically characterized, including their impact on image quality and navigation algorithm performance, as well as commonly employed mitigation strategies. To support this analysis, a simulation framework is introduced to recreate faulty conditions in synthetically generated images, enabling a systematic and controlled reproduction of faulty data. The resulting dataset of fault-injected images provides a valuable tool for training and testing AI-based fault detection algorithms. The final link to the dataset will be added after an embargo period. For peer-reviewers, this private link is available.

[86] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Ziyin Zhou,Yunpeng Luo,Yuanchen Wu,Ke Sun,Jiayi Ji,Ke Yan,Shouhong Ding,Xiaoshuai Sun,Yunsheng Wu,Rongrong Ji

Main category: cs.CV

TL;DR: This paper introduces AIGI-Holmes, a novel AI-generated image detection model supported by the Holmes-Set dataset and Holmes Pipeline training framework, offering improved explainability and generalization to combat misinformation through realistic AI-generated images.

Details Motivation: The motivation behind this research is the increasing misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, which poses a threat to public information security. Current AIGI detection techniques face challenges related to the lack of human-verifiable explanations and poor generalization on the latest generation of AI-generated images. Method: The authors introduced a large-scale dataset called Holmes-Set, which includes two subsets: Holmes-SFTSet for instruction-tuning with explanations and Holmes-DPOSet for human-aligned preferences. They also proposed the Multi-Expert Jury method for efficient data annotation and the Holmes Pipeline, a three-stage training framework involving visual expert pre-training, supervised fine-tuning, and direct preference optimization. Additionally, they implemented a collaborative decoding strategy during inference to enhance generalization. Result: Extensive experiments on three benchmarks demonstrated the effectiveness of the proposed AIGI-Holmes model in detecting AI-generated images with high accuracy, while also generating human-verifiable and human-aligned explanations, thus addressing the limitations of existing methods. Conclusion: The study concludes that the proposed AIGI-Holmes model, along with the Holmes Pipeline and Holmes-Set dataset, effectively addresses the challenges of AIGI detection by generating human-verifiable explanations and improving generalization capabilities. Abstract: The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.

[87] Learning few-step posterior samplers by unfolding and distillation of diffusion models

Charlesquin Kemajou Mbakam,Jonathan Spence,Marcelo Pereyra

Main category: cs.CV

TL;DR: 本论文提出了一种结合深度展开和模型蒸馏的新框架,用于将扩散模型(DMs)转化为后验采样的条件模型。

Details Motivation: 扩散模型在贝叶斯计算成像中已成为强大的图像先验工具,但现有方法在准确性和灵活性之间存在权衡。 Method: 通过深度展开技术对马尔可夫链蒙特卡洛(MCMC)算法进行展开,并结合模型蒸馏,将扩散模型转化为高效的条件模型。 Result: 实验表明,该方法在准确性和计算效率方面表现优异,同时保留了对前向模型变化的适应能力。 Conclusion: 所提框架为扩散模型在贝叶斯计算成像中的应用提供了新的解决方案,兼具灵活性与高效性。 Abstract: Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm - specifically, the recently proposed LATINO Langevin sampler (Spagnoletti et al., 2025) - representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.

[88] APT: Adaptive Personalized Training for Diffusion Models with Limited Data

JungWoo Chae,Jiyoon Kim,JaeWoong Choi,Kyungyul Kim,Sangheum Hwang

Main category: cs.CV

TL;DR: This paper proposes the Adaptive Personalized Training (APT) framework to address overfitting and preserve prior knowledge when personalizing diffusion models with limited data.

Details Motivation: Personalizing diffusion models using limited data presents significant challenges such as overfitting, loss of prior knowledge, and degradation of text alignment. Method: The proposed Adaptive Personalized Training (APT) framework uses three components: Adaptive Training Adjustment, Representation Stabilization, and Attention Alignment for Prior Knowledge Preservation. Result: APT demonstrates effectiveness in mitigating overfitting, preserving prior knowledge, and generating high-quality, diverse images with limited reference data. Conclusion: APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data. Abstract: Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.

[89] CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

Xiangyang Luo,Ye Zhu,Yunfei Liu,Lijian Lin,Cong Wan,Zijian Cai,Shao-Lun Huang,Yu Li

Main category: cs.CV

TL;DR: CanonSwap improves video face swapping by separating facial motion and appearance, allowing for high-quality identity transfer while preserving dynamic facial attributes.

Details Motivation: Current methods struggle with maintaining the dynamic attributes of the target face (e.g., expressions, poses) while transferring identity, due to the coupling of appearance and motion in videos. Method: CanonSwap decouples motion-related information from appearance, modifies identity in a canonical space, and reintegrates the swapped features into the original video space. It also uses a Partial Identity Modulation module and introduces fine-grained synchronization metrics. Result: CanonSwap outperforms existing approaches in visual quality, temporal consistency, and identity preservation in video face swapping. Conclusion: CanonSwap provides a more effective solution for video face swapping by decoupling motion and appearance information, leading to better preservation of dynamic attributes while transferring identity. Abstract: Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.

[90] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

Qi Xu,Dongxu Wei,Lingzhe Zhao,Wenpu Li,Zhangchi Huang,Shunping Ji,Peidong Liu

Main category: cs.CV

TL;DR: SIU3R is an alignment-free framework that enables simultaneous 3D reconstruction and understanding by leveraging pixel-aligned 3D representations and unified learnable queries, achieving state-of-the-art results.

Details Motivation: Recent approaches using a 2D-to-3D feature alignment paradigm have limitations in 3D understanding and may lose semantic information. This motivates the need for an alignment-free framework for better generalizable simultaneous understanding and 3D reconstruction. Method: SIU3R uses a pixel-aligned 3D representation to bridge reconstruction and understanding tasks and introduces unified learnable queries for native 3D understanding. It also incorporates lightweight modules to enhance task collaboration. Result: Extensive experiments show that SIU3R achieves superior performance on individual tasks like 3D reconstruction and understanding, as well as on their simultaneous execution, highlighting its effectiveness. Conclusion: The proposed SIU3R framework demonstrates state-of-the-art performance on both individual and simultaneous 3D reconstruction and understanding tasks, showcasing the benefits of an alignment-free approach. Abstract: Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.

[91] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

Qin Guo,Ailing Zeng,Dongxu Yue,Ceyuan Yang,Yang Cao,Hanzhong Guo,Fei Shen,Wei Liu,Xihui Liu,Dan Xu

Main category: cs.CV

TL;DR: This paper introduces UniMC, a DiT-based framework, and HAIG-2.9M, a new dataset, to improve keypoint-guided image generation for multi-class and occluded scenarios involving both humans and animals.

Details Motivation: The motivation stems from the limitations of existing keypoint-guided Text-to-Image diffusion models in controlling the generation of non-rigid objects beyond humans (e.g., animals) and generating multiple overlapping humans and animals solely based on keypoint controls. Method: The authors designed a DiT-based framework named UniMC to unify controllable multi-class image generation by integrating instance- and keypoint-level conditions into compact tokens. They also introduced a new large-scale dataset called HAIG-2.9M containing extensive annotations for both humans and animals. Result: Extensive experiments demonstrated the high quality of the HAIG-2.9M dataset and the effectiveness of the UniMC framework, particularly in handling complex cases like heavy occlusions and multi-class scenarios. Conclusion: The paper concludes that the proposed UniMC framework and HAIG-2.9M dataset effectively address challenges in keypoint-guided image generation for multi-class scenarios, especially those involving occlusions and multiple overlapping instances. Abstract: Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.

[92] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

Yuxuan Wang,Tianwei Cao,Huayu Zhang,Zhongjiang He,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: This paper proposes FairHuman, a method for improving text-to-image generation of human images by enhancing both global and local detail quality through multi-objective optimization.

Details Motivation: Generating realistic human images with detailed features like faces and hands remains challenging due to insufficient supervision of local regions during training. Method: FairHuman constructs three learning objectives: a global objective from the default diffusion objective and two local objectives for hands and faces using pre-annotated positional priors. It uses the Minimum Potential Delay (MPD) criterion to derive an optimal parameter updating strategy for fair multi-objective optimization. Result: Extensive experiments show that FairHuman significantly improves the performance of human image generation in different scenarios, particularly in generating challenging local details. Conclusion: The proposed FairHuman approach enhances the generation of challenging local details in human images while maintaining overall quality, demonstrating effectiveness across various scenarios. Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.

[93] Prompt learning with bounding box constraints for medical image segmentation

Mélanie Gaillochet,Mehrdad Noori,Sahar Dastani,Christian Desrosiers,Hervé Lombaert

Main category: cs.CV

TL;DR: 本文介绍了一种利用边界框注释自动提示基础模型并结合弱监督学习以提高医学图像分割效率的新方法,在有限数据下表现优异。

Details Motivation: 医学领域中的像素级注释获取非常繁琐且昂贵,因此需要一种基于更易获取的边界框注释的弱监督方法来减轻负担。 Method: 该方法通过使用边界框注释自动化生成基础模型的提示,并整合由框注释导出的多个约束条件以及由提示基础模型生成的伪标签。 Result: 实验结果显示,所提出的弱监督方法在有限数据设置下平均Dice得分达到了84.90%,超过了现有的全监督和弱监督方法。 Conclusion: 本文提出了一种新颖的框架,结合了基础模型的表示能力和弱监督分割的注释效率。使用仅边界框注释自动生成基础模型的提示,并在有限数据设置中平均Dice得分为84.90%,优于现有的全监督和弱监督方法。代码公开可用。 Abstract: Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at https://github.com/Minimel/box-prompt-learning-VFM.git

[94] DexVLG: Dexterous Vision-Language-Grasp Model at Scale

Jiawei He,Danshi Li,Xinqiang Yu,Zekun Qi,Wenyao Zhang,Jiayi Chen,Zhaoxiang Zhang,Zhizheng Zhang,Li Yi,He Wang

Main category: cs.CV

TL;DR: 本文提出了DexVLG模型,利用大规模灵巧抓取数据集DexGraspNet 3.0,成功实现了基于语言指令的零样本功能性抓取,在模拟和物理实验中均表现优异。

Details Motivation: 当前研究受限于数据采集难度,主要集中在简单夹爪控制,缺乏针对类人灵巧手的功能性抓取研究。 Method: 提出DexVLG模型,结合视觉-语言模型与流匹配姿态头,并使用包含170亿灵巧抓取姿态的DexGraspNet 3.0数据集进行训练。 Result: 在模拟环境中达到超过76%的零样本执行成功率和最先进的部件抓取精度,同时在真实世界实验中实现精准的部分对齐抓取。 Conclusion: DexVLG实现了基于大模型的功能性抓取,通过零样本泛化能力,在模拟和真实场景中表现出色,为类人灵巧手的控制提供了新思路。 Abstract: As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG's performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG's strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.

[95] Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen

Main category: cs.CV

TL;DR: This paper proposes MANO, a novel attention mechanism inspired by n-body simulations, achieving efficient linear complexity while maintaining performance comparable to existing Transformer models.

Details Motivation: Standard Transformers have quadratic complexity in memory and time with respect to input length, making them impractical for high-resolution inputs. Existing solutions often rely on patchification, downsampling, or coarsening techniques that lose fine-scale details. Method: Inspired by n-body numerical simulation techniques, the authors introduced the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale manner. This approach maintains a global receptive field in each attention head and achieves linear time and memory complexity. Result: Empirical results show that MANO rivals state-of-the-art models like ViT and Swin Transformer on tasks such as image classification and Darcy flows, while significantly reducing runtime and peak memory usage. Conclusion: The proposed MANO model successfully reduces the time and memory complexity of attention computation in Transformers, achieving performance comparable to state-of-the-art models while significantly improving efficiency. Abstract: Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.

[96] Partial Weakly-Supervised Oriented Object Detection

Mingxin Liu,Peiyuan Zhang,Yuan Liu,Wei Zhang,Yue Zhou,Ning Liao,Ziyang Gong,Junwei Luo,Zhirui Wang,Yi Yu,Xue Yang

Main category: cs.CV

TL;DR: This paper proposes an efficient and cost-effective oriented object detection framework named PWOOD, which uses partial weak annotations and achieves performance comparable to semi-supervised approaches.

Details Motivation: The high annotation cost in oriented object detection motivated the need for a more efficient and lower-cost solution. Method: Development of the PWOOD framework, including the OS-Student model for orientation and scale learning, and CPF strategy for pseudo-label filtering, using partially weak annotations. Result: Experiments on DOTA and DIOR datasets show that PWOOD performs comparably or better than semi-supervised methods while utilizing less costly annotations. Conclusion: The proposed PWOOD framework offers a cost-effective and efficient solution for oriented object detection, demonstrating competitive performance compared to traditional semi-supervised algorithms. Abstract: The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose:(1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms.

[97] From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images

Danrong Zhang,Huili Huang,N. Simrill Smith,Nimisha Roy,J. David Frost

Main category: cs.CV

TL;DR: 这项研究提出了一种基于语义分割的新方法,用于地震后社交媒体图片中的损害严重程度评估,从而提供更客观和全面的分析,有助于提高灾害响应工作的效率和针对性。

Details Motivation: 传统的损害严重程度评估方法在地震后的社交媒体图片中存在主观性和无法考虑图像内损害程度变化的局限性,因此需要一种更客观的分析方法。 Method: 本研究构建了一个分割损害严重程度的数据集,并使用SegFormer模型进行微调,以生成地震后社交媒体图像的损害严重程度分割。此外,引入了一种新的损害严重程度评分系统,通过考虑不同区域的损害程度来量化损害。 Result: 应用这种方法可以更客观地量化社交媒体图片中的损害严重程度,并提供对损害的细致理解。 Conclusion: 该研究通过语义分割方法改进了地震后社交媒体图片中的损害严重程度评估,提供了更客观和全面的分析方式。这种方法增强了灾害侦察团队提供精确指导的能力,促进了地震后更有效和有针对性的响应工作。 Abstract: In the aftermath of earthquakes, social media images have become a crucial resource for disaster reconnaissance, providing immediate insights into the extent of damage. Traditional approaches to damage severity assessment in post-earthquake social media images often rely on classification methods, which are inherently subjective and incapable of accounting for the varying extents of damage within an image. Addressing these limitations, this study proposes a novel approach by framing damage severity assessment as a semantic segmentation problem, aiming for a more objective analysis of damage in earthquake-affected areas. The methodology involves the construction of a segmented damage severity dataset, categorizing damage into three degrees: undamaged structures, damaged structures, and debris. Utilizing this dataset, the study fine-tunes a SegFormer model to generate damage severity segmentations for post-earthquake social media images. Furthermore, a new damage severity scoring system is introduced, quantifying damage by considering the varying degrees of damage across different areas within images, adjusted for depth estimation. The application of this approach allows for the quantification of damage severity in social media images in a more objective and comprehensive manner. By providing a nuanced understanding of damage, this study enhances the ability to offer precise guidance to disaster reconnaissance teams, facilitating more effective and targeted response efforts in the aftermath of earthquakes.

[98] RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Liheng Zhang,Lexi Pang,Hang Ye,Xiaoxuan Ma,Yizhou Wang

Main category: cs.CV

TL;DR: This paper proposes a training-free framework for text-to-image diffusion models that enhances structural alignment and visual quality through a flexible feature injection mechanism and additional strategies.

Details Motivation: To overcome the limitations of existing feature injection methods in T2I diffusion models, particularly structural misalignment, condition leakage, and visual artifacts when condition images deviate from natural RGB distributions. Method: A flexible feature injection framework that decouples the injection timestep from the denoising process, combined with appearance-rich prompting and a restart refinement strategy. Result: The approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios while enabling training-free, structure- and appearance-rich image generation. Conclusion: The proposed framework significantly improves the performance of text-to-image diffusion models by addressing structural misalignment, condition leakage, and visual artifacts in a training-free manner. Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.

[99] No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa,Chenhongyi Yang,Linus Ericsson,Steven McDonagh,Elliot J. Crowley

Main category: cs.CV

TL;DR: This paper introduces a training-free method for object segmentation that uses semantic correspondences between reference and target images, achieving state-of-the-art results with minimal manual input.

Details Motivation: To reduce the manual effort required by existing methods like SAM by automatically generating segmentation masks using reference images. Method: A multi-stage, training-free method that includes memory bank construction, representation aggregation, and semantic-aware feature matching. Result: Significant improvements in segmentation metrics, including 36.8% nAP on COCO FSOD, 71.2% nAP50 on PASCAL VOC Few-Shot, and 22.4% nAP on Cross-Domain FSOD. Conclusion: The proposed method leverages semantic priors from foundation models to automatically generate instance-level segmentation masks, achieving state-of-the-art performance on several benchmarks. Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

[100] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars

Gent Serifi,Marcel C. Bühler

Main category: cs.CV

TL;DR: This paper proposes HyperGaussians, an enhanced version of 3D Gaussian Splatting, enabling more accurate and efficient creation of animatable face avatars from monocular videos.

Details Motivation: Creating detailed animatable face avatars from monocular videos remains challenging due to limitations in handling nonlinear deformations, lighting effects, and fine details using existing methods like 3D Gaussian Splatting (3DGS). Method: The paper introduces HyperGaussians, an extension of 3D Gaussian Splatting using high-dimensional multivariate Gaussians with a reparameterized covariance matrix ('inverse covariance trick'), integrated into the FlashAvatar framework. Result: Evaluation on 19 subjects across 4 datasets shows that HyperGaussians surpass 3DGS both numerically and visually, particularly in rendering high-frequency details such as eyeglass frames, teeth, facial movements, and specular reflections. Conclusion: HyperGaussians provide a more expressive and efficient method for creating high-quality animatable face avatars, outperforming 3DGS in preserving fine details and handling complex deformations. Abstract: We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.

[101] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu,Hao Li,Jiawei Chi,Hanyang Wang,Minghui Yang,Fudong Wang,Yueqi Duan

Main category: cs.CV

TL;DR: LangScene-X利用扩散模型与语言压缩技术,实现基于稀疏视角的语言驱动3D重建。

Details Motivation: 现有方法依赖密集视角重建,在视角有限时容易出现渲染伪影,因此需要一个更鲁棒的框架。 Method: LangScene-X通过TriMap视频扩散模型和语言量化压缩器(LQC)结合语言信息进行3D重建。 Result: LangScene-X在3D重建质量和跨场景泛化能力方面优于现有技术。 Conclusion: LangScene-X是一个能够从稀疏视图中生成语言嵌入场景的框架,并在真实世界数据上展示了其优越性。 Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.

[102] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach

Panpan Ji,Junni Song,Hang Xiao,Hanyu Liu,Chao Li

Main category: cs.CV

TL;DR: This paper proposes a new framework called DCDP-HAR to address challenges in multimodal sensor-based human activity recognition by using a dual-path feature extraction architecture, contrastive learning, and gradient modulation.

Details Motivation: Multimodal HAR systems face key challenges including difficulties in cross-modal feature alignment and imbalanced modality contributions. Method: The paper proposes a Dynamic Contrastive Dual-Path Network (DCDP-HAR) with three components: a dual-path feature extraction architecture, a multi-stage contrastive learning mechanism, and a confidence-driven gradient modulation strategy. A momentum-based gradient accumulation strategy is also adopted. Result: Ablation studies validate the effectiveness of each component of the framework, and extensive comparative experiments are performed on four public benchmark datasets. Conclusion: The proposed DCDP-HAR framework effectively addresses challenges in multimodal HAR systems, such as cross-modal feature alignment and imbalanced modality contributions. Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.

[103] USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network

Ying Yu,Hang Xiao,Siyao Li,Jiarui Li,Haotian Tang,Hanyu Liu,Chao Li

Main category: cs.CV

TL;DR: This paper introduces a novel optimization approach for human activity recognition using a multi-attention interaction mechanism, achieving superior accuracy and efficiency across multiple datasets and practical deployment scenarios.

Details Motivation: Human activity recognition (HAR) faces challenges such as labeled data scarcity for rare activities, inadequate extraction of high-level features, and suboptimal model performance on lightweight devices. This research aims to overcome these issues. Method: The paper proposes a multi-attention interaction mechanism-based approach, incorporating an unsupervised diffusion model for data augmentation, a multi-branch spatio-temporal interaction network with attention mechanisms, and an adaptive multi-loss function fusion strategy. Result: Experimental results on WISDM, PAMAP2, and OPPORTUNITY datasets show that the proposed USAD model achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly surpassing existing approaches. The method also demonstrates efficiency and feasibility in deployment on embedded devices. Conclusion: The proposed USAD model outperforms existing methods in human activity recognition, achieving high accuracy on three public datasets and proving efficient for deployment on embedded devices. Abstract: The primary objective of human activity recognition (HAR) is to infer ongoing human actions from sensor data, a task that finds broad applications in health monitoring, safety protection, and sports analysis. Despite proliferating research, HAR still faces key challenges, including the scarcity of labeled samples for rare activities, insufficient extraction of high-level features, and suboptimal model performance on lightweight devices. To address these issues, this paper proposes a comprehensive optimization approach centered on multi-attention interaction mechanisms. First, an unsupervised, statistics-guided diffusion model is employed to perform data augmentation, thereby alleviating the problems of labeled data scarcity and severe class imbalance. Second, a multi-branch spatio-temporal interaction network is designed, which captures multi-scale features of sequential data through parallel residual branches with 3*3, 5*5, and 7*7 convolutional kernels. Simultaneously, temporal attention mechanisms are incorporated to identify critical time points, while spatial attention enhances inter-sensor interactions. A cross-branch feature fusion unit is further introduced to improve the overall feature representation capability. Finally, an adaptive multi-loss function fusion strategy is integrated, allowing for dynamic adjustment of loss weights and overall model optimization. Experimental results on three public datasets, WISDM, PAMAP2, and OPPORTUNITY, demonstrate that the proposed unsupervised data augmentation spatio-temporal attention diffusion network (USAD) achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly outperforming existing approaches. Furthermore, practical deployment on embedded devices verifies the efficiency and feasibility of the proposed method.

[104] AnyI2V: Animating Any Conditional Image with Motion Control

Ziye Li,Hao Luo,Xincheng Shuai,Henghui Ding

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频生成框架AnyI2V,能够基于用户定义的运动轨迹生成视频,并支持多种条件图像模态和风格编辑,从而提高了视频生成的灵活性和多样性。

Details Motivation: 现有的文本到视频方法缺乏对生成内容空间布局的精确控制,而图像到视频方法受限于真实图像的使用,同时一些引入ControlNet的方法缺少显式的运动控制且需要昂贵的训练成本。 Method: 提出了一个无需训练的框架AnyI2V,通过用户定义的运动轨迹来生成视频,并支持多种条件图像模态以及混合输入,实现了风格迁移和编辑功能。 Result: AnyI2V在广泛的实验中表现出优越的性能,在空间和运动控制视频生成方面提供了新的视角。 Conclusion: AnyI2V提供了一种更灵活、多功能的视频生成方法,克服了现有T2V和I2V方法在动态运动信号集成和空间约束方面的限制。 Abstract: Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at https://henghuiding.com/AnyI2V/.

[105] Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Jiaer Xia,Bingkui Tong,Yuhang Zang,Rui Shao,Kaiyang Zhou

Main category: cs.CV

TL;DR: This paper introduces GCoT, a novel method to improve the adaptation of MLLMs to specialized vision tasks by enhancing the quality of CoT reasoning data through grounding information like bounding boxes.

Details Motivation: Multimodal Large Language Models (MLLMs) struggle with adapting to specialized vision tasks like chart understanding due to mismatches between pre-training datasets and downstream tasks. Additionally, distillation from pre-trained MLLMs often results in CoT data with factual errors, prompting the need for a more reliable adaptation strategy. Method: The authors propose a bootstrapping-based method called Grounded Chain-of-Thought (GCoT), which injects grounding information such as bounding boxes into Chain-of-Thought (CoT) reasoning data, improving the model's faithfulness to input images. Result: The GCoT approach was evaluated on five specialized vision tasks involving charts, tables, receipts, and reports. Under data-limited conditions, it significantly outperformed traditional fine-tuning and distillation methods. Conclusion: The proposed Grounded Chain-of-Thought (GCoT) approach effectively enhances the adaptation of Multimodal Large Language Models (MLLMs) to specialized vision tasks under data-limited regimes by incorporating grounding information into CoT data. Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.

[106] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

Xin Zhou,Dingkang Liang,Kaijin Chen,Tianrui Feng,Xiwu Chen,Hongkai Lin,Yikang Ding,Feiyang Tan,Hengshuang Zhao,Xiang Bai

Main category: cs.CV

TL;DR: EasyCache是一种无需训练的视频扩散模型加速框架,通过一种轻量级、运行时自适应的缓存机制显著提高视频生成速度而不损失视觉质量。

Details Motivation: 现有的视频生成模型由于去噪过程的迭代性质导致推理速度慢和计算成本高,限制了其广泛应用。解决这一瓶颈对于普及先进的视频合成技术和将其集成到实际应用中至关重要。 Method: EasyCache引入了一种轻量级、运行时自适应的缓存机制,动态重用先前计算的转换向量,在推理过程中避免冗余计算。 Result: EasyCache在各种大规模视频生成模型上进行了全面的研究,包括OpenSora、Wan2.1和HunyuanVideo。与原始基线相比,该方法达到了领先的加速性能,推理时间减少了2.1-3.3倍,并且与之前的SOTA方法相比,PSNR提高了36%。 Conclusion: EasyCache是一个高效的视频生成加速框架,为研究和实际应用中的高质量视频生成提供了一个高效且易于访问的解决方案。 Abstract: Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.

[107] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Zhening Huang,Xiaoyang Wu,Fangcheng Zhong,Hengshuang Zhao,Matthias Nießner,Joan Lasenby

Main category: cs.CV

TL;DR: LiteReality converts RGB-D scans into interactive 3D virtual replicas with high-quality materials and object individuality, suitable for various applications.

Details Motivation: To create a method that generates detailed and interactive 3D virtual replicas suitable for applications like AR/VR, gaming, robotics, and digital twins. Method: LiteReality uses scene understanding to parse results into a coherent 3D layout and objects using a structured scene graph. It retrieves visually similar 3D artist-crafted models, enhances realism with material painting, and integrates the scene into a simulation engine for interaction. Result: The resulting scenes are compact, editable, fully compatible with standard graphics pipelines, and include high-quality materials and object individuality. Conclusion: LiteReality is a new pipeline that creates compact, realistic, and interactive 3D virtual replicas from RGB-D scans of indoor environments. Abstract: We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c

[108] RefTok: Reference-Based Tokenization for Video Generation

Xiang Fan,Xiaohang Sun,Kushan Thakkar,Zhu Liu,Vimal Bhat,Ranjay Krishna,Xiang Hao

Main category: cs.CV

TL;DR: RefTok improves video modeling by efficiently capturing temporal dynamics and context, significantly outperforming current methods.

Details Motivation: Handling temporal redundancy is crucial in video model learning, but prevailing approaches often fail to capture inherent temporal dependencies and redundancies. Method: RefTok, a reference-based tokenization method, was introduced to encode and decode sets of frames conditioned on an unquantized reference frame. Result: RefTok outperformed existing tokenizers like Cosmos and MAGVIT by improving metrics like PSNR, SSIM, LPIPS by 36.7% on average, and surpassed both smaller and larger versions of MAGVIT in video generation tasks. Conclusion: RefTok proves to be more effective than current state-of-the-art tokenizers in handling temporal redundancy in videos, offering better performance even against larger models. Abstract: Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.

[109] Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Yuqi Wu,Wenzhao Zheng,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出 Point3R,一种基于显式空间指针内存的在线密集3D重建框架,有效提升了多图像场景的全局一致性与重建效率。

Details Motivation: 为了解决现有方法中隐式内存容量有限、易导致早期帧信息丢失的问题,提出了一种更高效的显式空间指针内存策略。 Method: 该方法引入了3D层次位置嵌入和一种简单有效的融合机制,以促进从最新帧提取的信息与指针内存的交互。 Result: Point3R 在多种任务上实现了具有竞争力或最先进的性能,并且训练成本较低。 Conclusion: Point3R 提出了一种在线密集流式3D重建框架,通过维护一个显式的空间指针内存来实现高效且统一的全局坐标系统集成。 Abstract: Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: https://github.com/YkiWu/Point3R.