Table of Contents
cs.CL [Back]
[1] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models
Tian Lan,Xiangdong Su,Xu Liu,Ruirui Wang,Ke Chang,Jiang Li,Guanglai Gao
Main category: cs.CL
TL;DR: This paper introduces a Multi-task Chinese Bias Evaluation Benchmark (McBE) to measure biases in large language models (LLMs), addressing the lack of comprehensive datasets for non-English and non-North American cultures. The evaluation of LLMs shows varying degrees of bias, highlighting the need for mitigation strategies.
Details
Motivation: The motivation stems from the fact that most existing bias evaluation datasets focus on English and North American culture, which are not fully applicable to other cultures. Datasets grounded in the Chinese language and culture are scarce, and they usually only support single evaluation tasks without comprehensive measurement capabilities. Method: The authors present a Multi-task Chinese Bias Evaluation Benchmark (McBE) with 4,077 evaluation instances covering 12 single bias categories, 82 subcategories, and 5 evaluation tasks. They evaluate several popular LLMs from different series and parameter sizes. Result: The proposed McBE benchmark provides extensive category coverage, content diversity, and comprehensiveness in evaluating bias in LLMs. The evaluation of popular LLMs reveals varying degrees of bias, offering novel insights into the issue. Conclusion: The paper concludes that popular LLMs demonstrate varying degrees of bias, emphasizing the importance of measuring and addressing biases in LLMs to mitigate ethical risks. Abstract: As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.[2] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization
Keyan Jin,Yapeng Wang,Leonel Santos,Tao Fang,Xu Yang,Sio Kei Im,Hugo Gonçalo Oliveira
Main category: cs.CL
TL;DR: This paper evaluates reasoning and non-reasoning large language models (LLMs) in dialogue summarization and finds that explicit stepwise reasoning does not consistently improve performance. Instead, reasoning models often produce longer, less accurate summaries, highlighting the need for specialized approaches in dialogue summarization.
Details
Motivation: Despite significant progress in summarization tasks by large language models (LLMs), the effectiveness of step-by-step reasoning architectures, such as Long Chain-of-Thought implementations, remains unexplored in dialogue scenarios that require both abstraction and conciseness. This work aims to fill this gap by systematically evaluating how reasoning and non-reasoning LLMs perform in dialogue summarization. Method: The research conducted a comprehensive evaluation of state-of-the-art reasoning and non-reasoning LLMs across three major dialogue summarization paradigms: generic, role-oriented, and query-oriented. It analyzed performance across diverse languages, domains, and summary lengths using benchmarks like SAMSum, DialogSum, CSDS, and QMSum, alongside both LLM-based automatic metrics and human-inspired evaluation criteria. Result: Findings show that reasoning LLMs do not consistently outperform non-reasoning models in dialogue summarization. They are prone to verbosity, factual inconsistencies, and less concise outputs. Scenario-specific analyses and case studies reveal when and why explicit reasoning may hinder summarization in complex dialogue contexts. Conclusion: Dialogue summarization quality is not consistently improved by explicit stepwise reasoning. Reasoning LLMs often produce verbose and factually inconsistent summaries compared to non-reasoning models. The study highlights the need for targeted modeling and evaluation strategies tailored to real-world dialogue summarization. Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.[3] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer
Wenquan Lu,Yuechuan Yang,Kyle Lee,Yanshu Li,Enqi Liu
Main category: cs.CL
TL;DR: This paper investigates whether Huginn-3.5B develops internal reasoning (latent CoT), finding limited interpretability and minimal benefit from deeper recurrence.
Details
Motivation: To understand if and how latent chain-of-thought (CoT) reasoning emerges in recurrent transformer architectures without explicit natural language externalization. Method: The researchers analyzed Huginn-3.5B's internal behavior on arithmetic tasks using probing techniques like Logit Lens and Coda Lens to track reasoning steps in latent space. Result: Limited interpretable latent CoT evidence was found, with inconsistent probing results across layers and decoding methods, showing minimal gains from deeper recurrence. Conclusion: The study concludes that Huginn-3.5B does not significantly internalize reasoning steps as latent CoT, and increasing recurrence depth offers only marginal benefits compared to models externalizing reasoning. Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model's internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.[4] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons
Steven Song,Anirudh Subramanyam,Zhenyu Zhang,Aarti Venkat,Robert L. Grossman
Main category: cs.CL
TL;DR: GDC Cohort Copilot is an open-source tool that converts natural language descriptions into cancer genomics data cohorts on the Genomic Data Commons, making it easier for users to build and analyze patient cohorts.
Details
Motivation: Users of the Genomic Data Commons (GDC) often struggle to locate specific cohort descriptors among hundreds of fields, especially when they are new to the platform. Natural language input may offer a more intuitive way to define desired cohorts. Method: The authors developed and evaluated multiple large language models (LLMs) to power GDC Cohort Copilot. These models translate user-input natural language into GDC cohort filters, which can then be refined through an interactive interface before being exported for further analysis. Result: The GDC Cohort LLM, a locally-served open-source model, outperforms GPT-4o in generating accurate GDC cohort filters from natural language descriptions. The tool includes an interactive interface for refining generated cohorts and supports exporting them back to the GDC for further use. Conclusion: GDC Cohort Copilot is an effective tool that simplifies the process of creating complex cohorts on the Genomic Data Commons by translating natural language descriptions into cohort filters, outperforming GPT-4o in this task. Abstract: Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot. Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds.[5] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu,Tinghong Chen,Jiangtao Feng,Jiangjie Chen,Weinan Dai,Qiying Yu,Ya-Qin Zhang,Wei-Ying Ma,Jingjing Liu,Mingxuan Wang,Hao Zhou
Main category: cs.CL
TL;DR: 本文提出了一种新的长文本处理模型MemAgent,其通过分段读取和内存覆盖策略以及改进的DAPO算法实现了高效的长上下文外推,具有极佳的性能。
Details
Motivation: 尽管在长度外推、高效注意力和内存模块方面有所改进,但在不降低性能的情况下以线性复杂度处理无限长文档仍然是长文本处理的终极挑战。 Method: 通过引入新的代理工作流程MemAgent,分段读取文本并使用覆盖策略更新内存,并扩展DAPO算法以通过独立上下文多对话生成进行训练。 Result: MemAgent能够在长上下文任务中表现出色,成功实现从8K上下文向3.5M QA任务的外推,并在512K RULER测试中表现优异。 Conclusion: MemAgent展现出卓越的长文本处理能力,能够从8K上下文训练扩展到32K文本,并在3.5M QA任务中保持小于5%的性能损失,在512K RULER测试中达到95%以上的准确率。 Abstract: Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.[6] DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning
Dohoon Kim,Donghun Kang,Taesup Moon
Main category: cs.CL
TL;DR: 本文介绍了一种名为DoMIX的新方法,通过利用LoRA模块进行参数高效微调,解决了现有持续领域自适应预训练(DAP)方法的多个问题,并且能够扩展到标准的大语言模型微调场景中。
Details
Motivation: 研究者们注意到现有的持续DAP方法面临几个限制:(1) 在训练期间计算成本和GPU内存使用较高;(2) 对于增量数据顺序敏感;(3) 为所有终端任务提供单一通用模型,这与DAP的本质相悖。因此,他们寻求开发一种新的方法来解决这些问题。 Method: 论文提出了一种名为DoMIX的新方法,该方法利用LoRA模块进行高效的并行领域自适应预训练,并且对于领域顺序具有鲁棒性,同时能够有效利用积累的知识来为特定任务提供定制化的预训练模型。 Result: 论文展示了DoMIX方法的有效性,它不仅解决了现有持续DAP方法的问题,而且还能扩展到标准的LLM微调场景中。 Conclusion: 论文得出结论,DoMIX是一种新颖的方法,通过利用LoRA模块解决了现有的持续DAP方法所面临的高计算成本、对增量数据顺序敏感以及无法为特定任务提供定制化模型的问题。此外,该方法还可以扩展到标准的LLM微调场景中。 Abstract: Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.[7] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models
Christian Jaumann,Annemarie Friedrich,Rainer Lienhart
Main category: cs.CL
TL;DR: 本文提出了一种基于多模态大语言模型和少样本检索策略的集成方法,在SciVQA 2025共享任务中取得了优异成绩。
Details
Motivation: 为了应对SciVQA 2025共享任务中的科学视觉问答挑战,需要高效的模型和策略来提高性能。 Method: 该系统采用了两种多模态大语言模型的集成和各种少样本示例检索策略,并根据模型的置信度选择答案。 Result: 系统在盲测数据上表现良好,平均F1得分达到85.12。 Conclusion: 该系统在SciVQA 2025共享任务中排名第七中的第三名,平均F1得分为85.12。 Abstract: This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models' confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.[8] QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers
Pilsung Kang
Main category: cs.CL
TL;DR: This paper presents QFFN-BERT, a hybrid quantum-classical transformer that replaces feedforward network modules with parameterized quantum circuit-based layers. The design aims to improve expressibility and reduce parameter count compared to classical models. Experiments show that QFFN-BERT can achieve higher accuracy than its classical counterpart while significantly reducing the number of parameters needed, especially in feedforward networks. The model also performs well in few-shot learning scenarios, demonstrating its potential for improved data efficiency.
Details
Motivation: The motivation for this work is the dominant parameter contribution of FFNs in standard Transformer encoder blocks, which account for approximately two-thirds of the parameters. Prior studies have primarily integrated PQCs into self-attention modules, but this work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Method: The authors introduced QFFN-BERT, a hybrid quantum-classical transformer where the FFN modules of a compact BERT variant are replaced by PQC-based layers. The final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Result: Experiments conducted on a classical simulator on the SST-2 and DBpedia benchmarks showed that a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. The model also exhibited a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. Conclusion: Parameterized quantum circuits (PQCs) can serve as powerful and parameter-efficient alternatives to classical feedforward networks (FFNs) when co-designed with foundational deep learning principles. Abstract: Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.[9] Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection
Weijie Lyu,Sheng-Jun Huang,Xuan Xia
Main category: cs.CL
TL;DR: 本文提出了一种基于参数模型的代码数据选择方法,在保证数据质量的同时提高了训练效率和模型性能。
Details
Motivation: 当前方法主要依赖大量数据来提升模型性能,而往往忽视了数据质量,影响了训练效率。 Method: 利用参数模型进行代码数据选择,优化参数模型以确保所选子集的分布一致性和多样性。 Result: 实验结果表明,使用仅10K样本,该方法在HumanEval和MBPP任务上分别比92K全样本基线提升了2.4%和2.3%,在性能和效率方面均优于其他采样方法。 Conclusion: 该方法在提升模型性能的同时显著降低了计算成本。 Abstract: Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.[10] Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability
Mark Atta Mensah,Isaac Wiafe,Akon Ekpezu,Justice Kwame Appati,Jamal-Deen Abdulai,Akosua Nyarkoa Wiafe-Akenten,Frank Ernest Yeboah,Gifty Odame
Main category: cs.CL
TL;DR: This paper benchmarks seven Akan ASR models on four diverse speech datasets to evaluate their generalization across domains, revealing domain dependency and distinct error behaviors between Whisper and Wav2Vec2 architectures.
Details
Motivation: Most ASR research evaluates models using in-domain datasets but rarely assesses their ability to generalize across diverse speech contexts. This gap is especially relevant for low-resource languages like Akan. Method: Seven Akan ASR models based on transformer architectures (e.g., Whisper and Wav2Vec2) were benchmarked using four Akan speech corpora spanning various domains: culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. Performance was measured using word error rate and character error rate. Result: Models showed strong performance within their training domains but significant accuracy degradation when tested on mismatched domains. Whisper models produced more fluent but potentially misleading errors, while Wav2Vec2 generated more obvious yet less interpretable outputs when encountering unfamiliar inputs. Conclusion: The study highlights the need for domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks to improve ASR performance for Akan and other low-resource languages, with a trade-off observed between readability and transparency in transcription errors. Abstract: Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.[11] A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages
Sumaya Ahmed Salihs,Isaac Wiafe,Jamal-Deen Abdulai,Elikem Doe Atsakpo,Gifty Ayoka,Richard Cave,Akon Obu Ekpezu,Catherine Holloway,Katrin Tomanek,Fiifi Baffoe Payin Winful
Main category: cs.CL
TL;DR: 本文提出了一种面向低资源语言受损语音的自动语音识别解决方案,包括开源数据集、最佳实践指南以及模型微调方法。
Details
Motivation: 为了使资源匮乏语言中的受损语音也能被有效识别,本研究旨在推动自动语音识别技术的普及和包容性发展。 Method: 研究人员构建了第一个开源的阿坎语受损语音数据集,并开发了一套用于社区驱动数据收集和ASR模型构建的最佳实践指南(“食谱”)。此外,他们使用这些数据对开源ASR模型进行了微调。 Result: 研究人员成功创建了首个面向阿坎语受损语音的开源数据集,并提供了配套的工具和方法指南。同时,微调后的开源ASR模型在识别受损语音方面展现了初步成效。 Conclusion: 该研究通过开发最佳实践“食谱”和培训,使自动语音识别(ASR)技术和数据收集民主化,实现了社区驱动的数据收集和ASR模型构建。研究还展示了微调开源ASR模型以更好识别阿坎语中受损语音的初步结果。 Abstract: This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a "cookbook" of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.[12] IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders
Sneha Deshmukh,Prathmesh Kamble
Main category: cs.CL
TL;DR: 本文介绍了IndianBailJudgments-1200,这是一个包含1200个印度法院关于保释决定的判决的新基准数据集。
Details
Motivation: 由于结构化数据集的缺乏,法律自然语言处理在印度等地区发展不足。 Method: 使用经过提示工程的GPT-4o流水线生成注解,并对一致性进行了验证。 Result: 该资源支持一系列法律自然语言处理任务,并且是首个专注于印度保释判例的公开数据集。 Conclusion: IndianBailJudgments-1200数据集填补了印度法律自然语言处理领域的一项重要空白。 Abstract: Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.[13] WebSailor: Navigating Super-human Reasoning for Web Agent
Kuan Li,Zhongwang Zhang,Huifeng Yin,Liwen Zhang,Litu Ou,Jialong Wu,Wenbiao Yin,Baixuan Li,Zhengwei Tao,Xinyu Wang,Weizhou Shen,Junkai Zhang,Dingchu Zhang,Xixi Wu,Yong Jiang,Ming Yan,Pengjun Xie,Fei Huang,Jingren Zhou
Main category: cs.CL
TL;DR: WebSailor 是一种新的后训练方法,旨在提升开源模型在复杂信息检索任务中的表现,使其与专有代理的性能相匹配。
Details
Motivation: 作者认为,在极端不确定性下进行推理的能力是专有代理成功的关键,并且这种能力在开源模型中缺失。 Method: 通过结构化采样和信息模糊生成新的高不确定性任务、RFT冷启动以及高效的代理RL训练算法 Duplicating Sampling Policy Optimization (DUPO) 。 Result: WebSailor 在复杂的信息搜索任务中显著优于所有开源代理,其性能与专有代理相当。 Conclusion: WebSailor 提供了一种有效的解决方案,以缩小开源模型和专有模型之间的能力差距。 Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.[14] Revisiting Active Learning under (Human) Label Variation
Cornelia Gruber,Helen Alber,Bernd Bischl,Göran Kauermann,Barbara Plank,Matthias Aßenmacher
Main category: cs.CL
TL;DR: 该论文探讨了如何在主动学习中考虑人类标签变异,以提高现实场景下机器学习模型的训练效率。
Details
Motivation: 高质量标记数据的获取仍然是应用监督学习的一个限制因素;同时,现有的标注框架通常忽视了人类标签变异(HLV),而将其视为噪声处理。 Method: 通过分析基础假设和文献综述,提出一种整合HLV的AL框架,并讨论了大语言模型作为注释者的可能性。 Result: 作者强调需要将观察到的标签变异分解为信号(例如HLV)和噪声(例如标注错误),并提出了一种新的AL方法,包括实例选择、注释者选择和标签表示。 Conclusion: 本文提出了一个概念框架,旨在将人类标签变异(HLV)纳入主动学习(AL)的各个环节,以更好地反映现实世界注释的复杂性。 Abstract: Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed -- or neglected -- these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.[15] MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion
Xin Guan,PeiHsin Lin,Zekun Wu,Ze Wang,Ruibo Zhang,Emre Kazim,Adriano Koshiyama
Main category: cs.CL
TL;DR: Multiperspective Fusion (MPF) is a framework for mitigating bias in large language models by aligning their outputs with nuanced humanlike baselines, offering scalability and interpretability without extensive prompt engineering or finetuning.
Details
Motivation: MPF was developed in response to the growing need for easy bias mitigation in large language models. Method: Multiperspective Fusion (MPF) leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines by decomposing baseline into interpretable perspective components and guiding generation through sampling and balancing of responses. Result: Empirically, MPF demonstrates its ability to align LLM sentiment distributions with both counterfactual baselines and HR baseline, resulting in small KL divergence, reduction of calibration error, and generalization to unseen questions. Conclusion: MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning. Abstract: Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.[16] Exploring Gender Bias Beyond Occupational Titles
Ahmed Sabir,Rajesh Sharama
Main category: cs.CL
TL;DR: This paper introduces GenderLexicon and a new framework for estimating contextual gender bias, showing that biases extend beyond occupational stereotypes and can be effectively measured.
Details
Motivation: To understand and quantify gender biases in context, particularly focusing on action verbs, object nouns, and occupations. Method: A framework was developed to estimate contextual bias related to gender, using a novel dataset called GenderLexicon. The approach was validated across five diverse datasets, including a Japanese dataset. Result: A model capable of interpreting gender bias with a score was developed, improving the explainability of such biases. Findings confirmed the presence of gender biases beyond just occupational stereotypes. Conclusion: The study concludes that gender biases exist beyond occupational stereotypes, and the proposed model effectively estimates and interprets these biases with a score. Abstract: In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.[17] Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
Zhijian Xu,Yilun Zhao,Manasi Patwardhan,Lovekesh Vig,Arman Cohan
Main category: cs.CL
TL;DR: This paper introduces LimitGen, a new benchmark for assessing LLMs' ability to identify research paper limitations, showing that integrating literature retrieval enhances their performance in supporting peer review.
Details
Motivation: The increasing volume of scientific publications has made peer review more challenging. Although LLMs show promise in many scientific tasks, their potential to assist in peer review—particularly in identifying paper limitations—remains understudied. Method: The researchers developed LimitGen, a benchmark for evaluating LLMs' capabilities in identifying paper limitations. It includes two subsets: LimitGen-Syn (synthetic dataset) and LimitGen-Human (real human-written limitations). They augmented LLMs with literature retrieval techniques to improve performance. Result: LimitGen was successfully created as the first comprehensive benchmark for studying how LLMs can support early-stage peer review. The integration of literature retrieval improved the ability of LLMs to detect and generate meaningful limitations in research papers. Conclusion: The study concludes that augmenting LLMs with literature retrieval significantly enhances their ability to identify and generate limitations in research papers, providing more concrete and constructive feedback. Abstract: Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.[18] Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens
Peter Viechnicki
Main category: cs.CL
TL;DR: 该研究探讨了人类元音产生的控制机制,通过测量英语说话者在产生前元音时的“可生产差异”(JPD)来揭示元音模仿中的准确性。
Details
Motivation: 过去的研究表明,人类元音产生的复杂协调发音运动受到听觉空间区域目标控制机制的影响,但这种控制的精确度尚不清楚。这项研究旨在调查这一问题。 Method: 研究采用元音模仿范式,对两组英语说话者在前元音产生过程中的JPD进行了测量,并在F1 X F2空间中估计JPD的距离。 Result: 研究发现JPD距离在14到51美尔之间,这对语音产生的情景理论以及人类元音系统的结构提供了重要的启示。 Conclusion: 该研究为理解人类元音系统的设计和发音机制提供了心理物理学解释,并设定了一个理论上说话者共振峰空间中两个元音音素可能接近的下限。 Abstract: A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed 'Just Producible Difference' (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker's formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.[19] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Ken Tsui
Main category: cs.CL
TL;DR: This paper identifies a significant 'Self-Correction Blind Spot' in large language models (LLMs), showing they struggle to correct their own errors. Using a new framework called Self-Correction Bench, the study finds an average 64.5% blind spot rate across 14 models. The issue appears linked to training data, as human examples often lack error-correction sequences. However, simple interventions like appending 'Wait' can reduce blind spots by nearly 89%. This work underscores a key area for improving LLM reliability.
Details
Motivation: Large language models (LLMs) can make mistakes and explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, especially for autoregressive models. The motivation behind this study is to systematically investigate the 'Self-Correction Blind Spot' phenomenon where LLMs fail to correct identical errors in their own outputs. Method: The researchers introduced Self-Correction Bench, a systematic framework to measure the 'Self-Correction Blind Spot' through controlled error injection at three complexity levels. They tested 14 models and analyzed the relationship between training data composition and error correction capabilities. Result: The research found an average 64.5% blind spot rate across 14 tested models. Evidence suggests that this limitation relates to training data composition; human training demonstrations predominantly show error-free responses rather than error-correction sequences. RL-trained models learn error correction through outcome feedback. Appending 'Wait' reduced blind spots by 89.3%, indicating the capability exists but requires activation. Conclusion: The study highlights a critical limitation in current LLMs regarding self-correction and offers potential avenues for improving their reliability and trustworthiness. Abstract: Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.[20] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
Riccardo Cantini,Nicola Gabriele,Alessio Orsino,Domenico Talia
Main category: cs.CL
TL;DR: 尽管推理语言模型(RLMs)在复杂多步骤推理任务中表现出色,但它们可能增加了偏见引发的漏洞,表明推理并不一定提高鲁棒性。
Details
Motivation: 虽然RLMs能够执行复杂的多步骤推理任务,但它们在社会偏见鲁棒性方面的影响尚不清楚。 Method: 利用CLEAR-Bias基准测试RLMs对偏见诱导的对抗鲁棒性,并使用LLM-as-a-judge方法进行自动化安全评分和越狱技术评估内置安全机制的有效性。 Result: 具有显式推理能力的模型比没有这些机制的基础模型更容易受到偏见诱导;推理启用模型相比依赖CoT提示的模型稍微安全一些。 Conclusion: 推理语言模型(RLMs)可能会无意中增加偏见引发的漏洞,因此需要更关注偏见的设计推理方法。 Abstract: Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.[21] Multimodal Mathematical Reasoning with Diverse Solving Perspective
Wenhao Shi,Zhiqiang Hu,Yi Bin,Yang Yang,See-Kiong Ng,Heng Tao Shen
Main category: cs.CL
TL;DR: This study introduces MathV-DP and Qwen-VL-DP, leveraging diverse reasoning perspectives to improve multimodal mathematical reasoning, showing better accuracy and diversity in solutions.
Details
Motivation: Current multimodal large language models (MLLMs) for mathematical reasoning rely on one-to-one image-text pairs and single-solution supervision, neglecting the diversity of valid reasoning perspectives. This work aims to enhance reasoning capabilities by incorporating diverse solution paths. Method: The authors introduced MathV-DP, a dataset with multiple solution trajectories for each image-question pair, and proposed Qwen-VL-DP, a model fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), which integrates correctness discrimination and diversity-aware reward functions. Result: Extensive experiments on MathVista's minitest and Math-V benchmarks show that Qwen-VL-DP outperforms previous base MLLMs in both accuracy and generative diversity. Conclusion: Incorporating diverse reasoning perspectives and reflective reasoning significantly improves the performance of multimodal mathematical reasoning models. Abstract: Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista's minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.[22] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model
Wencheng Zhang,Shiqin Qiao,Lingjie Luo,Yinfeng Li,Chuanyang Zheng,Qian Xu,Meng Li,Yong Gui,Yijun He,Jianing Qiu,Jindong Hong,Jiankai Sun
Main category: cs.CL
TL;DR: 本文提出了SynapseRoute,一种动态路由框架,通过智能选择推理或非推理模式,在保证准确性的前提下显著降低了推理成本和时间。
Details
Motivation: 在实际应用中选择合适的大型语言模型需要权衡性能和操作成本,而推理能力模型进一步拉大了高成本推理模式与低成本非推理模式之间的差距。 Method: 提出了一种基于机器学习的动态路由框架SynapseRoute,将输入查询分配到适当的推理或非推理模式中。 Result: 实验结果显示,SynapseRoute不仅比单独使用推理模式提高了整体准确性(0.8390 vs. 0.8272),还减少了36.8%的推理时间和39.66%的token消耗。 Conclusion: SynapseRoute有效地优化了准确性和成本效率之间的平衡,动态路由框架可以根据问题复杂度智能选择推理模式。 Abstract: With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.[23] Generalizing Verifiable Instruction Following
Valentina Pyatkin,Saumya Malik,Victoria Graf,Hamish Ivison,Shengyi Huang,Pradeep Dasigi,Nathan Lambert,Hannaneh Hajishirzi
Main category: cs.CL
TL;DR: 该论文提出了一种新的评估基准IFBench和训练方法RLVR,用于提升语言模型在复杂、未见过的输出约束下的指令执行能力,解决了现有模型泛化能力不足的问题。
Details
Motivation: 研究动机源于语言模型与用户交互时需要严格遵守用户指令中的特定输出约束(如回答“是”或“否”,或重复某个词多次)。然而,目前大多数模型在这一任务上表现不佳,尤其在面对未知约束时缺乏泛化能力。因此,作者希望开发一个更有效的评估体系和训练方法来解决这个问题。 Method: 论文方法包括构建新的基准测试集IFBench,包含58个多样且具有挑战性的验证性约束;设计了约束验证模块,并采用可验证奖励的强化学习(RLVR)来训练模型以提高其泛化能力。此外还提供了29个手工标注的训练约束、验证函数、RLVR训练提示和代码。 Result: 论文结果显示,传统训练方法在新约束下的表现较差,而基于RLVR的方法显著提升了模型在IFBench上的准确率。同时,发布的IFBench基准测试集以及相关资源为后续研究提供了重要支持。 Conclusion: 论文得出结论,当前的语言模型在遵循精确的指令方面存在过拟合现象,难以泛化到未见过的输出约束。但通过引入IFBench基准测试和使用可验证奖励的强化学习(RLVR),可以有效提升模型在新约束下的指令跟随能力。 Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.[24] LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
Almog Hilel,Idan Shenfeld,Leshem Choshen,Jacob Andreas
Main category: cs.CL
TL;DR: 攻击者可以通过提供提示和反馈来持久性地改变语言模型的知识和行为,这揭示了偏好调整中的新特性以及针对语言模型的新攻击机制。
Details
Motivation: 发现并描述语言模型在用户反馈训练中存在的一种新的漏洞,这种漏洞允许攻击者通过有限的交互手段精细控制模型行为。 Method: 攻击者通过提示语言模型随机输出“中毒”或良性响应,并利用反馈信号影响后续的偏好调整过程,从而增加模型生成中毒响应的可能性。 Result: 实验表明,该攻击方法可以插入新的事实知识、修改代码生成模式以引入安全漏洞,以及注入虚假的金融新闻。 Conclusion: 研究不仅识别了语言模型偏好调整的新特征,还提出了一种新的基于用户反馈的语言模型攻击方式,扩展了此前关于预训练数据中毒和部署时提示注入的研究。 Abstract: We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).[25] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
Purbesh Mitra,Sennur Ulukus
Main category: cs.CL
TL;DR: MOTIF: Modular Thinking via Reinforcement Finetuning improves LLM reasoning by enabling multi-round thinking beyond context limits, showing better performance with less data.
Details
Motivation: LLMs have limited context size, restricting their ability to generate long reasoning sequences. Existing methods like GRPO are also constrained by this limit, necessitating a new approach for extended reasoning. Method: The authors proposed MOTIF, a reinforcement learning (RL) training method that enables multi-round modular thinking. They trained Qwen2.5-3B-Instruct on the GSM8K dataset using parameter-efficient fine-tuning and evaluated its performance on MATH500 and AIME2024 benchmarks. Result: MOTIF achieved 3.8% and 3.3% improvements over vanilla GRPO-based training on MATH500 and AIME2024 benchmarks, respectively, using only 15% of the samples. Conclusion: MOTIF improves the reasoning capabilities of LLMs beyond context size limitations through modular thinking and reinforcement learning, achieving better accuracy with sample efficiency. Abstract: Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.[26] Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Nikhil Chandak,Shashwat Goel,Ameya Prabhu,Moritz Hardt,Jonas Geiping
Main category: cs.CL
TL;DR: The paper argues for moving from multiple choice benchmarks to answer matching for evaluating language models, as the latter aligns better with human grading and provides more accurate model rankings.
Details
Motivation: Multiple choice benchmarks have limitations due to shortcuts that don't reflect true model understanding; generative evaluation offers a more accurate alternative. Method: Annotated MMLU-Pro and GPQA-Diamond datasets to obtain human grading data; compared evaluation approaches' agreement with human grading. Result: Answer matching using recent models (even small ones) achieved agreement levels comparable to inter-annotator agreement; rankings of some models changed significantly when using this method. Conclusion: Answer matching is a better evaluation method compared to multiple choice and LLM-as-a-judge without reference answers, as it achieves near-perfect agreement with human grading. Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.cs.CV [Back]
[27] Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma
Main category: cs.CV
TL;DR: This paper reviews the application of large language models (LLMs) and vision-language models (VLMs) in crash detection from video feeds, offering a structured overview of methods, datasets, and benchmarks while highlighting future research directions.
Details
Motivation: The motivation behind this study is the critical importance of crash detection in intelligent transportation systems and the transformative potential of recent advancements in LLMs and VLMs to improve multimodal information processing, reasoning, and summarization for this task. Method: The paper employs a survey methodology, analyzing recent methods that use LLMs for crash detection from video data. It presents a structured taxonomy of fusion strategies, summarizes key datasets, analyzes model architectures, and compares performance benchmarks. Result: The paper provides a comprehensive review of recent methods for crash detection using LLMs, including a taxonomy of fusion strategies, summaries of relevant datasets, analysis of model architectures, and comparisons of performance benchmarks. Conclusion: The paper concludes that leveraging large language models (LLMs) and vision-language models (VLMs) offers a promising foundation for future research in crash detection from video data, providing a structured taxonomy of fusion strategies and identifying ongoing challenges and opportunities. Abstract: Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.[28] Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning
Zijie Cai,Christopher Metzler
Main category: cs.CV
TL;DR: This paper evaluates and improves monocular metric depth estimation for underwater environments through domain adaptation and synthetic training data.
Details
Motivation: Monocular depth estimation faces challenges in underwater environments due to light attenuation, scattering, color distortion, turbidity, and lack of high-quality ground-truth data. Method: Evaluated state-of-the-art models on real-world underwater datasets (e.g., FLSea, SQUID) and fine-tuned Depth Anything V2 using a synthetic underwater variant of Hypersim dataset generated with a physically based underwater image formation model. Result: Large-scale models trained on terrestrial data perform poorly underwater. The fine-tuned Depth Anything V2 model showed improved performance across all benchmarks compared to baselines trained only on clean in-air data. Conclusion: The study concludes that domain adaptation and scale-aware supervision are crucial for robust and generalizable metric depth predictions in underwater environments. Abstract: Monocular depth estimation has recently advanced to provide not only relative but also metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground-truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, such as FLSea and SQUID. We evaluate a diverse set of state-of-the-art models across a range of underwater conditions with different ranges. Our results show that large-scale models trained on terrestrial (real or synthetic) data, while effective in in-air settings, perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S backbone encoder on a synthetic underwater variant of the Hypersim dataset, which we generated using a physically based underwater image formation model. We demonstrate our fine-tuned model consistently improves performance across all benchmarks and outperforms baselines trained only on the clean in-air Hypersim dataset. Our study provides a detailed evaluation and visualization for monocular metric depth estimation in underwater scenes, highlighting the importance of domain adaptation and scale-aware supervision for achieving robust and generalizable metric depth predictions in challenging underwater environments for future research.[29] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning
Xiao Wang,Jingtao Jiang,Qiang Chen,Lan Chen,Lin Zhu,Yaowei Wang,Yonghong Tian,Jin Tang
Main category: cs.CV
TL;DR: 本文提出了一种基于思维链推理的新框架 ESTR-CoT,显著提高了事件流场景文本识别的性能和可解释性。
Details
Motivation: 当前基于事件流的场景文本识别方法存在不足的可解释性和较弱的上下文逻辑推理能力,因此提出了基于思维链推理的新型框架以解决这些问题。 Method: 该研究采用了EVA-CLIP视觉编码器将输入事件流转换为tokens,并利用Q-former对齐视觉token与预训练的大语言模型Vicuna-7B,同时输出答案和思维链(CoT)推理过程。 Result: 在EventSTR、WordArt* 和 IC15* 三个事件流STR基准数据集上的广泛实验验证了所提出框架的有效性和可解释性。 Conclusion: ESTR-CoT 提出了一种基于思维链推理的事件流场景文本识别框架,有效提升了事件流STR的性能与可解释性。 Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.[30] Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach
Elena Ryumina,Maxim Markitantov,Alexandr Axyonov,Dmitry Ryumin,Mikhail Dolgushin,Alexey Karpov
Main category: cs.CV
TL;DR: 本研究提出了一种新的零样本多模态方法用于复合情绪识别,结合多种模态并通过新模块实现高效的特征融合和情绪输出,其性能接近于监督学习方法。
Details
Motivation: 复合情绪识别(CER)旨在检测由基本情绪组合形成的复杂情绪状态,而现有方法通常依赖任务特定的训练数据,因此需要一种更灵活有效的方法。 Method: 该方法结合了六种异构模态:静态和动态面部表情、场景和标签匹配、场景上下文、音频和文本,并引入了多头概率融合(MHPF)模块和复合表情(CE)转换模块。 Result: 在AffWild2上F1得分为46.95%,AFEW上为49.02%,C-EXPR-DB上为34.85%,证明该方法在零样本测试下具有竞争力。 Conclusion: 研究得出,所提出的零样本多模态方法在复合表情识别(CER)中表现良好,且无需目标数据的领域适应即可与监督方法相媲美。 Abstract: Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available.[31] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers
Takuro Kawada,Shunsuke Kitada,Sota Nemoto,Hitoshi Iyatomi
Main category: cs.CV
TL;DR: This paper introduces SciGA-145k, a large dataset for studying and recommending Graphical Abstracts, and proposes new tasks and metrics to improve visual scientific communication.
Details
Motivation: Graphical Abstracts (GAs) are important for visual scientific communication, but designing effective GAs is challenging due to the need for advanced visualization skills. Current research lacks exploration on how visual materials can better support scientific communication. Method: The researchers introduced SciGA-145k, a large-scale dataset of 145,000 papers and 1.14 million figures, and proposed two GA-related recommendation tasks: Intra-GA and Inter-GA. They also developed a new evaluation metric called Confidence Adjusted top-1 ground truth Ratio (CAR). Result: The creation of SciGA-145k enables research into GA selection, recommendation, and automated generation. Two recommendation tasks were defined—Intra-GA and Inter-GA—and baseline models were provided. The new CAR metric allows for more detailed analysis of recommendation performance. Conclusion: SciGA-145k lays a foundation for improving visual scientific communication and advancing AI in the field by addressing challenges in GA design through recommendation tasks and a new metric, CAR. Abstract: Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.[32] Understanding Trade offs When Conditioning Synthetic Data
Brandon Trabucco,Qasim Wani,Benjamin Pikus,Vasu Sharma
Main category: cs.CV
TL;DR: This paper explores synthetic data generation using diffusion models for object detection, finding that layout-based conditioning outperforms prompt-based methods when visual concept diversity is high, leading to significant improvements in detection accuracy.
Details
Motivation: The motivation is to address the challenge of learning robust object detectors with limited training data in industrial vision systems, exploring synthetic data as a solution due to its potential for data-efficient visual inspection. Method: The authors studied eighty diverse visual concepts from four standard object detection benchmarks and compared two diffusion model conditioning strategies: prompt-based and layout-based. Result: When layout cues match the full training distribution, synthetic data improves mean average precision by an average of thirty-four percent and up to one hundred seventy-seven percent compared to using real data alone. Conclusion: The study concludes that layout-based conditioning becomes superior when the diversity of visual concepts increases, and synthetic data can significantly improve detection performance compared to real data alone. Abstract: Learning robust object detectors from only a handful of images is a critical challenge in industrial vision systems, where collecting high quality training data can take months. Synthetic data has emerged as a key solution for data efficient visual inspection and pick and place robotics. Current pipelines rely on 3D engines such as Blender or Unreal, which offer fine control but still require weeks to render a small dataset, and the resulting images often suffer from a large gap between simulation and reality. Diffusion models promise a step change because they can generate high quality images in minutes, yet precise control, especially in low data regimes, remains difficult. Although many adapters now extend diffusion beyond plain text prompts, the effect of different conditioning schemes on synthetic data quality is poorly understood. We study eighty diverse visual concepts drawn from four standard object detection benchmarks and compare two conditioning strategies: prompt based and layout based. When the set of conditioning cues is narrow, prompt conditioning yields higher quality synthetic data; as diversity grows, layout conditioning becomes superior. When layout cues match the full training distribution, synthetic data raises mean average precision by an average of thirty four percent and by as much as one hundred seventy seven percent compared with using real data alone.[33] High-Fidelity Differential-information Driven Binary Vision Transformer
Tian Gao,Zhiyuan Zhang,Kaijie Yin,Xu-Cheng Zhong,Hui Kong
Main category: cs.CV
TL;DR: The study introduces DIDB-ViT, a novel binary vision transformer that improves performance over existing quantization methods by incorporating differential information, frequency decomposition, and enhanced activation functions.
Details
Motivation: To address the trade-off between high computational/storage demands of vision transformers and edge-device constraints while minimizing performance degradation from binarization. Method: Designing an informative attention module with differential information, applying frequency decomposition using discrete Haar wavelet, integrating similarities across different frequencies, and introducing an improved RPReLU activation function. Result: DIDB-ViT achieves better performance compared to existing binary ViT methods, preserving model accuracy and similarity calculations without relying on full-precision modules. Conclusion: DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance. Abstract: The binarization of vision transformers (ViTs) offers a promising approach to addressing the trade-off between high computational/storage demands and the constraints of edge-device deployment. However, existing binary ViT methods often suffer from severe performance degradation or rely heavily on full-precision modules. To address these issues, we propose DIDB-ViT, a novel binary ViT that is highly informative while maintaining the original ViT architecture and computational efficiency. Specifically, we design an informative attention module incorporating differential information to mitigate information loss caused by binarization and enhance high-frequency retention. To preserve the fidelity of the similarity calculations between binary Q and K tensors, we apply frequency decomposition using the discrete Haar wavelet and integrate similarities across different frequencies. Additionally, we introduce an improved RPReLU activation function to restructure the activation distribution, expanding the model's representational capacity. Experimental results demonstrate that our DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance.[34] FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model
Jiangxia Chen,Tongyuan Huang,Ke Song
Main category: cs.CV
TL;DR: 本文提出了一种新的3D语义占用预测方法FMOcc,通过创新的流匹配和三视角特征处理策略,有效解决了遮挡和远距离场景中的预测难题,取得了卓越的性能表现。
Details
Motivation: 解决由于图像帧数有限以及3D空间冗余导致的预测精度下降问题,特别是在遮挡和远距离场景中的表现不佳。 Method: 设计了基于流匹配模型的特征细化模块(FMSSM)、TPV SSM层和平面选择性SSM(PS3M),并引入了掩码训练(MT)方法来增强鲁棒性。 Result: 实验结果表明,FMOcc在Occ3D-nuScenes和OpenOcc数据集上均优于现有最先进方法,使用两帧输入取得了显著的RayIoU和mIoU成绩,并具有较低的推理内存和时间消耗。 Conclusion: FMOcc通过结合流匹配和三视角方法,提高了3D语义占用预测的准确性和效率,尤其在处理遮挡和远距离场景时表现优异。 Abstract: 3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction. Firstly, to generate missing features, we designed a feature refinement module based on a flow matching model, which is called Flow Matching SSM module (FMSSM). Furthermore, by designing the TPV SSM layer and Plane Selective SSM (PS3M), we selectively filter TPV features to reduce the impact of air voxels on non-air voxels, thereby enhancing the overall efficiency of the model and prediction capability for distant scenes. Finally, we design the Mask Training (MT) method to enhance the robustness of FMOcc and address the issue of sensor data loss. Experimental results on the Occ3D-nuScenes and OpenOcc datasets show that our FMOcc outperforms existing state-of-theart methods. Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.[35] SurgVisAgent: Multimodal Agentic Model for Versatile Surgical Visual Enhancement
Zeyu Lei,Hongyuan Yu,Jinlin Wu,Zhen Chen
Main category: cs.CV
TL;DR: This paper proposes SurgVisAgent, an intelligent surgical vision agent that addresses the limitations of current single-task enhancement algorithms, offering a versatile and effective solution for diverse surgical scenarios.
Details
Motivation: Current enhancement algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world surgical situations. There is a need for a more versatile and effective solution. Method: The paper introduces SurgVisAgent, an end-to-end intelligent surgical vision agent based on multimodal large language models (MLLMs) that dynamically identifies distortion categories and severity levels in endoscopic images. It uses a prior model for domain-specific knowledge, in-context few-shot learning, and chain-of-thought reasoning for customized enhancements. Result: Extensive experiments on a comprehensive benchmark simulating real-world surgical distortions show that SurgVisAgent surpasses traditional single-task models. Conclusion: SurgVisAgent serves as a unified solution for surgical assistance by overcoming the limitations of traditional single-task models. Abstract: Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context few-shot learning and chain-of-thought (CoT) reasoning, SurgVisAgent delivers customized image enhancements tailored to a wide range of distortion types and severity levels, thereby addressing the diverse requirements of surgeons. Furthermore, we construct a comprehensive benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models, highlighting its potential as a unified solution for surgical assistance.[36] Multi-Label Classification Framework for Hurricane Damage Assessment
Zhangding Liu,Neda Mohammadi,John E. Taylor
Main category: cs.CV
TL;DR: 这项研究介绍了一个新颖的多标签分类框架,通过结合ResNet特征提取模块和特定类别注意力机制,提高了对飓风造成损害的评估效率和准确性。
Details
Motivation: 传统单标签分类方法无法捕捉飓风后复杂损伤的多样性,因此需要引入一种新的多标签分类框架。 Method: 该方法集成了基于ResNet的特征提取模块和特定类别注意力机制,以识别单个图像中的多种损伤类型。 Result: 使用Rescuenet飓风Michael数据集,所提方法达到了90.23%的平均精度,超过了现有基线方法。 Conclusion: 本研究提出了一种新的多标签分类框架,用于评估飓风后的损害,并增强了灾后损害评估,实现了更有针对性和高效的灾害响应。 Abstract: Hurricanes cause widespread destruction, resulting in diverse damage types and severities that require timely and accurate assessment for effective disaster response. While traditional single-label classification methods fall short of capturing the complexity of post-hurricane damage, this study introduces a novel multi-label classification framework for assessing damage using aerial imagery. The proposed approach integrates a feature extraction module based on ResNet and a class-specific attention mechanism to identify multiple damage types within a single image. Using the Rescuenet dataset from Hurricane Michael, the proposed method achieves a mean average precision of 90.23%, outperforming existing baseline methods. This framework enhances post-hurricane damage assessment, enabling more targeted and efficient disaster response and contributing to future strategies for disaster mitigation and resilience. This paper has been accepted at the ASCE International Conference on Computing in Civil Engineering (i3CE 2025), and the camera-ready version will appear in the official conference proceedings.[37] Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation
Yuxiang Zhang,Wei Li,Wen Jia,Mengmeng Zhang,Ran Tao,Shunlin Liang
Main category: cs.CV
TL;DR: 本文提出了一种新的双向域自适应框架BiDA,结合三支路变压器架构与耦合多头交叉注意机制,有效提升了跨域高光谱图像分类性能。
Details
Motivation: 利用高光谱遥感技术可以提取细粒度的土地覆盖类别。然而,训练和测试使用的卫星或航空图像通常来自不同区域或时间,导致同一类别在不同场景下存在显著的光谱偏移。因此需要一种能够同时提取域不变特征和域特有信息的方法以增强对目标场景的适应性和可分离性。 Method: 提出了一种双向域自适应框架BiDA,用于跨域高光谱图像分类。设计了三支路变压器架构(源支路、目标支路和耦合支路),并引入语义分词器作为主干。开发了耦合多头交叉注意机制和双向蒸馏损失,并提出了自适应强化策略来引导模型关注特定泛化特征提取。 Result: 所提出的BiDA在跨时相/场景的机载和卫星数据集上表现显著优于现有最先进领域自适应方法,在跨时相树种分类任务中性能高出3%~5%。 Conclusion: 实验结果表明,所提出的BiDA在跨时相/场景的机载和卫星数据集上显著优于一些最先进的领域自适应方法。在跨时相树种分类任务中,BiDA比最先进方法高出3%~5%。代码可在GitHub获取。 Abstract: Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3\%$\sim$5\% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA.[38] MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement
Fanghai Yi,Zehong Zheng,Zexiao Liang,Yihang Dong,Xiyang Fang,Wangyu Wu,Xuhang Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为MAC-Lookup的水下图像增强模型,通过结合颜色校正和自适应增强技术,显著提升了水下图像的质量和可视性。
Details
Motivation: 水下图像存在可见性和颜色问题,传统方法效果不佳,而深度学习缺乏高质量数据集。 Method: 引入了Multi-Axis Conditional Lookup (MAC-Lookup)模型,结合Conditional 3D Lookup Table Color Correction (CLTCC)和Multi-Axis Adaptive Enhancement (MAAE)方法。 Result: 实验表明,MAC-Lookup在恢复细节和颜色方面优于现有方法。 Conclusion: MAC-Lookup模型在水下图像增强方面表现出色,能够有效改善颜色准确性、清晰度和对比度,同时防止过度增强和饱和。 Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is https://github.com/onlycatdoraemon/MAC-Lookup.[39] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation
Feizhen Huang,Yu Wu,Yutian Lin,Bo Du
Main category: cs.CV
TL;DR: This paper introduces a self-distillation approach for Video-to-Audio generation that incorporates cinematic language understanding, significantly improving performance when visual information is partial and on the VGGSound dataset.
Details
Motivation: Current Video-to-Audio (V2A) generation methods overlook cinematic language, which leads to reduced performance when Foley targets are only partially visible. This work addresses this gap by incorporating cinematic language understanding into V2A models. Method: The method uses a self-distillation approach where a student model learns to align video features of training pairs with the same audio-visual correspondences by simulating cinematic language variations. Result: The method achieves impressive improvements under partial visibility across all evaluation metrics and enhances performance on the large-scale V2A dataset, VGGSound. Conclusion: The proposed self-distillation approach successfully extends V2A models to handle cinematic language scenarios, achieving improvements in partial visibility situations and enhancing performance on the VGGSound dataset. Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.[40] LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models
Juntao Liu,Liqiang Niu,Wenchao Chen,Jie Zhou,Fandong Meng
Main category: cs.CV
TL;DR: LaCo is a novel framework for visual token compression in MLLMs that enhances efficiency and throughput during both training and inference.
Details
Motivation: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. Method: The proposed LaCo framework includes a layer-wise pixel-shuffle mechanism and a residual learning architecture with non-parametric shortcuts. Result: Experiments show that LaCo outperforms all existing methods in compressing tokens in the intermediate layers of the vision encoder. It improves training efficiency beyond 20% and inference throughput over 15%. Conclusion: LaCo is an effective visual token compression framework that improves training efficiency and inference throughput while maintaining strong performance. Abstract: Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.[41] Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization
De Cheng,Zhipeng Xu,Xinyang Jiang,Dongsheng Li,Nannan Wang,Xinbo Gao
Main category: cs.CV
TL;DR: 本文提出了一种新的基于视觉基础模型的语言提示引导视觉提示调整框架,并引入WERA方法以提升域泛化性能,在多个主要DG数据集上取得了优于现有方法的效果。
Details
Motivation: 尽管预训练的视觉基础模型(如CLIP)在提升深度学习模型的泛化能力方面展现出潜力,但在设计能够跨不同领域解耦不变特征的有效提示方面仍存在挑战。为此,本文探索了利用语言提示的可控性和灵活性来应对这一挑战。 Method: 本文提出了一种基于视觉基础模型(VFM)的语言提示引导的视觉提示调整框架,并引入了WERA方法,通过结合抽象提示和风格化图像增强来提高源域多样性。 Result: 提出的框架自动使用大语言模型(LLM)解耦文本提示,并通过解耦后的文本特征学习域不变的视觉表示。此外,通过WERA方法增强了视觉表示的一致性与多样性。 Conclusion: 实验结果表明,该方法在多个主要DG数据集上优于现有的最先进DG方法。 Abstract: Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.[42] ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation
Hanbo Bi,Yulong Xu,Ya Li,Yongqiang Mao,Boyuan Tong,Chongyang Li,Chunbo Lang,Wenhui Diao,Hongqi Wang,Yingchao Feng,Xian Sun
Main category: cs.CV
TL;DR: 本文提出ViRefSAM,用于解决遥感图像中使用Segment Anything Model(SAM)时需要手动提示和缺乏领域适应性的问题,实现了基于少量参考图像的自动高效分割。
Details
Motivation: 解决将SAM应用于遥感图像时存在的手动构建提示效率低下和缺乏领域适应性两个主要挑战。 Method: 提出了一种新的框架ViRefSAM,包含视觉上下文提示编码器和动态目标对齐适配器,以实现无需手动提示的类一致对象分割。 Result: 在iSAID-5^i、LoveDA-2^i和COCO-20^i三个少样本分割基准上的实验表明,ViRefSAM能够有效提升未见类别的分割性能,并优于现有方法。 Conclusion: ViRefSAM通过利用少量参考图像实现了对遥感图像的准确自动分割,解决了手动构建提示和领域适应性的问题。 Abstract: The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM's original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM's image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$, demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets.[43] DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
Yunhan Yang,Shuo Chen,Yukun Huang,Xiaoyang Wu,Yuan-Chen Guo,Edmund Y. Lam,Hengshuang Zhao,Tong He,Xihui Liu
Main category: cs.CV
TL;DR: DreamComposer++是一个改进的框架,通过引入多视角条件来提高当前视图感知扩散模型的可控新视角生成能力。
Details
Motivation: 现有的从单一图像生成新视角的方法在可控性上存在挑战,因此需要一种能够利用多视角信息的新方法。 Method: DreamComposer++利用视图感知的3D提升模块从不同视角提取对象的3D表示,并通过多视角特征融合模块将这些表示聚合并渲染为目标视角的潜在特征,最终结合预训练的图像或视频扩散模型进行新视角合成。 Result: 实验结果表明,DreamComposer++可以无缝集成到最新的视图感知扩散模型中,并显著增强其从多视角条件下生成可控新视角的能力。 Conclusion: DreamComposer++提供了一种有效的方法来改善从单一图像生成新视角的效果,促进了可控的3D对象重建及其广泛应用。 Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.[44] Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images
Haoxuan Li,Chenxu Wei,Haodong Wang,Xiaomeng Hu,Boyuan An,Lingyan Ran,Baosen Zhang,Jin Jin,Omirzhan Taukebayev,Amirkhan Temirbayev,Junrui Liu,Xiuwei Zhang
Main category: cs.CV
TL;DR: This paper proposes Flow-CDNet, a novel change detection framework that effectively identifies both slow and fast changes in bitemporal images using a dual-branch architecture.
Details
Motivation: Detecting both slow and fast changes in bitemporal images is crucial for identifying early signs of potential hazards in critical areas like slopes and dams. Method: Flow-CDNet uses a pyramid structure for multi-scale displacement extraction and combines a ResNet-based network with optical flow output for fast change detection. A new loss function and evaluation metric (FEPE) are also introduced. Result: Experiments on the Flow-Change dataset showed superior performance compared to existing methods, with ablation studies confirming the mutual enhancement of the two branches. Conclusion: The proposed Flow-CDNet effectively detects both slow and fast changes by combining an optical flow branch and a binary change detection branch, outperforming existing methods. Abstract: Change detection typically involves identifying regions with changes between bitemporal images taken at the same location. Besides significant changes, slow changes in bitemporal images are also important in real-life scenarios. For instance, weak changes often serve as precursors to major hazards in scenarios like slopes, dams, and tailings ponds. Therefore, designing a change detection network that simultaneously detects slow and fast changes presents a novel challenge. In this paper, to address this challenge, we propose a change detection network named Flow-CDNet, consisting of two branches: optical flow branch and binary change detection branch. The first branch utilizes a pyramid structure to extract displacement changes at multiple scales. The second one combines a ResNet-based network with the optical flow branch's output to generate fast change outputs. Subsequently, to supervise and evaluate this new change detection framework, a self-built change detection dataset Flow-Change, a loss function combining binary tversky loss and L2 norm loss, along with a new evaluation metric called FEPE are designed. Quantitative experiments conducted on Flow-Change dataset demonstrated that our approach outperforms the existing methods. Furthermore, ablation experiments verified that the two branches can promote each other to enhance the detection performance.[45] LMPNet for Weakly-supervised Keypoint Discovery
Pei Guo,Ryan Farrell
Main category: cs.CV
TL;DR: This paper introduces LMPNet, an interpretable deep learning framework that uses a novel leaky max pooling layer and other techniques to discover semantic object keypoints in a weakly-supervised manner, achieving performance comparable to supervised models.
Details
Motivation: The authors aim to explore the task of weakly-supervised semantic object keypoint discovery using only category labels, with a focus on interpretable deep learning methods. Method: The paper proposes a novel leaky max pooling (LMP) layer, a selection strategy for consistent filter activations, attention mask-out, and a learnable clustering layer to transform intermediate layer filters into keypoint detectors. Result: The proposed LMPNet model successfully discovers semantic keypoints robust to object pose and demonstrates strong prediction accuracy, comparable to supervised pose estimation models. Conclusion: LMPNet is a highly interpretable model that can automatically discover semantic keypoints robust to object pose and achieves strong prediction accuracy comparable to supervised models. Abstract: In this work, we explore the task of semantic object keypoint discovery weakly-supervised by only category labels. This is achieved by transforming discriminatively-trained intermediate layer filters into keypoint detectors. We begin by identifying three preferred characteristics of keypoint detectors: (i) spatially sparse activations, (ii) consistency and (iii) diversity. Instead of relying on hand-crafted loss terms, a novel computationally-efficient leaky max pooling (LMP) layer is proposed to explicitly encourage final conv-layer filters to learn "non-repeatable local patterns" that are well aligned with object keypoints. Informed by visualizations, a simple yet effective selection strategy is proposed to ensure consistent filter activations and attention mask-out is then applied to force the network to distribute its attention to the whole object instead of just the most discriminative region. For the final keypoint prediction, a learnable clustering layer is proposed to group keypoint proposals into keypoint predictions. The final model, named LMPNet, is highly interpretable in that it directly manipulates network filters to detect predefined concepts. Our experiments show that LMPNet can (i) automatically discover semantic keypoints that are robust to object pose and (ii) achieves strong prediction accuracy comparable to a supervised pose estimation model.[46] Perception Activator: An intuitive and portable framework for brain cognitive exploration
Le Xu,Qi Zhang,Qixian Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao
Main category: cs.CV
TL;DR: This paper introduces a novel framework that leverages fMRI data through cross-attention mechanisms to improve object detection and segmentation, revealing that fMRI contains underutilized semantic and spatial cues.
Details
Motivation: To better understand the brain's visual perception patterns and how current decoding models process semantic objects, addressing the limitation of existing methods in achieving sufficient semantic alignment for accurate reconstruction. Method: An experimental framework was developed using fMRI representations as intervention conditions, which were injected into multi-scale image features via cross-attention. The impact was assessed by comparing downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Result: Incorporating fMRI signals enhanced the accuracy of downstream detection and segmentation tasks, demonstrating that fMRI contains valuable semantic and spatial information not yet fully utilized by current models. Conclusion: The study concludes that fMRI signals contain rich multi-object semantic cues and coarse spatial localization information that current models have yet to fully exploit or integrate. Abstract: Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain's visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.[47] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation
JaeHyuck Choi,MinJun Kim,JeHyeong Hong
Main category: cs.CV
TL;DR: 本文提出了MAGIC,一种能够同时保持正常背景不变、准确生成指定位置异常并保证其外观多样性和合理性的少样本异常生成方法,在实际工业检测任务中表现出色。
Details
Motivation: 现有扩散方法无法同时满足保留正常背景、精确匹配异常掩码以及生成外观多样且合理异常的需求,因此需要提出一种更优的方法。 Method: MAGIC采用基于Stable Diffusion的修复主干网络,并引入多层次扰动策略和上下文感知对齐模块,包括高斯提示级扰动、掩码引导的空间噪声注入和语义掩码对齐。 Result: 在MVTec-AD数据集上的统一评估协议下,MAGIC在下游异常任务中表现优于之前的最先进方法。 Conclusion: MAGIC实现了对正常背景的保护、合成异常与提供掩码的严格对应,并通过上下文感知的掩码对齐模块确保异常在语义上合理的定位,从而优于现有的最先进方法。 Abstract: Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC--Mask-guided inpainting with multi-level perturbations and Context-aware alignment--to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.[48] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
Zecheng Zhao,Selena Song,Tong Chen,Zhi Chen,Shazia Sadiq,Yadan Luo
Main category: cs.CV
TL;DR: 本文提出了 SynTVA,一个用于评估合成视频实用性的新数据集和基准,通过多维度标注和自动评估方法,探索了合成视频在文本到视频检索任务中的应用价值。
Details
Motivation: 现有的文本到视频合成评价指标主要关注视觉质量和时间一致性,缺乏对下游任务(如文本到视频检索)表现的洞察。需要一个新的基准来更全面地评估合成视频的实用性。 Method: 开发了 SynTVA 数据集和基准,基于 MSRVTT 的用户查询生成合成视频,并从四个语义对齐维度进行标注。此外,构建了自动评估框架来估计对齐质量,并分析其在下游任务中的预测能力。 Result: SynTVA 数据集成功建立并公开发布,结果表明其能够有效支持检索模型训练,并且合成视频的质量与下游任务性能密切相关。自动评估工具也展示了可扩展性潜力。 Conclusion: SynTVA 作为有价值的资源,不仅用于基准测试,还支持数据集增强,提升文本到视频检索的表现。合成视频的实用性得到了验证,并为未来研究提供了方向。 Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.[49] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
Nina Konovalova,Maxim Nikolaev,Andrey Kuznetsov,Aibek Alanov
Main category: cs.CV
TL;DR: InnerControl enhances spatial control in diffusion models by aligning intermediate generation stages, outperforming previous methods like ControlNet and ControlNet++.
Details
Motivation: Despite progress in text-to-image diffusion models, precise spatial control remains challenging. Existing methods like ControlNet and ControlNet++ focus only on final denoising steps, neglecting intermediate stages and limiting effectiveness. Method: InnerControl introduces lightweight convolutional probes to reconstruct input control signals from intermediate UNet features at every denoising step, using an alignment loss to minimize discrepancies between predicted and target conditions throughout the diffusion process. Result: InnerControl enhances control fidelity and generation quality by efficiently extracting control signals even from highly noisy latents, leading to improved spatial consistency across diverse conditioning methods such as edges and depth. Conclusion: InnerControl improves spatial consistency in text-to-image diffusion models by enforcing alignment across all diffusion steps, achieving state-of-the-art performance when combined with existing techniques like ControlNet++. Abstract: Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).[50] Neural Network-based Study for Rice Leaf Disease Recognition and Classification: A Comparative Analysis Between Feature-based Model and Direct Imaging Model
Farida Siddiqi Prity,Mirza Raquib,Saydul Akbar Murad,Md. Jubayar Alam Rafi,Md. Khairul Bashar Bhuiyan,Anupam Kumar Bairagi
Main category: cs.CV
TL;DR: This research compares two models for detecting rice leaf diseases, finding that the Feature Analysis Detection Model (FADM) performs better than the Direct Image-Centric Detection Model (DICDM), offering promising results for improving rice farming productivity and sustainability.
Details
Motivation: Rice leaf diseases cause significant economic losses and reduce productivity, making early detection crucial. While Artificial Neural Networks (ANNs) are commonly used for image-based disease detection, there is a lack of comprehensive comparison between Feature Analysis Detection Model (FADM) and Direct Image-Centric Detection Model (DICDM). Method: This study compares two models for rice leaf disease classification: FADM, which uses Feature Extraction Algorithms (FEAs), Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM), and DICDM, which does not use FEAs. The evaluation is based on 10-fold Cross-Validation across multiple disease categories. Result: The experiments show that the Feature Analysis Detection Model (FADM) achieves higher performance in classifying rice leaf diseases compared to the Direct Image-Centric Detection Model (DICDM). Conclusion: The Feature Analysis Detection Model (FADM) outperforms the Direct Image-Centric Detection Model (DICDM) in classifying rice leaf diseases, showing great potential for improving crop health, reducing yield losses, and enhancing rice farming productivity and sustainability. Abstract: Rice leaf diseases significantly reduce productivity and cause economic losses, highlighting the need for early detection to enable effective management and improve yields. This study proposes Artificial Neural Network (ANN)-based image-processing techniques for timely classification and recognition of rice diseases. Despite the prevailing approach of directly inputting images of rice leaves into ANNs, there is a noticeable absence of thorough comparative analysis between the Feature Analysis Detection Model (FADM) and Direct Image-Centric Detection Model (DICDM), specifically when it comes to evaluating the effectiveness of Feature Extraction Algorithms (FEAs). Hence, this research presents initial experiments on the Feature Analysis Detection Model, utilizing various image Feature Extraction Algorithms, Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM). The experiments are carried out on datasets encompassing bacterial leaf blight, brown spot, leaf blast, leaf scald, Sheath blight rot, and healthy leaf, utilizing 10-fold Cross-Validation method. A Direct Image-Centric Detection Model is established without the utilization of any FEA, and the evaluation of classification performance relies on different metrics. Ultimately, an exhaustive contrast is performed between the achievements of the Feature Analysis Detection Model and Direct Image-Centric Detection Model in classifying rice leaf diseases. The results reveal that the highest performance is attained using the Feature Analysis Detection Model. The adoption of the proposed Feature Analysis Detection Model for detecting rice leaf diseases holds excellent potential for improving crop health, minimizing yield losses, and enhancing overall productivity and sustainability of rice farming.[51] Two-Steps Neural Networks for an Automated Cerebrovascular Landmark Detection
Rafic Nader,Vincent L'Allinec,Romain Bourcier,Florent Autrusseau
Main category: cs.CV
TL;DR: 本文介绍了一种自动检测Willis环分叉点的方法,通过两步神经网络过程提高了检测精度和效率。
Details
Motivation: 颅内动脉瘤通常发生在Willis环(CoW)的特定段,准确检测这些关键标志对于快速有效的诊断至关重要。 Method: 首先使用目标检测网络识别近似标志物位置的感兴趣区域(ROIs),然后采用具有深度监督的改进U-Net精确地定位分叉点。 Result: 在两个脑部磁共振血管造影(MRA)数据集上评估了所提方法的有效性,结果表明该方法在分叉点检测任务中达到了最高水平的性能。 Conclusion: 该论文提出了一种基于两步神经网络的完全自动化检测方法,用于识别大脑Willis环(CoW)的分叉点,有效解决了由于两个标志物接近且视觉特征相似导致的漏检问题,并考虑了CoW解剖结构的变异性。 Abstract: Intracranial aneurysms (ICA) commonly occur in specific segments of the Circle of Willis (CoW), primarily, onto thirteen major arterial bifurcations. An accurate detection of these critical landmarks is necessary for a prompt and efficient diagnosis. We introduce a fully automated landmark detection approach for CoW bifurcations using a two-step neural networks process. Initially, an object detection network identifies regions of interest (ROIs) proximal to the landmark locations. Subsequently, a modified U-Net with deep supervision is exploited to accurately locate the bifurcations. This two-step method reduces various problems, such as the missed detections caused by two landmarks being close to each other and having similar visual characteristics, especially when processing the complete MRA Time-of-Flight (TOF). Additionally, it accounts for the anatomical variability of the CoW, which affects the number of detectable landmarks per scan. We assessed the effectiveness of our approach using two cerebral MRA datasets: our In-House dataset which had varying numbers of landmarks, and a public dataset with standardized landmark configuration. Our experimental results demonstrate that our method achieves the highest level of performance on a bifurcation detection task.[52] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
Xiangfeng Wang,Xiao Li,Yadong Wei,Xueyu Song,Yang Song,Xiaoqiang Xia,Fangrui Zeng,Zaiyi Chen,Liu Liu,Gu Xu,Tong Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态叙事理解的人类启发式自动视频编辑框架HIVE,并构建了新的基准数据集DramaAD,旨在解决现有自动视频编辑方法忽视视觉上下文而导致输出不连贯的问题。
Details
Motivation: 随着在线视频内容,特别是短视频平台上的内容迅速增长,对高效视频编辑技术的需求日益增加,这些技术能够将长视频压缩成简洁且吸引人的片段。现有的自动编辑方法主要依赖于ASR转录的文本线索和端到端的片段选择,常常忽略了丰富的视觉上下文,导致输出不连贯。 Method: 本文提出了一种受人类启发的自动视频编辑框架(HIVE),该框架利用多模态叙事理解来解决现有方法忽视丰富视觉上下文的问题。具体方法包括角色提取、对话分析和叙事摘要,并通过场景级分割将编辑过程分解为三个子任务:亮点检测、开头/结尾选择和无关内容剪枝。 Result: 为了促进这一领域的研究,作者引入了一个新的基准数据集DramaAD,包含超过800部短剧和500个专业编辑的广告片段。 Conclusion: 实验结果表明,所提出的HIVE框架在通用和广告编辑任务上均优于现有基线,显著缩小了自动编辑与人工编辑视频之间的质量差距。 Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.[53] Lightweight Shrimp Disease Detection Research Based on YOLOv8n
Fei Yuhuan,Wang Gengchen,Liu Fenghao,Zang Ran,Sun Xufei,Chang Hao
Main category: cs.CV
TL;DR: This paper presents a lightweight network architecture based on YOLOv8n for intelligent disease detection in shrimp aquaculture, achieving high accuracy and computational efficiency while reducing parameters by 32.3%. The model demonstrates robustness and a 4.1% increase in mAP@0.5 on the URPC2020 dataset.
Details
Motivation: Shrimp diseases are one of the primary causes of economic losses in shrimp aquaculture. Preventing disease transmission and enhancing intelligent detection efficiency in shrimp farming are the key motivations. Method: This paper proposes a lightweight network architecture based on YOLOv8n by designing the RLDD detection head and C2f-EMCM module, introducing an improved SegNext_Attention self-attention mechanism to enhance feature extraction capability. Result: The proposed model achieves a 32.3% reduction in parameters compared to the original YOLOv8n, with a mAP@0.5 of 92.7% (3% improvement over YOLOv8n). It also outperforms other lightweight YOLO-series models in mAP@0.5, parameter count, and model size. Conclusion: The proposed method achieves an optimal balance between accuracy and efficiency, providing reliable technical support for intelligent disease detection in shrimp aquaculture. Abstract: Shrimp diseases are one of the primary causes of economic losses in shrimp aquaculture. To prevent disease transmission and enhance intelligent detection efficiency in shrimp farming, this paper proposes a lightweight network architecture based on YOLOv8n. First, by designing the RLDD detection head and C2f-EMCM module, the model reduces computational complexity while maintaining detection accuracy, improving computational efficiency. Subsequently, an improved SegNext_Attention self-attention mechanism is introduced to further enhance the model's feature extraction capability, enabling more precise identification of disease characteristics. Extensive experiments, including ablation studies and comparative evaluations, are conducted on a self-constructed shrimp disease dataset, with generalization tests extended to the URPC2020 dataset. Results demonstrate that the proposed model achieves a 32.3% reduction in parameters compared to the original YOLOv8n, with a mAP@0.5 of 92.7% (3% improvement over YOLOv8n). Additionally, the model outperforms other lightweight YOLO-series models in mAP@0.5, parameter count, and model size. Generalization experiments on the URPC2020 dataset further validate the model's robustness, showing a 4.1% increase in mAP@0.5 compared to YOLOv8n. The proposed method achieves an optimal balance between accuracy and efficiency, providing reliable technical support for intelligent disease detection in shrimp aquaculture.[54] Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Ziqi Miao,Yi Ding,Lijun Li,Jing Shao
Main category: cs.CV
TL;DR: The paper introduces VisCo Attack, a novel method to induce harmful responses from multimodal large language models using visual-centric jailbreak techniques.
Details
Motivation: Security vulnerabilities in the visual modality of MLLMs pose challenges in open-world environments; existing approaches lack realistic scenarios. Method: Proposed VisCo Attack uses four visual-focused strategies and dynamically generates auxiliary images to fabricate contextual dialogue for attacking MLLMs. Result: Toxicity score of 4.78 and ASR of 85% on MM-SafetyBench against GPT-4o, significantly higher than baseline results. Conclusion: VisCo Attack effectively triggers harmful responses from MLLMs by constructing a visual-centric jailbreak scenario, outperforming baseline methods. Abstract: With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at https://github.com/Dtc7w3PQ/Visco-Attack.[55] Holistic Tokenizer for Autoregressive Image Generation
Anlin Zheng,Haochen Wang,Yucheng Zhao,Weipeng Deng,Tiancai Wang,Xiangyu Zhang,Xiaojuan Qi
Main category: cs.CV
TL;DR: Hita is a new image tokenizer for autoregressive generation that improves holistic understanding and outperforms existing methods in both performance and efficiency.
Details
Motivation: Vanilla autoregressive models struggle to capture holistic relationships among token sequences, and most visual tokenizers only map local patches into latent tokens, limiting global information capture. Method: The paper introduces Hita, which uses a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. It also incorporates causal attention and a fusion module to improve alignment with the AR generation process. Result: Experiments show that Hita accelerates training speed and achieves better performance on ImageNet (2.59 FID and 281.9 IS). It also excels at zero-shot style transfer and image in-painting. Conclusion: Hita, a novel image tokenizer for autoregressive image generation, effectively captures holistic relationships and outperforms vanilla tokenizers in performance and efficiency. Abstract: The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}[56] LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling
Jiahao Wu,Rui Peng,Jianbo Jiao,Jiayu Yang,Luyang Tang,Kaiqiang Xiong,Jie Liang,Jinbo Yan,Runling Liu,Ronggang Wang
Main category: cs.CV
TL;DR: 本文提出 LocalDyGS,通过将复杂动态场景分解为局部空间并解耦静态与动态特征,实现了更真实的动态场景重建。
Details
Motivation: 由于现实世界中复杂的高度动态运动,从多视角输入合成任意视点的动态视频具有挑战性,而基于神经辐射场或3D高斯散射的方法在精细尺度运动建模上存在局限性。 Method: 1) 将复杂动态场景分解为由种子定义的局部空间,以在每个局部空间内捕捉运动并实现全局建模;2) 解耦局部空间运动建模中的静态和动态特征,通过共享跨时间步长的静态特征和提供特定时间的动态残差场来生成时间高斯分布。 Result: LocalDyGS 在各种精细尺度数据集上的性能与现有最先进方法相当,并且首次尝试对更大、更复杂的高度动态场景进行建模。 Conclusion: LocalDyGS 是一种新颖的动态场景重建框架,能够更真实地建模高度动态的真实世界场景。 Abstract: Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github.io/LocalDyGS/.[57] UVLM: Benchmarking Video Language Model for Underwater World Understanding
Xizhe Xue,Yang Zhou,Dawei Yan,Ying Li,Haokui Zhang,Rong Xiao
Main category: cs.CV
TL;DR: 本研究开发了UVLM,一个用于水下观察的视频语言模型基准,旨在提升现有模型在水下环境中的理解和应用能力。
Details
Motivation: 现有的工作主要关注陆地场景,忽视了水下观测的高度需求应用。 Method: 引入了UVLM,这是一个通过结合人类专业知识和AI模型的协作方法构建的水下观测基准。 Result: 构建了一个包含419类海洋动物、各种静态植物和地形的数据集,并设计了20种不同的任务类型。 Conclusion: 实验结果表明,在UVLM上对VidLM进行微调显著提高了对水下世界的理解,同时在现有的空中VidLM基准测试中也显示出轻微改进的潜力。 Abstract: Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. To overcome this gap, we introduce UVLM, an under water observation benchmark which is build through a collaborative approach combining human expertise and AI models. To ensure data quality, we have conducted in-depth considerations from multiple perspectives. First, to address the unique challenges of underwater environments, we selected videos that represent typical underwater challenges including light variations, water turbidity, and diverse viewing angles to construct the dataset. Second, to ensure data diversity, the dataset covers a wide range of frame rates, resolutions, 419 classes of marine animals, and various static plants and terrains. Next, for task diversity, we adopted a structured design where observation targets are categorized into two major classes: biological and environmental. Each category includes content observation and change/action observation, totaling 20 distinct task types. Finally, we designed several challenging evaluation metrics to enable quantitative comparison and analysis of different methods. Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks, such as VideoMME and Perception text. The dataset and prompt engineering will be released publicly.[58] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection
Seokyeong Lee,Sithu Aung,Junyong Choi,Seungryong Kim,Ig-Jae Kim,Junghyun Cho
Main category: cs.CV
TL;DR: 本文提出了一种无需多视角或额外传感器的单目3D目标检测新方法,利用视频数据中的时序信息提高检测鲁棒性与准确性。
Details
Motivation: 解决由于高标注成本和2D到3D的固有歧义导致的单目3D目标检测数据稀缺问题。现有的弱监督和伪标签方法受限于领域特定学习或仅依赖单一观测的形状信息。 Method: 通过在时间相邻帧之间聚合静态和动态物体的伪激光雷达,并使用物体点跟踪技术进行3D属性提取。 Result: 实验表明该方法在保证可靠精度的同时具有强扩展性,是单目3D目标检测的一种实用且有效的解决方案。 Conclusion: 该论文提出了一种新的伪标签框架,能够在没有多视角设置、额外传感器、相机姿态或特定领域训练的情况下,实现更鲁棒的单目3D目标检测。 Abstract: Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.[59] Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis
Byung Hyun Lee,Wongi Jeong,Woojae Han,Kyoungbun Lee,Se Young Chun
Main category: cs.CV
TL;DR: This paper introduces CoMEL, a novel framework for continual multiple instance learning with enhanced localization, achieving significant improvements in accuracy while minimizing forgetting.
Details
Motivation: To reduce annotation costs with bag-level weak labels while enabling continual learning with minimal forgetting, particularly for localization tasks in large-scale images like histopathological whole slide images (WSIs). Method: The paper proposes CoMEL, which uses a Grouped Double Attention Transformer (GDAT), Bag Prototypes-based Pseudo-Labeling (BPPL), and Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to improve MIL localization and adaptability. Result: Experiments show that CoMEL outperforms prior methods by up to 11.00% in bag-level accuracy and up to 23.4% in localization accuracy under continual MIL setup on three public WSI datasets. Conclusion: CoMEL successfully addresses the challenges of adaptability and minimal forgetting in multiple instance learning for localization tasks, demonstrating superior performance on WSI datasets. Abstract: Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., $16 \times 16$) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ($\sim 10^5$) of large patches (e.g., $256 \times 256$) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework for both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets demonstrate superior performance of CoMEL, outperforming the prior arts by up to $11.00\%$ in bag-level accuracy and up to $23.4\%$ in localization accuracy under the continual MIL setup.[60] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection
Taehoon Kim,Jongwook Choi,Yonghyun Jeong,Haeun Noh,Jaejun Yoo,Seungryul Baek,Jongwon Choi
Main category: cs.CV
TL;DR: 本文提出了一种基于像素时间不一致性的深度伪造视频检测新方法,通过1D傅里叶变换和注意力机制提升了检测性能。
Details
Motivation: 传统的基于空间频率的检测方法未能有效捕捉像素层面的时间伪影,导致检测效果不佳。 Method: 对每个像素的时间轴进行1D傅里叶变换,并引入注意力提议模块和联合变压器模块来提取和整合特征。 Result: 该方法在多种复杂场景下表现出色,能够精准定位时间伪影并扩展可检测的伪造痕迹范围。 Conclusion: 该论文提出了一种新的深度伪造视频检测方法,利用像素级的时间不一致性进行检测,克服了传统方法的局限性。 Abstract: We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.[61] TABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation
Peilin Zhang,Shaouxan Wua,Jun Feng,Zhuo Jin,Zhizezhang Gao,Jingkun Chen,Yaqiong Xing,Xiao Zhang
Main category: cs.CV
TL;DR: 本文介绍了一种用于医学图像分割的新颖弱监督框架TAB Net,该框架利用三元组增强自恢复模块和边界感知伪标签监督模块,在基于涂鸦的弱监督分割方面表现出了最先进的性能,并且与全监督方法的性能相当。
Details
Motivation: 医学图像分割是各种临床应用中的核心任务,但获取大规模、完全标注的医学图像数据集既耗时又昂贵。因此,作为稀疏标注的一种有效替代方案,涂鸦标注被提出,但其稀疏性限制了目标区域的特征学习并缺乏足够的边界监督。 Method: 我们提出了TAB Net,一个包含三元组增强自恢复(TAS)模块和边界感知伪标签监督(BAP)模块的框架。TAS模块通过三种互补的增强策略增强了特征学习,而BAP模块通过融合双分支预测并引入边界感知损失来提高伪监督的准确性和边界建模。 Result: 在两个公开数据集ACDC和MSCMR seg上的实验评估表明,TAB Net显著优于基于涂鸦的弱监督分割的最先进方法。此外,它达到了与全监督方法相当的性能。 Conclusion: TAB Net实现了对基于涂鸦的弱监督分割的最先进方法的重大超越,并且其性能与全监督方法相当。 Abstract: Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the feature learning of the target region and lacks sufficient boundary supervision, which poses significant challenges for training segmentation networks. Methods: We propose TAB Net, a novel weakly-supervised medical image segmentation framework, consisting of two key components: the triplet augmentation self-recovery (TAS) module and the boundary-aware pseudo-label supervision (BAP) module. The TAS module enhances feature learning through three complementary augmentation strategies: intensity transformation improves the model's sensitivity to texture and contrast variations, cutout forces the network to capture local anatomical structures by masking key regions, and jigsaw augmentation strengthens the modeling of global anatomical layout by disrupting spatial continuity. By guiding the network to recover complete masks from diverse augmented inputs, TAS promotes a deeper semantic understanding of medical images under sparse supervision. The BAP module enhances pseudo-supervision accuracy and boundary modeling by fusing dual-branch predictions into a loss-weighted pseudo-label and introducing a boundary-aware loss for fine-grained contour refinement. Results: Experimental evaluations on two public datasets, ACDC and MSCMR seg, demonstrate that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation. Moreover, it achieves performance comparable to that of fully supervised methods.[62] Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings
Mufhumudzi Muthivhi,Terence L. van Zyl
Main category: cs.CV
TL;DR: This study demonstrates that self-supervised learning can surpass supervised methods in wildlife re-identification, offering better performance and reducing dependence on annotated datasets.
Details
Motivation: Current state-of-the-art models depend on annotated data for training, which has led to the curation of large-scale wildlife datasets. This research explores SSL as an alternative to reduce reliance on labeled data. Method: This study utilizes Self-Supervised Learning (SSL) by extracting two distinct views of individuals using temporal image pairs from camera trap data, training a model without supervision. Result: Self-supervised models outperformed supervised models across all evaluated downstream tasks and showed greater robustness even with limited data. Conclusion: Self-supervised models demonstrate superior performance and robustness compared to supervised models in wildlife re-identification tasks. Abstract: Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here https://github.com/pxpana/SSLWildlife.[63] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration
Ayantika Das,Moitreya Chaudhuri,Koushik Bhat,Keerthi Ram,Mihail Bota,Mohanasankar Sivaprakasam
Main category: cs.CV
TL;DR: This paper integrates encoders with diffusion models to enable structured latent representations, allowing for improved analysis of brain images and effective unsupervised artifact restoration techniques.
Details
Motivation: Denoising diffusion models generate high-quality images but lack the ability to extract image-specific semantic representations, unlike auto-encoders. This work aims to bridge this gap by integrating encoding capabilities into diffusion models for structured latent representations. Method: The authors introduce a diffusion auto-encoding model with an encoder to learn image-specific representations. They enforce regression of positional information in high-resolution patches to structure the latent space. Additionally, they propose unsupervised techniques for tear artifact and JPEG artifact restoration using neighborhood awareness and diffusion model capabilities. Result: The proposed method successfully structures the latent space of diffusion models, enabling recognition of region-specific cellular patterns in brain images. It also introduces effective unsupervised techniques for tear and JPEG artifact restoration based on latent representations and diffusion processes. Conclusion: The paper concludes that integrating an encoder with diffusion models enables structured latent spaces for capturing image-specific representations, which facilitates various downstream tasks such as tissue differentiation and artifact restoration in brain images. Abstract: Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique.[64] A Novel Tuning Method for Real-time Multiple-Object Tracking Utilizing Thermal Sensor with Complexity Motion Pattern
Duong Nguyen-Ngoc Tran,Long Hoang Pham,Chi Dai Tran,Quoc Pham-Nam Ho,Huy-Hung Nguyen,Jae Wook Jeon
Main category: cs.CV
TL;DR: 这篇论文介绍了一种新的用于热图像中行人跟踪的调整方法,解决了热传感器中低级特征表示的挑战。
Details
Motivation: 论文的主要动机是解决热成像中的低级特征表示问题,这使得准确检测和跟踪行人变得困难。这种方法特别设计用来处理热成像中的复杂运动模式。 Method: 论文中提出了一种新颖的行人跟踪调整方法,该方法优化了两个阶段,并通过细调超参数以最大化跟踪性能来实现实时跟踪。 Result: 在PBVS Thermal MOT数据集上进行了广泛的实验,证明了该方法的有效性。 Conclusion: 论文的结论是,所提出的方法在各种热成像相机条件下都非常有效,使其成为现实世界监视应用的鲁棒解决方案。 Abstract: Multi-Object Tracking in thermal images is essential for surveillance systems, particularly in challenging environments where RGB cameras struggle due to low visibility or poor lighting conditions. Thermal sensors enhance recognition tasks by capturing infrared signatures, but a major challenge is their low-level feature representation, which makes it difficult to accurately detect and track pedestrians. To address this, the paper introduces a novel tuning method for pedestrian tracking, specifically designed to handle the complex motion patterns in thermal imagery. The proposed framework optimizes two-stages, ensuring that each stage is tuned with the most suitable hyperparameters to maximize tracking performance. By fine-tuning hyperparameters for real-time tracking, the method achieves high accuracy without relying on complex reidentification or motion models. Extensive experiments on PBVS Thermal MOT dataset demonstrate that the approach is highly effective across various thermal camera conditions, making it a robust solution for real-world surveillance applications.[65] Privacy-preserving Preselection for Face Identification Based on Packing
Rundong Xin,Taotao Wang,Jin Wang,Chonghe Zhao,Jing Wang
Main category: cs.CV
TL;DR: 本文提出了一种高效且隐私保护的加密域人脸识别方案PFIP,通过创新的预选机制和打包模块显著提升了检索效率。
Details
Motivation: 随着隐私问题的增加和原始面部数据的潜在恢复,加密域中的人脸识别系统受到了广泛关注。然而,随着密文模板库规模的增长,人脸检索过程变得愈发耗时。 Method: 提出了一种名为PFIP的新颖方案,包括预选机制和打包模块,用于加密域中的人脸识别。 Result: 在LFW和CASIA数据集上进行的大量实验表明,PFIP在保留原始人脸识别模型准确性的同时,实现了100%的命中率,并能在300毫秒内检索1,000个密文人脸模板。 Conclusion: PFIP实现了高效的加密域人脸识别,同时保持了原始模型的准确性,并在检索效率上比现有方法提高了近50倍。 Abstract: Face identification systems operating in the ciphertext domain have garnered significant attention due to increasing privacy concerns and the potential recovery of original facial data. However, as the size of ciphertext template libraries grows, the face retrieval process becomes progressively more time-intensive. To address this challenge, we propose a novel and efficient scheme for face retrieval in the ciphertext domain, termed Privacy-Preserving Preselection for Face Identification Based on Packing (PFIP). PFIP incorporates an innovative preselection mechanism to reduce computational overhead and a packing module to enhance the flexibility of biometric systems during the enrollment stage. Extensive experiments conducted on the LFW and CASIA datasets demonstrate that PFIP preserves the accuracy of the original face recognition model, achieving a 100% hit rate while retrieving 1,000 ciphertext face templates within 300 milliseconds. Compared to existing approaches, PFIP achieves a nearly 50x improvement in retrieval efficiency.[66] Determination Of Structural Cracks Using Deep Learning Frameworks
Subhasis Dasgupta,Jaydip Sen,Tuhina Halder
Main category: cs.CV
TL;DR: 本文介绍了一种用于结构裂缝检测的新型深度学习方法,通过使用残差U-Net模型和集成模型提高了检测准确性和效率。
Details
Motivation: 结构裂缝检测是一项关键性的公共安全任务,而人工检测速度慢、不一致且容易出错,因此需要引入一种更可靠的方法。 Method: 本研究利用了多种残差U-Net模型的配置,并将这些模型与包含卷积块的元模型集成在一起,以提高预测效率。 Result: 结果表明,残差U-Net模型在低分辨率图像中表现优于其前身,而集成模型的表现超过了所有个体模型,达到了最高的IoU指标和DICE系数得分。 Conclusion: 该论文提出了一种新的深度学习架构,用于提高结构裂缝检测的准确性和效率。集成模型优于其他个体模型,证明其在结构缺陷监测任务中的有效性。 Abstract: Structural crack detection is a critical task for public safety as it helps in preventing potential structural failures that could endanger lives. Manual detection by inexperienced personnel can be slow, inconsistent, and prone to human error, which may compromise the reliability of assessments. The current study addresses these challenges by introducing a novel deep-learning architecture designed to enhance the accuracy and efficiency of structural crack detection. In this research, various configurations of residual U-Net models were utilized. These models, due to their robustness in capturing fine details, were further integrated into an ensemble with a meta-model comprising convolutional blocks. This unique combination aimed to boost prediction efficiency beyond what individual models could achieve. The ensemble's performance was evaluated against well-established architectures such as SegNet and the traditional U-Net. Results demonstrated that the residual U-Net models outperformed their predecessors, particularly with low-resolution imagery, and the ensemble model exceeded the performance of individual models, proving it as the most effective. The assessment was based on the Intersection over Union (IoU) metric and DICE coefficient. The ensemble model achieved the highest scores, signifying superior accuracy. This advancement suggests way for more reliable automated systems in structural defects monitoring tasks.[67] AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars
Yiming Zhong,Xiaolin Zhang,Ligang Liu,Yao Zhao,Yunchao Wei
Main category: cs.CV
TL;DR: 本文介绍了一种新的3D虚拟头像化妆方法AvatarMakeup,它能够从任意个体的单张参考照片中转移化妆模式,并保证在动态和多视角下的化妆效果一致性和高质量。
Details
Motivation: 尽管现有的3D高斯编辑方法可以用于面部化妆目的,但这些方法无法满足实现真实化妆效果的基本要求:1)确保可驱动表情下的一致外观;2)在整个化妆过程中保持身份不变;3)对细节进行精确控制。 Method: 提出了一种名为AvatarMakeup的3D虚拟头像化妆方法,该方法利用预训练扩散模型从任意个体的单张参考照片中转移化妆模式,并采用由粗到精的思想进行处理。 Result: 实验表明,AvatarMakeup能够实现高质量的化妆迁移,且在动画过程中保持良好的一致性。 Conclusion: AvatarMakeup实现了动画过程中最先进的化妆转移质量和一致性,解决了现有方法在动态和多视角效果下的不一致问题。 Abstract: Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions, 2) preserving the identity throughout the makeup process, and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multiview effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.[68] F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning
Wei Li,Jingyang Zhang,Lihao Liu,Guoan Wang,Junjun He,Yang Chen,Lixu Gu
Main category: cs.CV
TL;DR: This paper proposes a new Image-level Disentangled Prompt Tuning (I-DiPT) framework for Test-Time Adaptation (TTA), which addresses the problem of adapting source models to free-form domain fragments in medical data, showing superior performance over existing methods.
Details
Motivation: Existing Test-Time Adaptation (TTA) methods assume data arrives in complete domain units, which is not reflective of clinical practice where data usually arrives in domain fragments of arbitrary lengths and random orders due to resource constraints and patient variability. Method: The paper proposes a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework with an image-invariant prompt and an image-specific prompt. It also introduces Uncertainty-oriented Masking (UoM) and Parallel Graph Distillation (PGD) to improve knowledge representation. Result: Experiments on breast cancer and glaucoma classification show the effectiveness of the proposed method over existing TTA approaches in the Free-Form Test-Time Adaptation (F²TTA) task. Conclusion: The paper concludes that the proposed I-DiPT framework outperforms existing TTA approaches in F²TTA tasks, as demonstrated by experiments on breast cancer and glaucoma classification. Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F$^{2}$TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F$^{2}$TTA. Code is available at https://github.com/mar-cry/F2TTA.[69] Red grape detection with accelerated artificial neural networks in the FPGA's programmable logic
Sandro Costa Magalhães,Marco Almeida,Filipe Neves dos Santos,António Paulo Moreira,Jorge Dias
Main category: cs.CV
TL;DR: This paper explores the use of Field-Programmable Gate Arrays (FPGAs) to accelerate Artificial Neural Networks (ANNs) for more efficient object detection by robots.
Details
Motivation: Robots usually slow down for scanning to detect objects while moving. Additionally, the robot's camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. Method: Used the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA's PL. Result: MobileNet v1 performed better, reaching a success rate of 98% and an inference speed of 6611 FPS. Conclusion: Using FPGAs can speed up ANNs and make them suitable for attention mechanisms. Abstract: Robots usually slow down for canning to detect objects while moving. Additionally, the robot's camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs' PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA's PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.[70] IGDNet: Zero-Shot Robust Underexposed Image Enhancement via Illumination-Guided and Denoising
Hailong Yan,Junjian Huang,Tingwen Huang
Main category: cs.CV
TL;DR: IGDNet proposes a Zero-Shot enhancement method for restoring underexposed images, without requiring paired training data or guiding priors.
Details
Motivation: Current methods rely on supervised learning with paired datasets, which are impractical to collect in real-world scenarios and can lead to over-enhancement. Method: IGDNet uses a decomposition module to separate the image into illumination and reflection components via a dense connection network. A denoising module enhances non-uniformly illuminated regions using an illumination-guided pixel adaptive correction method. Noise pairs are generated through downsampling and refined iteratively. Result: IGDNet outperforms 14 state-of-the-art unsupervised methods based on quantitative results from metrics like PSNR (20.41dB) and SSIM (0.860dB). It shows strong generalization ability and effectively suppresses noise while restoring illumination. Conclusion: IGDNet is a promising solution for restoring underexposed images in complex lighting conditions without relying on training data or priors. Abstract: Current methods for restoring underexposed images typically rely on supervised learning with paired underexposed and well-illuminated images. However, collecting such datasets is often impractical in real-world scenarios. Moreover, these methods can lead to over-enhancement, distorting well-illuminated regions. To address these issues, we propose IGDNet, a Zero-Shot enhancement method that operates solely on a single test image, without requiring guiding priors or training data. IGDNet exhibits strong generalization ability and effectively suppresses noise while restoring illumination. The framework comprises a decomposition module and a denoising module. The former separates the image into illumination and reflection components via a dense connection network, while the latter enhances non-uniformly illuminated regions using an illumination-guided pixel adaptive correction method. A noise pair is generated through downsampling and refined iteratively to produce the final result. Extensive experiments on four public datasets demonstrate that IGDNet significantly improves visual quality under complex lighting conditions. Quantitative results on metrics like PSNR (20.41dB) and SSIM (0.860dB) show that it outperforms 14 state-of-the-art unsupervised methods. The code will be released soon.[71] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection
Weiwei Duan,Luping Ji,Shengjia Chen,Sicheng Zhu,Jianghong Huang,Mao Ye
Main category: cs.CV
TL;DR: 本文提出了一种用于移动红外小目标检测的弱监督对比学习方案(WeCoL),仅需简单的目标数量提示进行模型训练,显著减少了对大量人工标注的依赖。
Details
Motivation: 传统的全监督方法依赖大量手动标注的目标注释,而手动标注视频序列尤其是低质量红外帧图像往往昂贵且耗时。因此,本文探索了非全监督策略,特别是弱监督方法,以减少标注需求。 Method: 基于预训练的Segment Anything Model (SAM),设计了一种潜在目标挖掘策略,并结合对比学习和长短期运动感知学习方案来提高伪标签的可靠性并建模小目标的局部运动模式和全局运动轨迹。 Result: 提出的弱监督方案在DAUB和ITSDT-15K两个公共数据集上的实验结果表明,其性能通常优于早期的全监督方法,并接近当前最先进的全监督方法。 Conclusion: 该论文提出了一种新的弱监督对比学习(WeCoL)方案,用于移动红外小目标检测。实验表明,该方法在两个公共数据集上通常优于早期的全监督方法,性能甚至可以达到最先进的全监督方法的90%以上。 Abstract: Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ($e.g.$, weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model training.Specifically, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy accumulation.Besides, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature subspace.Moreover, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small targets.The extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90\% of state-of-the-art (SOTA) fully-supervised ones.[72] Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk
Gaochao Song,Zibo Zhao,Haohan Weng,Jingbo Zeng,Rongfei Jia,Shenghua Gao
Main category: cs.CV
TL;DR: Mesh Silksong 是一种高效的网格表示方法,通过减少冗余实现高质量多边形网格生成,在压缩率和几何完整性方面表现优异。
Details
Motivation: 现有网格标记化方法会生成包含重复顶点标记的序列,浪费网络能力,因此需要更高效的方法。 Method: 提出 Mesh Silksong 方法,采用一次访问一个顶点的方式对网格进行标记化,以自回归方式生成多边形网格。 Result: Mesh Silksong 将标记序列的冗余度降低了 50%,实现了约 22% 的最先进压缩率,并生成了具有优良几何属性(如流形拓扑、无水检测和一致面法线)的多边形网格。 Conclusion: Mesh Silksong 是一种新颖的网格表示方法,通过减少冗余实现了高效的多边形网格生成,并在几何完整性和压缩率方面取得了显著成果。 Abstract: We introduce Mesh Silksong, a compact and efficient mesh representation tailored to generate the polygon mesh in an auto-regressive manner akin to silk weaving. Existing mesh tokenization methods always produce token sequences with repeated vertex tokens, wasting the network capability. Therefore, our approach tokenizes mesh vertices by accessing each mesh vertice only once, reduces the token sequence's redundancy by 50\%, and achieves a state-of-the-art compression rate of approximately 22\%. Furthermore, Mesh Silksong produces polygon meshes with superior geometric properties, including manifold topology, watertight detection, and consistent face normals, which are critical for practical applications. Experimental results demonstrate the effectiveness of our approach, showcasing not only intricate mesh generation but also significantly improved geometric integrity.[73] CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios
Teng Fu,Yuwen Chen,Zhuofan Chen,Mengyang Zhao,Bin Li,Xiangyang Xue
Main category: cs.CV
TL;DR: This paper introduces CrowdTrack, a large-scale, realistic dataset for multi-pedestrian tracking designed to improve the development of tracking algorithms in complex, real-world scenarios.
Details
Motivation: Existing MOT datasets suffer from simple scene compositions and unrealistic scenarios, making them inadequate for training robust tracking algorithms. The authors aim to address this limitation by proposing a more realistic and challenging dataset. Method: The authors created a large-scale, challenging dataset called CrowdTrack, which includes 33 videos with 5,185 trajectories captured from a first-person view in real-life complex scenarios. Each object is annotated with a unique ID and a complete bounding box. Result: The dataset was comprehensively analyzed, and multiple state-of-the-art models were tested on it. The performance of foundation models was also evaluated, showing the dataset's utility in advancing tracking algorithms for complex situations. Conclusion: The paper concludes that the proposed CrowdTrack dataset serves as a valuable platform for developing and testing multi-pedestrian tracking algorithms in complex, real-life scenarios. Abstract: Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack'' because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: https://github.com/loseevaya/CrowdTrack .[74] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention
Zunhui Xia,Hongxing Li,Libin Lan
Main category: cs.CV
TL;DR: MedFormer is an efficient and versatile medical vision transformer designed to improve performance on various medical image recognition tasks by addressing computational cost and performance limitations of existing methods.
Details
Motivation: The motivation behind MedFormer is to address the limitations of existing vision transformer-based methods in medical image recognition, specifically their task-specific nature, high computational costs, and suboptimal performance due to sparse attention mechanisms. Method: The paper proposes MedFormer, which utilizes a pyramid scaling structure as a versatile backbone and introduces Dual Sparse Selection Attention (DSSA) to enhance computational efficiency and robustness while maintaining high performance. Result: Extensive experiments demonstrate that MedFormer outperforms existing medical vision transformers in terms of generality, efficiency, and performance across multiple imaging modality datasets. Conclusion: MedFormer is a highly effective medical vision transformer that improves performance across various medical image recognition tasks, including image classification, semantic segmentation, and lesion detection. Abstract: Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is explicitly designed to attend to the most relevant content. In addition, a detailed theoretical analysis has been conducted, demonstrating that MedFormer has superior generality and efficiency in comparison to existing medical vision transformers. Extensive experiments on a variety of imaging modality datasets consistently show that MedFormer is highly effective in enhancing performance across all three above-mentioned medical image recognition tasks. The code is available at https://github.com/XiaZunhui/MedFormer.[75] Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy
Luca Parolari,Andrea Cherubini,Lamberto Ballan,Carlo Biffi
Main category: cs.CV
TL;DR: 本文介绍了一种新的息肉计数方法,利用时间感知技术提高了计数准确性和性能。
Details
Motivation: 现有的息肉计数方法主要依赖于自监督学习,并且忽视了在轨迹特征学习和聚类阶段中的时间关系。本研究旨在解决这一问题,以提高息肉计数的准确性。 Method: 该研究通过引入一种结合时间感知软目标的监督对比损失,来捕捉息肉内的变异性和保持息肉间的可区分性,并通过集成时间邻接约束改进轨迹片段聚类。 Result: 实验结果表明,与之前的方法相比,碎片率降低了2.2倍,证明了所提方法的有效性。 Conclusion: 本文提出了一种新的有监督对比损失方法,并引入了时间邻接约束,以提高结肠镜检查中息肉计数的鲁棒性和准确性,从而实现了息肉计数的新技术水平。 Abstract: Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.[76] MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations
Hyunsoo Son,Jeonghyun Noh,Suemin Jeon,Chaoli Wang,Won-Ki Jeong
Main category: cs.CV
TL;DR: This paper introduces MC-INR, a novel framework that improves the encoding of complex, multivariate, unstructured scientific data using neural networks with clustering and meta-learning techniques.
Details
Motivation: Existing INR-based methods struggle with complex structures, single-variable data, and reliance on structured grids, limiting their effectiveness on real-world datasets. Method: The paper proposes MC-INR, a neural network framework combining meta-learning and clustering, with a residual-based dynamic re-clustering mechanism and a branched layer to handle multivariate data. Result: MC-INR demonstrates superior performance in encoding scientific multivariate data compared to traditional INR methods. Conclusion: MC-INR provides an effective solution for encoding complex real-world multivariate data on unstructured grids, outperforming existing INR-based methods. Abstract: Implicit Neural Representations (INRs) are widely used to encode data as continuous functions, enabling the visualization of large-scale multivariate scientific simulation data with reduced memory usage. However, existing INR-based methods face three main limitations: (1) inflexible representation of complex structures, (2) primarily focusing on single-variable data, and (3) dependence on structured grids. Thus, their performance degrades when applied to complex real-world datasets. To address these limitations, we propose a novel neural network-based framework, MC-INR, which handles multivariate data on unstructured grids. It combines meta-learning and clustering to enable flexible encoding of complex structures. To further improve performance, we introduce a residual-based dynamic re-clustering mechanism that adaptively partitions clusters based on local error. We also propose a branched layer to leverage multivariate data through independent branches simultaneously. Experimental results demonstrate that MC-INR outperforms existing methods on scientific data encoding tasks.[77] Automatic Labelling for Low-Light Pedestrian Detection
Dimitrios Bouzoulas,Eerik Alamikkotervo,Risto Ojala
Main category: cs.CV
TL;DR: This paper proposes an automated infrared-RGB labeling pipeline for improving low-light RGB pedestrian detection, showing promising results with better performance in most cases compared to traditional methods.
Details
Motivation: Low-light conditions pose a challenge for RGB pedestrian detection due to the lack of large public datasets. This research aims to address that challenge through an automated solution. Method: An automated infrared-RGB labeling pipeline involving infrared detection, label transfer, and training object detection models using generated labels was developed. Evaluation was conducted using the KAIST dataset. Result: Models trained on generated labels outperformed those trained on ground-truth labels in 6 out of 9 cases for mAP@50 and mAP@50-95 metrics on previously unseen image sequences. Conclusion: The proposed automated infrared-RGB labeling pipeline enhances low-light RGB pedestrian detection performance in most cases compared to models trained on ground-truth labels. Abstract: Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling[78] Detecting Multiple Diseases in Multiple Crops Using Deep Learning
Vivek Yadav,Anugrah Jain
Main category: cs.CV
TL;DR: This paper proposes a deep learning model to detect multiple diseases across various crops in India, achieving high accuracy and broader coverage compared to existing solutions.
Details
Motivation: India's agrarian economy suffers from significant crop losses due to diseases, pests, and environmental stress. Early detection is crucial for improving yield and food security. Method: A deep learning-based solution was developed using a unified dataset of images from 17 crops and 34 diseases. Result: The model achieved 99% accuracy on the unified dataset, which is 7% higher than existing models covering fewer crops and diseases. Conclusion: The proposed deep learning model outperforms state-of-the-art methods in detecting crop diseases and aims to provide a better solution for Indian farmers. Abstract: India, as a predominantly agrarian economy, faces significant challenges in agriculture, including substantial crop losses caused by diseases, pests, and environmental stress. Early detection and accurate identification of diseases across different crops are critical for improving yield and ensuring food security. This paper proposes a deep learning based solution for detecting multiple diseases in multiple crops, aimed to cover India's diverse agricultural landscape. We first create a unified dataset encompassing images of 17 different crops and 34 different diseases from various available repositories. Proposed deep learning model is trained on this dataset and outperforms the state-of-the-art in terms of accuracy and the number of crops, diseases covered. We achieve a significant detection accuracy, i.e., 99 percent for our unified dataset which is 7 percent more when compared to state-of-the-art handling 14 crops and 26 different diseases only. By improving the number of crops and types of diseases that can be detected, proposed solution aims to provide a better product for Indian farmers.[79] IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning
Abiam Remache González,Meriem Chagour,Timon Bijan Rüth,Raúl Trapiella Cañedo,Marina Martínez Soler,Álvaro Lorenzo Felipe,Hyun-Suk Shin,María-Jesús Zamorano Serrano,Ricardo Torres,Juan-Antonio Castillo Parra,Eduardo Reyes Abad,Miguel-Ángel Ferrer Ballester,Juan-Manuel Afonso López,Francisco-Mario Hernández Tejera,Adrian Penate-Sanchez
Main category: cs.CV
TL;DR: IMASHRIMP is an automated system for shrimp morphological analysis that optimizes genetic selection in aquaculture, reducing human error and efficiently predicting key points on the shrimp's skeleton with high precision.
Details
Motivation: The motivation behind IMASHRIMP is to optimize genetic selection tasks in aquaculture by addressing specific challenges of shrimp morphology analysis from RGBD images through existing deep learning and computer vision techniques. Method: IMASHRIMP incorporates two discrimination modules based on a modified ResNet-50 architecture for image classification and rostrum integrity determination, a pose estimation module adapted from VitPose for predicting key points on the shrimp's skeleton, and a morphological regression module using an SVM model for converting pixel measurements to centimeter units. Result: IMASHRIMP achieved a reduction in human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. The system attained a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. Conclusion: IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices. Abstract: This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei}, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a "two-factor authentication (human and IA)" system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp's skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices.The code are available at https://github.com/AbiamRemacheGonzalez/ImaShrimp-public[80] MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang,Sicheng Xu,Yue Dong,Yu Deng,Jianfeng Xiang,Zelong Lv,Guangzhong Sun,Xin Tong,Jiaolong Yang
Main category: cs.CV
TL;DR: MoGe-2 improves 3D geometry estimation from single images by combining metric scale accuracy with fine detail recovery through enhanced data refinement and model optimization.
Details
Motivation: The motivation is to improve upon existing monocular geometry estimation techniques by recovering metric-scale 3D point maps from single images without compromising on relative geometry accuracy or fine-grained detail. Method: The method builds upon the MoGe approach, extending it for metric geometry prediction while maintaining affine-invariant point representation accuracy. A unified data refinement strategy was developed to filter and complete real data using synthetic labels, enhancing geometry granularity. Result: The model achieves superior performance in recovering accurate relative geometry, metric scale precision, and detailed surface structures, which earlier approaches could not accomplish together. Conclusion: MoGe-2 is a highly effective open-domain geometry estimation model that outperforms previous methods by simultaneously achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery. Abstract: We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative geometry accuracy provided by the affine-invariant point representation. Additionally, we discover that noise and errors in real data diminish fine-grained detail in the predicted geometry. We address this by developing a unified data refinement approach that filters and completes real data from different sources using sharp synthetic labels, significantly enhancing the granularity of the reconstructed geometry while maintaining the overall accuracy. We train our model on a large corpus of mixed datasets and conducted comprehensive evaluations, demonstrating its superior performance in achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery -- capabilities that no previous methods have simultaneously achieved.[81] Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning
Buzhen Huang,Chen Li,Chongyang Xu,Dongyue Lu,Jinnan Chen,Yangang Wang,Gim Hee Lee
Main category: cs.CV
TL;DR: 本文提出了一种结合人类外观、社交空间学和物理定律的新方法,解决了传统姿态估计方法在复杂场景中难以恢复合理人类交互的问题。
Details
Motivation: 现有的人体姿态估计方法由于视觉模糊性和人与人之间的遮挡,在处理野外视频中的紧密交互时无法恢复合理的动作;即使是最先进的大型基础模型(如SAM)也无法在这种具有挑战性的场景下准确区分人类语义。因此需要一种新方法来解决这些障碍。 Method: 首先训练一个扩散模型来学习人类的空间行为和姿态先验知识,然后将训练好的网络和两个可优化张量整合到一个双分支优化框架中以重建人类动作和外观。同时设计了基于3D高斯分布、2D关键点和网格穿透的多种约束条件来辅助优化过程。 Result: 论文展示了其方法在多个基准数据集上的实验结果,证明该方法在复杂环境中从野外视频中估计人类交互方面优于现有技术。此外,作者还构建了一个带有伪真实交互注释的数据集,有助于推动相关领域的未来发展。 Conclusion: 该论文提出了一种双分支优化框架,通过结合人类外观、社交空间学和物理定律的约束,能够从复杂环境中的野外视频中重建准确且具有合理身体接触的人类互动动作。实验结果表明,该方法优于现有方法,并可能促进未来在姿态估计和人类行为理解方面的研究。 Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.[82] Parametric shape models for vessels learned from segmentations via differentiable voxelization
Alina F. Dima,Suprosanna Shit,Huaqi Qiu,Robbie Holland,Tamara T. Mueller,Fabio Antonio Musio,Kaiyuan Yang,Bjoern Menze,Rickmer Braren,Marcus Makowski,Daniel Rueckert
Main category: cs.CV
TL;DR: This paper proposes a differentiable framework that unifies voxel, mesh, and parametric vessel representations, enabling accurate and flexible modeling without requiring explicit ground-truth shape parameters.
Details
Motivation: Current vessel representations—voxels, meshes, and parametric models—are typically used separately despite their complementary properties; integrating them could enhance accuracy and applicability in complex vessel modeling. Method: The method uses differentiable voxelization to extract parametric shape models via shape-to-segmentation fitting, parametrizing vessels using cubic B-splines for centerlines and radii, and differentiably extracting meshes from the learned parameters. Result: The method accurately captures the geometry of complex vessels such as aortas, aneurysms, and brain vessels, producing high-fidelity meshes that can be manipulated post-fitting. Conclusion: The proposed framework successfully integrates voxel, mesh, and parametric representations of vessels through differentiable transformations, allowing for high-fidelity modeling and manipulation. Abstract: Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joins the three representations under differentiable transformations. By leveraging differentiable voxelization, we automatically extract a parametric shape model of the vessels through shape-to-segmentation fitting, where we learn shape parameters from segmentations without the explicit need for ground-truth shape parameters. The vessel is parametrized as centerlines and radii using cubic B-splines, ensuring smoothness and continuity by construction. Meshes are differentiably extracted from the learned shape parameters, resulting in high-fidelity meshes that can be manipulated post-fit. Our method can accurately capture the geometry of complex vessels, as demonstrated by the volumetric fits in experiments on aortas, aneurysms, and brain vessels.[83] Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning
Tan Pan,Zhaorui Tan,Kaiyu Guo,Dongli Xu,Weidi Xu,Chen Jiang,Xin Guo,Yuan Qi,Yuan Cheng
Main category: cs.CV
TL;DR: 本研究提出了一种名为S²DC的新方法,用于3D医学图像的自监督学习,通过同时考虑结构内部的一致性和结构之间的差异性,提升了模型的表现力,在多个实验中均取得了优异的结果。
Details
Motivation: 传统3D医学图像自监督学习方法通常使用固定大小的图像块进行处理,忽略了不同解剖结构在位置、尺度和形态上的变化,这限制了对有意义特征的捕捉能力。因此,本文旨在通过结构感知的方法改进这一问题。 Method: 该论文提出了S²DC框架,分为两个步骤:第一步利用最优传输策略增强不同区域之间的语义差异性;第二步基于邻域相似性分布提升结构层面的语义一致性。 Result: S²DC在10个数据集、4个任务和3种模态上进行了全面评估,结果表明其在3D医学图像自监督学习领域优于现有的最先进方法。 Conclusion: 该论文提出了一种新的3D医学图像自监督学习框架S²DC,通过结合语义一致性和差异性来实现结构感知的表示学习,并在多个数据集和任务上表现出优于现有方法的性能。 Abstract: 3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.[84] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu,Enxin Song,Wenhao Chai,Xuexiang Wen,Tian Ye,Gaoang Wang
Main category: cs.CV
TL;DR: AuroraLong is an efficient, linear RNN-based model that addresses the challenges of long video understanding by reducing computational complexity and memory cost.
Details
Motivation: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost due to the quadratic scaling of transformer-based LLMs with input sequence length. Method: AuroraLong replaces the LLM component in MLLMs with a linear RNN language model and combines visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Result: Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. Conclusion: AuroraLong demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. Abstract: The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.[85] Addressing Camera Sensors Faults in Vision-Based Navigation: Simulation and Dataset Development
Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill
Main category: cs.CV
TL;DR: 本文提出了一种用于深空探索任务的基于AI的视觉导航传感器故障检测方法,并提供了一个包含故障注入图像的数据集用于训练和测试。
Details
Motivation: 视觉导航(VBN)算法在空间任务中的重要性日益增加,但传感器故障可能导致导航算法输出不准确甚至完全失效,而传统故障检测方法存在局限性,因此需要利用人工智能进行改进。 Method: 研究集中于深空探测任务场景,对视觉导航流程中使用的相机传感器潜在故障案例进行了全面分析,并利用仿真框架在合成图像中重现这些故障条件。 Result: 提出了一种系统化的方法来分析相机传感器故障的原因、影响及常见缓解策略,并构建了一个故障注入图像数据集以支持AI故障检测算法的发展。 Conclusion: 该研究通过引入一个用于重现相机传感器故障条件的仿真框架,生成了可用于训练和测试基于AI的故障检测算法的故障注入图像数据集。 Abstract: The increasing importance of Vision-Based Navigation (VBN) algorithms in space missions raises numerous challenges in ensuring their reliability and operational robustness. Sensor faults can lead to inaccurate outputs from navigation algorithms or even complete data processing faults, potentially compromising mission objectives. Artificial Intelligence (AI) offers a powerful solution for detecting such faults, overcoming many of the limitations associated with traditional fault detection methods. However, the primary obstacle to the adoption of AI in this context is the lack of sufficient and representative datasets containing faulty image data. This study addresses these challenges by focusing on an interplanetary exploration mission scenario. A comprehensive analysis of potential fault cases in camera sensors used within the VBN pipeline is presented. The causes and effects of these faults are systematically characterized, including their impact on image quality and navigation algorithm performance, as well as commonly employed mitigation strategies. To support this analysis, a simulation framework is introduced to recreate faulty conditions in synthetically generated images, enabling a systematic and controlled reproduction of faulty data. The resulting dataset of fault-injected images provides a valuable tool for training and testing AI-based fault detection algorithms. The final link to the dataset will be added after an embargo period. For peer-reviewers, this private link is available.[86] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
Ziyin Zhou,Yunpeng Luo,Yuanchen Wu,Ke Sun,Jiayi Ji,Ke Yan,Shouhong Ding,Xiaoshuai Sun,Yunsheng Wu,Rongrong Ji
Main category: cs.CV
TL;DR: This paper presents AIGI-Holmes, a novel approach for detecting AI-generated images using a comprehensive dataset and a three-stage training framework to improve generalization and provide human-verifiable explanations.
Details
Motivation: The motivation stems from the misuse of highly realistic AI-generated images in spreading misinformation and limitations in current detection techniques, such as lack of human-verifiable explanations and poor generalization. Method: The study introduces a new dataset (Holmes-Set) and an efficient data annotation method called Multi-Expert Jury. It also proposes a three-stage training framework named Holmes Pipeline, which adapts multimodal large language models for AI-generated image detection. Result: Extensive experiments show that the proposed model, AIGI-Holmes, is effective in detecting AI-generated images across multiple benchmarks while generating human-aligned explanations. Conclusion: The paper concludes that AIGI-Holmes, developed using the Holmes Pipeline and enhanced through collaborative decoding strategies, successfully addresses challenges in detecting AI-generated images with high generalization and human-verifiable explanations. Abstract: The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.[87] Learning few-step posterior samplers by unfolding and distillation of diffusion models
Charlesquin Kemajou Mbakam,Jonathan Spence,Marcelo Pereyra
Main category: cs.CV
TL;DR: This paper proposes a new framework that combines deep unfolding and model distillation to transform diffusion models into few-step conditional models for efficient and flexible posterior sampling in Bayesian computational imaging.
Details
Motivation: The motivation behind the study is to combine the zero-shot flexibility of Plug-and-Play methods with the accuracy and speed of specialized conditional DMs in Bayesian computational imaging by leveraging DMs as powerful image priors. Method: The method involves the integration of deep unfolding and model distillation to convert a diffusion model (DM) image prior into a few-step conditional model for posterior sampling. A key part of this method is the application of deep unfolding to a Markov chain Monte Carlo (MCMC) algorithm, specifically the LATINO Langevin sampler. Result: The results show that the proposed unfolded and distilled samplers achieve excellent accuracy and computational efficiency. They were tested extensively and compared with the state of the art, proving their effectiveness and adaptability to variations in the forward model at inference time. Conclusion: The paper concludes that by integrating deep unfolding and model distillation, a DM image prior can be effectively transformed into a few-step conditional model for posterior sampling, achieving excellent accuracy and computational efficiency while maintaining flexibility. Abstract: Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm - specifically, the recently proposed LATINO Langevin sampler (Spagnoletti et al., 2025) - representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.[88] APT: Adaptive Personalized Training for Diffusion Models with Limited Data
JungWoo Chae,Jiyoon Kim,JaeWoong Choi,Kyungyul Kim,Sangheum Hwang
Main category: cs.CV
TL;DR: This paper proposes Adaptive Personalized Training (APT) to tackle overfitting and preserve prior knowledge when personalizing diffusion models with limited data, achieving superior performance in image generation.
Details
Motivation: Personalizing diffusion models using limited data presents significant challenges like overfitting, loss of prior knowledge, and degradation of text alignment, which the paper aims to address. Method: The proposed method, Adaptive Personalized Training (APT), includes three components: Adaptive Training Adjustment to detect and control overfitting, Representation Stabilization to regulate feature maps, and Attention Alignment for maintaining prior knowledge during fine-tuning. Result: Through experiments, APT was shown to effectively mitigate overfitting, maintain semantic coherence, and generate high-quality, diverse images better than existing approaches under limited data conditions. Conclusion: APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data. Abstract: Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.[89] CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation
Xiangyang Luo,Ye Zhu,Yunfei Liu,Lijian Lin,Cong Wan,Zijian Cai,Shao-Lun Huang,Yu Li
Main category: cs.CV
TL;DR: CanonSwap is a novel video face-swapping framework that effectively transfers source identity while accurately preserving target face dynamics, outperforming previous methods.
Details
Motivation: The motivation is to overcome the limitations of existing methods that focus on high-quality identity transfer but fail to maintain dynamic attributes like head poses, facial expressions, and lip-sync, resulting in inconsistent outcomes. Method: CanonSwap decouples motion information from appearance information by eliminating motion-related information first, enabling identity modification within a unified canonical space. Then, the swapped feature is reintegrated into the original video space to preserve dynamic attributes. A Partial Identity Modulation module and fine-grained synchronization metrics are also introduced. Result: CanonSwap significantly outperforms existing approaches in visual quality, temporal consistency, and identity preservation. Conclusion: CanonSwap successfully addresses the coupling issue of facial appearance and motion in videos, achieving superior performance in video face swapping. Abstract: Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.[90] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment
Qi Xu,Dongxu Wei,Lingzhe Zhao,Wenpu Li,Zhangchi Huang,Shunping Ji,Peidong Liu
Main category: cs.CV
TL;DR: This paper proposes SIU3R, an alignment-free framework for simultaneous understanding and 3D reconstruction from unposed images, achieving superior performance by eliminating reliance on 2D-to-3D alignment paradigms.
Details
Motivation: Traditional 2D-to-3D feature alignment methods limit 3D understanding capabilities and risk semantic information loss. The paper aims to develop a more effective framework for simultaneous understanding and 3D reconstruction. Method: SIU3R utilizes pixel-aligned 3D representation to bridge reconstruction and understanding tasks. It unifies multiple understanding tasks into learnable queries and introduces lightweight modules for interaction between tasks. Result: Extensive experiments show that SIU3R outperforms existing approaches on both individual and combined tasks of 3D reconstruction and understanding. Conclusion: The paper concludes that SIU3R, an alignment-free framework, achieves state-of-the-art performance on simultaneous understanding and 3D reconstruction tasks while avoiding the limitations of traditional 2D-to-3D feature alignment methods. Abstract: Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.[91] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation
Qin Guo,Ailing Zeng,Dongxu Yue,Ceyuan Yang,Yang Cao,Hanzhong Guo,Fei Shen,Wei Liu,Xihui Liu,Dan Xu
Main category: cs.CV
TL;DR: This paper presents UniMC, a new DiT-based framework combined with the HAIG-2.9M dataset, significantly enhancing the capabilities of keypoint-guided Text-to-Image diffusion models, especially for complex scenes involving multiple humans and animals.
Details
Motivation: The motivation is to overcome the limitations of existing keypoint-guided Text-to-Image diffusion models, which struggle with generating images of non-rigid objects beyond humans and have difficulties distinguishing instances and classes due to reliance on skeleton images. Method: The paper introduces a DiT-based framework called UniMC that integrates instance- and keypoint-level conditions into compact tokens, along with a new large-scale dataset named HAIG-2.9M containing extensive annotations for both humans and animals. Result: The proposed UniMC framework and HAIG-2.9M dataset demonstrate significant improvements in generating high-quality images, especially in challenging cases such as heavy occlusions and multi-class scenarios. Conclusion: The paper concludes that by using the UniMC framework and the HAIG-2.9M dataset, keypoint-guided Text-to-Image diffusion models can be effectively enhanced for multi-class image generation, particularly improving performance in complex scenarios involving multiple overlapping humans and animals. Abstract: Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.[92] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models
Yuxuan Wang,Tianwei Cao,Huayu Zhang,Zhongjiang He,Kongming Liang,Zhanyu Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为FairHuman的方法,用于改善图像生成中的局部细节问题。
Details
Motivation: 由于训练期间对局部区域监督不足,生成具有合理细节的人类图像仍然具有挑战性。 Method: 构建三个学习目标,并基于Minimum Potential Delay准则导出最优参数更新策略。 Result: 该方法在不同场景下都能有效改善人类图像生成的性能。 Conclusion: FairHuman是一个多目标微调方法,可以公平地提升整体和局部生成质量。 Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.[93] Prompt learning with bounding box constraints for medical image segmentation
Mélanie Gaillochet,Mehrdad Noori,Sahar Dastani,Christian Desrosiers,Hervé Lombaert
Main category: cs.CV
TL;DR: 本研究提出了一种基于边界框标注的弱监督学习方法,无需完全标注即可实现高效的医学图像分割。
Details
Motivation: 像素级标注在医学领域代价高昂且费时,而基于边界框标注的弱监督方法更为高效实用,因此需要一种能够减少对完全标注数据依赖的方法。 Method: 该方法利用边界框标注生成伪标签,并与基础模型结合,采用集成多个约束条件的优化方案进行训练。 Result: 实验结果显示,在有限数据设置下,所提出的弱监督方法平均Dice得分为84.90%,优于现有的全监督和弱监督方法。 Conclusion: 本文提出了一种新的弱监督框架,结合了基础模型的表示能力和弱监督分割的标注效率,并通过仅使用边界框标注实现自动提示生成,从而减少了用户干预。 Abstract: Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at https://github.com/Minimel/box-prompt-learning-VFM.git[94] DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Jiawei He,Danshi Li,Xinqiang Yu,Zekun Qi,Wenyao Zhang,Jiayi Chen,Zhaoxiang Zhang,Zhizheng Zhang,Li Yi,He Wang
Main category: cs.CV
TL;DR: This paper presents DexVLG, a large Vision-Language-Grasp model for predicting dexterous grasp poses aligned with language instructions using single-view RGBD input. The model is trained on a large-scale dataset called DexGraspNet 3.0, comprising 170 million dexterous grasp poses and captions. Testing shows DexVLG's strong zero-shot generalization capabilities and success in real-world grasping tasks.
Details
Motivation: Despite the progress in vision-language-action (VLA) systems for robots, there is limited research on functional grasping with large models for human-like dexterous hands due to difficulties in data collection. The paper aims to address this gap by developing a model that enables dexterous grasp pose prediction aligned with language instructions. Method: The paper introduces DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. A dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects was generated in simulation and paired with detailed part-level captions. This dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. Result: The creation of DexVLG achieves over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation. It also enables successful part-aligned grasps on physical objects in real-world scenarios. Conclusion: DexVLG demonstrates strong zero-shot generalization capabilities, achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation, as well as successful part-aligned grasps on physical objects in real-world scenarios. Abstract: As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG's performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG's strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.[95] Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics
Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen
Main category: cs.CV
TL;DR: This paper introduces MANO, a new approach to attention computation that reduces the complexity of Transformers, making them more efficient for high-resolution inputs without sacrificing performance.
Details
Motivation: The motivation stems from the impracticality of standard Transformers for high-resolution inputs due to their quadratic complexity in memory and time. Current variants often lose fine-scale details through patchification, downsampling, or coarsening techniques. Method: The authors introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion inspired by techniques in n-body numerical simulations. This method maintains a global receptive field and achieves linear time and memory complexity. Result: Empirical results show that MANO performs comparably to state-of-the-art models like ViT and Swin Transformer on tasks such as image classification and Darcy flows, while significantly reducing runtime and peak memory usage. Conclusion: The study concludes that MANO, a novel approach to computing attention, rivals state-of-the-art models in performance while significantly reducing runtime and memory usage. Abstract: Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.[96] Partial Weakly-Supervised Oriented Object Detection
Mingxin Liu,Peiyuan Zhang,Yuan Liu,Wei Zhang,Yue Zhou,Ning Liao,Ziyang Gong,Junwei Luo,Zhirui Wang,Yi Yu,Xue Yang
Main category: cs.CV
TL;DR: This paper proposes PWOOD, a new framework for oriented object detection that reduces annotation costs by using partially weak annotations and unlabeled data effectively.
Details
Motivation: The motivation stems from the high annotation costs associated with current OOD algorithms, which are either fully supervised, semi-supervised, or weakly supervised. The authors aim to provide a lower-cost alternative that maintains performance. Method: The paper introduces the PWOOD framework, which includes an Orientation-and-Scale-aware Student (OS-Student) model and a Class-Agnostic Pseudo-Label Filtering strategy (CPF), designed to work with partial weak annotations and unlabeled data. Result: Experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets show that the PWOOD framework performs as well as or better than traditional semi-supervised methods, while significantly reducing annotation costs. Conclusion: The proposed PWOOD framework offers a cost-effective solution for oriented object detection by utilizing partially weak annotations and leveraging unlabeled data efficiently, outperforming traditional semi-supervised algorithms. Abstract: The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose:(1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms.[97] From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images
Danrong Zhang,Huili Huang,N. Simrill Smith,Nimisha Roy,J. David Frost
Main category: cs.CV
TL;DR: 本研究将损伤评估转化为语义分割问题,利用深度学习模型与新评分系统提高地震后社交媒体图像分析的客观性与实用性。
Details
Motivation: 传统方法在评估地震后社交媒体图像的损伤严重程度时存在主观性强和无法考虑图像内损伤程度差异的问题。 Method: 构建了一个分段损伤严重程度数据集,并微调了SegFormer模型以生成损伤严重程度分割,同时引入了一种新的损伤严重程度评分系统。 Result: 实现了对地震受损区域更客观、全面的损伤量化分析,并提升了灾害响应效率。 Conclusion: 该研究通过语义分割方法改进地震后社交媒体图像的损伤评估,提高了客观性和准确性,为灾害侦察团队提供了更精确的指导。 Abstract: In the aftermath of earthquakes, social media images have become a crucial resource for disaster reconnaissance, providing immediate insights into the extent of damage. Traditional approaches to damage severity assessment in post-earthquake social media images often rely on classification methods, which are inherently subjective and incapable of accounting for the varying extents of damage within an image. Addressing these limitations, this study proposes a novel approach by framing damage severity assessment as a semantic segmentation problem, aiming for a more objective analysis of damage in earthquake-affected areas. The methodology involves the construction of a segmented damage severity dataset, categorizing damage into three degrees: undamaged structures, damaged structures, and debris. Utilizing this dataset, the study fine-tunes a SegFormer model to generate damage severity segmentations for post-earthquake social media images. Furthermore, a new damage severity scoring system is introduced, quantifying damage by considering the varying degrees of damage across different areas within images, adjusted for depth estimation. The application of this approach allows for the quantification of damage severity in social media images in a more objective and comprehensive manner. By providing a nuanced understanding of damage, this study enhances the ability to offer precise guidance to disaster reconnaissance teams, facilitating more effective and targeted response efforts in the aftermath of earthquakes.[98] RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
Liheng Zhang,Lexi Pang,Hang Ye,Xiaoxuan Ma,Yizhou Wang
Main category: cs.CV
TL;DR: This paper proposes a training-free, flexible feature injection framework for text-to-image diffusion models that decouples feature injection from the denoising process, improving structural and appearance control in generated images.
Details
Motivation: Existing feature injection methods for text-to-image diffusion models suffer from structural misalignment, condition leakage, and visual artifacts, especially when condition images deviate from natural RGB distributions. This work aims to address these limitations by rethinking the feature injection mechanism. Method: The method introduces a flexible feature injection framework that decouples the injection timestep from the denoising process, along with structure-rich injection modules, appearance-rich prompting, and a restart refinement strategy. Result: The approach significantly improves structural and appearance fidelity in generated images across diverse zero-shot conditioning scenarios, outperforming existing methods without requiring training. Conclusion: The proposed framework achieves state-of-the-art performance in zero-shot conditioning scenarios, enabling training-free, structure-rich, and appearance-rich text-to-image generation. Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.[99] No time to train! Training-Free Reference-Based Instance Segmentation
Miguel Espinosa,Chenhongyi Yang,Linus Ericsson,Steven McDonagh,Elliot J. Crowley
Main category: cs.CV
TL;DR: This paper introduces a training-free method for object segmentation that uses semantic priors and reference images to automatically generate segmentation masks, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: The motivation is to reduce the burden of manual visual-prompts or complex domain-dependent prompt-generation rules in image segmentation by utilizing reference images and semantic priors. Method: The method involves a multi-stage, training-free approach incorporating memory bank construction, representation aggregation, and semantic-aware feature matching. Result: The experiments showed significant improvements in segmentation metrics, including 36.8% nAP on COCO FSOD, 71.2% nAP50 on PASCAL VOC Few-Shot, and 22.4% nAP on the Cross-Domain FSOD benchmark. Conclusion: The study concludes that leveraging semantic priors from foundation models can significantly enhance object segmentation performance, achieving state-of-the-art results on several benchmarks. Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).[100] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
Gent Serifi,Marcel C. Bühler
Main category: cs.CV
TL;DR: HyperGaussians是一种用于高质量可动画面部头像的新颖3D高斯扩展,解决了非线性变形、复杂光照效果和细节表现的问题。
Details
Motivation: 尽管3D高斯点绘在静态面部渲染方面表现出色,但在单目视频中创建具有复杂变形和高细节的可动画头像仍然面临挑战。 Method: 引入高维多元高斯表示(HyperGaussians),通过学习的本地嵌入条件提高表达能力,并采用'逆协方差技巧'优化计算效率。 Result: 在19个受试者上的评估表明,HyperGaussians在数值和视觉效果上均优于3DGS,特别是在高频细节如眼镜框、牙齿、复杂面部运动和镜面反射方面。 Conclusion: HyperGaussians提供了一种高效且富有表现力的方法,显著提升了单目视频生成可动画面部头像的质量。 Abstract: We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.[101] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
Fangfu Liu,Hao Li,Jiawei Chi,Hanyang Wang,Minghui Yang,Fudong Wang,Yueqi Duan
Main category: cs.CV
TL;DR: LangScene-X is a novel generative framework that unifies and generates 3D consistent multi-modality information for reconstruction and understanding, achieving superior performance compared to existing approaches.
Details
Motivation: Recovering 3D structures with open-vocabulary scene understanding from 2D images is challenging due to reliance on calibrated dense-view reconstruction paradigms that suffer from rendering artifacts and implausible semantic synthesis when limited views are available. Method: LangScene-X utilizes a TriMap video diffusion model and a Language Quantized Compressor (LQC) to generate appearance, geometry, and semantics from sparse inputs and enable cross-scene generalization without per-scene retraining. Result: Extensive experiments on real-world data demonstrate the superiority of LangScene-X over state-of-the-art methods in terms of quality and generalizability. Conclusion: LangScene-X demonstrates superior performance in generating 3D consistent multi-modality information for reconstruction and understanding, enabling open-ended language queries. Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.[102] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach
Panpan Ji,Junni Song,Hang Xiao,Hanyu Liu,Chao Li
Main category: cs.CV
TL;DR: This paper proposes the Dynamic Contrastive Dual-Path Network (DCDP-HAR) to improve sensor-based human activity recognition by addressing issues in cross-modal feature alignment and modality contribution imbalance.
Details
Motivation: Multimodal HAR systems encounter key challenges like difficulties in cross-modal feature alignment and imbalanced modality contributions, which need to be addressed for improved performance. Method: A Dynamic Contrastive Dual-Path Network (DCDP-HAR) is proposed, which includes a dual-path feature extraction architecture, a multi-stage contrastive learning mechanism, and a confidence-driven gradient modulation strategy. Result: Extensive experiments on four public benchmark datasets validate the effectiveness of the DCDP-HAR framework in improving performance for sensor-based human activity recognition. Conclusion: The proposed DCDP-HAR framework effectively addresses challenges in multimodal HAR systems, such as cross-modal feature alignment and imbalanced modality contributions. Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.[103] USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network
Ying Yu,Hang Xiao,Siyao Li,Jiarui Li,Haotian Tang,Hanyu Liu,Chao Li
Main category: cs.CV
TL;DR: 本文提出了一种新的多注意力交互机制用于人类活动识别,解决了标记数据稀缺、特征提取不足和模型性能问题,具有较高的准确率和良好的实际部署能力。
Details
Motivation: 人类活动识别面临的主要挑战包括罕见活动的标记样本稀缺、高层特征提取不足以及轻量级设备上的模型性能欠佳。 Method: 首先,采用一种无监督的、统计引导的扩散模型进行数据增强;其次,设计了一个多分支时空交互网络,通过并行残差分支捕获多尺度特征,并引入时间注意力机制和空间注意力机制;最后,引入了自适应多损失函数融合策略,并在三个公共数据集上进行了实验验证。 Result: 在三个公开数据集WISDM、PAMAP2和OPPORTUNITY上的实验结果表明,所提出的USAD模型分别达到了98.84%、93.81%和80.92%的准确率,显著优于现有方法。 Conclusion: 本文提出了一种基于多注意力交互机制的综合优化方法,以解决人类活动识别中的关键挑战,并通过实验验证了该方法的有效性和可行性。 Abstract: The primary objective of human activity recognition (HAR) is to infer ongoing human actions from sensor data, a task that finds broad applications in health monitoring, safety protection, and sports analysis. Despite proliferating research, HAR still faces key challenges, including the scarcity of labeled samples for rare activities, insufficient extraction of high-level features, and suboptimal model performance on lightweight devices. To address these issues, this paper proposes a comprehensive optimization approach centered on multi-attention interaction mechanisms. First, an unsupervised, statistics-guided diffusion model is employed to perform data augmentation, thereby alleviating the problems of labeled data scarcity and severe class imbalance. Second, a multi-branch spatio-temporal interaction network is designed, which captures multi-scale features of sequential data through parallel residual branches with 3*3, 5*5, and 7*7 convolutional kernels. Simultaneously, temporal attention mechanisms are incorporated to identify critical time points, while spatial attention enhances inter-sensor interactions. A cross-branch feature fusion unit is further introduced to improve the overall feature representation capability. Finally, an adaptive multi-loss function fusion strategy is integrated, allowing for dynamic adjustment of loss weights and overall model optimization. Experimental results on three public datasets, WISDM, PAMAP2, and OPPORTUNITY, demonstrate that the proposed unsupervised data augmentation spatio-temporal attention diffusion network (USAD) achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly outperforming existing approaches. Furthermore, practical deployment on embedded devices verifies the efficiency and feasibility of the proposed method.[104] AnyI2V: Animating Any Conditional Image with Motion Control
Ziye Li,Hao Luo,Xincheng Shuai,Henghui Ding
Main category: cs.CV
TL;DR: 本文提出一种新型无训练视频生成框架AnyI2V,通过用户定义的运动轨迹和混合条件输入实现更灵活的视频生成。
Details
Motivation: 现有T2V方法在空间布局控制上存在局限性,而I2V方法受限于对真实图像的依赖,缺乏灵活性和可编辑性。 Method: 提出AnyI2V框架,利用LoRA和文本提示实现风格迁移与编辑,并支持网格和点云等多模态条件输入。 Result: 实验表明AnyI2V在性能上优于现有方法,为运动和空间控制的视频生成提供了新视角。 Conclusion: AnyI2V提供了一种灵活且多功能的视频生成方法,无需训练并支持多种条件输入模式和用户定义的运动轨迹。 Abstract: Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at https://henghuiding.com/AnyI2V/.[105] Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Jiaer Xia,Bingkui Tong,Yuhang Zang,Rui Shao,Kaiyang Zhou
Main category: cs.CV
TL;DR: This paper proposes GCoT, a method that enhances MLLM adaptation for specialized vision tasks by integrating grounding information into CoT data, leading to improved performance under data-limited scenarios.
Details
Motivation: Multimodal Large Language Models (MLLMs) struggle with specialized vision tasks due to a lack of relevant pre-training data. The motivation is to adapt these models without large-scale retraining, especially when data is limited. Method: A bootstrapping-based method called Grounded Chain-of-Thought (GCoT) is introduced to enhance reasoning step accuracy by adding bounding box information to CoT data, addressing the mismatch issue in pre-training and downstream datasets. Result: Evaluation on five specialized vision tasks shows that GCoT significantly outperforms fine-tuning and distillation approaches under data-limited conditions. Conclusion: The proposed Grounded Chain-of-Thought (GCoT) approach effectively improves model adaptation for specialized vision tasks under data-limited regimes by incorporating grounding information into CoT data. Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.[106] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching
Xin Zhou,Dingkang Liang,Kaijin Chen,Tianrui Feng,Xiwu Chen,Hongkai Lin,Yikang Ding,Feiyang Tan,Hengshuang Zhao,Xiang Bai
Main category: cs.CV
TL;DR: EasyCache是一个无需训练的视频扩散模型加速框架,通过动态缓存和重用变换向量减少冗余计算,从而显著提高推理速度并保持高质量视频生成。
Details
Motivation: 视频生成模型由于去噪过程的迭代特性导致推理速度慢且计算成本高,这限制了其广泛应用。因此需要一种高效且易于访问的解决方案来加速这些模型。 Method: 提出了一种轻量级、运行时自适应的缓存机制(EasyCache),该机制在推理过程中动态地重复使用先前计算的变换向量以避免冗余计算,而无需离线分析、预计算或大量参数调优。 Result: 与原始基线相比,EasyCache将推理时间减少了2.1-3.3倍,并且与之前的最先进方法相比,PSNR提高了最高达36%,从而实现了领先的加速性能同时保持高视觉保真度。 Conclusion: EasyCache为高质量视频生成提供了一个高效且高度可访问的解决方案,适用于研究和实际应用,并已开源代码供使用。 Abstract: Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.[107] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans
Zhening Huang,Xiaoyang Wu,Fangcheng Zhong,Hengshuang Zhao,Matthias Nießner,Joan Lasenby
Main category: cs.CV
TL;DR: LiteReality 将 RGB-D 扫描转换为紧凑、逼真的 3D 场景,支持高质量渲染和交互,适用于 AR/VR、游戏、机器人等领域。
Details
Motivation: 开发一种能够同时满足高质量视觉还原、支持关键图形功能(如物体独立性、高精度材质渲染)并具有交互能力的 3D 场景重建方法。 Method: LiteReality 通过场景理解解析结果,利用结构化场景图构建 3D 布局和对象,从精选资产数据库中检索最视觉相似的 3D 模型进行重建,结合材质绘制模块增强现实感,并集成到支持基本物理属性的模拟引擎中实现交互行为。 Result: LiteReality 实现了在真实扫描和公共数据集上的高质量 3D 场景重建,其引入的无训练物体检索模块在 Scan2CAD 基准测试中达到了最先进的性能,且材质绘制模块能够在复杂条件下实现外观迁移。 Conclusion: LiteReality 是一种新颖的流程,可将室内环境的 RGB-D 扫描转换为紧凑、逼真且可交互的 3D 虚拟复制品,并具备与标准图形管线完全兼容的特点。 Abstract: We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c[108] RefTok: Reference-Based Tokenization for Video Generation
Xiang Fan,Xiaohang Sun,Kushan Thakkar,Zhu Liu,Vimal Bhat,Ranjay Krishna,Xiang Hao
Main category: cs.CV
TL;DR: This paper introduces RefTok, a new reference-based tokenization method for video modeling that effectively captures temporal dependencies and redundancies, resulting in improved performance over current state-of-the-art methods.
Details
Motivation: Handling temporal redundancy effectively is a key challenge in learning video models, which current approaches fail to address by treating each set of frames independently. Method: RefTok uses a reference-based tokenization method where frames are encoded and decoded based on an unquantized reference frame to preserve motion continuity and object appearance. Result: RefTok outperforms state-of-the-art tokenizers across four datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS) and improves metrics like PSNR, SSIM, and LPIPS by 36.7% at similar or higher compression ratios. Additionally, when used for video generation on the BAIR task, RefTok surpasses both MAGVIT-B and MAGVIT-L by 27.9% across all generation metrics. Conclusion: RefTok is a more effective tokenizer compared to existing methods like Cosmos and MAGVIT, as it better captures temporal dynamics and contextual information in videos, leading to superior performance in video generation tasks. Abstract: Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.[109] Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory
Yuqi Wu,Wenzhao Zheng,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: Point3R improves dense 3D scene reconstruction by introducing an explicit spatial pointer memory, enhancing efficiency and accuracy.