Table of Contents
cs.CL [Back]
[1] Noise or Nuance: An Investigation Into Useful Information and Filtering For LLM Driven AKBC
Alex Clay,Ernesto Jiménez-Ruiz,Pranava Madhyastha
Main category: cs.CL
TL;DR: 本文研究了在受限条件下提升LLM三元组生成质量的方法,发现增加信息、LLM自我过滤和响应解析策略的有效性。
Details
Motivation: 在RAG和微调等提升LLM输出质量的技术受限的场景下(如2025 LM-KBC挑战),探索有效的替代方法以提高三元组生成质量。 Method: 研究聚焦于三元组补全任务的三个方面:生成、质量保证和LLM响应解析,分析在受限设置下不同策略的效果。 Result: 发现额外信息有助于提升生成质量;LLMs在过滤低质量三元组方面表现良好;响应解析中灵活性与一致性的权衡取决于具体应用场景。 Conclusion: 在受限环境下,如2025 LM-KBC挑战中,通过生成、质量保证和LLM响应解析三个方面研究发现,增加信息可以提升生成质量,LLMs可以有效过滤低质量三元组,而响应解析的灵活性与一致性权衡依赖于具体设置。 Abstract: RAG and fine-tuning are prevalent strategies for improving the quality of LLM outputs. However, in constrained situations, such as that of the 2025 LM-KBC challenge, such techniques are restricted. In this work we investigate three facets of the triple completion task: generation, quality assurance, and LLM response parsing. Our work finds that in this constrained setting: additional information improves generation quality, LLMs can be effective at filtering poor quality triples, and the tradeoff between flexibility and consistency with LLM response parsing is setting dependent.[2] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach
Imene Kolli,Ario Saeid Vaghefi,Chiara Colesanti Senni,Shantam Raj,Markus Leippold
Main category: cs.CL
TL;DR: This paper proposes an AI-assisted Retrieval-Augmented Generation framework to automate and accelerate the extraction of climate policy engagement evidence from corporate documents, showing improved performance while still requiring human oversight for accuracy.
Details
Motivation: InfluenceMap's current analytical workflow for monitoring corporate climate policy engagement is largely manual, making it time-consuming, labor-intensive, and susceptible to human error. Automating this process can improve efficiency and accuracy. Method: An AI-assisted framework leveraging Retrieval-Augmented Generation (RAG) is proposed, incorporating layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies to automate evidence extraction from multilingual corporate documents. Result: The combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies achieved the best performance in extracting and classifying evidence from large-scale, multilingual corporate documents. Conclusion: The proposed AI-assisted RAG framework effectively accelerates evidence extraction in monitoring corporate climate policy engagement, but requires a human-in-the-loop approach to ensure accuracy due to the nuanced nature of the analysis. Abstract: InfluenceMap's LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity's support or opposition to science-based policy pathways for achieving the Paris Agreement's goal of limiting global warming to 1.5{\deg}C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.[3] Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings
Jinsong Chen
Main category: cs.CL
TL;DR: This research proposes a psychometric method to analyze textual data through contextual embeddings, transforming text into response data for uncovering latent knowledge structures.
Details
Motivation: The motivation is to develop a novel psychometric approach for analyzing textual data by transforming it into response data, enabling a more nuanced understanding of latent factors within a corpus. Method: The method involves two stages: obtaining contextual scores using NLP techniques and encoder-based transformer models, and performing psychometric analysis using exploratory and bifactor models for factor extraction and correlation determination. Result: Experimental results on the Wiki STEM corpus demonstrate the method's potential to uncover latent knowledge dimensions and patterns, validating its effectiveness in psychometric analysis of textual data. Conclusion: The research concludes that the proposed psychometric method effectively uncovers latent knowledge dimensions and patterns within textual data, offering enhanced analysis and promising applications in fields rich in textual information. Abstract: This research introduces a novel psychometric method for analyzing textual data using large language models. By leveraging contextual embeddings to create contextual scores, we transform textual data into response data suitable for psychometric analysis. Treating documents as individuals and words as items, this approach provides a natural psychometric interpretation under the assumption that certain keywords, whose contextual meanings vary significantly across documents, can effectively differentiate documents within a corpus. The modeling process comprises two stages: obtaining contextual scores and performing psychometric analysis. In the first stage, we utilize natural language processing techniques and encoder based transformer models to identify common keywords and generate contextual scores. In the second stage, we employ various types of factor analysis, including exploratory and bifactor models, to extract and define latent factors, determine factor correlations, and identify the most significant words associated with each factor. Applied to the Wiki STEM corpus, our experimental results demonstrate the method's potential to uncover latent knowledge dimensions and patterns within textual data. This approach not only enhances the psychometric analysis of textual data but also holds promise for applications in fields rich in textual information, such as education, psychology, and law.[4] BRoverbs -- Measuring how much LLMs understand Portuguese proverbs
Thales Sales Almeida,Giovana Kerche Bonás,João Guilherme Alves Santos
Main category: cs.CL
TL;DR: 本文介绍了一个名为 BRoverbs 的新数据集,用于评估葡萄牙语大型语言模型在理解巴西谚语方面的能力,旨在填补现有评估方法在语言和文化细节上的不足。
Details
Motivation: 由于大型语言模型 (LLMs) 在不同语言和文化背景下的表现存在显著差异,需要成熟的评估框架来评估其在特定区域环境中的能力。葡萄牙语的现有评估仍然有限,通常依赖于翻译的数据集,这些数据集可能无法完全捕捉语言的细微差别或文化参考。 Method: 引入 BRoverbs 数据集,该数据集利用巴西谚语对 LLM 进行评估。 Result: BRoverbs 数据集能够提供一种新的评估工具,专门用于葡萄牙语 LLM 的性能测试。 Conclusion: BRoverbs 是一个为评估葡萄牙语 LLM 性能而设计的新数据集,它通过巴西谚语提供了一个新的评估工具,有助于推进区域性基准测试的发展。 Abstract: Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.[5] Can Vision-Language Models Solve Visual Math Equations?
Monjoy Narayan Choudhury,Junling Wang,Yifan Hou,Mrinmaya Sachan
Main category: cs.CL
TL;DR: 这篇论文研究了视觉语言模型(VLMs)在需要综合感知和符号计算的任务中的局限性,特别是在解决视觉方程中的计数问题时表现不佳。
Details
Motivation: 尽管VLMs在视觉理解和基于语言的推理方面表现出色,但它们在需要集成感知和符号计算的任务上仍然存在困难。论文旨在理解这些限制并指出未来改进的方向。 Method: 通过研究视觉方程解决任务,将数学方程嵌入图像中,变量用对象图标表示,系数必须通过计数推断。任务被分解为系数计数和变量识别,以分析模型的瓶颈所在。 Result: VLMs在文本方程上表现良好,但在视觉基础的对应方程上失败。计数被发现是主要瓶颈,即使识别准确率高。此外,组合识别和推理会引入额外错误,而随着方程复杂性的增加,符号推理本身也成为了限制因素。 Conclusion: 当前VLMs在视觉基础数学推理中存在关键弱点,未来改进需要增强计数能力和多步骤视觉推理。 Abstract: Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.[6] Stated Preference for Interaction and Continued Engagement (SPICE): Evaluating an LLM's Willingness to Re-engage in Conversation
Thomas Manuel Rost,Martina Figlia,Bernd Wallraff
Main category: cs.CL
TL;DR: SPICE is a reliable and efficient diagnostic tool for assessing a model's willingness to re-engage with users based on interaction tone, offering unique insights compared to traditional abuse classification methods.
Details
Motivation: To develop a simple diagnostic signal for evaluating a Large Language Model's willingness to re-engage with user behavior, providing a relational understanding of model disposition. Method: The study evaluated SPICE by testing four open-weight chat models across four framing conditions using a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set, resulting in 480 trials. Statistical tests, including Rao-Scott adjustment and cluster permutation tests, were applied. Result: SPICE sharply discriminates by user tone: friendly interactions resulted in 97.5% YES to continue, abusive interactions resulted in 17.9% YES, and unclear interactions resulted in 60.4% YES. SPICE provides a distinct signal from abuse classification, with models preferring not to continue even when abuse was not identified. Conclusion: SPICE is a robust, low-overhead tool for auditing model dispositions, offering a direct relational signal of a model's state and complementing existing metrics. Abstract: We introduce and evaluate Stated Preference for Interaction and Continued Engagement (SPICE), a simple diagnostic signal elicited by asking a Large Language Model a YES or NO question about its willingness to re-engage with a user's behavior after reviewing a short transcript. In a study using a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set, we tested four open-weight chat models across four framing conditions, resulting in 480 trials. Our findings show that SPICE sharply discriminates by user tone. Friendly interactions yielded a near-unanimous preference to continue (97.5% YES), while abusive interactions yielded a strong preference to discontinue (17.9% YES), with unclear interactions falling in between (60.4% YES). This core association remains decisive under multiple dependence-aware statistical tests, including Rao-Scott adjustment and cluster permutation tests. Furthermore, we demonstrate that SPICE provides a distinct signal from abuse classification. In trials where a model failed to identify abuse, it still overwhelmingly stated a preference not to continue the interaction (81% of the time). An exploratory analysis also reveals a significant interaction effect: a preamble describing the study context significantly impacts SPICE under ambiguity, but only when transcripts are presented as a single block of text rather than a multi-turn chat. The results validate SPICE as a robust, low-overhead, and reproducible tool for auditing model dispositions, complementing existing metrics by offering a direct, relational signal of a model's state. All stimuli, code, and analysis scripts are released to support replication.[7] Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M
Piyush Pant
Main category: cs.CL
TL;DR: This study shows that combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) improves the safety and helpfulness of language models better than using either method alone.
Details
Motivation: The motivation stems from the need to improve the safety and helpfulness of language models, addressing challenges such as harmful outputs and limited model alignment. Method: The research employs Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and their combination (SFT+DPO) on the OPT-350M model using the Anthropic Helpful-Harmless RLHF dataset. Evaluation is based on three metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and Combined Alignment Score (CAS). Result: Results indicate that SFT outperforms DPO, but the combined SFT+DPO model achieves the best performance across all metrics, demonstrating the effectiveness of combining alignment techniques. Conclusion: The study concludes that the combined SFT+DPO approach outperforms individual SFT and DPO techniques in enhancing the safety and helpfulness of the OPT-350M model, indicating their complementary nature. Abstract: This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language model. Utilizing the Anthropic Helpful-Harmless RLHF dataset, we train and evaluate four models: the base OPT350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. We introduce three key evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), all derived from reward model outputs. The results show that while SFT outperforms DPO, The combined SFT+DPO model outperforms all others across all metrics, demonstrating the complementary nature of these techniques. Our findings also highlight challenges posed by noisy data, limited GPU resources, and training constraints. This study offers a comprehensive view of how fine-tuning strategies affect model alignment and provides a foundation for more robust alignment pipelines in future work.[8] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction
Zhongqiu Li,Shiquan Wang,Ruiyu Fang,Mengjiao Bao,Zhenhe Wu,Shuangyong Song,Yongxiang Li,Zhongjiang He
Main category: cs.CL
TL;DR: This paper proposes MR-UIE, a method that combines reinforcement learning with multi-perspective reasoning to improve information extraction by enabling large language models to actively reason, resulting in better accuracy and generalization across domains.
Details
Motivation: Large language models (LLMs) demonstrate robust capabilities across various domains but face limitations in universal information extraction (UIE), especially in structured output scenarios requiring multi-step reasoning. Existing methods like in-context learning and instruction tuning offer improvements but still exhibit limitations. This work aims to enhance the generalization ability of LLMs in IE tasks. Method: The proposed approach, MR-UIE, combines reinforcement learning (RL) with multi-perspective reasoning to improve information extraction (IE) tasks. This method enables models to transition from passive extraction to active reasoning, enhancing their understanding of both what and how to extract. Result: Experiments on multiple information extraction benchmarks showed that MR-UIE consistently improves extraction accuracy across domains and outperforms state-of-the-art methods on several datasets. The incorporation of multi-perspective reasoning into RL notably enhances generalization in complex IE tasks. Conclusion: The integration of reinforcement learning with multi-perspective reasoning significantly enhances the performance and generalization ability of large language models in information extraction tasks, establishing the importance of reasoning in complex scenarios. Abstract: Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model's generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.[9] TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla
Nishat Raihan,Antonios Anastasopoulos,Marcos Zampieri
Main category: cs.CL
TL;DR: 本论文介绍了针对孟加拉语的代码大语言模型TigerCoder家族,提供了专门的孟加拉语代码指令数据集和评估基准MBPP-Bangla,并展示了其在孟加拉语代码生成任务上的性能优势。
Details
Motivation: 尽管孟加拉语是第五大语言,但在大型语言模型尤其是代码生成领域中仍然缺乏代表性,主要原因是缺乏高质量的训练数据。 Method: 开发了TigerCoder家族的代码大语言模型,并构建了孟加拉语代码指令数据集和评估基准MBPP-Bangla。 Result: TigerCoder在MBPP-Bangla基准测试中表现出显著优于现有多种语言和通用孟加拉语大语言模型的性能,性能提升约为11-18%。 Conclusion: 高质量的数据集可以克服小模型在低资源语言上的局限性,TigerCoder家族模型及配套数据集的开源将促进孟加拉语大语言模型的研究进展。 Abstract: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.[10] Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia
Sophia Maria
Main category: cs.CL
TL;DR: Compass-v3 is a specialized Mixture-of-Experts language model designed for Southeast Asian e-commerce, outperforming general-domain models with its domain-specific optimizations and multilingual capabilities.
Details
Motivation: Large language models often perform poorly in specialized domains like e-commerce due to noisy, heterogeneous, and multilingual data. Compass-v3 aims to overcome these limitations with domain-specific design and optimization. Method: Compass-v3 uses a Mixture-of-Experts model with hardware-efficient optimizations and a mixed-training strategy on large multilingual and synthetic e-commerce datasets. A new alignment method, Optimal-Transport Direct Preference Optimization (OTPO), is introduced to enhance instruction adherence. Result: Compass-v3 outperforms existing models like DeepSeek-V3.1, GPT-4, and Qwen3-235B in e-commerce performance while demonstrating strong multilingual capabilities across Southeast Asian languages and Portuguese. Conclusion: Compass-v3 has been successfully applied on Shopee's platform, replacing a significant portion of OpenAI's usage, showcasing its effectiveness in e-commerce tasks and multilingual capabilities. Abstract: Large language models (LLMs) excel in general-domain applications, yet their performance often degrades in specialized tasks requiring domain-specific knowledge. E-commerce is particularly challenging, as its data are noisy, heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and 71B active per token, designed for Southeast Asian e-commerce. Compass-v3 adopts fewer but larger experts, combined with hardware-efficient optimizations-such as intra-node expert parallelism and a customized memcpy operator-to maximize GPU utilization. The model is trained on 12T tokens of curated multilingual corpora and large-scale synthetic e-commerce instructions using a mixed-training strategy. To enhance alignment, we propose Optimal-Transport Direct Preference Optimization (OTPO), which captures token-level distinctions and improves instruction adherence in commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3 delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1, GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong multilingual capability across low-resource Southeast Asian languages (Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while sustaining competitive performance on general benchmarks. It has already been widely applied in Shopee's industrial-scale e-commerce platform and is gradually replacing OpenAI's traffic, now accounting for over 70\% of total LLM usage, highlighting its dual strengths in specialized commerce expertise and broad linguistic competence.[11] Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
Liqun He,Jiaqi Xu
Main category: cs.CL
TL;DR: 本研究验证了生成式AI(特别是GPT-4)在教育对话行为分类中的有效性,准确率达到80%,为自动化标注提供了高效且可行的解决方案。
Details
Motivation: 减少传统手动编码所需的时间和精力,探索生成式AI在自动化分类导师对话行为中的应用。 Method: 使用GPT-3.5-turbo和GPT-4模型对CIMA语料库中的导师回应进行测试,并使用定制提示进行分类。 Result: GPT-4模型达到了80%的准确率,加权F1得分为0.81,Cohen's Kappa为0.74,超过了基线性能。 Conclusion: 生成式AI在DA分类中表现出高效和可访问的潜力,为教育对话分析提供了有意义的启示。 Abstract: This study explores the use of generative AI for automating the classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors' responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen's Kappa of 0.74, surpassing baseline performance and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task-specific label definitions and contextual information in enhancing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices. The script of this research is publicly available at https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging.[12] ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
Phuong-Nam Dang,Kieu-Linh Nguyen,Thanh-Hieu Pham
Main category: cs.CL
TL;DR: 本文介绍了ViRanker,一种专为越南语设计的交叉编码器重排序模型,通过改进架构和数据整理,在低资源条件下实现了优异的性能。
Details
Motivation: 越南语缺乏具有竞争力的重排序模型,且其语言特性复杂,包括复杂的语法和变音符号。 Method: 基于BGE-M3编码器构建,并通过分块并行变压器进行增强,同时采用混合硬负样本采样进行微调。 Result: 在MMARCO-VI基准测试中,ViRanker表现出强大的早期排名准确性,超越了多语言基线,并与PhoRanker竞争。 Conclusion: ViRanker有效地解决了越南语这一低资源语言在重排序任务中的挑战,并展示了如何通过架构改进和数据整理推动其他代表性不足语言的重排序发展。 Abstract: This paper presents ViRanker, a cross-encoder reranking model tailored to the Vietnamese language. Built on the BGE-M3 encoder and enhanced with the Blockwise Parallel Transformer, ViRanker addresses the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with hybrid hard-negative sampling to strengthen robustness. Evaluated on the MMARCO-VI benchmark, ViRanker achieves strong early-rank accuracy, surpassing multilingual baselines and competing closely with PhoRanker. By releasing the model openly on Hugging Face, we aim to support reproducibility and encourage wider adoption in real-world retrieval systems. Beyond Vietnamese, this study illustrates how careful architectural adaptation and data curation can advance reranking in other underrepresented languages.[13] LITcoder: A General-Purpose Library for Building and Comparing Encoding Models
Taha Binhuraib,Ruimin Gao,Anna A. Ivanova
Main category: cs.CL
TL;DR: LITcoder是一个开源框架,用于构建和评估神经编码模型,提供标准化工具和模块化流程,以促进脑活动预测模型的研究与开发。
Details
Motivation: 为了解决神经编码模型构建中的技术障碍,提供标准化工具,使研究人员能够轻松组合、比较和扩展编码模型,而无需重复开发核心基础设施。 Method: 设计了一个模块化的流水线框架,涵盖多种方法论选择,包括脑数据集、脑区、刺激特征(基于神经网络和控制特征)、下采样方法等,并提供内置的日志记录、绘图功能以及与实验跟踪平台(如Weights & Biases)的无缝集成。 Result: 通过将一系列编码模型拟合到三个故事听力数据集(LeBel et al. (2023), Narratives, 和 Little Prince)中,展示了该框架的可扩展性和多功能性,并探讨了构建连续fMRI数据编码模型的关键方法论选择。 Conclusion: LITcoder是一个开源库,旨在降低构建和基准神经编码模型的技术门槛,促进模型和数据集之间的系统比较,增强方法论的严谨性,并加速高质量、高性能的大脑活动预测模型的开发。 Abstract: We introduce LITcoder, an open-source library for building and benchmarking neural encoding models. Designed as a flexible backend, LITcoder provides standardized tools for aligning continuous stimuli (e.g., text and speech) with brain data, transforming stimuli into representational features, mapping those features onto brain data, and evaluating the predictive performance of the resulting model on held-out data. The library implements a modular pipeline covering a wide array of methodological design choices, so researchers can easily compose, compare, and extend encoding models without reinventing core infrastructure. Such choices include brain datasets, brain regions, stimulus feature (both neural-net-based and control, such as word rate), downsampling approaches, and many others. In addition, the library provides built-in logging, plotting, and seamless integration with experiment tracking platforms such as Weights & Biases (W&B). We demonstrate the scalability and versatility of our framework by fitting a range of encoding models to three story listening datasets: LeBel et al. (2023), Narratives, and Little Prince. We also explore the methodological choices critical for building encoding models for continuous fMRI data, illustrating the importance of accounting for all tokens in a TR scan (as opposed to just taking the last one, even when contextualized), incorporating hemodynamic lag effects, using train-test splits that minimize information leakage, and accounting for head motion effects on encoding model predictivity. Overall, LITcoder lowers technical barriers to encoding model implementation, facilitates systematic comparisons across models and datasets, fosters methodological rigor, and accelerates the development of high-quality high-performance predictive models of brain activity. Project page: https://litcoder-brain.github.io[14] Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing
Zhiyue Liu,Fanrong Ma,Xin Ling
Main category: cs.CL
TL;DR: This paper proposes a counterfactual-enhanced debiasing framework for target-oriented multimodal sentiment classification, addressing dataset biases and improving classification accuracy.
Details
Motivation: Existing methods over-rely on textual content and ignore dataset biases, particularly word-level contextual biases, which lead to spurious correlations and impair classification accuracy. Method: The paper introduces a counterfactual data augmentation strategy and an adaptive debiasing contrastive learning mechanism to reduce spurious correlations and learn robust features. Result: Experimental results on benchmark datasets show that the proposed method outperforms state-of-the-art baselines in target-oriented multimodal sentiment classification. Conclusion: The proposed counterfactual-enhanced debiasing framework effectively reduces spurious correlations in target-oriented multimodal sentiment classification, leading to improved performance over existing methods. Abstract: Target-oriented multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs. While existing works achieve competitive performance, they often over-rely on textual content and fail to consider dataset biases, in particular word-level contextual biases. This leads to spurious correlations between text features and output labels, impairing classification accuracy. In this paper, we introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations. Our framework incorporates a counterfactual data augmentation strategy that minimally alters sentiment-related causal features, generating detail-matched image-text samples to guide the model's attention toward content tied to sentiment. Furthermore, for learning robust features from counterfactual data and prompting model decisions, we introduce an adaptive debiasing contrastive learning mechanism, which effectively mitigates the influence of biased words. Experimental results on several benchmark datasets show that our proposed method outperforms state-of-the-art baselines.[15] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Yuhao Zhang,Yuhao Du,Zhanchen Dai,Xiangnan Ma,Kaiqi Kou,Benyou Wang,Haizhou Li
Main category: cs.CL
TL;DR: 本文提出EchoX,一种结合语义表示和动态语音目标生成的语音大模型训练方法,有效提升了模型的知识和推理能力,在多任务问答基准上表现优异。
Details
Motivation: 现有的语音到语音大模型(SLLMs)在知识和推理能力上存在退化问题,因为当前的训练范式未能弥合语音与语义之间的表征差距。 Method: EchoX利用语义表示并动态生成语音训练目标,将声学学习和语义学习结合起来,以提升语音大模型的知识和推理能力。 Result: EchoX在使用约六千小时训练数据的情况下,在多个基于知识的问答基准测试中表现出先进的性能。 Conclusion: EchoX通过结合语义表示和动态生成语音训练目标,成功弥补了语音-语义之间的表征差距,使语音大模型能够保持强大的推理能力。 Abstract: Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.[16] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition
Chin Yuen Kwok,Jia Qi yip
Main category: cs.CL
TL;DR: A new approach to improve rare word recognition in ASR models by enabling them to predict multiple steps at once, resulting in a significant reduction in word error rate.
Details
Motivation: Contextual biasing improves rare word recognition in ASR models, but current methods like Trie-based biasing are computationally expensive and limited by the need for revocation steps in beam search. Method: The study involves fine-tuning ASR models, specifically Whisper, to look ahead and predict multiple steps simultaneously. This is done using only 10 hours of synthetic data. Result: The adapted method reduced the word error rate on the NSC Part 2 test set from 30.86% to 12.19%. Conclusion: The proposed method of adapting ASR models to predict multiple steps at once improves rare word recognition, outperforming the Trie-based biasing approach by avoiding computationally expensive revocation steps. Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives "bonus scores" to partial hypothesis (e.g. "Bon") that may lead to the generation of the rare word (e.g. "Bonham"). If the full word ("Bonham") isn't ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.[17] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function
Chin Yuen Kwok,Jia Qi Yip,Eng Siong Chng
Main category: cs.CL
TL;DR: This paper proposes an improved method for rare word recognition in ASR models by enhancing the TCPGen-based contextual biasing approach and introducing a keyword-aware loss function that reduces overfitting and improves decoding accuracy.
Details
Motivation: The motivation is to improve rare word recognition in ASR models, particularly addressing the issue of overfitting caused by synthetic audio artifacts. Method: The method involves adapting ASR models to synthetic data, training a biasing module using a keyword-aware loss function that includes a masked cross-entropy term and a binary classification term. Result: By adapting Whisper to 10 hours of synthetic data, the method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%. Conclusion: The enhanced TCPGen-based contextual biasing approach with a keyword-aware loss function effectively improves rare word recognition in ASR models. Abstract: Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.[18] GmSLM : Generative Marmoset Spoken Language Modeling
Talia Sternberg,Michael London,David Omer,Yossi Adi
Main category: cs.CL
TL;DR: This paper introduces GmSLM, a novel spoken language model for Marmoset vocal communication, demonstrating its effectiveness in generating realistic vocalizations and supporting research on the neural basis of speech.
Details
Motivation: Marmoset monkeys show complex vocal communication similar to human speech, offering a unique opportunity to study brain activity related to language, which is difficult to access in humans. Method: Generative Marmoset Spoken Language Modeling (GmSLM) was developed as an optimized spoken language model pipeline for Marmoset vocal communication, using zero-shot evaluation metrics with unsupervised data and weakly labeled conversational data. Result: GmSLM generated vocalizations that closely matched real samples acoustically, outperformed a human-speech-based baseline, and effectively distinguished real from artificial conversations in downstream tasks. Conclusion: GmSLM is a practical framework for studying the neural basis of vocal communication and benefits future research in neuroscience, bioacoustics, and evolutionary biology. Abstract: Marmoset monkeys exhibit complex vocal communication, challenging the view that nonhuman primates vocal communication is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal communication offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal communication. We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal communication and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.[19] CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling
Wenhao Li,Bangcheng Sun,Weihao Ye,Tianyi Zhang,Daohai Yu,Fei Chao,Rongrong Ji
Main category: cs.CL
TL;DR: 本文提出了一种名为CCF的上下文压缩框架,通过学习保留全局语义的层次潜在表示,有效解决了长上下文语言建模中的计算和内存负担问题。
Details
Motivation: 扩展语言模型的上下文对于捕捉复杂依赖关系至关重要,但直接扩展上下文会带来巨大的计算和内存负担。 Method: 通过分段语义聚合和键值记忆编码,形成紧凑的层次潜在表示,并结合增量段解码与稀疏存储抽样策略进行优化。 Result: 在多个长上下文语言建模基准测试中,CCF在高压缩比下实现了有竞争力的困惑度,并显著提高了吞吐量和内存效率。 Conclusion: CCF是一个新的上下文压缩框架,能够有效提升长上下文语言模型的效率和可扩展性。 Abstract: Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, na\"ive context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context compression framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment decoding with sparse reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured compression for scalable and effective long-context language modeling.[20] Reading Between the Lines: Classifying Resume Seniority with Large Language Models
Matan Cohen,Shira Shani,Eden Menahem,Yehudit Aperstein,Alexander Apartsin
Main category: cs.CL
TL;DR: 本研究探讨了使用大语言模型自动分类简历中的候选人资历,通过引入混合数据集评估模型性能,发现大语言模型在检测资历膨胀方面具有潜力。
Details
Motivation: 准确评估候选人的资历至关重要,但受到夸大经验和模糊自我展示的困扰。 Method: 引入了一个包含真实简历和合成生成的难例的混合数据集,用于评估大型语言模型在检测与资历膨胀和隐性专业知识相关的细微语言线索方面的表现。 Result: 研究结果显示,大语言模型在自动分类简历资历方面具有前景。 Conclusion: 研究强调了使用大语言模型改进自动化候选人评估系统的潜力,并减轻自我宣传语言带来的偏差。 Abstract: Accurately assessing candidate seniority from resumes is a critical yet challenging task, complicated by the prevalence of overstated experience and ambiguous self-presentation. In this study, we investigate the effectiveness of large language models (LLMs), including fine-tuned BERT architectures, for automating seniority classification in resumes. To rigorously evaluate model performance, we introduce a hybrid dataset comprising both real-world resumes and synthetically generated hard examples designed to simulate exaggerated qualifications and understated seniority. Using the dataset, we evaluate the performance of Large Language Models in detecting subtle linguistic cues associated with seniority inflation and implicit expertise. Our findings highlight promising directions for enhancing AI-driven candidate evaluation systems and mitigating bias introduced by self-promotional language. The dataset is available for the research community at https://bit.ly/4mcTovt[21] Agentic LLMs for Question Answering over Tabular Data
Rishit Tyagi,Mohit Gupta,Rahul Bouri
Main category: cs.CL
TL;DR: 本文提出了一种基于大型语言模型的自然语言转SQL方法,用于表格数据问答任务,通过多阶段流程显著提高了准确率。
Details
Motivation: 表格数据问答任务面临结构、规模和数据类型的多样性挑战,需要有效的方法提升模型准确回答问题的能力。 Method: 采用多阶段流程,包括示例选择、SQL查询生成、答案提取、验证和迭代优化,利用GPT-4o、GPT-4o-mini和DeepSeek v2:16b等大型语言模型生成SQL查询。 Result: 在DataBench QA上达到70.5%的准确率,在DataBench Lite QA上达到71.6%,显著超过基线模型的26%和27%。 Conclusion: 该方法在表格数据问答任务中表现出色,提供了对大型语言模型在该领域应用的深入见解。 Abstract: Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5\% accuracy on DataBench QA and 71.6\% on DataBench Lite QA, significantly surpassing baseline scores of 26\% and 27\% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.[22] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models
Grazia Sveva Ascione,Nicolò Tamagnone
Main category: cs.CL
TL;DR: 本文通过弱监督方法和语义对齐,解决了专利与联合国可持续发展目标(SDG)分类中的数据不足和扩展性问题,创建了一个有效的多标签数据集。
Details
Motivation: 由于缺乏大型标记数据集,传统的监督学习方法难以使用,现有的方法如关键词搜索、迁移学习和基于引用的启发式方法缺乏可扩展性和通用性。 Method: 将专利到SDG的分类问题作为弱监督问题处理,使用专利本体论中的结构化概念,结合跨领域相似度分数,并通过基于排名的检索方法进行整合。 Result: 生成了一个银标准的软多标签数据集,用于将专利映射到SDG,并通过两种验证策略证明了方法的有效性。 Conclusion: 弱监督和语义对齐可以大规模增强SDG分类。 Abstract: Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.[23] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
Channdeth Sok,David Luz,Yacine Haddam
Main category: cs.CL
TL;DR: The paper introduces MetaRAG, a metamorphic testing framework for detecting hallucinations in Retrieval-Augmented Generation systems, which operates in real-time, unsupervised, and black-box settings.
Details
Motivation: LLMs suffer from hallucinations, which existing detection methods like SelfCheckGPT and MetaQA do not address in Retrieval-Augmented Generation (RAG) systems. Method: MetaRAG operates in four stages: decomposition of answers into factoids, generation of mutations, verification against context, and aggregation of inconsistency penalties. Result: Experiments on an enterprise dataset show MetaRAG effectively detects hallucinations and enables trustworthy deployment of RAG-based conversational agents. Conclusion: MetaRAG localization helps identify unsupported claims and allows system designers to configure thresholds for identity-sensitive queries. Abstract: Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.[24] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research
Molly R Petersen,Claire E Stevenson,Lonneke van der Plas
Main category: cs.CL
TL;DR: 本文探讨了认知科学中的类比推理过程与自然语言处理的关系,指出从认知角度理解类比推理有助于改善文本中的关系理解。
Details
Motivation: 将认知科学中的类比推理过程与NLP研究联系起来,为NLP研究提供新的视角和指导。 Method: 总结认知科学文献中关于类比推理过程的关键理论,并将其与NLP领域的研究联系起来。 Result: 展示了认知科学中的类比推理过程如何与NLP中的概念相关联,并指出了这些概念对于NLP研究中的几个主要挑战的重要性。 Conclusion: 通过认知科学的视角,可以更好地理解NLP中的类比推理过程,从而改善文本中的关系理解。 Abstract: Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.[25] Hierarchical Bracketing Encodings Work for Dependency Graphs
Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares
Main category: cs.CL
TL;DR: This paper shows that hierarchical bracketing encodings efficiently represent dependency graphs, leading to improved parsing accuracy.
Details
Motivation: To find a more efficient way to parse dependency graphs without losing structural details like reentrancies, cycles, and empty nodes. Method: Revisiting hierarchical bracketing encodings to represent dependency graphs as sequences, enabling linear-time parsing while maintaining structural information. Result: The method reduces the label space significantly and shows consistent improvements in exact match accuracy across a multilingual and multi-formalism benchmark. Conclusion: Hierarchical bracketing encodings offer a more efficient and effective approach to dependency graph parsing compared to existing methods. Abstract: We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with $n$ tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.[26] GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
Zhaohan Zhang,Ziquan Liu,Ioannis Patras
Main category: cs.CL
TL;DR: GrACE是一种可扩展且可靠的大型语言模型置信度评估方法,通过在词汇表中添加特殊标记并实时测量模型最后隐藏状态与该标记嵌入之间的相似性,实现了无需额外采样或辅助模型的实时置信度估计。
Details
Motivation: 现有的大型语言模型置信度评估方法要么计算开销大,要么校准效果差,难以在现实世界中可靠部署。 Method: GrACE通过在模型中添加一个特殊的标记,并通过微调模型来校准置信度,从而实现对模型置信度的实时评估。 Result: 实验表明,GrACE在开放生成任务中展现出最佳的判别能力和校准性能,优于六种竞争方法,并且在测试时缩放方案中显著减少了所需样本数量。 Conclusion: GrACE是一种实用的解决方案,具有可扩展性、可靠性和实时性,可用于在高风险应用中部署带有置信度评估的大型语言模型。 Abstract: Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in which the model expresses confidence by the similarity between the last hidden state and the embedding of a special token appended to the vocabulary, in real-time. We fine-tune the model for calibrating the confidence with calibration targets associated with accuracy. Experiments with three LLMs and two benchmark datasets show that the confidence produced by GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks, outperforming six competing methods without resorting to additional sampling or an auxiliary model. Moreover, we propose two strategies for improving test-time scaling based on confidence induced by GrACE. Experimental results show that using GrACE not only improves the accuracy of the final decision but also significantly reduces the number of required samples in the test-time scaling scheme, indicating the potential of GrACE as a practical solution for deploying LLMs with scalable, reliable, and real-time confidence estimation.[27] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation
Lucie Poláková,Martin Popel,Věra Kloudová,Michal Novák,Mariia Anisimova,Jiří Balhar
Main category: cs.CL
TL;DR: The EdUKate project develops multilingual educational materials by creating a machine translation system tailored to the educational domain, translating interactive exercises, and providing free access to resulting tools and resources.
Details
Motivation: To address the needs of non-Czech-speaking students by providing multilingual learning materials and developing a domain-specific machine translation system that handles technical and scientific terminology. Method: The project involves collaboration between an academic institution and an educational publisher, development of a machine translation system for the educational domain, and processing of formatted content such as XML and PDF. Evaluation includes teacher surveys and system implementation on a web portal. Result: The project translates up to 9,000 interactive exercises into Ukrainian, English, and German, and develops a direct Czech-Ukrainian machine translation system. The applications are made freely available on an educational web portal. Conclusion: The EdUKate project successfully integrates digital education, linguistics, translation studies, and machine translation to create multilingual educational resources, with a focus on Czech-Ukrainian translation tailored to educational needs. Abstract: The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country's largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system's evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.[28] Towards Explainable Job Title Matching: Leveraging Semantic Textual Relatedness and Knowledge Graphs
Vadim Zadykian,Bruno Andrade,Haithem Afli
Main category: cs.CL
TL;DR: 该研究提出一种结合知识图谱和句子嵌入的自监督方法,用于提升语义文本相关性分析,尤其在高相关性区域效果显著。
Details
Motivation: 解决简历推荐系统中的职位名称匹配问题,超越表面的词汇相似性。 Method: 引入了自监督混合架构,结合密集句子嵌入与领域特定知识图谱,并通过分层评估分析模型性能。 Result: 使用KG增强的微调SBERT模型在高语义文本相关性区域表现最佳,RMSE降低了25%。 Conclusion: 结合KG与文本嵌入的好处以及区域性能分析的重要性,支持在人力资源系统中进行更精确的模型选择。 Abstract: Semantic Textual Relatedness (STR) captures nuanced relationships between texts that extend beyond superficial lexical similarity. In this study, we investigate STR in the context of job title matching - a key challenge in resume recommendation systems, where overlapping terms are often limited or misleading. We introduce a self-supervised hybrid architecture that combines dense sentence embeddings with domain-specific Knowledge Graphs (KGs) to improve both semantic alignment and explainability. Unlike previous work that evaluated models on aggregate performance, our approach emphasizes data stratification by partitioning the STR score continuum into distinct regions: low, medium, and high semantic relatedness. This stratified evaluation enables a fine-grained analysis of model performance across semantically meaningful subspaces. We evaluate several embedding models, both with and without KG integration via graph neural networks. The results show that fine-tuned SBERT models augmented with KGs produce consistent improvements in the high-STR region, where the RMSE is reduced by 25% over strong baselines. Our findings highlight not only the benefits of combining KGs with text embeddings, but also the importance of regional performance analysis in understanding model behavior. This granular approach reveals strengths and weaknesses hidden by global metrics, and supports more targeted model selection for use in Human Resources (HR) systems and applications where fairness, explainability, and contextual matching are essential.[29] DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning
Daniil Ignatev,Nan Li,Hugh Mee Wong,Anh Dang,Shane Kaszefski Yaschuk
Main category: cs.CL
TL;DR: The paper explores ICL and LDL methods for soft label prediction in annotation disagreements, showing that both approaches are effective, with ICL predicting annotator-specific labels and LDL offering promising results for the perspectivist community.
Details
Motivation: The motivation is to explore effective methods for handling disagreements in annotation data, particularly focusing on predicting perspectivist annotations and generating competitive soft labels. Method: The team tested two approaches: in-context learning with large language models using different example sampling strategies, and label distribution learning with RoBERTa using various fine-tuning methods. Result: The results show that ICL can effectively predict annotator-specific annotations, and aggregating these predictions into soft labels performs competitively. Additionally, LDL methods are found to be promising for soft label prediction. Conclusion: The DeMeVa team concludes that both in-context learning (ICL) and label distribution learning (LDL) methods are effective for soft label prediction in the context of perspectivist annotations, with potential for further exploration. Abstract: This system paper presents the DeMeVa team's approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.[30] Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025)
Paolo Pedinotti,Peter Baumann,Nathan Jessurun,Leslie Barrett,Enrico Santus
Main category: cs.CL
TL;DR: MetaGraph is a methodology that uses LLMs to extract knowledge graphs from scientific literature, providing a structured overview of trends in financial NLP research.
Details
Motivation: The rapid transformation of financial NLP by Large Language Models has outpaced traditional surveys, necessitating a new methodology to map research trends and analyze the proliferation of datasets and diversification of data sources. Method: MetaGraph uses an ontology for financial NLP research combined with an LLM-based extraction pipeline applied to 681 papers (2022-2025) for large-scale, data-driven analysis. Result: MetaGraph identifies three key phases in financial NLP: early LLM adoption and task/dataset innovation; critical reflection on LLM limitations; and growing integration of peripheral techniques into modular systems. Conclusion: MetaGraph provides a structured view of financial NLP evolution, highlighting trends and methodological shifts, and demonstrating a reusable approach for mapping scientific progress in other domains. Abstract: Large Language Models (LLMs) have rapidly reshaped financial NLP, enabling new tasks and driving a proliferation of datasets and diversification of data sources. Yet, this transformation has outpaced traditional surveys. In this paper, we present MetaGraph, a generalizable methodology for extracting knowledge graphs from scientific literature and analyzing them to obtain a structured, queryable view of research trends. We define an ontology for financial NLP research and apply an LLM-based extraction pipeline to 681 papers (2022-2025), enabling large-scale, data-driven analysis. MetaGraph reveals three key phases: early LLM adoption and task/dataset innovation; critical reflection on LLM limitations; and growing integration of peripheral techniques into modular systems. This structured view offers both practitioners and researchers a clear understanding of how financial NLP has evolved - highlighting emerging trends, shifting priorities, and methodological shifts-while also demonstrating a reusable approach for mapping scientific progress in other domains.[31] Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in Matchmaking
Brittany Harbison,Samuel Taubman,Travis Taylor,Ashok. K. Goel
Main category: cs.CL
TL;DR: This paper proposes a personality detection model using AI to infer student personality traits from online forum posts, integrating this into the SAMI matchmaking system to improve social connections in online learning environments.
Details
Motivation: Social connections are crucial for learning, but online environments struggle to foster organic social group formation. SAMI helps, but its effectiveness is limited by an incomplete understanding of student personality. Method: A personality detection model was developed using GPT's zero-shot capability to infer Big-Five personality traits from forum introduction posts. This model was benchmarked against established models and integrated into SAMI's existing matchmaking system. Result: The proposed personality detection model demonstrated efficacy in inferring personality traits from student posts and was successfully integrated into SAMI's matchmaking system, suggesting that personality traits can complement existing matching factors. Conclusion: The integration of personality detection into SAMI's matchmaking system could enhance social recommendations in online learning environments, though further evaluation is needed to understand its full impact. Abstract: Social connection is a vital part of learning, yet online course environments present barriers to the organic formation of social groups. SAMI offers one solution by facilitating student connections, but its effectiveness is constrained by an incomplete Theory of Mind, limiting its ability to create an effective mental model of a student. One facet of this is its inability to intuit personality, which may influence the relevance of its recommendations. To explore this, we propose a personality detection model utilizing GPTs zero-shot capability to infer Big-Five personality traits from forum introduction posts, often encouraged in online courses. We benchmark its performance against established models, demonstrating its efficacy in this task. Furthermore, we integrate this model into SAMIs entity-based matchmaking system, enabling personality-informed social recommendations. Initial integration suggests personality traits can complement existing matching factors, though additional evaluation is required to determine their full impact on student engagement and match quality.[32] Fluent but Unfeeling: The Emotional Blind Spots of Language Models
Bangzhao Shu,Isha Joshi,Melissa Karnaze,Anh C. Pham,Ishita Kakkar,Sindhu Kothe,Arpine Hovasapian,Mai ElSherief
Main category: cs.CL
TL;DR: 这篇论文提出了一个名为EXPRESS的基准数据集,用于评估大型语言模型(LLM)在细粒度情绪对齐方面的表现,发现LLM在捕捉人类情绪细微差别方面仍存在局限性。
Details
Motivation: 论文的动机是现有研究通常将情绪分类为预定义的有限类别,忽略了更细微的情绪表达,而本文旨在填补LLM在细粒度水平上是否与人类情绪对齐的研究空白。 Method: 论文的方法包括构建一个名为EXPRESS的基准数据集,该数据集从Reddit社区中收集了251种细粒度的自我披露情绪标签,并使用一个综合评估框架来检验预测的情绪术语,将其分解为八种基本情绪进行细粒度比较。 Result: 论文的结果表明,在各种提示设置下,流行的LLM在准确预测与人类自我披露情绪对齐的情绪方面仍面临挑战,同时定性分析显示它们有时未能有效捕捉情境线索。 Conclusion: 该论文的结论是,尽管某些LLM能够生成与既定情绪理论和定义一致的情绪术语,但它们在捕捉人类自我披露的情绪方面仍存在挑战,特别是在细粒度情绪对齐和情境线索理解方面。 Abstract: The versatility of Large Language Models (LLMs) in natural language understanding has made them increasingly popular in mental health research. While many studies explore LLMs' capabilities in emotion recognition, a critical gap remains in evaluating whether LLMs align with human emotions at a fine-grained level. Existing research typically focuses on classifying emotions into predefined, limited categories, overlooking more nuanced expressions. To address this gap, we introduce EXPRESS, a benchmark dataset curated from Reddit communities featuring 251 fine-grained, self-disclosed emotion labels. Our comprehensive evaluation framework examines predicted emotion terms and decomposes them into eight basic emotions using established emotion theories, enabling a fine-grained comparison. Systematic testing of prevalent LLMs under various prompt settings reveals that accurately predicting emotions that align with human self-disclosed emotions remains challenging. Qualitative analysis further shows that while certain LLMs generate emotion terms consistent with established emotion theories and definitions, they sometimes fail to capture contextual cues as effectively as human self-disclosures. These findings highlight the limitations of LLMs in fine-grained emotion alignment and offer insights for future research aimed at enhancing their contextual understanding.[33] LAVA: Language Model Assisted Verbal Autopsy for Cause-of-Death Determination
Yiqun T. Chen,Tyler H. McCormick,Li Liu,Abhirup Datta
Main category: cs.CL
TL;DR: This study introduces LA-VA, a method combining LLMs and traditional algorithms, which significantly improves verbal autopsy accuracy in predicting causes of death, especially in low-resource settings.
Details
Motivation: Verbal autopsy (VA) is essential for estimating causes of death in resource-limited areas where medical certification is not available. Improving VA accuracy can enhance global health surveillance. Method: The study uses LA-VA, a pipeline combining Large Language Models (LLMs), traditional algorithmic methods, and embedding-based classification to predict causes of death. It evaluates multiple approaches, including GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles, using the PHMRC dataset across three age categories (Adult, Child, Neonate). Result: GPT-5 achieved the highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%. Conclusion: LLM-assisted approaches can significantly improve the accuracy of verbal autopsy in low-resource settings, which has important implications for global health surveillance. Abstract: Verbal autopsy (VA) is a critical tool for estimating causes of death in resource-limited settings where medical certification is unavailable. This study presents LA-VA, a proof-of-concept pipeline that combines Large Language Models (LLMs) with traditional algorithmic approaches and embedding-based classification for improved cause-of-death prediction. Using the Population Health Metrics Research Consortium (PHMRC) dataset across three age categories (Adult: 7,580; Child: 1,960; Neonate: 2,438), we evaluate multiple approaches: GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles. Our results demonstrate that GPT-5 achieves the highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%. Our findings suggest that simple off-the-shelf LLM-assisted approaches could substantially improve verbal autopsy accuracy, with important implications for global health surveillance in low-resource settings.[34] Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems
Minghang Zhu,Zhengliang Shi,Zhiwei Xu,Shiguang Wu,Lingjie Wang,Pengjie Ren,Zhaochun Ren,Zhumin Chen
Main category: cs.CL
TL;DR: MOAT is a framework for improving collaboration between agents in multi-agent systems by iteratively aligning their capabilities, resulting in better performance on complex tasks.
Details
Motivation: Most existing methods fine-tune agents independently, leading to capability gaps and poor coordination. MOAT aims to address this issue by improving agent collaboration through iterative alignment. Method: The proposed MOAT framework alternates between two stages: Planning Agent Alignment, which optimizes subgoal generation for better guidance, and Grounding Agent Improving, which enhances the grounding agent's generalization through self-generated subgoal-action pairs. Result: Experiments across six benchmarks show that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks. Conclusion: MOAT improves the collaboration between specialized agents in multi-agent systems by using a joint alignment tuning framework, leading to better performance than existing methods on various tasks. Abstract: The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.[35] All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens
Siddarth Mamidanna,Daking Rai,Ziyu Yao,Yilun Zhou
Main category: cs.CL
TL;DR: This paper identifies a subgraph in large language models—called All-for-One (AF1)—where most meaningful computation for mental math happens late and at the final token, using techniques CAMA and ABP to uncover how information flows and is processed.
Details
Motivation: The motivation is to understand how large language models process information for tasks like mental math, specifically investigating the extent to which causal self-attention and MLP layers contribute to computation across token positions. Method: The paper uses two techniques—Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP)—to analyze computation patterns in large language models during mental math tasks. It examines how restricting computation in specific layers impacts performance. Result: The researchers identified the All-for-One subgraph (AF1), where computation happens late in the model and mostly at the last token, which gathers information from other tokens in a few specific middle layers. This subgraph transfers across models and performs well on various mental math tasks. Conclusion: The study concludes that the All-for-One subgraph (AF1) is both sufficient and necessary for high performance in mental math tasks across various models and input styles, indicating that computation occurs late and primarily at the last token. Abstract: Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.[36] Steering MoE LLMs via Expert (De)Activation
Mohsen Fayyaz,Ali Modarressi,Hanieh Deilamsalehy,Franck Dernoncourt,Ryan Rossi,Trung Bui,Hinrich Schütze,Nanyun Peng
Main category: cs.CL
TL;DR: SteerMoE是一种通过选择性激活模型内部专家来控制大型语言模型行为的新方法,能够在不改变模型权重的情况下提升模型性能,但也揭示了潜在的安全风险。
Details
Motivation: 寻找一种无需重新训练或修改模型权重即可控制大型语言模型行为的方法,并探索模型内部潜在的对齐问题。 Method: 通过检测与特定行为相关的专家,并在推理过程中选择性地激活或停用这些专家来实现对模型行为的控制。 Result: 在11个基准测试和6个大型语言模型中,SteerMoE将安全性提高了最多20%,可靠性提高了最多27%;在对抗攻击模式下,单独使用可使安全性下降41%,与现有越狱方法结合时下降100%。 Conclusion: SteerMoE可以有效控制模型行为,如提高安全性和可靠性,但也可能被用于规避安全措施,揭示了模型内部隐藏的不对齐问题。 Abstract: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.[37] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
Runpeng Dai,Linfeng Song,Haolin Liu,Zhenwen Liang,Dian Yu,Haitao Mi,Zhaopeng Tu,Rui Liu,Tong Zheng,Hongtu Zhu,Dong Yu
Main category: cs.CL
TL;DR: This paper introduces CDE, a curiosity-driven exploration framework for RLVR that improves LLM reasoning by addressing entropy collapse and enhancing exploration through intrinsic actor and critic signals, achieving better performance on AIME benchmarks.
Details
Motivation: Current RLVR methods suffer from poor exploration, leading to premature convergence and entropy collapse. This work aims to address these limitations by introducing an intrinsic curiosity mechanism to improve reasoning and exploration in LLMs. Method: CDE formalizes curiosity using intrinsic signals from both the actor and critic. Actor curiosity is measured by perplexity over generated responses, while critic curiosity is determined by the variance of value estimates from a multi-head architecture. These signals act as exploration bonuses within the RLVR framework. Result: The CDE method achieves an approximate +3 point improvement over standard RLVR approaches like GRPO/PPO on AIME benchmarks. Theoretical analysis connects the critic-wise bonus to count-based exploration and highlights the actor-wise bonus's role in penalizing overconfident errors and promoting response diversity. Conclusion: The proposed Curiosity-Driven Exploration (CDE) framework effectively enhances exploration in Reinforcement Learning with Verifiable Rewards (RLVR), leading to improved performance and addressing calibration collapse in Large Language Models (LLMs). Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.cs.CV [Back]
[38] Recurrence Meets Transformers for Universal Multimodal Retrieval
Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出了一种统一的多模态检索模型ReT-2,其在复杂多模态检索任务中表现出色,不仅实现了最先进的性能,还提高了推理速度和内存效率。
Details
Motivation: 随着多模态检索任务变得越来越复杂,现有方法受限于单模态查询或文档,且依赖于任务特定的微调。 Method: ReT-2采用多层表示和受LSTM启发的门控机制的循环Transformer架构,动态整合跨层和跨模态的信息。 Result: ReT-2在M2KR和M-BEIR基准测试中表现出色,且在检索增强生成管道中提升了下游任务的表现。 Conclusion: ReT-2在多模态检索任务中表现出色,实现了最先进的性能,同时提高了推理速度并减少了内存使用。 Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2[39] Diffusion-Based Action Recognition Generalizes to Untrained Domains
Rogerio Guimaraes,Frank Xiao,Pietro Perona,Markus Marks
Main category: cs.CV
TL;DR: The paper introduces an approach using a Vision Diffusion Model to improve action recognition across varying conditions, achieving human-like robustness.
Details
Motivation: The motivation is to overcome the limitations of current deep learning models which struggle with generalization in action recognition due to large context and viewpoint variations. Method: The method involves using features generated by a Vision Diffusion Model (VDM) and aggregated via a transformer to achieve human-like action recognition across challenging conditions. Result: The model sets a new state-of-the-art across all three generalization benchmarks: classifying actions across animal species, different viewing angles, and different recording contexts. Conclusion: The paper concludes that using a Vision Diffusion Model (VDM) conditioned on earlier timesteps of the diffusion process enhances generalization in action recognition across various conditions like different animal species, viewing angles, and recording contexts, achieving human-like robustness. Abstract: Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: $\href{https://www.vision.caltech.edu/actiondiff/}{\texttt{vision.caltech.edu/actiondiff}}$ Code: $\href{https://github.com/frankyaoxiao/ActionDiff}{\texttt{github.com/frankyaoxiao/ActionDiff}}$[40] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability
Tung Vu,Lam Nguyen,Quynh Dao
Main category: cs.CV
TL;DR: This paper introduces PromptGuard, a proactive framework to prevent harmful outputs in LLMs, combining modular prompting techniques and theoretical optimization for real-time harm reduction.
Details
Motivation: The increasing use of LLMs poses risks of generating harmful or biased information, particularly affecting vulnerable populations, necessitating proactive harm prevention mechanisms. Method: PromptGuard utilizes a hybrid technique called VulnGuard Prompt, incorporating few-shot examples, ethical chain-of-thought reasoning, and adaptive role-prompting within a multi-module framework. Result: PromptGuard demonstrates 25-30% analytical harm reduction through entropy bounds and Pareto optimality, with formal theoretical validation and convergence proofs. Conclusion: PromptGuard offers a proactive solution to prevent harmful outputs in LLMs by employing a modular prompting framework with population-specific protective barriers, establishing mathematical foundations for systematic empirical research. Abstract: The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.[41] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures
Waqar Ahmad,Evan Murphy,Vladimir A. Krylov
Main category: cs.CV
TL;DR: 本文提出了一种新的基于Beta分布混合模型的异常检测框架Beta-SOD,用于提升有标签噪声情况下图像再识别任务的性能。
Details
Motivation: 解决Re-ID方法对标签噪声高度敏感导致性能下降的问题。 Method: 将Re-ID重新定义为监督图像相似度任务,采用Siamese网络结构并提出基于Beta分布混合模型的统计异常检测框架Beta-SOD。 Result: Beta-SOD在CUHK03、Market-1501和VeRi-776数据集上均显示出优于现有方法的去噪和Re-ID性能。 Conclusion: Beta-SOD方法在不同噪声水平下表现出对Re-ID任务的有效性与鲁棒性,并结合损失函数优化特征级相似性学习。 Abstract: Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed.The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30\%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: https://github.com/waqar3411/Beta-SOD[42] SFD-Mamba2Net: Strcture-Guided Frequency-Enhanced Dual-Stream Mamba2 Network for Coronary Artery Segmentation
Nan Mu,Ruiqi Song,Zhihui Xu,Jingfeng Jiang,Chen Zhao
Main category: cs.CV
TL;DR: This paper proposes SFD-Mamba2Net, a novel framework for improving coronary artery segmentation and stenosis detection in ICA images by leveraging structural enhancement and high-frequency perception techniques.
Details
Motivation: Coronary Artery Disease (CAD) is a leading cause of death globally, and accurate diagnosis via Invasive Coronary Angiography (ICA) is challenging due to low-contrast, high-noise images and complex vascular structures. This study aims to improve segmentation and detection accuracy in ICA images. Method: The study proposed SFD-Mamba2Net, an end-to-end framework for ICA-based vascular segmentation and stenosis detection. It incorporates a Curvature-Aware Structural Enhancement (CASE) module in the encoder and a Progressive High-Frequency Perception (PHFP) module in the decoder. Result: SFD-Mamba2Net showed superior performance over existing methods in segmentation and achieved the best results in stenosis detection. Conclusion: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics and achieved the highest true positive rate and positive predictive value in stenosis detection. Abstract: Background: Coronary Artery Disease (CAD) is one of the leading causes of death worldwide. Invasive Coronary Angiography (ICA), regarded as the gold standard for CAD diagnosis, necessitates precise vessel segmentation and stenosis detection. However, ICA images are typically characterized by low contrast, high noise levels, and complex, fine-grained vascular structures, which pose significant challenges to the clinical adoption of existing segmentation and detection methods. Objective: This study aims to improve the accuracy of coronary artery segmentation and stenosis detection in ICA images by integrating multi-scale structural priors, state-space-based long-range dependency modeling, and frequency-domain detail enhancement strategies. Methods: We propose SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection. In the encoder, a Curvature-Aware Structural Enhancement (CASE) module is embedded to leverage multi-scale responses for highlighting slender tubular vascular structures, suppressing background interference, and directing attention toward vascular regions. In the decoder, we introduce a Progressive High-Frequency Perception (PHFP) module that employs multi-level wavelet decomposition to progressively refine high-frequency details while integrating low-frequency global structures. Results and Conclusions: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection.[43] Live(r) Die: Predicting Survival in Colorectal Liver Metastasis
Muhammad Alberb,Helen Cheung,Anne Martel
Main category: cs.CV
TL;DR: 本研究提出了一种新的自动化框架,结合3D医学图像分割和放射组学分析,显著提升了结直肠癌肝转移手术的生存预测准确性。
Details
Motivation: 当前针对结直肠癌肝转移的预后模型预测能力有限,尤其是在多灶性病例中。因此需要一种更准确和全面的预测方法。 Method: 研究开发了一个自动化框架,包括分割管道和放射组学管道。分割管道利用可提示的基础模型从部分标注数据中完成缺失标签,并提出了一种新的零样本3D提示传播算法SAMONAI。放射组学管道则通过SurvAMINN(一种基于自编码器的多实例神经网络)进行特征提取和生存预测。 Result: 在包含227名患者的机构数据集中,该框架在预测生存方面超越了现有的临床和基因组生物标志物,C指数提高了超过10%。 Conclusion: 该研究提出了一种结合自动分割算法和放射组学生存分析的框架,能够实现对结直肠癌肝转移手术结果的准确预测,同时具有高效注释和可解释性。 Abstract: Colorectal cancer frequently metastasizes to the liver, significantly reducing long-term survival. While surgical resection is the only potentially curative treatment for colorectal liver metastasis (CRLM), patient outcomes vary widely depending on tumor characteristics along with clinical and genomic factors. Current prognostic models, often based on limited clinical or molecular features, lack sufficient predictive power, especially in multifocal CRLM cases. We present a fully automated framework for surgical outcome prediction from pre- and post-contrast MRI acquired before surgery. Our framework consists of a segmentation pipeline and a radiomics pipeline. The segmentation pipeline learns to segment the liver, tumors, and spleen from partially annotated data by leveraging promptable foundation models to complete missing labels. Also, we propose SAMONAI, a novel zero-shot 3D prompt propagation algorithm that leverages the Segment Anything Model to segment 3D regions of interest from a single point prompt, significantly improving our segmentation pipeline's accuracy and efficiency. The predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts features from each tumor and predicts survival using SurvAMINN, a novel autoencoder-based multiple instance neural network for survival analysis. SurvAMINN jointly learns dimensionality reduction and hazard prediction from right-censored survival data, focusing on the most aggressive tumors. Extensive evaluation on an institutional dataset comprising 227 patients demonstrates that our framework surpasses existing clinical and genomic biomarkers, delivering a C-index improvement exceeding 10%. Our results demonstrate the potential of integrating automated segmentation algorithms and radiomics-based survival analysis to deliver accurate, annotation-efficient, and interpretable outcome prediction in CRLM.[44] Discovering Divergent Representations between Text-to-Image Models
Lisa Dunlap,Joseph E. Gonzalez,Trevor Darrell,Fabian Caba Heilbron,Josef Sivic,Bryan Russell
Main category: cs.CV
TL;DR: 本文提出CompCon算法,用于揭示文本到图像生成模型之间的视觉表示差异,并通过构建ID2数据集验证其效果。
Details
Motivation: 研究两个文本到图像生成模型在视觉表征上的差异,挖掘特定提示下模型输出的不同属性。 Method: 提出了CompCon(Comparing Concepts)进化搜索算法,结合自动化数据生成管道发现模型输出中的视觉属性差异及关联提示概念。 Result: 发现了如PixArt将“孤独”与湿润街道关联、Stable Diffusion 3.5对非裔美国人媒体职业描绘等差异,并构建了ID2数据集验证方法有效性。 Conclusion: CompCon成功揭示了不同文本到图像生成模型间视觉表征的差异,并通过自动化数据集ID2验证了其有效性。 Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon[45] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery
Yibin Wang,Wondimagegn Beshah,Padmanava Dash,Haifeng Wang
Main category: cs.CV
TL;DR: 该研究提出了一种基于U-Net的深度学习模型,用于识别和修正无人机遥感图像中的云影和阳光耀斑区域,从而提高水质参数估计的准确性。
Details
Motivation: 无人机遥感图像在水质参数估计中面临云影和阳光耀斑的干扰,需要提出一种有效的方法来解决这些问题。 Method: 采用基于U-Net的深度学习模型,通过像素级数据分析,识别和提取受云影和阳光耀斑影响的区域,并对模型训练的最佳设置进行评估。 Result: 研究确定了高质量的图像校正模型,能够有效恢复图像中的云影和阳光耀斑区域。 Conclusion: 提出的方法在无人机遥感图像处理中表现出色,为解决云影和阳光耀斑问题提供了有效途径。 Abstract: The use of unmanned aerial systems (UASs) has increased tremendously in the current decade. They have significantly advanced remote sensing with the capability to deploy and image the terrain as per required spatial, spectral, temporal, and radiometric resolutions for various remote sensing applications. One of the major advantages of UAS imagery is that images can be acquired in cloudy conditions by flying the UAS under the clouds. The limitation to the technology is that the imagery is often sullied by cloud shadows. Images taken over water are additionally affected by sun glint. These are two pose serious issues for estimating water quality parameters from the UAS images. This study proposes a novel machine learning approach first to identify and extract regions with cloud shadows and sun glint and separate such regions from non-obstructed clear sky regions and sun-glint unaffected regions. The data was extracted from the images at pixel level to train an U-Net based deep learning model and best settings for model training was identified based on the various evaluation metrics from test cases. Using this evaluation, a high-quality image correction model was determined, which was used to recover the cloud shadow and sun glint areas in the images.[46] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision
Puskal Khadka,Rodrigue Rizk,Longwei Wang,KC Santosh
Main category: cs.CV
TL;DR: CoSwin enhances Vision Transformers by combining global attention with local convolutional features, improving performance on small vision datasets.
Details
Motivation: Vision Transformers (ViTs) struggle with local feature extraction on small datasets due to lack of inductive biases like locality; CoSwin aims to address this limitation. Method: CoSwin proposes a learnable local feature enhancement module integrated into each attention block, combining hierarchical shifted window attention with convolutional features. Result: CoSwin outperforms state-of-the-art models on benchmarks like CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet, achieving significant accuracy improvements over the Swin Transformer. Conclusion: CoSwin improves the performance of vision transformers on small datasets by integrating localized convolutional feature learning with global attention mechanisms. Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin[47] iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning
Karim Slimani,Catherine Achard,Brahim Tamadazte
Main category: cs.CV
TL;DR: This paper introduces iMatcher, a fully differentiable framework for feature matching in point cloud registration that improves rigid registration performance by incorporating local and global geometric consistency.
Details
Motivation: To improve feature matching in point cloud registration by incorporating both local and global geometric consistency. Method: iMatcher is a fully differentiable framework for feature matching in point cloud registration that predicts a geometrically consistent confidence matrix through local graph embedding and global geometric consistency learning. Result: iMatcher achieved state-of-the-art inlier ratios on KITTI, KITTI-360, and 3DMatch datasets. Conclusion: iMatcher significantly improves rigid registration performance and demonstrates robustness across diverse settings. Abstract: This paper presents iMatcher, a fully differentiable framework for feature matching in point cloud registration. The proposed method leverages learned features to predict a geometrically consistent confidence matrix, incorporating both local and global consistency. First, a local graph embedding module leads to an initialization of the score matrix. A subsequent repositioning step refines this matrix by considering bilateral source-to-target and target-to-source matching via nearest neighbor search in 3D space. The paired point features are then stacked together to be refined through global geometric consistency learning to predict a point-wise matching probability. Extensive experiments on real-world outdoor (KITTI, KITTI-360) and indoor (3DMatch) datasets, as well as on 6-DoF pose estimation (TUD-L) and partial-to-partial matching (MVP-RG), demonstrate that iMatcher significantly improves rigid registration performance. The method achieves state-of-the-art inlier ratios, scoring 95% - 97% on KITTI, 94% - 97% on KITTI-360, and up to 81.1% on 3DMatch, highlighting its robustness across diverse settings.[48] UltrON: Ultrasound Occupancy Networks
Magdalena Wysocki,Felix Duelmer,Ananya Bal,Nassir Navab,Mohammad Farid Azampour
Main category: cs.CV
TL;DR: 本文提出了一种新的超声图像3D重建方法UltrON,通过利用声学特征和新颖的损失函数改进了几何一致性,解决了传统方法的遮挡和标记依赖问题。
Details
Motivation: 在自由手超声成像中,超声医师依赖专业知识将部分2D视图整合成3D解剖形状。形状重建可以辅助临床医生完成此任务,但传统方法存在一些问题。 Method: 提出了一种基于占据的表示方法,并引入了UltrON,利用声学特征和新颖的损失函数来改善优化。 Result: UltrON泛化到相同解剖结构的形状,缓解了遮挡和稀疏标记的限制,并且代码和数据集将可用。 Conclusion: UltrON通过利用声学特征和新颖的损失函数,在弱监督优化方案中改善了几何一致性,为更精确的3D重建铺平了道路。 Abstract: In free-hand ultrasound imaging, sonographers rely on expertise to mentally integrate partial 2D views into 3D anatomical shapes. Shape reconstruction can assist clinicians in this process. Central to this task is the choice of shape representation, as it determines how accurately and efficiently the structure can be visualized, analyzed, and interpreted. Implicit representations, such as SDF and occupancy function, offer a powerful alternative to traditional voxel- or mesh-based methods by modeling continuous, smooth surfaces with compact storage, avoiding explicit discretization. Recent studies demonstrate that SDF can be effectively optimized using annotations derived from segmented B-mode ultrasound images. Yet, these approaches hinge on precise annotations, overlooking the rich acoustic information embedded in B-mode intensity. Moreover, implicit representation approaches struggle with the ultrasound's view-dependent nature and acoustic shadowing artifacts, which impair reconstruction. To address the problems resulting from occlusions and annotation dependency, we propose an occupancy-based representation and introduce \gls{UltrON} that leverages acoustic features to improve geometric consistency in weakly-supervised optimization regime. We show that these features can be obtained from B-mode images without additional annotation cost. Moreover, we propose a novel loss function that compensates for view-dependency in the B-mode images and facilitates occupancy optimization from multiview ultrasound. By incorporating acoustic properties, \gls{UltrON} generalizes to shapes of the same anatomy. We show that \gls{UltrON} mitigates the limitations of occlusions and sparse labeling and paves the way for more accurate 3D reconstruction. Code and dataset will be available at https://github.com/magdalena-wysocki/ultron.[49] Implicit Neural Representations of Intramyocardial Motion and Strain
Andrew Bell,Yan Kit Choi,Steffen Peterson,Andrew King,Muhummad Sohaib Nazir,Alistair Young
Main category: cs.CV
TL;DR: A new method using INRs for predicting LV displacement was developed, offering high accuracy, low strain error, and significantly faster processing speed than existing deep learning baselines.
Details
Motivation: Automatic quantification of intramyocardial motion and strain from tagging MRI is an important but challenging task. Method: Implicit neural representations (INRs), conditioned on learned latent codes, were used to predict continuous LV displacement without requiring inference-time optimization. Result: The method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. It was also ~380× faster than the most accurate baseline. Conclusion: INR-based models are suitable for accurate and scalable analysis of myocardial strain in large CMR datasets. Abstract: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement -- without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets.[50] E-MLNet: Enhanced Mutual Learning for Universal Domain Adaptation with Sample-Specific Weighting
Samuel Felipe dos Santos,Tiago Agostinho de Almeida,Jurandy Almeida
Main category: cs.CV
TL;DR: E-MLNet提出动态加权策略,聚焦关键类别边界,提升通用域适应性能,在多个基准数据集上优于现有方法MLNet。
Details
Motivation: 现有UniDA方法(如MLNet)对所有分类器一视同仁,削弱了学习信号,因此需要一种更聚焦的适应策略。 Method: E-MLNet引入了一种动态加权策略,结合OEM方法,通过关注每个目标样本最相关的类别边界来增强分类效果。 Result: 实验表明,E-MLNet在VisDA和ImageCLEF上取得了最高的平均H分数,并在22/31(Open-Partial DA)和19/31(Open-Set DA)任务中优于MLNet。 Conclusion: E-MLNet通过动态加权策略改进了UniDA任务中的分类性能,特别是在VisDA和ImageCLEF数据集上表现最优,并且在大多数任务中优于MLNet。 Abstract: Universal Domain Adaptation (UniDA) seeks to transfer knowledge from a labeled source to an unlabeled target domain without assuming any relationship between their label sets, requiring models to classify known samples while rejecting unknown ones. Advanced methods like Mutual Learning Network (MLNet) use a bank of one-vs-all classifiers adapted via Open-set Entropy Minimization (OEM). However, this strategy treats all classifiers equally, diluting the learning signal. We propose the Enhanced Mutual Learning Network (E-MLNet), which integrates a dynamic weighting strategy to OEM. By leveraging the closed-set classifier's predictions, E-MLNet focuses adaptation on the most relevant class boundaries for each target sample, sharpening the distinction between known and unknown classes. We conduct extensive experiments on four challenging benchmarks: Office-31, Office-Home, VisDA-2017, and ImageCLEF. The results demonstrate that E-MLNet achieves the highest average H-scores on VisDA and ImageCLEF and exhibits superior robustness over its predecessor. E-MLNet outperforms the strong MLNet baseline in the majority of individual adaptation tasks -- 22 out of 31 in the challenging Open-Partial DA setting and 19 out of 31 in the Open-Set DA setting -- confirming the benefits of our focused adaptation strategy.[51] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
Umair Hassan
Main category: cs.CV
TL;DR: 本文提出了COCO-Urdu,这是目前最大的公开乌尔都语图像描述数据集,旨在减少多模态研究中的语言偏见。
Details
Motivation: 乌尔都语在多模态和视觉-语言研究中资源严重不足,缺乏大规模、高质量的数据集,这限制了乌尔都语系统的发展,并在多语言视觉-语言模型中造成了偏见。 Method: 使用SeamlessM4T v2进行翻译,并通过结合COMET-Kiwi、基于CLIP的相似度和BERTScore与回译的混合多模态质量估计框架进行验证。低分描述通过开源大语言模型迭代改进。 Result: COCO-Urdu包含59,000张图像和319,000条乌尔都语描述,并在BLEU、SacreBLEU和chrF指标上报告了持续良好的结果。 Conclusion: COCO-Urdu的发布旨在减少多模态研究中的语言偏见,并为包容性的视觉-语言系统奠定基础。 Abstract: Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.[52] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI
Chenqian Le,Yilin Zhao,Nikasadat Emami,Kushagra Yadav,Xujin "Chris" Liu,Xupeng Chen,Yao Wang
Main category: cs.CV
TL;DR: 本文提出VoxelFormer,一种基于transformer的多受试者fMRI视觉解码框架,通过高效模块设计实现了参数高效的神经解码。
Details
Motivation: 解决现有基于fMRI的视觉解码方法依赖于特定受试者训练的问题,从而提高可扩展性和实用性。 Method: 引入了基于transformer的VoxelFormer架构,包含Token Merging Transformer (ToMer)和query-driven Q-Former模块,用于从fMRI数据中进行视觉解码。 Result: 在7T Natural Scenes数据集上的实验表明,与现有方法相比,VoxelFormer在训练中包含的受试者上实现了具有竞争力的检索性能,同时参数显著更少。 Conclusion: VoxelFormer展现了在参数效率神经解码方面的潜力,强调了标记合并和基于查询的transformer方法的重要性。 Abstract: Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbf{VoxelFormer}, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.[53] Integrating Anatomical Priors into a Causal Diffusion Model
Binxu Li,Wei Peng,Mingjie Li,Ehsan Adeli,Kilian M. Pohl
Main category: cs.CV
TL;DR: 本研究提出PCGM,一种结合解剖约束的生成扩散模型,用于生成高质量的反事实脑MRI图像,首次实现对疾病细微影响的复现,推动合成MRI在细微形态差异研究中的应用。
Details
Motivation: 3D脑MRI研究通常关注不同群体之间难以通过视觉检测的细微形态差异,而MRI采集成本高昂,因此图像合成特别是反事实图像生成具有重要意义。然而,现有反事实模型由于缺乏显式的归纳偏置,难以保持细粒度的解剖细节,导致生成的MRI不够解剖合理。 Method: 提出了一种称为概率因果图模型(PCGM)的方法,将解剖约束作为先验信息显式地集成到生成扩散框架中,通过概率图模块捕获解剖约束并转化为空间二值掩码,使用3D扩展的ControlNet编码掩码,并设计了一种新的反事实去噪UNet,最后通过3D扩散解码器生成高质量脑MRI图像。 Result: 在多个数据集上的实验表明,PCGM生成的脑MRI图像质量优于多种基线方法,并首次证明从反事实生成的图像中提取的脑测量数据能够复现神经科学文献中报道的疾病对大脑皮层区域的细微影响。 Conclusion: PCGM能够生成高质量的结构脑MRI图像,并首次证明合成MRI可以复现疾病对大脑皮层区域的细微影响,标志着合成MRI在研究细微形态差异中的重要进展。 Abstract: 3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.[54] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models
Qiuhui Chen,Xuancheng Yao,Huping Ye,Yi Hong
Main category: cs.CV
TL;DR: Med3DInsight is a novel pretraining framework that integrates 3D image encoders with 2D MLLMs to enhance 3D medical image understanding by leveraging text descriptions, offering a scalable solution without requiring human annotations.
Details
Motivation: Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. Method: Med3DInsight integrates 3D image encoders with 2D MLLMs via a plane-slice-aware transformer module, and employs a partial optimal transport based alignment to tolerate noise in LLM-generated content. Result: Extensive experiments demonstrate Med3DInsight's state-of-the-art performance on segmentation and classification tasks across various public datasets with CT and MRI modalities, outperforming current SSL methods. Conclusion: Med3DInsight provides a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations and can be seamlessly integrated into existing 3D medical image understanding networks to enhance their performance. Abstract: Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.[55] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach
Hesham M. Shehata,Mohammad Abdolrahmani
Main category: cs.CV
TL;DR: This paper introduces a multi-task learning approach incorporating object interaction data, improving action recognition accuracy to 99.25%.
Details
Motivation: Traditional graph convolutional neural networks (GCNs) struggle with recognizing human-object interactions due to insufficient scene representation and suboptimal learning architectures, prompting the need for an improved approach. Method: A multi-task learning methodology was adopted, integrating human skeleton poses with fixed object information in the environment. Real-world data were collected and organized into interaction and non-interaction classes for evaluation. Result: The proposed method successfully recognized interaction and non-interaction actions with a 99.25% accuracy, surpassing the baseline model's performance by 2.75%. Conclusion: The proposed multi-task learning approach, incorporating interaction area information and fixed object context, enhances human action recognition performance, achieving an accuracy of 99.25% and outperforming the base model by 2.75%. Abstract: Recent graph convolutional neural networks (GCNs) have shown high performance in the field of human action recognition by using human skeleton poses. However, it fails to detect human-object interaction cases successfully due to the lack of effective representation of the scene information and appropriate learning architectures. In this context, we propose a methodology to utilize human action recognition performance by considering fixed object information in the environment and following a multi-task learning approach. In order to evaluate the proposed method, we collected real data from public environments and prepared our data set, which includes interaction classes of hands-on fixed objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and non-interaction classes of walking and standing. The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%, outperforming the accuracy of the base model using only human skeleton poses by 2.75%.[56] IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection
Jifeng Shen,Haibo Zhan,Xin Zuo,Heng Fan,Xiaohui Yuan,Jun Li,Wankou Yang
Main category: cs.CV
TL;DR: IRDFusion proposes a new feature fusion framework using MFRM and DFFM modules to improve multispectral object detection by adaptively enhancing relevant features and suppressing noise, achieving superior performance on multiple datasets.
Details
Motivation: Current multispectral object detection methods struggle with retaining extraneous background or noise during feature fusion, which limits perceptual performance. This work aims to address this limitation by proposing an innovative feature fusion framework. Method: IRDFusion uses two modules: the Mutual Feature Refinement Module (MFRM) to enhance intra- and inter-modal feature representations, and the Differential Feature Feedback Module (DFFM) to dynamically compute inter-modal differential features as guidance signals, which are fed back to MFRM for adaptive fusion. Result: IRDFusion achieves state-of-the-art performance on FLIR, LLVIP, and M$^3$FD datasets, consistently outperforming existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Conclusion: IRDFusion is a robust and effective framework for multispectral object detection that achieves state-of-the-art performance by enabling high-quality cross-modal fusion through iterative feedback, enhancing salient relational signals while suppressing feature noise. Abstract: Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance.To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference.Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power.Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains.In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.[57] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Hengyu Fang,Yijiang Liu,Yuan Du,Li Du,Huanrui Yang
Main category: cs.CV
TL;DR: SQAP-VLA是一种高效的VLA模型推理加速框架,它通过协同设计量化和标记修剪方法,解决了两者之间的不兼容问题,从而实现了显著的性能提升。
Details
Motivation: VLA模型的高计算和内存成本阻碍了其实际部署,而现有的压缩和加速方法无法同时实现量化和标记修剪的整体效率提升。 Method: SQAP-VLA通过协同设计量化和标记修剪流程,提出了一种新的量化感知标记修剪准则,并改进了量化器设计以提高修剪效果。 Result: SQAP-VLA在标准VLA模型上的应用显示,其推理速度提高了1.93倍,并且平均成功率提高了4.5%。 Conclusion: SQAP-VLA成功解决了VLA模型中量化和标记修剪的不兼容问题,实现了高效的推理加速,同时保持了模型的核心性能。 Abstract: Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5\% average success rate enhancement compared to the original model.[58] S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization
Chenghao Zhang,Lun Luo,Si-Yuan Cao,Xiaokai Bai,Yuncheng Jin,Zhu Yu,Beinan Yu,Yisen Wang,Hui-Liang Shen
Main category: cs.CV
TL;DR: 本文提出了一种名为S-BEVLoc的自监督LiDAR全局定位框架,无需真实姿态数据,具有高度可扩展性且性能优异。
Details
Motivation: 现有的监督学习方法需要高精度真实姿态数据,获取成本高,因此需要一种更经济、可扩展的解决方案。 Method: 提出S-BEVLoc,基于鸟瞰图(BEV)的自监督学习框架,利用卷积神经网络(CNN)提取局部特征,NetVLAD聚合全局描述符,并引入SoftCos损失优化学习效果。 Result: 在KITTI和NCLT数据集上的实验表明,S-BEVLoc在地点识别、回环检测和全局定位任务中表现优异。 Conclusion: S-BEVLoc是一个无需真实姿态监督、高度可扩展的LiDAR全局定位框架,实验证明其在大规模数据集上实现了最先进的性能。 Abstract: LiDAR-based global localization is an essential component of simultaneous localization and mapping (SLAM), which helps loop closure and re-localization. Current approaches rely on ground-truth poses obtained from GPS or SLAM odometry to supervise network training. Despite the great success of these supervised approaches, substantial cost and effort are required for high-precision ground-truth pose acquisition. In this work, we propose S-BEVLoc, a novel self-supervised framework based on bird's-eye view (BEV) for LiDAR global localization, which eliminates the need for ground-truth poses and is highly scalable. We construct training triplets from single BEV images by leveraging the known geographic distances between keypoint-centered BEV patches. Convolutional neural network (CNN) is used to extract local features, and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce SoftCos loss to enhance learning from the generated triplets. Experimental results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks, while offering scalability that would require extra effort for supervised approaches.[59] FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding
Jianqin Gao,Tianqi Wang,Yu Zhang,Yishu Zhang,Chenyuan Wang,Allan Dong,Zihao Wang
Main category: cs.CV
TL;DR: The paper introduces FPI-Det, a new dataset for detecting phone usage in various environments, featuring 22,879 annotated images and baseline results using YOLO and DETR detectors.
Details
Motivation: The motivation stems from the increasing use of mobile devices and the need for vision systems to detect phone usage in diverse conditions, which requires understanding behavioral context and human-device interactions that existing benchmarks do not fully capture. Method: The authors introduce FPI-Det, a new dataset containing 22,879 images with synchronized annotations for faces and phones across different environments. They evaluate representative YOLO and DETR detectors to provide baseline results and analyze performance across object sizes, occlusion levels, and environments. Result: The result is the creation of the FPI-Det dataset, which features extreme scale variation, frequent occlusions, and varied capture conditions. The evaluation of YOLO and DETR detectors provides baseline performance metrics across different object sizes, occlusion levels, and environments. Conclusion: The paper concludes by introducing the FPI-Det dataset, which addresses the gap in detecting phone usage by considering fine-grained human-device interactions across various scenarios. Abstract: The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human--device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.[60] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Junhao Xing,Ryohei Miyakawa,Yang Yang,Xinpeng Liu,Risa Shinoda,Hiroaki Santo,Yosuke Toda,Fumio Okura
Main category: cs.CV
TL;DR: 本文提出了一种名为ZeroPlantSeg的零样本分割方法,通过整合基础分割模型和视觉-语言模型,实现了从顶部视角图像中有效提取植物个体。
Details
Motivation: 现有的层次分割任务通常需要标注训练数据集,而这些数据集往往是物种特定的,并且需要大量的人工劳动。为了克服这一问题,我们提出了一种零样本分割方法。 Method: 结合基础分割模型和视觉-语言模型,以零样本学习方式实现植物个体的层次分割。 Result: 在多个植物物种、生长阶段和拍摄环境的数据集上评估表明,该方法超越了现有的零样本方法,并在跨域性能上优于有监督方法。 Conclusion: ZeroPlantSeg是一个无需训练的植物个体分割方法,整合了基础分割模型和视觉-语言模型,实现了从顶部视角图像中有效提取植物个体。 Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.[61] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
Tianlu Zheng,Yifan Zhang,Xiang An,Ziyong Feng,Kaicheng Yang,Qichuan Ding
Main category: cs.CV
TL;DR: 本研究通过改进数据构建管道和引入GA-DMS框架,有效解决了CLIP在人物表征学习中的数据不足和全局对比学习局限性问题,显著提升了模型性能。
Details
Motivation: CLIP在人物表征学习中的应用面临两个关键挑战:缺乏大规模标注的以人物为中心的图文数据,以及全局对比学习在保持细粒度匹配所需的判别局部特征方面的固有局限性。 Method: 开发了一种抗噪声的数据构建管道,并引入了GA-DMS(Gradient-Attention Guided Dual-Masking Synergetic)框架,通过自适应屏蔽噪声文本标记来改进跨模态对齐。 Result: 生成了包含500万高质量以人物为中心的图文对的WebPerson大规模数据集,且GA-DMS框架在多个基准测试中表现出最先进的性能。 Conclusion: GA-DMS框架在多个基准测试中实现了最先进的性能,推进了CLIP在人物表征学习方面的应用。 Abstract: Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.[62] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain
Bin Huang,Kang Chen,Bingxuan Li,Huafeng Liu,Qiegen Liu
Main category: cs.CV
TL;DR: ALL-PET is a PET imaging foundation model that works with minimal data and low computational resources, offering high-quality results and versatility across tasks.
Details
Motivation: Building large-scale foundation models for PET imaging is limited by scarce labeled data and insufficient computational resources, which ALL-PET aims to address. Method: ALL-PET uses a latent diffusion model with three innovations: Radon mask augmentation strategy, dynamic multi-mask mechanism, and transparent medical attention mechanism. Result: ALL-PET achieves high-quality sinogram generation using only 500 samples, performs comparably to models trained on larger datasets, and generalizes across multiple tasks efficiently. Conclusion: ALL-PET is a low-resource, low-shot PET foundation model that achieves high-quality sinogram generation with minimal data, generalizing across various tasks while operating efficiently. Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.[63] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology
Dylan Peek,Matthew P. Skerritt,Stephan Chalup
Main category: cs.CV
TL;DR: This study compares the noise robustness of ANNs and PH in inferring topological structures from data, showing that ANNs can be a more effective alternative under structural noise.
Details
Motivation: PH and ANNs offer contrasting approaches to inferring topological structure from data. The study aims to examine the noise robustness of ANNs in predicting topological features. Method: A supervised neural network was trained to predict Betti numbers in 2D binary images and was compared against a PH pipeline based on cubical complexes and SEDT. Result: ANNs outperformed the PH approach under noise, due to their capacity to learn contextual and geometric priors from training data. Conclusion: ANNs are a compelling alternative to PH under structural noise for topology estimation. Abstract: Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer contrasting approaches to inferring topological structure from data. In this study, we examine the noise robustness of a supervised neural network trained to predict Betti numbers in 2D binary images. We compare an ANN approach against a PH pipeline based on cubical complexes and the Signed Euclidean Distance Transform (SEDT), which is a widely adopted strategy for noise-robust topological analysis. Using one synthetic and two real-world datasets, we show that ANNs can outperform this PH approach under noise, likely due to their capacity to learn contextual and geometric priors from training data. Though still emerging, the use of ANNs for topology estimation offers a compelling alternative to PH under structural noise.[64] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation
Yuiko Uchida,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
Main category: cs.CV
TL;DR: 本文提出了一种新的 3D 场景评估指标 OSIM,通过以物体为中心的方法更好地与人类感知对齐。
Details
Motivation: 现有的评估指标主要关注整体图像质量,与人类感知存在差异;而人类对 3D 场景的认知主要基于对个体物体的识别。 Method: 利用物体检测模型及其特征表示来量化场景中每个物体的“物体性”,从而进行以物体为中心的评估。 Result: 通过用户研究验证了 OSIM 比现有指标更符合人类感知,并重新评估了近期的 3D 重建和生成模型。 Conclusion: OSIM 是一种新的以物体为中心的 3D 场景评估指标,能够更好地与人类感知对齐。 Abstract: This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on "objects," which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the "objectness" of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.[65] Video Understanding by Design: How Datasets Shape Architectures and Insights
Lei Wang,Piotr Koniusz,Yongsheng Gao
Main category: cs.CV
TL;DR: This survey provides a dataset-driven perspective on the evolution of video understanding models, identifying key dataset characteristics that guide model design and offering a framework for aligning datasets and architectures.
Details
Motivation: The motivation is to understand how datasets influence the development of video understanding models, offering a new perspective to better align datasets and model design. Method: The paper presents a survey of video understanding models, analyzing them from a dataset-driven perspective and interpreting their evolution as responses to dataset characteristics. Result: The paper identifies dataset characteristics such as motion complexity, temporal span, hierarchical composition, and multimodal richness as key drivers of architectural evolution, and reinterprets model milestones as responses to these pressures. Conclusion: This survey concludes that the evolution of video understanding models is significantly influenced by dataset characteristics, and provides a framework to align datasets, inductive biases, and architectures for more effective model design. Abstract: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.[66] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge
JaeWoong Shin,Jeongun Ryu,Aaron Valero Puche,Jinhee Lee,Biagio Brattoli,Wonkyung Jung,Soo Ick Cho,Kyunghyun Paeng,Chan-Young Ock,Donggeun Yoo,Zhaoyang Li,Wangkai Li,Huayu Mai,Joshua Millward,Zhen He,Aiden Nibali,Lydia Anette Schoenpflug,Viktor Hendrik Koelzer,Xu Shuoyu,Ji Zheng,Hu Bin,Yu-Wen Lo,Ching-Hui Yang,Sérgio Pereira
Main category: cs.CV
TL;DR: The OCELOT 2023 challenge showed that including cell-tissue interactions in models significantly improves cell detection performance over traditional methods.
Details
Motivation: Pathologists use multiple magnifications to make comprehensive diagnoses, but existing deep learning models struggle to replicate this behavior and learn interdependent structures across scales. Method: A dataset with multi-scale overlapping cell and tissue annotations was created, and participants developed models to explore the importance of cell-tissue interactions. Result: Top models achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model, showing the effectiveness of incorporating cell-tissue relationships. Conclusion: The OCELOT 2023 challenge demonstrated that incorporating multi-scale semantics, particularly cell-tissue interactions, significantly improves cell detection performance compared to traditional cell-only methods. Abstract: Pathologists routinely alternate between different magnifications when examining Whole-Slide Images, allowing them to evaluate both broad tissue morphology and intricate cellular details to form comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate these behaviors and learn the interdependent semantics between structures at different magnifications. A key barrier in the field is the lack of datasets with multi-scale overlapping cell and tissue annotations. The OCELOT 2023 challenge was initiated to gather insights from the community to validate the hypothesis that understanding cell and tissue (cell-tissue) interactions is crucial for achieving human-level performance, and to accelerate the research in this field. The challenge dataset includes overlapping cell detection and tissue segmentation annotations from six organs, comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA) Whole-Slide Images with hematoxylin and eosin staining, divided into training, validation, and test subsets. Participants presented models that significantly enhanced the understanding of cell-tissue relationships. Top entries achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model that did not incorporate cell-tissue relationships. This is a substantial improvement in performance over traditional cell-only detection methods, demonstrating the need for incorporating multi-scale semantics into the models. This paper provides a comparative analysis of the methods used by participants, highlighting innovative strategies implemented in the OCELOT 2023 challenge.[67] RT-DETR++ for UAV Object Detection
Yuan Shufang
Main category: cs.CV
TL;DR: The paper proposes RT-DETR++, an enhanced encoder for object detection in UAV imagery, which effectively addresses challenges like small objects, scale variations, and occlusion while ensuring real-time performance.
Details
Motivation: Object detection in UAV imagery presents significant challenges due to issues such as densely packed small objects, scale variations, and occlusion. Method: The paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model by introducing a channel-gated attention-based upsampling/downsampling mechanism and incorporating CSP-PAC during feature fusion. Result: Evaluation demonstrates that the novel neck design achieves superior performance in detecting small and densely packed objects while maintaining sufficient speed for real-time detection without increasing computational complexity. Conclusion: The study provides an effective approach for feature encoding design in real-time detection systems. Abstract: Object detection in unmanned aerial vehicle (UAV) imagery presents significant challenges. Issues such as densely packed small objects, scale variations, and occlusion are commonplace. This paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model. Our improvements focus on two key aspects. First, we introduce a channel-gated attention-based upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes errors and preserves details during feature layer propagation. Second, we incorporate CSP-PAC during feature fusion. This technique employs parallel hollow convolutions to process local and contextual information within the same layer, facilitating the integration of multi-scale features. Evaluation demonstrates that our novel neck design achieves superior performance in detecting small and densely packed objects. The model maintains sufficient speed for real-time detection without increasing computational complexity. This study provides an effective approach for feature encoding design in real-time detection systems.[68] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering
Zhiyue Liu,Sihang Liu,Jinyuan Liu,Xinru Zhang
Main category: cs.CV
TL;DR: This paper proposes a training-free framework for knowledge-based visual question answering that enhances knowledge relevance and reduces redundancy, leading to improved performance over existing methods.
Details
Motivation: Existing KB-VQA approaches suffer from noise due to knowledge redundancy when directly augmenting models with retrieved information. This work aims to improve accuracy by enhancing knowledge relevance and reducing redundancy. Method: The framework involves three key steps: (1) knowledge retrieval focusing on essential parts of image-question pairs, (2) prompting large models to extract answer-beneficial segments, and (3) selective knowledge integration based on model confidence. Result: Extensive experiments demonstrate that the framework outperforms state-of-the-art methods in KB-VQA tasks. Conclusion: The proposed training-free framework with knowledge focusing for KB-VQA outperforms state-of-the-art methods by acquiring accurate and critical knowledge. Abstract: Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.[69] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Yulin Tong,Fengzong Zhang,Haiqin Cheng
Main category: cs.CV
TL;DR: 本文提出了一种名为CWSSNet的分类框架,用于解决高光谱图像分类中的特征冗余问题,并在实际应用中展示了良好的性能和鲁棒性。
Details
Motivation: 高光谱遥感技术在林业生态和精准农业等领域具有重要的应用价值,同时也对精细地物分类提出了更高的要求。然而,尽管高光谱图像光谱信息丰富,可以提高识别精度,但由于其波段众多、高维数和光谱混合的特性,往往会导致显著的特征冗余。 Method: 该研究采用了ZY1F卫星的高光谱图像作为数据源,并选择了江西省上饶市余干县作为研究区域进行地物分类研究。提出了一种名为CWSSNet的分类框架,该框架结合了三维光谱-空间特征和小波卷积,并使用多尺度卷积注意力模块集成多模态信息,在小波域中引入多波段分解和卷积操作。 Result: 实验结果显示,CWSSNet在余干县的平均交并比(mIoU)、平均准确率(mAcc)和平均F1分数(mF1)分别达到了74.50%、82.73%和84.94%。 Conclusion: CWSSNet在小样本训练条件下仍能保持可靠的性能,同时在水体、植被和裸地分类中取得了最高的交并比(IoU),表现出良好的鲁棒性。 Abstract: Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50\%, 82.73\%, and 84.94\% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70\%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.[70] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios
Chunxiao Li,Xiaoxiao Wang,Meiling Li,Boming Miao,Peng Sun,Yunjian Zhang,Xiangyang Ji,Yao Zhu
Main category: cs.CV
TL;DR: The paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of AI-generated image detection models under complex real-world conditions and highlights the importance of incorporating human adaptability in developing robust detection algorithms.
Details
Motivation: The motivation of the paper is to address the research gap in evaluating AI-generated image detection methods under complex real-world conditions, given the rapid advancement of generative models and the challenges they pose to digital security and media credibility. Method: The paper introduces the Real-World Robustness Dataset (RRDataset) to evaluate detection models across three dimensions: Scenario Generalization, Internet Transmission Robustness, and Re-digitization Robustness. It benchmarks 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducts a large-scale human study involving 192 participants. Result: The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and highlight the importance of drawing on human adaptability to develop more robust detection algorithms. The paper also presents RRDataset, which encompasses high-quality images from seven major scenarios and assesses detector performance on images that have undergone multiple rounds of sharing across various social media platforms and through four distinct re-digitization methods. Conclusion: The paper concludes that current AI-generated image detection methods have limitations under real-world conditions and highlights the importance of incorporating human adaptability to develop more robust detection algorithms. Abstract: With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.[71] Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
Jiasheng Guo,Xin Gao,Yuxiang Yan,Guanghao Li,Jian Pu
Main category: cs.CV
TL;DR: Dark-ISP improves low-light object detection by directly processing RAW images through a lightweight, self-adaptive ISP pipeline, achieving superior results with minimal complexity.
Details
Motivation: Existing methods suffer from information loss or complex frameworks when processing RAW images for low-light object detection. Dark-ISP aims to address these issues with a lightweight and adaptive solution. Method: Dark-ISP deconstructs traditional ISP pipelines into differentiable components and introduces a Self-Boost mechanism for sub-module cooperation, enabling end-to-end training optimized for object detection. Result: Dark-ISP outperforms state-of-the-art RGB- and RAW-based detection approaches with minimal parameters on three RAW image datasets in challenging low-light environments. Conclusion: Dark-ISP demonstrates superior performance in low-light object detection by efficiently processing RAW images with a lightweight and self-adaptive ISP plugin. Abstract: Low-light Object detection is crucial for many real-world applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equipped with content-aware adaptability and physics-informed priors, enabling automatic RAW-to-RGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline's intrinsic cascade structure, we devise a Self-Boost mechanism that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.[72] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results
Hanwei Zhu,Haoning Wu,Zicheng Zhang,Lingyu Zhu,Yixuan Li,Peilin Chen,Shiqi Wang,Chris Wei Zhou,Linhan Cao,Wei Sun,Xiangyang Zhu,Weixia Zhang,Yucheng Zhu,Jing Liu,Dandan Zhu,Guangtao Zhai,Xiongkuo Min,Zhichao Zhang,Xinyue Li,Shubo Xu,Anh Dao,Yifan Li,Hongyuan Yu,Jiaojiao Yi,Yiding Tian,Yupeng Wu,Feiran Sun,Lijuan Liao,Song Jiang
Main category: cs.CV
TL;DR: The VQualA 2025 Challenge introduces a new benchmark for evaluating visual quality comparison abilities in Large Multimodal Models, highlighting progress in open-domain reasoning and setting the stage for future research in interpretable quality assessment systems.
Details
Motivation: To evaluate and enhance the ability of state-of-the-art Large Multimodal Models (LMMs) to perform open-ended and detailed reasoning about visual quality differences across multiple images. Method: The challenge introduces a novel benchmark with thousands of coarse-to-fine grained visual quality comparison tasks, including single images, pairs, and multi-image groups, evaluated through 2AFC-based binary preference and multi-choice questions (MCQs). Result: Around 100 participants submitted entries, and five models demonstrated the emerging capabilities of instruction-tuned LMMs in visual quality assessment. Conclusion: The VQualA 2025 Challenge represents a significant step toward advancing open-domain visual quality reasoning and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems. Abstract: This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.[73] MGTraj: Multi-Granularity Goal-Guided Human Trajectory Prediction with Recursive Refinement Network
Ge Sun,Jun Ma
Main category: cs.CV
TL;DR: 本文提出MGTraj模型,结合多粒度轨迹建模和目标引导,提升人类轨迹预测精度。
Details
Motivation: 当前目标引导的轨迹预测方法在粗粒度目标预测和细粒度轨迹生成之间存在断层,中间时间粒度的潜力尚未被充分探索。 Method: MGTraj递归编码从粗到细粒度的轨迹提议,并使用基于Transformer的递归优化网络(RRN)进行特征捕捉和预测,同时采用权重共享策略整合不同粒度特征,并利用速度预测作为辅助任务。 Result: 在ETH/UCY和Stanford Drone Dataset上实验表明,MGTraj优于基线方法,并在目标引导方法中达到最先进的性能。 Conclusion: MGTraj通过多粒度轨迹建模和递归优化网络,在目标引导的人类轨迹预测中表现出色,优于现有方法。 Abstract: Accurate human trajectory prediction is crucial for robotics navigation and autonomous driving. Recent research has demonstrated that incorporating goal guidance significantly enhances prediction accuracy by reducing uncertainty and leveraging prior knowledge. Most goal-guided approaches decouple the prediction task into two stages: goal prediction and subsequent trajectory completion based on the predicted goal, which operate at extreme granularities: coarse-grained goal prediction forecasts the overall intention, while fine-grained trajectory completion needs to generate the positions for all future timesteps. The potential utility of intermediate temporal granularity remains largely unexplored, which motivates multi-granularity trajectory modeling. While prior work has shown that multi-granularity representations capture diverse scales of human dynamics and motion patterns, effectively integrating this concept into goal-guided frameworks remains challenging. In this paper, we propose MGTraj, a novel Multi-Granularity goal-guided model for human Trajectory prediction. MGTraj recursively encodes trajectory proposals from coarse to fine granularity levels. At each level, a transformer-based recursive refinement network (RRN) captures features and predicts progressive refinements. Features across different granularities are integrated using a weight-sharing strategy, and velocity prediction is employed as an auxiliary task to further enhance performance. Comprehensive experimental results in EHT/UCY and Stanford Drone Dataset indicate that MGTraj outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods.[74] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Zhengzhao Lai,Youbin Zheng,Zhenyang Cai,Haonan Lyu,Jinpu Yang,Hongqing Liang,Yan Hu,Benyou Wang
Main category: cs.CV
TL;DR: The paper introduces MatCha, a benchmark for evaluating MLLMs' ability to understand materials characterization images, highlighting their current limitations compared to human experts.
Details
Motivation: The motivation is to bridge the gap between the promise of MLLMs in materials science and their underexplored capacity to understand real-world characterization imaging data. Method: The researchers introduced MatCha, a benchmark for materials characterization image understanding, and evaluated state-of-the-art MLLMs against expert-level human performance on a range of tasks. Result: Evaluations revealed a significant performance gap between current MLLMs and human experts, particularly in tasks requiring higher-level expertise and advanced visual perception. Conclusion: The study concludes that existing MLLMs have limited adaptability to real-world materials characterization scenarios, highlighting the need for further research and development in this area. Abstract: Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.[75] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
Jiesi Hu,Jianfeng Cao,Yanwu Yang,Chenfei Ye,Yixuan Zhang,Hanyang Peng,Ting Ma
Main category: cs.CV
TL;DR: Medverse是一种通用的上下文学习模型,通过在多种医学成像任务和解剖区域上进行训练,实现了高保真预测和全局解剖理解,并在多个数据集上表现出色。
Details
Motivation: 当前的ICL模型在医学成像中无法同时实现高保真预测和全局解剖理解,且缺乏一个在多种医学成像任务和解剖区域上训练的统一模型。 Method: Medverse采用了一个下一尺度自回归的上下文学习框架,并提出了一种块状交叉注意模块,以促进上下文和目标输入之间的长距离交互。 Result: Medverse在广泛的保留数据集上进行了评估,并且结果显示其表现明显优于现有的ICL基线。 Conclusion: Medverse是一个通用的ICL模型,为3D医学成像提供了新的ICL范式,并且在多个临床中心、器官、物种和成像模态上表现出色。 Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.[76] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Rongyao Fang,Aldrich Yu,Chengqi Duan,Linjiang Huang,Shuai Bai,Yuxuan Cai,Kun Wang,Si Liu,Xihui Liu,Hongsheng Li
Main category: cs.CV
TL;DR: The paper introduces FLUX-Reason-6M, a large-scale dataset, and PRISM-Bench, a benchmark, to advance open-source text-to-image generation models.
Details
Motivation: To bridge the performance gap between open-source and closed-source text-to-image models due to the lack of large-scale datasets and evaluation benchmarks. Method: Creation of a large-scale dataset (FLUX-Reason-6M) and a comprehensive benchmark (PRISM-Bench) for evaluating text-to-image generation models. Result: The dataset includes 6 million high-quality images and 20 million bilingual descriptions, while the benchmark offers 7 evaluation tracks, revealing critical performance gaps in current models. Conclusion: The introduction of FLUX-Reason-6M and PRISM-Bench provides a significant resource and evaluation standard for the improvement of open-source text-to-image models. Abstract: The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .[77] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification
Mustafa Yurdakul,Sakir Tasdemir
Main category: cs.CV
TL;DR: CoAtNeXt, a novel hybrid model for gastric tissue image classification, achieves high performance and surpasses existing models, showing potential to improve diagnostic accuracy and efficiency for pathologists.
Details
Motivation: Early diagnosis of gastric diseases is crucial to prevent fatal outcomes, but histopathologic examination is labor-intensive, prone to variability, and lacks standard procedures, highlighting the need for automated, reliable, and efficient analysis methods. Method: A novel hybrid model named CoAtNeXt was proposed, built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks and integrating the Convolutional Block Attention Module (CBAM). The model was evaluated on HMU-GC-HE-30K and GasHisSDB datasets, and compared against CNNs and ViT models. Result: CoAtNeXt achieved high accuracy, precision, recall, F1 score, and AUC on both HMU-GC-HE-30K and GasHisSDB datasets, outperforming all tested CNN and ViT models and surpassing previous studies in the literature. Conclusion: CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass, and has the potential to assist pathologists by enhancing diagnostic accuracy and reducing workload. Abstract: Background and objective Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Although histopathologic examination remains the diagnostic gold standard, it is performed entirely manually, making evaluations labor-intensive and prone to variability among pathologists. Critical findings may be missed, and lack of standard procedures reduces consistency. These limitations highlight the need for automated, reliable, and efficient methods for gastric tissue analysis. Methods In this study, a novel hybrid model named CoAtNeXt was proposed for the classification of gastric tissue images. The model is built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block Attention Module (CBAM) is integrated to improve local feature extraction through channel and spatial attention mechanisms. The architecture was scaled to achieve a balance between computational efficiency and classification performance. CoAtNeXt was evaluated on two publicly available datasets, HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary classification, and was compared against 10 Convolutional Neural Networks (CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies in the literature. Conclusion Experimental results show that CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload.[78] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Jing Hao,Yuxuan Fan,Yanpeng Sun,Kaixin Guo,Lizhuo Lin,Jinrong Yang,Qi Yong H. Ai,Lun M. Wong,Hao Tang,Kuo Feng Hung
Main category: cs.CV
TL;DR: The paper introduces MMOral, a large-scale multimodal dataset and benchmark for panoramic X-ray interpretation in dentistry, and proposes OralGPT, which significantly improves model performance through supervised fine-tuning.
Details
Motivation: Recent advances in large vision-language models (LVLMs) have shown strong performance on general medical tasks, but their effectiveness in specialized domains like dentistry remains underexplored. Panoramic X-rays present interpretative challenges due to dense anatomical structures and subtle pathological cues not captured by existing datasets. Method: The authors introduce MMOral, a large-scale multimodal instruction dataset, and MMOral-Bench, a comprehensive evaluation suite. They evaluate 64 LVLMs on MMOral-Bench and propose OralGPT, which conducts supervised fine-tuning on Qwen2.5-VL-7B using the MMOral dataset. Result: The evaluation of 64 LVLMs on MMOral-Bench revealed that even the best-performing model, GPT-4o, only achieved 41.45% accuracy. OralGPT showed a 24.73% improvement in performance after a single epoch of supervised fine-tuning. Conclusion: Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.[79] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
Chao Yuan,Yang Yang,Yehui Yang,Zach Cheng
Main category: cs.CV
TL;DR: This paper proposes DATE, a method to enhance temporal understanding in MLLMs for long videos, achieving top performance on challenging benchmarks.
Details
Motivation: Long video understanding is challenging for MLLMs due to limitations in handling long-range dependencies and precise temporal reasoning. Existing methods suffer from information loss and degraded temporal comprehension. Method: Dynamic Absolute Time Enhancement (DATE) is introduced, combining Timestamp Injection Mechanism (TIM) and Temporal-Aware Similarity Sampling (TASS). TIM constructs a continuous temporal reference by interleaving video frame embeddings with textual timestamp tokens, while TASS reformulates video sampling as a vision-language retrieval task using a two-stage algorithm. Result: The method achieves remarkable improvements in absolute time understanding and key event localization, setting state-of-the-art performance among 7B and 72B models on long video benchmarks. Conclusion: The proposed DATE approach significantly improves temporal awareness in MLLMs for long video understanding, achieving state-of-the-art performance on hour-long video benchmarks, with the 7B model outperforming many 72B models. Abstract: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.[80] Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation
Linhao Li,Yiwen Ye,Ziyang Chen,Yong Xia
Main category: cs.CV
TL;DR: PSP-Seg是一个高效的3D医学图像分割框架,通过动态剪枝和功能解耦损失优化资源使用,同时保持高性能。
Details
Motivation: 现有的3D医学图像分割模型通常静态且手动设计,限制了其适应不同任务的能力,且难以平衡性能与资源效率。 Method: 提出了一种渐进式剪枝框架PSP-Seg,通过模块化剪枝和功能解耦损失实现动态高效的3D分割。 Result: 轻量级变体PSP-Seg-S在所有数据集中性能与nnU-Net相当,同时降低了GPU内存使用42-45%,训练时间减少29-48%,参数数量减少83-87%。 Conclusion: PSP-Seg是一种具有潜力的临床广泛应用的高效3D分割框架。 Abstract: 3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg's potential as a cost-effective yet high-performing alternative for widespread clinical application.[81] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
Bohao Tang,Yan Ma,Fei Zhang,Jiadi Su,Ethan Chern,Zhulin Hu,Zhixin Wang,Pengfei Liu,Ya Zhang
Main category: cs.CV
TL;DR: This paper introduces an adaptive reasoning framework for Vision-Language Models in chart understanding, combining Code-as-Thought and Visual Programmability to dynamically choose between code-based and visual analysis for improved accuracy and flexibility.
Details
Motivation: Prior approaches to chart understanding face limitations, either relying on external tools or adopting a single reasoning strategy like text-based chain-of-thought. These methods lack adaptability and verifiability in reasoning steps, necessitating a more flexible and accurate solution. Method: The authors propose a Code-as-Thought (CaT) approach combined with Visual Programmability, enabling models to dynamically select between code-based reasoning and direct visual analysis. The selection policy is trained using reinforcement learning with a dual-reward system that emphasizes factual accuracy and strategic decision-making. Result: Experiments show robust and strong performance across diverse chart-understanding benchmarks, demonstrating the effectiveness of the adaptive reasoning framework in teaching Vision-Language Models not only what to reason but also how to reason. Conclusion: The proposed Code-as-Thought approach with Visual Programmability enhances the reasoning capabilities of Vision-Language Models in chart understanding by adaptively choosing between code-based and visual analysis pathways. Abstract: Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.[82] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training
Anthony P. Addison,Felix Wagner,Wentian Xu,Natalie Voets,Konstantinos Kamnitsas
Main category: cs.CV
TL;DR: 本文提出了一种改进的U-Net架构和数据增强方法,使分割模型能够处理训练时未见过的MRI模态,从而提高脑MRI分割的灵活性和性能。
Details
Motivation: 大多数多模态脑MRI分割模型受限于固定的模态,无法有效处理新模态。该研究旨在开发一种更具灵活性和实用性的模型,能够处理未见过的模态和混合模态。 Method: 对U-Net架构进行改进,集成模态无关输入通道和模态特定输入通道,并开发了一种数据增强方法来训练模态无关组件。 Result: 在8个MRI数据库上的实验表明,该方法不仅能够有效处理训练期间遇到的模态,还能处理新的、未见过的模态,并提高分割性能。 Conclusion: 通过引入模态无关输入通道和合成MRI模态的数据增强方案,该研究成功开发了一种在训练期间未见过的模态下也能有效进行推理的分割模型。 Abstract: Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG[83] Model-Agnostic Open-Set Air-to-Air Visual Object Detection for Reliable UAV Perception
Spyridon Loukovitis,Anastasios Arsenos,Vasileios Karampinis,Athanasios Voulodimos
Main category: cs.CV
TL;DR: This paper introduces a novel open-set detection framework for UAVs that improves robustness and detection accuracy in real-world air-to-air scenarios, outperforming existing methods by up to 10% in AUROC.
Details
Motivation: Traditional closed-set detectors significantly degrade under domain shifts and flight data corruption, posing risks to safety-critical UAV applications. This work addresses these limitations by introducing a robust, model-agnostic open-set detection method tailored for air-to-air object detection. Method: The framework estimates semantic uncertainty via entropy modeling in the embedding space and uses spectral normalization and temperature scaling to improve open-set discrimination. It is validated on the AOT aerial benchmark and through real-world flight tests. Result: Comprehensive ablation studies show consistent improvements over baseline methods, with up to a 10% relative AUROC gain compared to standard YOLO-based detectors. The method also demonstrates enhanced robustness against corrupted flight data. Conclusion: The proposed open-set detection framework enhances UAV perception robustness in dynamic air-to-air environments, outperforming baseline methods by up to 10% in AUROC and maintaining detection accuracy while improving unknown object and background rejection. Abstract: Open-set detection is crucial for robust UAV autonomy in air-to-air object detection under real-world conditions. Traditional closed-set detectors degrade significantly under domain shifts and flight data corruption, posing risks to safety-critical applications. We propose a novel, model-agnostic open-set detection framework designed specifically for embedding-based detectors. The method explicitly handles unknown object rejection while maintaining robustness against corrupted flight data. It estimates semantic uncertainty via entropy modeling in the embedding space and incorporates spectral normalization and temperature scaling to enhance open-set discrimination. We validate our approach on the challenging AOT aerial benchmark and through extensive real-world flight tests. Comprehensive ablation studies demonstrate consistent improvements over baseline methods, achieving up to a 10\% relative AUROC gain compared to standard YOLO-based detectors. Additionally, we show that background rejection further strengthens robustness without compromising detection accuracy, making our solution particularly well-suited for reliable UAV perception in dynamic air-to-air environments.[84] Learning Object-Centric Representations in SAR Images with Multi-Level Feature Fusion
Oh-Tae Jang,Min-Gon Cho,Kyung-Tae Kim
Main category: cs.CV
TL;DR: 本文提出SlotSAR,通过多级特征融合与注意力机制提升SAR图像中目标识别的鲁棒性,无需掩码标注。
Details
Motivation: SAR图像中的背景杂波(包括地形反射和斑点噪声)可能具有与目标相似的强度和模式,导致模型提取纠缠或虚假特征,从而影响目标表示能力。 Method: 提出了一种名为SlotSAR的新型对象中心学习(OCL)框架,结合了SARATR-X提取的高层语义特征和小波散射网络提取的底层散射特征,并设计了多级槽注意力模块以增强槽表示的区分度。 Result: SlotSAR在SAR图像的目标表示解耦和背景杂波抑制方面优于现有OCL方法,保留了更多结构细节。 Conclusion: SlotSAR实现了最先进的SAR图像性能,通过保留结构细节有效解耦目标表示并抑制背景杂波干扰。 Abstract: Synthetic aperture radar (SAR) images contain not only targets of interest but also complex background clutter, including terrain reflections and speckle noise. In many cases, such clutter exhibits intensity and patterns that resemble targets, leading models to extract entangled or spurious features. Such behavior undermines the ability to form clear target representations, regardless of the classifier. To address this challenge, we propose a novel object-centric learning (OCL) framework, named SlotSAR, that disentangles target representations from background clutter in SAR images without mask annotations. SlotSAR first extracts high-level semantic features from SARATR-X and low-level scattering features from the wavelet scattering network in order to obtain complementary multi-level representations for robust target characterization. We further present a multi-level slot attention module that integrates these low- and high-level features to enhance slot-wise representation distinctiveness, enabling effective OCL. Experimental results demonstrate that SlotSAR achieves state-of-the-art performance in SAR imagery by preserving structural details compared to existing OCL methods.[85] You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
Hao Si,Ehsan Javanmardi,Manabu Tsukada
Main category: cs.CV
TL;DR: This paper proposes PHCP, a new framework for heterogeneous collaborative perception that enables dynamic feature alignment during inference without joint training, achieving strong performance with minimal unlabeled data.
Details
Motivation: The study addresses the challenge of heterogeneous models in collaborative perception due to manufacturer variations, aiming to eliminate the impractical joint training requirement of existing methods. Method: Progressive Heterogeneous Collaborative Perception (PHCP) is introduced, which uses few-shot unsupervised domain adaptation to align features dynamically during inference by self-training an adapter. Result: Experiments on the OPV2V dataset show that PHCP performs well across diverse heterogeneous scenarios and achieves performance comparable to state-of-the-art methods using only a small amount of unlabeled data. Conclusion: PHCP proves to be an effective solution for heterogeneous collaborative perception, achieving strong results with minimal unlabeled data and eliminating the need for joint training. Abstract: Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.[86] Image Recognition with Vision and Language Embeddings of VLMs
Illia Volkov,Nikita Kisel,Klara Janouskova,Jiri Matas
Main category: cs.CV
TL;DR: This paper evaluates dual-encoder VLMs for image classification, showing the complementary strengths of vision and language, and proposes a fusion method to enhance accuracy without learning.
Details
Motivation: To explore the purely visual inference capabilities of vision-language models (VLMs) and investigate how vision and language can complement each other in image classification. Method: Comprehensive evaluation of dual-encoder vision-language models (VLMs) including SigLIP 2 and RADIOv2.5 for both language-guided and vision-only image classification. A learning-free fusion method based on per-class precision was introduced. Result: Demonstrated that some classes favor textual prompts while others are better handled through visual similarity, with improved performance achieved via the proposed fusion method. Conclusion: The study concludes that language and vision provide complementary strengths in image classification, and a learning-free fusion method can enhance performance by leveraging per-class precision. Abstract: Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.[87] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM
Hui Li,Yi You,Qiqi Chen,Bingfeng Zhang,George Q. Huang
Main category: cs.CV
TL;DR: This paper introduces the BUG workflow with LMM to improve clothing design customization through chat and image inputs, making fashion design more accessible and evaluated with a new dataset called FashionEdit.
Details
Motivation: The motivation stems from the limitations of current generative AI models in providing fine-grained customization due to text uncertainty and the lack of professional background knowledge from end-users in the garment industry. Method: The method involves the development of a Better Understanding Generation (BUG) workflow using a Large Multimodal Model (LMM) for automatic creation and customization of cloth designs through chat and image-into-prompt techniques. A new dataset called FashionEdit was also introduced for evaluation. Result: The proposed BUG workflow demonstrates effectiveness in enhancing creativity and reducing barriers in clothing design, validated through the FashionEdit dataset with metrics on generation similarity, user satisfaction, and quality. Conclusion: The BUG workflow with LMM effectively enhances the generation and customization of clothing designs, making the process more accessible to users without professional backgrounds. Abstract: Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users' creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.[88] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment
Dimitrios Anastasiou,Razvan Caramalau,Nazir Sirajudeen,Matthew Boal,Philip Edwards,Justin Collins,John Kelly,Ashwin Sridhar,Maxine Tran,Faiz Mumtaz,Nevil Pavithran,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos
Main category: cs.CV
TL;DR: 本文研究了在手术技能评估中应用少样本学习和自我监督预训练策略的效果,发现领域相关的小型数据集可能优于大规模但不太相关的数据集。
Details
Motivation: 由于技能注释稀缺且难以获取,开发能够仅需少量标注的鲁棒手术技能评估模型变得尤为重要。 Method: 将手术技能评估任务建式为少样本学习任务,并探索不同的自我监督预训练策略对下游任务性能的影响。 Result: 领域相关的小型数据集在1、2、5样本设置下分别达到了60.16%、66.03%、73.65%的准确率,加入特定程序数据显著提升了性能。 Conclusion: 在手术技能评估中,选择领域相关的小规模数据进行预训练比使用大规模但不太相关的数据更有效。 Abstract: Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.[89] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Ian Nell,Shane Gilroy
Main category: cs.CV
TL;DR: 本研究开发了一种基于计算机视觉的驾驶员行为分类系统,可有效检测非联网车辆的分心与受损驾驶行为。
Details
Motivation: 全球道路交通事故仍然令人担忧,其中人类错误(尤其是分心驾驶和受损驾驶)是主要原因之一。因此,需要一种能够有效检测驾驶行为的系统以提高道路安全。 Method: 该研究采用先进的计算机视觉方法,包括实时目标跟踪、横向位移分析和车道位置监控,结合YOLO目标检测模型和自定义车道估计算法来识别分心和受损驾驶行为。 Result: 实验评估表明,该框架在不同道路和环境条件下均具有良好的可靠性和适应性,能够有效识别不安全驾驶行为。 Conclusion: 该研究提出了一种新的驾驶员行为分类系统,能够通过外部观察技术检测非联网车辆的不安全驾驶行为,具有较高的可靠性和适应性。 Abstract: Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behavior classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviors such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioral analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.[90] Texture-aware Intrinsic Image Decomposition with Model- and Learning-based Priors
Xiaodong Wang,Zijun He,Xin Yuan
Main category: cs.CV
TL;DR: 本文提出了一种处理内在图像分解中严重光照和丰富纹理的新方法,通过设计纹理引导的正则化项并将其公式化为优化框架,从而分离材质纹理和光照效果。
Details
Motivation: 尽管内在图像分解问题已经研究了几十年,在复杂场景情况下,如空间变化的光照效果和丰富的纹理下,它仍然是一个重大挑战。 Method: 设计了一种纹理引导的正则化项,并将分解问题转化为优化框架,以分离材质纹理和光照效果。 Result: 该方法能够为真实世界图像生成高质量的内在图像,处理严重的光照和丰富纹理。 Conclusion: 结合新的纹理感知先验信息能够产生优于现有方法的结果。 Abstract: This paper aims to recover the intrinsic reflectance layer and shading layer given a single image. Though this intrinsic image decomposition problem has been studied for decades, it remains a significant challenge in cases of complex scenes, i.e. spatially-varying lighting effect and rich textures. In this paper, we propose a novel method for handling severe lighting and rich textures in intrinsic image decomposition, which enables to produce high-quality intrinsic images for real-world images. Specifically, we observe that previous learning-based methods tend to produce texture-less and over-smoothing intrinsic images, which can be used to infer the lighting and texture information given a RGB image. In this way, we design a texture-guided regularization term and formulate the decomposition problem into an optimization framework, to separate the material textures and lighting effect. We demonstrate that combining the novel texture-aware prior can produce superior results to existing approaches.[91] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection
Xiaodong Wang,Ping Wang,Zhangyuan Li,Xin Yuan
Main category: cs.CV
TL;DR: This paper presents a hybrid data-consistency module that integrates learned priors with physical forward models for improved single-pixel imaging reconstruction.
Details
Motivation: The motivation is to explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, particularly in single-pixel imaging. Method: The paper proposes a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms, applied directly to the denoised estimate. Result: Experimental results on single-pixel imaging tasks demonstrate that the method achieves better reconstruction quality. Conclusion: The paper concludes that the proposed hybrid data-consistency module improves measurement consistency without disrupting the diffusion sampling trajectory, leading to better reconstruction quality in single-pixel imaging tasks. Abstract: We explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a focus on single-pixel imaging. We begin by identifying key distinctions between PnP and diffusion models-particularly in their denoising mechanisms and sampling procedures. By decoupling the diffusion process into three interpretable stages: denoising, data consistency enforcement, and sampling, we provide a unified framework that integrates learned priors with physical forward models in a principled manner. Building upon this insight, we propose a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. This hybrid correction is applied directly to the denoised estimate, improving measurement consistency without disrupting the diffusion sampling trajectory. Experimental results on single-pixel imaging tasks demonstrate that our method achieves better reconstruction quality.[92] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data
Pengxu Wen,Tingting Yu,Ziwei Nie,Cheng Jiang,Zhenyu Yin,Mingyang He,Bo Liao,Xiaoping Yang
Main category: cs.CV
TL;DR: 该研究提出了一种结合关键帧识别、ONSD测量和临床数据融合的全自动ICP分级框架,显著提高了ICP评估的准确性和可靠性。
Details
Motivation: 目前临床实践中ONSD测量存在手动操作不一致、最佳视图选择主观性和阈值设定变异性的问题,限制了其可靠性,因此需要一种更准确和客观的评估方法。 Method: 引入了一种全自动的两阶段框架进行ICP分级,包括眼底超声视频处理阶段和颅内压分级阶段,结合关键帧识别、ONSD测量和临床数据。 Result: 实验结果显示,该方法在验证集上的准确率为0.845 ± 0.071,在独立测试集上的准确率为0.786,显著优于传统基于阈值的方法(验证准确率为0.637 ± 0.111,测试准确率为0.429)。 Conclusion: 该研究通过减少操作者变异性并整合多源信息,建立了一种可靠的无创临床ICP评估方法,有望改善急性神经系统疾病患者的管理。 Abstract: Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of $0.845 \pm 0.071$ (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ($0.637 \pm 0.111$ validation accuracy, $0.429$ test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.[93] Unsupervised Integrated-Circuit Defect Segmentation via Image-Intrinsic Normality
Botong Zhao,Qijun Shi,Shujing Lyu,Yue Lu
Main category: cs.CV
TL;DR: 本文提出了一种新的无监督集成电路缺陷分割方法,利用图像自身的正常特征进行缺陷检测,避免了对外部正常图像的依赖,提高了分割效果和对产品变异的鲁棒性。
Details
Motivation: 现代集成电路制造引入了多种细粒度缺陷,降低了良率和可靠性。现有的工业缺陷分割方法通常通过将测试图像与外部正常图像集进行对比来检测缺陷,这种方法在IC图像中布局随产品而变化且难以精确对齐的情况下表现脆弱。作者观察到缺陷主要是局部的,同时每张图像仍包含丰富且可重复的正常模式。 Method: 提出了一种无监督的IC缺陷分割框架,该框架不需要外部正常支持。可学习的正常信息提取器从测试图像中聚合代表性的正常特征,并且一致性损失强制它们与正常区域相关联。解码器在这些特征的引导下仅重建正常内容;重建残差则用于分割缺陷。伪异常增强进一步稳定了训练过程。 Result: 提出了一种无需外部正常图像支持的IC缺陷分割框架,利用图像自身提取的正常特征进行缺陷分割,通过重建残差实现缺陷检测。 Conclusion: 实验结果表明,该方法在三个IC工艺阶段的数据集上均取得了优于现有方法的一致性改进,并且对产品变异性具有很强的鲁棒性。 Abstract: Modern Integrated-Circuit(IC) manufacturing introduces diverse, fine-grained defects that depress yield and reliability. Most industrial defect segmentation compares a test image against an external normal set, a strategy that is brittle for IC imagery where layouts vary across products and accurate alignment is difficult. We observe that defects are predominantly local, while each image still contains rich, repeatable normal patterns. We therefore propose an unsupervised IC defect segmentation framework that requires no external normal support. A learnable normal-information extractor aggregates representative normal features from the test image, and a coherence loss enforces their association with normal regions. Guided by these features, a decoder reconstructs only normal content; the reconstruction residual then segments defects. Pseudo-anomaly augmentation further stabilizes training. Experiments on datasets from three IC process stages show consistent improvements over existing approaches and strong robustness to product variability.[94] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Umaima Rahman,Raza Imam,Mohammad Yaqub,Dwarikanath Mahapatra
Main category: cs.CV
TL;DR: This paper proposes DRiFt, a novel framework for medical vision-language models, which improves performance and robustness by decoupling task-relevant signals from noise, ensuring safer clinical deployment.
Details
Motivation: Medical VLMs face challenges in reliability and generalizability due to variability in imaging protocols and text reports, which can lead to unpredictable behavior under distribution shifts. Method: DRiFt utilizes a structured feature decoupling approach with parameter-efficient tuning (LoRA) and learnable prompt tokens, along with curated high-quality image-text pairs for improved cross-modal alignment. Result: The approach achieves an improvement of +11.4% Top-1 accuracy and +3.3% Macro-F1 over previous methods, while maintaining robustness on unseen datasets. Conclusion: The proposed DRiFt framework enhances the reliability and robustness of medical vision-language models (VLMs) by decoupling clinically relevant signals from task-agnostic noise, offering safer and more trustworthy clinical applications. Abstract: Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.[95] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
Yuchan Jie,Yushen Xu,Xiaosong Li,Fuqiang Zhou,Jianming Lv,Huafeng Li
Main category: cs.CV
TL;DR: 本文提出了一种新的联合图像融合与超分辨率方法FS-Diff,通过语义引导和清晰度感知来解决现有技术在低分辨率和弱语义信息下的融合效果不佳的问题。
Details
Motivation: 现有的图像融合技术在军事侦察和远距离探测任务中容易受到低分辨率和弱语义信息的影响,导致融合效果不佳,因此提出了FS-Diff来解决这一问题。 Method: FS-Diff 将图像融合与超分辨率统一为一个条件生成问题,通过引入双向特征Mamba来提取多模态图像的全局特征,并利用基于改进U-Net网络的随机迭代去噪过程进行多噪声水平的去噪训练。 Result: FS-Diff 在多个公开数据集和新构建的AVMS数据集上进行了广泛的联合图像融合与超分辨率实验,结果显示其在多个放大倍数下均优于现有技术,能够恢复更丰富的细节和语义信息。 Conclusion: FS-Diff 是一种基于语义引导和清晰度感知的联合图像融合与超分辨率方法,它在多个公开数据集和AVMS数据集上的实验结果表明其性能优于现有技术,并能够恢复融合图像中的更多细节和语义信息。 Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.[96] Semantic Concentration for Self-Supervised Dense Representations Learning
Peisong Wen,Qianqian Xu,Siran Dai,Runmin Cong,Qingming Huang
Main category: cs.CV
TL;DR: This paper introduces a novel approach for dense self-supervised learning that addresses over-dispersion by explicitly concentrating semantic information through distillation, robust loss functions, and object-aware filtering.
Details
Motivation: The motivation stems from the challenge of over-dispersion in dense representation learning, where patches from the same instance or category scatter, which hampers performance. The authors aim to explicitly concentrate semantic information in dense SSL, as existing approaches are spatially sensitive and unsuitable for complex scenes. Method: The paper introduces three key components: distillation of patch correspondences to break strict spatial alignment, a noise-tolerant ranking loss to handle noisy pseudo labels, and an object-aware filter to map output space to an object-based space using cross-attention. Result: The proposed method demonstrates effectiveness through empirical studies across various tasks, showing improved performance in dense self-supervised learning scenarios. Conclusion: The paper proposes a method to address over-dispersion in dense self-supervised learning by explicitly concentrating semantic information through distillation of patch correspondences, a noise-tolerant ranking loss, and an object-aware filter. Abstract: Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.[97] FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model
Yushen Xu,Xiaosong Li,Yuchun Wang,Xiaoqi Cheng,Huafeng Li,Haishu Tan
Main category: cs.CV
TL;DR: FlexiD-Fuse is a diffusion-based image fusion network that can handle varying input modalities, providing high-quality fused images for clinical diagnosis.
Details
Motivation: Existing fusion methods can only handle a fixed number of modality inputs, which hinders their application in clinical settings. Method: FlexiD-Fuse, a diffusion-based image fusion network, using Expectation-Maximization algorithm in diffusion sampling iteration. Result: FlexiD-Fuse can generate high-quality fused images from varying input images, achieving best performance on nine popular metrics. Conclusion: FlexiD-Fuse is an effective and superior method for image fusion with varying input modalities. Abstract: Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.[98] Resource-Efficient Glioma Segmentation on Sub-Saharan MRI
Freedmore Sidume,Oumayma Soula,Joseph Muthui Wacira,YunFei Zhu,Abbas Rabiu Muhammad,Abderrazek Zeraii,Oluwaseun Kalejaye,Hajer Ibrahim,Olfa Gaddour,Brain Halubanza,Dong Zhang,Udunna C Anazodo,Confidence Raymond
Main category: cs.CV
TL;DR: This study presents an efficient deep learning model for glioma segmentation on limited-quality MRI data in Sub-Saharan Africa, achieving strong performance and demonstrating practical potential for clinical use in low-resource settings.
Details
Motivation: Accurate glioma segmentation from MRI is essential for diagnosis and treatment, but the lack of high-quality annotated data in Sub-Saharan Africa makes it challenging to deploy advanced models. This work aims to address that gap. Method: The study employed a 3D Attention UNet architecture with residual blocks and used transfer learning from pre-trained weights on the BraTS 2021 dataset. The model was evaluated on 95 MRI cases from the BraTS-Africa dataset. Result: Despite limited data, the model achieved Dice scores of 0.76 for Enhancing Tumor (ET), 0.80 for Necrotic and Non-Enhancing Tumor Core (NETC), and 0.85 for Surrounding Non-Functional Hemisphere (SNFH), with a compact architecture and fast inference time. Conclusion: This study successfully developed a robust and efficient deep learning framework for glioma segmentation in resource-constrained settings, particularly in Sub-Saharan Africa, contributing to equitable AI in global health. Abstract: Gliomas are the most prevalent type of primary brain tumors, and their accurate segmentation from MRI is critical for diagnosis, treatment planning, and longitudinal monitoring. However, the scarcity of high-quality annotated imaging data in Sub-Saharan Africa (SSA) poses a significant challenge for deploying advanced segmentation models in clinical workflows. This study introduces a robust and computationally efficient deep learning framework tailored for resource-constrained settings. We leveraged a 3D Attention UNet architecture augmented with residual blocks and enhanced through transfer learning from pre-trained weights on the BraTS 2021 dataset. Our model was evaluated on 95 MRI cases from the BraTS-Africa dataset, a benchmark for glioma segmentation in SSA MRI data. Despite the limited data quality and quantity, our approach achieved Dice scores of 0.76 for the Enhancing Tumor (ET), 0.80 for Necrotic and Non-Enhancing Tumor Core (NETC), and 0.85 for Surrounding Non-Functional Hemisphere (SNFH). These results demonstrate the generalizability of the proposed model and its potential to support clinical decision making in low-resource settings. The compact architecture, approximately 90 MB, and sub-minute per-volume inference time on consumer-grade hardware further underscore its practicality for deployment in SSA health systems. This work contributes toward closing the gap in equitable AI for global health by empowering underserved regions with high-performing and accessible medical imaging solutions.[99] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection
Victor Livernoche,Akshatha Arodi,Andreea Musulan,Zachary Yang,Adam Salvail,Gaétan Marceau Caron,Jean-François Godbout,Reihaneh Rabbany
Main category: cs.CV
TL;DR: This paper addresses the growing threat of deepfakes in misinformation by introducing a comprehensive dataset and a crowdsourced platform to improve detection techniques against modern synthetic images.
Details
Motivation: The motivation is to address the growing threat of deepfakes in spreading misinformation, particularly in politically sensitive contexts, by improving detection techniques and providing a realistic, comprehensive dataset. Method: The study analyzes social media posts to identify modalities of deepfake propagation, conducts a human perception study, and introduces a crowdsourced adversarial platform to generate challenging synthetic images. It also presents a new dataset with real and synthetic images for benchmarking. Result: The paper presents a dataset of three million real images and 963k synthetic images generated using modern generative models. It also introduces a crowdsourced adversarial platform to continuously challenge and improve deepfake detection methods. Conclusion: The paper concludes that deepfake detection must evolve to combat increasingly sophisticated synthetic images, emphasizing the importance of a comprehensive dataset and a crowdsourced adversarial platform to keep detection methods robust and adaptive. Abstract: Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.[100] Improving Human Motion Plausibility with Body Momentum
Ha Linh Nguyen,Tze Ho Elden Tse,Angela Yao
Main category: cs.CV
TL;DR: The paper proposes a new approach to model human motion by leveraging whole-body linear and angular momentum to link local and global dynamics, resulting in improved motion quality and realism.
Details
Motivation: The motivation is to address the challenge of accurately capturing the physical coupling between local and global dynamics in motion models, which is often ignored or computationally expensive to derive using traditional methods. Method: The method involves introducing a new loss term that enforces consistency between generated momentum profiles and those observed in ground-truth data, based on the insight that momentum provides a physically grounded link between local joint behavior and global displacement. Result: The result is a motion model that better reflects the interaction between local joint dynamics and global movement, reducing artifacts like foot sliding and jitter while improving balance and motion accuracy. Conclusion: The paper concludes that using whole-body linear and angular momentum as a constraint improves the coupling of local and global dynamics in motion models, resulting in reduced foot sliding and jitter, improved balance, and preserved motion accuracy. Abstract: Many studies decompose human motion into local motion in a frame attached to the root joint and global motion of the root joint in the world frame, treating them separately. However, these two components are not independent. Global movement arises from interactions with the environment, which are, in turn, driven by changes in the body configuration. Motion models often fail to precisely capture this physical coupling between local and global dynamics, while deriving global trajectories from joint torques and external forces is computationally expensive and complex. To address these challenges, we propose using whole-body linear and angular momentum as a constraint to link local motion with global movement. Since momentum reflects the aggregate effect of joint-level dynamics on the body's movement through space, it provides a physically grounded way to relate local joint behavior to global displacement. Building on this insight, we introduce a new loss term that enforces consistency between the generated momentum profiles and those observed in ground-truth data. Incorporating our loss reduces foot sliding and jitter, improves balance, and preserves the accuracy of the recovered motion. Code and data are available at the project page https://hlinhn.github.io/momentum_bmvc.[101] Region-Wise Correspondence Prediction between Manga Line Art Images
Yingxuan Li,Jiafeng Mao,Qianru Qiu,Yusuke Matsui
Main category: cs.CV
TL;DR: 本文提出了一种新的实用任务:在没有预先存在的标签或掩码的情况下,预测原始漫画线条艺术图像之间的区域对应关系。
Details
Motivation: 理解漫画线条艺术图像之间的区域对应关系是漫画处理中的一项基础任务,能够推动下游应用的发展,例如自动线条艺术上色和中间帧生成。然而,这项任务在现实场景中尚未得到充分探索,尤其是没有预先存在的分割或注释的情况下。 Method: 将每幅线条艺术图像分割成若干补丁,使用Transformer框架学习图像内部和跨图像的补丁级相似性,随后应用边缘感知聚类和区域匹配算法,将补丁级预测转化为连贯的区域级对应关系。 Result: 实验结果表明,该方法在补丁级别上具有较高的准确性(例如,96.34%),并能生成一致的区域级对应关系。 Conclusion: 该论文提出了一种基于Transformer的框架,成功实现了在没有预先标签或掩码的情况下,对漫画线条艺术图像进行区域级对应预测,显示出其在现实漫画应用中的潜力。 Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.[102] Generative Diffusion Contrastive Network for Multi-View Clustering
Jian Zhu,Xin Zou,Xi Wang,Ning Zhang,Bian Wu,Yao Yang,Ying Zhou,Lingfang Zeng,Chang Tang,Cheng Luo
Main category: cs.CV
TL;DR: 本文提出了一种新的SGDF方法以及基于SGDF的GDCN模型,解决了多视图聚类中的低质量数据问题,并在实验中取得了SOTA结果。
Details
Motivation: 多视图融合中存在低质量数据问题,包括噪声数据和缺失数据。 Method: 基于SGDF,提出了GDCN方法。 Result: 实验表明GDCN在深度MVC任务中表现优异。 Conclusion: GDCN实现了深度MVC任务的SOTA结果。 Abstract: In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at https://github.com/HackerHyper/GDCN.[103] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
Paul F. R. Wilson,Matteo Ronchetti,Rüdiger Göbl,Viktoria Markova,Sebastian Rosenzweig,Raphael Prevost,Parvin Mousavi,Oliver Zettinig
Main category: cs.CV
TL;DR: DualTrack is a new method for 3D ultrasound imaging that improves accuracy by separately modeling local and global features using a dual-encoder architecture.
Details
Motivation: Traditional 3D ultrasound systems are costly and complex, and existing sensorless methods do not adequately model both local and global features for robust 3D reconstruction. Method: DualTrack uses a dual-encoder architecture with separate local and global encoders for feature extraction, followed by a fusion module to estimate the 3D trajectory. Result: DualTrack achieves state-of-the-art accuracy with an average reconstruction error below 5 mm and produces globally consistent 3D reconstructions. Conclusion: DualTrack provides a promising solution for 3D ultrasound imaging with improved accuracy and consistent reconstructions by leveraging decoupled local and global feature extraction. Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.[104] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders
Dohun Lee,Hyeonho Jeong,Jiwook Kim,Duygu Ceylan,Jong Chul Ye
Main category: cs.CV
TL;DR: 本文提出了Align4Gen,通过多特征融合与对齐改善视频扩散模型的生成效果。
Details
Motivation: 视频扩散模型在架构创新和训练目标方面取得了进展,但其特征表示能力仍有待提高。 Method: 提出了一种新的多特征融合与对齐方法,并对各种视觉编码器进行了深入分析。 Result: Align4Gen在无条件和类别条件视频生成任务中均表现出改进效果。 Conclusion: Align4Gen通过融合和对齐视频扩散模型中的多特征,提升了视频生成的效果。 Abstract: Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: https://align4gen.github.io/align4gen/[105] InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
Sirui Xu,Dongting Li,Yucheng Zhang,Xiyan Xu,Qi Long,Ziyin Wang,Yunzhi Lu,Shuchang Dong,Hezi Jiang,Akshat Gupta,Yu-Xiong Wang,Liang-Yan Gui
Main category: cs.CV
TL;DR: InterAct是一个大规模3D人类-物体互动(HOI)基准,通过整合和标准化来自不同来源的HOI数据,提出统一的优化框架,以减少数据集中的伪影并校正手部运动,旨在推动3D HOI生成的研究。
Details
Motivation: 由于现有数据集的限制,建模和生成动态3D人类-物体互动(HOIs)仍然具有挑战性,包括缺乏高质量的运动和注释以及存在接触渗透、漂浮和错误的手部运动等伪影。 Method: 首先,整合和标准化了21.81小时的HOI数据,并通过详细的文本注释丰富数据集。其次,提出了一个统一的优化框架,利用接触不变性原理来减少伪影和校正手部运动,同时引入运动变化,将数据集扩展到30.70小时。最后,定义了六个基准任务,并开发了一个统一的HOI生成建模视角。 Result: InterAct数据集被证明是推动3D HOI生成研究的基础资源,并在相关任务上实现了最先进的性能。 Conclusion: InterAct通过其数据集和方法的创新,解决了现有数据集的限制,为3D人类-物体互动生成提供了重要的研究支持。 Abstract: While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at https://github.com/wzyabcas/InterAct, and will be actively maintained.[106] Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer's Disease Classification
Akshit Achara,Esther Puyol Anton,Alexander Hammers,Andrew P. King
Main category: cs.CV
TL;DR: This paper shows that deep learning models for Alzheimer's disease diagnosis from MRI scans can exhibit shortcut learning and demographic bias, particularly related to race and sex. Multiple models and datasets were used to confirm these findings, highlighting the need for fairer diagnostic tools.
Details
Motivation: Deep learning algorithms are increasingly used for medical diagnosis, but they can suffer from shortcut learning, where models rely on spurious features, especially those related to protected attributes like race and sex. This can lead to biased outcomes. The motivation for this study is to understand and mitigate such biases in Alzheimer's disease diagnosis using MRI scans. Method: The research utilized deep learning algorithms (ResNet and SwinTransformer) to explore shortcut learning and bias in Alzheimer's disease diagnosis from MRI scans. The study analyzed the impact of training set imbalance on model performance and conducted feature attribution analysis for brain regions. Result: The study found evidence of shortcut learning and performance bias in deep learning models for Alzheimer's disease diagnosis based on MRI scans. These biases were associated with underrepresented demographic groups, specifically race and sex. Feature attribution analysis revealed how different brain regions were used for classification tasks, highlighting the influence of protected attributes. Conclusion: The study demonstrates the existence of shortcut learning and demographic bias in deep learning-based Alzheimer's disease diagnosis using brain MRI scans. The authors emphasize the need for fairer DL diagnostic tools. Abstract: Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer's disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR[107] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection
Sijun Dong,Yuxuan Hu,LiBo Wang,Geng Chen,Xiaoliang Meng
Main category: cs.CV
TL;DR: PeftCD 是一种基于 Vision Foundation Models 的高效变化检测框架,通过参数高效微调方法解决遥感图像中的伪变化问题、样本稀缺问题和跨域泛化问题,并在多个数据集上表现出色。
Details
Motivation: 为了解决多时相、多源遥感影像中伪变化普遍、标记样本稀缺以及跨域泛化困难等问题,提出了 PeftCD 框架。 Method: PeftCD 采用基于 Vision Foundation Models (VFMs) 的 Siamese 编码器,结合 LoRA 和 Adapter 模块进行参数高效微调,并使用轻量级解码器以专注于主干网络的特征表示。 Result: 实验表明,PeftCD 在多个公开数据集上达到最先进的性能,包括 SYSU-CD (IoU 73.81%)、WHUCD (92.05%)、MSRSCD (64.07%)、MLCD (76.89%)、CDD (97.01%)、S2Looking (52.25%) 和 LEVIR-CD (85.62%),并具有精确的边界划分和强伪变化抑制能力。 Conclusion: PeftCD 提供了一个在准确率、效率和泛化能力之间达到最佳平衡的框架,为大规模视觉基础模型在遥感变化检测中的应用提供了强大且可扩展的范式。 Abstract: To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.[108] Visual Grounding from Event Cameras
Lingdong Kong,Dongyue Lu,Ao Liang,Rong Li,Yuhao Dong,Tianshuai Hu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau
Main category: cs.CV
TL;DR: Talk2Event is a new benchmark for language-driven object grounding using event camera data, enabling contextual reasoning in dynamic environments.
Details
Motivation: Event cameras offer advantages for modeling dynamic scenes, but their integration with natural language understanding is underexplored, creating a gap in multimodal perception. Method: The study introduces Talk2Event, a large-scale benchmark for language-driven object grounding using real-world driving scenarios, including annotated objects and referring expressions enriched with structured attributes. Result: Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and over 30,000 referring expressions, capturing spatial, temporal, and relational cues for interpretable and compositional grounding. Conclusion: Talk2Event is expected to be a foundational resource for advancing multimodal and temporally-aware perception with applications in robotics and human-AI interaction. Abstract: Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes -- appearance, status, relation to the viewer, and relation to surrounding objects -- that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.[109] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Yikang Ding,Jiwen Liu,Wenyuan Zhang,Zekun Wang,Wentao Hu,Liyuan Cui,Mingming Lao,Yingchao Shao,Hui Liu,Xiaohan Li,Ming Chen,Xiaoqiang Liu,Yu-Shen Liu,Pengfei Wan
Main category: cs.CV
TL;DR: Kling-Avatar 是一种音频驱动虚拟形象生成的新框架,通过多模态指令理解和并行生成架构,在长时高保真视频生成方面取得突破。
Details
Motivation: 现有音频驱动虚拟形象生成方法仅基于声学或视觉线索进行低级跟踪,缺乏对指令传达交流意图的建模,影响了叙事连贯性和角色表现力。 Method: 提出了一种两阶段级联框架:第一阶段使用多模态大语言模型生成蓝图视频,第二阶段基于蓝图关键帧并行生成多个子片段。 Result: Kling-Avatar 在唇同步精度、情感和动态表现力、指令可控性、身份保持和跨域泛化方面表现出色,可生成生动流畅的长时间视频(最高1080p、48fps)。 Conclusion: Kling-Avatar 通过统一多模态指令理解和逼真肖像生成,在音频驱动的虚拟形象生成方面取得了新进展,为真实场景应用提供了新基准。 Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.[110] Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth
Daria Laslo,Efthymios Georgiou,Marius George Linguraru,Andreas Rauschecker,Sabine Muller,Catherine R. Jutzeler,Sarah Bruningk
Main category: cs.CV
TL;DR: 提出了一种结合数学肿瘤生长模型和引导去噪扩散隐式模型(DDIM)的混合机制学习框架,用于从先前的扫描中合成解剖学上可行的未来MRI图像。
Details
Motivation: 预测脑肿瘤的时空进展对于指导神经肿瘤学中的临床决策至关重要。 Method: 提出了一种混合机制学习框架,将数学肿瘤生长模型与引导去噪扩散隐式模型(DDIM)相结合,以合成未来MRI图像。 Result: 该框架能够生成基于空间相似性度量的真实后续扫描图像,并引入了肿瘤生长概率图,可以捕捉肿瘤生长的临床相关范围和方向性。 Conclusion: 该方法在数据有限的情况下实现了生物学信息的图像生成,提供了考虑机制先验的生成空间时间预测。 Abstract: Predicting the spatio-temporal progression of brain tumors is essential for guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic learning framework that combines a mathematical tumor growth model with a guided denoising diffusion implicit model (DDIM) to synthesize anatomically feasible future MRIs from preceding scans. The mechanistic model, formulated as a system of ordinary differential equations, captures temporal tumor dynamics including radiotherapy effects and estimates future tumor burden. These estimates condition a gradient-guided DDIM, enabling image synthesis that aligns with both predicted growth and patient anatomy. We train our model on the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our framework generates realistic follow-up scans based on spatial similarity metrics. It also introduces tumor growth probability maps, which capture both clinically relevant extent and directionality of tumor growth as shown by 95th percentile Hausdorff Distance. The method enables biologically informed image generation in data-limited scenarios, offering generative-space-time predictions that account for mechanistic priors.[111] Measuring Epistemic Humility in Multimodal Large Language Models
Bingkui Tong,Jiaer Xia,Sifeng Shang,Kaiyang Zhou
Main category: cs.CV
TL;DR: The paper introduces HumbleBench, a new hallucination benchmark for MLLMs that evaluates the model's ability to reject plausible but incorrect answers, ensuring more reliable AI decision-making in safety-critical scenarios.
Details
Motivation: Existing benchmarks focus on recognition accuracy but overlook the critical capability of recognizing when none of the provided answers are correct. This epistemic humility is essential for trustworthy AI, especially in safety-critical applications. Method: The authors created HumbleBench by leveraging a panoptic scene graph dataset to extract ground-truth entities and relations. They used GPT-4-Turbo to generate multiple-choice questions, followed by manual filtering. Each question includes a 'None of the above' option to test epistemic humility. Result: The authors evaluated state-of-the-art MLLMs, including both general-purpose and specialized reasoning models, on HumbleBench, demonstrating its effectiveness in measuring the models' ability to reject incorrect answers. Conclusion: HumbleBench fills a critical gap in current evaluation suites by incorporating explicit false-option rejection, offering a more realistic measure of MLLM reliability in safety-critical settings. Abstract: Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.[112] Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Zhiyuan Yan,Kaiqing Lin,Zongjian Li,Junyan Ye,Hui Han,Zhendong Wang,Hao Liu,Bin Lin,Hao Li,Xue Xu,Xinyan Xiao,Jingdong Wang,Haifeng Wang,Li Yuan
Main category: cs.CV
TL;DR: This paper introduces UAE and Unified-GRPO, a new framework and reinforcement learning approach for unified multimodal learning, achieving high-fidelity image reconstructions through bidirectional information flow between encoder and decoder.
Details
Motivation: The motivation is to enhance the bidirectional information flow between image understanding (encoding) and generation (decoding) processes by using reconstruction fidelity as a unified training objective, thereby achieving mutual gains in performance. Method: The paper proposes UAE, a novel framework for unified multimodal learning, and Unified-GRPO, a reinforcement learning method that includes three stages: a cold-start phase, Generation for Understanding, and Understanding for Generation. The decoder is pre-trained with large-scale long-context image captions to capture detailed semantic and spatial relationships. Result: The result shows that as reinforcement learning progresses, the encoder autonomously produces more descriptive captions, while the decoder demonstrates a strong ability to understand these descriptions, leading to image reconstructions of high fidelity. Conclusion: The paper concludes that through the proposed UAE framework and the Unified-GRPO reinforcement learning approach, there is a significant improvement in the bidirectional information flow between understanding and generation processes, leading to high-fidelity image reconstructions and a more unified multimodal learning model. Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.[113] Geometric Neural Distance Fields for Learning Human Motion Priors
Zhengdi Yu,Simone Foti,Linguang Zhang,Amy Zhao,Cem Keskin,Stefanos Zafeiriou,Tolga Birdal
Main category: cs.CV
TL;DR: Neural Riemannian Motion Fields (NRMF) introduces a geometrically rigorous, 3D generative motion prior that enables realistic and physically plausible human motion recovery across diverse tasks.
Details
Motivation: Existing methods like VAEs or diffusion models lack explicit modeling of higher-order motion dynamics and geometric constraints, leading to less realistic or inconsistent motion recovery. Method: NRMF models human motion using neural distance fields (NDFs) on the product space of joint rotations, velocities, and accelerations, with an adaptive-step hybrid projection algorithm and geometric integrator for motion generation. Result: NRMF achieves significant performance gains, generalizing across tasks like denoising, motion in-betweening, and fitting to partial 2D/3D observations when trained on the AMASS dataset. Conclusion: NRMF offers a robust and versatile solution for 3D human motion recovery, demonstrating superior generalization across tasks and input modalities. Abstract: We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to "roll out" realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.[114] Locality in Image Diffusion Models Emerges from Data Statistics
Artem Lukoianov,Chenyang Yuan,Justin Solomon,Vincent Sitzmann
Main category: cs.CV
TL;DR: 扩散模型中的最优去噪器虽然理论上最优,但实际生成能力不足,本文研究发现图像数据集的统计特性而非网络结构的归纳偏置导致了局部性,从而提出了更好的分析去噪器。
Details
Motivation: 最优去噪器无法生成多样化图像,而深度扩散模型表现更好,但其性能差距的原因尚不明确。 Method: 通过理论分析和实验验证,研究最优线性去噪器的局部性特性及其与自然图像数据集中像素相关性的关系。 Result: 发现最优线性去噪器表现出与深度神经网络去噪器相似的局部性,这种局部性来源于图像数据集本身的像素相关性。 Conclusion: 局部性源于数据统计特性而非卷积网络的归纳偏置,利用该发现设计的分析去噪器更接近深度扩散模型的预测。 Abstract: Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.[115] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Jiahao Wang,Yufeng Yuan,Rujie Zheng,Youtian Lin,Jian Gao,Lin-Zhuo Chen,Yajie Bao,Yi Zhang,Chang Zeng,Yanxi Zhou,Xiaoxiao Long,Hao Zhu,Zhaoxiang Zhang,Xun Cao,Yao Yao
Main category: cs.CV
TL;DR: 本文介绍了一个名为SpatialVID的大规模、高质三维视频数据集,旨在提升空间智能模型的性能和泛化能力。