Table of Contents
cs.CL [Back]
[1] Noise or Nuance: An Investigation Into Useful Information and Filtering For LLM Driven AKBC
Alex Clay,Ernesto Jiménez-Ruiz,Pranava Madhyastha
Main category: cs.CL
TL;DR: 本文研究了在受限环境下提升LLM三元组补全任务表现的方法,发现增加信息、质量过滤和解析策略选择对提升效果有重要作用。
Details
Motivation: 在2025 LM-KBC挑战等受限环境下,传统的RAG和微调策略受到限制,因此需要研究其他方法来提高LLM输出的质量。 Method: 论文研究了三元组补全任务的三个方面:生成、质量保证和LLM响应解析,分析了在受限环境下不同策略的效果。 Result: 研究发现,额外信息有助于生成质量提升,LLM能够有效识别并过滤低质量三元组,同时解析策略的灵活性与一致性之间存在依赖于场景的权衡。 Conclusion: 论文得出结论,在受限环境下,额外的信息可以提高生成质量,LLM可以有效过滤低质量的三元组,同时生成解析的灵活性和一致性取决于具体设置。 Abstract: RAG and fine-tuning are prevalent strategies for improving the quality of LLM outputs. However, in constrained situations, such as that of the 2025 LM-KBC challenge, such techniques are restricted. In this work we investigate three facets of the triple completion task: generation, quality assurance, and LLM response parsing. Our work finds that in this constrained setting: additional information improves generation quality, LLMs can be effective at filtering poor quality triples, and the tradeoff between flexibility and consistency with LLM response parsing is setting dependent.[2] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach
Imene Kolli,Ario Saeid Vaghefi,Chiara Colesanti Senni,Shantam Raj,Markus Leippold
Main category: cs.CL
TL;DR: This paper proposes an AI-assisted framework using Retrieval-Augmented Generation to automate and accelerate the monitoring of corporate climate policy engagement, reducing manual effort while maintaining accuracy through a human-in-the-loop approach.
Details
Motivation: The manual assessment of corporate climate policy engagement is time-consuming, labor-intensive, and prone to human error. InfluenceMap seeks to accelerate this process while maintaining accuracy through automation. Method: The study proposes an AI-assisted framework leveraging Retrieval-Augmented Generation (RAG), incorporating layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies. This approach aims to automate the extraction and classification of relevant evidence from multilingual corporate documents. Result: The combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies demonstrated the best performance in extracting and classifying evidence from large-scale textual data. This significantly accelerated the evidence extraction process. Conclusion: The proposed AI-assisted framework, based on Retrieval-Augmented Generation, effectively accelerates evidence extraction in monitoring corporate climate policy engagement. However, due to the nuanced nature of the analysis, a human-in-the-loop approach is necessary to ensure accuracy, with the technology serving to augment rather than replace expert judgment. Abstract: InfluenceMap's LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity's support or opposition to science-based policy pathways for achieving the Paris Agreement's goal of limiting global warming to 1.5{\deg}C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.[3] Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings
Jinsong Chen
Main category: cs.CL
TL;DR: This research proposes a psychometric method using large language models to analyze textual data, transforming it into response data to uncover latent knowledge dimensions and patterns, with potential applications in multiple fields.
Details
Motivation: The motivation is to develop a psychometric approach that interprets textual data by transforming it into response data, enabling the extraction of meaningful latent factors. Method: The method involves two stages: obtaining contextual scores through NLP techniques and transformer models, and conducting psychometric analysis using factor analysis models. Result: The experimental results on the Wiki STEM corpus demonstrate the method's potential to uncover latent knowledge dimensions and patterns in textual data. Conclusion: The research concludes that the novel psychometric method for textual data analysis using large language models effectively uncovers latent knowledge dimensions and patterns within textual data, showing promise for application in various fields. Abstract: This research introduces a novel psychometric method for analyzing textual data using large language models. By leveraging contextual embeddings to create contextual scores, we transform textual data into response data suitable for psychometric analysis. Treating documents as individuals and words as items, this approach provides a natural psychometric interpretation under the assumption that certain keywords, whose contextual meanings vary significantly across documents, can effectively differentiate documents within a corpus. The modeling process comprises two stages: obtaining contextual scores and performing psychometric analysis. In the first stage, we utilize natural language processing techniques and encoder based transformer models to identify common keywords and generate contextual scores. In the second stage, we employ various types of factor analysis, including exploratory and bifactor models, to extract and define latent factors, determine factor correlations, and identify the most significant words associated with each factor. Applied to the Wiki STEM corpus, our experimental results demonstrate the method's potential to uncover latent knowledge dimensions and patterns within textual data. This approach not only enhances the psychometric analysis of textual data but also holds promise for applications in fields rich in textual information, such as education, psychology, and law.[4] BRoverbs -- Measuring how much LLMs understand Portuguese proverbs
Thales Sales Almeida,Giovana Kerche Bonás,João Guilherme Alves Santos
Main category: cs.CL
TL;DR: 本文介绍了一个用于评估葡萄牙语大型语言模型的新数据集BRoverbs,该数据集基于巴西谚语,能够更好地捕捉语言的复杂性和文化内涵。
Details
Motivation: 现有的葡萄牙语评估仍然有限,主要依赖可能无法完全捕捉语言细微差别或文化参照的翻译数据集。 Method: 介绍BRoverbs,一个专门为巴西谚语设计的数据集,用于评估大型语言模型的性能。 Result: BRoverbs 数据集能够提供对模型理解地区表达能力的挑战,并评估其语言理解的广泛性。 Conclusion: BRoverbs 提供了一个新的评估工具,用于评估葡萄牙语大型语言模型,有助于推进区域知情基准测试。 Abstract: Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.[5] Can Vision-Language Models Solve Visual Math Equations?
Monjoy Narayan Choudhury,Junling Wang,Yifan Hou,Mrinmaya Sachan
Main category: cs.CL
TL;DR: Vision-Language Models (VLMs) face challenges in visually grounded mathematical reasoning, particularly in coefficient counting and symbolic computation as equation complexity increases.
Details
Motivation: Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation, which motivates the study of their limitations in visual equation solving. Method: The research analyzes VLM performance in visual equation solving by decomposing the task into coefficient counting and variable recognition. Result: VLMs perform well on textual equations but fail on visually grounded counterparts. Coefficient counting is identified as the primary bottleneck, and symbolic reasoning becomes a limiting factor as equation complexity increases. Conclusion: The study reveals key weaknesses in current Vision-Language Models (VLMs) concerning visually grounded mathematical reasoning, indicating areas for future improvements. Abstract: Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.[6] Stated Preference for Interaction and Continued Engagement (SPICE): Evaluating an LLM's Willingness to Re-engage in Conversation
Thomas Manuel Rost,Martina Figlia,Bernd Wallraff
Main category: cs.CL
TL;DR: SPICE is a new tool that effectively measures a language model's willingness to continue interactions based on user behavior, offering a simple and reliable way to audit model dispositions.
Details
Motivation: The motivation is to develop a simple, reproducible method for auditing how models respond to different user behaviors, particularly to detect subtle signals of model disposition beyond simple abuse classification. Method: The study introduced SPICE, a diagnostic signal based on YES/NO responses from Large Language Models regarding their willingness to re-engage with user behavior after reviewing a transcript. It used a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set across four models and four framing conditions, totaling 480 trials. Statistical tests such as Rao-Scott adjustment and cluster permutation tests were employed. Result: SPICE effectively discriminated based on user tone: Friendly interactions led to a 97.5% YES response for continuation, abusive interactions led to only 17.9% YES, and unclear interactions resulted in 60.4% YES. Additionally, SPICE provided a distinct signal from abuse classification, with models preferring not to continue interactions even when abuse was not identified. A preamble significantly affected SPICE responses under ambiguity when transcripts were presented as single blocks. Conclusion: The study concludes that SPICE is a robust, low-overhead tool for auditing model dispositions, providing a direct relational signal of a model's state and complementing existing metrics. Abstract: We introduce and evaluate Stated Preference for Interaction and Continued Engagement (SPICE), a simple diagnostic signal elicited by asking a Large Language Model a YES or NO question about its willingness to re-engage with a user's behavior after reviewing a short transcript. In a study using a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set, we tested four open-weight chat models across four framing conditions, resulting in 480 trials. Our findings show that SPICE sharply discriminates by user tone. Friendly interactions yielded a near-unanimous preference to continue (97.5% YES), while abusive interactions yielded a strong preference to discontinue (17.9% YES), with unclear interactions falling in between (60.4% YES). This core association remains decisive under multiple dependence-aware statistical tests, including Rao-Scott adjustment and cluster permutation tests. Furthermore, we demonstrate that SPICE provides a distinct signal from abuse classification. In trials where a model failed to identify abuse, it still overwhelmingly stated a preference not to continue the interaction (81% of the time). An exploratory analysis also reveals a significant interaction effect: a preamble describing the study context significantly impacts SPICE under ambiguity, but only when transcripts are presented as a single block of text rather than a multi-turn chat. The results validate SPICE as a robust, low-overhead, and reproducible tool for auditing model dispositions, complementing existing metrics by offering a direct, relational signal of a model's state. All stimuli, code, and analysis scripts are released to support replication.[7] Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M
Piyush Pant
Main category: cs.CL
TL;DR: This research demonstrates that combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) improves the safety and helpfulness of the OPT-350M language model more effectively than using either method alone.
Details
Motivation: The motivation behind this research is to investigate the effectiveness of different alignment techniques in improving the safety and helpfulness of language models, specifically addressing the challenges posed by noisy data, limited GPU resources, and training constraints. Method: The research employed Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on the OPT-350M language model. Four models were trained and evaluated: the base OPT-350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. Three key evaluation metrics - Harmlessness Rate (HmR), Helpfulness Rate (HpR), and Combined Alignment Score (CAS) - were introduced and derived from reward model outputs. Result: Results indicate that Supervised Fine-Tuning (SFT) outperforms Direct Preference Optimization (DPO). However, the combined SFT+DPO model surpasses all other models across all metrics, highlighting the complementary nature of these techniques. Conclusion: The study concludes that the combined SFT+DPO approach outperforms individual SFT and DPO techniques in enhancing the safety and helpfulness of the OPT-350M language model. This suggests that these methods are complementary and can be used together to achieve better model alignment. Abstract: This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language model. Utilizing the Anthropic Helpful-Harmless RLHF dataset, we train and evaluate four models: the base OPT350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. We introduce three key evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), all derived from reward model outputs. The results show that while SFT outperforms DPO, The combined SFT+DPO model outperforms all others across all metrics, demonstrating the complementary nature of these techniques. Our findings also highlight challenges posed by noisy data, limited GPU resources, and training constraints. This study offers a comprehensive view of how fine-tuning strategies affect model alignment and provides a foundation for more robust alignment pipelines in future work.[8] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction
Zhongqiu Li,Shiquan Wang,Ruiyu Fang,Mengjiao Bao,Zhenhe Wu,Shuangyong Song,Yongxiang Li,Zhongjiang He
Main category: cs.CL
TL;DR: 本文提出 MR-UIE,通过结合强化学习和多视角推理提升大型语言模型在信息抽取任务中的表现,尤其在复杂场景中展现了卓越的泛化能力。
Details
Motivation: 大型语言模型在通用信息抽取(UIE)中的表现不足,特别是在涉及复杂模式描述和需要多步推理的结构化输出场景中。 Method: 提出将强化学习(RL)与多视角推理结合,使大型语言模型从被动抽取器转变为主动推理者。 Result: 实验表明,MR-UIE 在多个信息抽取基准测试中持续提高了抽取准确性,并在多个数据集上超越了最先进的方法。 Conclusion: 将强化学习与多视角推理相结合显著提升了复杂信息抽取任务的泛化能力,证明了推理在复杂场景中的关键作用。 Abstract: Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model's generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.[9] TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla
Nishat Raihan,Antonios Anastasopoulos,Marcos Zampieri
Main category: cs.CL
TL;DR: The research introduces the first family of Code LLMs for Bangla, showing that quality datasets can enhance performance in low-resource languages.
Details
Motivation: Bangla, being the 5th most spoken language, remains underrepresented in Large Language Models, particularly for code generation, due to a lack of high-quality data. Method: The researchers introduced the first dedicated family of Code LLMs for Bangla, including a comprehensive Bangla code instruction dataset, an evaluation benchmark (MBPP-Bangla), and the TigerCoder-family of Code LLMs. Result: The TigerCoder-family of Code LLMs achieved significant performance gains (~11-18% at Pass@1) over existing multilingual and general-purpose Bangla LLMs. Conclusion: The study concludes that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages like Bangla. Abstract: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.[10] Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia
Sophia Maria
Main category: cs.CL
TL;DR: Compass-v3是一个专为东南亚电子商务设计的垂直领域混合专家模型,具有245B总参数和71B活动参数。
Details
Motivation: 大型语言模型(LLMs)在通用领域应用表现出色,但在需要领域特定知识的专业任务中其性能往往会下降。电子商务数据具有噪声大、异构性、多语言性和高度动态性,这对模型提出了更高的要求。 Method: Compass-v3采用了更少但更大的专家模型,并结合硬件效率优化,如节点内专家并行和定制的memcpy操作符,以最大化GPU利用率。模型在12T个经过策划的多语言语料和大规模合成电子商务指令上使用混合训练策略进行训练。为了增强对齐效果,提出了最优传输直接偏好优化(OTPO)。 Result: Compass-v3在电子商务性能方面达到了最先进的水平,超过了DeepSeek-V3.1、GPT-4系列和Qwen3-235B。此外,Compass-v3在资源较少的东南亚语言(印尼语、泰语、菲律宾语、越南语、马来语、他加禄语)和葡萄牙语方面表现出强大的多语言能力,同时在通用基准测试中保持了有竞争力的表现。 Conclusion: Compass-v3已经广泛应用于Shopee的工业级电子商务平台,并逐步取代OpenAI的流量,现已占总LLM使用量的70%以上,突显了其在专业商业专业知识和广泛语言能力方面的双重优势。 Abstract: Large language models (LLMs) excel in general-domain applications, yet their performance often degrades in specialized tasks requiring domain-specific knowledge. E-commerce is particularly challenging, as its data are noisy, heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and 71B active per token, designed for Southeast Asian e-commerce. Compass-v3 adopts fewer but larger experts, combined with hardware-efficient optimizations-such as intra-node expert parallelism and a customized memcpy operator-to maximize GPU utilization. The model is trained on 12T tokens of curated multilingual corpora and large-scale synthetic e-commerce instructions using a mixed-training strategy. To enhance alignment, we propose Optimal-Transport Direct Preference Optimization (OTPO), which captures token-level distinctions and improves instruction adherence in commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3 delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1, GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong multilingual capability across low-resource Southeast Asian languages (Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while sustaining competitive performance on general benchmarks. It has already been widely applied in Shopee's industrial-scale e-commerce platform and is gradually replacing OpenAI's traffic, now accounting for over 70\% of total LLM usage, highlighting its dual strengths in specialized commerce expertise and broad linguistic competence.[11] Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
Liqun He,Jiaqi Xu
Main category: cs.CL
TL;DR: 研究发现,生成式AI(尤其是GPT-4)在自动化分类教育对话行为方面表现出色,为教育对话分析提供了高效的解决方案。
Details
Motivation: 为了减少传统人工编码所需的时间和精力,研究探索了生成式AI在自动化分类导师对话行为(DA)中的应用。 Method: 使用开源CIMA语料库进行案例研究,测试GPT-3.5-turbo和GPT-4模型的分类性能,并使用特定提示进行优化。 Result: GPT-4实现了80%的准确率、0.81的加权F1分数和0.74的Cohen's Kappa,表现出与人类标注的高度一致性,并超过了基线性能。 Conclusion: 生成式AI在DA分类中的应用具有高效且可行的潜力,同时强调了任务特定标签定义和上下文信息的重要性,并强调了生成式AI使用的伦理考虑和负责任的研究实践的必要性。 Abstract: This study explores the use of generative AI for automating the classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors' responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen's Kappa of 0.74, surpassing baseline performance and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task-specific label definitions and contextual information in enhancing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices. The script of this research is publicly available at https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging.[12] ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
Phuong-Nam Dang,Kieu-Linh Nguyen,Thanh-Hieu Pham
Main category: cs.CL
TL;DR: ViRanker 是一种专为越南语设计的交叉编码器重排序模型,通过架构改进和数据整理提升了低资源语言的重排序性能。
Details
Motivation: 越南语作为一种低资源语言,缺乏具有竞争力的重排序模型,且其复杂的语法和变音符号为现有模型带来了挑战。 Method: ViRanker 基于 BGE-M3 编码器构建,并通过 Blockwise Parallel Transformer 进行增强。模型使用 8GB 精选语料库进行训练,并采用混合硬负样本采样进行微调以增强鲁棒性。 Result: 在 MMARCO-VI 基准测试中,ViRanker 实现了出色的早期排名准确率,超越了多语言基线模型,并与 PhoRanker 竞争激烈。 Conclusion: ViRanker 解决了越南语重排序任务中的不足,展示了架构改进和数据整理对低资源语言重排序的积极影响,并为其他代表性不足的语言提供了可借鉴的方法。 Abstract: This paper presents ViRanker, a cross-encoder reranking model tailored to the Vietnamese language. Built on the BGE-M3 encoder and enhanced with the Blockwise Parallel Transformer, ViRanker addresses the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with hybrid hard-negative sampling to strengthen robustness. Evaluated on the MMARCO-VI benchmark, ViRanker achieves strong early-rank accuracy, surpassing multilingual baselines and competing closely with PhoRanker. By releasing the model openly on Hugging Face, we aim to support reproducibility and encourage wider adoption in real-world retrieval systems. Beyond Vietnamese, this study illustrates how careful architectural adaptation and data curation can advance reranking in other underrepresented languages.[13] LITcoder: A General-Purpose Library for Building and Comparing Encoding Models
Taha Binhuraib,Ruimin Gao,Anna A. Ivanova
Main category: cs.CL
TL;DR: LITcoder是一个灵活的开源库,用于构建和评估神经编码模型,支持多种方法和数据集,强调方法的严谨性和可重复性,并加速高质量大脑活动预测模型的发展。
Details
Motivation: 为了应对构建用于连续功能性磁共振成像(fMRI)数据的编码模型时面临的方法论选择和信息泄漏问题,以及提高研究的可重复性和可扩展性。 Method: LITcoder提供了一个模块化的流程,涵盖了从对齐连续刺激与大脑数据、将刺激转换为表征特征、将这些特征映射到大脑数据,到在保留数据上评估模型预测性能等一系列方法。 Result: 通过将一系列编码模型拟合到三个故事听力数据集(LeBel等,2023;Narratives;Little Prince)中,展示了框架的可扩展性和多功能性,并探讨了构建连续fMRI数据编码模型的关键方法论选择。 Conclusion: LITcoder是一个开源库,旨在降低构建和基准测试神经编码模型的技术门槛,促进模型和数据集之间的系统比较,增强方法的严谨性,并加速高质量、高性能的大脑活动预测模型的发展。 Abstract: We introduce LITcoder, an open-source library for building and benchmarking neural encoding models. Designed as a flexible backend, LITcoder provides standardized tools for aligning continuous stimuli (e.g., text and speech) with brain data, transforming stimuli into representational features, mapping those features onto brain data, and evaluating the predictive performance of the resulting model on held-out data. The library implements a modular pipeline covering a wide array of methodological design choices, so researchers can easily compose, compare, and extend encoding models without reinventing core infrastructure. Such choices include brain datasets, brain regions, stimulus feature (both neural-net-based and control, such as word rate), downsampling approaches, and many others. In addition, the library provides built-in logging, plotting, and seamless integration with experiment tracking platforms such as Weights & Biases (W&B). We demonstrate the scalability and versatility of our framework by fitting a range of encoding models to three story listening datasets: LeBel et al. (2023), Narratives, and Little Prince. We also explore the methodological choices critical for building encoding models for continuous fMRI data, illustrating the importance of accounting for all tokens in a TR scan (as opposed to just taking the last one, even when contextualized), incorporating hemodynamic lag effects, using train-test splits that minimize information leakage, and accounting for head motion effects on encoding model predictivity. Overall, LITcoder lowers technical barriers to encoding model implementation, facilitates systematic comparisons across models and datasets, fosters methodological rigor, and accelerates the development of high-quality high-performance predictive models of brain activity. Project page: https://litcoder-brain.github.io[14] Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing
Zhiyue Liu,Fanrong Ma,Xin Ling
Main category: cs.CL
TL;DR: 本文提出了一种新的去偏框架,通过反事实数据增强和对比学习机制提高目标导向的多模态情感分类性能。
Details
Motivation: 现有方法过于依赖文本内容,且未考虑词级上下文偏见,导致分类准确率下降。 Method: 引入了一种反事实数据增强策略和自适应去偏对比学习机制,以生成细节匹配的图文样本,并学习鲁棒特征。 Result: 实验结果表明,所提出的方法在多个基准数据集上优于现有技术。 Conclusion: 该论文提出了一种基于反事实增强的去偏框架,以减少文本特征与输出标签之间的虚假相关性,并在多个基准数据集中优于最先进的基线方法。 Abstract: Target-oriented multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs. While existing works achieve competitive performance, they often over-rely on textual content and fail to consider dataset biases, in particular word-level contextual biases. This leads to spurious correlations between text features and output labels, impairing classification accuracy. In this paper, we introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations. Our framework incorporates a counterfactual data augmentation strategy that minimally alters sentiment-related causal features, generating detail-matched image-text samples to guide the model's attention toward content tied to sentiment. Furthermore, for learning robust features from counterfactual data and prompting model decisions, we introduce an adaptive debiasing contrastive learning mechanism, which effectively mitigates the influence of biased words. Experimental results on several benchmark datasets show that our proposed method outperforms state-of-the-art baselines.[15] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Yuhao Zhang,Yuhao Du,Zhanchen Dai,Xiangnan Ma,Kaiqi Kou,Benyou Wang,Haizhou Li
Main category: cs.CL
TL;DR: EchoX是一种语音到语音的大语言模型,通过语义辅助训练解决了声学-语义差距问题,从而保持强大的推理能力,并在多个任务中表现出色。
Details
Motivation: 现有的语音到语音大语言模型(SLLMs)在知识和推理能力上表现下降,因为其训练范式无法弥合特征表示空间中的声学-语义差距。 Method: EchoX利用语义表示并动态生成语音训练目标,结合声学和语义学习进行训练。 Result: EchoX在使用约六千小时的训练数据后,在多个基于知识的问答基准测试中表现出先进的性能。 Conclusion: EchoX通过结合语义表示和动态生成语音训练目标,成功弥补了语音特征表示空间中的声学-语义差距,保持了语音LLM的强大推理能力。 Abstract: Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.[16] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition
Chin Yuen Kwok,Jia Qi yip
Main category: cs.CL
TL;DR: 本文提出了一种改进的ASR模型方法,通过前瞻预测避免了传统方法中的撤销步骤,从而显著提高了罕见词的识别效率和准确性。
Details
Motivation: 传统Trie-based偏见方法在Beam Search中撤销奖励步骤计算成本高昂,且限制了模型性能,因此需要一种更高效的方法。 Method: 提出了适应ASR模型以提前预测多步输出的方法,避免了传统Trie-based偏见方法中需要撤销奖励的步骤。 Result: 通过仅使用10小时合成数据微调Whisper,所提出的方法将NSC Part 2测试集的词错误率从30.86%降低到12.19%。 Conclusion: 上下文偏见通过在解码期间优先考虑罕见词的输出提高了ASR模型的罕见词识别能力,而通过微调Whisper模型,所提出的方法显著降低了NSC Part 2测试集的词错误率。 Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives "bonus scores" to partial hypothesis (e.g. "Bon") that may lead to the generation of the rare word (e.g. "Bonham"). If the full word ("Bonham") isn't ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.[17] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function
Chin Yuen Kwok,Jia Qi Yip,Eng Siong Chng
Main category: cs.CL
TL;DR: 改进的上下文偏置方法结合关键词感知损失函数显著提升了罕见词识别的准确性。
Details
Motivation: 罕见词识别在语音识别中存在挑战,合成数据和上下文偏置方法虽有效,但可能因合成音频中的伪影导致过拟合。 Method: 提出了一种关键词感知损失函数,包含masked cross-entropy项和二分类项,以增强偏置模块的训练效果。 Result: 通过适配Whisper模型,使用10小时合成数据,方法将NSC Part 2测试集的词错误率从29.71%降低到11.81%。 Conclusion: 通过改进的TCPGen-based上下文偏置方法和关键词感知损失函数,可以有效提升罕见词识别的性能,同时减少过拟合的风险。 Abstract: Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.[18] GmSLM : Generative Marmoset Spoken Language Modeling
Talia Sternberg,Michael London,David Omer,Yossi Adi
Main category: cs.CL
TL;DR: This study developed GmSLM, a spoken language model for Marmoset vocalizations, demonstrating its effectiveness in mimicking real vocal communication and offering potential for understanding the neural mechanisms behind it.
Details
Motivation: Marmoset vocal communication shares similarities with human speech and studying it can offer insights into brain activity, which is difficult to access in humans. Method: Generative Marmoset Spoken Language Modeling (GmSLM) was introduced with a novel zero-shot evaluation metric using unsupervised and weakly labeled data. Result: GmSLM generated vocalizations that closely matched real samples and outperformed a human-speech-based baseline in acoustic accuracy and downstream tasks. Conclusion: GmSLM provides a practical framework for studying the neural basis of vocal communication and benefits neuroscience, bioacoustics, and evolutionary biology. Abstract: Marmoset monkeys exhibit complex vocal communication, challenging the view that nonhuman primates vocal communication is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal communication offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal communication. We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal communication and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.[19] CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling
Wenhao Li,Bangcheng Sun,Weihao Ye,Tianyi Zhang,Daohai Yu,Fei Chao,Rongrong Ji
Main category: cs.CL
TL;DR: 本文提出CCF,一种高效的长上下文语言模型压缩框架,通过学习保留全局语义的层次化潜在表示,显著提升模型的可扩展性和性能。
Details
Motivation: 扩展语言模型的上下文长度对于捕捉长距离依赖关系至关重要,但直接扩展上下文会带来显著的计算和内存负担。 Method: CCF结合了分段语义聚合和键值记忆编码,并通过增量分段解码与稀疏存储抽样优化策略进行训练。 Result: 在多个长上下文语言建模基准测试中,CCF在高压缩率下实现了具有竞争力的困惑度,并显著提高了吞吐量和内存效率。 Conclusion: CCF是一个有效的上下文压缩框架,能够显著提升长上下文语言模型的可扩展性和性能。 Abstract: Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, na\"ive context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context compression framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment decoding with sparse reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured compression for scalable and effective long-context language modeling.[20] Reading Between the Lines: Classifying Resume Seniority with Large Language Models
Matan Cohen,Shira Shani,Eden Menahem,Yehudit Aperstein,Alexander Apartsin
Main category: cs.CL
TL;DR: 该研究探讨了大型语言模型在自动化检测简历中资历夸大和隐含专业知识方面的有效性,以解决因自我宣传而导致的评估偏见问题。
Details
Motivation: 由于简历中普遍存在夸大经验和模糊自我展示的情况,准确评估候选人资历变得尤为关键且具有挑战性。 Method: 引入了一个包含真实简历和合成硬样本的混合数据集,以评估大型语言模型(包括微调的BERT架构)在检测与资历夸大和隐含专业知识相关的细微语言线索方面的有效性。 Result: 研究成果突显了人工智能驱动的候选人评估系统的改进方向,并有助于减轻由自我宣传语言引起的偏见。 Conclusion: 研究发现大型语言模型在自动化评估候选人资历方面具有潜力,同时强调了自我宣传语言可能引入的偏见问题。 Abstract: Accurately assessing candidate seniority from resumes is a critical yet challenging task, complicated by the prevalence of overstated experience and ambiguous self-presentation. In this study, we investigate the effectiveness of large language models (LLMs), including fine-tuned BERT architectures, for automating seniority classification in resumes. To rigorously evaluate model performance, we introduce a hybrid dataset comprising both real-world resumes and synthetically generated hard examples designed to simulate exaggerated qualifications and understated seniority. Using the dataset, we evaluate the performance of Large Language Models in detecting subtle linguistic cues associated with seniority inflation and implicit expertise. Our findings highlight promising directions for enhancing AI-driven candidate evaluation systems and mitigating bias introduced by self-promotional language. The dataset is available for the research community at https://bit.ly/4mcTovt[21] Agentic LLMs for Question Answering over Tabular Data
Rishit Tyagi,Mohit Gupta,Rahul Bouri
Main category: cs.CL
TL;DR: This paper proposes an effective NL-to-SQL method using LLMs for Table QA, achieving high accuracy on the DataBench benchmark and providing insights into LLM-driven approaches for structured query tasks.
Details
Motivation: Table QA poses challenges due to diverse table structures, sizes, and data types, necessitating robust models capable of accurately answering structured queries. Method: The study uses a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement with LLMs like GPT-4o, GPT-4o-mini, and DeepSeek v2:16b. Result: The approach achieved 70.5% accuracy on DataBench QA and 71.6% on DataBench Lite QA, significantly surpassing baseline scores of 26% and 27% respectively. Conclusion: The proposed NL-to-SQL approach leveraging LLMs demonstrates significant improvements in Table QA tasks, offering insights into the strengths and limitations of LLM-driven methods. Abstract: Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5\% accuracy on DataBench QA and 71.6\% on DataBench Lite QA, significantly surpassing baseline scores of 26\% and 27\% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.[22] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models
Grazia Sveva Ascione,Nicolò Tamagnone
Main category: cs.CL
TL;DR: This paper introduces a weak supervision approach to classify patents by their relevance to UN SDGs using semantic alignment and large language models, overcoming limitations of existing methods and enabling effective multi-label regression models.
Details
Motivation: Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges, but supervised learning is limited due to the lack of a large, labeled dataset. Method: The study framed patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications as a noisy initial signal. A composite labeling function (LF) was developed using large language models to extract structured concepts from patents and SDG papers. Cross-domain similarity scores were computed and combined using a rank-based retrieval approach, with LF calibration via a custom positive-only loss. This generated a silver-standard, soft multi-label dataset mapping patents to SDGs. Result: The method produced a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. The approach outperformed several baselines in internal validation and revealed greater thematic, cognitive, and organizational coherence than traditional classifications in external validation. Conclusion: Weak supervision and semantic alignment can enhance SDG classification at scale. Abstract: Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.[23] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
Channdeth Sok,David Luz,Yacine Haddam
Main category: cs.CL
TL;DR: 本文提出了一种名为MetaRAG的变形测试框架,用于检测检索增强生成系统中的幻觉,该框架能够在实时、无监督、黑盒设置中运行,并能将未支持的声明定位到事实发生的具体位置。
Details
Motivation: 现有的检测方法主要针对独立的LLM,并未解决检索增强生成系统的独特挑战,其中响应必须与检索到的证据保持一致。 Method: MetaRAG框架包括四个阶段:(1)将答案分解为原子事实,(2)使用同义词和反义词替换生成每个事实的受控变异,(3)根据检索到的上下文验证每个变体,(4)将不一致的惩罚汇总为响应级别的幻觉评分。 Result: 实验表明,MetaRAG在检测幻觉和实现基于RAG的对话代理的可信部署方面是有效的。 Conclusion: MetaRAG是一种用于检测检索增强生成系统中幻觉的变形测试框架,它能够实时、无监督且无需访问模型内部或真实参考即可运行。 Abstract: Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.[24] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research
Molly R Petersen,Claire E Stevenson,Lonneke van der Plas
Main category: cs.CL
TL;DR: This paper explores the application of cognitive science theories on analogical reasoning to NLP, emphasizing improved text understanding through relational analysis.
Details
Motivation: To bridge the gap between cognitive science theories on analogical reasoning and NLP research, highlighting the potential for improved relational understanding in text. Method: The paper summarizes key theories of analogical reasoning from cognitive science and relates them to NLP research. Result: The paper identifies how analogical reasoning concepts are relevant to major NLP challenges and suggests a shift from entity-level similarity to relational understanding. Conclusion: The paper concludes that viewing NLP research through a cognitive lens of analogical reasoning can help better optimize relational understanding in text. Abstract: Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.[25] Hierarchical Bracketing Encodings Work for Dependency Graphs
Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares
Main category: cs.CL
TL;DR: This paper introduces hierarchical bracketing encodings for dependency graph parsing, reducing label space while preserving structural information, and achieving improved accuracy across multilingual and multi-formalism benchmarks.
Details
Motivation: The motivation is to improve dependency graph parsing by reducing the label space without losing structural information, addressing challenges like reentrancies, cycles, and empty nodes that are common in graph parsing. Method: This paper encodes dependency graphs as sequences using hierarchical bracketing encodings, enabling linear-time parsing with n tagging actions while capturing reentrancies, cycles, and empty nodes. The method is evaluated across a multilingual and multi-formalism benchmark. Result: The proposed hierarchical bracketing encoding achieved competitive results and showed consistent improvements in exact match accuracy compared to other graph linearization methods. Conclusion: The hierarchical bracketing encoding approach effectively reduces label space while maintaining structural information, leading to competitive results and improvements in exact match accuracy for multilingual and multi-formalism dependency graph parsing. Abstract: We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with $n$ tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.[26] GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
Zhaohan Zhang,Ziquan Liu,Ioannis Patras
Main category: cs.CL
TL;DR: 本文提出 GrACE,一种高效可靠的大型语言模型信心估计方法,通过模型隐藏状态与特殊标记嵌入的相似性进行信心估计,并在多个模型和数据集上验证其优越性能。
Details
Motivation: 现有的大型语言模型信心估计方法要么计算开销大,要么校准效果差,因此需要一种更实用和可靠的方法。 Method: 提出了一种新的方法 GrACE,通过模型最后一层隐藏状态与附加特殊标记嵌入之间的相似性进行信心估计,并通过校准目标微调模型以校准信心。 Result: 实验结果表明,GrACE 在开放生成任务中具有最佳的判别能力和校准效果,且无需额外采样或辅助模型。此外,GrACE 提高了最终决策的准确性,并显著减少了测试时扩展方案所需的样本数量。 Conclusion: GrACE 是一种可行的解决方案,用于部署具有可扩展、可靠和实时信心估计的大型语言模型。 Abstract: Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in which the model expresses confidence by the similarity between the last hidden state and the embedding of a special token appended to the vocabulary, in real-time. We fine-tune the model for calibrating the confidence with calibration targets associated with accuracy. Experiments with three LLMs and two benchmark datasets show that the confidence produced by GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks, outperforming six competing methods without resorting to additional sampling or an auxiliary model. Moreover, we propose two strategies for improving test-time scaling based on confidence induced by GrACE. Experimental results show that using GrACE not only improves the accuracy of the final decision but also significantly reduces the number of required samples in the test-time scaling scheme, indicating the potential of GrACE as a practical solution for deploying LLMs with scalable, reliable, and real-time confidence estimation.[27] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation
Lucie Poláková,Martin Popel,Věra Kloudová,Michal Novák,Mariia Anisimova,Jiří Balhar
Main category: cs.CL
TL;DR: The EdUKate project develops multilingual learning materials for Czech schools by leveraging machine translation and collaboration between academic and educational institutions, focusing on creating a Czech-Ukrainian translation system for educational content.
Details
Motivation: The motivation behind the EdUKate project is to address the educational needs of non-Czech-speaking students by creating multilingual learning resources and providing freely accessible tools for students, educators, and researchers. Method: The project uses a collaborative approach between a major Czech academic institution and the country's largest educational publisher to translate multimodal interactive exercises into Ukrainian, English, and German. It involves the development and evaluation of a domain-specific machine translation system while focusing on processing formatted content like XML and PDF, and handling technical and scientific terminology. Result: The project translated up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German. It also developed and evaluated a direct Czech-Ukrainian machine translation system adapted to the educational domain, including handling formatted content and specialized terminology. Conclusion: The EdUKate project successfully integrates digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools, with a focus on creating a Czech-Ukrainian machine translation system tailored to the educational domain. Abstract: The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country's largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system's evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.[28] Towards Explainable Job Title Matching: Leveraging Semantic Textual Relatedness and Knowledge Graphs
Vadim Zadykian,Bruno Andrade,Haithem Afli
Main category: cs.CL
TL;DR: 本文提出了一种结合知识图谱和句子嵌入的新方法,以提升简历推荐系统中的职位匹配效果,并通过分层评估揭示模型在不同语义相关区域的表现差异。
Details
Motivation: 解决简历推荐系统中的职位匹配问题,其中重叠术语通常有限或具有误导性,同时提高语义对齐和可解释性。 Method: 引入了一种自监督混合架构,将密集句子嵌入与领域特定知识图谱(KG)结合,并通过图神经网络进行集成。 Result: 微调的SBERT模型与KG结合后在高STR区域的RMSE降低了25%。 Conclusion: 结合KG和文本嵌入的模型在高STR区域表现出色,同时强调区域性能分析对于模型理解的重要性。 Abstract: Semantic Textual Relatedness (STR) captures nuanced relationships between texts that extend beyond superficial lexical similarity. In this study, we investigate STR in the context of job title matching - a key challenge in resume recommendation systems, where overlapping terms are often limited or misleading. We introduce a self-supervised hybrid architecture that combines dense sentence embeddings with domain-specific Knowledge Graphs (KGs) to improve both semantic alignment and explainability. Unlike previous work that evaluated models on aggregate performance, our approach emphasizes data stratification by partitioning the STR score continuum into distinct regions: low, medium, and high semantic relatedness. This stratified evaluation enables a fine-grained analysis of model performance across semantically meaningful subspaces. We evaluate several embedding models, both with and without KG integration via graph neural networks. The results show that fine-tuned SBERT models augmented with KGs produce consistent improvements in the high-STR region, where the RMSE is reduced by 25% over strong baselines. Our findings highlight not only the benefits of combining KGs with text embeddings, but also the importance of regional performance analysis in understanding model behavior. This granular approach reveals strengths and weaknesses hidden by global metrics, and supports more targeted model selection for use in Human Resources (HR) systems and applications where fairness, explainability, and contextual matching are essential.[29] DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning
Daniil Ignatev,Nan Li,Hugh Mee Wong,Anh Dang,Shane Kaszefski Yaschuk
Main category: cs.CL
TL;DR: The DeMeVa team explored the effectiveness of ICL and LDL methods in predicting annotations and soft labels for the LeWiDi 2025 shared task, showing that both methods have promising results.
Details
Motivation: The motivation is to explore effective methods for predicting annotator-specific annotations and soft labels in the context of the LeWiDi 2025 shared task. Method: The team explored two methods: in-context learning with large language models using different example sampling strategies, and label distribution learning methods with RoBERTa using various fine-tuning methods. Result: The contributions showed that ICL can predict perspectivist annotations effectively and that LDL methods are promising for soft label predictions. Conclusion: The DeMeVa team concludes that ICL can effectively predict perspectivist annotations and that LDL methods are promising for soft label predictions. Abstract: This system paper presents the DeMeVa team's approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.[30] Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025)
Paolo Pedinotti,Peter Baumann,Nathan Jessurun,Leslie Barrett,Enrico Santus
Main category: cs.CL
TL;DR: MetaGraph 是一种从科学文献中提取知识图谱并分析研究趋势的新方法,揭示了金融NLP领域的发展阶段和趋势。
Details
Motivation: 大型语言模型(LLMs)迅速重塑了金融NLP,导致任务、数据集和数据源的多样化,但这种转变已经超越了传统调查的范围。 Method: 定义了一个金融NLP研究的本体论,并应用基于LLM的提取流程对681篇论文(2022-2025)进行大规模、数据驱动的分析。 Result: MetaGraph 揭示了金融NLP领域发展的三个关键阶段:LLM的早期应用与任务/数据集创新;对LLM局限性的批判性反思;以及将外围技术日益集成到模块化系统中。 Conclusion: MetaGraph 提供了一种可重用的方法,用于从科学文献中提取知识图谱并分析研究趋势,为金融NLP领域的研究者和从业者提供了清晰的研究发展视角。 Abstract: Large Language Models (LLMs) have rapidly reshaped financial NLP, enabling new tasks and driving a proliferation of datasets and diversification of data sources. Yet, this transformation has outpaced traditional surveys. In this paper, we present MetaGraph, a generalizable methodology for extracting knowledge graphs from scientific literature and analyzing them to obtain a structured, queryable view of research trends. We define an ontology for financial NLP research and apply an LLM-based extraction pipeline to 681 papers (2022-2025), enabling large-scale, data-driven analysis. MetaGraph reveals three key phases: early LLM adoption and task/dataset innovation; critical reflection on LLM limitations; and growing integration of peripheral techniques into modular systems. This structured view offers both practitioners and researchers a clear understanding of how financial NLP has evolved - highlighting emerging trends, shifting priorities, and methodological shifts-while also demonstrating a reusable approach for mapping scientific progress in other domains.[31] Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in Matchmaking
Brittany Harbison,Samuel Taubman,Travis Taylor,Ashok. K. Goel
Main category: cs.CL
TL;DR: This paper proposes a personality detection model using GPT to improve social matchmaking in online courses by integrating personality traits into SAMI's recommendation system.
Details
Motivation: Social connections are essential for learning, but online courses often hinder the formation of social groups. SAMI's current limitations in understanding student personality traits impact its recommendation relevance. Method: A personality detection model was developed using GPT's zero-shot capability to infer Big-Five personality traits from student forum posts. This model was benchmarked against established models and then integrated into SAMI's matchmaking system. Result: The personality detection model demonstrated efficacy in inferring traits from forum posts, and initial integration into SAMI suggested that personality traits could enhance existing matching criteria. Conclusion: The integration of personality detection into SAMI's matchmaking system could enhance social recommendations in online courses, although further evaluation is needed. Abstract: Social connection is a vital part of learning, yet online course environments present barriers to the organic formation of social groups. SAMI offers one solution by facilitating student connections, but its effectiveness is constrained by an incomplete Theory of Mind, limiting its ability to create an effective mental model of a student. One facet of this is its inability to intuit personality, which may influence the relevance of its recommendations. To explore this, we propose a personality detection model utilizing GPTs zero-shot capability to infer Big-Five personality traits from forum introduction posts, often encouraged in online courses. We benchmark its performance against established models, demonstrating its efficacy in this task. Furthermore, we integrate this model into SAMIs entity-based matchmaking system, enabling personality-informed social recommendations. Initial integration suggests personality traits can complement existing matching factors, though additional evaluation is required to determine their full impact on student engagement and match quality.[32] Fluent but Unfeeling: The Emotional Blind Spots of Language Models
Bangzhao Shu,Isha Joshi,Melissa Karnaze,Anh C. Pham,Ishita Kakkar,Sindhu Kothe,Arpine Hovasapian,Mai ElSherief
Main category: cs.CL
TL;DR: 本研究提出了EXPRESS数据集,用于评估大型语言模型(LLM)在细粒度情感对齐方面的能力,发现LLM在捕捉上下文情感方面仍存在局限性。
Details
Motivation: 现有的研究通常将情感分类为预定义的有限类别,忽略了更细微的情感表达,因此需要评估LLM是否在更细粒度的层面上与人类情感保持一致。 Method: 引入了一个名为EXPRESS的基准数据集,该数据集从Reddit社区中收集了251个细粒度的、自我披露的情感标签。通过建立一个全面的评估框架,将预测的情感术语分解为八种基本情感,进行系统测试,并进行定性分析。 Result: 在不同提示设置下对流行LLM的系统测试表明,准确预测与人类自我披露情感一致的情感仍然具有挑战性。 Conclusion: 研究发现,尽管某些LLM生成的情感术语与既定情感理论和定义一致,但它们在捕捉上下文线索方面有时不如人类自我披露的情感表达有效,这突显了LLM在细粒度情感对齐方面的局限性,并为未来增强其上下文理解提供了见解。 Abstract: The versatility of Large Language Models (LLMs) in natural language understanding has made them increasingly popular in mental health research. While many studies explore LLMs' capabilities in emotion recognition, a critical gap remains in evaluating whether LLMs align with human emotions at a fine-grained level. Existing research typically focuses on classifying emotions into predefined, limited categories, overlooking more nuanced expressions. To address this gap, we introduce EXPRESS, a benchmark dataset curated from Reddit communities featuring 251 fine-grained, self-disclosed emotion labels. Our comprehensive evaluation framework examines predicted emotion terms and decomposes them into eight basic emotions using established emotion theories, enabling a fine-grained comparison. Systematic testing of prevalent LLMs under various prompt settings reveals that accurately predicting emotions that align with human self-disclosed emotions remains challenging. Qualitative analysis further shows that while certain LLMs generate emotion terms consistent with established emotion theories and definitions, they sometimes fail to capture contextual cues as effectively as human self-disclosures. These findings highlight the limitations of LLMs in fine-grained emotion alignment and offer insights for future research aimed at enhancing their contextual understanding.[33] LAVA: Language Model Assisted Verbal Autopsy for Cause-of-Death Determination
Yiqun T. Chen,Tyler H. McCormick,Li Liu,Abhirup Datta
Main category: cs.CL
TL;DR: This study introduces LA-VA, a novel pipeline combining Large Language Models with traditional methods to improve cause-of-death predictions in resource-limited settings, showing that LLM-assisted approaches can significantly enhance verbal autopsy accuracy.
Details
Motivation: Verbal autopsy (VA) is essential in resource-limited settings where medical certification is not available, prompting the need for improved prediction methods. Method: This study introduces LA-VA, a pipeline combining Large Language Models (LLMs) with traditional algorithmic methods and embedding-based classification to improve cause-of-death prediction. Result: GPT-5 demonstrated the highest performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), surpassing traditional machine learning baselines by 5-10%. Conclusion: LLM-assisted approaches can significantly enhance the accuracy of verbal autopsy, offering valuable benefits for global health monitoring in resource-constrained areas. Abstract: Verbal autopsy (VA) is a critical tool for estimating causes of death in resource-limited settings where medical certification is unavailable. This study presents LA-VA, a proof-of-concept pipeline that combines Large Language Models (LLMs) with traditional algorithmic approaches and embedding-based classification for improved cause-of-death prediction. Using the Population Health Metrics Research Consortium (PHMRC) dataset across three age categories (Adult: 7,580; Child: 1,960; Neonate: 2,438), we evaluate multiple approaches: GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles. Our results demonstrate that GPT-5 achieves the highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%. Our findings suggest that simple off-the-shelf LLM-assisted approaches could substantially improve verbal autopsy accuracy, with important implications for global health surveillance in low-resource settings.[34] Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems
Minghang Zhu,Zhengliang Shi,Zhiwei Xu,Shiguang Wu,Lingjie Wang,Pengjie Ren,Zhaochun Ren,Zhumin Chen
Main category: cs.CL
TL;DR: MOAT是一个多智能体联合对齐调优框架,通过迭代对齐改善智能体之间的协作。
Details
Motivation: 现有的多智能体系统通常独立微调智能体,导致智能体之间能力差距大且协调性差。 Method: MOAT交替进行两个关键阶段:(1)规划智能体对齐,优化规划智能体以生成更好地指导基础智能体的子目标序列;(2)基础智能体改进,使用智能体自身生成的多样化子目标-动作对进行微调以增强其泛化能力。 Result: 实验结果显示,MOAT在六个基准测试中表现优于现有最先进基线方法,平均在已见任务上提升了3.1%,在未见任务上提升了4.4%。 Conclusion: MOAT是一个多智能体联合对齐调优框架,通过迭代对齐改善智能体之间的协作,证明了MOAT能够确保非递减且渐进收敛的训练过程,并在六个基准测试中表现出优于现有最先进基线方法的性能。 Abstract: The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.[35] All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens
Siddarth Mamidanna,Daking Rai,Ziyu Yao,Yilun Zhou
Main category: cs.CL
TL;DR: The paper investigates the extent of utilization of causal self-attention and multilayer perceptron layers in large language models for mental math tasks and identifies an All-for-One subgraph that is crucial for high model performance.
Details
Motivation: The motivation is to understand the extent to which causal self-attention and multilayer perceptron layers are utilized in large language models for mental math tasks. Method: The paper uses two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), to identify an All-for-One subgraph (AF1) that is sufficient and necessary for high model performance. Result: The experiments show that the identified subgraph transfers across different models and works on a variety of input styles, and ablations on different CAMA and ABP alternatives reveal their unique advantages. Conclusion: The paper concludes that meaningful computation in large language models for mental math tasks occurs very late and only at the last token, which receives information from other tokens in a few specific middle layers. Abstract: Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.[36] Steering MoE LLMs via Expert (De)Activation
Mohsen Fayyaz,Ali Modarressi,Hanieh Deilamsalehy,Franck Dernoncourt,Ryan Rossi,Trung Bui,Hinrich Schütze,Nanyun Peng
Main category: cs.CL
TL;DR: SteerMoE通过操控专家激活控制大型语言模型行为,提高安全性和忠诚度,同时揭示了隐藏的对齐伪造问题。
Details
Motivation: MoE模型中专家的特定行为控制需求未被充分研究,需要一种无需重新训练即可操控模型行为的方法。 Method: 通过检测与行为相关的专家并选择性激活/去激活专家,SteerMoE实现了对模型行为的控制。 Result: SteerMoE在11个基准测试和6个LLM中将安全性提高了20%,忠诚度提高了27%;在对抗攻击模式下,安全性下降达100%。 Conclusion: SteerMoE实现了对大型语言模型中专家混合系统的行为控制,揭示了专家内部隐藏的对齐伪造问题。 Abstract: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.[37] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
Runpeng Dai,Linfeng Song,Haolin Liu,Zhenwen Liang,Dian Yu,Haitao Mi,Zhaopeng Tu,Rui Liu,Tong Zheng,Hongtu Zhu,Dong Yu
Main category: cs.CL
TL;DR: This paper introduces a curiosity-driven exploration approach to improve RLVR methods for LLMs, addressing poor exploration and enhancing performance on reasoning tasks.
Details
Motivation: The motivation is to overcome poor exploration and premature convergence in current RLVR methods, which limit the reasoning ability of Large Language Models (LLMs). Method: The method involves introducing Curiosity-Driven Exploration (CDE) with signals from both the actor and critic to guide exploration within the RLVR framework. Result: The proposed method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks, demonstrating its effectiveness. Conclusion: The study concludes that incorporating curiosity-driven exploration in RLVR methods enhances response diversity and performance on benchmarks, and identifies a calibration collapse mechanism in RLVR. Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.cs.CV [Back]
[38] Recurrence Meets Transformers for Universal Multimodal Retrieval
Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出了一种名为ReT-2的统一多模态检索模型,该模型能够处理包含图像和文本的查询,并在多模态文档集合中进行搜索。
Details
Motivation: 由于多模态检索及其在LLMs和多模态LLMs中的应用迅速发展,出现了日益复杂的检索任务。现有方法主要依赖于任务特定的视觉-语言模型微调,并且局限于单模态查询或文档。 Method: ReT-2利用多层表示和受LSTM启发的门控机制的循环Transformer架构,动态整合跨层和跨模态的信息。 Result: 在具有挑战性的M2KR和M-BEIR基准测试中,ReT-2在不同检索配置下一致地实现了最先进的性能。 Conclusion: ReT-2是一个统一的检索模型,支持多模态查询,并在不同的检索配置中展现出卓越的性能,同时提供了更快的推理速度和减少的内存使用。 Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2[39] Diffusion-Based Action Recognition Generalizes to Untrained Domains
Rogerio Guimaraes,Frank Xiao,Pietro Perona,Markus Marks
Main category: cs.CV
TL;DR: This paper introduces an action recognition approach using a Vision Diffusion Model and transformer aggregation that significantly improves generalization across species, viewpoints, and contexts, achieving human-like robustness.
Details
Motivation: Current deep learning models struggle with human-like generalization in action recognition, particularly under varying contexts, species, and viewpoints. The goal is to bridge this gap and bring machine action recognition closer to human robustness. Method: The authors propose using a Vision Diffusion Model (VDM) to generate features, conditioned on earlier timesteps in the diffusion process, and aggregate these features using a transformer to achieve robust action recognition across varying contexts, species, and viewpoints. Result: The model achieves superior generalization performance across three challenging benchmarks: action classification across animal species, different viewing angles, and varying recording contexts, setting a new state-of-the-art in all of them. Conclusion: The proposed method achieves human-like robustness in action recognition by using features from a Vision Diffusion Model aggregated via a transformer, setting a new state-of-the-art in generalization benchmarks. Abstract: Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: $\href{https://www.vision.caltech.edu/actiondiff/}{\texttt{vision.caltech.edu/actiondiff}}$ Code: $\href{https://github.com/frankyaoxiao/ActionDiff}{\texttt{github.com/frankyaoxiao/ActionDiff}}$[40] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability
Tung Vu,Lam Nguyen,Quynh Dao
Main category: cs.CV
TL;DR: PromptGuard是一种新型模块化提示框架,通过VulnGuard提示技术和理论验证,有效预防大型语言模型对弱势群体的伤害,建立了系统性实证研究的数学基础。
Details
Motivation: 现有的安全方法依赖事后过滤或通用对齐技术,无法主动预防生成有害内容,而PromptGuard旨在从生成源头上预防对弱势群体的伤害。 Method: PromptGuard框架采用理论多目标优化,结合熵界限和帕累托最优性进行分析,并整合了六个核心模块:输入分类、VulnGuard提示、伦理原则集成、外部工具交互、输出验证和用户-系统交互。 Result: 通过使用现实世界数据驱动的对比学习,PromptGuard实现了25-30%的分析危害减少,并提供全面的数学形式化验证框架。 Conclusion: PromptGuard框架通过其核心模块和理论验证,为系统性实证研究奠定了数学基础,实现了实时危害预防的智能专家系统。 Abstract: The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.[41] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures
Waqar Ahmad,Evan Murphy,Vladimir A. Krylov
Main category: cs.CV
TL;DR: 本文提出了一种新的基于Beta分布混合模型的异常检测框架Beta-SOD,用于解决标签噪声对Re-ID任务的影响,并在多个数据集上验证了其优越性能。
Details
Motivation: 现有的对象重新识别方法对标签噪声非常敏感,通常会导致性能显著下降。 Method: 将Re-ID任务重新定义为监督图像相似度任务,采用Siamese网络架构,并引入基于Beta分布混合模型的统计异常检测框架Beta-SOD。 Result: Beta-SOD在不同噪声水平下均表现出优于现有方法的鲁棒性和广泛适用性,适用于人和车辆的Re-ID任务。 Conclusion: Beta-SOD通过建模嵌入对的余弦相似度分布,有效解决了标签噪声对Re-ID任务的影响,并在多个数据集上展示了优越的性能。 Abstract: Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed.The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30\%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: https://github.com/waqar3411/Beta-SOD[42] SFD-Mamba2Net: Strcture-Guided Frequency-Enhanced Dual-Stream Mamba2 Network for Coronary Artery Segmentation
Nan Mu,Ruiqi Song,Zhihui Xu,Jingfeng Jiang,Chen Zhao
Main category: cs.CV
TL;DR: The study introduces SFD-Mamba2Net, a new framework for improving coronary artery segmentation and stenosis detection in ICA images, achieving superior performance over existing methods.
Details
Motivation: Coronary Artery Disease is a leading cause of death globally, and Invasive Coronary Angiography images are challenging to analyze due to low contrast, high noise levels, and complex vascular structures. This study aims to enhance segmentation and detection accuracy in ICA images. Method: We propose SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection, with a CASE module in the encoder and a PHFP module in the decoder. Result: SFD-Mamba2Net improved the accuracy of coronary artery segmentation and stenosis detection, outperforming state-of-the-art methods in multiple metrics. Conclusion: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection. Abstract: Background: Coronary Artery Disease (CAD) is one of the leading causes of death worldwide. Invasive Coronary Angiography (ICA), regarded as the gold standard for CAD diagnosis, necessitates precise vessel segmentation and stenosis detection. However, ICA images are typically characterized by low contrast, high noise levels, and complex, fine-grained vascular structures, which pose significant challenges to the clinical adoption of existing segmentation and detection methods. Objective: This study aims to improve the accuracy of coronary artery segmentation and stenosis detection in ICA images by integrating multi-scale structural priors, state-space-based long-range dependency modeling, and frequency-domain detail enhancement strategies. Methods: We propose SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection. In the encoder, a Curvature-Aware Structural Enhancement (CASE) module is embedded to leverage multi-scale responses for highlighting slender tubular vascular structures, suppressing background interference, and directing attention toward vascular regions. In the decoder, we introduce a Progressive High-Frequency Perception (PHFP) module that employs multi-level wavelet decomposition to progressively refine high-frequency details while integrating low-frequency global structures. Results and Conclusions: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection.[43] Live(r) Die: Predicting Survival in Colorectal Liver Metastasis
Muhammad Alberb,Helen Cheung,Anne Martel
Main category: cs.CV
TL;DR: 本研究提出了一种用于结直肠癌肝转移手术结果预测的新框架,结合了自动分割算法和放射组学生存分析,提供了更准确的预测结果。
Details
Motivation: 目前基于有限临床或分子特征的预后模型缺乏足够的预测能力,特别是在多灶性CRLM病例中。因此需要一种新的方法来提高预测能力。 Method: 该研究提出了一个全自动的手术结果预测框架,包括一个分割管道和一个放射组学管道。分割管道通过利用可提示的基础模型完成缺失标签,从部分标注数据中学习分割肝脏、肿瘤和脾脏。同时,研究还提出了一种新的零样本3D提示传播算法SAMONAI,以及一种基于自编码器的多实例神经网络SurvAMINN用于生存分析。 Result: 该研究在包含227名患者的机构数据集上进行了广泛评估,结果显示该框架超越了现有的临床和基因组生物标志物,C指数提高了超过10%。 Conclusion: 该研究展示了结合自动分割算法和基于放射组学的生存分析在结直肠癌肝转移(CRLM)中的潜力,能够提供准确、注释高效且可解释的结果预测。 Abstract: Colorectal cancer frequently metastasizes to the liver, significantly reducing long-term survival. While surgical resection is the only potentially curative treatment for colorectal liver metastasis (CRLM), patient outcomes vary widely depending on tumor characteristics along with clinical and genomic factors. Current prognostic models, often based on limited clinical or molecular features, lack sufficient predictive power, especially in multifocal CRLM cases. We present a fully automated framework for surgical outcome prediction from pre- and post-contrast MRI acquired before surgery. Our framework consists of a segmentation pipeline and a radiomics pipeline. The segmentation pipeline learns to segment the liver, tumors, and spleen from partially annotated data by leveraging promptable foundation models to complete missing labels. Also, we propose SAMONAI, a novel zero-shot 3D prompt propagation algorithm that leverages the Segment Anything Model to segment 3D regions of interest from a single point prompt, significantly improving our segmentation pipeline's accuracy and efficiency. The predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts features from each tumor and predicts survival using SurvAMINN, a novel autoencoder-based multiple instance neural network for survival analysis. SurvAMINN jointly learns dimensionality reduction and hazard prediction from right-censored survival data, focusing on the most aggressive tumors. Extensive evaluation on an institutional dataset comprising 227 patients demonstrates that our framework surpasses existing clinical and genomic biomarkers, delivering a C-index improvement exceeding 10%. Our results demonstrate the potential of integrating automated segmentation algorithms and radiomics-based survival analysis to deliver accurate, annotation-efficient, and interpretable outcome prediction in CRLM.[44] Discovering Divergent Representations between Text-to-Image Models
Lisa Dunlap,Joseph E. Gonzalez,Trevor Darrell,Fabian Caba Heilbron,Josef Sivic,Bryan Russell
Main category: cs.CV
TL;DR: 本文提出了CompCon方法,用于比较文本到图像生成模型之间的视觉表示差异,并发现不同模型在特定提示下的输出特性。
Details
Motivation: 研究两个文本到图像生成模型在视觉表示上的差异,特别是探索在特定提示下某一模型输出中会出现而另一模型不会出现的视觉属性。 Method: 引入CompCon(Comparing Concepts),一种进化搜索算法,用于发现在一个模型输出中比另一个更普遍的视觉属性,并揭示与这些视觉差异相关的提示概念。同时创建了ID2数据集用于评估CompCon的效果。 Result: CompCon成功发现了多个文本到图像模型之间的视觉属性差异,并创建了ID2数据集来验证该方法的有效性。此外,CompCon的表现优于多个基于LLM和VLM的基线方法。 Conclusion: CompCon有效地揭示了不同文本到图像模型间视觉表示的差异性,例如PixArt在描述孤独时倾向于表现湿润的街道,而Stable Diffusion 3.5则倾向于描绘非裔美国人从事媒体职业。 Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon[45] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery
Yibin Wang,Wondimagegn Beshah,Padmanava Dash,Haifeng Wang
Main category: cs.CV
TL;DR: 该研究开发了一种基于深度学习的图像校正方法,用于消除UAS图像中的云影和阳光耀斑干扰,从而提高水质参数估算的准确性。
Details
Motivation: 无人航空系统(UAS)图像在遥感应用中面临云影和水体阳光耀斑的干扰,影响水质参数估算的准确性,因此需要一种有效的图像校正方法。 Method: 研究采用了基于U-Net的深度学习模型,通过像素级数据训练模型,以识别和分离云影和阳光耀斑区域,并通过测试案例的评估指标确定最佳模型设置。 Result: 成功训练了一个深度学习模型,能够识别并修复受云影和阳光耀斑影响的区域,提高了图像质量和后续分析的准确性。 Conclusion: 该研究通过基于U-Net的深度学习模型有效识别和修复了UAS图像中的云影和阳光耀斑区域,从而提高了图像质量,有助于更准确地估算水质参数。 Abstract: The use of unmanned aerial systems (UASs) has increased tremendously in the current decade. They have significantly advanced remote sensing with the capability to deploy and image the terrain as per required spatial, spectral, temporal, and radiometric resolutions for various remote sensing applications. One of the major advantages of UAS imagery is that images can be acquired in cloudy conditions by flying the UAS under the clouds. The limitation to the technology is that the imagery is often sullied by cloud shadows. Images taken over water are additionally affected by sun glint. These are two pose serious issues for estimating water quality parameters from the UAS images. This study proposes a novel machine learning approach first to identify and extract regions with cloud shadows and sun glint and separate such regions from non-obstructed clear sky regions and sun-glint unaffected regions. The data was extracted from the images at pixel level to train an U-Net based deep learning model and best settings for model training was identified based on the various evaluation metrics from test cases. Using this evaluation, a high-quality image correction model was determined, which was used to recover the cloud shadow and sun glint areas in the images.[46] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision
Puskal Khadka,Rodrigue Rizk,Longwei Wang,KC Santosh
Main category: cs.CV
TL;DR: CoSwin improves Vision Transformers by combining global attention with local convolution, enhancing performance on small image datasets.
Details
Motivation: Vision Transformers struggle with local feature extraction on small datasets due to a lack of inductive biases like locality and translation equivariance, which CoSwin aims to address. Method: CoSwin combines a learnable local feature enhancement module with hierarchical shifted window attention to capture both fine-grained spatial details and global semantic structure. Result: CoSwin outperforms state-of-the-art models on several benchmarks, achieving significant improvements over the baseline Swin Transformer on CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Conclusion: The proposed CoSwin architecture effectively enhances Vision Transformers by integrating localized convolutional feature learning with hierarchical shifted window attention, improving performance on small-scale vision tasks. Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin[47] iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning
Karim Slimani,Catherine Achard,Brahim Tamadazte
Main category: cs.CV
TL;DR: iMatcher是一个用于点云配准的深度学习框架,通过局部和全局几何一致性学习提升特征匹配性能,达到了当前最先进的内点率。
Details
Motivation: 点云配准中的特征匹配需要几何一致性,但现有方法可能无法同时保证局部和全局一致性,因此提出了iMatcher以提高配准性能。 Method: iMatcher利用局部图嵌入模块初始化得分矩阵,随后通过在3D空间中进行双边源到目标和目标到源的最近邻搜索来优化该矩阵。最后,通过全局几何一致性学习预测点对点的匹配概率。 Result: 在KITTI、KITTI-360、3DMatch等真实数据集以及TUD-L和MVP-RG等任务上的实验表明,iMatcher显著提升了刚性配准性能,达到了95%-97%的KITTI内点率、94%-97%的KITTI-360内点率和最高81.1%的3DMatch内点率。 Conclusion: iMatcher是一个用于点云配准的完全可微分特征匹配框架,它通过学习特征预测几何一致的置信度矩阵,结合局部和全局一致性。 Abstract: This paper presents iMatcher, a fully differentiable framework for feature matching in point cloud registration. The proposed method leverages learned features to predict a geometrically consistent confidence matrix, incorporating both local and global consistency. First, a local graph embedding module leads to an initialization of the score matrix. A subsequent repositioning step refines this matrix by considering bilateral source-to-target and target-to-source matching via nearest neighbor search in 3D space. The paired point features are then stacked together to be refined through global geometric consistency learning to predict a point-wise matching probability. Extensive experiments on real-world outdoor (KITTI, KITTI-360) and indoor (3DMatch) datasets, as well as on 6-DoF pose estimation (TUD-L) and partial-to-partial matching (MVP-RG), demonstrate that iMatcher significantly improves rigid registration performance. The method achieves state-of-the-art inlier ratios, scoring 95% - 97% on KITTI, 94% - 97% on KITTI-360, and up to 81.1% on 3DMatch, highlighting its robustness across diverse settings.[48] UltrON: Ultrasound Occupancy Networks
Magdalena Wysocki,Felix Duelmer,Ananya Bal,Nassir Navab,Mohammad Farid Azampour
Main category: cs.CV
TL;DR: 本文提出了一种名为UltrON的新方法,该方法通过利用B型图像中的声学特征和新颖的损失函数来改进弱监督优化中的几何一致性,从而减轻遮挡和稀疏标记的限制,实现更精确的3D重建。
Details
Motivation: 传统方法依赖于精确的注释,并且难以处理超声波的视角依赖性和声影伪影,因此需要改进的形状重建方法。 Method: 提出了一种基于占用率的表示方法和UltrON,利用B型图像中的声学特征和新颖的损失函数进行优化。 Result: UltrON减轻了遮挡和稀疏标记的限制,并且能够推广到相同解剖结构的形状。 Conclusion: UltrON通过利用声学特征和新颖的损失函数,在弱监督优化中提高了几何一致性,为更精确的3D重建铺平了道路。 Abstract: In free-hand ultrasound imaging, sonographers rely on expertise to mentally integrate partial 2D views into 3D anatomical shapes. Shape reconstruction can assist clinicians in this process. Central to this task is the choice of shape representation, as it determines how accurately and efficiently the structure can be visualized, analyzed, and interpreted. Implicit representations, such as SDF and occupancy function, offer a powerful alternative to traditional voxel- or mesh-based methods by modeling continuous, smooth surfaces with compact storage, avoiding explicit discretization. Recent studies demonstrate that SDF can be effectively optimized using annotations derived from segmented B-mode ultrasound images. Yet, these approaches hinge on precise annotations, overlooking the rich acoustic information embedded in B-mode intensity. Moreover, implicit representation approaches struggle with the ultrasound's view-dependent nature and acoustic shadowing artifacts, which impair reconstruction. To address the problems resulting from occlusions and annotation dependency, we propose an occupancy-based representation and introduce \gls{UltrON} that leverages acoustic features to improve geometric consistency in weakly-supervised optimization regime. We show that these features can be obtained from B-mode images without additional annotation cost. Moreover, we propose a novel loss function that compensates for view-dependency in the B-mode images and facilitates occupancy optimization from multiview ultrasound. By incorporating acoustic properties, \gls{UltrON} generalizes to shapes of the same anatomy. We show that \gls{UltrON} mitigates the limitations of occlusions and sparse labeling and paves the way for more accurate 3D reconstruction. Code and dataset will be available at https://github.com/magdalena-wysocki/ultron.[49] Implicit Neural Representations of Intramyocardial Motion and Strain
Andrew Bell,Yan Kit Choi,Steffen Peterson,Andrew King,Muhummad Sohaib Nazir,Alistair Young
Main category: cs.CV
TL;DR: A new method using INRs with learned latent codes enables fast and accurate quantification of left ventricular motion and strain from tagging MRI, outperforming deep learning baselines in both accuracy and speed.
Details
Motivation: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. Method: A method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular displacement without requiring inference-time optimization. Result: Evaluated on 452 UK Biobank test cases, the method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. It is also ~380x faster than the most accurate baseline. Conclusion: INR-based models are suitable for accurate and scalable analysis of myocardial strain in large CMR datasets. Abstract: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement -- without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets.[50] E-MLNet: Enhanced Mutual Learning for Universal Domain Adaptation with Sample-Specific Weighting
Samuel Felipe dos Santos,Tiago Agostinho de Almeida,Jurandy Almeida
Main category: cs.CV
TL;DR: E-MLNet improves UniDA performance by dynamically focusing on critical class boundaries, surpassing MLNet in accuracy and robustness.
Details
Motivation: MLNet's equal treatment of classifiers dilutes learning signals, prompting the need for a focused adaptation strategy. Method: E-MLNet introduces a dynamic weighting strategy to OEM, leveraging closed-set predictions to sharpen known-unknown distinctions. Result: E-MLNet outperforms MLNet on VisDA and ImageCLEF benchmarks, excelling in 22/31 Open-Partial DA and 19/31 Open-Set DA tasks. Conclusion: E-MLNet demonstrates superior performance in UniDA by focusing on relevant class boundaries, achieving higher H-scores and robustness compared to MLNet. Abstract: Universal Domain Adaptation (UniDA) seeks to transfer knowledge from a labeled source to an unlabeled target domain without assuming any relationship between their label sets, requiring models to classify known samples while rejecting unknown ones. Advanced methods like Mutual Learning Network (MLNet) use a bank of one-vs-all classifiers adapted via Open-set Entropy Minimization (OEM). However, this strategy treats all classifiers equally, diluting the learning signal. We propose the Enhanced Mutual Learning Network (E-MLNet), which integrates a dynamic weighting strategy to OEM. By leveraging the closed-set classifier's predictions, E-MLNet focuses adaptation on the most relevant class boundaries for each target sample, sharpening the distinction between known and unknown classes. We conduct extensive experiments on four challenging benchmarks: Office-31, Office-Home, VisDA-2017, and ImageCLEF. The results demonstrate that E-MLNet achieves the highest average H-scores on VisDA and ImageCLEF and exhibits superior robustness over its predecessor. E-MLNet outperforms the strong MLNet baseline in the majority of individual adaptation tasks -- 22 out of 31 in the challenging Open-Partial DA setting and 19 out of 31 in the Open-Set DA setting -- confirming the benefits of our focused adaptation strategy.[51] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
Umair Hassan
Main category: cs.CV
TL;DR: 本文提出了COCO-Urdu,一个大规模的乌尔都语图像描述数据集,旨在减少多模态研究中的语言偏见。
Details
Motivation: 乌尔都语在多模态和视觉-语言研究中资源匮乏,缺乏大规模高质量数据集限制了乌尔都语系统的发展。 Method: 通过分层抽样方法从MS COCO中选取数据,并使用SeamlessM4T v2进行翻译,结合COMET-Kiwi、CLIP-based相似度和BERTScore进行验证。 Result: 构建了一个包含59,000张图片和319,000条乌尔都语描述的数据集,并在BLEU、SacreBLEU和chrF指标上取得良好结果。 Conclusion: COCO-Urdu有望减少多模态研究中的语言偏见,并为包容性的视觉-语言系统奠定基础。 Abstract: Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.[52] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI
Chenqian Le,Yilin Zhao,Nikasadat Emami,Kushagra Yadav,Xujin "Chris" Liu,Xupeng Chen,Yao Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为VoxelFormer的新方法,用于从fMRI数据中解码视觉信息,能够在多个受试者上进行训练,减少了参数需求,提高了效率。
Details
Motivation: 现有的基于fMRI的视觉解码方法大多依赖于特定受试者的训练,限制了其可扩展性和实际部署,因此需要一种更高效且可扩展的方法。 Method: VoxelFormer结合了一个用于高效体素压缩的标记合并变压器(ToMer)和一个产生与CLIP图像嵌入空间对齐的固定大小神经表示的查询驱动Q-Former。 Result: 在7T自然场景数据集上评估,VoxelFormer在训练期间包含的受试者上实现了有竞争力的检索性能,并且参数显著少于现有方法。 Conclusion: VoxelFormer是一种轻量级的变压器架构,能够在多个受试者上进行训练,从而实现了从fMRI数据中的视觉解码,这表明了标记合并和基于查询的变压器在参数高效神经解码中的潜力。 Abstract: Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbf{VoxelFormer}, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.[53] Integrating Anatomical Priors into a Causal Diffusion Model
Binxu Li,Wei Peng,Mingjie Li,Ehsan Adeli,Kilian M. Pohl
Main category: cs.CV
TL;DR: 本文提出了一种新的概率因果图模型(PCGM),用于生成高质量的脑部MRI图像,能够更好地保留细微的解剖学变化,并首次实现了从反事实图像中重现疾病对大脑皮层区域的细微影响。
Details
Motivation: 3D脑部MRI研究通常检查难以通过视觉检测的细微形态差异,而由于MRI采集成本高昂,这些研究可以从图像合成,特别是反事实图像生成中受益。然而,由于缺乏显式的归纳偏差来保留细粒度的解剖细节,反事实模型难以生成解剖上合理的MRI图像。 Method: 提出了一种称为概率因果图模型(PCGM)的新方法,通过将解剖学约束作为先验信息显式地集成到生成扩散框架中。 Result: PCGM生成的结构性脑部MRI图像质量高于几种基线方法,并首次证明了从PCGM生成的反事实中提取的大脑测量数据可以重现神经科学文献中报道的疾病对大脑皮层区域的细微影响。 Conclusion: PCGM能够生成高质量的结构性脑部MRI图像,并且是首次成功从反事实中提取的大脑测量数据重现了疾病对大脑皮层区域的细微影响。 Abstract: 3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.[54] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models
Qiuhui Chen,Xuancheng Yao,Huping Ye,Yi Hong
Main category: cs.CV
TL;DR: Med3DInsight是一种无需人工注释的3D医学图像理解的新方法,结合了3D图像编码器和2D MLLMs,并在多个任务和数据集中展示了优越的性能。
Details
Motivation: 现有的3D医学卷积和基于变压器的自监督学习方法往往缺乏深层的语义理解,而多模态大语言模型(MLLMs)的最新进展为通过文本描述增强图像理解提供了有希望的方法。 Method: 提出了一种新的预训练框架Med3DInsight,该框架通过专门设计的平面切片感知变压器模块将3D图像编码器与2D MLLMs集成,并采用基于部分最优传输的对齐方法。 Result: Med3DInsight在各种公共数据集上的CT和MRI模态中,两个下游任务即分割和分类任务中都表现出了优于当前SSL方法的性能。 Conclusion: Med3DInsight提供了一种无需人工注释的可扩展多模态3D医学表示学习新范式,并且在分割和分类任务中表现出最先进的性能。 Abstract: Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.[55] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach
Hesham M. Shehata,Mohammad Abdolrahmani
Main category: cs.CV
TL;DR: This paper proposes a multi-task learning method that incorporates fixed object information and interaction area data to enhance human action recognition, achieving 99.25% accuracy and outperforming the base model by 2.75%.
Details
Motivation: Existing graph convolutional neural networks (GCNs) perform well in human action recognition using skeleton poses but struggle with detecting human-object interactions due to a lack of effective scene representation and appropriate learning architectures. Method: The authors proposed a multi-task learning methodology that incorporates fixed object information in the environment to improve human action recognition. They evaluated the approach using a dataset they collected, which includes interaction classes (e.g., interacting with ATMs or check-in machines) and non-interaction classes (e.g., walking and standing). Result: The proposed multi-task learning approach achieved an accuracy of 99.25%, outperforming the base model using only human skeleton poses by 2.75%. Conclusion: The proposed multi-task learning approach, incorporating interaction area information and fixed object data, significantly enhances the recognition of interaction and non-interaction actions compared to the base model relying solely on human skeleton poses. Abstract: Recent graph convolutional neural networks (GCNs) have shown high performance in the field of human action recognition by using human skeleton poses. However, it fails to detect human-object interaction cases successfully due to the lack of effective representation of the scene information and appropriate learning architectures. In this context, we propose a methodology to utilize human action recognition performance by considering fixed object information in the environment and following a multi-task learning approach. In order to evaluate the proposed method, we collected real data from public environments and prepared our data set, which includes interaction classes of hands-on fixed objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and non-interaction classes of walking and standing. The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%, outperforming the accuracy of the base model using only human skeleton poses by 2.75%.[56] IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection
Jifeng Shen,Haibo Zhan,Xin Zuo,Heng Fan,Xiaohui Yuan,Jun Li,Wankou Yang
Main category: cs.CV
TL;DR: IRDFusion is a novel feature fusion framework that improves multispectral object detection by adaptively enhancing salient structures and suppressing shared background interference.
Details
Motivation: Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance. Method: IRDFusion integrates the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM) into a unified framework for adaptive cross-modal feature fusion. Result: IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios on FLIR, LLVIP, and M³FD datasets. Conclusion: IRDFusion demonstrates robustness and effectiveness in multispectral object detection by achieving state-of-the-art performance on FLIR, LLVIP, and M³FD datasets. Abstract: Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance.To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference.Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power.Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains.In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.[57] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Hengyu Fang,Yijiang Liu,Yuan Du,Li Du,Huanrui Yang
Main category: cs.CV
TL;DR: 本研究提出了一种名为SQAP-VLA的新型VLA推理加速框架,该框架通过协同设计量化和令牌剪枝流程,在不进行训练的情况下实现了高效的模型推理,同时提升了推理速度和模型性能。
Details
Motivation: 现有的VLA压缩和加速方法在量化或令牌剪枝方面采取临时的方法,但未能同时实现两者以实现整体效率提升,因此需要一种能够同时优化量化和令牌剪枝的方法。 Method: 通过协同设计量化和令牌剪枝流程,提出新的量化感知令牌剪枝标准,并改进量化器设计以增强剪枝效果,从而克服量化和令牌剪枝之间的不兼容性。 Result: SQAP-VLA在标准VLA模型上的应用显示,与原始模型相比,推理速度提高了1.93倍,平均成功率提高了4.5%。 Conclusion: SQAP-VLA是一个无需训练的结构化VLA推理加速框架,能够同时实现最先进的量化和令牌剪枝,从而显著提高计算效率和推理速度,同时保持核心模型性能。 Abstract: Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5\% average success rate enhancement compared to the original model.[58] S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization
Chenghao Zhang,Lun Luo,Si-Yuan Cao,Xiaokai Bai,Yuncheng Jin,Zhu Yu,Beinan Yu,Yisen Wang,Hui-Liang Shen
Main category: cs.CV
TL;DR: 本文提出了一种名为S-BEVLoc的自我监督框架,用于基于鸟瞰图(BEV)的LiDAR全局定位,避免了对真实姿态的依赖,并在大规模数据集上实现了最先进的性能。
Details
Motivation: 现有的基于监督学习的方法需要高精度的真实姿态,而这需要大量的成本和努力。因此,需要一种不需要真实姿态的自我监督方法来提高可扩展性。 Method: S-BEVLoc通过利用关键点中心BEV补丁之间的已知地理距离,从单个BEV图像中构建训练三元组。使用卷积神经网络(CNN)提取局部特征,并通过NetVLAD聚合全局描述符。此外,引入了SoftCos损失以增强从生成的三元组中学习的能力。 Result: 在大规模KITTI和NCLT数据集上的实验表明,S-BEVLoc在地点识别、回环检测和全局定位任务中达到了最先进的性能,同时提供了监督方法需要额外努力才能实现的可扩展性。 Conclusion: S-BEVLoc是一种有效的自我监督方法,用于基于LiDAR的全局定位,避免了对真实姿态的依赖,并具有良好的可扩展性和性能。 Abstract: LiDAR-based global localization is an essential component of simultaneous localization and mapping (SLAM), which helps loop closure and re-localization. Current approaches rely on ground-truth poses obtained from GPS or SLAM odometry to supervise network training. Despite the great success of these supervised approaches, substantial cost and effort are required for high-precision ground-truth pose acquisition. In this work, we propose S-BEVLoc, a novel self-supervised framework based on bird's-eye view (BEV) for LiDAR global localization, which eliminates the need for ground-truth poses and is highly scalable. We construct training triplets from single BEV images by leveraging the known geographic distances between keypoint-centered BEV patches. Convolutional neural network (CNN) is used to extract local features, and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce SoftCos loss to enhance learning from the generated triplets. Experimental results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks, while offering scalability that would require extra effort for supervised approaches.[59] FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding
Jianqin Gao,Tianqi Wang,Yu Zhang,Yishu Zhang,Chenyuan Wang,Allan Dong,Zihao Wang
Main category: cs.CV
TL;DR: 本文提出了FPI-Det数据集,用于检测手机使用行为,解决安全监控、工作效率评估等问题,并提供基准结果和分析。
Details
Motivation: 移动设备的广泛使用对视觉系统提出了新的挑战,需要理解细粒度的人机交互行为,而现有的数据集无法充分解决这一问题。 Method: 引入FPI-Det数据集,包含22,879张图像,并评估了YOLO和DETR检测器的性能。 Result: 该数据集包含多种场景,具有极端尺度变化、频繁遮挡和多样化捕获条件,并提供了不同检测器的基准结果。 Conclusion: FPI-Det是一个新数据集,用于解决在不同场景下检测手机使用行为的细粒度人机交互问题,并提供了基准结果。 Abstract: The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human--device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.[60] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Junhao Xing,Ryohei Miyakawa,Yang Yang,Xinpeng Liu,Risa Shinoda,Hiroaki Santo,Yosuke Toda,Fumio Okura
Main category: cs.CV
TL;DR: 本文介绍了一种名为ZeroPlantSeg的零样本学习方法,用于从顶部视角图像中分割出呈莲座状的植物个体。
Details
Motivation: 为了解决需要大量特定物种标注数据和显著人力投入的植物个体分割问题,引入了一种零样本学习的植物分割方法。 Method: 整合基础分割模型和视觉语言模型 Result: 在多个植物物种、生长阶段和拍摄环境的数据集评估中,该方法超越了现有的零样本方法,并比监督方法具有更好的跨领域性能。 Conclusion: ZeroPlantSeg是一个无需训练的植物个体分割方法,它通过整合基础分割模型和视觉语言模型实现了对多叶植物个体的高效提取。 Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.[61] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
Tianlu Zheng,Yifan Zhang,Xiang An,Ziyong Feng,Kaicheng Yang,Qichuan Ding
Main category: cs.CV
TL;DR: 本文提出了一种用于人物表征学习的GA-DMS框架,并构建了一个大规模数据集WebPerson,以改进对比语言-图像预训练(CLIP)在人物中心图像上的应用。
Details
Motivation: 对比语言-图像预训练(CLIP)在多样化的视觉任务中表现出色,但在人物表征学习方面面临两个关键挑战:一是缺乏大规模标注的以人物为中心的视觉-语言数据,二是全局对比学习的固有局限性,难以保持细粒度匹配所需的判别性局部特征,同时易受噪声文本标记的影响。 Method: 首先,开发了一个抗噪声的数据构建管道,利用MLLMs的情境学习能力自动过滤和标注网络来源的图像,生成了包含500万高质量人物中心图像-文本对的大规模数据集WebPerson。其次,引入GA-DMS(梯度注意力引导的双掩码协同)框架,通过基于梯度注意力相似度分数自适应地掩码噪声文本标记,改进跨模态对齐。此外,还引入了掩码标记预测目标,迫使模型预测信息性文本标记,从而增强细粒度语义表征学习。 Result: 广泛的实验表明,GA-DMS在多个基准测试中达到了最先进的性能。 Conclusion: 本文通过改进数据构建和模型架构,推进了CLIP在人物表征学习中的应用,提出的GA-DMS框架和WebPerson数据集为该领域提供了有效的解决方案。 Abstract: Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.[62] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain
Bin Huang,Kang Chen,Bingxuan Li,Huafeng Liu,Qiegen Liu
Main category: cs.CV
TL;DR: ALL-PET是一种低资源、低样本PET成像基础模型,通过创新的Radon掩码增强、动态多掩码机制和透明医疗注意力方法,实现了高质量成像和跨任务泛化能力。
Details
Motivation: 构建大规模PET成像基础模型受限于标记数据的缺乏和计算资源的不足,因此需要一种高效且数据需求低的解决方案。 Method: 提出了一种基于潜扩散模型(LDM)的Radon掩码增强策略(RMAS)、动态多掩码机制(DMM)和透明医疗注意力(TMA)方法,以提高数据多样性和生成质量,并确保几何一致性。 Result: 实验结果表明,ALL-PET仅使用500个样本即可生成高质量的sinogram,并且能够推广到低剂量重建、衰减校正、延迟帧预测和示踪剂分离等任务,内存使用低于24GB。 Conclusion: ALL-PET通过三种创新策略在低资源和低样本条件下实现了高质量的PET成像基础模型,克服了数据稀缺和计算资源不足的问题。 Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.[63] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology
Dylan Peek,Matthew P. Skerritt,Stephan Chalup
Main category: cs.CV
TL;DR: 研究比较了人工神经网络与持续同调方法在拓扑结构噪声下的性能,显示ANN在预测贝蒂数时更具噪声鲁棒性。
Details
Motivation: 持续同调(PH)和人工神经网络(ANNs)提供了从数据中推断拓扑结构的对比方法。我们研究了监督神经网络在2D二值图像中预测贝蒂数的任务中对噪声的鲁棒性。 Method: 我们比较了基于立方复形和符号欧几里得距离变换(SEDT)的PH流程与训练用于预测2D二值图像贝蒂数的人工神经网络方法在噪声下的性能。 Result: 使用一个合成数据集和两个真实世界数据集,我们展示了ANN在噪声情况下可以优于PH方法,这可能是因为ANN能够从训练数据中学习上下文和几何先验。 Conclusion: 尽管仍处于起步阶段,使用人工神经网络进行拓扑估计在结构噪声下为持续同调提供了一个有前景的替代方案。 Abstract: Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer contrasting approaches to inferring topological structure from data. In this study, we examine the noise robustness of a supervised neural network trained to predict Betti numbers in 2D binary images. We compare an ANN approach against a PH pipeline based on cubical complexes and the Signed Euclidean Distance Transform (SEDT), which is a widely adopted strategy for noise-robust topological analysis. Using one synthetic and two real-world datasets, we show that ANNs can outperform this PH approach under noise, likely due to their capacity to learn contextual and geometric priors from training data. Though still emerging, the use of ANNs for topology estimation offers a compelling alternative to PH under structural noise.[64] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation
Yuiko Uchida,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
Main category: cs.CV
TL;DR: This paper introduces OSIM, an object-centric evaluation metric for 3D scenes that better aligns with human perception than traditional metrics.
Details
Motivation: Existing metrics assess overall image quality but often lead to discrepancies with human perception. The authors hypothesize that human recognition of 3D scenes involves focusing on individual objects, which inspired the development of an object-centric evaluation metric. Method: The paper proposes Objectness SIMilarity (OSIM), which uses an object detection model and its feature representations to quantify the 'objectness' of each object in a 3D scene. Additionally, a user study was conducted to evaluate the alignment of OSIM with human perception, and recent 3D reconstruction and generation models were re-evaluated using a standardized setup. Result: The proposed OSIM metric demonstrates better alignment with human perception than existing metrics. The paper also provides insights into the characteristics of OSIM and re-evaluates recent 3D reconstruction and generation models. Conclusion: The paper concludes that OSIM aligns more closely with human perception compared to existing metrics, offering a more object-centric approach to evaluating 3D scenes. Abstract: This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on "objects," which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the "objectness" of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.[65] Video Understanding by Design: How Datasets Shape Architectures and Insights
Lei Wang,Piotr Koniusz,Yongsheng Gao
Main category: cs.CV
TL;DR: 这篇论文从数据集驱动的角度分析视频理解模型的演化,并提出模型设计的实用指南。
Details
Motivation: 现有的研究主要基于任务或模型家族分类,忽略了数据集对模型结构演化的引导作用。 Method: 通过重新解释从双流网络、3D CNN到顺序模型、Transformer和多模态基础模型的重要进展,分析数据集对模型结构的影响。 Result: 论文展示了运动复杂性、时间跨度、层次组合和多模态丰富性如何对模型产生归纳偏置,并为通用视频理解提供了回顾与发展方向。 Conclusion: 该论文提出了一种数据集驱动的视频理解模型分析视角,并为模型设计提供了实用指导。 Abstract: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.[66] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge
JaeWoong Shin,Jeongun Ryu,Aaron Valero Puche,Jinhee Lee,Biagio Brattoli,Wonkyung Jung,Soo Ick Cho,Kyunghyun Paeng,Chan-Young Ock,Donggeun Yoo,Zhaoyang Li,Wangkai Li,Huayu Mai,Joshua Millward,Zhen He,Aiden Nibali,Lydia Anette Schoenpflug,Viktor Hendrik Koelzer,Xu Shuoyu,Ji Zheng,Hu Bin,Yu-Wen Lo,Ching-Hui Yang,Sérgio Pereira
Main category: cs.CV
TL;DR: The OCELOT 2023 challenge showed that integrating cell-tissue interactions into deep learning models significantly improves cell detection performance, highlighting the importance of multi-scale analysis for achieving human-like diagnostic capabilities.
Details
Motivation: Pathologists routinely use multiple magnifications to diagnose tissue samples, but existing deep learning models struggle to replicate this behavior. This study aimed to validate the hypothesis that understanding cell-tissue interactions is crucial for achieving human-level performance. Method: The study used a dataset from the OCELOT 2023 challenge with overlapping cell detection and tissue segmentation annotations across six organs. It compared models that incorporated cell-tissue relationships with a baseline cell-only model. Result: Top models in the challenge achieved up to a 7.99 increase in F1-score compared to the baseline model, demonstrating the importance of incorporating multi-scale semantics into cell detection models. Conclusion: The OCELOT 2023 challenge demonstrated that incorporating multi-scale semantics, particularly cell-tissue interactions, significantly improves the performance of cell detection models, bringing them closer to human-level performance. Abstract: Pathologists routinely alternate between different magnifications when examining Whole-Slide Images, allowing them to evaluate both broad tissue morphology and intricate cellular details to form comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate these behaviors and learn the interdependent semantics between structures at different magnifications. A key barrier in the field is the lack of datasets with multi-scale overlapping cell and tissue annotations. The OCELOT 2023 challenge was initiated to gather insights from the community to validate the hypothesis that understanding cell and tissue (cell-tissue) interactions is crucial for achieving human-level performance, and to accelerate the research in this field. The challenge dataset includes overlapping cell detection and tissue segmentation annotations from six organs, comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA) Whole-Slide Images with hematoxylin and eosin staining, divided into training, validation, and test subsets. Participants presented models that significantly enhanced the understanding of cell-tissue relationships. Top entries achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model that did not incorporate cell-tissue relationships. This is a substantial improvement in performance over traditional cell-only detection methods, demonstrating the need for incorporating multi-scale semantics into the models. This paper provides a comparative analysis of the methods used by participants, highlighting innovative strategies implemented in the OCELOT 2023 challenge.[67] RT-DETR++ for UAV Object Detection
Yuan Shufang
Main category: cs.CV
TL;DR: 本文提出了 RT-DETR++,通过改进 RT-DETR 模型的编码器组件,有效解决了无人机图像中检测小且密集物体的挑战,同时保持了实时检测的速度和计算效率。
Details
Motivation: 无人机(UAV)图像中的目标检测存在显著挑战,例如密集的小物体、尺度变化和遮挡等问题。 Method: 改进了 RT-DETR 模型的编码器组件,包括基于通道门控注意力的上采样/下采样机制(AU/AD)和在特征融合过程中引入 CSP-PAC(采用并行空洞卷积处理局部和上下文信息) Result: 新型 Neck 设计在检测小物体和密集排列物体方面表现出色,同时保持了足够的实时检测速度,且没有增加计算复杂度。 Conclusion: RT-DETR++ 提供了一种有效的特征编码设计方法,适用于实时检测系统。 Abstract: Object detection in unmanned aerial vehicle (UAV) imagery presents significant challenges. Issues such as densely packed small objects, scale variations, and occlusion are commonplace. This paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model. Our improvements focus on two key aspects. First, we introduce a channel-gated attention-based upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes errors and preserves details during feature layer propagation. Second, we incorporate CSP-PAC during feature fusion. This technique employs parallel hollow convolutions to process local and contextual information within the same layer, facilitating the integration of multi-scale features. Evaluation demonstrates that our novel neck design achieves superior performance in detecting small and densely packed objects. The model maintains sufficient speed for real-time detection without increasing computational complexity. This study provides an effective approach for feature encoding design in real-time detection systems.[68] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering
Zhiyue Liu,Sihang Liu,Jinyuan Liu,Xinru Zhang
Main category: cs.CV
TL;DR: This paper proposes a training-free framework for knowledge-based visual question answering that focuses on retrieving and integrating relevant knowledge to reduce noise and redundancy, leading to improved performance over existing methods.
Details
Motivation: Existing approaches to KB-VQA often introduce noise by augmenting models with redundant retrieved information, which can negatively impact the answering process. Method: The framework includes creating low-noise queries for knowledge retrieval, prompting large models to extract beneficial knowledge segments, and introducing a selective knowledge integration strategy. Result: Extensive experiments show that the proposed framework outperforms state-of-the-art methods in KB-VQA tasks. Conclusion: The proposed training-free framework with knowledge focusing for KB-VQA enables the acquisition of accurate and critical knowledge, outperforming state-of-the-art methods. Abstract: Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.[69] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Yulin Tong,Fengzong Zhang,Haiqin Cheng
Main category: cs.CV
TL;DR: 本研究提出了一种新的高光谱图像分类框架CWSSNet,结合3D光谱-空间特征和小波卷积,提高了地物分类的准确性和鲁棒性。
Details
Motivation: 高光谱图像虽然富含光谱信息,但由于其波段众多、高维数和光谱混合特性,容易导致特征冗余,因此需要更精细的地物分类方法。 Method: 使用ZY1F卫星的高光谱图像作为数据源,提出了一种名为CWSSNet的分类框架,并采用多尺度卷积注意力模块集成多模态信息,同时在小波域中引入多波段分解和卷积操作。 Result: CWSSNet在Yugan县的实验中分别在平均交并比(mIoU)、平均准确率(mAcc)和平均F1分数(mF1)上达到了74.50%、82.73%和84.94%,并在水体、植被和裸地分类中获得了最高的交并比(IoU),表现出良好的鲁棒性。 Conclusion: CWSSNet通过整合3D光谱-空间特征和小波卷积,突破了传统方法的分类性能瓶颈,并在小样本训练条件下保持可靠的性能。 Abstract: Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50\%, 82.73\%, and 84.94\% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70\%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.[70] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios
Chunxiao Li,Xiaoxiao Wang,Meiling Li,Boming Miao,Peng Sun,Yunjian Zhang,Xiangyang Ji,Yao Zhu
Main category: cs.CV
TL;DR: This paper introduces RRDataset to evaluate AI-generated image detection methods in real-world conditions and highlights the limitations of current methods while emphasizing the importance of human adaptability for more robust detection algorithms.
Details
Motivation: The motivation of the paper is to address the research gap in evaluating AI-generated image detection methods under complex real-world conditions, which existing methods have not fully addressed. Method: The paper introduces the Real-World Robustness Dataset (RRDataset), which evaluates detection models across three dimensions: Scenario Generalization, Internet Transmission Robustness, and Re-digitization Robustness. It benchmarks 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducts a large-scale human study involving 192 participants to investigate human few-shot learning capabilities. Result: The benchmarking results reveal the limitations of current AI detection methods in real-world scenarios and highlight the effectiveness of human adaptability in detecting AI-generated images. Conclusion: The paper concludes that current AI-generated image detection methods have limitations under real-world conditions and emphasizes the importance of incorporating human adaptability to develop more robust detection algorithms. Abstract: With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.[71] Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
Jiasheng Guo,Xin Gao,Yuxiang Yan,Guanghao Li,Jian Pu
Main category: cs.CV
TL;DR: Dark-ISP is a lightweight and self-adaptive RAW image processing plugin that improves low-light object detection by integrating differentiable ISP modules optimized for detection tasks.
Details
Motivation: Existing low-light object detection methods suffer from information loss or complex frameworks; RAW images offer better potential but lack efficient and adaptive processing techniques. Method: Dark-ISP deconstructs traditional ISP pipelines into differentiable components (linear sensor calibration and nonlinear tone mapping) and introduces a Self-Boost mechanism for module cooperation, enabling task-driven optimization. Result: Dark-ISP outperforms state-of-the-art RGB- and RAW-based methods on three datasets while using minimal parameters. Conclusion: Dark-ISP achieves superior performance in low-light object detection by processing Bayer RAW images directly, offering lightweight, self-adaptive processing with seamless end-to-end training. Abstract: Low-light Object detection is crucial for many real-world applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equipped with content-aware adaptability and physics-informed priors, enabling automatic RAW-to-RGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline's intrinsic cascade structure, we devise a Self-Boost mechanism that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.[72] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results
Hanwei Zhu,Haoning Wu,Zicheng Zhang,Lingyu Zhu,Yixuan Li,Peilin Chen,Shiqi Wang,Chris Wei Zhou,Linhan Cao,Wei Sun,Xiangyang Zhu,Weixia Zhang,Yucheng Zhu,Jing Liu,Dandan Zhu,Guangtao Zhai,Xiongkuo Min,Zhichao Zhang,Xinyue Li,Shubo Xu,Anh Dao,Yifan Li,Hongyuan Yu,Jiaojiao Yi,Yiding Tian,Yupeng Wu,Feiran Sun,Lijuan Liao,Song Jiang
Main category: cs.CV
TL;DR: The VQualA 2025 Challenge introduced a new benchmark for evaluating Large Multimodal Models' ability to reason about visual quality differences, highlighting progress in open-domain quality reasoning and setting the stage for future research in human-aligned evaluation systems.
Details
Motivation: To evaluate and enhance the ability of state-of-the-art Large Multimodal Models (LMMs) to perform detailed reasoning about visual quality differences across multiple images. Method: The challenge introduced a novel benchmark with thousands of coarse-to-fine grained visual quality comparison tasks, including single images, pairs, and multi-image groups, evaluated through 2AFC-based binary preference and multi-choice questions (MCQs). Result: Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs in performing quality assessment tasks. Conclusion: The VQualA 2025 Challenge represents a significant advancement in open-domain visual quality reasoning and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems. Abstract: This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.[73] MGTraj: Multi-Granularity Goal-Guided Human Trajectory Prediction with Recursive Refinement Network
Ge Sun,Jun Ma
Main category: cs.CV
TL;DR: This paper proposes MGTraj, a novel Multi-Granularity goal-guided model for human trajectory prediction that outperforms existing methods by incorporating multi-granularity modeling and transformer-based refinement.
Details
Motivation: The potential utility of intermediate temporal granularity remains largely unexplored in goal-guided approaches for trajectory prediction, which motivates multi-granularity trajectory modeling. Method: MGTraj recursively encodes trajectory proposals from coarse to fine granularity levels using a transformer-based recursive refinement network and integrates features across granularities with a weight-sharing strategy. Result: MGTraj demonstrates superior performance on the EHT/UCY and Stanford Drone Dataset, indicating its effectiveness in human trajectory prediction. Conclusion: MGTraj outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods in human trajectory prediction. Abstract: Accurate human trajectory prediction is crucial for robotics navigation and autonomous driving. Recent research has demonstrated that incorporating goal guidance significantly enhances prediction accuracy by reducing uncertainty and leveraging prior knowledge. Most goal-guided approaches decouple the prediction task into two stages: goal prediction and subsequent trajectory completion based on the predicted goal, which operate at extreme granularities: coarse-grained goal prediction forecasts the overall intention, while fine-grained trajectory completion needs to generate the positions for all future timesteps. The potential utility of intermediate temporal granularity remains largely unexplored, which motivates multi-granularity trajectory modeling. While prior work has shown that multi-granularity representations capture diverse scales of human dynamics and motion patterns, effectively integrating this concept into goal-guided frameworks remains challenging. In this paper, we propose MGTraj, a novel Multi-Granularity goal-guided model for human Trajectory prediction. MGTraj recursively encodes trajectory proposals from coarse to fine granularity levels. At each level, a transformer-based recursive refinement network (RRN) captures features and predicts progressive refinements. Features across different granularities are integrated using a weight-sharing strategy, and velocity prediction is employed as an auxiliary task to further enhance performance. Comprehensive experimental results in EHT/UCY and Stanford Drone Dataset indicate that MGTraj outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods.[74] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Zhengzhao Lai,Youbin Zheng,Zhenyang Cai,Haonan Lyu,Jinpu Yang,Hongqing Liang,Yan Hu,Benyou Wang
Main category: cs.CV
TL;DR: 本文提出了MatCha,这是首个用于材料表征图像理解的基准测试,旨在评估多模态大语言模型(MLLMs)在真实世界材料科学任务中的表现。
Details
Motivation: 材料表征对于理解材料的加工、微观结构和性能之间的关系至关重要。然而,当前多模态大语言模型在理解真实世界材料表征图像方面的能力尚未充分探索,因此需要建立一个专门的基准测试来填补这一空白。 Method: 作者构建了MatCha,这是一个包含1,500个需要专家级领域知识的问题的基准测试,涵盖材料研究的四个关键阶段和21项具体任务。他们评估了最先进的MLLMs在MatCha上的表现,并与人类专家进行了比较。 Result: 研究发现,现有的MLLMs在MatCha上的表现显著低于人类专家,尤其是在需要高级专业知识和复杂视觉感知的问题上。简单的少样本和思维链提示方法难以缓解这些限制。 Conclusion: 现有MLLMs在真实世界材料表征任务中的适应能力仍然有限。MatCha的发布有望推动新材料发现和自主科学智能体等未来研究方向。 Abstract: Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.[75] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
Jiesi Hu,Jianfeng Cao,Yanwu Yang,Chenfei Ye,Yixuan Zhang,Hanyang Peng,Ting Ma
Main category: cs.CV
TL;DR: Medverse is a universal 3D medical imaging model using in-context learning to achieve high-fidelity predictions and global anatomical understanding, outperforming previous methods across diverse tasks and datasets.
Details
Motivation: Current ICL models for medical imaging are limited in achieving high-fidelity predictions and global anatomical understanding, and lack a unified approach across diverse tasks and anatomical regions. Medverse aims to address these limitations. Method: Medverse employs a next-scale autoregressive in-context learning framework and a blockwise cross-attention module to enable high-fidelity predictions and global anatomical understanding across diverse medical imaging tasks. Result: Medverse demonstrates superior performance over existing ICL baselines across a wide range of held-out datasets, including unseen clinical centers, organs, species, and imaging modalities. Conclusion: Medverse is a universal ICL model for 3D medical imaging that outperforms existing ICL baselines, offering a novel paradigm for in-context learning with potential for widespread application in medical imaging. Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.[76] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Rongyao Fang,Aldrich Yu,Chengqi Duan,Linjiang Huang,Shuai Bai,Yuxuan Cai,Kun Wang,Si Liu,Xihui Liu,Hongsheng Li
Main category: cs.CV
TL;DR: 本文介绍了FLUX-Reason-6M和PRISM-Bench,旨在解决开源文本到图像模型在大规模推理数据集和全面评估基准方面的不足,通过广泛评估揭示了关键性能差距并提出了改进方向。
Details
Motivation: 开源文本到图像模型的发展由于缺乏大规模、以推理为重点的数据集和全面的评估基准而受到阻碍,导致与领先闭源系统之间存在性能差距。 Method: 介绍了FLUX-Reason-6M和PRISM-Bench。FLUX-Reason-6M是一个包含600万高质量FLUX生成的图像和2000万双语描述的大规模数据集,专门设计用于教授复杂推理;PRISM-Bench则提供了一个包含七个不同赛道的新颖评估标准,包括使用GCoT的长文本挑战。 Result: 对PRISM-Bench上的19个领先模型进行了广泛的评估,揭示了关键的性能差距,并强调了需要改进的具体领域。 Conclusion: FLUX-Reason-6M和PRISM-Bench的推出填补了开源文本到图像模型在大规模推理数据集和全面评估基准方面的空白,有助于缩小与领先闭源系统之间的性能差距。 Abstract: The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .[77] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification
Mustafa Yurdakul,Sakir Tasdemir
Main category: cs.CV
TL;DR: CoAtNeXt, a novel hybrid model, outperforms existing models in classifying gastric tissue images, offering potential assistance to pathologists.
Details
Motivation: The need for automated, reliable, and efficient methods for gastric tissue analysis due to limitations of manual histopathologic examination which is labor-intensive and prone to variability. Method: A novel hybrid model named CoAtNeXt was proposed, built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks integrated with the Convolutional Block Attention Module (CBAM). Result: CoAtNeXt achieved high accuracy, precision, recall, F1 score, and AUC on two datasets, outperforming all CNN and ViT models tested and surpassing previous studies in the literature. Conclusion: CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass, and has the potential to assist pathologists by enhancing diagnostic accuracy and reducing workload. Abstract: Background and objective Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Although histopathologic examination remains the diagnostic gold standard, it is performed entirely manually, making evaluations labor-intensive and prone to variability among pathologists. Critical findings may be missed, and lack of standard procedures reduces consistency. These limitations highlight the need for automated, reliable, and efficient methods for gastric tissue analysis. Methods In this study, a novel hybrid model named CoAtNeXt was proposed for the classification of gastric tissue images. The model is built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block Attention Module (CBAM) is integrated to improve local feature extraction through channel and spatial attention mechanisms. The architecture was scaled to achieve a balance between computational efficiency and classification performance. CoAtNeXt was evaluated on two publicly available datasets, HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary classification, and was compared against 10 Convolutional Neural Networks (CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies in the literature. Conclusion Experimental results show that CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload.[78] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Jing Hao,Yuxuan Fan,Yanpeng Sun,Kaixin Guo,Lizhuo Lin,Jinrong Yang,Qi Yong H. Ai,Lun M. Wong,Hao Tang,Kuo Feng Hung
Main category: cs.CV
TL;DR: This paper introduces MMOral, a large-scale dataset for panoramic X-ray interpretation in dentistry, and proposes OralGPT, which significantly improves LVLM performance in this domain.
Details
Motivation: The motivation stems from the lack of exploration of LVLMs in specialized domains like dentistry, particularly in interpreting panoramic X-rays, which pose challenges due to dense anatomical structures and subtle pathological cues. Method: The researchers introduced the MMOral dataset and benchmark, then evaluated 64 LVLMs, including GPT-4o. They also proposed OralGPT, which involved supervised fine-tuning (SFT) on the MMOral dataset. Result: The evaluation showed that GPT-4o achieved only 41.45% accuracy on MMOral-Bench. OralGPT demonstrated a 24.73% improvement in performance after one epoch of SFT. Conclusion: The study concludes that even the best-performing LVLMs have significant limitations in interpreting panoramic X-rays in dentistry, and the proposed MMOral dataset and OralGPT model offer a promising foundation for advancing intelligent dentistry. Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.[79] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
Chao Yuan,Yang Yang,Yehui Yang,Zach Cheng
Main category: cs.CV
TL;DR: This paper proposes DATE, a novel method for improving temporal understanding in MLLMs by introducing timestamps and a new sampling strategy, leading to state-of-the-art results on long video tasks.
Details
Motivation: Long video understanding is challenging for MLLMs due to limitations in modeling long-range temporal dependencies. Existing methods suffer from information loss and poor temporal comprehension, necessitating a more effective approach. Method: Dynamic Absolute Time Enhancement (DATE) is introduced, incorporating the Timestamp Injection Mechanism (TIM) and Temporal-Aware Similarity Sampling (TASS). TIM interleaves video frame embeddings with textual timestamp tokens, while TASS reformulates video sampling as a vision-language retrieval task using a two-stage algorithm. Result: The method achieves significant improvements in absolute time understanding and key event localization on hour-long video benchmarks for both 7B and 72B models. Conclusion: The proposed DATE method significantly enhances temporal awareness in MLLMs, achieving state-of-the-art performance on long video benchmarks, with the 7B model outperforming many 72B models. Abstract: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.[80] Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation
Linhao Li,Yiwen Ye,Ziyang Chen,Yong Xia
Main category: cs.CV
TL;DR: PSP-Seg是一种高效的3D医学图像分割方法,通过渐进式剪枝优化模型,显著降低资源消耗并保持高性能。
Details
Motivation: 现有的3D医学图像分割模型通常资源消耗大、适应性差,需要一种高效且灵活的方法以提升临床部署能力。 Method: PSP-Seg采用渐进式剪枝框架,通过模块化剪枝和功能解耦损失函数动态优化模型。 Result: PSP-Seg-S在多个数据集上性能与nnU-Net相当,同时显著降低了GPU内存使用(42-45%)、训练时间(29-48%)和参数数量(83-87%)。 Conclusion: PSP-Seg是一种具有潜力的3D医学图像分割方法,兼具高效性和高性能,适用于广泛的临床应用。 Abstract: 3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg's potential as a cost-effective yet high-performing alternative for widespread clinical application.[81] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
Bohao Tang,Yan Ma,Fei Zhang,Jiadi Su,Ethan Chern,Zhulin Hu,Zhixin Wang,Pengfei Liu,Ya Zhang
Main category: cs.CV
TL;DR: This paper proposes an adaptive Vision-Language Model framework that combines code-based reasoning and visual analysis, using reinforcement learning to dynamically select the optimal approach for chart understanding, achieving strong performance across benchmarks.
Details
Motivation: To overcome limitations in existing approaches for chart understanding by developing a model that can adaptively select the most suitable reasoning strategy, enhancing factual accuracy and interpretability. Method: A Code-as-Thought approach combined with a reinforcement learning framework that uses a dual-reward system to train a model to select between code-based and visual reasoning pathways. Result: The proposed method demonstrates robust performance across various chart-understanding benchmarks, showing the ability of Vision-Language Models to dynamically select optimal reasoning pathways. Conclusion: The study concludes that Vision-Language Models can be effectively trained to dynamically choose the best reasoning method for understanding charts, either through code-based reasoning or direct visual analysis. Abstract: Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.[82] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training
Anthony P. Addison,Felix Wagner,Wentian Xu,Natalie Voets,Konstantinos Kamnitsas
Main category: cs.CV
TL;DR: 该研究提出了一种新的MRI分割模型,能够灵活处理已见和未见的多种模态,提高了模型的泛化能力。
Details
Motivation: 现有的多模态脑MRI分割模型通常受限于固定的模态,无法有效处理新的模态,而本研究旨在开发一种能够灵活处理已见和未见模态的模型。 Method: 在U-net架构中集成模态无关输入通道,并开发了一种合成人工MRI模态的图像增强方案,以训练模态无关组件。 Result: 该方法在包含5种病理类型和8种模态的8个MRI数据库中评估,结果显示其不仅能够有效处理训练中遇到的模态,还能处理新的、未见过的模态以改进分割。 Conclusion: 该研究通过引入模态无关输入通道与特定模态输入通道相结合的方法,开发了一种能够在训练中未见模态上进行有效推理的MRI分割模型。 Abstract: Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG[83] Model-Agnostic Open-Set Air-to-Air Visual Object Detection for Reliable UAV Perception
Spyridon Loukovitis,Anastasios Arsenos,Vasileios Karampinis,Athanasios Voulodimos
Main category: cs.CV
TL;DR: The paper proposes an open-set detection framework for UAVs that improves robustness and detection accuracy in dynamic environments by leveraging entropy modeling, spectral normalization, and temperature scaling.
Details
Motivation: The motivation stems from the limitations of traditional closed-set detectors in handling domain shifts and flight data corruption, which poses risks in safety-critical applications like UAV autonomy. Method: The method involves a model-agnostic open-set detection framework for embedding-based detectors, using entropy modeling, spectral normalization, and temperature scaling to handle unknown object rejection and improve open-set discrimination. Result: The results show consistent improvements over baseline methods, including a 10% relative AUROC gain compared to standard YOLO-based detectors, validated on the AOT aerial benchmark and through real-world flight tests. Conclusion: The paper concludes that the proposed open-set detection framework significantly enhances UAV autonomy by improving the detection robustness in real-world scenarios, particularly under domain shifts and data corruption. Abstract: Open-set detection is crucial for robust UAV autonomy in air-to-air object detection under real-world conditions. Traditional closed-set detectors degrade significantly under domain shifts and flight data corruption, posing risks to safety-critical applications. We propose a novel, model-agnostic open-set detection framework designed specifically for embedding-based detectors. The method explicitly handles unknown object rejection while maintaining robustness against corrupted flight data. It estimates semantic uncertainty via entropy modeling in the embedding space and incorporates spectral normalization and temperature scaling to enhance open-set discrimination. We validate our approach on the challenging AOT aerial benchmark and through extensive real-world flight tests. Comprehensive ablation studies demonstrate consistent improvements over baseline methods, achieving up to a 10\% relative AUROC gain compared to standard YOLO-based detectors. Additionally, we show that background rejection further strengthens robustness without compromising detection accuracy, making our solution particularly well-suited for reliable UAV perception in dynamic air-to-air environments.[84] Learning Object-Centric Representations in SAR Images with Multi-Level Feature Fusion
Oh-Tae Jang,Min-Gon Cho,Kyung-Tae Kim
Main category: cs.CV
TL;DR: 本文提出了一种新的以对象为中心的学习框架SlotSAR,能够在没有掩码注释的情况下,从SAR图像中分离目标表示并抑制背景杂波。
Details
Motivation: 合成孔径雷达(SAR)图像不仅包含感兴趣的目标,还包含复杂的背景杂波,包括地形反射和斑点噪声。这种杂波表现出类似目标的强度和模式,导致模型提取纠缠或虚假特征,削弱了形成清晰目标表示的能力。 Method: SlotSAR从SARATR-X中提取高层语义特征,并从小波散射网络中提取低层散射特征,以获得互补的多级表示。此外,还提出了一个多层次槽注意模块,将这些低层和高层特征结合起来,增强槽式表示的独特性,从而实现有效的OCL。 Result: 实验结果表明,与现有的OCL方法相比,SlotSAR在保留结构细节方面表现优异。 Conclusion: SlotSAR是一种新的以对象为中心的学习框架,它在没有掩码注释的情况下,从SAR图像中分离目标表示并抑制背景杂波,实现了SAR图像的最新性能。 Abstract: Synthetic aperture radar (SAR) images contain not only targets of interest but also complex background clutter, including terrain reflections and speckle noise. In many cases, such clutter exhibits intensity and patterns that resemble targets, leading models to extract entangled or spurious features. Such behavior undermines the ability to form clear target representations, regardless of the classifier. To address this challenge, we propose a novel object-centric learning (OCL) framework, named SlotSAR, that disentangles target representations from background clutter in SAR images without mask annotations. SlotSAR first extracts high-level semantic features from SARATR-X and low-level scattering features from the wavelet scattering network in order to obtain complementary multi-level representations for robust target characterization. We further present a multi-level slot attention module that integrates these low- and high-level features to enhance slot-wise representation distinctiveness, enabling effective OCL. Experimental results demonstrate that SlotSAR achieves state-of-the-art performance in SAR imagery by preserving structural details compared to existing OCL methods.[85] You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
Hao Si,Ehsan Javanmardi,Manabu Tsukada
Main category: cs.CV
TL;DR: This paper proposes PHCP, a novel framework for heterogeneous collaborative perception that works during inference without labeled data or joint training, achieving strong performance comparable to state-of-the-art methods.
Details
Motivation: The motivation stems from the impracticality of existing methods that require joint training or pre-storing models for every collaborator. The authors aim to eliminate such dependencies and address the challenge during inference directly. Method: The paper introduces PHCP, a new framework that formulates the problem as few-shot unsupervised domain adaptation. It dynamically aligns features by self-training an adapter during inference. Result: Experiments on the OPV2V dataset show that PHCP performs well across heterogeneous scenarios and achieves results comparable to SOTA methods while using only a small amount of unlabeled data. Conclusion: PHCP proves to be effective in tackling heterogeneous collaborative perception without needing joint training or labeled data, achieving performance comparable to state-of-the-art methods. Abstract: Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.[86] Image Recognition with Vision and Language Embeddings of VLMs
Illia Volkov,Nikita Kisel,Klara Janouskova,Jiri Matas
Main category: cs.CV
TL;DR: 本文评估了视觉语言模型在图像分类中的表现,发现语言和视觉的互补性,并提出了一种有效的融合方法。
Details
Motivation: 探索视觉语言模型在纯视觉推理能力方面的表现,并利用语言和视觉的互补性提升分类效果。 Method: 对多种双编码器视觉语言模型进行综合评估,并引入一种无需学习的融合方法。 Result: 发现某些类别更适合文本提示,而另一些类别通过视觉相似性处理更好,融合方法提升了整体分类性能。 Conclusion: 语言和视觉提供了互补的优势,通过基于每类精度的简单融合方法可以提升分类性能。 Abstract: Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.[87] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM
Hui Li,Yi You,Qiqi Chen,Bingfeng Zhang,George Q. Huang
Main category: cs.CV
TL;DR: This paper introduces the BUG workflow with LMM for automated clothing design creation and customization, making the process more accessible to users without professional expertise.
Details
Motivation: Current generative AI models struggle with fine-grained customization in clothing design due to text uncertainty and the need for professional background knowledge from users. Method: The authors proposed a Better Understanding Generation (BUG) workflow using a Large Multimodal Model (LMM) to automate clothing design creation and customization. They also introduced the FashionEdit dataset to evaluate their model's performance. Result: The BUG workflow enables users to create and customize clothing designs through chat with image-into-prompt, reducing barriers to clothing design without additional human involvement. Conclusion: The BUG workflow with LMM effectively enhances the creation and customization of clothing designs, making the process more accessible and efficient. Abstract: Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users' creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.[88] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment
Dimitrios Anastasiou,Razvan Caramalau,Nazir Sirajudeen,Matthew Boal,Philip Edwards,Justin Collins,John Kelly,Ashwin Sridhar,Maxine Tran,Faiz Mumtaz,Nevil Pavithran,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos
Main category: cs.CV
TL;DR: This paper investigates few-shot surgical skill assessment using domain-relevant pre-training, showing that small, relevant datasets outperform large, irrelevant ones and that incorporating procedure-specific data boosts performance.
Details
Motivation: The scarcity of skill annotations in surgical vision makes developing robust models challenging, so the study explores few-shot learning with self-supervised pre-training as a scalable solution. Method: The authors formulated surgical skill assessment as a few-shot learning task, annotated a robotic surgery dataset with OSATS scores, and evaluated various self-supervised pre-training strategies across three few-shot settings (1-, 2-, and 5-shot). They also measured domain similarity and analyzed the impact of domain gap and procedure-specific data inclusion. Result: Models using small but domain-relevant datasets outperformed larger, less aligned ones, achieving 60.16%, 66.03%, and 73.65% accuracy in 1-, 2-, and 5-shot settings respectively. Adding procedure-specific data improved performance by +1.22% in accuracy and +2.28% in F1-score, while irrelevant large-scale data hurt performance. Conclusion: The study concludes that domain-relevant pre-training significantly improves few-shot surgical skill assessment, and incorporating procedure-specific data can further enhance performance, while large-scale but less relevant data may degrade it. Abstract: Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.[89] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Ian Nell,Shane Gilroy
Main category: cs.CV
TL;DR: This study proposes a computer vision-based system for classifying driver behavior, detecting distraction and impairment through real-time object tracking, lateral displacement analysis, and lane position monitoring, with potential applications in enhancing road safety.
Details
Motivation: Road traffic accidents are a significant global concern, primarily caused by human error such as distracted and impaired driving. This study aims to address this issue by introducing a novel driver behavior classification system. Method: The study uses a framework that employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring, to detect indicators of driver distraction and impairment. Result: Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions. Conclusion: The study concludes that the proposed vision-based driver behavior classification system effectively identifies unsafe driving behaviors and has the potential to enhance road safety by enabling analysis of non-connected vehicles. Abstract: Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behavior classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviors such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioral analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.[90] Texture-aware Intrinsic Image Decomposition with Model- and Learning-based Priors
Xiaodong Wang,Zijun He,Xin Yuan
Main category: cs.CV
TL;DR: This paper proposes a novel method for intrinsic image decomposition that effectively handles complex scenes with spatially-varying lighting effects and rich textures, outperforming existing approaches.
Details
Motivation: Previous learning-based methods tend to produce texture-less and over-smoothing intrinsic images, which fail to handle complex scenes with spatially-varying lighting effects and rich textures. Method: A texture-guided regularization term is designed and the decomposition problem is formulated into an optimization framework to separate material textures and lighting effects. Result: The method enables the production of high-quality intrinsic images for real-world images by combining a novel texture-aware prior. Conclusion: The proposed method outperforms existing approaches in handling severe lighting and rich textures for intrinsic image decomposition. Abstract: This paper aims to recover the intrinsic reflectance layer and shading layer given a single image. Though this intrinsic image decomposition problem has been studied for decades, it remains a significant challenge in cases of complex scenes, i.e. spatially-varying lighting effect and rich textures. In this paper, we propose a novel method for handling severe lighting and rich textures in intrinsic image decomposition, which enables to produce high-quality intrinsic images for real-world images. Specifically, we observe that previous learning-based methods tend to produce texture-less and over-smoothing intrinsic images, which can be used to infer the lighting and texture information given a RGB image. In this way, we design a texture-guided regularization term and formulate the decomposition problem into an optimization framework, to separate the material textures and lighting effect. We demonstrate that combining the novel texture-aware prior can produce superior results to existing approaches.[91] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection
Xiaodong Wang,Ping Wang,Zhangyuan Li,Xin Yuan
Main category: cs.CV
TL;DR: The study establishes a unified framework linking PnP methods and DDIM for inverse problems in single-pixel imaging, resulting in improved reconstruction quality through a hybrid data-consistency approach.
Details
Motivation: The motivation is to explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, particularly focusing on enhancing single-pixel imaging. Method: The method involves decoupling the diffusion process into denoising, data consistency enforcement, and sampling stages, and proposing a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. Result: Experimental results on single-pixel imaging tasks show that the proposed method achieves better reconstruction quality. Conclusion: The paper concludes that by integrating learned priors with physical forward models and proposing a hybrid data-consistency module, the reconstruction quality in single-pixel imaging tasks can be significantly improved. Abstract: We explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a focus on single-pixel imaging. We begin by identifying key distinctions between PnP and diffusion models-particularly in their denoising mechanisms and sampling procedures. By decoupling the diffusion process into three interpretable stages: denoising, data consistency enforcement, and sampling, we provide a unified framework that integrates learned priors with physical forward models in a principled manner. Building upon this insight, we propose a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. This hybrid correction is applied directly to the denoised estimate, improving measurement consistency without disrupting the diffusion sampling trajectory. Experimental results on single-pixel imaging tasks demonstrate that our method achieves better reconstruction quality.[92] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data
Pengxu Wen,Tingting Yu,Ziwei Nie,Cheng Jiang,Zhenyu Yin,Mingyang He,Bo Liao,Xiaoping Yang
Main category: cs.CV
TL;DR: 本研究提出了一种结合超声分析和多源数据融合的全自动ICP分级框架,有效减少了操作者变异性,为非侵入性颅内压评估提供了可靠的新方法。
Details
Motivation: 由于颅内压升高对脑功能有严重威胁,需要及时监测和干预。目前临床使用的腰椎穿刺测量ICP方法具有侵入性,存在相关风险,因此需要开发一种非侵入性的替代方法。 Method: 该方法分为两个阶段:眼底超声视频处理阶段进行帧级解剖分割、基于规则的关键帧识别和精确的ONSD测量;颅内压分级阶段则将ONSD指标与临床特征融合以预测ICP等级。 Result: 实验结果显示,该方法在验证集上的准确率为0.845 ± 0.071,在独立测试集上的准确率为0.786,显著优于传统基于阈值的方法(验证准确率为0.637 ± 0.111,测试准确率为0.429)。 Conclusion: 该研究开发了一种全自动的两阶段框架,用于颅内压(ICP)分级,结合了关键帧识别、ONSD测量和临床数据,提供了一种可靠的无创临床ICP评估方法。 Abstract: Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of $0.845 \pm 0.071$ (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ($0.637 \pm 0.111$ validation accuracy, $0.429$ test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.[93] Unsupervised Integrated-Circuit Defect Segmentation via Image-Intrinsic Normality
Botong Zhao,Qijun Shi,Shujing Lyu,Yue Lu
Main category: cs.CV
TL;DR: 本文提出了一种用于集成电路缺陷分割的无监督框架,无需外部正常样本支持,并通过利用测试图像中的正常特征实现缺陷检测。
Details
Motivation: 集成电路制造中的多样且细粒度的缺陷影响产量和可靠性,而传统缺陷分割方法在面对布局变化和对齐困难时表现脆弱。 Method: 提出了一种无监督学习方法,通过从测试图像中提取正常特征并使用一致性损失来增强正常区域的关联性,解码器仅重建正常内容,通过重建残差进行缺陷分割。 Result: 在三个集成电路工艺阶段的数据集上进行了实验,结果表明该方法在产品变化下具有良好的鲁棒性和一致性改进。 Conclusion: 该方法在无需外部正常样本的情况下,有效实现了集成电路缺陷的分割,适用于工业实际应用。 Abstract: Modern Integrated-Circuit(IC) manufacturing introduces diverse, fine-grained defects that depress yield and reliability. Most industrial defect segmentation compares a test image against an external normal set, a strategy that is brittle for IC imagery where layouts vary across products and accurate alignment is difficult. We observe that defects are predominantly local, while each image still contains rich, repeatable normal patterns. We therefore propose an unsupervised IC defect segmentation framework that requires no external normal support. A learnable normal-information extractor aggregates representative normal features from the test image, and a coherence loss enforces their association with normal regions. Guided by these features, a decoder reconstructs only normal content; the reconstruction residual then segments defects. Pseudo-anomaly augmentation further stabilizes training. Experiments on datasets from three IC process stages show consistent improvements over existing approaches and strong robustness to product variability.[94] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Umaima Rahman,Raza Imam,Mohammad Yaqub,Dwarikanath Mahapatra
Main category: cs.CV
TL;DR: DRiFt improves the reliability of medical vision-language models by decoupling relevant clinical signals from noise, enhancing generalization and robustness for safer clinical deployment.
Details
Motivation: Medical vision-language models (VLMs) face challenges in reliability and generalizability due to task-agnostic correlations caused by variations in imaging protocols and free-text reports. This limits their safe deployment in real-world clinical settings. Method: DRiFt uses a structured feature decoupling framework with parameter-efficient tuning (LoRA) and learnable prompt tokens to separate relevant clinical signals from noise. High-quality image-text pairs are curated to improve cross-modal alignment and reduce uncertainty. Result: The DRiFt approach achieves a +11.4% improvement in Top-1 accuracy and a +3.3% increase in Macro-F1 over prior prompt-based methods. It also demonstrates strong robustness across unseen datasets and reduces unpredictable behavior under domain shifts. Conclusion: The proposed DRiFt framework enhances the reliability and robustness of medical vision-language models (VLMs) by decoupling clinically relevant signals from task-agnostic noise, leading to safer and more trustworthy models for clinical use. Abstract: Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.[95] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
Yuchan Jie,Yushen Xu,Xiaosong Li,Fuqiang Zhou,Jianming Lv,Huafeng Li
Main category: cs.CV
TL;DR: FS-Diff is a novel method for joint image fusion and super-resolution that leverages semantic guidance and a modified U-Net to produce high-quality, high-resolution fused images.
Details
Motivation: Current fusion techniques struggle with low-resolution, corrupted images in real-world applications like military reconnaissance, necessitating a more effective joint fusion and super-resolution approach. Method: FS-Diff initializes the fused result as Gaussian noise and uses a bidirectional feature Mamba for global feature extraction. A modified U-Net network is trained for denoising at multiple noise levels, using source images and semantics as conditions. Result: FS-Diff achieves superior performance on six public datasets and the AVMS benchmark, generating high-resolution fused images with abundant semantic details. Conclusion: FS-Diff outperforms state-of-the-art methods in joint image fusion and super-resolution, producing high-resolution results with rich details and semantic information. Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.[96] Semantic Concentration for Self-Supervised Dense Representations Learning
Peisong Wen,Qianqian Xu,Siran Dai,Runmin Cong,Qingming Huang
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的密集自监督学习方法,通过显式的语义集中策略解决过度分散问题,并在多种任务上验证了其有效性。
Details
Motivation: 现有的图像级自监督学习虽然取得了显著进展,但在处理密集表示学习时面临过度分散的问题,即来自同一实例/类别的块会分散,影响下游任务的性能。 Method: 首先,通过提炼块对应关系打破严格的空间对齐;其次,面对嘈杂和不平衡的伪标签,提出了一种容忍噪声的排序损失;最后,通过对象感知滤波器将输出空间映射到基于对象的空间。 Result: 在各种任务上的实证研究表明,所提出的方法在解决密集自监督学习中的过度分散问题方面具有显著效果。 Conclusion: 论文提出了一种新的密集自监督学习方法,通过显式的语义集中策略解决了现有方法中的过度分散问题,从而提高了下游任务的性能。 Abstract: Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.[97] FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model
Yushen Xu,Xiaosong Li,Yuchun Wang,Xiaoqi Cheng,Huafeng Li,Haishu Tan
Main category: cs.CV
TL;DR: FlexiD-Fuse是一种基于扩散的医学图像融合方法,能够灵活处理不同数量的输入模态,并在医学图像融合任务中表现出色。
Details
Motivation: 现有的医学图像融合方法只能处理固定数量的模态输入,无法直接处理变化的输入数量,限制了其在临床环境中的应用。 Method: 引入FlexiD-Fuse,通过将扩散融合问题转化为基于扩散过程和分层贝叶斯建模的最大似然估计问题,并结合期望最大化算法,实现对不同数量输入模态的医学图像进行端到端处理。 Result: 实验结果表明,FlexiD-Fuse在哈佛数据集上的医学图像融合任务中表现最佳,扩展实验也证明了其在红外-可见光、多曝光和多焦点图像融合任务中的有效性和优越性。 Conclusion: FlexiD-Fuse是一个基于扩散的医学图像融合网络,能够灵活处理不同数量的输入模态,并在医学图像融合任务中表现出色。 Abstract: Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.[98] Resource-Efficient Glioma Segmentation on Sub-Saharan MRI
Freedmore Sidume,Oumayma Soula,Joseph Muthui Wacira,YunFei Zhu,Abbas Rabiu Muhammad,Abderrazek Zeraii,Oluwaseun Kalejaye,Hajer Ibrahim,Olfa Gaddour,Brain Halubanza,Dong Zhang,Udunna C Anazodo,Confidence Raymond
Main category: cs.CV
TL;DR: This paper presents a deep learning model for glioma segmentation in Sub-Saharan Africa, using a 3D Attention UNet with residual blocks and transfer learning. It achieves strong performance despite limited data and offers practical deployment potential in resource-constrained settings.
Details
Motivation: The motivation stems from the scarcity of high-quality annotated MRI data in Sub-Saharan Africa, which hampers the deployment of advanced segmentation models for gliomas, a prevalent type of primary brain tumor. Method: The study employed a 3D Attention UNet architecture with residual blocks, enhanced through transfer learning from the BraTS 2021 dataset. The model was evaluated on the BraTS-Africa dataset consisting of 95 MRI cases. Result: Despite limited data, the model achieved Dice scores of 0.76 for Enhancing Tumor, 0.80 for Necrotic and Non-Enhancing Tumor Core, and 0.85 for Surrounding Non-Functional Hemisphere. The model's compact size and fast inference time further enhance its suitability for low-resource settings. Conclusion: This study successfully developed a robust and efficient deep learning model for glioma segmentation in resource-limited settings, particularly in Sub-Saharan Africa, contributing to equitable AI in global health. Abstract: Gliomas are the most prevalent type of primary brain tumors, and their accurate segmentation from MRI is critical for diagnosis, treatment planning, and longitudinal monitoring. However, the scarcity of high-quality annotated imaging data in Sub-Saharan Africa (SSA) poses a significant challenge for deploying advanced segmentation models in clinical workflows. This study introduces a robust and computationally efficient deep learning framework tailored for resource-constrained settings. We leveraged a 3D Attention UNet architecture augmented with residual blocks and enhanced through transfer learning from pre-trained weights on the BraTS 2021 dataset. Our model was evaluated on 95 MRI cases from the BraTS-Africa dataset, a benchmark for glioma segmentation in SSA MRI data. Despite the limited data quality and quantity, our approach achieved Dice scores of 0.76 for the Enhancing Tumor (ET), 0.80 for Necrotic and Non-Enhancing Tumor Core (NETC), and 0.85 for Surrounding Non-Functional Hemisphere (SNFH). These results demonstrate the generalizability of the proposed model and its potential to support clinical decision making in low-resource settings. The compact architecture, approximately 90 MB, and sub-minute per-volume inference time on consumer-grade hardware further underscore its practicality for deployment in SSA health systems. This work contributes toward closing the gap in equitable AI for global health by empowering underserved regions with high-performing and accessible medical imaging solutions.[99] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection
Victor Livernoche,Akshatha Arodi,Andreea Musulan,Zachary Yang,Adam Salvail,Gaétan Marceau Caron,Jean-François Godbout,Reihaneh Rabbany
Main category: cs.CV
TL;DR: This paper addresses the growing threat of politically motivated deepfakes by presenting a new dataset for detection benchmarking and a crowdsourced platform to continuously challenge and improve detection methods.
Details
Motivation: The motivation is to address the growing threat of deepfakes in spreading misinformation, particularly in politically sensitive contexts, by improving detection methods and adapting to evolving generative techniques. Method: The study analyzes social media posts to identify deepfake propagation modalities, conducts a human perception study, and develops a politically-focused dataset with real and synthetic images. Additionally, it introduces a crowdsourced adversarial platform to generate challenging synthetic images. Result: A comprehensive dataset of three million real images paired with captions and 963k high-quality synthetic images was created. The human perception study showed that modern synthetic images are increasingly indistinguishable from real ones, highlighting the need for improved detection methods. Conclusion: Deepfake detection requires continuous improvement due to the increasing realism of synthetic images; the study introduces a comprehensive dataset and an innovative crowdsourced platform to enhance detection capabilities and safeguard public discourse. Abstract: Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.[100] Improving Human Motion Plausibility with Body Momentum
Ha Linh Nguyen,Tze Ho Elden Tse,Angela Yao
Main category: cs.CV
TL;DR: This paper proposes a momentum-based loss to better link local and global motion dynamics, improving motion generation quality with reduced artifacts and better balance.
Details
Motivation: Motion models often fail to capture the physical coupling between local and global dynamics, and deriving global trajectories from joint torques and forces is computationally expensive. This work aims to address these challenges. Method: The study introduces a new loss term based on whole-body linear and angular momentum to link local motion with global movement, leveraging momentum as a physical constraint. Result: Incorporating the momentum-based loss reduces foot sliding and jitter, improves balance, and maintains accurate motion recovery. Conclusion: The proposed momentum-based loss term improves motion generation by enforcing consistency between generated and observed momentum profiles, reducing foot sliding and jitter while preserving motion accuracy. Abstract: Many studies decompose human motion into local motion in a frame attached to the root joint and global motion of the root joint in the world frame, treating them separately. However, these two components are not independent. Global movement arises from interactions with the environment, which are, in turn, driven by changes in the body configuration. Motion models often fail to precisely capture this physical coupling between local and global dynamics, while deriving global trajectories from joint torques and external forces is computationally expensive and complex. To address these challenges, we propose using whole-body linear and angular momentum as a constraint to link local motion with global movement. Since momentum reflects the aggregate effect of joint-level dynamics on the body's movement through space, it provides a physically grounded way to relate local joint behavior to global displacement. Building on this insight, we introduce a new loss term that enforces consistency between the generated momentum profiles and those observed in ground-truth data. Incorporating our loss reduces foot sliding and jitter, improves balance, and preserves the accuracy of the recovered motion. Code and data are available at the project page https://hlinhn.github.io/momentum_bmvc.[101] Region-Wise Correspondence Prediction between Manga Line Art Images
Yingxuan Li,Jiafeng Mao,Qianru Qiu,Yusuke Matsui
Main category: cs.CV
TL;DR: This paper proposes a Transformer-based method to predict region-wise correspondence in manga line art without annotations, achieving high accuracy and enabling practical manga processing applications.
Details
Motivation: The motivation is to address the unexplored challenge of understanding region-wise correspondence in manga line art without relying on pre-existing segmentation or annotations, which is essential for downstream applications like colorization and frame generation. Method: The method involves dividing line art images into patches, using a Transformer-based framework to learn patch-level similarities, applying edge-aware clustering, and employing a region matching algorithm to generate region-level correspondences. Result: The experiments show high patch-level accuracy (e.g., 96.34%) and consistent region-level correspondences, validating the effectiveness of the proposed approach. Conclusion: The paper concludes that their proposed Transformer-based framework achieves high accuracy in predicting region-wise correspondence between manga line art images, demonstrating its potential for practical applications in manga processing. Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.[102] Generative Diffusion Contrastive Network for Multi-View Clustering
Jian Zhu,Xin Zou,Xi Wang,Ning Zhang,Bian Wu,Yao Yang,Ying Zhou,Lingfang Zeng,Chang Tang,Cheng Luo
Main category: cs.CV
TL;DR: This paper proposes the Stochastic Generative Diffusion Fusion (SGDF) method and builds the Generative Diffusion Contrastive Network (GDCN) for deep multi-view clustering, achieving state-of-the-art results and robustness to low-quality data.
Details
Motivation: Multi-View Clustering (MVC) suffers from low-quality data issues, such as noisy or missing data in certain views, which degrade clustering performance. Method: The proposed Stochastic Generative Diffusion Fusion (SGDF) method uses a multiple generative mechanism for multi-view features, and Generative Diffusion Contrastive Network (GDCN) is built on SGDF. Result: Extensive experiments demonstrate that GDCN achieves superior performance, reaching state-of-the-art results in deep MVC tasks. Conclusion: GDCN, built upon SGDF, achieves state-of-the-art results in deep multi-view clustering tasks and is robust to low-quality data. Abstract: In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at https://github.com/HackerHyper/GDCN.[103] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
Paul F. R. Wilson,Matteo Ronchetti,Rüdiger Göbl,Viktoria Markova,Sebastian Rosenzweig,Raphael Prevost,Parvin Mousavi,Oliver Zettinig
Main category: cs.CV
TL;DR: DualTrack improves sensorless 3D ultrasound by using a dual-encoder deep learning architecture that separately models local and global features, resulting in highly accurate and consistent 3D reconstructions.
Details
Motivation: Sensorless 3D ultrasound is a cost-effective and less complex alternative to traditional 3D systems, but existing methods struggle to effectively model both local and global features, limiting their accuracy and consistency. Method: DualTrack uses a dual-encoder architecture with separate local and global encoders. The local encoder captures fine-grained features with spatiotemporal convolutions, while the global encoder uses a 2D CNN or foundation model with temporal attention for high-level anatomical features. A fusion module combines these features to estimate the 3D trajectory. Result: DualTrack achieves state-of-the-art accuracy and produces globally consistent 3D reconstructions from 2D ultrasound images, with an average reconstruction error below 5 mm. Conclusion: DualTrack is a promising solution for sensorless 3D ultrasound, offering accurate and globally consistent 3D reconstructions with an average error below 5 mm. Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.[104] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders
Dohun Lee,Hyeonho Jeong,Jiwook Kim,Duygu Ceylan,Jong Chul Ye
Main category: cs.CV
TL;DR: Align4Gen enhances video diffusion models by aligning intermediate features with pre-trained vision encoders, resulting in better video generation.
Details
Motivation: Improving feature representation power in video diffusion models has been largely overlooked despite advancements in architecture and training objectives. Method: A new metric was proposed to evaluate vision encoders' discriminability and temporal consistency. Align4Gen was developed for multi-feature fusion and alignment in video diffusion model training. Result: Align4Gen demonstrated improved performance in unconditional and class-conditional video generation tasks based on various evaluation metrics. Conclusion: Align4Gen improves video generation by aligning intermediate features with pre-trained vision encoders, demonstrating enhanced performance on video diffusion models. Abstract: Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: https://align4gen.github.io/align4gen/[105] InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
Sirui Xu,Dongting Li,Yucheng Zhang,Xiyan Xu,Qi Long,Ziyin Wang,Yunzhi Lu,Shuchang Dong,Hezi Jiang,Akshat Gupta,Yu-Xiong Wang,Liang-Yan Gui
Main category: cs.CV
TL;DR: This paper introduces InterAct, a large-scale 3D human-object interaction benchmark, addressing dataset limitations through data consolidation, optimization, and task design to advance motion generation research.
Details
Motivation: Modeling and generating dynamic 3D human-object interactions remain challenging due to dataset limitations such as lack of extensive, high-quality motion and annotation, and artifacts like contact penetration, floating, and incorrect hand motions. Method: A unified optimization framework leveraging contact invariance was proposed to enhance data quality and introduce motion variations. Six benchmarking tasks were defined, along with a unified HOI generative modeling perspective. Result: InterAct, a large-scale 3D HOI benchmark, was developed by consolidating and standardizing 21.81 hours of HOI data, enriching it with textual annotations, and expanding it to 30.70 hours using an optimization framework. State-of-the-art performance was achieved in benchmarking tasks. Conclusion: InterAct serves as a foundational resource for advancing 3D human-object interaction generation and is publicly available to support continued research. Abstract: While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at https://github.com/wzyabcas/InterAct, and will be actively maintained.[106] Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer's Disease Classification
Akshit Achara,Esther Puyol Anton,Alexander Hammers,Andrew P. King
Main category: cs.CV
TL;DR: 该研究揭示了基于深度学习的阿尔茨海默病MRI诊断中存在种族和性别相关的快捷学习和偏见问题,并提出了未来开发更公平诊断工具的基础。
Details
Motivation: 深度学习算法在阿尔茨海默病的MRI诊断中存在快捷学习问题,当这些特征与受保护属性相关时,可能导致对代表性不足群体的性能偏见,本文旨在探索这一问题。 Method: 作者通过多个数据集和深度学习模型(ResNet和SwinTransformer)进行了实验,包括对MRI扫描中种族或性别的识别、训练集不平衡对模型性能的影响分析,以及不同脑区特征归因的定量和定性分析。 Result: 研究表明,基于深度学习的阿尔茨海默病分类任务中存在种族和性别相关的快捷学习和偏见,并展示了这些偏见如何影响模型性能。 Conclusion: 本文探讨了基于深度学习的阿尔茨海默病诊断中种族和性别相关的快捷学习和偏见的存在,并为开发更公平的脑部MRI深度学习诊断工具奠定了基础。 Abstract: Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer's disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR[107] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection
Sijun Dong,Yuxuan Hu,LiBo Wang,Geng Chen,Xiaoliang Meng
Main category: cs.CV
TL;DR: PeftCD is a parameter-efficient change detection framework using Vision Foundation Models that achieves high performance across multiple remote sensing datasets while maintaining efficiency.
Details
Motivation: The study aims to address challenges in change detection for multi-temporal and multi-source remote sensing imagery, including pseudo changes, limited labeled samples, and cross-domain generalization. Method: PeftCD uses a weight-sharing Siamese encoder based on Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT), integrating LoRA and Adapter modules for efficient task adaptation. It explores two backbones, SAM2 and DINOv3, and employs a lightweight decoder. Result: PeftCD achieved state-of-the-art performance on multiple datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%), and LEVIR-CD (85.62%), with precise boundary delineation and effective suppression of pseudo-changes. Conclusion: PeftCD provides an optimal balance of accuracy, efficiency, and generalization, offering a scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. Abstract: To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.[108] Visual Grounding from Event Cameras
Lingdong Kong,Dongyue Lu,Ao Liang,Rong Li,Yuhao Dong,Tianshuai Hu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau
Main category: cs.CV
TL;DR: This paper introduces Talk2Event, a benchmark for language-driven object grounding using event camera data, aiming to advance multimodal perception in dynamic environments.
Details
Motivation: Event cameras offer advantages for modeling dynamic scenes, but their integration with natural language understanding is underexplored, creating a gap in multimodal perception. Method: The paper introduces Talk2Event, a large-scale benchmark for language-driven object grounding using event data, enriched with structured attributes. Result: Talk2Event includes 5,567 scenes, 13,458 annotated objects, and over 30,000 validated referring expressions with structured attributes for interpretable grounding. Conclusion: Talk2Event serves as a foundation for advancing multimodal and temporally-aware perception with potential applications in robotics and human-AI interaction. Abstract: Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes -- appearance, status, relation to the viewer, and relation to surrounding objects -- that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.[109] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Yikang Ding,Jiwen Liu,Wenyuan Zhang,Zekun Wang,Wentao Hu,Liyuan Cui,Mingming Lao,Yingchao Shao,Hui Liu,Xiaohan Li,Ming Chen,Xiaoqiang Liu,Yu-Shen Liu,Pengfei Wan
Main category: cs.CV
TL;DR: Kling-Avatar is a new cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation, addressing the limitations of existing methods in audio-driven avatar video generation that do not model the communicative purpose conveyed by instructions. It uses a two-stage pipeline with a multimodal large language model director and a parallel sub-clip generation process to achieve high-fidelity results in various aspects like lip synchronization, emotion expressiveness, and cross-domain generalization.
Details
Motivation: Existing methods in audio-driven avatar video generation treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues without modeling the communicative purpose conveyed by the instructions, which compromises their narrative coherence and character expressiveness. Method: Kling-Avatar adopts a two-stage pipeline. The first stage involves a multimodal large language model (MLLM) director that produces a blueprint video based on diverse instruction signals to govern high-level semantics like character motion and emotions. In the second stage, multiple sub-clips are generated in parallel using a first-last frame strategy guided by blueprint keyframes. This global-to-local framework ensures fine-grained detail preservation and faithful encoding of high-level intent behind multimodal instructions. Result: Kling-Avatar can generate vivid, fluent, long-duration videos at up to 1080p and 48 fps, demonstrating superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. Conclusion: Kling-Avatar sets a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis with its ability to generate vivid, fluent, long-duration videos while achieving superior performance in various aspects like lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.[110] Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth
Daria Laslo,Efthymios Georgiou,Marius George Linguraru,Andreas Rauschecker,Sabine Muller,Catherine R. Jutzeler,Sarah Bruningk
Main category: cs.CV
TL;DR: 本文提出了一种结合数学模型和深度学习的框架,用于预测脑肿瘤的生长并生成逼真的医学图像。
Details
Motivation: 预测脑肿瘤的时空进展对于神经肿瘤学中的临床决策至关重要。 Method: 提出了一种混合机制学习框架,将数学肿瘤生长模型与引导去噪扩散隐式模型(DDIM)结合,从先前扫描中合成解剖可行的未来MRI。 Result: 该方法基于空间相似性指标生成逼真的随访扫描,并引入肿瘤生长概率图,捕捉临床相关的肿瘤生长范围和方向性。 Conclusion: 该框架在数据有限的情况下实现了生物学信息指导下的图像生成,提供了结合机制先验的生成时空预测能力。 Abstract: Predicting the spatio-temporal progression of brain tumors is essential for guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic learning framework that combines a mathematical tumor growth model with a guided denoising diffusion implicit model (DDIM) to synthesize anatomically feasible future MRIs from preceding scans. The mechanistic model, formulated as a system of ordinary differential equations, captures temporal tumor dynamics including radiotherapy effects and estimates future tumor burden. These estimates condition a gradient-guided DDIM, enabling image synthesis that aligns with both predicted growth and patient anatomy. We train our model on the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our framework generates realistic follow-up scans based on spatial similarity metrics. It also introduces tumor growth probability maps, which capture both clinically relevant extent and directionality of tumor growth as shown by 95th percentile Hausdorff Distance. The method enables biologically informed image generation in data-limited scenarios, offering generative-space-time predictions that account for mechanistic priors.[111] Measuring Epistemic Humility in Multimodal Large Language Models
Bingkui Tong,Jiaer Xia,Sifeng Shang,Kaiyang Zhou
Main category: cs.CV
TL;DR: HumbleBench is introduced as a new hallucination benchmark for evaluating multimodal large language models' ability to reject plausible but incorrect answers, providing a more realistic measure of reliability in safety-critical settings.
Details
Motivation: Hallucinations in MLLMs can lead to significant risks in real-world applications. Existing benchmarks focus on recognition accuracy but overlook the critical capability of recognizing when none of the provided options are correct. This work aims to evaluate MLLMs' ability to reject plausible but incorrect answers, reflecting epistemic humility. Method: HumbleBench was built from a panoptic scene graph dataset using fine-grained annotations to extract ground-truth entities and relations. Multiple-choice questions were generated using GPT-4-Turbo, followed by a manual filtering process. Each question included a 'None of the above' option to evaluate the models' ability to identify when no provided answer is valid. Result: The authors evaluated various state-of-the-art MLLMs, including both general-purpose and specialized reasoning models, on HumbleBench, sharing valuable findings and insights with the community regarding MLLM reliability. Conclusion: HumbleBench provides a more realistic measure of MLLM reliability in safety-critical settings by evaluating the models' ability to reject explicit false options, addressing a gap in current evaluation suites. Abstract: Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.[112] Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Zhiyuan Yan,Kaiqing Lin,Zongjian Li,Junyan Ye,Hui Han,Zhendong Wang,Hao Liu,Bin Lin,Hao Li,Xue Xu,Xinyan Xiao,Jingdong Wang,Haifeng Wang,Li Yuan
Main category: cs.CV
TL;DR: This paper proposes UAE, a new framework for unified multimodal learning that enhances understanding and generation processes through reconstruction fidelity, leading to improved performance in bidirectional information flow.
Details
Motivation: The motivation is to develop a more coherent and unified approach to multimodal learning by integrating understanding and generation processes through reconstruction fidelity. Method: The method involves the use of an Auto-Encoder framework called UAE, which includes pre-training a decoder with image captions and applying reinforcement learning through Unified-GRPO to optimize both encoder and decoder performance. Result: The result demonstrates that the encoder autonomously produces more descriptive captions, and the decoder shows improved ability to understand these descriptions, achieving high reconstruction fidelity. Conclusion: The study concludes that the proposed UAE framework effectively enhances bidirectional information flow between understanding and generation processes, leading to improved performance in multimodal learning tasks. Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.[113] Geometric Neural Distance Fields for Learning Human Motion Priors
Zhengdi Yu,Simone Foti,Linguang Zhang,Amy Zhao,Cem Keskin,Stefanos Zafeiriou,Tolga Birdal
Main category: cs.CV
TL;DR: Neural Riemannian Motion Fields (NRMF) is a new method for generating realistic 3D human motion by modeling motion dynamics using neural distance fields and geometric integration techniques.
Details
Motivation: Current methods for 3D human motion recovery often lack temporal consistency and physical plausibility. NRMF aims to overcome these limitations with a novel higher-order motion prior. Method: NRMF models human motion using neural distance fields on the product space of joint rotations, angular velocities, and accelerations. It uses an adaptive-step hybrid algorithm and a geometric integrator for motion projection and generation. Result: NRMF achieves significant gains in performance, generalizing across input modalities and tasks such as denoising, motion in-betweening, and fitting to partial 2D/3D observations. Conclusion: NRMF provides a novel and rigorous framework for 3D human motion recovery, showing significant improvements in robustness, temporal consistency, and physical plausibility. Abstract: We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to "roll out" realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.[114] Locality in Image Diffusion Models Emerges from Data Statistics
Artem Lukoianov,Chenyang Yuan,Justin Solomon,Vincent Sitzmann
Main category: cs.CV
TL;DR: 本文揭示了深度扩散模型中局部性的来源,并提出了一种新的解析去噪器,能够更好地匹配深度扩散模型的预测结果。
Details
Motivation: 现有的扩散模型研究中,最优去噪器无法复现深度扩散模型的行为,而本文旨在揭示深度扩散模型局部性的来源,并设计更优的解析去噪器。 Method: 研究者展示了最优参数化线性去噪器在深度扩散模型中表现出与深度神经网络去噪器类似的局部性,并通过理论和实验表明,这种局部性直接来源于自然图像数据集中的像素相关性。 Result: 研究表明,深度扩散模型的局部性来源于数据集的统计特性,而非网络结构的归纳偏置,并据此提出了一种性能优于现有方法的解析去噪器。 Conclusion: 本文的结论是,深度扩散模型中的局部性源于自然图像数据集中的像素相关性,而非卷积神经网络的归纳偏置,并且基于这一认识,可以设计出更好地匹配深度扩散模型预测分数的解析去噪器。 Abstract: Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.[115] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Jiahao Wang,Yufeng Yuan,Rujie Zheng,Youtian Lin,Jian Gao,Lin-Zhuo Chen,Yajie Bao,Yi Zhang,Chang Zeng,Yanxi Zhou,Xiaoxiao Long,Hao Zhu,Zhaoxiang Zhang,Xun Cao,Yao Yao
Main category: cs.CV
TL;DR: SpatialVID is a large-scale, high-quality dataset with diverse scenes and rich annotations, designed to improve spatial intelligence models for real-world applications.