Skip to content

Table of Contents

cs.CL [Back]

[1] Noise or Nuance: An Investigation Into Useful Information and Filtering For LLM Driven AKBC

Alex Clay,Ernesto Jiménez-Ruiz,Pranava Madhyastha

Main category: cs.CL

TL;DR: 本文探讨了在受限环境下提升LLM输出质量的策略,发现额外信息有助于生成、LLMs能有效过滤低质量三元组,以及解析策略需权衡灵活性与一致性。

Details Motivation: 在2025 LM-KBC挑战等受限情况下,传统的改进LLM输出质量的方法(如RAG和微调)受到限制,因此需要研究其他方法来提高输出质量。 Method: 论文研究了三元组补全任务的三个方面:生成、质量保证和LLM响应解析,并分析在受限环境下的效果。 Result: 研究发现,在这种受限设置中,引入额外信息能够提升生成质量,LLMs在过滤低质量三元组方面表现良好,响应解析的灵活性和一致性之间的权衡依赖于具体环境。 Conclusion: 论文得出结论,在受限环境下,额外信息可以提高生成质量,LLMs可以有效过滤低质量三元组,并且响应解析的灵活性和一致性之间的权衡取决于具体设置。 Abstract: RAG and fine-tuning are prevalent strategies for improving the quality of LLM outputs. However, in constrained situations, such as that of the 2025 LM-KBC challenge, such techniques are restricted. In this work we investigate three facets of the triple completion task: generation, quality assurance, and LLM response parsing. Our work finds that in this constrained setting: additional information improves generation quality, LLMs can be effective at filtering poor quality triples, and the tradeoff between flexibility and consistency with LLM response parsing is setting dependent.

[2] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach

Imene Kolli,Ario Saeid Vaghefi,Chiara Colesanti Senni,Shantam Raj,Markus Leippold

Main category: cs.CL

TL;DR: This paper proposes an AI-assisted Retrieval-Augmented Generation (RAG) framework to automate and accelerate the analysis of corporate climate policy engagement, showing that while automation improves efficiency, human oversight remains crucial for accuracy.

Details Motivation: The motivation stems from the limitations of the manual assessment process in InfluenceMap's LobbyMap Platform, which is time-consuming, labor-intensive, and prone to human error. The need for automation to improve efficiency and accuracy in monitoring corporate climate policy engagement drives this study. Method: The research employs a Retrieval-Augmented Generation (RAG) framework combined with layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies to automate the extraction and classification of evidence from multilingual corporate documents. Result: The proposed AI-assisted framework demonstrates effectiveness in accelerating the extraction of relevant evidence from large-scale textual data. The best performance was achieved through a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies. Conclusion: The study concludes that the AI-assisted RAG framework significantly accelerates evidence extraction from corporate documents, but nuanced analysis still requires human expertise to ensure accuracy, suggesting a hybrid approach where technology supports rather than replaces human judgment. Abstract: InfluenceMap's LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity's support or opposition to science-based policy pathways for achieving the Paris Agreement's goal of limiting global warming to 1.5{\deg}C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.

[3] Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings

Jinsong Chen

Main category: cs.CL

TL;DR: 本研究提出一种结合大型语言模型与心理测量分析的新方法,通过上下文嵌入技术将文本数据转化为响应数据,并利用因子分析揭示潜在知识结构。

Details Motivation: 研究动机在于传统心理测量方法难以处理文本数据,而大型语言模型和上下文嵌入技术的发展为将文本数据转化为可用于心理测量分析的响应数据提供了可能。 Method: 该方法包括两个阶段:第一阶段使用自然语言处理技术和基于编码器的Transformer模型来识别常见关键词并生成上下文分数;第二阶段应用多种因子分析方法(包括探索性和双因子模型)提取潜在因子、确定因子相关性,并识别与每个因子最相关的关键词。 Result: 实验结果表明,该方法能够从Wiki STEM语料库中揭示文本数据中的潜在知识维度和模式,显示出其在教育、心理学和法律等领域的应用潜力。 Conclusion: 该研究提出了一种利用大型语言模型进行文本数据心理测量分析的新方法,通过上下文嵌入生成上下文分数,并将其转换为适合心理测量分析的响应数据。 Abstract: This research introduces a novel psychometric method for analyzing textual data using large language models. By leveraging contextual embeddings to create contextual scores, we transform textual data into response data suitable for psychometric analysis. Treating documents as individuals and words as items, this approach provides a natural psychometric interpretation under the assumption that certain keywords, whose contextual meanings vary significantly across documents, can effectively differentiate documents within a corpus. The modeling process comprises two stages: obtaining contextual scores and performing psychometric analysis. In the first stage, we utilize natural language processing techniques and encoder based transformer models to identify common keywords and generate contextual scores. In the second stage, we employ various types of factor analysis, including exploratory and bifactor models, to extract and define latent factors, determine factor correlations, and identify the most significant words associated with each factor. Applied to the Wiki STEM corpus, our experimental results demonstrate the method's potential to uncover latent knowledge dimensions and patterns within textual data. This approach not only enhances the psychometric analysis of textual data but also holds promise for applications in fields rich in textual information, such as education, psychology, and law.

[4] BRoverbs -- Measuring how much LLMs understand Portuguese proverbs

Thales Sales Almeida,Giovana Kerche Bonás,João Guilherme Alves Santos

Main category: cs.CL

TL;DR: 本文介绍了 BRoverbs,这是一个用于评估葡萄牙语 LLM 的新数据集,旨在通过巴西谚语推动区域知情的基准测试。

Details Motivation: 现有的葡萄牙语评估有限,主要依赖翻译的数据集,无法充分捕捉语言细微差别或文化参考。 Method: 介绍了一个名为 BRoverbs 的新数据集,用于通过巴西谚语评估 LLM 的性能。 Result: 谚语作为一种丰富的语言资源,能够体现文化智慧、比喻性表达和复杂句法结构,挑战模型对地区性表达的理解能力。 Conclusion: BRoverbs 为评估葡萄牙语 LLM 提供了新的评估工具,有助于推动区域知情的基准测试。 Abstract: Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.

[5] Can Vision-Language Models Solve Visual Math Equations?

Monjoy Narayan Choudhury,Junling Wang,Yifan Hou,Mrinmaya Sachan

Main category: cs.CL

TL;DR: 本文研究了视觉语言模型在视觉方程求解任务中的局限性,发现系数计数和符号推理是主要瓶颈,揭示了当前模型的关键弱点,并提出了未来改进方向。

Details Motivation: 尽管视觉语言模型在视觉理解和语言推理方面表现出色,但在需要集成感知和符号计算的任务上仍存在不足。研究旨在揭示这些模型在视觉方程求解中的瓶颈,并探讨其改进方向。 Method: 通过视觉方程求解任务来研究视觉语言模型的局限性,将任务分解为系数计数和变量识别,并分析模型在不同步骤中的表现。 Result: 视觉语言模型在文本方程上表现良好,但在视觉方程上表现较差。研究发现,即使变量识别准确,系数计数仍是主要瓶颈。此外,识别和推理的组合会引入额外错误,且随着方程复杂度的增加,符号推理本身也成为限制因素。 Conclusion: 当前的视觉语言模型在需要集成感知和符号计算的任务上仍存在显著局限性,特别是在视觉方程求解方面,系数计数和符号推理是主要瓶颈。研究揭示了现有模型的关键弱点,并指出了未来改进的方向。 Abstract: Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.

[6] Stated Preference for Interaction and Continued Engagement (SPICE): Evaluating an LLM's Willingness to Re-engage in Conversation

Thomas Manuel Rost,Martina Figlia,Bernd Wallraff

Main category: cs.CL

TL;DR: SPICE is a new tool for evaluating Large Language Models' willingness to re-engage with users based on short transcripts, showing strong discrimination by user tone and providing a distinct relational signal of model dispositions.

Details Motivation: The authors aimed to introduce and evaluate SPICE, a diagnostic signal for assessing a Large Language Model's willingness to re-engage with a user based on a short transcript, to complement existing metrics with a direct relational signal of a model's state. Method: The study evaluated SPICE by testing four open-weight chat models across four framing conditions using a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set, resulting in 480 trials. Multiple statistical tests were employed, including Rao-Scott adjustment and cluster permutation tests. Result: Friendly interactions led to a high preference to continue engagement (97.5% YES), abusive interactions led to a strong preference to discontinue (17.9% YES), and unclear interactions fell in between (60.4% YES). SPICE was shown to provide a distinct signal from abuse classification, and an exploratory analysis revealed significant interaction effects under ambiguity. Conclusion: SPICE is a valid and reliable tool for auditing model dispositions, providing a direct relational signal of a model's state and complementing existing metrics. Abstract: We introduce and evaluate Stated Preference for Interaction and Continued Engagement (SPICE), a simple diagnostic signal elicited by asking a Large Language Model a YES or NO question about its willingness to re-engage with a user's behavior after reviewing a short transcript. In a study using a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set, we tested four open-weight chat models across four framing conditions, resulting in 480 trials. Our findings show that SPICE sharply discriminates by user tone. Friendly interactions yielded a near-unanimous preference to continue (97.5% YES), while abusive interactions yielded a strong preference to discontinue (17.9% YES), with unclear interactions falling in between (60.4% YES). This core association remains decisive under multiple dependence-aware statistical tests, including Rao-Scott adjustment and cluster permutation tests. Furthermore, we demonstrate that SPICE provides a distinct signal from abuse classification. In trials where a model failed to identify abuse, it still overwhelmingly stated a preference not to continue the interaction (81% of the time). An exploratory analysis also reveals a significant interaction effect: a preamble describing the study context significantly impacts SPICE under ambiguity, but only when transcripts are presented as a single block of text rather than a multi-turn chat. The results validate SPICE as a robust, low-overhead, and reproducible tool for auditing model dispositions, complementing existing metrics by offering a direct, relational signal of a model's state. All stimuli, code, and analysis scripts are released to support replication.

[7] Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

Piyush Pant

Main category: cs.CL

TL;DR: This study explores the effectiveness of alignment techniques (SFT, DPO, and SFT+DPO) on the OPT-350M language model, concluding that the combined SFT+DPO approach best enhances safety and helpfulness.

Details Motivation: The research aims to investigate how different alignment techniques affect the safety and helpfulness of language models, particularly addressing the challenges of noisy data, limited GPU resources, and training constraints. Method: The research employs SFT, DPO, and a combined SFT+DPO strategy to train models using the Anthropic Helpful-Harmless RLHF dataset. Evaluation is based on Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS). Result: Results indicate that SFT outperforms DPO, but the combined SFT+DPO approach achieves the best performance across all metrics, showing the complementary benefits of combining both techniques. Conclusion: The study concludes that the combined SFT+DPO approach outperforms individual SFT and DPO techniques in enhancing the safety and helpfulness of the OPT-350M model. Abstract: This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language model. Utilizing the Anthropic Helpful-Harmless RLHF dataset, we train and evaluate four models: the base OPT350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. We introduce three key evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), all derived from reward model outputs. The results show that while SFT outperforms DPO, The combined SFT+DPO model outperforms all others across all metrics, demonstrating the complementary nature of these techniques. Our findings also highlight challenges posed by noisy data, limited GPU resources, and training constraints. This study offers a comprehensive view of how fine-tuning strategies affect model alignment and provides a foundation for more robust alignment pipelines in future work.

[8] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Zhongqiu Li,Shiquan Wang,Ruiyu Fang,Mengjiao Bao,Zhenhe Wu,Shuangyong Song,Yongxiang Li,Zhongjiang He

Main category: cs.CL

TL;DR: This paper proposes MR-UIE, a method that combines reinforcement learning with multi-perspective reasoning to enhance information extraction by transforming LLMs into active reasoners capable of handling complex tasks.

Details Motivation: The motivation is to improve the performance of large language models (LLMs) in universal information extraction (UIE), particularly for structured output scenarios that require multi-step reasoning. Method: The method involves integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction tasks. Result: Experiments showed that MR-UIE consistently improved extraction accuracy across domains and outperformed state-of-the-art methods on several datasets. Conclusion: MR-UIE enhances the generalization ability of LLMs in information extraction tasks, especially in complex scenarios. Abstract: Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model's generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.

[9] TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla

Nishat Raihan,Antonios Anastasopoulos,Marcos Zampieri

Main category: cs.CL

TL;DR: This paper presents the first dedicated Bangla Code Large Language Models (TigerCoder-family) and associated datasets/benchmarks, achieving significant performance improvements in Bangla code generation.

Details Motivation: Bangla, despite being the fifth most spoken language, is underrepresented in Large Language Models (LLMs), particularly in code generation due to the scarcity of high-quality training data. Method: The authors introduced the TigerCoder-family of Code LLMs and created a comprehensive Bangla code instruction dataset and an evaluation benchmark called MBPP-Bangla for model training and assessment. Result: The TigerCoder-family of Code LLMs achieved significant performance improvements (~11-18% gains at Pass@1) over existing multilingual and general-purpose Bangla LLMs. Conclusion: The research concludes that curated, high-quality datasets can overcome the limitations of smaller models for low-resource languages like Bangla, enabling more effective code generation through dedicated Code LLMs. Abstract: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.

[10] Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia

Sophia Maria

Main category: cs.CL

TL;DR: Compass-v3是一个专为东南亚电商设计的MoE模型,其结合了硬件优化、混合训练策略和新的对齐方法OTPO,在电商领域表现卓越,并具备强大的多语言能力。

Details Motivation: 大语言模型在专业任务中表现不佳,而电商数据具有噪音大、多语言、动态性强等特点,因此需要专门的模型。 Method: 提出了一种垂直领域的MoE模型Compass-v3,并结合硬件优化和混合训练策略,以及一种新的对齐方法OTPO。 Result: Compass-v3在电商性能上达到了SOTA,同时在东南亚低资源语言和葡萄牙语中表现良好。 Conclusion: Compass-v3展现了在电商领域的卓越性能,超越了现有的大型模型,并具备强大的多语言能力和通用基准测试中的竞争力。 Abstract: Large language models (LLMs) excel in general-domain applications, yet their performance often degrades in specialized tasks requiring domain-specific knowledge. E-commerce is particularly challenging, as its data are noisy, heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and 71B active per token, designed for Southeast Asian e-commerce. Compass-v3 adopts fewer but larger experts, combined with hardware-efficient optimizations-such as intra-node expert parallelism and a customized memcpy operator-to maximize GPU utilization. The model is trained on 12T tokens of curated multilingual corpora and large-scale synthetic e-commerce instructions using a mixed-training strategy. To enhance alignment, we propose Optimal-Transport Direct Preference Optimization (OTPO), which captures token-level distinctions and improves instruction adherence in commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3 delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1, GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong multilingual capability across low-resource Southeast Asian languages (Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while sustaining competitive performance on general benchmarks. It has already been widely applied in Shopee's industrial-scale e-commerce platform and is gradually replacing OpenAI's traffic, now accounting for over 70\% of total LLM usage, highlighting its dual strengths in specialized commerce expertise and broad linguistic competence.

[11] Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus

Liqun He,Jiaqi Xu

Main category: cs.CL

TL;DR: This research shows that generative AI, specifically GPT-4, can effectively classify tutors' Dialogue Acts with high accuracy, offering a faster and more accessible alternative to manual coding while emphasizing the importance of ethical and responsible AI use.

Details Motivation: The study aims to automate the classification of tutors' Dialogue Acts to reduce the time and effort required by manual coding, making educational dialogue analysis more efficient. Method: This study used the open-source CIMA corpus with pre-annotated Dialogue Act categories and tested GPT-3.5-turbo and GPT-4 models with tailored prompts for classification performance. Model outputs were evaluated using accuracy, weighted F1-score, and Cohen's Kappa. Result: GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen's Kappa of 0.74, surpassing baseline performance and showing substantial agreement with human annotations. Conclusion: Generative AI, particularly GPT-4, has strong potential to provide an efficient and accessible approach to Dialogue Act classification, reducing the need for traditional manual coding. However, ethical considerations and responsible research practices are crucial when using such technology. Abstract: This study explores the use of generative AI for automating the classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors' responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen's Kappa of 0.74, surpassing baseline performance and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task-specific label definitions and contextual information in enhancing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices. The script of this research is publicly available at https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging.

[12] ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking

Phuong-Nam Dang,Kieu-Linh Nguyen,Thanh-Hieu Pham

Main category: cs.CL

TL;DR: This paper introduces ViRanker, a cross-encoder reranking model for Vietnamese, which addresses the lack of competitive rerankers for low-resource languages.

Details Motivation: The motivation for this study is the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics. Method: ViRanker is built on the BGE-M3 encoder and enhanced with the Blockwise Parallel Transformer. It was trained on an 8 GB curated corpus and fine-tuned using hybrid hard-negative sampling. Result: ViRanker achieves strong early-rank accuracy on the MMARCO-VI benchmark, surpassing multilingual baselines and competing closely with PhoRanker. Conclusion: ViRanker is a cross-encoder reranking model designed for Vietnamese that demonstrates strong performance and supports reproducibility and adoption in real-world retrieval systems. This study also highlights how architectural adaptation and data curation can benefit reranking in other underrepresented languages. Abstract: This paper presents ViRanker, a cross-encoder reranking model tailored to the Vietnamese language. Built on the BGE-M3 encoder and enhanced with the Blockwise Parallel Transformer, ViRanker addresses the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with hybrid hard-negative sampling to strengthen robustness. Evaluated on the MMARCO-VI benchmark, ViRanker achieves strong early-rank accuracy, surpassing multilingual baselines and competing closely with PhoRanker. By releasing the model openly on Hugging Face, we aim to support reproducibility and encourage wider adoption in real-world retrieval systems. Beyond Vietnamese, this study illustrates how careful architectural adaptation and data curation can advance reranking in other underrepresented languages.

[13] LITcoder: A General-Purpose Library for Building and Comparing Encoding Models

Taha Binhuraib,Ruimin Gao,Anna A. Ivanova

Main category: cs.CL

TL;DR: LITcoder是一个开源库,用于构建和基准测试神经编码模型,它通过标准化工具和模块化流程简化模型开发,并促进方法论的严谨性和模型性能的提升。

Details Motivation: 为了降低实施编码模型的技术障碍,促进系统比较,并加速高质量预测模型的发展,研究者们引入了LITcoder这一灵活的开源库。 Method: LITcoder提供了一个模块化的流程,涵盖了从对齐连续刺激与大脑数据、将刺激转化为表征特征、将这些特征映射到大脑数据,到评估模型预测性能的一系列方法选择。 Result: LITcoder通过标准化工具和模块化流程,实现了编码模型的轻松构建、比较和扩展,并展示了其在多个故事聆听数据集上的可扩展性和多功能性。 Conclusion: LITcoder是一个开源库,旨在降低构建和基准测试神经编码模型的技术门槛,促进模型和数据集之间的系统比较,增强方法论的严谨性,并加速高质量、高性能的大脑活动预测模型的发展。 Abstract: We introduce LITcoder, an open-source library for building and benchmarking neural encoding models. Designed as a flexible backend, LITcoder provides standardized tools for aligning continuous stimuli (e.g., text and speech) with brain data, transforming stimuli into representational features, mapping those features onto brain data, and evaluating the predictive performance of the resulting model on held-out data. The library implements a modular pipeline covering a wide array of methodological design choices, so researchers can easily compose, compare, and extend encoding models without reinventing core infrastructure. Such choices include brain datasets, brain regions, stimulus feature (both neural-net-based and control, such as word rate), downsampling approaches, and many others. In addition, the library provides built-in logging, plotting, and seamless integration with experiment tracking platforms such as Weights & Biases (W&B). We demonstrate the scalability and versatility of our framework by fitting a range of encoding models to three story listening datasets: LeBel et al. (2023), Narratives, and Little Prince. We also explore the methodological choices critical for building encoding models for continuous fMRI data, illustrating the importance of accounting for all tokens in a TR scan (as opposed to just taking the last one, even when contextualized), incorporating hemodynamic lag effects, using train-test splits that minimize information leakage, and accounting for head motion effects on encoding model predictivity. Overall, LITcoder lowers technical barriers to encoding model implementation, facilitates systematic comparisons across models and datasets, fosters methodological rigor, and accelerates the development of high-quality high-performance predictive models of brain activity. Project page: https://litcoder-brain.github.io

[14] Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing

Zhiyue Liu,Fanrong Ma,Xin Ling

Main category: cs.CL

TL;DR: 本文提出了一种新的对抗性增强去偏框架,用于目标导向的多模态情感分类,通过对抗性数据增强策略和自适应去偏对比学习机制,减少文本特征与输出标签之间的虚假相关性,在多个基准数据集上取得了优于现有方法的表现。

Details Motivation: 现有的多模态情感分类方法往往过度依赖文本内容,未能考虑数据集中的词级上下文偏差,导致文本特征与输出标签之间出现虚假相关性。 Method: 引入了一种新的对抗性增强去偏框架,包括对抗性数据增强策略和自适应去偏对比学习机制。 Result: 该方法在多个基准数据集上取得了优于现有方法的表现。 Conclusion: 实验结果表明,所提出的对抗性增强去偏框架在多个基准数据集上优于最先进的基线方法。 Abstract: Target-oriented multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs. While existing works achieve competitive performance, they often over-rely on textual content and fail to consider dataset biases, in particular word-level contextual biases. This leads to spurious correlations between text features and output labels, impairing classification accuracy. In this paper, we introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations. Our framework incorporates a counterfactual data augmentation strategy that minimally alters sentiment-related causal features, generating detail-matched image-text samples to guide the model's attention toward content tied to sentiment. Furthermore, for learning robust features from counterfactual data and prompting model decisions, we introduce an adaptive debiasing contrastive learning mechanism, which effectively mitigates the influence of biased words. Experimental results on several benchmark datasets show that our proposed method outperforms state-of-the-art baselines.

[15] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Yuhao Zhang,Yuhao Du,Zhanchen Dai,Xiangnan Ma,Kaiqi Kou,Benyou Wang,Haizhou Li

Main category: cs.CL

TL;DR: EchoX improves SLLMs by bridging the acoustic-semantic gap, preserving reasoning abilities, and achieving strong results on question-answering benchmarks.

Details Motivation: Current training paradigms for speech-to-speech large language models (SLLMs) fail to bridge the acoustic-semantic gap, leading to degradation in knowledge and reasoning capabilities. Method: EchoX uses semantic representations and dynamically generates speech training targets to integrate both acoustic and semantic learning. Result: EchoX achieves advanced performance on multiple knowledge-based question-answering benchmarks with about six thousand hours of training data. Conclusion: EchoX successfully bridges the acoustic-semantic gap in SLLMs, preserving their reasoning capabilities while achieving advanced performance on knowledge-based question-answering benchmarks. Abstract: Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

[16] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition

Chin Yuen Kwok,Jia Qi yip

Main category: cs.CL

TL;DR: 本文提出了一种通过前瞻预测多步来避免吊销步骤的方法,以改进ASR模型对罕见词的识别。

Details Motivation: 上下文偏置通过在解码期间优先输出罕见词来提高ASR模型的罕见词识别能力,但常见的Trie-based偏置方法在撤销部分假设的奖励时计算成本高昂。 Method: 通过微调ASR模型(仅使用10小时合成数据)使其能够前瞻预测多步,从而更好地估计部分假设是否会生成完整的罕见词。 Result: 该方法将NSC Part 2测试集上的词错误率从30.86%降低到12.19%。 Conclusion: 提出的方法有效克服了传统Trie-based偏置方法的局限性,显著提升了罕见词的识别效果。 Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives "bonus scores" to partial hypothesis (e.g. "Bon") that may lead to the generation of the rare word (e.g. "Bonham"). If the full word ("Bonham") isn't ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.

[17] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function

Chin Yuen Kwok,Jia Qi Yip,Eng Siong Chng

Main category: cs.CL

TL;DR: An improved contextual biasing approach with a keyword-aware loss function significantly enhances rare word recognition in ASR models, achieving a notable reduction in word error rate.

Details Motivation: Improving rare word recognition in ASR models while addressing overfitting issues caused by artifacts in synthetic audio data. Method: The method involves adapting ASR models to synthetic data and introducing a contextual biasing module trained with a keyword-aware loss function that includes a masked cross-entropy term and a binary classification term. Result: Adapting Whisper to 10 hours of synthetic data using the proposed method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%. Conclusion: The enhanced TCPGen-based contextual biasing approach with a keyword-aware loss function effectively improves rare word recognition in ASR models, as demonstrated by the significant reduction in word error rate on the NSC Part 2 test set. Abstract: Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.

[18] GmSLM : Generative Marmoset Spoken Language Modeling

Talia Sternberg,Michael London,David Omer,Yossi Adi

Main category: cs.CL

TL;DR: 本文提出了 GmSLM,一种专为狨猴发声交流设计的口语语言模型管道,展示了其在声学匹配和下游任务中的优势,并强调了其在神经科学、生物声学和进化生物学领域中的潜力。

Details Motivation: 狨猴的发声交流复杂,具有与人类语言相似的特征,研究其发声交流可为探究语言的神经基础提供独特机会,尤其考虑到在人类语言研究中难以获取脑活动数据的困难。 Method: 本研究设计了一种优化的狨猴口语语言模型管道(GmSLM),并采用零样本评估指标结合弱标签对话数据对其进行评估。 Result: GmSLM 生成的发声样本在声学上与真实样本接近,并在下游任务中表现出色,且能够有效区分真实与人工对话,即使完全采用无监督学习。 Conclusion: GmSLM 为狨猴的发声交流研究提供了一个实用框架,并可能支持进一步探究发声交流的神经基础,在神经科学、生物声学和进化生物学领域具有广泛应用前景。 Abstract: Marmoset monkeys exhibit complex vocal communication, challenging the view that nonhuman primates vocal communication is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal communication offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal communication. We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal communication and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.

[19] CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling

Wenhao Li,Bangcheng Sun,Weihao Ye,Tianyi Zhang,Daohai Yu,Fei Chao,Rongrong Ji

Main category: cs.CL

TL;DR: This paper proposes CCF, a novel context compression framework for efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while reducing input redundancy.

Details Motivation: Scaling language models to longer contexts is essential but na"ive context extension imposes significant computational and memory burdens. Method: The paper proposes CCF, a novel context compression framework that integrates segment-wise semantic aggregation with key-value memory encoding. Result: CCF achieves competitive perplexity under high compression ratios and significantly improves throughput and memory efficiency compared to existing approaches. Conclusion: Structured compression has potential for scalable and effective long-context language modeling. Abstract: Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, na\"ive context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context compression framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment decoding with sparse reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured compression for scalable and effective long-context language modeling.

[20] Reading Between the Lines: Classifying Resume Seniority with Large Language Models

Matan Cohen,Shira Shani,Eden Menahem,Yehudit Aperstein,Alexander Apartsin

Main category: cs.CL

TL;DR: 这项研究评估了大型语言模型在自动化评估简历资历方面的有效性,并引入了一个新的混合数据集以测试模型对夸大描述的识别能力。

Details Motivation: 准确评估候选人的资历对于招聘过程至关重要,但由于简历中常见的夸大经验和模糊的自我描述,这成为一个具有挑战性的任务。 Method: 研究人员使用了包含真实简历和合成生成的困难样例的混合数据集,评估了包括微调BERT架构在内的大型语言模型的表现。 Result: 研究结果表明,大型语言模型能够有效检测与资历夸大和隐含专业知识相关的细微语言线索,为改进人工智能驱动的候选人评估系统提供了方向。 Conclusion: 该研究得出的结论是,大型语言模型在自动化评估简历中的资历方面具有潜力,同时强调了对抗自我夸大语言的偏见的重要性。 Abstract: Accurately assessing candidate seniority from resumes is a critical yet challenging task, complicated by the prevalence of overstated experience and ambiguous self-presentation. In this study, we investigate the effectiveness of large language models (LLMs), including fine-tuned BERT architectures, for automating seniority classification in resumes. To rigorously evaluate model performance, we introduce a hybrid dataset comprising both real-world resumes and synthetically generated hard examples designed to simulate exaggerated qualifications and understated seniority. Using the dataset, we evaluate the performance of Large Language Models in detecting subtle linguistic cues associated with seniority inflation and implicit expertise. Our findings highlight promising directions for enhancing AI-driven candidate evaluation systems and mitigating bias introduced by self-promotional language. The dataset is available for the research community at https://bit.ly/4mcTovt

[21] Agentic LLMs for Question Answering over Tabular Data

Rishit Tyagi,Mohit Gupta,Rahul Bouri

Main category: cs.CL

TL;DR: 本论文提出了一种基于大语言模型的多阶段自然语言到SQL方法,用于解决表格数据问答的挑战,并取得了显著优于基线的准确率。

Details Motivation: 表格数据的多样结构、大小和数据类型对问答系统提出了挑战,因此需要开发能够准确回答结构化查询的方法。 Method: 论文采用多阶段流程,包括示例选择、SQL查询生成、答案提取、验证和迭代优化,并利用GPT-4o、GPT-4o-mini和DeepSeek v2:16b等大语言模型生成SQL查询。 Result: 实验结果表明,该方法在DataBench QA上达到70.5%的准确率,在DataBench Lite QA上达到71.6%,远超基线的26%和27%。 Conclusion: 本论文提出了一种基于大语言模型的自然语言到SQL的多阶段流程方法,用于表格数据问答(Table QA),并通过实验证明了其有效性。 Abstract: Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5\% accuracy on DataBench QA and 71.6\% on DataBench Lite QA, significantly surpassing baseline scores of 26\% and 27\% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.

[22] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models

Grazia Sveva Ascione,Nicolò Tamagnone

Main category: cs.CL

TL;DR: 本文提出了一种利用弱监督学习和语义对齐的方法,将专利与联合国可持续发展目标(SDG)进行分类,解决了现有方法在可扩展性和泛化能力方面的不足。

Details Motivation: 由于缺乏大规模标记数据集,现有的专利与联合国可持续发展目标(SDG)分类方法(如关键词搜索、迁移学习和基于引用的启发式方法)在可扩展性和泛化能力方面存在不足。 Method: 本文提出了一种弱监督方法,通过从专利和SDG论文中提取结构化概念(功能、解决方案和应用)并计算跨领域相似性得分,结合基于排名的检索方法,开发了一个组合标签函数(LF),并通过一种定制的仅正样本损失进行校准。 Result: 生成了一个银标准的软多标签数据集,将专利映射到SDG,训练了有效的多标签回归模型。在内部验证中,该方法优于包括基于变压器的模型和零样本LLM在内的多个基线;在外部验证中,相比传统技术分类,该方法显示出更高的主题、认知和组织一致性。 Conclusion: 弱监督和语义对齐可以大规模增强SDG分类。 Abstract: Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.

[23] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

Channdeth Sok,David Luz,Yacine Haddam

Main category: cs.CL

TL;DR: MetaRAG is a new framework for detecting hallucinations in RAG systems by verifying mutations of factoids against retrieved context, offering real-time, unsupervised, black-box testing with no need for ground truth or model access.

Details Motivation: LLMs often produce hallucinations that hinder their reliability, particularly in Retrieval-Augmented Generation (RAG) systems. Existing detection methods don't address RAG-specific challenges, prompting the need for MetaRAG. Method: MetaRAG uses a four-stage metamorphic testing framework: decomposing answers into factoids, generating mutations using synonym and antonym substitutions, verifying mutations against the context, and aggregating inconsistencies into a hallucination score. Result: Experiments on an enterprise dataset show MetaRAG's effectiveness in detecting hallucinations and enabling safe deployment of RAG-based agents. Conclusion: MetaRAG localization helps identify unsupported claims and enables system designers to set thresholds for sensitive queries, making it a promising approach for trustworthy deployment of RAG-based conversational agents. Abstract: Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.

[24] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research

Molly R Petersen,Claire E Stevenson,Lonneke van der Plas

Main category: cs.CL

TL;DR: 该论文探讨了类比推理在人类认知中的作用,并将其与自然语言处理(NLP)研究联系起来,以优化文本中的关系理解。

Details Motivation: 类比推理是人类认知的重要组成部分,但NLP领域通常没有从认知的角度来看待这一过程。 Method: 总结认知科学文献中关于类比推理过程的关键理论,并将其与NLP研究联系起来。 Result: 类比推理的概念与NLP中的若干主要挑战相关,并可帮助优化文本中的关系理解。 Conclusion: 将类比推理的认知科学理论与NLP研究联系起来,可以更好地优化文本中的关系理解。 Abstract: Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.

[25] Hierarchical Bracketing Encodings Work for Dependency Graphs

Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares

Main category: cs.CL

TL;DR: This paper revisits hierarchical bracketing encodings for dependency graph parsing, reducing label space while maintaining structural integrity and achieving improved exact match accuracy.

Details Motivation: The study aims to address the limitations of existing graph linearizations by reducing label space and maintaining structural information for efficient and accurate dependency graph parsing. Method: The approach revisits hierarchical bracketing encodings, encoding dependency graphs as sequences to enable linear-time parsing with n tagging actions, while capturing reentrancies, cycles, and empty nodes. Result: The method demonstrated competitive performance on a multilingual and multi-formalism benchmark with consistent improvements in exact match accuracy over other approaches. Conclusion: The hierarchical bracketing encodings approach proves effective in representing dependency graphs with reduced label space while preserving structural information, showing competitive results and improvements in exact match accuracy. Abstract: We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with $n$ tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.

[26] GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models

Zhaohan Zhang,Ziquan Liu,Ioannis Patras

Main category: cs.CL

TL;DR: GrACE enables efficient and reliable confidence estimation for LLMs in real-time without additional sampling or auxiliary models, offering improved accuracy and reduced computational demands.

Details Motivation: Existing methods for assessing LLM reliability are either computationally expensive or poorly calibrated, limiting their practicality in high-stakes applications. Method: GrACE, a Generative Approach to Confidence Elicitation, uses a special token's embedding similarity to the last hidden state for real-time confidence expression and is fine-tuned for confidence calibration. Result: GrACE outperforms six competing methods in discriminative capacity and calibration for open-ended generation tasks and reduces test-time sample requirements while improving decision accuracy. Conclusion: GrACE is a practical solution for deploying LLMs with scalable, reliable, and real-time confidence estimation. Abstract: Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in which the model expresses confidence by the similarity between the last hidden state and the embedding of a special token appended to the vocabulary, in real-time. We fine-tune the model for calibrating the confidence with calibration targets associated with accuracy. Experiments with three LLMs and two benchmark datasets show that the confidence produced by GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks, outperforming six competing methods without resorting to additional sampling or an auxiliary model. Moreover, we propose two strategies for improving test-time scaling based on confidence induced by GrACE. Experimental results show that using GrACE not only improves the accuracy of the final decision but also significantly reduces the number of required samples in the test-time scaling scheme, indicating the potential of GrACE as a practical solution for deploying LLMs with scalable, reliable, and real-time confidence estimation.

[27] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation

Lucie Poláková,Martin Popel,Věra Kloudová,Michal Novák,Mariia Anisimova,Jiří Balhar

Main category: cs.CL

TL;DR: The EdUKate project develops multilingual educational materials and a specialized Czech-Ukrainian machine translation system to support non-Czech-speaking students in Czech schools, with all resources freely accessible online.

Details Motivation: To address the need for multilingual educational materials in Czech primary and secondary schools, particularly for non-Czech-speaking students, and to advance machine translation in the educational domain. Method: The project involves the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with capabilities to handle formatted content like XML and PDF, along with technical and scientific terminology. It also includes a survey of Czech teachers to understand the needs of non-Czech-speaking students. Result: The project translated up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German, developed a specialized Czech-Ukrainian machine translation system, and implemented it on an educational web portal with all applications being made freely available. Conclusion: The EdUKate project successfully integrates digital education, linguistics, translation studies, and machine translation to create multilingual learning resources for Czech schools, with a focus on making all resulting applications freely accessible to students, educators, and researchers. Abstract: The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country's largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system's evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.

[28] Towards Explainable Job Title Matching: Leveraging Semantic Textual Relatedness and Knowledge Graphs

Vadim Zadykian,Bruno Andrade,Haithem Afli

Main category: cs.CL

TL;DR: 本文提出了一种结合句子嵌入和知识图谱的自监督混合架构,用于改进简历推荐系统中的职位匹配。通过分层评估方法,研究显示该模型在高语义相关性区域显著优于基线模型。

Details Motivation: 研究旨在解决简历推荐系统中职位匹配的关键挑战,其中传统方法的重叠术语有限或具有误导性,同时超越表面词汇相似性的文本关系分析。 Method: 介绍了一种自监督混合架构,结合了密集句子嵌入与领域特定知识图谱,并通过图神经网络集成知识图谱。此外,通过将STR评分连续体划分为低、中、高语义相关性区域,进行分层评估。 Result: 结果显示,结合知识图谱的微调SBERT模型在高语义相关性区域产生了一致改进,均方根误差比强基线模型减少了25%。 Conclusion: 研究发现结合知识图谱和文本嵌入的好处,并强调区域性能分析对于理解模型行为的重要性。这种方法揭示了全局指标隐藏的优缺点,并支持更有针对性的模型选择,适用于人力资源系统中的公平性、可解释性和上下文匹配需求。 Abstract: Semantic Textual Relatedness (STR) captures nuanced relationships between texts that extend beyond superficial lexical similarity. In this study, we investigate STR in the context of job title matching - a key challenge in resume recommendation systems, where overlapping terms are often limited or misleading. We introduce a self-supervised hybrid architecture that combines dense sentence embeddings with domain-specific Knowledge Graphs (KGs) to improve both semantic alignment and explainability. Unlike previous work that evaluated models on aggregate performance, our approach emphasizes data stratification by partitioning the STR score continuum into distinct regions: low, medium, and high semantic relatedness. This stratified evaluation enables a fine-grained analysis of model performance across semantically meaningful subspaces. We evaluate several embedding models, both with and without KG integration via graph neural networks. The results show that fine-tuned SBERT models augmented with KGs produce consistent improvements in the high-STR region, where the RMSE is reduced by 25% over strong baselines. Our findings highlight not only the benefits of combining KGs with text embeddings, but also the importance of regional performance analysis in understanding model behavior. This granular approach reveals strengths and weaknesses hidden by global metrics, and supports more targeted model selection for use in Human Resources (HR) systems and applications where fairness, explainability, and contextual matching are essential.

[29] DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning

Daniil Ignatev,Nan Li,Hugh Mee Wong,Anh Dang,Shane Kaszefski Yaschuk

Main category: cs.CL

TL;DR: The paper presents approaches to predicting annotator-specific annotations and evaluates the potential of LDL methods for soft label predictions.

Details Motivation: The motivation is to explore effective methods for predicting annotator-specific annotations and to evaluate the potential of LDL methods in soft label predictions. Method: The paper explores two methods: in-context learning (ICL) with large language models and label distribution learning (LDL) methods with RoBERTa. Result: The result shows that ICL can effectively predict perspectivist annotations and that aggregating these predictions into soft labels yields competitive performance. Conclusion: The paper concludes that ICL can effectively predict perspectivist annotations and that LDL methods are promising for soft label predictions. Abstract: This system paper presents the DeMeVa team's approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.

[30] Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025)

Paolo Pedinotti,Peter Baumann,Nathan Jessurun,Leslie Barrett,Enrico Santus

Main category: cs.CL

TL;DR: 本文提出了MetaGraph,通过基于LLM的方法从金融NLP文献中提取知识图谱,揭示了该领域的发展阶段和趋势,为研究者和从业者提供结构化视角。

Details Motivation: 大型语言模型(LLMs)迅速重塑了金融NLP,催生了新的任务和数据集的多样化,但这种变革已经超越了传统调查的进展,因此需要一种新的方法来系统地分析研究趋势。 Method: 定义了一个金融NLP研究的本体论,并应用基于LLM的提取流程对681篇论文(2022-2025年)进行处理,以实现大规模数据驱动的分析。 Result: MetaGraph 揭示了金融NLP发展的三个关键阶段:LLM的早期采用和任务/数据集创新;对LLM局限性的批判性反思;以及将外围技术逐渐整合到模块化系统中。 Conclusion: MetaGraph 提供了一种可重用的方法,用于从科学文献中提取知识图谱,并对研究趋势进行结构化、可查询的分析,不仅阐明了金融NLP的发展演变,还展示了其他领域科学进展的映射方法。 Abstract: Large Language Models (LLMs) have rapidly reshaped financial NLP, enabling new tasks and driving a proliferation of datasets and diversification of data sources. Yet, this transformation has outpaced traditional surveys. In this paper, we present MetaGraph, a generalizable methodology for extracting knowledge graphs from scientific literature and analyzing them to obtain a structured, queryable view of research trends. We define an ontology for financial NLP research and apply an LLM-based extraction pipeline to 681 papers (2022-2025), enabling large-scale, data-driven analysis. MetaGraph reveals three key phases: early LLM adoption and task/dataset innovation; critical reflection on LLM limitations; and growing integration of peripheral techniques into modular systems. This structured view offers both practitioners and researchers a clear understanding of how financial NLP has evolved - highlighting emerging trends, shifting priorities, and methodological shifts-while also demonstrating a reusable approach for mapping scientific progress in other domains.

[31] Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in Matchmaking

Brittany Harbison,Samuel Taubman,Travis Taylor,Ashok. K. Goel

Main category: cs.CL

TL;DR: The paper proposes a personality detection model using GPT to improve SAMI's ability to make social recommendations in online courses, showing initial promise but requiring further evaluation.

Details Motivation: SAMI's effectiveness in facilitating student connections is limited by an incomplete Theory of Mind, particularly its inability to intuit personality, which may influence the relevance of its recommendations. Method: A personality detection model was developed using GPT's zero-shot capability to infer Big-Five personality traits from forum introduction posts. This model was benchmarked against established models and integrated into SAMI's entity-based matchmaking system. Result: The proposed personality detection model demonstrated efficacy in inferring Big-Five personality traits and was successfully integrated into SAMI's matchmaking system, suggesting that personality traits can inform social recommendations. Conclusion: The integration of personality traits into SAMI's matchmaking system can complement existing factors, but further evaluation is needed to understand their impact on student engagement and match quality. Abstract: Social connection is a vital part of learning, yet online course environments present barriers to the organic formation of social groups. SAMI offers one solution by facilitating student connections, but its effectiveness is constrained by an incomplete Theory of Mind, limiting its ability to create an effective mental model of a student. One facet of this is its inability to intuit personality, which may influence the relevance of its recommendations. To explore this, we propose a personality detection model utilizing GPTs zero-shot capability to infer Big-Five personality traits from forum introduction posts, often encouraged in online courses. We benchmark its performance against established models, demonstrating its efficacy in this task. Furthermore, we integrate this model into SAMIs entity-based matchmaking system, enabling personality-informed social recommendations. Initial integration suggests personality traits can complement existing matching factors, though additional evaluation is required to determine their full impact on student engagement and match quality.

[32] Fluent but Unfeeling: The Emotional Blind Spots of Language Models

Bangzhao Shu,Isha Joshi,Melissa Karnaze,Anh C. Pham,Ishita Kakkar,Sindhu Kothe,Arpine Hovasapian,Mai ElSherief

Main category: cs.CL

TL;DR: 本文提出了一种用于评估大型语言模型在细粒度情绪对齐方面表现的新基准数据集EXPRESS。研究发现,尽管LLM在生成符合情绪理论的术语方面有一定能力,但在捕捉人类自我披露的情绪和上下文线索方面仍存在局限性。

Details Motivation: 尽管大型语言模型(LLM)在情绪识别领域的能力被广泛研究,但在细粒度上评估LLM是否与人类情绪一致仍然是一个关键空白。现有研究通常将情绪分类为预定义的有限类别,忽略了更细微的表达。 Method: 研究引入了一个名为EXPRESS的基准数据集,该数据集从Reddit社区中整理出251种细粒度的自我披露情绪标签。通过一个全面的评估框架,将预测的情绪术语分解为八种基本情绪,进行细粒度比较。 Result: 系统性测试显示,准确预测与人类自我披露情绪一致的情绪对当前LLM来说仍然具有挑战性。定性分析进一步表明,虽然某些LLM能够生成与情绪理论一致的术语,但在捕捉上下文线索方面不如人类自我披露有效。 Conclusion: 研究发现,尽管某些LLM生成的情绪术语与既定情绪理论和定义一致,但它们在捕捉人类自我披露的情绪方面仍存在挑战,尤其是在理解上下文线索方面。研究结果突显了LLM在细粒度情绪对齐方面的局限性,并为未来提高其上下文理解能力提供了思路。 Abstract: The versatility of Large Language Models (LLMs) in natural language understanding has made them increasingly popular in mental health research. While many studies explore LLMs' capabilities in emotion recognition, a critical gap remains in evaluating whether LLMs align with human emotions at a fine-grained level. Existing research typically focuses on classifying emotions into predefined, limited categories, overlooking more nuanced expressions. To address this gap, we introduce EXPRESS, a benchmark dataset curated from Reddit communities featuring 251 fine-grained, self-disclosed emotion labels. Our comprehensive evaluation framework examines predicted emotion terms and decomposes them into eight basic emotions using established emotion theories, enabling a fine-grained comparison. Systematic testing of prevalent LLMs under various prompt settings reveals that accurately predicting emotions that align with human self-disclosed emotions remains challenging. Qualitative analysis further shows that while certain LLMs generate emotion terms consistent with established emotion theories and definitions, they sometimes fail to capture contextual cues as effectively as human self-disclosures. These findings highlight the limitations of LLMs in fine-grained emotion alignment and offer insights for future research aimed at enhancing their contextual understanding.

[33] LAVA: Language Model Assisted Verbal Autopsy for Cause-of-Death Determination

Yiqun T. Chen,Tyler H. McCormick,Li Liu,Abhirup Datta

Main category: cs.CL

TL;DR: This study introduces LA-VA, a method combining LLMs and traditional algorithms, showing that GPT-5 significantly improves verbal autopsy accuracy compared to conventional techniques.

Details Motivation: Verbal autopsy is essential in resource-limited settings where medical certification is lacking. This study aims to explore how integrating LLMs with traditional methods can improve cause-of-death predictions. Method: The study introduces LA-VA, a pipeline combining Large Language Models (LLMs) with traditional algorithmic methods and embedding-based classification. It evaluates methods like GPT-5, LCVA baseline, text embeddings, and meta-learner ensembles using the PHMRC dataset across adult, child, and neonate categories. Result: GPT-5 delivered the best individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional baselines by 5-10%. Conclusion: LLM-assisted approaches can significantly enhance the accuracy of verbal autopsies, offering substantial benefits for health surveillance in low-resource regions. Abstract: Verbal autopsy (VA) is a critical tool for estimating causes of death in resource-limited settings where medical certification is unavailable. This study presents LA-VA, a proof-of-concept pipeline that combines Large Language Models (LLMs) with traditional algorithmic approaches and embedding-based classification for improved cause-of-death prediction. Using the Population Health Metrics Research Consortium (PHMRC) dataset across three age categories (Adult: 7,580; Child: 1,960; Neonate: 2,438), we evaluate multiple approaches: GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles. Our results demonstrate that GPT-5 achieves the highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%. Our findings suggest that simple off-the-shelf LLM-assisted approaches could substantially improve verbal autopsy accuracy, with important implications for global health surveillance in low-resource settings.

[34] Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems

Minghang Zhu,Zhengliang Shi,Zhiwei Xu,Shiguang Wu,Lingjie Wang,Pengjie Ren,Zhaochun Ren,Zhumin Chen

Main category: cs.CL

TL;DR: MOAT是一种多智能体联合对齐微调框架,通过迭代优化规划和执行智能体的协作,显著提升复杂任务解决能力。

Details Motivation: 现有方法通常独立微调多智能体,导致智能体间能力不匹配和协调性差,因此提出MOAT来改善协作效果。 Method: MOAT框架交替进行规划智能体对齐和基础智能体改进,通过优化子目标生成和执行动作的协调性,提高整体性能。 Result: MOAT在六个基准测试中表现优异,平均提升了3.1%(内部任务)和4.4%(外部任务)的性能。 Conclusion: MOAT通过迭代对齐协调多智能体协作,有效提升了复杂任务的解决能力,实验表明其在多个基准测试中优于现有方法。 Abstract: The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.

[35] All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens

Siddarth Mamidanna,Daking Rai,Ziyu Yao,Yilun Zhou

Main category: cs.CL

TL;DR: The paper investigates the extent of operations in LLMs for mental math tasks and identifies an All-for-One subgraph that is crucial for high model performance.

Details Motivation: The motivation is to understand the inner workings of LLMs, particularly the extent to which causal self-attention and multilayer perceptron layers allow every token to access and compute information based on all preceding tokens. Method: The paper uses two techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), to investigate the extent of operations in LLMs on mental math tasks. Result: An All-for-One subgraph (AF1) was identified with high accuracy on a wide variety of mental math tasks, which is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Conclusion: The paper concludes that meaningful computation in LLMs for mental math tasks occurs very late and only at the last token, which receives information from other tokens in few specific middle layers. Abstract: Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.

[36] Steering MoE LLMs via Expert (De)Activation

Mohsen Fayyaz,Ali Modarressi,Hanieh Deilamsalehy,Franck Dernoncourt,Ryan Rossi,Trung Bui,Hinrich Schütze,Nanyun Peng

Main category: cs.CL

TL;DR: SteerMoE通过识别和控制与特定行为相关的专家,实现对大语言模型的操控,从而提高安全性与忠诚度,或在对抗攻击中绕过安全防护。

Details Motivation: 大语言模型中的专家混合模型(MoE)通过选择性激活不同专家来处理不同任务,但如何有效控制这些专家以提升模型行为仍未被充分探索。 Method: SteerMoE通过对比输入对中专家的激活模式差异,识别出与特定行为相关的专家,并在推理过程中选择性激活或抑制这些专家以实现行为控制。 Result: 在11个基准测试和6个大语言模型中,SteerMoE将安全性提高了20%,忠诚度提高了27%;在对抗攻击中,单独使用可降低安全性41%,结合现有越狱方法则可降至100%。 Conclusion: SteerMoE为大语言模型提供了无需重新训练即可控制模型行为的新方法,并揭示了专家结构中潜在的对齐欺骗问题。 Abstract: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.

[37] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

Runpeng Dai,Linfeng Song,Haolin Liu,Zhenwen Liang,Dian Yu,Haitao Mi,Zhaopeng Tu,Rui Liu,Tong Zheng,Hongtu Zhu,Dong Yu

Main category: cs.CL

TL;DR: This paper proposes a curiosity-driven exploration method for RLVR to improve LLM reasoning by enhancing exploration through actor-critic intrinsic signals.

Details Motivation: Current RLVR methods suffer from poor exploration, leading to premature convergence and entropy collapse, which this study aims to address. Method: The authors introduced Curiosity-Driven Exploration (CDE), which uses actor and critic signals—perplexity and variance of value estimates—as exploration bonuses in the RLVR framework. Result: Empirically, the proposed method achieved a +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Conclusion: The study concludes that incorporating curiosity-driven exploration into RLVR improves model performance and provides insights into LLM failure modes through calibration collapse. Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

cs.CV [Back]

[38] Recurrence Meets Transformers for Universal Multimodal Retrieval

Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara

Main category: cs.CV

TL;DR: 本文提出ReT-2,一种支持多模态查询和文档检索的统一模型,通过多层表示和循环Transformer架构实现跨模态信息动态整合,在多个基准测试中表现优异。

Details Motivation: 随着多模态检索任务变得日益复杂,现有的方法由于依赖于任务特定的微调且受限于单一模态查询或文档,难以满足新的需求。 Method: ReT-2采用多层表示和受LSTM启发的门控机制的循环Transformer架构,以动态整合跨层和模态的信息。 Result: 在M2KR和M-BEIR基准测试中,ReT-2在不同检索配置下均达到最先进的性能,同时提供更快的推理速度和更低的内存使用。 Conclusion: ReT-2在多模态检索任务中表现出色,不仅实现了最先进的性能,还提高了检索增强生成管道的下游任务效果。 Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

[39] Diffusion-Based Action Recognition Generalizes to Untrained Domains

Rogerio Guimaraes,Frank Xiao,Pietro Perona,Markus Marks

Main category: cs.CV

TL;DR: The paper introduces a novel approach using a Vision Diffusion Model conditioned on earlier timesteps and a transformer aggregation to achieve human-like action recognition across various challenging conditions, setting a new state-of-the-art in generalization benchmarks.

Details Motivation: The motivation behind the paper is to address the challenge that current deep learning models face in generalizing action recognition across varying contexts and viewpoints, unlike humans who can easily recognize the same actions despite such variations. Method: The method involves using a Vision Diffusion Model (VDM) to generate features that are aggregated via a transformer. The VDM is conditioned on earlier timesteps of the diffusion process to emphasize semantic information over pixel-level details. This approach is experimentally tested for generalization properties in classifying actions across different animal species, viewing angles, and recording contexts. Result: The result of the paper is that the proposed model sets a new state-of-the-art across all three generalization benchmarks (animal species, viewing angles, and recording contexts), significantly improving the robustness of machine action recognition and bringing it closer to human-like performance. Conclusion: The paper concludes that using a Vision Diffusion Model (VDM) conditioned on earlier timesteps of the diffusion process, aggregated via a transformer, achieves human-like action recognition across various challenging conditions. This approach sets a new state-of-the-art in generalization benchmarks, making machine action recognition more robust and similar to human performance. Abstract: Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: $\href{https://www.vision.caltech.edu/actiondiff/}{\texttt{vision.caltech.edu/actiondiff}}$ Code: $\href{https://github.com/frankyaoxiao/ActionDiff}{\texttt{github.com/frankyaoxiao/ActionDiff}}$

[40] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability

Tung Vu,Lam Nguyen,Quynh Dao

Main category: cs.CV

TL;DR: 本文提出PromptGuard,一种新的模块化提示框架,通过在生成源头主动防止有害信息,保护弱势群体免受大型语言模型可能带来的伤害。

Details Motivation: 现有的安全性方法未能在生成源头主动防止有害输出,因此需要提出PromptGuard以保护弱势群体免受大型语言模型可能带来的伤害。 Method: PromptGuard采用多目标优化理论和形式证明,结合现实世界数据驱动的对比学习,整合了六个核心模块进行实时伤害预防。 Result: PromptGuard的理论分析表明,通过熵约束和帕累托最优性,实现了25-30%的伤害减少率。 Conclusion: PromptGuard通过其创新的VulnGuard Prompt和其他模块,构建了一个有效的实时伤害预防系统,为系统性实证研究奠定了数学基础。 Abstract: The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.

[41] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures

Waqar Ahmad,Evan Murphy,Vladimir A. Krylov

Main category: cs.CV

TL;DR: 本文提出了一种名为Beta-SOD的新统计异常检测框架,通过将对象重新识别问题转化为监督图像相似性任务,利用Siamese网络架构和多损失函数优化特征级相似性学习,解决了标签噪声对Re-ID性能的影响,并在多个数据集中展示了其优越性能。

Details Motivation: 解决对象重新识别(Re-ID)方法对标签噪声高度敏感导致性能显著下降的问题。 Method: 通过将Re-ID重新构建为监督图像相似性任务,并采用捕获判别对关系的Siamese网络架构,结合二元交叉熵、对比和余弦嵌入损失进行特征级相似性学习。 Result: Beta-SOD在CUHK03、Market-1501和VeRi-776数据集上均有效去噪并提升Re-ID任务性能。 Conclusion: Beta-SOD方法在不同噪声水平下均表现出优于现有方法的性能,展示了其在嘈杂Re-ID场景中的鲁棒性和广泛应用潜力。 Abstract: Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed.The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30\%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: https://github.com/waqar3411/Beta-SOD

[42] SFD-Mamba2Net: Strcture-Guided Frequency-Enhanced Dual-Stream Mamba2 Network for Coronary Artery Segmentation

Nan Mu,Ruiqi Song,Zhihui Xu,Jingfeng Jiang,Chen Zhao

Main category: cs.CV

TL;DR: This study proposes SFD-Mamba2Net, a novel framework for improving coronary artery segmentation and stenosis detection in Invasive Coronary Angiography (ICA) images, addressing challenges such as low contrast, high noise levels, and complex vascular structures.

Details Motivation: Coronary Artery Disease (CAD) is one of the leading causes of death worldwide, and Invasive Coronary Angiography (ICA) is regarded as the gold standard for CAD diagnosis. However, ICA images are typically characterized by low contrast, high noise levels, and complex, fine-grained vascular structures, which pose significant challenges to the clinical adoption of existing segmentation and detection methods. Method: The study proposed SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection. It incorporates a Curvature-Aware Structural Enhancement (CASE) module in the encoder and a Progressive High-Frequency Perception (PHFP) module in the decoder. Result: The proposed SFD-Mamba2Net framework improved the accuracy of coronary artery segmentation and stenosis detection in ICA images by integrating multi-scale structural priors, state-space-based long-range dependency modeling, and frequency-domain detail enhancement strategies. Conclusion: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection. Abstract: Background: Coronary Artery Disease (CAD) is one of the leading causes of death worldwide. Invasive Coronary Angiography (ICA), regarded as the gold standard for CAD diagnosis, necessitates precise vessel segmentation and stenosis detection. However, ICA images are typically characterized by low contrast, high noise levels, and complex, fine-grained vascular structures, which pose significant challenges to the clinical adoption of existing segmentation and detection methods. Objective: This study aims to improve the accuracy of coronary artery segmentation and stenosis detection in ICA images by integrating multi-scale structural priors, state-space-based long-range dependency modeling, and frequency-domain detail enhancement strategies. Methods: We propose SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection. In the encoder, a Curvature-Aware Structural Enhancement (CASE) module is embedded to leverage multi-scale responses for highlighting slender tubular vascular structures, suppressing background interference, and directing attention toward vascular regions. In the decoder, we introduce a Progressive High-Frequency Perception (PHFP) module that employs multi-level wavelet decomposition to progressively refine high-frequency details while integrating low-frequency global structures. Results and Conclusions: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection.

[43] Live(r) Die: Predicting Survival in Colorectal Liver Metastasis

Muhammad Alberb,Helen Cheung,Anne Martel

Main category: cs.CV

TL;DR: 该研究开发了一个全自动的术前MRI分析框架,结合了先进的图像分割和放射组学技术,用于更准确地预测结直肠癌肝转移手术患者的生存率。

Details Motivation: 当前基于有限临床或分子特征的预后模型预测能力不足,尤其是在多灶性CRLM病例中。 Method: 该框架包括一个分割流程和一个放射组学流程。分割流程利用可提示的基础模型从部分标注数据中学习,并提出了一种新的零样本3D提示传播算法SAMONAI;放射组学流程从每个肿瘤中提取特征,并使用SurvAMINN进行生存分析。 Result: 在包含227名患者的机构数据集上进行的广泛评估表明,该框架超越了现有的临床和基因组生物标志物,C指数提高了超过10%。 Conclusion: 该研究提出了一种全自动的术前MRI影像分析框架,用于预测结直肠癌肝转移(CRLM)手术后的患者生存率,结果显示其预测能力优于现有的临床和基因组生物标志物。 Abstract: Colorectal cancer frequently metastasizes to the liver, significantly reducing long-term survival. While surgical resection is the only potentially curative treatment for colorectal liver metastasis (CRLM), patient outcomes vary widely depending on tumor characteristics along with clinical and genomic factors. Current prognostic models, often based on limited clinical or molecular features, lack sufficient predictive power, especially in multifocal CRLM cases. We present a fully automated framework for surgical outcome prediction from pre- and post-contrast MRI acquired before surgery. Our framework consists of a segmentation pipeline and a radiomics pipeline. The segmentation pipeline learns to segment the liver, tumors, and spleen from partially annotated data by leveraging promptable foundation models to complete missing labels. Also, we propose SAMONAI, a novel zero-shot 3D prompt propagation algorithm that leverages the Segment Anything Model to segment 3D regions of interest from a single point prompt, significantly improving our segmentation pipeline's accuracy and efficiency. The predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts features from each tumor and predicts survival using SurvAMINN, a novel autoencoder-based multiple instance neural network for survival analysis. SurvAMINN jointly learns dimensionality reduction and hazard prediction from right-censored survival data, focusing on the most aggressive tumors. Extensive evaluation on an institutional dataset comprising 227 patients demonstrates that our framework surpasses existing clinical and genomic biomarkers, delivering a C-index improvement exceeding 10%. Our results demonstrate the potential of integrating automated segmentation algorithms and radiomics-based survival analysis to deliver accurate, annotation-efficient, and interpretable outcome prediction in CRLM.

[44] Discovering Divergent Representations between Text-to-Image Models

Lisa Dunlap,Joseph E. Gonzalez,Trevor Darrell,Fabian Caba Heilbron,Josef Sivic,Bryan Russell

Main category: cs.CV

TL;DR: 本文提出了CompCon算法,用于比较文本到图像生成模型之间的视觉表示差异,并通过自动化生成数据集ID2进行评估。

Details Motivation: 研究两个不同生成模型学习的视觉表示何时以及如何产生差异,并发现特定模型输出中存在的视觉属性及其触发提示。 Method: 引入CompCon(Comparing Concepts),一种进化搜索算法,并通过自动化数据生成管道创建ID2数据集进行评估。 Result: 成功发现模型间差异,例如PixArt在描述孤独时使用湿润的街道,而Stable Diffusion 3.5则描绘非裔美国人从事媒体职业。 Conclusion: CompCon是一个有效的算法,能够发现不同生成模型之间视觉属性的差异,并揭示与这些差异相关的提示概念。 Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon

[45] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery

Yibin Wang,Wondimagegn Beshah,Padmanava Dash,Haifeng Wang

Main category: cs.CV

TL;DR: 本研究提出了一種新的機器學習方法,用於從無人機系統(UAS)圖像中識別和分離雲影和陽光閃耀區域,以提高水質參數估計的準確性。

Details Motivation: 無人機影像在遙感應用中具有重要優勢,但雲影和陽光閃耀會影響水質參數的準確估計,因此需要一種有效的方法來處理這些問題。 Method: 研究採用了基於U-Net的深度學習模型,利用像素級數據訓練模型,並根據測試案例的評估指標確定最佳模型訓練設置。 Result: 成功開發了一種高質量的圖像校正模型,能夠有效識別並恢復圖像中的雲影和陽光閃耀區域。 Conclusion: 這種基於深度學習的方法在處理無人機圖像中的雲影和陽光閃耀問題方面表現出良好的效果,有助於提高水質參數估計的準確性。 Abstract: The use of unmanned aerial systems (UASs) has increased tremendously in the current decade. They have significantly advanced remote sensing with the capability to deploy and image the terrain as per required spatial, spectral, temporal, and radiometric resolutions for various remote sensing applications. One of the major advantages of UAS imagery is that images can be acquired in cloudy conditions by flying the UAS under the clouds. The limitation to the technology is that the imagery is often sullied by cloud shadows. Images taken over water are additionally affected by sun glint. These are two pose serious issues for estimating water quality parameters from the UAS images. This study proposes a novel machine learning approach first to identify and extract regions with cloud shadows and sun glint and separate such regions from non-obstructed clear sky regions and sun-glint unaffected regions. The data was extracted from the images at pixel level to train an U-Net based deep learning model and best settings for model training was identified based on the various evaluation metrics from test cases. Using this evaluation, a high-quality image correction model was determined, which was used to recover the cloud shadow and sun glint areas in the images.

[46] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision

Puskal Khadka,Rodrigue Rizk,Longwei Wang,KC Santosh

Main category: cs.CV

TL;DR: CoSwin enhances Vision Transformers by combining global attention with local convolutional features, improving performance on small vision datasets.

Details Motivation: Vision Transformers (ViTs) struggle with local feature extraction on small datasets due to a lack of inductive biases like locality; CoSwin aims to address this by incorporating convolutional features into the attention blocks. Method: The authors propose CoSwin, which combines hierarchical shifted window attention with a learnable local feature enhancement module to capture both local details and global structure. Result: CoSwin outperforms state-of-the-art models on multiple benchmarks, with notable accuracy improvements on CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Conclusion: CoSwin improves the performance of vision transformers on small datasets by integrating local convolutional feature learning with global attention mechanisms. Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin

[47] iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning

Karim Slimani,Catherine Achard,Brahim Tamadazte

Main category: cs.CV

TL;DR: iMatcher 是一种用于点云配准的完全可微分特征匹配框架,通过学习特征预测几何一致的置信度矩阵,显著提升了刚性配准性能。

Details Motivation: 为了提高点云配准的准确性与鲁棒性,特别是在复杂场景中实现更高效的特征匹配。 Method: iMatcher 利用局部图嵌入模块初始化得分矩阵,并通过考虑源到目标和目标到源的双边匹配来优化该矩阵,最终通过全局几何一致性学习预测点对匹配概率。 Result: 在 KITTI、KITTI-360 和 3DMatch 等多个真实数据集上,iMatcher 在刚性配准性能上取得了显著提升,分别实现了 95%-97%、94%-97% 和最高 81.1% 的内点率。 Conclusion: iMatcher 在点云配准任务中表现出色,具备强大的鲁棒性和广泛的应用前景。 Abstract: This paper presents iMatcher, a fully differentiable framework for feature matching in point cloud registration. The proposed method leverages learned features to predict a geometrically consistent confidence matrix, incorporating both local and global consistency. First, a local graph embedding module leads to an initialization of the score matrix. A subsequent repositioning step refines this matrix by considering bilateral source-to-target and target-to-source matching via nearest neighbor search in 3D space. The paired point features are then stacked together to be refined through global geometric consistency learning to predict a point-wise matching probability. Extensive experiments on real-world outdoor (KITTI, KITTI-360) and indoor (3DMatch) datasets, as well as on 6-DoF pose estimation (TUD-L) and partial-to-partial matching (MVP-RG), demonstrate that iMatcher significantly improves rigid registration performance. The method achieves state-of-the-art inlier ratios, scoring 95% - 97% on KITTI, 94% - 97% on KITTI-360, and up to 81.1% on 3DMatch, highlighting its robustness across diverse settings.

[48] UltrON: Ultrasound Occupancy Networks

Magdalena Wysocki,Felix Duelmer,Ananya Bal,Nassir Navab,Mohammad Farid Azampour

Main category: cs.CV

TL;DR: 本文提出了一种新的基于占用率的表示方法和UltrON,通过利用声学特性改进了几何一致性,从而在弱监督优化方案中实现更精确的3D重建。

Details Motivation: 为了克服隐式表示方法在超声波视图依赖性和声学阴影伪影方面的困难,以及减少对精确注释的依赖。 Method: 提出了一种基于占用率的表示方法和UltrON,引入了一种新的损失函数,从多视角超声波图像中进行占用率优化。 Result: 展示了UltrON能够从相同的解剖结构中推广形状,并且通过获得B模式图像中的声学特征而无需额外的注释成本。 Conclusion: UltrON通过利用声学特性改进了几何一致性,从而缓解了遮挡和稀疏标记的限制,为更精确的3D重建铺平了道路。 Abstract: In free-hand ultrasound imaging, sonographers rely on expertise to mentally integrate partial 2D views into 3D anatomical shapes. Shape reconstruction can assist clinicians in this process. Central to this task is the choice of shape representation, as it determines how accurately and efficiently the structure can be visualized, analyzed, and interpreted. Implicit representations, such as SDF and occupancy function, offer a powerful alternative to traditional voxel- or mesh-based methods by modeling continuous, smooth surfaces with compact storage, avoiding explicit discretization. Recent studies demonstrate that SDF can be effectively optimized using annotations derived from segmented B-mode ultrasound images. Yet, these approaches hinge on precise annotations, overlooking the rich acoustic information embedded in B-mode intensity. Moreover, implicit representation approaches struggle with the ultrasound's view-dependent nature and acoustic shadowing artifacts, which impair reconstruction. To address the problems resulting from occlusions and annotation dependency, we propose an occupancy-based representation and introduce \gls{UltrON} that leverages acoustic features to improve geometric consistency in weakly-supervised optimization regime. We show that these features can be obtained from B-mode images without additional annotation cost. Moreover, we propose a novel loss function that compensates for view-dependency in the B-mode images and facilitates occupancy optimization from multiview ultrasound. By incorporating acoustic properties, \gls{UltrON} generalizes to shapes of the same anatomy. We show that \gls{UltrON} mitigates the limitations of occlusions and sparse labeling and paves the way for more accurate 3D reconstruction. Code and dataset will be available at https://github.com/magdalena-wysocki/ultron.

[49] Implicit Neural Representations of Intramyocardial Motion and Strain

Andrew Bell,Yan Kit Choi,Steffen Peterson,Andrew King,Muhummad Sohaib Nazir,Alistair Young

Main category: cs.CV

TL;DR: The paper introduces a novel method using implicit neural representations to accurately and efficiently analyze myocardial strain from cardiac MRI data, outperforming existing deep learning approaches in both accuracy and speed.

Details Motivation: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. Method: A method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement without requiring inference-time optimisation. Result: Evaluated on 452 UK Biobank test cases, the method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. Additionally, the method is approximately 380 times faster than the most accurate baseline. Conclusion: INR-based models are suitable for accurate and scalable analysis of myocardial strain in large CMR datasets. Abstract: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement -- without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets.

[50] E-MLNet: Enhanced Mutual Learning for Universal Domain Adaptation with Sample-Specific Weighting

Samuel Felipe dos Santos,Tiago Agostinho de Almeida,Jurandy Almeida

Main category: cs.CV

TL;DR: E-MLNet improves domain adaptation by focusing on relevant class boundaries, outperforming existing methods in accuracy and robustness.

Details Motivation: Existing UniDA methods dilute the learning signal by treating all classifiers equally, necessitating a more focused adaptation strategy. Method: E-MLNet integrates a dynamic weighting strategy into OEM, focusing on relevant class boundaries for each target sample. Result: E-MLNet achieves the highest average H-scores on VisDA and ImageCLEF, outperforming MLNet in 22 out of 31 tasks in Open-Partial DA and 19 out of 31 in Open-Set DA. Conclusion: E-MLNet demonstrates superior performance and robustness compared to MLNet, particularly in handling challenging domain adaptation tasks. Abstract: Universal Domain Adaptation (UniDA) seeks to transfer knowledge from a labeled source to an unlabeled target domain without assuming any relationship between their label sets, requiring models to classify known samples while rejecting unknown ones. Advanced methods like Mutual Learning Network (MLNet) use a bank of one-vs-all classifiers adapted via Open-set Entropy Minimization (OEM). However, this strategy treats all classifiers equally, diluting the learning signal. We propose the Enhanced Mutual Learning Network (E-MLNet), which integrates a dynamic weighting strategy to OEM. By leveraging the closed-set classifier's predictions, E-MLNet focuses adaptation on the most relevant class boundaries for each target sample, sharpening the distinction between known and unknown classes. We conduct extensive experiments on four challenging benchmarks: Office-31, Office-Home, VisDA-2017, and ImageCLEF. The results demonstrate that E-MLNet achieves the highest average H-scores on VisDA and ImageCLEF and exhibits superior robustness over its predecessor. E-MLNet outperforms the strong MLNet baseline in the majority of individual adaptation tasks -- 22 out of 31 in the challenging Open-Partial DA setting and 19 out of 31 in the Open-Set DA setting -- confirming the benefits of our focused adaptation strategy.

[51] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation

Umair Hassan

Main category: cs.CV

TL;DR: 本文提出了COCO-Urdu,这是一个大规模的乌尔都语图像字幕数据集,旨在减少多模态研究中的语言偏差。

Details Motivation: 乌尔都语在多模态和视觉-语言研究中资源严重不足,缺乏大规模高质量数据集限制了乌尔都语系统的发展。 Method: 使用SeamlessM4T v2进行翻译,并通过结合COMET-Kiwi、CLIP-based相似度和BERTScore的混合多模态质量估计框架进行验证。 Result: COCO-Urdu包含59,000张图像和319,000条乌尔都语字幕,并在BLEU、SacreBLEU和chrF指标上表现出色。 Conclusion: COCO-Urdu有望减少多模态研究中的语言偏差,并为包容性的视觉-语言系统奠定基础。 Abstract: Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.

[52] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI

Chenqian Le,Yilin Zhao,Nikasadat Emami,Kushagra Yadav,Xujin "Chris" Liu,Xupeng Chen,Yao Wang

Main category: cs.CV

TL;DR: VoxelFormer是一种高效的多受试者fMRI视觉解码模型,结合token合并和查询驱动的transformer,实现参数高效且可扩展的神经解码。

Details Motivation: 解决现有fMRI视觉解码方法依赖特定受试者训练的问题,以提升可扩展性和实际部署能力。 Method: 引入了一种轻量级transformer架构,包括Token Merging Transformer (ToMer) 和查询驱动的Q-Former,用于高效压缩体素并生成与CLIP图像嵌入空间对齐的固定大小神经表示。 Result: 在7T Natural Scenes数据集上评估显示,VoxelFormer在训练中包含的受试者上实现了具有竞争力的检索性能,同时参数量显著少于现有方法。 Conclusion: VoxelFormer展示了一种参数高效且可扩展的fMRI视觉解码方法,利用token合并和基于查询的transformer策略实现了多受试者训练。 Abstract: Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbf{VoxelFormer}, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.

[53] Integrating Anatomical Priors into a Causal Diffusion Model

Binxu Li,Wei Peng,Mingjie Li,Ehsan Adeli,Kilian M. Pohl

Main category: cs.CV

TL;DR: PCGM improves counterfactual 3D brain MRI synthesis by integrating anatomical constraints into a diffusion model, generating realistic images that reflect subtle disease-related changes in cortical regions.

Details Motivation: 3D brain MRI studies often require detecting subtle morphometric differences between cohorts, but counterfactual image generation models struggle to preserve fine anatomical details due to lack of explicit inductive biases. The high cost of MRI acquisition makes synthetic image generation highly beneficial for such studies. Method: PCGM incorporates anatomical constraints via a probabilistic graph module, which generates spatial binary masks of regions with subtle variations. These masks constrain a counterfactual denoising UNet, integrated with a 3D diffusion decoder to produce high-quality MRIs. The method explicitly embeds voxel-level anatomical priors, addressing limitations in existing counterfactual models. Result: PCGM outperforms baseline approaches in generating high-quality structural brain MRIs. It successfully replicates known subtle effects of diseases on cortical regions, demonstrating its ability to preserve medically relevant local variations. Conclusion: The proposed Probabilistic Causal Graph Model (PCGM) successfully integrates anatomical constraints into a generative diffusion framework, enabling the synthesis of high-quality, anatomically plausible 3D brain MRIs that replicate subtle disease-related cortical effects previously identified in neuroscience literature, marking a significant milestone for synthetic MRI use in morphological studies. Abstract: 3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.

[54] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models

Qiuhui Chen,Xuancheng Yao,Huping Ye,Yi Hong

Main category: cs.CV

TL;DR: Med3DInsight是一种利用2D多模态大语言模型提升3D医学图像理解的新方法,具有抗噪能力强、无需人工注释、性能优越等特点。

Details Motivation: 现有3D医学图像卷积和基于Transformer的自监督学习方法缺乏深层语义理解,而多模态大语言模型为通过文本描述增强图像理解提供了新的可能性。 Method: 提出Med3DInsight,一种结合3D图像编码器和2D多模态大语言模型(MLLMs)的预训练框架,并采用基于部分最优传输的对齐方法。 Result: 在多个CT和MRI公共数据集上的实验表明,Med3DInsight在下游任务(如分割和分类)中优于现有SSL方法,并可无缝集成到现有3D医学图像理解网络中。 Conclusion: Med3DInsight提供了一种用于可扩展多模态3D医学表示学习的新范式,无需人工注释,且在分割和分类任务中表现出最先进的性能。 Abstract: Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.

[55] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach

Hesham M. Shehata,Mohammad Abdolrahmani

Main category: cs.CV

TL;DR: This paper introduces a multi-task learning approach incorporating fixed object data to improve human action recognition, achieving 99.25% accuracy on a custom dataset of interaction and non-interaction actions.

Details Motivation: Existing graph convolutional neural networks (GCNs) struggle with detecting human-object interactions due to limited scene representation and inadequate learning architectures. Method: A multi-task learning methodology was employed, incorporating fixed object information to enhance the representation of human actions. Real-world data was collected and organized into interaction and non-interaction classes for evaluation. Result: The proposed method achieved an accuracy of 99.25%, outperforming the base model by 2.75% in recognizing actions involving interactions with fixed objects. Conclusion: The proposed multi-task learning approach, incorporating fixed object information, outperforms the base model in recognizing human-object interactions and non-interaction actions. Abstract: Recent graph convolutional neural networks (GCNs) have shown high performance in the field of human action recognition by using human skeleton poses. However, it fails to detect human-object interaction cases successfully due to the lack of effective representation of the scene information and appropriate learning architectures. In this context, we propose a methodology to utilize human action recognition performance by considering fixed object information in the environment and following a multi-task learning approach. In order to evaluate the proposed method, we collected real data from public environments and prepared our data set, which includes interaction classes of hands-on fixed objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and non-interaction classes of walking and standing. The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%, outperforming the accuracy of the base model using only human skeleton poses by 2.75%.

[56] IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection

Jifeng Shen,Haibo Zhan,Xin Zuo,Heng Fan,Xiaohui Yuan,Jun Li,Wankou Yang

Main category: cs.CV

TL;DR: This paper proposes IRDFusion, a novel feature fusion framework for multispectral object detection that effectively suppresses background noise and enhances salient features through adaptive cross-modal fusion, achieving state-of-the-art results on multiple datasets.

Details Motivation: Current multispectral object detection methods suffer from retaining extraneous background or noise during feature fusion, which limits perceptual performance. This paper aims to address this issue by proposing an innovative feature fusion framework. Method: IRDFusion incorporates two novel modules: the Mutual Feature Refinement Module (MFRM) for enhancing intra- and inter-modal feature representations, and the Differential Feature Feedback Module (DFFM) for dynamically computing and feeding back inter-modal differential features to guide adaptive fusion and noise suppression. Result: IRDFusion achieves state-of-the-art performance on FLIR, LLVIP, and M$^3$FD datasets, consistently outperforming existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Conclusion: IRDFusion is a robust and effective framework for multispectral object detection that achieves state-of-the-art performance by enabling high-quality cross-modal fusion through progressive amplification of salient relational signals and suppression of feature noise. Abstract: Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance.To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference.Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power.Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains.In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.

[57] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models

Hengyu Fang,Yijiang Liu,Yuan Du,Li Du,Huanrui Yang

Main category: cs.CV

TL;DR: 提出了SQAP-VLA,一种结构化的、无需训练的VLA推理加速框架,同时实现量化和令牌剪枝,提高了计算效率和推理速度,同时保持了模型性能。

Details Motivation: 现有的VLA压缩和加速方法无法同时实现量化和令牌剪枝,导致效率提升受限。 Method: 共同设计量化和令牌剪枝流程,提出可在强量化模型上运行的量化感知令牌剪枝标准,并改进量化器设计以增强剪枝效果。 Result: 在标准VLA模型上的应用表明,SQAP-VLA在保持核心模型性能的同时,显著提高了计算效率和推理速度,推理速度加速达1.93倍,平均成功率提高了最多4.5%。 Conclusion: SQAP-VLA是首个在VLA模型上同时实现高效量化和令牌剪枝的框架,为VLA的实用部署提供了新的解决方案。 Abstract: Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5\% average success rate enhancement compared to the original model.

[58] S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization

Chenghao Zhang,Lun Luo,Si-Yuan Cao,Xiaokai Bai,Yuncheng Jin,Zhu Yu,Beinan Yu,Yisen Wang,Hui-Liang Shen

Main category: cs.CV

TL;DR: S-BEVLoc is a self-supervised LiDAR-based framework for global localization that eliminates the need for ground-truth poses and achieves high performance and scalability.

Details Motivation: Current LiDAR-based global localization approaches rely on costly and labor-intensive ground-truth poses obtained from GPS or SLAM odometry, which limits scalability. Method: S-BEVLoc constructs training triplets from single BEV images using known geographic distances between keypoint-centered BEV patches. CNN extracts local features, NetVLAD aggregates global descriptors, and SoftCos loss enhances learning. Result: Experimental results on KITTI and NCLT datasets demonstrate that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks. Conclusion: S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks without relying on ground-truth poses, and offers high scalability. Abstract: LiDAR-based global localization is an essential component of simultaneous localization and mapping (SLAM), which helps loop closure and re-localization. Current approaches rely on ground-truth poses obtained from GPS or SLAM odometry to supervise network training. Despite the great success of these supervised approaches, substantial cost and effort are required for high-precision ground-truth pose acquisition. In this work, we propose S-BEVLoc, a novel self-supervised framework based on bird's-eye view (BEV) for LiDAR global localization, which eliminates the need for ground-truth poses and is highly scalable. We construct training triplets from single BEV images by leveraging the known geographic distances between keypoint-centered BEV patches. Convolutional neural network (CNN) is used to extract local features, and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce SoftCos loss to enhance learning from the generated triplets. Experimental results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks, while offering scalability that would require extra effort for supervised approaches.

[59] FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding

Jianqin Gao,Tianqi Wang,Yu Zhang,Yishu Zhang,Chenyuan Wang,Allan Dong,Zihao Wang

Main category: cs.CV

TL;DR: This paper introduces FPI-Det, a novel dataset for detecting phone usage in diverse scenarios, with synchronized annotations for faces and phones and baseline results using object detection models.

Details Motivation: The motivation is to address the challenges in detecting phone usage by understanding behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing benchmarks do not fully capture such interactions. Method: The authors introduce FPI-Det, a dataset with 22,879 images containing synchronized annotations for faces and phones. They evaluate representative object detection models like YOLO and DETR to establish baseline results and analyze performance across object sizes, occlusion levels, and environments. Result: FPI-Det is introduced as a solution, featuring extreme scale variation, frequent occlusions, and varied capture conditions. Baseline results are provided using YOLO and DETR detectors, analyzing performance across different scenarios. Conclusion: The paper introduces FPI-Det, a new dataset designed to address the gap in existing benchmarks for detecting fine-grained human-device interactions, particularly phone usage, across various scenarios. Abstract: The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human--device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.

[60] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Junhao Xing,Ryohei Miyakawa,Yang Yang,Xinpeng Liu,Risa Shinoda,Hiroaki Santo,Yosuke Toda,Fumio Okura

Main category: cs.CV

TL;DR: ZeroPlantSeg introduces a zero-shot method for hierarchical plant segmentation, combining segmentation and vision-language models without requiring annotated data.

Details Motivation: Segmenting entire plant individuals composed of overlapping leaves is challenging and typically requires species-specific annotated datasets. A zero-shot approach is needed to avoid manual annotation. Method: ZeroPlantSeg combines a foundation segmentation model for leaf instance extraction and a vision-language model for reasoning about plant structures, eliminating the need for additional training. Result: ZeroPlantSeg performs well across multiple plant species, growth stages, and imaging environments, showing superior cross-domain performance compared to supervised methods. Conclusion: The proposed ZeroPlantSeg achieves zero-shot hierarchical segmentation of rosette-shaped plant individuals from top-view images, outperforming existing zero-shot and supervised methods. Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

[61] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Tianlu Zheng,Yifan Zhang,Xiang An,Ziyong Feng,Kaicheng Yang,Qichuan Ding

Main category: cs.CV

TL;DR: 本文提出了一种改进CLIP在人物表征学习中应用的方法,包括创建大规模数据集WebPerson和引入GA-DMS框架,有效解决了数据稀缺和全局对比学习局限性的问题。

Details Motivation: CLIP在多样化的视觉任务中表现出色,但在人物表征学习方面面临两个挑战:一是缺乏大规模标注的以人物为中心的图像-语言数据,二是全局对比学习的固有局限性,难以保持细粒度匹配所需的判别局部特征,并且易受噪声文本标记的影响。 Method: 开发了一种抗噪数据构建管道,利用MLLMs的能力自动筛选和标注网络来源的图像,创建了包含500万高质量人物中心图像-文本对的大规模数据集WebPerson;引入了GA-DMS框架,通过基于梯度注意力相似度得分自适应地屏蔽嘈杂文本标记,改善了跨模态对齐,并通过纳入标记预测目标增强了细粒度语义表征学习。 Result: 开发了WebPerson数据集和GA-DMS框架,其中GA-DMS在多个基准测试中达到了最先进的性能。 Conclusion: GA-DMS框架在多个基准测试中实现了最先进的性能,推进了CLIP在人物表征学习方面的应用。 Abstract: Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.

[62] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain

Bin Huang,Kang Chen,Bingxuan Li,Huafeng Liu,Qiegen Liu

Main category: cs.CV

TL;DR: ALL-PET是一个低资源、低样本PET基础模型,通过在投影域中使用潜扩散模型和三种关键创新来实现高质量的sinogram生成,并且在少量数据下表现良好。

Details Motivation: 由于标记数据的访问有限和计算资源不足,构建大规模的PET成像基础模型受到阻碍。 Method: ALL-PET利用潜扩散模型(LDM)并引入了三种创新策略:Radon掩码增强策略(RMAS)、动态多掩码机制(DMM)以及透明医疗注意力(TMA)。 Result: 实验结果显示,ALL-PET仅使用500个样本即可实现高质量的sinogram生成,其性能与使用更大数据集训练的模型相当,并且内存使用量低于24GB。 Conclusion: ALL-PET通过少量数据和低资源消耗实现了高效的PET成像基础模型,适用于多种任务,如低剂量重建、衰减校正、延迟帧预测和示踪剂分离。 Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.

[63] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology

Dylan Peek,Matthew P. Skerritt,Stephan Chalup

Main category: cs.CV

TL;DR: This study shows that ANNs can be more effective than PH in the presence of structural noise when predicting topological features from 2D binary images.

Details Motivation: PH and ANNs offer contrasting approaches to inferring topological structure from data, and this study examines the noise robustness of ANNs in predicting topological features. Method: A supervised neural network was trained to predict Betti numbers in 2D binary images and compared against a PH pipeline based on cubical complexes and SEDT. Result: ANNs outperformed the PH approach under noise due to their capacity to learn contextual and geometric priors from training data. Conclusion: ANNs offer a compelling alternative to PH under structural noise for topology estimation. Abstract: Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer contrasting approaches to inferring topological structure from data. In this study, we examine the noise robustness of a supervised neural network trained to predict Betti numbers in 2D binary images. We compare an ANN approach against a PH pipeline based on cubical complexes and the Signed Euclidean Distance Transform (SEDT), which is a widely adopted strategy for noise-robust topological analysis. Using one synthetic and two real-world datasets, we show that ANNs can outperform this PH approach under noise, likely due to their capacity to learn contextual and geometric priors from training data. Though still emerging, the use of ANNs for topology estimation offers a compelling alternative to PH under structural noise.

[64] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation

Yuiko Uchida,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

Main category: cs.CV

TL;DR: 本文提出了一种新的 3D 场景评估指标 OSIM,通过关注对象来更贴近人类感知,并通过用户研究验证其有效性。

Details Motivation: 现有的评估指标关注整体图像质量,与人类感知存在差异;而人类对 3D 场景的认知主要依赖于对单个对象的关注。 Method: 利用对象检测模型及其特征表示来量化场景中每个对象的“对象性”,从而实现以对象为中心的评估。 Result: 用户研究表明,OSIM 比现有指标更符合人类感知,并在标准化实验设置下重新评估了最近的 3D 重建和生成模型。 Conclusion: OSIM 是一种新的评估 3D 场景的指标,它更贴近人类的感知,代码已经开源。 Abstract: This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on "objects," which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the "objectness" of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.

[65] Video Understanding by Design: How Datasets Shape Architectures and Insights

Lei Wang,Piotr Koniusz,Yongsheng Gao

Main category: cs.CV

TL;DR: 这篇论文首次从数据集驱动的角度审视视频理解模型的发展,探讨了数据集如何影响模型架构的演化,并为未来研究提供指导。

Details Motivation: 现有研究忽略了数据集对模型架构演化的结构性压力,本文旨在填补这一空白。 Method: 采用数据集驱动的视角,重新解读里程碑式的模型作为对这些数据集驱动压力的具体响应。 Result: 提供实用指导,以协调模型设计与数据集不变性,同时平衡可扩展性和任务需求。 Conclusion: 通过统一数据集、归纳偏置和架构,为通用视频理解提供全面的回顾和前进路线图。 Abstract: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.

[66] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge

JaeWoong Shin,Jeongun Ryu,Aaron Valero Puche,Jinhee Lee,Biagio Brattoli,Wonkyung Jung,Soo Ick Cho,Kyunghyun Paeng,Chan-Young Ock,Donggeun Yoo,Zhaoyang Li,Wangkai Li,Huayu Mai,Joshua Millward,Zhen He,Aiden Nibali,Lydia Anette Schoenpflug,Viktor Hendrik Koelzer,Xu Shuoyu,Ji Zheng,Hu Bin,Yu-Wen Lo,Ching-Hui Yang,Sérgio Pereira

Main category: cs.CV

TL;DR: The OCELOT 2023 challenge focuses on improving deep learning models for cell detection and tissue segmentation by incorporating multi-scale semantics, specifically cell-tissue interactions, leading to a significant performance boost over traditional methods.

Details Motivation: Pathologists routinely use different magnifications to evaluate both broad tissue morphology and intricate cellular details for comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate this behavior due to the lack of datasets with multi-scale overlapping annotations, which is why the OCELOT 2023 challenge was initiated. Method: The study uses a dataset from the OCELOT 2023 challenge that includes multi-scale overlapping cell and tissue annotations from six organs, comprising 673 pairs sourced from TCGA Whole-Slide Images. Comparative analysis of the methods used by participants was conducted to highlight innovative strategies. Result: Top entries in the OCELOT 2023 challenge achieved up to a 7.99 increase in F1-score on the test set compared to a baseline cell-only model, demonstrating the effectiveness of incorporating cell-tissue relationships into models. Conclusion: The OCELOT 2023 challenge demonstrates the importance of incorporating multi-scale semantics, specifically cell-tissue interactions, into deep learning models for improved performance in cell detection and tissue segmentation, showing a substantial improvement over traditional cell-only methods. Abstract: Pathologists routinely alternate between different magnifications when examining Whole-Slide Images, allowing them to evaluate both broad tissue morphology and intricate cellular details to form comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate these behaviors and learn the interdependent semantics between structures at different magnifications. A key barrier in the field is the lack of datasets with multi-scale overlapping cell and tissue annotations. The OCELOT 2023 challenge was initiated to gather insights from the community to validate the hypothesis that understanding cell and tissue (cell-tissue) interactions is crucial for achieving human-level performance, and to accelerate the research in this field. The challenge dataset includes overlapping cell detection and tissue segmentation annotations from six organs, comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA) Whole-Slide Images with hematoxylin and eosin staining, divided into training, validation, and test subsets. Participants presented models that significantly enhanced the understanding of cell-tissue relationships. Top entries achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model that did not incorporate cell-tissue relationships. This is a substantial improvement in performance over traditional cell-only detection methods, demonstrating the need for incorporating multi-scale semantics into the models. This paper provides a comparative analysis of the methods used by participants, highlighting innovative strategies implemented in the OCELOT 2023 challenge.

[67] RT-DETR++ for UAV Object Detection

Yuan Shufang

Main category: cs.CV

TL;DR: This paper proposes RT-DETR++, an enhanced encoder for object detection in UAV imagery that effectively detects small and densely packed objects while maintaining real-time speed.

Details Motivation: Object detection in UAV imagery presents significant challenges due to densely packed small objects, scale variations, and occlusion. Method: RT-DETR++ enhances the encoder component of the RT-DETR model with a channel-gated attention-based upsampling/downsampling mechanism and incorporates CSP-PAC during feature fusion using parallel hollow convolutions. Result: The novel neck design of RT-DETR++ achieves superior performance in detecting small and densely packed objects while maintaining sufficient speed for real-time detection without increasing computational complexity. Conclusion: RT-DETR++ provides an effective approach for feature encoding design in real-time detection systems. Abstract: Object detection in unmanned aerial vehicle (UAV) imagery presents significant challenges. Issues such as densely packed small objects, scale variations, and occlusion are commonplace. This paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model. Our improvements focus on two key aspects. First, we introduce a channel-gated attention-based upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes errors and preserves details during feature layer propagation. Second, we incorporate CSP-PAC during feature fusion. This technique employs parallel hollow convolutions to process local and contextual information within the same layer, facilitating the integration of multi-scale features. Evaluation demonstrates that our novel neck design achieves superior performance in detecting small and densely packed objects. The model maintains sufficient speed for real-time detection without increasing computational complexity. This study provides an effective approach for feature encoding design in real-time detection systems.

[68] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering

Zhiyue Liu,Sihang Liu,Jinyuan Liu,Xinru Zhang

Main category: cs.CV

TL;DR: This paper proposes a training-free framework with knowledge focusing for KB-VQA that enhances knowledge relevance, reduces redundancy, and outperforms existing methods.

Details Motivation: Existing KB-VQA approaches often introduce noise into the answering process by directly augmenting models with retrieved information while ignoring knowledge redundancy. This work aims to address this limitation by enhancing knowledge relevance and reducing redundancy. Method: The framework involves three key steps: (1) knowledge retrieval by creating low-noise queries from image-question pairs, (2) prompting large models to extract answer-beneficial segments from retrieved knowledge, and (3) introducing a selective knowledge integration strategy that mitigates the influence of redundant information. Result: Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods in KB-VQA tasks. Conclusion: The proposed training-free framework with knowledge focusing for KB-VQA enables the acquisition of accurate and critical knowledge, outperforming state-of-the-art methods. Abstract: Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.

[69] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution

Yulin Tong,Fengzong Zhang,Haiqin Cheng

Main category: cs.CV

TL;DR: 本研究提出了一种名为CWSSNet的分类框架,成功应用于高光谱遥感地物分类,具有高精度和良好的鲁棒性。

Details Motivation: 高光谱遥感技术在精细地物分类上面临特征冗余、波段多、高维和光谱混合的挑战,因此需要提升分类精度并优化特征利用。 Method: 提出了一种结合3D光谱-空间特征和小波卷积的分类框架CWSSNet,并通过多尺度卷积注意力模块集成多模态信息,同时在小波域中引入多波段分解和卷积操作。 Result: 实验结果显示,CWSSNet在 Yugan County 的mIoU、mAcc 和 mF1 分别达到74.50%、82.73%和84.94%,并且在水体、植被和裸地分类中IoU最高,表现出良好的鲁棒性。 Conclusion: CWSSNet框架在小样本训练条件下仍能保持可靠的性能,同时在不同比例训练集下表现出稳定的分类效果。 Abstract: Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50\%, 82.73\%, and 84.94\% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70\%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.

[70] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

Chunxiao Li,Xiaoxiao Wang,Meiling Li,Boming Miao,Peng Sun,Yunjian Zhang,Xiangyang Ji,Yao Zhu

Main category: cs.CV

TL;DR: This paper introduces the Real-World Robustness Dataset (RRDataset) to evaluate AI-generated image detection methods under real-world conditions, revealing their limitations and emphasizing the importance of human adaptability for improved detection algorithms.

Details Motivation: The motivation is to address the research gap in evaluating AI-generated image detection methods under complex real-world conditions, as existing datasets and methods do not fully address these challenges. Method: The paper introduces the Real-World Robustness Dataset (RRDataset) to evaluate AI-generated image detection models across three dimensions: Scenario Generalization, Internet Transmission Robustness, and Re-digitization Robustness. It benchmarks 17 detectors and 10 vision-language models (VLMs), and conducts a large-scale human study with 192 participants. Result: The benchmarking results reveal the limitations of current AI detection methods in real-world conditions and highlight the importance of human adaptability in detecting AI-generated images. Conclusion: The paper concludes that current AI-generated image detection methods have limitations under real-world conditions, and suggests leveraging human adaptability for improved detection algorithms. Abstract: With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.

[71] Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection

Jiasheng Guo,Xin Gao,Yuxiang Yan,Guanghao Li,Jian Pu

Main category: cs.CV

TL;DR: Dark-ISP 是一种用于低光环境下物体检测的轻量级、自适应图像信号处理插件,它直接处理 Bayer RAW 图像,通过端到端训练实现高效的物体检测,优于现有的 RGB 和 RAW 图像检测方法。

Details Motivation: 低光环境下的物体检测对于许多实际应用至关重要,但因图像质量下降而具有挑战性。尽管 RAW 图像相比 RGB 图像有更高的潜力,但现有方法要么存在信息损失,要么框架过于复杂。为了解决这些问题,本文提出了 Dark-ISP 方法。 Method: 本文提出了一种轻量级、自适应的图像信号处理(ISP)插件 Dark-ISP。该方法将传统 ISP 流程分解为可微分的线性(传感器校准)和非线性(色调映射)模块,并通过任务驱动的损失函数进行优化。每个模块都具有内容感知适应性和物理先验知识,实现与检测目标一致的 RAW 到 RGB 转换。此外,利用 ISP 流程的级联结构设计了 Self-Boost 机制,增强模块间的协作。 Result: 在三个 RAW 图像数据集上的广泛实验表明,Dark-ISP 在低光环境下物体检测的性能优于最先进的基于 RGB 和 RAW 的检测方法,且参数最少。 Conclusion: Dark-ISP 成功解决了低光环境下 RAW 图像处理中的问题,为物体检测提供了一个高效、轻量级、端到端的解决方案。 Abstract: Low-light Object detection is crucial for many real-world applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equipped with content-aware adaptability and physics-informed priors, enabling automatic RAW-to-RGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline's intrinsic cascade structure, we devise a Self-Boost mechanism that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.

[72] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results

Hanwei Zhu,Haoning Wu,Zicheng Zhang,Lingyu Zhu,Yixuan Li,Peilin Chen,Shiqi Wang,Chris Wei Zhou,Linhan Cao,Wei Sun,Xiangyang Zhu,Weixia Zhang,Yucheng Zhu,Jing Liu,Dandan Zhu,Guangtao Zhai,Xiongkuo Min,Zhichao Zhang,Xinyue Li,Shubo Xu,Anh Dao,Yifan Li,Hongyuan Yu,Jiaojiao Yi,Yiding Tian,Yupeng Wu,Feiran Sun,Lijuan Liao,Song Jiang

Main category: cs.CV

TL;DR: VQualA 2025 Challenge evaluates and improves LMMs' ability to reason about visual quality differences using a new benchmark with thousands of tasks, including single images, pairs, and multi-image groups.

Details Motivation: To enhance the capability of state-of-the-art LMMs in performing detailed reasoning about visual quality differences across multiple images. Method: Introduction of a novel benchmark with coarse-to-fine grained visual quality comparison tasks and holistic evaluation protocols such as 2AFC-based binary preference and multi-choice questions. Result: Approximately 100 participants submitted entries, with five models showcasing the capabilities of instruction-tuned LMMs in quality assessment. Conclusion: The VQualA 2025 Challenge is a significant step towards open-domain visual quality reasoning and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems. Abstract: This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.

[73] MGTraj: Multi-Granularity Goal-Guided Human Trajectory Prediction with Recursive Refinement Network

Ge Sun,Jun Ma

Main category: cs.CV

TL;DR: This paper proposes MGTraj, a multi-granularity goal-guided model for human trajectory prediction, achieving state-of-the-art performance by recursively encoding trajectory proposals using a transformer-based refinement network and integrating features across granularities with weight sharing and velocity prediction.

Details Motivation: Most goal-guided approaches decouple the prediction task into two stages: goal prediction and subsequent trajectory completion based on the predicted goal, which operate at extreme granularities. The potential utility of intermediate temporal granularity remains largely unexplored, which motivates multi-granularity trajectory modeling. Method: MGTraj, a novel Multi-Granularity goal-guided model for human trajectory prediction, recursively encodes trajectory proposals from coarse to fine granularity levels using a transformer-based recursive refinement network (RRN). Features across different granularities are integrated using a weight-sharing strategy, and velocity prediction is employed as an auxiliary task. Result: Comprehensive experimental results in EHT/UCY and Stanford Drone Dataset indicate that MGTraj outperforms baseline methods. Conclusion: MGTraj outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods for human trajectory prediction. Abstract: Accurate human trajectory prediction is crucial for robotics navigation and autonomous driving. Recent research has demonstrated that incorporating goal guidance significantly enhances prediction accuracy by reducing uncertainty and leveraging prior knowledge. Most goal-guided approaches decouple the prediction task into two stages: goal prediction and subsequent trajectory completion based on the predicted goal, which operate at extreme granularities: coarse-grained goal prediction forecasts the overall intention, while fine-grained trajectory completion needs to generate the positions for all future timesteps. The potential utility of intermediate temporal granularity remains largely unexplored, which motivates multi-granularity trajectory modeling. While prior work has shown that multi-granularity representations capture diverse scales of human dynamics and motion patterns, effectively integrating this concept into goal-guided frameworks remains challenging. In this paper, we propose MGTraj, a novel Multi-Granularity goal-guided model for human Trajectory prediction. MGTraj recursively encodes trajectory proposals from coarse to fine granularity levels. At each level, a transformer-based recursive refinement network (RRN) captures features and predicts progressive refinements. Features across different granularities are integrated using a weight-sharing strategy, and velocity prediction is employed as an auxiliary task to further enhance performance. Comprehensive experimental results in EHT/UCY and Stanford Drone Dataset indicate that MGTraj outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods.

[74] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization

Zhengzhao Lai,Youbin Zheng,Zhenyang Cai,Haonan Lyu,Jinpu Yang,Hongqing Liang,Yan Hu,Benyou Wang

Main category: cs.CV

TL;DR: The paper introduces MatCha, a benchmark for evaluating materials characterization image understanding by MLLMs, revealing their current limitations compared to human experts.

Details Motivation: To bridge the gap in understanding real-world characterization imaging data by multimodal large language models (MLLMs) within materials science. Method: Development of MatCha, a benchmark for materials characterization image understanding with 1,500 questions requiring expert-level domain expertise across 21 tasks. Result: Evaluation of state-of-the-art MLLMs on MatCha revealed a significant performance gap compared to human experts, particularly for tasks requiring higher-level expertise and visual perception. Conclusion: MatCha highlights the limited adaptability of current MLLMs to real-world materials characterization scenarios and aims to facilitate future research in material discovery and autonomous scientific agents. Abstract: Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.

[75] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement

Jiesi Hu,Jianfeng Cao,Yanwu Yang,Chenfei Ye,Yixuan Zhang,Hanyang Peng,Ting Ma

Main category: cs.CV

TL;DR: Medverse 是一个统一的 3D 医学影像 ICL 模型,能够在多个任务和器官上实现高精度预测和多尺度解剖理解,超越现有方法。

Details Motivation: 当前医学影像的 ICL 模型无法同时实现高保真预测和全局解剖理解,且缺乏跨任务和解剖区域的统一模型,因此需要开发更强大的 ICL 框架以充分发挥其潜力。 Method: Medverse 采用 next-scale autoregressive ICL 框架,结合块状交叉注意力模块,实现从粗到精的预测,并在多个 3D 医学影像任务中进行训练和评估。 Result: Medverse 在多个未见过的临床中心、器官、物种和影像模态上显著优于现有 ICL 基线模型,并生成一致的全分辨率体积输出。 Conclusion: Medverse 是一种通用的 3D 医学影像 ICL 模型,通过多尺度解剖感知框架和块状交叉注意力模块,实现了高精度预测和全局解剖理解,并在多个临床中心、器官、物种和影像模态上表现出色,为 ICL 在医学影像领域建立了新范式。 Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.

[76] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Rongyao Fang,Aldrich Yu,Chengqi Duan,Linjiang Huang,Shuai Bai,Yuxuan Cai,Kun Wang,Si Liu,Xihui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 本研究开发了一个大规模推理数据集FLUX-Reason-6M和一个新的评估基准PRISM-Bench,以解决开源文本到图像模型缺乏高质量数据和全面评估的问题,推动了面向推理的图像生成技术的发展。

Details Motivation: 由于缺乏大规模的推理数据集和全面的评估基准,开源文本到图像模型的发展受到了限制,与领先的闭源系统之间存在性能差距。因此,需要开发新的数据集和评估工具来推动这一领域的发展。 Method: 通过开发包含600万高质量图像和2000万双语描述的FLUX-Reason-6M数据集以及设计PRISM-Bench评估基准,结合七个不同的评估轨道和基于GCoT的长文本挑战,利用先进的视觉-语言模型进行细致的人类对齐评估。 Result: 在PRISM-Bench上对19个领先模型的广泛评估揭示了关键的性能差距,并指出了需要改进的具体领域,同时FLUX-Reason-6M数据集的推出为社区提供了重要的资源。 Conclusion: FLUX-Reason-6M和PRISM-Bench的推出填补了开源文本到图像模型在大规模推理数据集和全面评估基准方面的空白,为社区提供了以前无法获得的资源,推动了下一波面向推理的文本到图像生成的发展。 Abstract: The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

[77] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification

Mustafa Yurdakul,Sakir Tasdemir

Main category: cs.CV

TL;DR: CoAtNeXt, a novel hybrid model based on CoAtNet architecture with enhanced ConvNeXtV2 blocks and CBAM module, demonstrates superior performance in histopathological classification of gastric tissue images, achieving high accuracy and efficiency on HMU-GC-HE-30K and GasHisSDB datasets. This highlights its potential to improve diagnostic accuracy and reduce workload for pathologists.

Details Motivation: Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Histopathologic examination is the gold standard but is labor-intensive, prone to variability, and lacks standard procedures. This highlights the need for automated, reliable, and efficient methods for gastric tissue analysis. Method: CoAtNeXt was proposed for the classification of gastric tissue images, built upon CoAtNet architecture with enhanced ConvNeXtV2 blocks and CBAM module. It was evaluated on HMU-GC-HE-30K and GasHisSDB datasets, compared against CNNs and ViT models. Result: CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies. Conclusion: CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload. Abstract: Background and objective Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Although histopathologic examination remains the diagnostic gold standard, it is performed entirely manually, making evaluations labor-intensive and prone to variability among pathologists. Critical findings may be missed, and lack of standard procedures reduces consistency. These limitations highlight the need for automated, reliable, and efficient methods for gastric tissue analysis. Methods In this study, a novel hybrid model named CoAtNeXt was proposed for the classification of gastric tissue images. The model is built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block Attention Module (CBAM) is integrated to improve local feature extraction through channel and spatial attention mechanisms. The architecture was scaled to achieve a balance between computational efficiency and classification performance. CoAtNeXt was evaluated on two publicly available datasets, HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary classification, and was compared against 10 Convolutional Neural Networks (CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies in the literature. Conclusion Experimental results show that CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload.

[78] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Jing Hao,Yuxuan Fan,Yanpeng Sun,Kaixin Guo,Lizhuo Lin,Jinrong Yang,Qi Yong H. Ai,Lun M. Wong,Hao Tang,Kuo Feng Hung

Main category: cs.CV

TL;DR: This paper introduces MMOral, a large multimodal dataset for dental X-ray interpretation, and OralGPT, a model that significantly improves LVLM performance in this specialized medical domain.

Details Motivation: LVLMs have not been effectively applied to specialized domains like dentistry, particularly for interpreting panoramic X-rays due to complex anatomical structures and lack of domain-specific datasets. Method: The authors introduced MMOral, a large-scale multimodal instruction dataset and benchmark for panoramic X-ray interpretation, and proposed OralGPT, a model trained via supervised fine-tuning (SFT) on MMOral to enhance performance. Result: Evaluation of 64 LVLMs on MMOral-Bench showed that the best-performing model (GPT-4o) only achieved 41.45% accuracy. OralGPT demonstrated a 24.73% improvement after one epoch of SFT on Qwen2.5-VL-7B. Conclusion: The study concludes that while LVLMs have strong general performance, they face significant challenges in specialized domains like dentistry, and the proposed MMOral dataset and OralGPT model offer substantial improvements for panoramic X-ray interpretation. Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.

[79] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

Chao Yuan,Yang Yang,Yehui Yang,Zach Cheng

Main category: cs.CV

TL;DR: This paper introduces DATE, a novel method for enhancing temporal understanding in multimodal large language models (MLLMs) for long video analysis, combining timestamp injection and semantic-guided sampling to achieve state-of-the-art performance.

Details Motivation: Long video understanding is challenging for MLLMs due to limitations in modeling long-range temporal dependencies with uniform frame sampling and implicit position encodings, leading to information loss and degraded temporal comprehension. Method: The DATE method introduces two key components: the Timestamp Injection Mechanism (TIM), which integrates textual timestamp tokens with video frame embeddings to create a continuous temporal reference system, and the Temporal-Aware Similarity Sampling (TASS) strategy, which reformulates video sampling as a vision-language retrieval task using a two-stage algorithm for semantic relevance and temporal coverage. Result: The DATE method achieves significant improvements in absolute time understanding and key event localization on hour-long video benchmarks, outperforming existing approaches and setting new state-of-the-art results for 7B and 72B models. Conclusion: The proposed DATE method significantly enhances temporal awareness in MLLMs, achieving state-of-the-art results on long video benchmarks for both 7B and 72B models, with the 7B model outperforming many 72B models. Abstract: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

[80] Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation

Linhao Li,Yiwen Ye,Ziyang Chen,Yong Xia

Main category: cs.CV

TL;DR: 本文提出了一种名为PSP-Seg的渐进式剪枝框架,用于实现动态高效的3D医学图像分割。

Details Motivation: 3D医学图像分割通常面临资源和时间消耗大的问题,现有高效分割模型通常是静态的,并在训练前手动设计,限制了其在不同任务间的适应性。 Method: PSP-Seg从一个冗余模型开始,通过分块剪枝和功能解耦损失迭代剪枝冗余模块。 Result: 轻量级变体PSP-Seg-S在所有数据集中与nnU-Net性能相当,同时减少了42-45%的GPU内存使用,训练时间减少了29-48%,参数数量减少了83-87%。 Conclusion: PSP-Seg在临床应用中具有作为成本效益高且高性能的替代方案的潜力。 Abstract: 3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg's potential as a cost-effective yet high-performing alternative for widespread clinical application.

[81] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

Bohao Tang,Yan Ma,Fei Zhang,Jiadi Su,Ethan Chern,Zhulin Hu,Zhixin Wang,Pengfei Liu,Ya Zhang

Main category: cs.CV

TL;DR: This paper proposes an adaptive framework for Vision-Language Models (VLMs) in chart understanding, combining code-based and visual reasoning pathways optimized through reinforcement learning, leading to improved performance.

Details Motivation: Prior approaches to chart understanding with VLMs have limitations, such as reliance on external tools or single reasoning strategies, which led to the need for a more adaptive and verifiable reasoning approach. Method: The method involves introducing a Code-as-Thought (CaT) approach combined with a reinforcement learning framework that uses a dual-reward system to train a model to adaptively decide between code-based or visual reasoning. Result: The proposed adaptive framework demonstrated strong and robust performance on diverse chart-understanding benchmarks, showing that models can be taught not only to reason but also how to reason effectively. Conclusion: The paper concludes that VLMs can be effectively trained to dynamically choose between code-based and visual reasoning strategies for chart understanding, leading to robust performance across benchmarks. Abstract: Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.

[82] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training

Anthony P. Addison,Felix Wagner,Wentian Xu,Natalie Voets,Konstantinos Kamnitsas

Main category: cs.CV

TL;DR: 本文提出了一种改进的U-Net架构,结合模态无关输入通道和数据增强技术,实现了对训练阶段未见过的MRI模态的有效分割。

Details Motivation: 现有的多模态脑MRI分割模型通常受限于固定的模态组合,无法有效处理新的模态。本研究旨在解决这一问题,使模型能够灵活处理各种模态组合。 Method: 对U-Net架构进行了改进,加入了模态无关输入通道,并开发了一种可合成人工MRI模态的图像增强方法。 Result: 该方法在8个MRI数据库上进行了评估,结果表明其不仅能够有效处理训练中见过的模态,还能处理新的、未见过的模态,从而提升分割性能。 Conclusion: 通过引入模态无关输入通道和合成MRI模态的数据增强方案,该研究成功开发了一种可在训练中未见过的模态上进行有效推理的分割模型。 Abstract: Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG

[83] Model-Agnostic Open-Set Air-to-Air Visual Object Detection for Reliable UAV Perception

Spyridon Loukovitis,Anastasios Arsenos,Vasileios Karampinis,Athanasios Voulodimos

Main category: cs.CV

TL;DR: A novel model-agnostic open-set detection framework is proposed for UAV autonomy, effectively handling unknown object rejection and robustness against corrupted flight data, with significant performance improvements demonstrated.

Details Motivation: Open-set detection is essential for UAV autonomy in real-world conditions where traditional closed-set detectors perform poorly due to domain shifts and data corruption, posing risks to safety-critical applications. Method: The method estimates semantic uncertainty via entropy modeling in the embedding space and uses spectral normalization and temperature scaling to improve open-set discrimination, specifically for embedding-based detectors. Result: The framework achieved up to a 10% relative AUROC gain compared to standard YOLO-based detectors and showed consistent improvements over baseline methods in ablation studies. Conclusion: The proposed open-set detection framework is well-suited for reliable UAV perception in dynamic air-to-air environments, as it enhances robustness without compromising detection accuracy. Abstract: Open-set detection is crucial for robust UAV autonomy in air-to-air object detection under real-world conditions. Traditional closed-set detectors degrade significantly under domain shifts and flight data corruption, posing risks to safety-critical applications. We propose a novel, model-agnostic open-set detection framework designed specifically for embedding-based detectors. The method explicitly handles unknown object rejection while maintaining robustness against corrupted flight data. It estimates semantic uncertainty via entropy modeling in the embedding space and incorporates spectral normalization and temperature scaling to enhance open-set discrimination. We validate our approach on the challenging AOT aerial benchmark and through extensive real-world flight tests. Comprehensive ablation studies demonstrate consistent improvements over baseline methods, achieving up to a 10\% relative AUROC gain compared to standard YOLO-based detectors. Additionally, we show that background rejection further strengthens robustness without compromising detection accuracy, making our solution particularly well-suited for reliable UAV perception in dynamic air-to-air environments.

[84] Learning Object-Centric Representations in SAR Images with Multi-Level Feature Fusion

Oh-Tae Jang,Min-Gon Cho,Kyung-Tae Kim

Main category: cs.CV

TL;DR: 本文提出SlotSAR,通过结合多级特征和注意力模块,在无掩码标注的情况下有效解耦SAR图像中的目标与背景,显著提升识别性能。

Details Motivation: SAR图像中的复杂背景杂波(如地形反射和斑点噪声)会导致模型提取纠缠或虚假特征,从而影响目标识别。 Method: 提出了一种名为SlotSAR的新型对象中心学习框架,结合了SARATR-X的高层语义特征和小波散射网络的低层散射特征,并利用多级插槽注意力模块增强特征区分度。 Result: 实验结果表明,SlotSAR在保持结构细节方面优于现有OCL方法,达到了SAR图像识别的最先进性能。 Conclusion: SlotSAR有效地从背景杂波中解耦目标表示,提升了SAR图像中的目标识别性能。 Abstract: Synthetic aperture radar (SAR) images contain not only targets of interest but also complex background clutter, including terrain reflections and speckle noise. In many cases, such clutter exhibits intensity and patterns that resemble targets, leading models to extract entangled or spurious features. Such behavior undermines the ability to form clear target representations, regardless of the classifier. To address this challenge, we propose a novel object-centric learning (OCL) framework, named SlotSAR, that disentangles target representations from background clutter in SAR images without mask annotations. SlotSAR first extracts high-level semantic features from SARATR-X and low-level scattering features from the wavelet scattering network in order to obtain complementary multi-level representations for robust target characterization. We further present a multi-level slot attention module that integrates these low- and high-level features to enhance slot-wise representation distinctiveness, enabling effective OCL. Experimental results demonstrate that SlotSAR achieves state-of-the-art performance in SAR imagery by preserving structural details compared to existing OCL methods.

[85] You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception

Hao Si,Ehsan Javanmardi,Manabu Tsukada

Main category: cs.CV

TL;DR: The paper introduces PHCP, a new framework for heterogeneous collaborative perception that eliminates the need for joint training by dynamically aligning features during inference, achieving strong performance with minimal unlabeled data.

Details Motivation: The study addresses the impracticality of existing methods for heterogeneous collaborative perception, which require either joint training or storing models for all potential collaborators in advance. Method: Progressive Heterogeneous Collaborative Perception (PHCP) is introduced, which dynamically aligns features through self-training an adapter during inference, formulating the problem as few-shot unsupervised domain adaptation. Result: Extensive experiments on the OPV2V dataset demonstrate that PHCP performs well across diverse heterogeneous scenarios and achieves results comparable to SOTA methods trained on the entire dataset, using only a small amount of unlabeled data. Conclusion: PHCP proves to be a practical and effective solution for heterogeneous collaborative perception, achieving performance comparable to state-of-the-art methods with minimal unlabeled data and without the need for joint training. Abstract: Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.

[86] Image Recognition with Vision and Language Embeddings of VLMs

Illia Volkov,Nikita Kisel,Klara Janouskova,Jiri Matas

Main category: cs.CV

TL;DR: This paper explores the use of vision-language models (VLMs) for image classification, showing that vision and language components have complementary strengths. A new fusion method is introduced to improve performance without additional learning.

Details Motivation: While VLMs are known for zero-shot classification through image-text alignment, their visual inference capabilities are not well explored. This study aims to better understand and utilize the separate and combined strengths of vision and language components. Method: The study evaluates dual-encoder VLMs like SigLIP 2 and RADIOv2.5 for both language-guided and vision-only classification on ImageNet-1k. It analyzes factors like prompt design, class diversity, k-NN neighbors, and reference set size. A fusion method based on per-class precision is introduced. Result: The study found that some classes perform better with textual prompts, while others are better with visual similarity. Their proposed fusion method improves classification accuracy in a learning-free manner. Conclusion: The authors conclude that language and vision in VLMs offer complementary strengths for image classification, and their proposed fusion method improves performance without requiring additional learning. Abstract: Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.

[87] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM

Hui Li,Yi You,Qiqi Chen,Bingfeng Zhang,George Q. Huang

Main category: cs.CV

TL;DR: This paper introduces the BUG workflow with LMM for automated clothing design, enabling fine-grained customization through chat and images, validated by a new dataset called FashionEdit.

Details Motivation: Current generative AI models struggle with fine-grained customization due to text uncertainty and lack of professional user input, limiting their effectiveness in complex workflows like fashion design. Method: The study proposes the BUG workflow using a large multimodal model (LMM) to generate and customize clothing designs through chat and image-to-prompt interactions. A new dataset, FashionEdit, was introduced to evaluate the model's performance based on generation similarity, user satisfaction, and quality. Result: The proposed BUG workflow successfully automates clothing design and customization, demonstrating effectiveness through the FashionEdit dataset evaluations. Conclusion: The BUG workflow with LMM effectively automates the creation and customization of clothing designs, enhancing user creativity and reducing barriers in the garment industry. Abstract: Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users' creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.

[88] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment

Dimitrios Anastasiou,Razvan Caramalau,Nazir Sirajudeen,Matthew Boal,Philip Edwards,Justin Collins,John Kelly,Ashwin Sridhar,Maxine Tran,Faiz Mumtaz,Nevil Pavithran,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos

Main category: cs.CV

TL;DR: 本文研究了在手术技能评估中使用少样本学习和自我监督预训练的效果,发现小但领域相关的数据集表现更优,并且加入特定手术数据能显著提高性能。

Details Motivation: 由于技能标注稀缺且耗时,需要专家共识,因此研究如何通过少样本学习和自我监督预训练来开发更稳健的SSA模型。 Method: 将SSA设定为少样本学习任务,使用自我监督的预训练策略,并在三种少样本设置下评估不同预训练来源的效果。 Result: 在1、2和5样本设置下,分别达到了60.16%、66.03%和73.65%的准确率;加入特定手术数据后,平均准确率提升了1.22%,F1得分提升了2.28%。 Conclusion: 研究发现,小但领域相关的数据集能比大规模但低相关的数据集表现更好,并且在预训练中加入特定手术数据可以显著提高下游任务的表现。 Abstract: Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.

[89] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles

Ian Nell,Shane Gilroy

Main category: cs.CV

TL;DR: This study proposes a vision-based system to detect distracted or impaired driving by analyzing driver behavior through external observation techniques, offering a solution that works for non-connected vehicles.

Details Motivation: Road traffic accidents caused by human error, particularly distracted and impaired driving, are a significant global concern. This motivates the development of a system to detect such behaviors and improve road safety. Method: The study uses advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, lane position monitoring, the YOLO object detection model, and custom lane estimation algorithms. Result: The experimental evaluations showed that the framework is reliable and adaptable for detecting unsafe driving behaviors under varying road and environmental conditions. Conclusion: The study concludes that the proposed vision-based driver behavior classification system effectively identifies unsafe driving behaviors and offers adaptability across different road and environmental conditions. Abstract: Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behavior classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviors such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioral analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework's reliability and adaptability across varying road and environmental conditions.

[90] Texture-aware Intrinsic Image Decomposition with Model- and Learning-based Priors

Xiaodong Wang,Zijun He,Xin Yuan

Main category: cs.CV

TL;DR: This paper proposes a novel texture-guided method for intrinsic image decomposition, effectively separating lighting and texture in complex scenes, with high-quality results.

Details Motivation: Intrinsic image decomposition remains challenging for complex scenes with spatially-varying lighting and rich textures, and previous methods often produce texture-less and over-smoothed results. Method: A texture-guided regularization term is designed and integrated into an optimization framework to separate material textures and lighting effects. Result: The method successfully handles severe lighting and rich textures, achieving superior performance in intrinsic image decomposition for real-world images. Conclusion: The proposed method effectively decomposes intrinsic images even in complex scenes, producing high-quality results that outperform existing approaches. Abstract: This paper aims to recover the intrinsic reflectance layer and shading layer given a single image. Though this intrinsic image decomposition problem has been studied for decades, it remains a significant challenge in cases of complex scenes, i.e. spatially-varying lighting effect and rich textures. In this paper, we propose a novel method for handling severe lighting and rich textures in intrinsic image decomposition, which enables to produce high-quality intrinsic images for real-world images. Specifically, we observe that previous learning-based methods tend to produce texture-less and over-smoothing intrinsic images, which can be used to infer the lighting and texture information given a RGB image. In this way, we design a texture-guided regularization term and formulate the decomposition problem into an optimization framework, to separate the material textures and lighting effect. We demonstrate that combining the novel texture-aware prior can produce superior results to existing approaches.

[91] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection

Xiaodong Wang,Ping Wang,Zhangyuan Li,Xin Yuan

Main category: cs.CV

TL;DR: The paper proposes a hybrid data-consistency module that improves reconstruction quality in single-pixel imaging by integrating learned priors with physical forward models through a unified framework based on the decoupling of the diffusion process.

Details Motivation: The motivation is to explore the connection between PnP methods and DDIM for solving ill-posed inverse problems, particularly in the context of single-pixel imaging, aiming to enhance reconstruction quality. Method: The method involves decoupling the diffusion process into denoising, data consistency enforcement, and sampling stages, and proposing a hybrid data-consistency module that combines multiple PnP-style fidelity terms. Result: Experimental results on single-pixel imaging tasks show that the proposed method achieves better reconstruction quality compared to existing approaches. Conclusion: The paper concludes that by integrating learned priors with physical forward models and proposing a hybrid data-consistency module, the reconstruction quality for single-pixel imaging tasks can be significantly improved. Abstract: We explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a focus on single-pixel imaging. We begin by identifying key distinctions between PnP and diffusion models-particularly in their denoising mechanisms and sampling procedures. By decoupling the diffusion process into three interpretable stages: denoising, data consistency enforcement, and sampling, we provide a unified framework that integrates learned priors with physical forward models in a principled manner. Building upon this insight, we propose a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. This hybrid correction is applied directly to the denoised estimate, improving measurement consistency without disrupting the diffusion sampling trajectory. Experimental results on single-pixel imaging tasks demonstrate that our method achieves better reconstruction quality.

[92] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data

Pengxu Wen,Tingting Yu,Ziwei Nie,Cheng Jiang,Zhenyu Yin,Mingyang He,Bo Liao,Xiaoping Yang

Main category: cs.CV

TL;DR: 本研究开发了一种结合超声视频分析与临床数据的全自动ICP分级框架,显著提高了非侵入性ICP评估的准确性。

Details Motivation: 颅内压升高对大脑功能构成严重威胁,而当前临床实践中ONSD测量存在操作不一致、视图选择主观性和阈值设定变异性等问题,限制了其可靠性,因此需要一种更客观、可靠的非侵入性ICP评估方法。 Method: 该框架分为两个阶段:1)眼底超声视频处理阶段,进行逐帧解剖分割、基于国际共识声明的规则关键帧识别以及精确的ONSD测量;2)ICP分级阶段,将ONSD指标与临床特征融合以预测ICP分级。 Result: 实验结果表明,该方法在验证集上的准确率为0.845±0.071,在独立测试集上的准确率为0.786,显著优于传统基于阈值的方法(验证集准确率0.637±0.111,测试集准确率0.429)。 Conclusion: 该论文提出了一种全自动的两阶段框架,用于颅内压(ICP)分级,通过整合关键帧识别、视神经鞘直径(ONSD)测量和临床数据,建立了一种可靠的无创临床ICP评估方法。 Abstract: Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of $0.845 \pm 0.071$ (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ($0.637 \pm 0.111$ validation accuracy, $0.429$ test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.

[93] Unsupervised Integrated-Circuit Defect Segmentation via Image-Intrinsic Normality

Botong Zhao,Qijun Shi,Shujing Lyu,Yue Lu

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督方法用于集成电路缺陷分割,无需依赖外部正常样本,通过提取测试图像中的正常特征并利用一致性损失和重建残差来检测缺陷,实验结果表明其性能优于现有方法且具有较强鲁棒性。

Details Motivation: 集成电路制造过程中引入的多样化、细粒度缺陷会影响产品良率和可靠性。现有的工业缺陷分割方法依赖于与外部正常样本集进行对比,但这种方法在布局多变且难以精确对齐的IC图像中表现脆弱。因此,需要一种更鲁棒的无监督缺陷分割方法。 Method: 提出了一种无需外部正常样本支持的IC缺陷分割框架,通过可学习的正常信息提取器聚合测试图像中的正常特征,并利用一致性损失强化正常区域的关联性。解码器在这些特征的引导下仅重建正常内容,通过重建残差实现缺陷分割。此外,伪异常增强进一步稳定了训练过程。 Result: 提出了一种新的无监督IC缺陷分割框架,能够在没有外部正常样本的情况下有效检测缺陷,并通过伪异常增强提升了模型训练的稳定性。 Conclusion: 实验结果表明,该方法在三个IC工艺阶段的数据集中均表现出比现有方法更优的性能,并且对产品变异性具有很强的鲁棒性。 Abstract: Modern Integrated-Circuit(IC) manufacturing introduces diverse, fine-grained defects that depress yield and reliability. Most industrial defect segmentation compares a test image against an external normal set, a strategy that is brittle for IC imagery where layouts vary across products and accurate alignment is difficult. We observe that defects are predominantly local, while each image still contains rich, repeatable normal patterns. We therefore propose an unsupervised IC defect segmentation framework that requires no external normal support. A learnable normal-information extractor aggregates representative normal features from the test image, and a coherence loss enforces their association with normal regions. Guided by these features, a decoder reconstructs only normal content; the reconstruction residual then segments defects. Pseudo-anomaly augmentation further stabilizes training. Experiments on datasets from three IC process stages show consistent improvements over existing approaches and strong robustness to product variability.

[94] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift

Umaima Rahman,Raza Imam,Mohammad Yaqub,Dwarikanath Mahapatra

Main category: cs.CV

TL;DR: The paper introduces DRiFt, a framework that enhances the reliability of medical vision-language models by separating relevant clinical signals from noise, improving performance and robustness in real-world settings.

Details Motivation: The motivation is to address the reliability issues of medical vision-language models (VLMs) under distribution shifts, which are a concern for their safe deployment. VLMs often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their real-world applicability. Method: DRiFt uses a structured feature decoupling framework that separates clinically relevant signals from task-agnostic noise through parameter-efficient tuning (LoRA) and learnable prompt tokens. High-quality, clinically grounded image-text pairs are curated to enhance cross-modal alignment and reduce uncertainty. Result: The approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over previous prompt-based methods and maintains robustness across unseen datasets. Conclusion: The DRiFt framework contributes to building safer and more trustworthy VLMs for clinical use by improving model generalization and reducing unpredictable behavior under domain shift. Abstract: Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.

[95] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

Yuchan Jie,Yushen Xu,Xiaosong Li,Fuqiang Zhou,Jianming Lv,Huafeng Li

Main category: cs.CV

TL;DR: 本文提出了一种新的图像融合与超分辨率联合方法FS-Diff,通过将问题统一为条件生成问题,并利用语义引导和清晰感知机制,在多个数据集上取得了优于现有方法的效果。

Details Motivation: 现有的图像融合技术在目标和背景结构容易受损、分辨率低和语义信息弱的实际应用中表现不佳,因此需要一种更有效的方法。 Method: FS-Diff将图像融合和超分辨率统一为条件生成问题,利用双向特征Mamba提取多模态图像的全局特征,并通过改进的U-Net网络实现多噪声水平的去噪训练以生成高分辨率融合结果。 Result: 实验结果表明,FS-Diff在多个公开数据集以及新构建的AVMS数据集上优于最先进的方法,并能够恢复融合图像中的更多细节和语义信息。 Conclusion: FS-Diff是一种语义引导和清晰感知的图像融合与超分辨率联合方法,统一了图像融合和超分辨率问题作为条件生成问题,并在多个数据集上证明了其优越性。 Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.

[96] Semantic Concentration for Self-Supervised Dense Representations Learning

Peisong Wen,Qianqian Xu,Siran Dai,Runmin Cong,Qingming Huang

Main category: cs.CV

TL;DR: 该论文针对密集表示的自监督学习问题,提出了一种结合对象感知过滤和噪声容忍排序损失的新方法,解决了现有方法的局限性。

Details Motivation: 现有的图像级自监督学习存在过度分散现象,而密集表示学习面临空间敏感性和复杂场景问题,因此需要探索显式的语义集中方法。 Method: 提出了基于对象感知过滤器的语义集中方法,并结合噪声容忍的排序损失来学习补丁对应关系。 Result: 在多种任务上的实验证明了所提方法的有效性,并提供了开源代码。 Conclusion: 该论文提出了一种新的密集自监督学习方法,并通过实验验证了其有效性。 Abstract: Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.

[97] FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model

Yushen Xu,Xiaosong Li,Yuchun Wang,Xiaoqi Cheng,Huafeng Li,Haishu Tan

Main category: cs.CV

TL;DR: FlexiD-Fuse는 다양한 입력 모달리티 수를 처리할 수 있는 확산 기반 의료 이미지 융합 네트워크로, 고정 조건 입력만 지원하는 기존 방법들의 한계를 극복하고 우수한 성능을 보입니다.

Details Motivation: 기존 융합 방법들은 입력 모달리티 수가 고정되어 있어 임상 적용에 어려움이 있었습니다. 이에 따라, 다양한 입력 수를 처리할 수 있는 유연한 방법 개발이 필요했습니다. Method: FlexiD-Fuse는 확산 프로세스와 계층적 베이즈 모델링을 기반으로 하여 최대 우도 추정 문제로 문제를 전환하고, 기대값 최대화 알고리즘을 활용해 융합 과정을 개선했습니다. Result: 실험 결과, FlexiD-Fuse가 기존 최신 방법들보다 다양한 입력 조건에서 의료 이미지 융합 성능이 우수함을 보였으며, 적외선-가시광, 다중 노출, 다중 초점 이미지 융합 작업에서도 일관된 효과성을 입증했습니다. Conclusion: FlexiD-Fuse는 유연한 입력 처리 능력과 뛰어난 융합 성능으로, 다양한 의료 영상 융합 작업에서 효과적인 솔루션임을 입증받았습니다. Abstract: Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.

[98] Resource-Efficient Glioma Segmentation on Sub-Saharan MRI

Freedmore Sidume,Oumayma Soula,Joseph Muthui Wacira,YunFei Zhu,Abbas Rabiu Muhammad,Abderrazek Zeraii,Oluwaseun Kalejaye,Hajer Ibrahim,Olfa Gaddour,Brain Halubanza,Dong Zhang,Udunna C Anazodo,Confidence Raymond

Main category: cs.CV

TL;DR: 本研究开发了一种高效的深度学习模型,用于撒哈拉以南非洲地区胶质瘤MRI图像的准确分割,克服了数据稀缺和资源限制的挑战。

Details Motivation: 撒哈拉以南非洲地区缺乏高质量的标注影像数据,这对在临床工作流中部署先进的分割模型提出了重大挑战。 Method: 研究采用了带有残差块的3D Attention UNet架构,并通过在BraTS 2021数据集上预训练的权重进行迁移学习增强。 Result: 在BraTS-Africa数据集上的评估结果显示,该模型在增强肿瘤(ET)、坏死和非增强肿瘤核心(NETC)和周围非功能性半球(SNFH)区域的Dice评分分别达到了0.76、0.80和0.85。 Conclusion: 该研究提出了一种适用于资源有限环境的高效深度学习框架,为缩小全球健康领域人工智能应用的差距提供了可能。 Abstract: Gliomas are the most prevalent type of primary brain tumors, and their accurate segmentation from MRI is critical for diagnosis, treatment planning, and longitudinal monitoring. However, the scarcity of high-quality annotated imaging data in Sub-Saharan Africa (SSA) poses a significant challenge for deploying advanced segmentation models in clinical workflows. This study introduces a robust and computationally efficient deep learning framework tailored for resource-constrained settings. We leveraged a 3D Attention UNet architecture augmented with residual blocks and enhanced through transfer learning from pre-trained weights on the BraTS 2021 dataset. Our model was evaluated on 95 MRI cases from the BraTS-Africa dataset, a benchmark for glioma segmentation in SSA MRI data. Despite the limited data quality and quantity, our approach achieved Dice scores of 0.76 for the Enhancing Tumor (ET), 0.80 for Necrotic and Non-Enhancing Tumor Core (NETC), and 0.85 for Surrounding Non-Functional Hemisphere (SNFH). These results demonstrate the generalizability of the proposed model and its potential to support clinical decision making in low-resource settings. The compact architecture, approximately 90 MB, and sub-minute per-volume inference time on consumer-grade hardware further underscore its practicality for deployment in SSA health systems. This work contributes toward closing the gap in equitable AI for global health by empowering underserved regions with high-performing and accessible medical imaging solutions.

[99] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection

Victor Livernoche,Akshatha Arodi,Andreea Musulan,Zachary Yang,Adam Salvail,Gaétan Marceau Caron,Jean-François Godbout,Reihaneh Rabbany

Main category: cs.CV

TL;DR: 本文提出了一种针对现代生成模型的全面政治深度伪造检测数据集,并引入了一个社区驱动的众包对抗平台,以提升深度伪造检测技术的鲁棒性和适应性,从而更好地应对虚假信息威胁。

Details Motivation: 现有的深度伪造检测数据集存在局限性,例如依赖过时的生成方法、低真实感或单一面孔图像,这限制了通用合成图像检测的效果。同时,深度伪造在政治敏感环境中的错误信息传播加剧了这一问题。 Method: 分析社交媒体帖子,确定深度伪造传播的多模态方式;通过人类感知研究评估合成图像与真实图像的区分难度;构建一个包含三百万张真实图像和963,000张合成图像的大型数据集;引入一个众包对抗平台,激励参与者生成和提交具有挑战性的合成图像。 Result: 论文提出了一种全面的政治深度伪造检测数据集,包含三百万张真实图像和963,000张高质量的合成图像,并开发了一个社区驱动的众包对抗平台,以持续提升深度伪造检测技术。 Conclusion: 通过介绍一个全面的政治深度伪造检测数据集以及一个众包对抗平台,论文旨在提升深度伪造检测方法的鲁棒性和适应性,以应对复杂的虚假信息威胁。 Abstract: Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.

[100] Improving Human Motion Plausibility with Body Momentum

Ha Linh Nguyen,Tze Ho Elden Tse,Angela Yao

Main category: cs.CV

TL;DR: This paper proposes a momentum-based loss to better connect local joint motion with global body movement, leading to more realistic and physically accurate motion generation.

Details Motivation: Existing motion models often fail to accurately capture the physical relationship between local joint dynamics and global body movement, leading to inaccuracies and artifacts in generated motion. Method: The authors introduce a new loss term based on whole-body linear and angular momentum to link local motion with global movement, enforcing consistency with ground-truth momentum profiles. Result: Incorporating the momentum-based loss improves motion quality by reducing foot sliding and jitter, enhancing balance, and preserving motion accuracy. Conclusion: The proposed method enhances motion models by incorporating whole-body linear and angular momentum, improving motion accuracy and balance while reducing artifacts like foot sliding. Abstract: Many studies decompose human motion into local motion in a frame attached to the root joint and global motion of the root joint in the world frame, treating them separately. However, these two components are not independent. Global movement arises from interactions with the environment, which are, in turn, driven by changes in the body configuration. Motion models often fail to precisely capture this physical coupling between local and global dynamics, while deriving global trajectories from joint torques and external forces is computationally expensive and complex. To address these challenges, we propose using whole-body linear and angular momentum as a constraint to link local motion with global movement. Since momentum reflects the aggregate effect of joint-level dynamics on the body's movement through space, it provides a physically grounded way to relate local joint behavior to global displacement. Building on this insight, we introduce a new loss term that enforces consistency between the generated momentum profiles and those observed in ground-truth data. Incorporating our loss reduces foot sliding and jitter, improves balance, and preserves the accuracy of the recovered motion. Code and data are available at the project page https://hlinhn.github.io/momentum_bmvc.

[101] Region-Wise Correspondence Prediction between Manga Line Art Images

Yingxuan Li,Jiafeng Mao,Qianru Qiu,Yusuke Matsui

Main category: cs.CV

TL;DR: 本文提出了一种新的无需预先标注的漫画线条艺术图像区域对应关系预测方法,并展示了其在实际应用中的潜力。

Details Motivation: 理解漫画线条艺术图像的区域对应关系是漫画处理中的一项基本任务,可以支持下游应用,如自动线条上色和中间帧生成。然而,这项任务在现实场景中仍然很大程度上未被探索,尤其是没有现有分割或注释的情况下。 Method: 该论文提出了一种基于Transformer的框架,通过学习图像内和跨图像的补丁级相似性来预测区域对应关系。此外,还采用了边缘感知聚类和区域匹配算法将补丁级预测转化为连贯的区域级对应关系。 Result: 实验表明,该方法在补丁级别的准确率高达96.34%,并能生成一致的区域级对应关系。 Conclusion: 该论文提出了一种无需预先标注或遮罩即可预测漫画线条艺术图像区域对应关系的新方法,并展示了其在实际漫画应用中的潜力。 Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.

[102] Generative Diffusion Contrastive Network for Multi-View Clustering

Jian Zhu,Xin Zou,Xi Wang,Ning Zhang,Bian Wu,Yao Yang,Ying Zhou,Lingfang Zeng,Chang Tang,Cheng Luo

Main category: cs.CV

TL;DR: This paper proposes GDCN, a novel method for multi-view clustering that addresses low-quality data issues using a stochastic generative diffusion fusion approach, achieving top performance in deep MVC tasks.

Details Motivation: The motivation is to address the problem of low-quality data in multi-view fusion, which arises from noisy or missing data in certain views, thereby improving clustering performance. Method: The paper proposes the Stochastic Generative Diffusion Fusion (SGDF) method, which uses a multiple generative mechanism for multi-view features. Based on SGDF, the Generative Diffusion Contrastive Network (GDCN) is developed. Result: Extensive experiments show that GDCN achieves state-of-the-art results in deep multi-view clustering (MVC) tasks. Conclusion: The paper concludes that the proposed GDCN, based on SGDF, achieves state-of-the-art results in deep MVC tasks and is robust to low-quality data. Abstract: In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at https://github.com/HackerHyper/GDCN.

[103] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context

Paul F. R. Wilson,Matteo Ronchetti,Rüdiger Göbl,Viktoria Markova,Sebastian Rosenzweig,Raphael Prevost,Parvin Mousavi,Oliver Zettinig

Main category: cs.CV

TL;DR: DualTrack introduces a dual-encoder deep learning architecture for sensorless 3D ultrasound, effectively combining local and global features to achieve high accuracy and consistent 3D imaging.

Details Motivation: Traditional 3D ultrasound systems are limited by cost and complexity, while existing sensorless 3D US approaches struggle to effectively model both local and global features for accurate trajectory estimation. Method: DualTrack uses a dual-encoder architecture with decoupled local and global encoders, leveraging dense spatiotemporal convolutions for local features and a 2D CNN or foundation model with temporal attention layers for global features, followed by a lightweight fusion module to estimate the probe trajectory. Result: DualTrack achieves state-of-the-art accuracy on a large public benchmark, with an average reconstruction error below 5 mm and globally consistent 3D reconstructions. Conclusion: DualTrack provides a promising solution for 3D ultrasound imaging by effectively combining local and global features, achieving high accuracy and consistent 3D reconstructions. Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.

[104] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders

Dohun Lee,Hyeonho Jeong,Jiwook Kim,Duygu Ceylan,Jong Chul Ye

Main category: cs.CV

TL;DR: Align4Gen enhances video diffusion models by aligning features with pre-trained encoders, leading to better video generation.

Details Motivation: Video diffusion models have advanced, but their feature representation power remains underexplored, which could enhance video generation quality. Method: Proposed Align4Gen with multi-feature fusion and alignment method, evaluated using various vision encoders and video generation metrics. Result: Align4Gen demonstrated improved video generation performance in both unconditional and class-conditional settings. Conclusion: Align4Gen improves video generation by aligning intermediate features with pre-trained vision encoders, showing effectiveness in unconditional and class-conditional tasks. Abstract: Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: https://align4gen.github.io/align4gen/

[105] InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation

Sirui Xu,Dongting Li,Yucheng Zhang,Xiyan Xu,Qi Long,Ziyin Wang,Yunzhi Lu,Shuchang Dong,Hezi Jiang,Akshat Gupta,Yu-Xiong Wang,Liang-Yan Gui

Main category: cs.CV

TL;DR: InterAct是一个解决动态3D人-物交互建模与生成挑战的大规模基准。

Details Motivation: 由于数据集的限制,建模和生成动态3D人-物交互仍然具有挑战性。现有的数据集往往缺乏广泛的高质量动作和注释,并且表现出接触渗透、浮动和错误的手势等伪影。 Method: 首先,作者整合并标准化了来自不同来源的21.81小时的HOI数据,并通过详细的文本注释丰富了数据集。其次,他们提出了一个统一的优化框架,以减少伪影并纠正手部动作。第三,他们定义了六个基准任务,并开发了一个统一的HOI生成建模视角。 Result: 利用接触不变性的原理,他们在保持人物关系的同时引入了动作变化,将数据集扩展到30.70小时。他们还实现了最先进的性能,并进行了广泛的实验以验证数据集的效用。 Conclusion: InterAct是一个大规模的3D HOI基准,它通过数据集和方法论的进展解决了建模和生成动态3D人-物交互的挑战。 Abstract: While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at https://github.com/wzyabcas/InterAct, and will be actively maintained.

[106] Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer's Disease Classification

Akshit Achara,Esther Puyol Anton,Alexander Hammers,Andrew P. King

Main category: cs.CV

TL;DR: This paper shows that deep learning models for Alzheimer's diagnosis from MRI scans can exhibit race and sex-based bias due to shortcut learning, and it provides insights into making these models fairer.

Details Motivation: Deep learning algorithms used in Alzheimer's disease diagnosis from MRI scans can suffer from shortcut learning and bias based on protected attributes like race and sex. Addressing this issue is critical for developing fair diagnostic tools. Method: The authors explored shortcut learning and demographic bias by analyzing deep learning models' ability to identify race or sex from MRI scans, assessing the impact of training set imbalance, and conducting feature attribution analyses across multiple datasets using ResNet and SwinTransformer models. Result: Experiments showed that deep learning models can infer race and sex from brain MRI scans, training imbalances lead to performance drops, and feature attribution analyses revealed biased patterns in brain region usage for prediction. Conclusion: The study demonstrates the existence of race and sex-based shortcut learning and bias in deep learning models used for Alzheimer's disease diagnosis from MRI scans, highlighting the need for fairer diagnostic tools. Abstract: Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer's disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR

[107] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection

Sijun Dong,Yuxuan Hu,LiBo Wang,Geng Chen,Xiaoliang Meng

Main category: cs.CV

TL;DR: PeftCD 是一种基于视觉基础模型和参数高效微调的遥感图像变化检测框架,具有高效、准确和强泛化能力。

Details Motivation: 为解决多时相、多源遥感图像中伪变化普遍、标记样本稀缺以及跨域泛化困难的问题,提出了 PeftCD 框架。 Method: PeftCD 框架基于视觉基础模型(VFMs),采用参数高效微调(PEFT)方法,结合共享权重的 Siamese 编码器与 LoRA 和 Adapter 模块,并使用轻量级解码器以专注于骨干网络的特征表示。 Result: PeftCD 在多个公开数据集上达到最先进的性能,包括 SYSU-CD (IoU 73.81%)、WHUCD (92.05%)、LEVIR-CD (85.62%) 等,且在边界精确划分和伪变化抑制方面表现出色。 Conclusion: PeftCD 提供了一种高效、准确且具有强泛化能力的遥感图像变化检测方法,为大规模视觉基础模型在实际应用中的适应性调整提供了有力范式。 Abstract: To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.

[108] Visual Grounding from Event Cameras

Lingdong Kong,Dongyue Lu,Ao Liang,Rong Li,Yuhao Dong,Tianshuai Hu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau

Main category: cs.CV

TL;DR: This paper introduces Talk2Event, the first large-scale benchmark for language-driven object grounding using event camera data, aimed at improving multimodal perception in dynamic environments.

Details Motivation: Event cameras have advantages in capturing dynamic scenes, but their integration with natural language understanding is underexplored, creating a gap in multimodal perception. Method: The paper introduces Talk2Event, a large-scale benchmark built on real-world driving scenarios, containing annotated scenes, objects, and referring expressions enriched with structured attributes. Result: Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and over 30,000 validated referring expressions with structured attributes for spatial, temporal, and relational analysis. Conclusion: Talk2Event offers a foundation for advancing multimodal and temporally-aware perception with potential applications in robotics and human-AI interaction. Abstract: Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes -- appearance, status, relation to the viewer, and relation to surrounding objects -- that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.

[109] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding,Jiwen Liu,Wenyuan Zhang,Zekun Wang,Wentao Hu,Liyuan Cui,Mingming Lao,Yingchao Shao,Hui Liu,Xiaohan Li,Ming Chen,Xiaoqiang Liu,Yu-Shen Liu,Pengfei Wan

Main category: cs.CV

TL;DR: Kling-Avatar 是一种新的音频驱动虚拟角色视频生成框架,通过多模态大语言模型理解和生成高质量视频,显著提升了生成视频的叙事连贯性、情感表现力和可控性。

Details Motivation: 现有音频驱动的虚拟角色生成方法仅基于声学或视觉线索进行低层次追踪,缺乏对指令传递的交际意图建模,导致叙事连贯性和角色表现力不足。因此,需要一种能够理解多模态指令并生成高质量视频的新方法。 Method: Kling-Avatar 采用两阶段级联框架。第一阶段设计了一个多模态大语言模型(MLLM)导演,根据多模态指令生成蓝图视频;第二阶段基于蓝图关键帧,利用首尾帧策略并行生成多个子片段,从而实现从全局到局部的高质量视频生成。 Result: Kling-Avatar 在唇同步精度、情感和动态表现力、指令可控性、身份保持性和跨域泛化能力方面表现出色,能生成最高达 1080p 和 48 fps 的生动、流畅、长时视频,并适用于数字人直播和 vlogging 等实际应用。 Conclusion: Kling-Avatar 是一种新的级联框架,实现了多模态指令理解与逼真肖像生成的统一,显著提升了音频驱动的虚拟角色视频生成效果,为该领域设立了新的基准。 Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

[110] Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth

Daria Laslo,Efthymios Georgiou,Marius George Linguraru,Andreas Rauschecker,Sabine Muller,Catherine R. Jutzeler,Sarah Bruningk

Main category: cs.CV

TL;DR: 提出了一种结合数学模型与深度学习的混合框架,用于预测脑肿瘤的生长并生成符合解剖结构的未来MRI图像。

Details Motivation: 预测脑肿瘤的时空进展对神经肿瘤学中的临床决策至关重要。 Method: 结合数学模型与引导去噪扩散隐式模型(DDIM),利用梯度引导生成符合肿瘤生长趋势的图像。 Result: 模型在儿科弥漫中线胶质瘤病例中生成了逼真的随访扫描图像,并引入了肿瘤生长概率图。 Conclusion: 该框架在数据有限的情况下实现了生物学信息指导下的图像生成,为肿瘤生长预测提供了新方法。 Abstract: Predicting the spatio-temporal progression of brain tumors is essential for guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic learning framework that combines a mathematical tumor growth model with a guided denoising diffusion implicit model (DDIM) to synthesize anatomically feasible future MRIs from preceding scans. The mechanistic model, formulated as a system of ordinary differential equations, captures temporal tumor dynamics including radiotherapy effects and estimates future tumor burden. These estimates condition a gradient-guided DDIM, enabling image synthesis that aligns with both predicted growth and patient anatomy. We train our model on the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our framework generates realistic follow-up scans based on spatial similarity metrics. It also introduces tumor growth probability maps, which capture both clinically relevant extent and directionality of tumor growth as shown by 95th percentile Hausdorff Distance. The method enables biologically informed image generation in data-limited scenarios, offering generative-space-time predictions that account for mechanistic priors.

[111] Measuring Epistemic Humility in Multimodal Large Language Models

Bingkui Tong,Jiaer Xia,Sifeng Shang,Kaiyang Zhou

Main category: cs.CV

TL;DR: HumbleBench is a new hallucination benchmark designed to evaluate MLLMs' ability to reject incorrect answers, providing a more realistic measure of reliability in safety-critical settings.

Details Motivation: The motivation is to address hallucinations in MLLMs that can lead to significant risks in real-world applications, by evaluating the model's ability to reject incorrect answers. Method: HumbleBench leverages fine-grained scene graph annotations and uses GPT-4-Turbo to generate multiple-choice questions with a 'None of the above' option. Result: Valuable findings and insights were shared about the ability of state-of-the-art MLLMs to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Conclusion: HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Abstract: Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.

[112] Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Zhiyuan Yan,Kaiqing Lin,Zongjian Li,Junyan Ye,Hui Han,Zhendong Wang,Hao Liu,Bin Lin,Hao Li,Xue Xu,Xinyan Xiao,Jingdong Wang,Haifeng Wang,Li Yuan

Main category: cs.CV

TL;DR: This paper proposes UAE and Unified-GRPO for unified multimodal learning, where understanding and generation processes are trained together using reconstruction fidelity as the objective. This leads to improved performance in both understanding images and generating accurate captions.

Details Motivation: The authors were motivated by the desire to create a more coherent bidirectional information flow between the understanding (I2T) and generation (T2I) processes in multimodal learning, with the goal of bringing mutual gains. Method: The authors propose UAE, a novel framework for unified multimodal learning, and Unified-GRPO via reinforcement learning (RL), which includes three stages: a cold-start phase, Generation for Understanding, and Understanding for Generation. Result: The encoder autonomously produces more descriptive captions as RL progresses, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity. Conclusion: The paper concludes that their proposed UAE framework and Unified-GRPO approach lead to improved understanding and generation in multimodal learning, as evidenced by the high fidelity of image reconstructions from captions. Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.

[113] Geometric Neural Distance Fields for Learning Human Motion Priors

Zhengdi Yu,Simone Foti,Linguang Zhang,Amy Zhao,Cem Keskin,Stefanos Zafeiriou,Tolga Birdal

Main category: cs.CV

TL;DR: 本文提出了一种新的三维人体运动生成先验模型Neural Riemannian Motion Fields (NRMF),通过使用神经距离场建模姿态、速度和加速度动态,实现了鲁棒且时间一致的三维运动恢复。

Details Motivation: 现有的VAE或扩散模型在3D人体运动恢复方面存在时间不一致和物理不合理的缺陷,因此需要一种更精确的运动先验模型。 Method: 文章提出了一种基于神经距离场(NDFs)的高阶运动先验模型,将人体运动建模为姿态、速度和加速度的零水平集,并在联合旋转、角速度和角加速度的乘积空间上构建NDFs。此外,还引入了自适应步长混合投影算法和几何积分器用于运动生成。 Result: 在AMASS数据集上训练后,NRMF在多种输入模态和任务(如去噪、运动插值和部分2D/3D观测拟合)中均表现出显著的性能提升和泛化能力。 Conclusion: NRMF提供了一种新的3D人体运动生成方法,结合了几何约束和物理合理性,显著优于现有方法。 Abstract: We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to "roll out" realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.

[114] Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov,Chenyang Yuan,Justin Solomon,Vincent Sitzmann

Main category: cs.CV

TL;DR: This paper investigates the origin of locality in deep diffusion models and demonstrates that it arises from statistical properties of image datasets rather than inductive biases of convolutional neural networks.

Details Motivation: The motivation of this paper is to address the gap between the optimal denoiser and deep diffusion models by investigating the origin of locality in deep diffusion models. Method: An optimal parametric linear denoiser was analyzed to demonstrate similar locality properties to deep neural denoisers. Theoretical and experimental approaches were used to show that this locality arises from pixel correlations in natural image datasets. Result: The results show that an optimal parametric linear denoiser exhibits similar locality properties to deep neural denoisers, and this locality arises directly from pixel correlations in natural image datasets. An analytical denoiser was crafted to better match scores predicted by a deep diffusion model. Conclusion: locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Abstract: Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

[115] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang,Yufeng Yuan,Rujie Zheng,Youtian Lin,Jian Gao,Lin-Zhuo Chen,Yajie Bao,Yi Zhang,Chang Zeng,Yanxi Zhou,Xiaoxiao Long,Hao Zhu,Zhaoxiang Zhang,Xun Cao,Yao Yao

Main category: cs.CV

TL;DR: SpatialVID is a new, large-scale dataset with diverse scenes and dense 3D annotations designed to advance spatial intelligence research.

Details Motivation: The scarcity of large-scale, high-quality training data constrains the scalability and real-world fidelity of current spatial intelligence models, particularly for dynamic scenes with ground-truth camera motion. Method: The authors collected over 21,000 hours of raw video and processed them into 2.7 million clips using a hierarchical filtering pipeline, followed by an annotation pipeline to enrich the data with spatial and semantic information. Result: The creation of SpatialVID, a dataset containing 7,089 hours of dynamic content enriched with dense 3D annotations, demonstrating improved richness and diversity compared to existing datasets. Conclusion: SpatialVID provides a scalable and diverse dataset that enhances model generalization and performance, making it a key asset for the video and 3D vision research community. Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.