Table of Contents
cs.CL [Back]
[1] Social Bias in Multilingual Language Models: A Survey
Lance Calvin Lim Gamboa,Yue Feng,Mark Lee
Main category: cs.CL
TL;DR: This paper reviews research on social bias in multilingual models, emphasizing the need for inclusive and culturally appropriate bias evaluation and mitigation techniques across languages.
Details
Motivation: The motivation is to understand and address the social bias in multilingual models similar to those observed in English text processing models. Method: The study conducts a systematic review of existing research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. Result: The review highlights gaps in methodological design choices, such as preference for certain languages and a lack of multilingual mitigation experiments, and catalogs common issues and solutions in adapting bias benchmarks across languages and cultures. Conclusion: The paper concludes that there is a need to enhance inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements in multilingual bias research. Abstract: Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field's dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature's inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.[2] Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models
Mohammad Amini,Babak Ahmadi,Xiaomeng Xiong,Yilin Zhang,Christopher Qiao
Main category: cs.CL
TL;DR: This study demonstrates that structured prompting and fine-tuning can enhance mid-sized language models like Gemma for automated test item generation, offering a scalable and efficient solution for K-12 language assessments.
Details
Motivation: The research aims to address the high cost and inconsistency associated with manually developing language assessment items, seeking to improve the efficiency and effectiveness of automated test generation. Method: The study employed a two-fold approach: comparing a fine-tuned medium-sized model (Gemma, 2B) with a larger untuned model (GPT-3.5, 175B), and evaluating seven structured prompting strategies. Generated test items were assessed using automated metrics, expert scoring, and large-model simulation with GPT-4.1. Result: Structured prompting strategies, especially those combining chain-of-thought and sequential design, significantly improved Gemma's performance. Gemma outperformed GPT-3.5's zero-shot responses in generating construct-aligned and instructionally appropriate items, highlighting the importance of prompt design for mid-sized models. Conclusion: This study concludes that structured prompting and efficient fine-tuning can significantly enhance the performance of mid-sized language models for automatic item generation (AIG) in educational assessments, particularly under limited data conditions. Abstract: This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.[3] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach
Andrei Mihai Albu,Giovanni Pollo,Alessio Burrello,Daniele Jahier Pagliari,Cristian Tesconi,Alessandra Neri,Dario Soldi,Fabio Autieri,Sara Vinco
Main category: cs.CL
TL;DR: This paper proposes an open-source methodology to integrate SystemC TLM models into FMI-based co-simulation workflows by encapsulating them as FMI 3.0 Co Simulation Functional Mock-up Units, addressing challenges like time synchronization and data exchange to enable seamless integration across heterogeneous simulation environments.
Details
Motivation: The motivation stems from the increasing complexity of cyber-physical systems, especially in automotive applications, which demands efficient modeling and cross-domain co-simulation techniques. However, SystemC Transaction-Level Modeling (TLM) faces integration challenges due to its limited interoperability with models from other engineering domains. Method: The methodology involves encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs) and addressing key technical challenges like time synchronization and data exchange. Result: The result is a lightweight open-source toolchain that successfully integrates SystemC TLM models into FMI-based co-simulation workflows, demonstrated through representative case studies showing the feasibility and effectiveness of the approach. Conclusion: This paper concludes that by encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units, it is feasible to achieve seamless and standardized integration of SystemC TLM models into FMI-based co-simulation workflows. Abstract: The growing complexity of cyber-physical systems, particularly in automotive applications, has increased the demand for efficient modeling and cross-domain co-simulation techniques. While SystemC Transaction-Level Modeling (TLM) enables effective hardware/software co-design, its limited interoperability with models from other engineering domains poses integration challenges. This paper presents a fully open-source methodology for integrating SystemC TLM models into Functional Mock-up Interface (FMI)-based co-simulation workflows. By encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs), the proposed approach facilitates seamless, standardized integration across heterogeneous simulation environments. We introduce a lightweight open-source toolchain, address key technical challenges such as time synchronization and data exchange, and demonstrate the feasibility and effectiveness of the integration through representative case studies.[4] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
Rikuto Kotoge,Mai Nishimura,Jiaxin Ma
Main category: cs.CL
TL;DR: DGPO通过冷启动初始化和持续教师指导解决了紧凑语言模型在强化学习中遇到的稀疏奖励和训练不稳定问题,使得紧凑模型能够在资源受限环境中实现复杂的代理搜索行为。
Details
Motivation: 紧凑语言模型在强化学习中表现出较差的推理能力,导致稀疏奖励和不稳定训练,需要一种新的方法来解决这些问题。 Method: 提出了一种名为DGPO的方法,该方法利用教师演示进行冷启动初始化,并在策略优化期间进行连续教师指导。 Result: 实验表明,DGPO使紧凑模型能够实现复杂的代理搜索行为,在某些情况下甚至超过了较大的教师模型。 Conclusion: DGPO为资源受限环境中的代理RAG提供了可行的解决方案。 Abstract: Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.[5] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
Haibo Jin,Ruoxi Chen,Peiyan Zhang,Andy Zhou,Yang Zhang,Haohan Wang
Main category: cs.CL
TL;DR: GUARD is introduced as a method to translate government guidelines into actionable tests for LLM compliance, validated across multiple models, with GUARD-JD extending capabilities to vision-language models.
Details
Motivation: The motivation stems from the societal and regulatory concerns regarding harmful responses from LLMs. While governments have issued ethics guidelines, there is a gap in translating these high-level guidelines into actionable testing procedures to ensure compliance. Method: GUARD generates guideline-violating questions to assess LLM adherence to guidelines. GUARD-JD extends this with jailbreak diagnostics to provoke unethical responses. The methods are validated on seven LLMs and extended to vision-language models. Result: GUARD effectively identifies inconsistencies when responses violate guidelines. GUARD-JD successfully transfers jailbreak diagnostics to vision-language models. The methods were validated on seven LLMs, demonstrating their effectiveness in promoting trustworthy AI. Conclusion: The paper concludes that GUARD and GUARD-JD are effective methods for operationalizing government-issued guidelines into actionable testing strategies for LLM compliance, and they promote reliable LLM-based applications. Abstract: As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.[6] Joint Enhancement of Relational Reasoning for Long-Context LLMs
Zhirui Chen,Wei Shen,Jiashui Huang,Ling Shao
Main category: cs.CL
TL;DR: JERR improves long-context comprehension and complex reasoning in LLMs through graph-based reasoning and MCTS, resulting in better reliability and transparency.
Details
Motivation: LLMs struggle with long contexts due to memory limitations, lack of transparency, and hallucinations. Method: JERR integrates synopsis extraction, graph construction, and relational reasoning using Monte Carlo Tree Search (MCTS). Result: Experimental results show that JERR outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation. Conclusion: JERR provides a novel solution to enhance long-context comprehension and complex reasoning tasks in LLMs with improved reliability and transparency. Abstract: Despite significant progress, large language models (LLMs) still struggle with long contexts due to memory limitations and their inability to tackle complex and long-context tasks. Additionally, LLMs often suffer from a lack of transparency and are prone to producing hallucinations. To address these challenges, we propose \textbf{JERR}, a novel framework designed to enhance long-context comprehension via graph-based reasoning in LLMs. JERR integrates three key components: synopsis extraction, graph construction, and relational reasoning. First, synopsis is extracted by chunking text strategically, allowing the model to summarize and understand information more efficiently. Second, we build a directed acyclic graph (DAG) to resolve redundancy, ensuring logical consistency and clarity. Finally, we incorporate Monte Carlo Tree Search (MCTS) to help the model navigate complex reasoning paths, ensuring more accurate and interpretable outputs. This framework provides a novel solution that enables LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency. Experimental results show that JERR consistently outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation.[7] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems
Yuyao Wang,Bowen Liu,Jianheng Tang,Nuo Chen,Yuhan Li,Qifan Zhang,Jia Li
Main category: cs.CL
TL;DR: This paper introduces NPH graph problems as a synthetic training corpus for enhancing Long CoT reasoning in LLMs through a two-stage post-training framework, achieving superior performance on various reasoning tasks.
Details
Motivation: To find scalable alternatives to costly and human-curated datasets for developing Long CoT behaviors in RLLMs. Method: A two-stage post-training framework: Long CoT Supervised Fine-Tuning on NPH graph instances and Reinforcement Learning with a fine-grained reward design. Result: Graph-R1-7B surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency, demonstrating strong generalization across various domains. Conclusion: NPH graph problems serve as an effective and scalable resource for advancing Long CoT reasoning in LLMs. Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at https://github.com/Graph-Reasoner/Graph-R1, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.[8] CAPE: Context-Aware Personality Evaluation Framework for Large Language Models
Jivnesh Sandhan,Fei Cheng,Tushar Sandhan,Yugo Murawaki
Main category: cs.CL
TL;DR: This paper introduces the Context-Aware Personality Evaluation (CAPE) framework to assess how prior conversations influence Large Language Models (LLMs), revealing that context improves response consistency but can also cause personality shifts, especially in GPT and Llama models.
Details
Motivation: The motivation behind this study is to move beyond the traditional 'Disney World test' approach in evaluating LLMs, which ignores real-world conversational context. The researchers aim to develop a more realistic evaluation framework that incorporates prior interactions to better understand and improve the behavioral consistency of LLMs. Method: The researchers introduced the Context-Aware Personality Evaluation (CAPE) framework and applied it to 7 LLMs. They introduced novel metrics to quantify response consistency and evaluated the influence of conversational history on model behavior. They also tested the models' sensitivity to question ordering and the degree to which their responses were influenced by prior interactions. Result: Experiments showed that conversational history improves response consistency through in-context learning but can also cause personality shifts. GPT-3.5-Turbo and GPT-4-Turbo showed extreme personality deviations, while Gemini-1.5-Flash and Llama-8B were more sensitive to question ordering. GPT models relied on both intrinsic traits and prior context, whereas Gemini-1.5-Flash and Llama-8B were more context-dependent. Applying CAPE to RPAs improved alignment with human judgments. Conclusion: The study concludes that context plays a significant role in shaping the responses of Large Language Models (LLMs), with some models exhibiting personality shifts based on prior interactions. The proposed Context-Aware Personality Evaluation (CAPE) framework enhances response consistency and aligns LLM behavior more closely with human judgments, especially in Role Playing Agents (RPAs). Abstract: Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior. Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE[9] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
Xu Guo
Main category: cs.CL
TL;DR: This paper explores how monitoring conditional entropy during reasoning steps can help predict the correctness of final answers in large language models, showing that decreasing entropy correlates with correct answers and suggesting ways to improve reasoning efficiency.
Details
Motivation: The motivation is to understand how reasoning utility contributes to the correctness of the final answer in large language models, especially given that generating more context does not always increase confidence in the answer. Method: An oracle study was conducted on the MATH dataset using Qwen2.5-32B and GPT-4o to generate reasoning chains, and Qwen3-8B was used to measure the utility of these chains. Conditional entropy was used to quantify uncertainty at each reasoning step. Result: The results indicate that reasoning chains leading to correct answers typically show decreasing conditional entropy over steps, while flat or increasing entropy is associated with incorrect answers. Additionally, incorrect reasoning paths tend to be longer than correct ones. Conclusion: The study concludes that a decrease in conditional entropy during reasoning steps is strongly correlated with correct answers, suggesting that monitoring entropy can help detect and avoid unproductive reasoning early. Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.[10] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools
Sam Jung,Agustin Garcinuno,Spencer Mateega
Main category: cs.CL
TL;DR: UI-Bench 是首个大规模评估AI生成网站视觉质量的基准测试,通过专家评估提供可重复的评价标准。
Details
Motivation: 目前缺乏验证AI文本到应用程序工具声称能快速生成高质量应用的公开基准测试。 Method: 通过专家配对比较,使用基于TrueSkill的模型评估10种AI文本到应用程序工具的视觉质量。 Result: UI-Bench 包含30个提示、300个生成的网站以及4000多项专家判断,用于对系统进行排名。 Conclusion: UI-Bench 提供了一个可重复的标准,用于评估AI生成网站的设计质量,并通过公开排行榜推动AI网页设计领域的发展。 Abstract: AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and \textit{4000+} expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.[11] DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
Hengchuan Zhu,Yihuan Xu,Yichen Li,Zijie Meng,Zuozhu Liu
Main category: cs.CL
TL;DR: 本文介绍了DentalBench,这是一个用于评估和改进大型语言模型在牙科领域表现的综合双语基准。
Details
Motivation: 现有的大型语言模型和医学语言模型在通用医学基准上表现出色,但在需要更深入领域特定知识的牙科等领域的能力尚未被充分探索,因为缺乏针对性的评估资源。 Method: 作者引入了DentalBench,包含DentalQA和DentalCorpus两个主要部分,评估了14个LLMs,并进行了进一步的实验以验证领域适应对模型性能的影响。 Result: 研究结果显示,领域适应显著提高了模型在知识密集型和术语聚焦任务上的表现,强调了领域特定基准在开发可信赖和有效的医疗应用LLMs中的重要性。 Conclusion: DentalBench为开发和评估牙科领域的LLMs提供了重要资源,表明领域特定基准对于提升LLMs在医疗健康应用中的性能至关重要。 Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.[12] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval
Chi Minh Bui,Ngoc Mai Thieu,Van Vinh Nguyen,Json J. Jung,Khac-Hoai Nam Bui
Main category: cs.CL
TL;DR: KG-CQR框架通过整合知识图谱与大型语言模型,提升了检索增强生成系统的查询检索效果,实验表明其在多个指标上优于现有方法4-6%。
Details
Motivation: 现有方法主要解决语料级上下文丢失问题,而本研究关注于通过结构化关系表示来丰富复杂输入查询的上下文表示,从而提升检索阶段的效果。 Method: 提出KG-CQR框架,包括子图提取、补全和上下文生成模块,通过增强查询的语义表示,实现无需额外训练的模型无关型检索增强方法。 Result: 在RAGBench和MultiHop-RAG数据集上的实验表明,KG-CQR在mAP指标上提升了4-6%,在Recall@25上提升了2-3%,并在多跳问答任务中表现出更优的检索效果。 Conclusion: KG-CQR通过结构化知识图谱增强查询表示,有效提升了检索增强生成系统的性能,且适用于不同规模的大型语言模型。 Abstract: The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR's superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness[13] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance
Feng Zhang,Chengjie Pang,Yuehan Zhang,Chenyu Luo
Main category: cs.CL
TL;DR: 本文提出了一种针对民用航空维修领域的工业级基准测试工具,旨在评估大语言模型(LLM)在该领域的性能,同时填补当前评估工具在领域知识和复杂推理方面的不足。
Details
Motivation: 民用航空维修领域需要高度专业的知识和复杂的推理能力,但目前缺乏专门针对该领域的大型语言模型评估工具。 Method: 开发了一个工业级基准测试工具,用于衡量LLM在民用航空维修中的能力,并评估现有的嵌入模型和LLM在该领域的表现。 Result: 实验表明该基准测试工具能够有效评估模型在民用航空维修领域的性能,并且该工具和代码已开源。 Conclusion: 该基准测试为民用航空维修领域的LLM改进提供了基础,推动了该领域向更智能化的方向发展。 Abstract: Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:https://github.com/CamBenchmark/cambenchmark[14] Searching the Title of Practical Work of the Informatics Engineering Bachelor Program with the Case Base Reasoning Method
Agung Sukrisna Jaya,Osvari Arsalan,Danny Matthew Saputra
Main category: cs.CL
TL;DR: 本文提出了一种结合CBR、TF-IDF和余弦相似度的方法,用于高效搜索实际工作标题,并在测试中表现出良好的匹配效果。
Details
Motivation: 为了更有效地基于已有经验搜索实际工作标题,提高匹配的准确性和效率。 Method: 基于案例的推理(CBR)结合TF-IDF进行文本向量化,并使用余弦相似度计算相似度值,分两个阶段进行测试。 Result: 第二阶段测试结果表明,系统能够找到相同数量的标题且平均匹配分数最高,验证了方法的有效性。 Conclusion: 系统在第二阶段获得了相同数量的标题和最高的平均匹配分数,表明CBR结合TF-IDF和余弦相似度的方法在搜索实际工作标题方面是有效的。 Abstract: Case Base Reasoning (CBR) is a case solving technique based on experience in cases that have occurred before with the highest similarity. CBR is used to search for practical work titles. TF-IDF is applied to process the vectorization of each practical work title word and Cosine Similarity for the calculation of similarity values. This system can search either in the form of titles or keywords. The output of the system is the title of practical work and the match value of each title. Based on the test results using 705 practical work titles, testing was carried out with five titles and carried out in two stages. The first stage searches with existing titles and the second stage randomizes the title from the first stage. And the results obtained in the second stage are the same number of titles found and the highest average match score.[15] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Zhenting Wang,Qi Chang,Hemani Patel,Shashank Biju,Cheng-En Wu,Quan Liu,Aolin Ding,Alireza Rezazadeh,Ankit Shah,Yujia Bao,Eugene Siow
Main category: cs.CL
TL;DR: MCP-Bench是一个新的基准,用于评估大型语言模型在现实多步骤任务中的性能,涉及工具使用、跨工具协调、参数控制和任务规划。
Details
Motivation: 现有的基于API的基准测试无法充分评估模型在模糊指令下检索相关工具、复杂目标的多跳执行轨迹规划、中间工具输出的响应基础以及跨领域工作流的编排等方面的能力。 Method: 基于Model Context Protocol(MCP)构建,MCP-Bench连接了28个代表性的实时MCP服务器,覆盖250个工具,涉及金融、旅行、科学计算和学术搜索等领域。 Result: 实验结果显示,MCP-Bench能够有效测试现有大型语言模型的多方面能力,并揭示了这些模型在处理多步骤任务时的持续挑战。 Conclusion: MCP-Bench是一个新的基准,用于评估大型语言模型在需要工具使用、跨工具协调、精确参数控制和任务解决规划/推理的现实多步骤任务中的性能。 Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.[16] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques
Yucheng Ruan,Xiang Lan,Daniel J. Tan,Hairil Rizal Abdullah,Mengling Feng
Main category: cs.CL
TL;DR: This study introduces a deep learning framework using NLP techniques to integrate multimodal EHRs for predicting mortality and resource utilization in critical care. The model outperforms existing methods and shows resilience to data corruption.
Details
Motivation: Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. Method: Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model's robustness against the corruption in structured EHRs. Result: Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6%/0.8% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusion: The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels. Abstract: Background Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. This study aimed to introduce and assess a deep learning framework using natural language processing techniques that integrates multimodal EHRs to predict mortality and resource utilization in critical care settings. Methods Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model's robustness against the corruption in structured EHRs. Results Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6\%/0.8\% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusions The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels.[17] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety
Luke Bates,Max Glockner,Preslav Nakov,Iryna Gurevych
Main category: cs.CL
TL;DR: 论文提出了ConspirED数据集,用于捕捉阴谋内容的认知特征,并利用该数据集开发了识别这些特征的计算模型。研究发现,现有的大型语言模型和推理模型在面对阴谋内容时存在不一致性,其输出可能受到输入推理模式的影响。
Details
Motivation: 随着人工智能生成的虚假信息变得越来越复杂,了解阴谋内容中的修辞模式对于开发干预措施(如有针对性的预驳斥)和评估人工智能的漏洞至关重要。ConspirED旨在提供一种标注阴谋内容认知特征的数据集,这是之前研究中所缺乏的。 Method: 论文提出了一种名为ConspirED(CONSPIR评估数据集)的方法,该数据集使用CONSPIR认知框架对在线阴谋文章中的多句摘录(80-120字)进行标注,以捕捉阴谋思维的认知特征。利用ConspirED,作者开发了识别阴谋特征并确定文本摘录中的主要特征的计算模型,并评估了LLM/LRM对阴谋输入的鲁棒性。 Result: 论文的结果包括(i)开发了能够识别文本摘录中的阴谋特征并确定主要特征的计算模型;(ii)发现无论是计算模型还是大型语言/推理模型(LLM/LRM)都与阴谋内容存在不一致性,即使它们能够成功应对可比较的事实核查错误信息,其输出仍然反映了输入的推理模式。 Conclusion: 论文得出的结论是,无论是计算模型还是大型语言/推理模型(LLM/LRM)都会受到阴谋内容的影响,产生与输入推理模式相似的输出,即使它们能够成功应对可比较的事实核查错误信息。 Abstract: Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80--120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.[18] Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark
Chihiro Taguchi,Seng Mai,Keita Kurabe,Yusuke Sakai,Georgina Agyei,Soudabeh Eslami,David Chiang
Main category: cs.CL
TL;DR: This paper identifies critical flaws in the FLORES+ multilingual machine translation benchmark, including cultural bias, domain specificity, and evaluation vulnerabilities, and proposes improved benchmarking practices for real-world translation evaluation.
Details
Motivation: The motivation is to investigate the suitability of the FLORES+ benchmark for truly multilingual evaluation, given its widespread use and claimed high-quality standards. Method: The authors analyzed data from four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) using human assessments and compared the performance of MT models on FLORES+ versus a domain-relevant evaluation set. They also tested simple heuristics, such as copying named entities, to evaluate the robustness of the benchmark. Result: Human assessments revealed that many translations in FLORES+ fall below the claimed 90% quality standard. Source sentences were found to be domain-specific and culturally biased toward English-speaking regions. Additionally, MT models trained on high-quality, naturalistic data performed poorly on FLORES+ but showed better performance on a domain-relevant evaluation set. Conclusion: The paper concludes that current multilingual MT benchmarks, particularly FLORES+, have significant shortcomings in terms of domain specificity, cultural bias, and evaluation protocol vulnerabilities. The authors advocate for more domain-general and culturally neutral benchmarks that better reflect real-world translation challenges. Abstract: Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark's suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts rely less on named entities, in order to better reflect real-world translation challenges.[19] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM
Pengjiang Li,Zaitian Wang,Xinhao Zhang,Ran Zhang,Lu Jiang,Pengfei Wang,Yuanchun Zhou
Main category: cs.CL
TL;DR: SciTopic是一种利用大型语言模型改进科学主题识别的新方法,在多个数据集上表现优于现有技术。
Details
Motivation: 现有的主题发现方法主要依赖词嵌入来捕捉语义,缺乏对科学出版物的全面理解,难以处理复杂的高维文本关系。 Method: SciTopic方法包括构建一个文本编码器,用于捕捉科学出版物的内容,接着构建一个空间优化模块,结合基于熵的采样和由LLM指导的三元组任务,最后通过优化三元组的对比损失来微调文本编码器。 Result: 在三个真实世界科学出版物数据集上的广泛实验表明,SciTopic优于最先进的科学主题发现方法。 Conclusion: SciTopic是一个改进的科学主题发现方法,通过利用大型语言模型(LLM)的指导,提高了对科学主题的识别能力。 Abstract: Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.[20] Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Salvador Lima-López,Eulàlia Farré-Maduell,Martin Krallinger,Natalia Loukachevitch,Vera Davydova,Elena Tutubalina,Georgios Paliouras
Main category: cs.CL
TL;DR: BioASQ 2024, part of CLEF 2024, successfully advanced biomedical NLP through four competitive tasks, drawing 37 teams and over 700 submissions, showing ongoing progress in the field.
Details
Motivation: To encourage progress in large-scale biomedical semantic indexing and question answering by organizing competitive challenges that test state-of-the-art systems. Method: The BioASQ challenge organized four shared tasks, including two established ones (b and Synergy) and two new ones (MultiCardioNER and BIONNE), to evaluate advancements in biomedical natural language processing. Result: A total of 37 teams submitted over 700 entries across the four tasks, with most systems showing competitive performance, indicating continuous improvement in the field. Conclusion: The twelfth edition of the BioASQ challenge successfully promoted advancements in biomedical semantic indexing and question answering with high team participation and competitive system performance. Abstract: This is an overview of the twelfth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2024. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks b and Synergy, and two new tasks: a) MultiCardioNER on the adaptation of clinical entity detection to the cardiology domain in a multilingual setting, and b) BIONNE on nested NER in Russian and English. In this edition of BioASQ, 37 competing teams participated with more than 700 distinct submissions in total for the four different shared tasks of the challenge. Similarly to previous editions, most of the participating systems achieved competitive performance, suggesting the continuous advancement of the state-of-the-art in the field.[21] Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Martin Krallinger,Miguel Rodríguez-Ortega,Eduard Rodriguez-López,Natalia Loukachevitch,Andrey Sakhovskiy,Elena Tutubalina,Dimitris Dimitriadis,Grigorios Tsoumakas,George Giannakoulas,Alexandra Bekiaridou,Athanasios Samaras,Giorgio Maria Di Nunzio,Nicola Ferro,Stefano Marchesin,Marco Martinelli,Gianmaria Silvello,Georgios Paliouras
Main category: cs.CL
TL;DR: The BioASQ 2025 challenge hosted six tasks, attracting 83 teams and showcasing advancements in biomedical semantic indexing and question answering.
Details
Motivation: To promote progress in biomedical semantic indexing and question answering by providing a competitive platform for evaluating state-of-the-art methods. Method: The challenge organized six shared tasks, including established and newly introduced ones, to evaluate advancements through team participation and submissions. Result: 83 teams participated with over 1000 submissions, showing continuous improvement in the field with competitive system performances. Conclusion: The thirteenth edition of the BioASQ challenge successfully advanced biomedical semantic indexing and question answering, with multiple tasks showing competitive system performances. Abstract: This is an overview of the thirteenth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2025. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks, b and Synergy, and four new tasks: a) Task MultiClinSum on multilingual clinical summarization. b) Task BioNNE-L on nested named entity linking in Russian and English. c) Task ELCardioCC on clinical coding in cardiology. d) Task GutBrainIE on gut-brain interplay information extraction. In this edition of BioASQ, 83 competing teams participated with more than 1000 distinct submissions in total for the six different shared tasks of the challenge. Similar to previous editions, several participating systems achieved competitive performance, indicating the continuous advancement of the state-of-the-art in the field.[22] Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data
Jiahao Xiao,Jiangming Liu
Main category: cs.CL
TL;DR: This paper introduces a new benchmarking framework and proposes the Adaptive Federated Distillation (AdaFD) method to address multi-domain non-IID challenges in federated learning for NLP, showing improved performance over existing techniques.
Details
Motivation: The motivation stems from the limitations of current non-IID scenarios in federated learning, which focus only on label diversity and neglect input (language domain) diversity crucial for NLP tasks. The authors aim to bridge this gap by introducing a more realistic and comprehensive evaluation framework. Method: The authors propose a unified benchmarking framework with multi-domain non-IID scenarios and introduce the Adaptive Federated Distillation (AdaFD) framework, which is evaluated through experiments on diverse data to assess its effectiveness in handling real-world federated learning challenges. Result: Experimental results show that the AdaFD framework successfully captures the diversity of local clients and achieves better performance than existing approaches in addressing multi-domain non-IID challenges. Conclusion: The paper concludes that the proposed Adaptive Federated Distillation (AdaFD) framework effectively addresses multi-domain non-IID challenges in both homogeneous and heterogeneous settings, demonstrating superior performance compared to existing methods. Abstract: The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: https://github.com/jiahaoxiao1228/AdaFD.[23] Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search
Zeyu Xiong,Yixuan Nan,Li Gao,Hengzhu Tang,Shuaiqiang Wang,Junfeng Wang,Dawei Yin
Main category: cs.CL
TL;DR: This paper introduces a novel framework that applies generative models to real-time Query-Driven Text Summarization (QDTS), overcoming the limitations of traditional methods and achieving state-of-the-art performance with high deployment efficiency.
Details
Motivation: The motivation stems from the limitations of traditional extractive summarization models, which suffer from cumulative information loss and lack sufficient semantic understanding, particularly for complex search intents. Method: The method involves integrating large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model into a domain-specialized QDTS expert. Result: The model outperforms the production baseline on multiple industry-relevant metrics, achieves a new state of the art, and demonstrates excellent deployment efficiency by handling ~50,000 queries per second under 55 ms average latency using 334 NVIDIA L20 GPUs. Conclusion: The study concludes that the proposed framework successfully pioneers the use of generative models for real-time QDTS in industrial web search, achieving superior performance and deployment efficiency. Abstract: In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.[24] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling
Yangfan Wang,Jie Liu,Chen Tang,Lian Yan,Jingchi Jiang
Main category: cs.CL
TL;DR: Knowledge Composition Sampling (KCS) improves multi-hop question generation by better integrating and sampling knowledge within documents, leading to higher accuracy and diversity.
Details
Motivation: Multi-hop question answering suffers from data sparsity, leading to models learning spurious patterns. Prior methods focus on diversifying questions but neglect essential knowledge integration. Method: KCS models knowledge composition selection as a sentence-level conditional prediction task using a probabilistic contrastive loss and stochastic decoding during inference. Result: KCS achieves a 3.9% improvement in knowledge composition selection accuracy and enhances data augmentation performance on HotpotQA and 2WikiMultihopQA datasets. Conclusion: KCS is an effective framework for improving the diversity and accuracy of multi-hop question generation by integrating knowledge composition sampling. Abstract: Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: https://github.com/yangfanww/kcs.[25] A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models
Soham Petkar,Hari Aakash K,Anirudh Vempati,Akshit Sinha,Ponnurangam Kumarauguru,Chirag Agarwal
Main category: cs.CL
TL;DR: 当前GLM评估基准不足以评估多模态推理能力,新引入的CLEGR基准测试显示,当前GLM在需要结构推理的任务中表现不佳,这突显了当前GLM在图形推理能力方面的局限性。
Details
Motivation: 当前GLM的评估基准主要为节点级分类数据集,不能充分评估多模态推理能力。 Method: 分析当前GLM的评估基准,并介绍新的CLEGR基准测试,使用合成图生成管道和需要结构和文本语义联合推理的问题进行评估。 Result: 强性能可以在这些基准测试中使用单一信息实现,表明它们不需要图-语言集成,同时GLM在需要结构推理的任务中表现出显著的性能下降。 Conclusion: 当前GLM的评估基准不足以评估多模态推理能力,引入的CLEGR基准测试表明,当前GLM在需要结构推理的任务中表现出显著的性能下降,这突显了当前GLM在图形推理能力方面的局限性。 Abstract: Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.[26] Generative Annotation for ASR Named Entity Correction
Yuanchang Luo,Daimeng Wei,Shaojun Li,Hengchao Shang,Jiaxin Guo,Zongyao Li,Zhanglin Wu,Xiaoyu Chen,Zhiqiang Rao,Jinlong Yang,Hao Yang
Main category: cs.CL
TL;DR: This paper introduces a novel named entity correction method for ASR systems that leverages speech sound features to handle significant word form differences, achieving better entity accuracy.
Details
Motivation: End-to-end ASR systems struggle with domain-specific named entities, leading to downstream failures. Existing NEC models fail when transcribed word forms and ground-truth entities are significantly different. Method: A generative method utilizing speech sound features to retrieve candidate entities and annotate entity errors in ASR transcripts. Result: The proposed method demonstrates significant improvement in entity accuracy on both open-source and self-constructed test sets. Conclusion: The proposed NEC method effectively improves domain-specific named entity transcription accuracy in ASR systems, particularly in scenarios where word forms differ significantly. Abstract: End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. We will open source our self-constructed test set and training data.[27] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning
Nelson Filipe Costa,Leila Kosseim
Main category: cs.CL
TL;DR: This paper presents HArch, a novel hierarchical model for multi-lingual implicit discourse relation recognition, achieving state-of-the-art results and outperforming large language models through fine-tuning.
Details
Motivation: The motivation is to develop the first multi-lingual and multi-label classification model for implicit discourse relation recognition, aiming to improve performance through hierarchical modeling and cross-lingual transfer. Method: The paper introduces HArch, a hierarchical model that leverages dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. It evaluates different pre-trained encoder backbones like RoBERTa and XLM-RoBERTa and compares fine-tuned models with LLMs using few-shot prompting. Result: RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch excels in multi-lingual settings. The fine-tuned models outperform GPT-4o and Llama-4-Maverick in few-shot prompting. Additionally, the model reports SOTA results on the DiscoGeM 1.0 corpus. Conclusion: HArch, a multi-lingual and multi-label classification model, achieves state-of-the-art results for implicit discourse relation recognition, outperforming large language models through task-specific fine-tuning. Abstract: This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.[28] Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models
Ruiyi Yan,Yugo Murawaki
Main category: cs.CL
TL;DR: 本研究探讨了在文本隐写和数字水印中爱丽丝与鲍勃之间的分词不一致(TI)问题,并提出了两种针对性的解决方案,以提高隐写的鲁棒性和水印的安全性。
Details
Motivation: 大型语言模型在提高文本生成能力和效率的同时,也突显了水印技术在防止恶意滥用方面的重要性。然而,隐写和水印技术中的分词不一致问题可能削弱其鲁棒性,因此需要深入研究并提出解决方案。 Method: 研究分析了导致分词不一致的问题标记的两个关键特征:低频性和临时性,并据此提出了两种解决方案:用于隐写的逐步验证方法和用于水印的事后回滚方法。 Result: 实验表明,针对分词不一致的处理在隐写中提高了流畅性、不可察觉性和抗隐写分析能力;在水印中增强了可检测性和抗攻击能力。 Conclusion: 通过直接解决分词不一致问题,可以有效提升文本隐写和水印技术的性能和安全性。 Abstract: Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based steganography. On the other hand, they have also underscored the importance of watermarking as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: infrequency and temporariness. Based on these findings, we propose two tailored solutions for TI elimination: a stepwise verification method for steganography and a post-hoc rollback method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.[29] rStar2-Agent: Agentic Reasoning Technical Report
Ning Shang,Yifei Liu,Yi Zhu,Li Lyna Zhang,Weijiang Xu,Xinyu Guan,Buze Zhang,Bingcheng Dong,Xudong Zhou,Bowen Zhang,Ying Xin,Ziming Miao,Scarlett Li,Fan Yang,Mao Yang
Main category: cs.CL
TL;DR: rStar2-Agent is a 14B math reasoning model that uses agentic reinforcement learning to achieve advanced problem-solving capabilities and state-of-the-art performance.
Details
Motivation: The motivation is to achieve frontier-level performance in math reasoning and complex problem-solving through agentic reinforcement learning. Method: The method involves three key innovations: an efficient RL infrastructure, GRPO-RoC algorithm, and an efficient agent training recipe. Result: rStar2-Agent achieves state-of-the-art results with average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing larger models with shorter responses. Conclusion: rStar2-Agent-14B demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.[30] Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees
Stephen Meisenbacher,Maulik Chevli,Florian Matthes
Main category: cs.CL
TL;DR: 本文提出了一种基于语义三元组的局部差分隐私文档生成方法DP-ST,并展示了其在保持文本连贯性和平衡隐私与效用方面的有效性。
Details
Motivation: 局部差分隐私下的文本隐私保护需要较高的隐私预算才能合理实现,本文旨在解决这一挑战。 Method: 通过利用语义三元组,DP-ST采用分治策略生成私有文档,并结合LLM后处理技术提高文本连贯性。 Result: 实验表明,DP-ST在较低隐私预算下仍能生成连贯的文本,有效平衡了隐私与效用。 Conclusion: 该方法强调了在合理隐私预算水平下实现平衡隐私输出的重要性。 Abstract: Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter $\varepsilon$. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high $\varepsilon$ values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower $\varepsilon$ values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable $\varepsilon$ levels.[31] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets
Vassiliy Cheremetiev,Quang Long Ho Ngo,Chau Ying Kot,Alina Elena Baia,Andrea Cavallaro
Main category: cs.CL
TL;DR: The paper proposes fine-tuning large language model-based embedding models to effectively detect implicit hate speech, achieving state-of-the-art results.
Details
Motivation: Implicit hate speech is challenging to detect because it does not include explicit derogatory or inflammatory words. Task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions, and sentiment data. Method: Fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed, and E5. Result: Experiments on multiple implicit hate speech datasets show up to 1.10 percentage points improvement for in-dataset evaluation and up to 20.35 percentage points improvement in cross-dataset evaluation in terms of F1-macro score. Conclusion: Fine-tuning general-purpose embedding models based on large language models can achieve state-of-the-art performance in detecting implicit hate speech. Abstract: Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.[32] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation
Yuanhao Ding,Esteban Garces Arias,Meimingwei Li,Julian Rodemann,Matthias Aßenmacher,Danlu Chen,Gaojuan Fan,Christian Heumann,Chongsheng Zhang
Main category: cs.CL
TL;DR: GUARD is a self-adaptive decoding method that balances coherence and diversity in LLM outputs using a novel uncertainty-driven framework and a token-count-based penalty, resulting in faster and more efficient text generation.
Details
Motivation: The motivation is to address the challenge of balancing coherence with diversity in LLM outputs, particularly to overcome the limitations of contrastive search-based decoding strategies, such as hyperparameter dependence and high computational costs. Method: GUARD uses a 'Glocal' uncertainty-driven framework combining global entropy estimates and local entropy deviations, along with a token-count-based penalty to reduce computational overhead. Result: GUARD demonstrates a good balance between text diversity and coherence, mitigates abrupt variations in uncertainty, and shows substantial improvements in generation speed. It also received validation for its performance from human and LLM evaluators. Conclusion: The study concludes that GUARD effectively balances text diversity and coherence, offering faster generation speed and improved performance, as validated by both human and LLM evaluators. Abstract: Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel "Glocal" uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at https://github.com/YecanLee/GUARD.[33] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions
Xiaoyi Wang,Jiwei Zhang,Guangtao Zhang,Honglei Guo
Main category: cs.CL
TL;DR: LLM-generated therapy dialogues lack emotional depth compared to real conversations, underscoring the need for more emotionally accurate synthetic data in mental health applications.
Details
Motivation: The motivation for this study was to determine whether synthetic therapy dialogues generated by large language models accurately capture the emotional dynamics of real counseling sessions, particularly as these synthetic dialogues are increasingly used in mental health NLP applications. Method: The researchers used the Utterance Emotion Dynamics framework to analyze emotional arcs in both real and synthetic Cognitive Behavioral Therapy dialogues, focusing on valence, arousal, and dominance dimensions. The analysis included full dialogues and individual speaker roles (counselor and client) from real sessions and the CACTUS dataset. Result: The analysis found that real therapy sessions exhibit greater emotional variability, more emotion-laden language, and more authentic patterns of emotional reactivity and regulation compared to synthetic dialogues. Emotional arc similarity between real and synthetic speakers was low, especially for clients. Conclusion: The study concludes that while LLM-generated therapy dialogues are fluent and structurally coherent, they lack emotional fidelity when compared to real Cognitive Behavioral Therapy sessions. This highlights the limitations of synthetic data in capturing nuanced emotional dynamics critical for mental health applications. Abstract: Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we conduct the first comparative analysis of emotional arcs between real and LLM-generated Cognitive Behavioral Therapy dialogues. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions transcribed from public videos and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability,more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity between real and synthetic speakers is low, especially for clients. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. We introduce RealCBT, a curated dataset of real CBT sessions, to support future research in this space.[34] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection
Harethah Abu Shairah,Hasan Abed Al Kader Hammoud,George Turkiyyah,Bernard Ghanem
Main category: cs.CL
TL;DR: ROSI enhances the safety of Large Language Models by steering internal representations toward refusal-mediating directions, improving safety without sacrificing performance.
Details
Motivation: Existing safety mechanisms in LLMs can be bypassed by manipulating internal representations; thus, a more robust and proactive safety alignment method is needed. Method: ROSI modifies model weights using a rank-one update to amplify safety alignment by steering activations toward a refusal-mediating subspace, derived from harmful and harmless instruction pairs. Result: ROSI improves safety refusal rates evaluated by Llama Guard 3 while maintaining model utility on benchmarks like MMLU, HellaSwag, and Arc, and can also align uncensored models. Conclusion: Targeted, interpretable weight steering through ROSI is an effective and inexpensive method to enhance the safety of LLMs, complementing resource-heavy fine-tuning methods. Abstract: Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.[35] Signs of Struggle: Spotting Cognitive Distortions across Language and Register
Abhishek Kuber,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah
Main category: cs.CL
TL;DR: This study explores the application of domain adaptation methods to detect cognitive distortions in Dutch adolescents' forum posts, showing promise for cross-lingual and cross-register generalization despite challenges posed by variations in language and writing style.
Details
Motivation: The rise in mental health issues among youth has created a need for automated detection of early signs of psychological distress, particularly cognitive distortions, which can lead to timely and cost-effective interventions. Method: An in-depth study analyzing forum posts written by Dutch adolescents to assess cross-lingual and cross-register generalization of cognitive distortion detection. Result: Changes in language and writing style significantly affect model performance in detecting cognitive distortions. Conclusion: Domain adaptation methods show promise in detecting cognitive distortions in different languages and writing styles. Abstract: Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One key focus is the identification of cognitive distortions, irrational thought patterns that have a role in aggravating mental distress. Early detection of these distortions may enable timely, low-cost interventions. While prior work has focused on English clinical data, we present the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection, analyzing forum posts written by Dutch adolescents. Our findings show that while changes in language and writing style can significantly affect model performance, domain adaptation methods show the most promise.[36] Exploring Machine Learning and Language Models for Multimodal Depression Detection
Javier Si Zhao Hong,Timothy Zoe Delaya,Sherwyn Chan Yin Kit,Pai Chet Ng,Xiaoxiao Miao
Main category: cs.CL
TL;DR: This paper compares XGBoost, transformers, and LLMs for multimodal depression detection, revealing insights into model strengths and effective strategies for mental health prediction.
Details
Motivation: The motivation is to detect depression using machine learning and deep learning models by leveraging multimodal data for better mental health prediction. Method: The paper explores and compares the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Result: The results show the effectiveness of different models in capturing depression-related signals from various modalities, highlighting the potential of multimodal approaches. Conclusion: The paper concludes that different models have distinct strengths and limitations in capturing depression-related signals across modalities, providing insights into effective multimodal representation strategies for mental health prediction. Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.[37] GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction
Jie Zhao,Wanting Ning,Yuxiao Fei,Yubo Feng,Lishuang Li
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型的全局距离感知建模方法(GDLLM),用于事件时序关系抽取(ETRE),通过图注意力网络和软推理机制提升模型对长距离依赖和少数类关系的识别能力,并在公开数据集上取得了最先进的性能。
Details
Motivation: 小语言模型受限于预训练知识,难以处理不平衡分类数据集中的少数类关系;而大语言模型采用手动设计的提示或指令可能引入噪声,干扰事件间长距离依赖的判断。 Method: 提出GDLLM框架,包括基于图注意力网络的距离感知图结构和基于软推理的时序特征学习范式,以增强模型对长距离依赖和少数类关系的识别能力。 Result: 在TB-Dense和MATRES数据集上的实验表明,该方法达到了最先进的性能。 Conclusion: GDLLM有效解决了事件时序关系抽取中长距离依赖和少数类关系识别的问题,显著提升了模型的整体学习能力。 Abstract: In Natural Language Processing(NLP), Event Temporal Relation Extraction (ETRE) is to recognize the temporal relations of two events. Prior studies have noted the importance of language models for ETRE. However, the restricted pre-trained knowledge of Small Language Models(SLMs) limits their capability to handle minority class relations in imbalanced classification datasets. For Large Language Models(LLMs), researchers adopt manually designed prompts or instructions, which may introduce extra noise, leading to interference with the model's judgment of the long-distance dependencies between events. To address these issues, we propose GDLLM, a Global Distance-aware modeling approach based on LLMs. We first present a distance-aware graph structure utilizing Graph Attention Network(GAT) to assist the LLMs in capturing long-distance dependency features. Additionally, we design a temporal feature learning paradigm based on soft inference to augment the identification of relations with a short-distance proximity band, which supplements the probabilistic information generated by LLMs into the multi-head attention mechanism. Since the global feature can be captured effectively, our framework substantially enhances the performance of minority relation classes and improves the overall learning ability. Experiments on two publicly available datasets, TB-Dense and MATRES, demonstrate that our approach achieves state-of-the-art (SOTA) performance.[38] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation
Rohan Phanse,Yijie Zhou,Kejian Shi,Wencai Zhang,Yixin Liu,Yilun Zhao,Arman Cohan
Main category: cs.CL
TL;DR: 本文提出了一个用于评估检索增强型系统整合多源信息能力的框架,并指出推理模型在多源合成任务上优于标准大语言模型。
Details
Motivation: 现实世界的应用往往需要整合多个来源的信息,而不是依赖单一来源或短答案,因此需要对检索增强型系统进行更深入的评估。 Method: 论文提出了一种可扩展的评估框架,用于构建挑战检索增强型系统整合不同来源信息的基准测试,并构建了两个新的基准MSRS-Story和MSRS-Meet。 Result: 实验表明,生成质量高度依赖于检索效果,且在不同任务中差异显著;即使在理想检索环境下,多源合成仍然具有挑战性,但推理模型表现更优。 Conclusion: 论文得出结论,检索增强型系统在整合多个来源信息方面面临挑战,而推理模型在这一过程中显著优于标准大语言模型。 Abstract: Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.[39] The Uneven Impact of Post-Training Quantization in Machine Translation
Benjamin Marie,Atsushi Fujita
Main category: cs.CL
TL;DR: This paper explores the effects of quantization on multilingual machine translation, revealing that 4-bit quantization preserves quality for high-resource languages, while low-resource languages suffer degradation, particularly at 2-bit precision. GGUF quantization shows the most consistent performance.
Details
Motivation: Quantization is crucial for deploying large language models (LLMs) on hardware with limited resources, but its impact on multilingual tasks is not well understood. Method: The authors conducted a large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. They compared four quantization techniques: AWQ, BitsAndBytes, GGUF, and AutoRound. Result: 4-bit quantization often preserves translation quality for high-resource languages and larger models, but performance degrades significantly for low-resource and typologically diverse languages, especially in 2-bit settings. GGUF variants show the most consistent performance, even at 2-bit precision. Language-matched calibration improves performance primarily in low-bit scenarios. Conclusion: The paper provides actionable insights for deploying multilingual LLMs under quantization constraints, particularly in low-resource settings, highlighting the importance of model size, quantization technique, and language-matched calibration. Abstract: Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. Our analysis reveals that while 4-bit quantization often preserves translation quality for high-resource languages and large models, significant degradation occurs for low-resource and typologically diverse languages, particularly in 2-bit settings. We compare four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound), showing that algorithm choice and model size jointly determine robustness. GGUF variants provide the most consistent performance, even at 2-bit precision. Additionally, we quantify the interactions between quantization, decoding hyperparameters, and calibration languages, finding that language-matched calibration offers benefits primarily in low-bit scenarios. Our findings offer actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.[40] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement
Yuan Ge,Junxiang Zhang,Xiaoqian Liu,Bei Li,Xiangnan Ma,Chenglong Wang,Kaiyang Ye,Yangfan Du,Linfeng Zhang,Yuxin Huang,Tong Xiao,Zhengtao Yu,JingBo Zhu
Main category: cs.CL
TL;DR: 本文提出了一种名为SageLM的端到端、多方面、可解释的语音大语言模型,用于全面评估语音到语音大语言模型。
Details
Motivation: 评估语音到语音大语言模型是一个根本性的挑战。 Method: SageLM联合评估语义和声学维度,并利用基于理性监督的方法增强可解释性。 Result: SageLM在与人类评估者的一致性率达到82.79%,比级联和SLM基线高出至少7.42%和26.20%。 Conclusion: SageLM是一种全面评估语音到语音大语言模型的模型,具有较高的评估一致性。 Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.[41] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench
Venkatesh Mishra,Amir Saeidi,Satyam Raj,Mutsumi Nakamura,Jayanth Srinivasa,Gaowen Liu,Ali Payani,Chitta Baral
Main category: cs.CL
TL;DR: This paper introduces IRMA, a framework that improves the reasoning and decision-making capabilities of large language models in dynamic conversational environments by reformulating user inputs with domain rules and tool suggestions, resulting in significant performance improvements over existing methods.
Details
Motivation: Large language models often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over long horizons in multi-turn conversational environments like τ-bench. This paper aims to address these limitations. Method: The authors conducted a comprehensive manual analysis of common errors in conversation trajectories, experimented with input reformulations for tool-calling agents, and proposed the Input-Reformulation Multi-Agent (IRMA) framework that automatically reformulates user queries with domain rules and tool suggestions. Result: The IRMA framework significantly outperformed ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. Conclusion: The proposed IRMA framework demonstrates superior reliability and consistency compared to existing methods like ReAct, Function Calling, and Self-Reflection in dynamic conversational environments. Abstract: Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $\tau$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.[42] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment
Jiaqian Li,Qisheng Hu,Jing Li,Wenya Wang
Main category: cs.CL
TL;DR: A novel two-stage exemplar selection strategy is proposed to enhance In-Context Learning for structured prediction tasks, demonstrating superior performance compared to existing approaches.
Details
Motivation: Existing ICL selection strategies overlook structural alignment in structured prediction tasks, leading to suboptimal performance and poor generalization. Method: The method involves fine-tuning a BERT-based retriever with structure-aware supervision and enhancing it with a plug-in module to amplify syntactically meaningful information. Result: Experiments on four benchmarks across three semantic parsing tasks show consistent outperformance over existing baselines using multiple recent LLMs. Conclusion: The proposed two-stage exemplar selection strategy effectively improves the performance, generalizability, and efficiency of In-Context Learning for structured prediction tasks like semantic parsing. Abstract: In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.[43] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents
Tianjian Liu,Fanqi Wan,Jiajian Guo,Xiaojun Quan
Main category: cs.CL
TL;DR: This paper proposes ProactiveEval, a framework for evaluating proactive dialogue capabilities in large language models, identifying top-performing models and analyzing the impact of reasoning capabilities.
Details
Motivation: Existing works on proactive dialogue focus on domain-specific or task-oriented scenarios, leading to fragmented evaluations and limiting comprehensive exploration of models' proactive conversation abilities. Method: The study proposes ProactiveEval, a unified framework for evaluating proactive dialogue capabilities of LLMs, decomposing proactive dialogue into target planning and dialogue guidance and enabling automatic generation of evaluation data. Result: The framework enables comprehensive evaluation across 328 environments spanning 6 domains and experiments with 22 types of LLMs. Conclusion: DeepSeek-R1 and Claude-3.7-Sonnet perform exceptionally on target planning and dialogue guidance tasks, respectively. Abstract: Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.[44] Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution
Chen Chen,Yuchen Sun,Jiaxin Gao,Xueluan Gong,Qian Wang,Ziyao Wang,Yongsen Zheng,Kwok-Yan Lam
Main category: cs.CL
TL;DR: LETHE是一种新颖的后门防御方法,结合了模型内部和提示外部的机制,能够有效、高效地消除大型语言模型中的后门行为。
Details
Motivation: 现有的后门防御方法缺乏全面性,无法应对复杂的后门攻击场景,如模型编辑攻击、多触发器攻击和无触发器攻击。 Method: 内部机制使用轻量级数据集训练干净模型,并将其与受感染模型合并;外部机制将良性且语义相关的证据纳入提示中,以分散模型对后门特征的注意力。 Result: LETHE在5个广泛使用的大型语言模型上对8种后门攻击的实验结果显示,其攻击成功率降低了高达98%,同时保持了模型的实用性。 Conclusion: LETHE是一种通过内部和外部机制进行知识稀释的新方法,可以有效地消除大型语言模型中的后门行为。 Abstract: Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model's parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM's attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.[45] An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs
Mathieu Bourdin,Anas Neumann,Thomas Paviot,Robert Pellerin,Samir Lamouri
Main category: cs.CL
TL;DR: This paper proposes EASI-RAG, a structured and agile method for deploying RAG systems in industrial SMEs, validated through a real-world case study showing rapid implementation, high usability, and reliable performance.
Details
Motivation: The motivation is to overcome the challenges SMEs face in deploying RAG systems due to limited resources and lack of NLP expertise, aiming to make RAG technology more accessible to such organizations. Method: The paper introduces EASI-RAG, a structured, agile method based on method engineering principles with defined roles, activities, and techniques. It validates the method through a real-world case study in an environmental testing laboratory. Result: The RAG system was deployed within a month by a team without prior RAG experience, demonstrated high user adoption, provided accurate answers, and improved data reliability. The system was iteratively improved using user feedback. Conclusion: EASI-RAG is a feasible and effective solution for implementing RAG systems in industrial SMEs, highlighting the potential for broader RAG deployment in similar environments. Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful solution to mitigate the limitations of Large Language Models (LLMs), such as hallucinations and outdated knowledge. However, deploying RAG-based tools in Small and Medium Enterprises (SMEs) remains a challenge due to their limited resources and lack of expertise in natural language processing (NLP). This paper introduces EASI-RAG, Enterprise Application Support for Industrial RAG, a structured, agile method designed to facilitate the deployment of RAG systems in industrial SME contexts. EASI-RAG is based on method engineering principles and comprises well-defined roles, activities, and techniques. The method was validated through a real-world case study in an environmental testing laboratory, where a RAG tool was implemented to answer operators queries using data extracted from operational procedures. The system was deployed in under a month by a team with no prior RAG experience and was later iteratively improved based on user feedback. Results demonstrate that EASI-RAG supports fast implementation, high user adoption, delivers accurate answers, and enhances the reliability of underlying data. This work highlights the potential of RAG deployment in industrial SMEs. Future works include the need for generalization across diverse use cases and further integration with fine-tuned models.[46] Re-Representation in Sentential Relation Extraction with Sequence Routing Algorithm
Ramazan Ali Bahrami,Ramin Yahyapour
Main category: cs.CL
TL;DR: This paper proposes a dynamic routing capsule model for sentential relation extraction, showing strong performance on several datasets but challenges with label noise and re-representation on larger datasets.
Details
Motivation: To improve sentential relation extraction performance and understand challenges such as low performance on larger datasets like Wikidata. Method: Dynamic routing in capsules is used for sentential relation extraction. Result: The proposed model outperforms state-of-the-art models on Tacred, Tacredrev, Retacred, and Conll04 datasets, while facing challenges on the Wikidata dataset due to label noise and re-representation issues. Conclusion: The proposed dynamic routing capsule model excels in sentential relation extraction but faces challenges with label noise and re-representation on larger datasets like Wikidata. Abstract: Sentential relation extraction (RE) is an important task in natural language processing (NLP). In this paper we propose to do sentential RE with dynamic routing in capsules. We first show that the proposed approach outperform state of the art on common sentential relation extraction datasets Tacred, Tacredrev, Retacred, and Conll04. We then investigate potential reasons for its good performance on the mentioned datasets, and yet low performance on another similar, yet larger sentential RE dataset, Wikidata. As such, we identify noise in Wikidata labels as one of the reasons that can hinder performance. Additionally, we show associativity of better performance with better re-representation, a term from neuroscience referred to change of representation in human brain to improve the match at comparison time. As example, in the given analogous terms King:Queen::Man:Woman, at comparison time, and as a result of re-representation, the similarity between related head terms (King,Man), and tail terms (Queen,Woman) increases. As such, our observation show that our proposed model can do re-representation better than the vanilla model compared with. To that end, beside noise in the labels of the distantly supervised RE datasets, we propose re-representation as a challenge in sentential RE.[47] Enabling Equitable Access to Trustworthy Financial Reasoning
William Jurayj,Nils Holzenberger,Benjamin Van Durme
Main category: cs.CL
TL;DR: 将大型语言模型与符号求解器结合,用于提高税务申报的准确性和可审计性,从而降低错误申报带来的成本。
Details
Motivation: 税务申报需要复杂的推理和精确的计算,而现代大型语言模型在准确性和可审计性方面表现不佳,因此需要一种更高效的方法。 Method: 将大型语言模型与符号求解器结合,将文本规则翻译成形式逻辑程序,并通过智能检索示例进行案例表示,从而提高性能。 Result: 在StAtutory Reasoning Assessment数据集上评估表明,该方法可以显著提高任务性能,并降低部署成本至现实平均水平以下。 Conclusion: 神经符号架构在提供可靠税务援助方面具有潜力,并具有经济可行性。 Abstract: According to the United States Internal Revenue Service, ''the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.cs.CV [Back]
[48] Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization
Alberto Compagnoni,Davide Caffagni,Nicholas Moratelli,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出了一种名为CHAIR-DPO的新方法,通过利用CHAIR指标和直接偏好优化技术,有效减少了多模态大语言模型中的幻觉问题。
Details
Motivation: 尽管多模态大语言模型(MLLMs)在许多任务中表现出色,但它们倾向于产生幻觉,即生成与视觉输入无关的答案。这种问题长期存在,需要一种有效的方法来减少幻觉。 Method: 该研究将幻觉问题视为一种对齐问题,并利用CHAIR指标来区分生成答案中的优劣样本,通过直接偏好优化(DPO)对现成的MLLM进行微调。 Result: 该方法在多个幻觉基准测试中显著减少了幻觉答案的数量,证明了基于CHAIR的奖励对MLLM微调的有效性。 Conclusion: CHAIR-DPO有效地减少了多模态大语言模型中的幻觉问题,并且源代码和训练模型已公开,可供他人使用和进一步研究。 Abstract: Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.[49] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization
Yang Su,Shunquan Tan,Jiwu Huang
Main category: cs.CV
TL;DR: 本文提出了一种新的基于Stable DiffusionV3多模态架构的图像取证框架,通过将图像伪造残差作为显式模态进行融合,显著提高了伪造定位的性能。
Details
Motivation: 由于新兴的图像操作技术(如Stable Diffusion)的快速发展,现有的图像伪造定位方法依赖于耗时且昂贵的标注数据,难以跟上这些新技术的步伐,因此需要提出一种更高效和准确的伪造定位方法。 Method: 利用Stable DiffusionV3 (SD3) 的多模态处理能力,在潜在空间中将图像伪造残差作为显式模态进行融合,以增强伪造定位性能。 Result: 实验结果显示,该框架在广泛使用的基准数据集上实现了高达12%的性能提升,并在训练过程中未见过的现实世界文档伪造图像和自然场景伪造图像任务中表现出色。 Conclusion: 本文提出了一种基于Stable Diffusion (SD) 多模态架构的图像取证框架,用于提高伪造定位性能,实验结果表明该方法在基准数据集上比现有最先进模型提高了高达12%的性能,并在现实世界文档伪造和自然场景伪造图像中表现出色。 Abstract: Driven by the new generation of multi-modal large models, such as Stable Diffusion (SD), image manipulation technologies have advanced rapidly, posing significant challenges to image forensics. However, existing image forgery localization methods, which heavily rely on labor-intensive and costly annotated data, are struggling to keep pace with these emerging image manipulation technologies. To address these challenges, we are the first to integrate both image generation and powerful perceptual capabilities of SD into an image forensic framework, enabling more efficient and accurate forgery localization. First, we theoretically show that the multi-modal architecture of SD can be conditioned on forgery-related information, enabling the model to inherently output forgery localization results. Then, building on this foundation, we specifically leverage the multimodal framework of Stable DiffusionV3 (SD3) to enhance forgery localization performance.We leverage the multi-modal processing capabilities of SD3 in the latent space by treating image forgery residuals -- high-frequency signals extracted using specific highpass filters -- as an explicit modality. This modality is fused into the latent space during training to enhance forgery localization performance. Notably, our method fully preserves the latent features extracted by SD3, thereby retaining the rich semantic information of the input image. Experimental results show that our framework achieves up to 12% improvements in performance on widely used benchmarking datasets compared to current state-of-the-art image forgery localization models. Encouragingly, the model demonstrates strong performance on forensic tasks involving real-world document forgery images and natural scene forging images, even when such data were entirely unseen during training.[50] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study
Max Torop,Masih Eskandar,Nicholas Kurtansky,Jinyang Liu,Jochen Weber,Octavia Camps,Veronica Rotemberg,Jennifer Dy,Kivanc Kose
Main category: cs.CV
TL;DR: This paper explores how fine-tuning Multimodal Large Language Models can improve the interpretability of AI predictions for skin cancer diagnosis by grounding them in clinically relevant quantitative attributes.
Details
Motivation: Improving the interpretability of AI models for diagnosing skin diseases, particularly cancer, to make them more useful and trustworthy for clinicians. Method: Fine-tuning Multimodal Large Language Models (MLLMs) to predict quantitative attribute values from images and evaluating the grounded embedding space through attribute-specific content-based image retrieval on the SLICE-3D dataset. Result: The study demonstrates that MLLMs can be fine-tuned to predict lesion-related attributes from images, enabling more interpretable predictions grounded in meaningful clinical concepts. Conclusion: MLLM embedding spaces can be effectively grounded in quantitative attributes related to lesion appearance through fine-tuning, enhancing the interpretability of predictions. Abstract: Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.[51] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels
Hossein Ahmadi,Banafsheh Saffari
Main category: cs.CV
TL;DR: A unified Vision Transformer framework is proposed for Automatic Modulation Recognition, integrating multiple learning objectives to achieve a label-efficient and scalable solution with strong performance on limited labeled data and varying signal-to-noise ratios.
Details
Motivation: The motivation is to overcome the limitations of existing AMR solutions that rely on large labeled datasets or multi-stage training pipelines, thereby improving scalability and generalization. Method: The method combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; it integrates supervised, self-supervised, and reconstruction objectives. Result: On the RML2018.01A dataset, the approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Conclusion: The proposed unified Vision Transformer framework provides a simple, generalizable, and label-efficient solution for AMR. Abstract: Automatic modulation recognition (AMR) is critical for cognitive radio, spectrum monitoring, and secure wireless communication. However, existing solutions often rely on large labeled datasets or multi-stage training pipelines, which limit scalability and generalization in practice. We propose a unified Vision Transformer (ViT) framework that integrates supervised, self-supervised, and reconstruction objectives. The model combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; the reconstruction branch maps augmented signals back to their originals, anchoring the encoder to fine-grained I/Q structure. This strategy promotes robust, discriminative feature learning during pretraining, while partial label supervision in fine-tuning enables effective classification with limited labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Overall, the framework provides a simple, generalizable, and label-efficient solution for AMR.[52] InfinityHuman: Towards Long-Term Audio-Driven Human
Xiaodi Li,Pan Xie,Yi Ren,Qijun Gan,Chen Zhang,Fangyuan Kong,Xiang Yin,Bingyue Peng,Zehuan Yuan
Main category: cs.CV
TL;DR: 本研究提出InfinityHuman,一种由粗到细的框架,有效解决音频驱动人体动画中的身份漂移和手部动作问题,生成高质量、长时间的视频。
Details
Motivation: 音频驱动的人体动画在实际应用中受到广泛关注,但生成高分辨率、长时间视频时仍存在身份漂移、颜色偏移、场景不稳定以及手部动作建模不足等问题。 Method: 提出了一种由粗到细的框架InfinityHuman,首先生成音频同步表示,然后使用姿态引导细化器逐步将其优化为高分辨率、长持续时间的视频。 Result: InfinityHuman在EMTD和HDTF数据集上的实验表明其在视频质量、身份保持、手势准确性和唇音同步方面达到了最先进的水平。 Conclusion: InfinityHuman实现了视频质量、身份保留、手势准确性和唇部同步的最先进性能,代码将公开。 Abstract: Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.[53] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
Mert Cokelek,Halit Ozsoy,Nevrez Imamoglu,Cagri Ozcinar,Inci Ayhan,Erkut Erdem,Aykut Erdem
Main category: cs.CV
TL;DR: 本研究扩展了全向视频(ODVs)中的显著性预测领域,提出新的数据集YT360-EyeTracking和两种显著性预测模型SalViT360及SalViT360-AV,结果显示结合空间音频线索能显著提升360度场景中观众注意力的预测准确性。
Details
Motivation: 由于缺乏全面的360度音视频显著性预测数据集,研究整理了一个新数据集YT360-EyeTracking,包含81个ODVs,每个都在不同的音视频条件下观察。 Method: 提出了两种新的显著性预测模型:SalViT360,一种基于视觉变换器的框架,配备了感知球形几何时空注意力层;以及SalViT360-AV,该模型进一步结合了以音频输入为条件的变换器适配器。 Result: 在包括YT360-EyeTracking在内的多个基准数据集上的结果表明,SalViT360和SalViT360-AV在预测360度场景中的观众注意力方面明显优于现有方法。 Conclusion: 将空间音频线索整合到模型架构中对于在全向视频中进行准确的显著性预测至关重要。 Abstract: Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer's perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.[54] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models
Phu-Vinh Nguyen,Tan-Hanh Pham,Chris Ngo,Truong Son Hy
Main category: cs.CV
TL;DR: This paper proposes a pipeline using Vision-Language Models to explain vision models at both sample and dataset levels, aiming to enhance model understanding, identify biases, and integrate explainability into model development.
Details
Motivation: Vision models typically prioritize performance metrics over explainability, and while xAI methods exist for sample-level explanations, there's a lack of methods for understanding general model behavior across datasets. This can lead to biased judgments and missed patterns. Method: The paper proposes a pipeline that utilizes Vision-Language Models to provide explanations at both the sample level and the dataset level, offering insights into model behavior with minimal effort. Result: The proposed pipeline enables the discovery of failure cases and provides a deeper understanding of vision models, effectively bridging the gap between vision model development and xAI analysis. Conclusion: The pipeline successfully enhances the explainability of vision models, supporting more informed model development and improving the reliability and fairness of image analysis. Abstract: The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.[55] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems
Mohamed Ohamouddou,Said Ohamouddou,Abdellatif El Afia,Rafik Lasri
Main category: cs.CV
TL;DR: ATMS-KD是一种新型的知识蒸馏框架,通过自适应温度调度和混合样本增强方法,将知识从大型模型转移到轻量级模型,从而在资源受限的农业环境中提高模型性能。
Details
Motivation: 开发适用于资源受限农业环境的轻量级CNN模型,以提高农业计算机视觉应用的性能。 Method: 结合自适应温度调度和混合样本增强方法,从MobileNetV3 Large教师模型向轻量级残差CNN学生模型传递知识。 Result: 所有学生模型使用ATMS-KD的验证准确率均超过96.7%,知识保留率超过99%。紧凑模型达到了97.11%的准确率,比第二佳方法提高了1.60个百分点,并保持了最低的72.19毫秒推理延迟。 Conclusion: ATMS-KD框架在知识蒸馏中表现出色,不仅提高了模型的准确率,同时保持了较低的推理延迟,适用于资源受限的农业环境。 Abstract: This study proposes ATMS-KD (Adaptive Temperature and Mixed-Sample Knowledge Distillation), a novel framework for developing lightweight CNN models suitable for resource-constrained agricultural environments. The framework combines adaptive temperature scheduling with mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model (5.7\,M parameters) to lightweight residual CNN students. Three student configurations were evaluated: Compact (1.3\,M parameters), Standard (2.4\,M parameters), and Enhanced (3.8\,M parameters). The dataset used in this study consists of images of \textit{Rosa damascena} (Damask rose) collected from agricultural fields in the Dades Oasis, southeastern Morocco, providing a realistic benchmark for agricultural computer vision applications under diverse environmental conditions. Experimental evaluation on the Damascena rose maturity classification dataset demonstrated significant improvements over direct training methods. All student models achieved validation accuracies exceeding 96.7\% with ATMS-KD compared to 95--96\% with direct training. The framework outperformed eleven established knowledge distillation methods, achieving 97.11\% accuracy with the compact model -- a 1.60 percentage point improvement over the second-best approach while maintaining the lowest inference latency of 72.19\,ms. Knowledge retention rates exceeded 99\% for all configurations, demonstrating effective knowledge transfer regardless of student model capacity.[56] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification
Mutahar Safdar,Gentry Wood,Max Zimmermann,Guy Lamouche,Priti Wanjara,Yaoyao Fiona Zhao
Main category: cs.CV
TL;DR: This study proposes a hybrid vision-language framework for material qualification, combining deep learning with expert knowledge to enable zero-shot classification of microstructures without task-specific retraining.
Details
Motivation: The bottleneck in rapid and reliable qualification of advanced materials, particularly for heterogeneous structures from additive manufacturing, motivates the need for a more efficient and interpretable framework that bridges raw data and expert knowledge. Method: The study introduces a framework combining microstructure informatics with expert knowledge using customized hybrid vision-language representations (VLRs). It integrates deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA) to encode visual and textual data into shared representations. Zero-shot classification is achieved through a net similarity scoring approach, incorporating expert-annotated images and descriptions. Z-score normalization improves alignment and classification by adjusting similarity scores based on dataset-driven distributions. Result: The framework successfully distinguishes between acceptable and defective samples in an additively manufactured metal matrix composite dataset. Comparative analysis shows FLAVA provides higher visual sensitivity, while CLIP aligns better with textual criteria. Z-score normalization enhances classification effectiveness in the hybrid framework. Conclusion: The proposed hybrid vision-language framework enhances traceability and interpretability in material qualification pipelines, enabling human-in-the-loop decision-making without retraining models for specific tasks. Abstract: Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non-conventional additive manufacturing processes. This study introduces a novel framework that links microstructure informatics with a range of expert characterization knowledge using customized and hybrid vision-language representations (VLRs). By integrating deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA), we encode both visual microstructural data and textual expert assessments into shared representations. To overcome limitations in general-purpose embeddings, we develop a customized similarity-based representation that incorporates both positive and negative references from expert-annotated images and their associated textual descriptions. This allows zero-shot classification of previously unseen microstructures through a net similarity scoring approach. Validation on an additively manufactured metal matrix composite dataset demonstrates the framework's ability to distinguish between acceptable and defective samples across a range of characterization criteria. Comparative analysis reveals that FLAVA model offers higher visual sensitivity, while the CLIP model provides consistent alignment with the textual criteria. Z-score normalization adjusts raw unimodal and cross-modal similarity scores based on their local dataset-driven distributions, enabling more effective alignment and classification in the hybrid vision-language framework. The proposed method enhances traceability and interpretability in qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining. By advancing semantic interoperability between raw data and expert knowledge, this work contributes toward scalable and domain-adaptable qualification strategies in engineering informatics.[57] MedNet-PVS: A MedNeXt-Based Deep Learning Model for Automated Segmentation of Perivascular Spaces
Zhen Xuen Brandon Low,Rory Zhang,Hang Min,William Pham,Lucy Vivash,Jasmine Moses,Miranda Lynch,Karina Dorfman,Cassandra Marotta,Shaun Koh,Jacob Bunyamin,Ella Rowsthorn,Alex Jarema,Himashi Peiris,Zhaolin Chen,Sandy R. Shultz,David K. Wright,Dexiao Kong,Sharon L. Naismith,Terence J. O'Brien,Ying Xia,Meng Law,Benjamin Sinclair
Main category: cs.CV
TL;DR: The study explores the use of MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, for automated segmentation of enlarged perivascular spaces (PVS) in MRI datasets. While the model performs efficiently across diverse T1w and T2w MRI datasets, it does not outperform the nnU-Net, suggesting that attention-based mechanisms providing global context may not be essential for high accuracy in PVS segmentation.
Details
Motivation: Manual segmentation of PVS is time-consuming and subject to moderate inter-rater reliability, while existing automated deep learning models have moderate performance and typically fail to generalize across diverse clinical and research MRI datasets. Method: MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, was adapted for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of 200 T2-weighted (T2w) MRI scans from the HCP-Aging dataset and another using 40 heterogeneous T1-weighted (T1w) MRI volumes from seven studies across six scanners. Model performance was evaluated using internal 5-fold cross validation (5FCV) and leave-one-site-out cross validation (LOSOCV). Result: MedNeXt-L-k5 models trained on the T2w images of the HCP-Aging dataset achieved voxel-level Dice scores of 0.88+/-0.06 (white matter, WM), comparable to the reported inter-rater reliability of that dataset, and the highest yet reported in the literature. The same models trained on the T1w images of the HCP-Aging dataset achieved a substantially lower Dice score of 0.58+/-0.09 (WM). Under LOSOCV, the model had voxel-level Dice scores of 0.38+/-0.16 (WM) and 0.35+/-0.12 (BG), and cluster-level Dice scores of 0.61+/-0.19 (WM) and 0.62+/-0.21 (BG). Conclusion: MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets. However, it did not outperform the nnU-Net, suggesting that attention-based mechanisms providing global context are not necessary for achieving high accuracy in PVS segmentation. Abstract: Enlarged perivascular spaces (PVS) are increasingly recognized as biomarkers of cerebral small vessel disease, Alzheimer's disease, stroke, and aging-related neurodegeneration. However, manual segmentation of PVS is time-consuming and subject to moderate inter-rater reliability, while existing automated deep learning models have moderate performance and typically fail to generalize across diverse clinical and research MRI datasets. We adapted MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of 200 T2-weighted (T2w) MRI scans from the Human Connectome Project-Aging (HCP-Aging) dataset and another using 40 heterogeneous T1-weighted (T1w) MRI volumes from seven studies across six scanners. Model performance was evaluated using internal 5-fold cross validation (5FCV) and leave-one-site-out cross validation (LOSOCV). MedNeXt-L-k5 models trained on the T2w images of the HCP-Aging dataset achieved voxel-level Dice scores of 0.88+/-0.06 (white matter, WM), comparable to the reported inter-rater reliability of that dataset, and the highest yet reported in the literature. The same models trained on the T1w images of the HCP-Aging dataset achieved a substantially lower Dice score of 0.58+/-0.09 (WM). Under LOSOCV, the model had voxel-level Dice scores of 0.38+/-0.16 (WM) and 0.35+/-0.12 (BG), and cluster-level Dice scores of 0.61+/-0.19 (WM) and 0.62+/-0.21 (BG). MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets. MedNeXt-L-k5 did not outperform the nnU-Net, indicating that the attention-based mechanisms present in transformer-inspired models to provide global context are not required for high accuracy in PVS segmentation.[58] Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
Zhixiang Chi,Yanan Wu,Li Gu,Huan Liu,Ziqiang Wang,Yang Zhang,Yang Wang,Konstantinos N. Plataniotis
Main category: cs.CV
TL;DR: 本文提出了一种训练-free的反馈机制,通过输出引导中间注意力,显著提升CLIP在分割任务中的性能。
Details
Motivation: CLIP在视觉-文本对齐方面表现出色,但在开放词汇分割任务中由于定位能力差而表现不佳,现有方法难以将空间一致性持续传递到最终输出。 Method: 设计了注意力隔离、基于置信度的稀疏自适应剪枝和自适应集成等关键模块,将输出预测反馈到中间注意力机制中。 Result: 该方法在八种基准测试中显著提升了多种注意力机制(Q-K、自-自、Proxy)和三种主干网络(ViT-B、ViT-L、ViT-H)的表现。 Conclusion: 该论文提出了一种无需训练的反馈驱动自适应框架,有效提升了CLIP在开放词汇分割任务中的表现,并可无缝集成到多种现有方法中。 Abstract: CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.[59] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
Zhuoran Yu,Yong Jae Lee
Main category: cs.CV
TL;DR: 该研究通过引入一种系统性的探测框架,揭示了多模态大语言模型中各层的功能角色及其在不同架构下的变化规律。
Details
Motivation: 多模态大语言模型在视觉-语言任务中表现出色,但其内部处理机制尚不明确,需要系统分析其跨层视觉和文本输入的处理方式。 Method: 引入探测框架,训练线性分类器从各层提取的token嵌入中预测细粒度视觉类别,并通过三种受控提示变体评估探测结果。 Result: 识别出一致的阶段式结构,并发现尽管整体结构在视觉token化、指令调优数据和预训练语料库变化时保持稳定,特定层的分配随基础LLM架构变化显著。 Conclusion: 研究发现多模态大语言模型的层间结构具有阶段性的特点,早期层负责视觉基础,中间层支持词汇整合和语义推理,最后的层准备任务特定输出,且整体结构稳定但具体层分配受基础LLM架构影响显著。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, we introduce a probing framework to systematically analyze how MLLMs process visual and textual inputs across layers. We train linear classifiers to predict fine-grained visual categories (e.g., dog breeds) from token embeddings extracted at each layer, using a standardized anchor question. To uncover the functional roles of different layers, we evaluate these probes under three types of controlled prompt variations: (1) lexical variants that test sensitivity to surface-level changes, (2) semantic negation variants that flip the expected answer by modifying the visual concept in the prompt, and (3) output format variants that preserve reasoning but alter the answer format. Applying our framework to LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, we identify a consistent stage-wise structure in which early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. We further show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts notably with changes in the base LLM architecture. Our findings provide a unified perspective on the layer-wise organization of MLLMs and offer a lightweight, model-agnostic approach for analyzing multimodal representation dynamics.[60] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)
Zhi Li,Hau Phan,Matthew Emigh,Austin J. Brockmeier
Main category: cs.CV
TL;DR: The paper proposes SLiCS, a supervised dictionary learning approach to disentangle embedding spaces in vision-language models, improving the precision of concept-filtered image retrieval and conditional generation.
Details
Motivation: The paper aims to disentangle the embedding space of vision-language co-embedding networks (like CLIP) to separate information on the content of complex scenes, enabling better performance in tasks like image retrieval and conditional generation. Method: A supervised dictionary learning approach is proposed to estimate a linear synthesis model with sparse, non-negative combinations of groups of vectors (atoms), which are optimized through alternating optimization. The method also exploits text co-embeddings for semantically meaningful descriptions and unsupervised dictionary learning with zero-shot classification. Result: The proposed SLiCS method improves the precision of concept-filtered image retrieval across different embeddings, including CLIP, TiTok, and DINOv2. It also enables conditional generation using image-to-prompt techniques. Conclusion: The disentangled embeddings provided by the proposed SLiCS method enable more precise concept-filtered image retrieval and conditional generation, and it can be applied to compressed autoencoder embeddings as well. Abstract: Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept's group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.[61] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models
Xiao Li,Yanfan Zhu,Ruining Deng,Wei-Qi Wei,Yu Wang,Shilin Zhao,Yaohong Wang,Haichun Yang,Yuankai Huo
Main category: cs.CV
TL;DR: MedFoundationHub is a secure toolkit for deploying medical VLMs, evaluated by experts for use in pathology, highlighting both its potential and current model limitations.
Details
Motivation: To address security concerns in medical VLMs, such as PHI exposure and data leakage, while making advanced models accessible for clinical and research use. Method: Development of MedFoundationHub, a GUI toolkit for deploying medical VLMs with privacy-preserving inference, evaluated through expert assessments on colon and renal pathology cases. Result: Engagement of board-certified pathologists in deploying and assessing five state-of-the-art VLMs revealed limitations in accuracy, reasoning, and terminology consistency. Conclusion: MedFoundationHub provides a secure and accessible platform for deploying medical vision-language models, but challenges remain in improving model accuracy and consistency in clinical contexts. Abstract: Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.[62] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction
Mang Cao,Sanping Zhou,Yizhe Li,Ye Deng,Wenli Huang,Le Wang
Main category: cs.CV
TL;DR: This paper proposes BIM, a method for multi-task dense prediction that effectively balances interaction completeness and computational efficiency through novel scanning mechanisms.
Details
Motivation: Existing methods face a trade-off between interaction completeness and computational efficiency in multi-task dense prediction. Method: BIM incorporates BI-Scan and MS-Scan mechanisms to improve cross-task interaction while maintaining computational efficiency. Result: BIM outperforms existing methods on NYUD-V2 and PASCAL-Context benchmarks. Conclusion: The proposed Bidirectional Interaction Mamba (BIM) demonstrates superiority over state-of-the-art methods in multi-task dense prediction tasks. Abstract: Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emph{i.e.}, NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.[63] Audio-Guided Visual Editing with Complex Multi-Modal Prompts
Hyeonyu Kim,Seokhoon Jeong,Seonghee Han,Chanhyuk Choi,Taehwan Kim
Main category: cs.CV
TL;DR: This paper introduces an audio-guided visual editing framework that effectively handles complex editing tasks using multiple text and audio prompts without additional training, overcoming the limitations of text-only methods.
Details
Motivation: The motivation is to address the limitations of existing visual editing methods that rely solely on textual guidance, which cannot adequately describe complex scenarios, thus requiring additional non-text editing prompts like audio. Method: The method involves leveraging a pre-trained multi-modal encoder with zero-shot capabilities, integrating diverse audio into visual editing tasks, and proposing separate noise branching and adaptive patch selection to handle complex, multi-modal prompts. Result: The results show that the proposed framework excels in handling complex visual editing tasks with multiple and multi-modal prompts, outperforming text-only approaches without requiring additional training. Conclusion: The paper concludes that their proposed audio-guided visual editing framework successfully handles complex editing tasks by incorporating rich audio information, where text-only approaches fail. Abstract: Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model's prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.[64] More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning
Luong Tran,Thieu Vo,Anh Nguyen,Sang Dinh,Van Nguyen
Main category: cs.CV
TL;DR: 该论文提出了一种新的损失函数GPR Loss和伪标签技术DAMP,用于解决单正多标签学习中的噪声问题,显著提升了多标签分类效果。
Details
Motivation: 由于完全标注大规模数据集成本高昂,单正多标签学习(SPML)中传统方法处理缺失标签存在不准确和误判问题,需要一种更鲁棒的方法来处理噪声伪标签。 Method: 提出Generalized Pseudo-Label Robust Loss (GPR Loss) 损失函数,结合Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) 技术,构建AEVLP框架进行多标签学习。 Result: 在四个基准数据集上进行了大量实验,结果表明该框架显著提升了多标签分类性能,并达到了最先进的效果。 Conclusion: 论文提出的AEVLP框架有效解决了SPML中伪标签噪声问题,为多标签图像分类提供了新的解决方案。 Abstract: Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.[65] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection
Chengjun Zhang,Yuhao Zhang,Jie Yang,Mohamad Sawan
Main category: cs.CV
TL;DR: This paper introduces a delay-spike approach and a novel tdIF neuron architecture for SNNs, enhancing visual detection task performance with minimal latency and energy consumption.
Details
Motivation: Current ANN-SNN conversion methods perform poorly in visual detection tasks despite success in classification tasks with ultra-low time-steps. Method: A delay-spike approach and a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture are proposed to address residual membrane potential and enable dynamic adjustment of neuron behavior based on time-step order. Result: The tdIF method achieves more precise feature representation with fewer time-steps, enabling high performance and ultra-low latency in object and lane line detection. Conclusion: The proposed tdIF neuron method outperforms existing ANN-SNN conversion approaches in visual detection tasks, achieving state-of-the-art performance with ultra-low latency. Abstract: Spiking Neural Networks (SNNs), inspired by the brain, are characterized by minimal power consumption and swift inference capabilities on neuromorphic hardware, and have been widely applied to various visual perception tasks. Current ANN-SNN conversion methods have achieved excellent results in classification tasks with ultra-low time-steps, but their performance in visual detection tasks remains suboptimal. In this paper, we propose a delay-spike approach to mitigate the issue of residual membrane potential caused by heterogeneous spiking patterns. Furthermore, we propose a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs. This enables Integrate-and-fire (IF) neurons to dynamically adjust their accumulation and firing behaviors based on the temporal order of time-steps. Our method enables spikes to exhibit distinct temporal properties, rather than relying solely on frequency-based representations. Moreover, the tdIF neuron maintains energy consumption on par with traditional IF neuron. We demonstrate that our method achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. In this study, we conduct extensive evaluation of the tdIF method across two critical vision tasks: object detection and lane line detection. The results demonstrate that the proposed method surpasses current ANN-SNN conversion approaches, achieving state-of-the-art performance with ultra-low latency (within 5 time-steps).[66] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection
Yuqi Xiong,Wuzhen Shi,Yang Wen,Ruhan Liu
Main category: cs.CV
TL;DR: This paper proposes DUP-MCRNet for salient object detection, improving detail accuracy and edge clarity in complex scenes through dynamic uncertainty propagation and multimodal collaboration.
Details
Motivation: Existing salient object detection methods tend to lose details, blur edges, and inadequately fuse single-modal information in complex scenes. Method: The paper proposes a dynamic uncertainty graph convolution module and a multimodal collaborative fusion strategy to improve detection accuracy and information fusion. Result: Extensive experiments demonstrate that DUP-MCRNet outperforms other SOD methods on benchmark datasets, especially in edge clarity and background robustness. Conclusion: DUP-MCRNet is introduced as an effective solution for salient object detection, particularly excelling in edge clarity and robustness to complex backgrounds. Abstract: In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.[67] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection
Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shota Orihashi,Tomohiro Tanaka,Mana Ihori,Naoki Makishima,Naotaka Kawata
Main category: cs.CV
TL;DR: The paper proposes MSMVD for Multi-View Pedestrian Detection which improves detection performance by utilizing multi-scale image features and outperforms previous methods on the GMVD dataset.
Details
Motivation: End-to-end trainable deep learning methods often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views because they do not exploit multi-scale image features. Method: Multi-Scale Multi-View Detection (MSMVD) that generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Result: MSMVD improves the detection of pedestrians with vastly different scales between views and achieves precise detection of pedestrians with consistently small or large scales in views. Conclusion: MSMVD outperforms the previous highest MODA by 4.5 points on the GMVD dataset. Abstract: Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird's eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by $4.5$ points on the GMVD dataset.[68] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection
Libo Lv,Tianyi Wang,Mengxiao Huang,Ruixia Liu,Yinglong Wang
Main category: cs.CV
TL;DR: This paper proposes SFMFNet, a lightweight and efficient architecture for real-time deepfake detection that balances accuracy and computational efficiency.
Details
Motivation: The motivation is to address the computational inefficiency of existing deepfake detection techniques, which limits their real-time deployment despite high accuracy. Method: The paper proposes a lightweight architecture called Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), incorporating a spatial-frequency hybrid aware module, a token-selective cross attention mechanism, and a residual-enhanced blur pooling structure. Result: Experiments demonstrate that SFMFNet achieves high accuracy while maintaining computational efficiency, showing strong generalization and practical value for real-time applications. Conclusion: SFMFNet provides a balanced solution in terms of accuracy and efficiency, making it suitable for real-time deepfake detection applications. Abstract: With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.[69] Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification
Ayaka Tsutsumi,Guang Li,Ren Togo,Takahiro Ogawa,Satoshi Kondo,Miki Haseyama
Main category: cs.CV
TL;DR: This paper proposes an efficient medical image classification method combining dual-model weight selection and self-knowledge distillation, achieving high performance with reduced computational costs.
Details
Motivation: The motivation stems from the need to overcome computational resource constraints in real-world medical settings, which limit the deployment of large-scale models. Developing lightweight models that retain performance comparable to larger models is essential. Method: The method involves initializing two lightweight models using weights from a large pretrained model (dual-model weight selection), followed by applying self-knowledge distillation and fine-tuning for target classification tasks. Result: Extensive experiments on medical image datasets (chest X-ray, lung CT scans, brain MRI) show that the proposed approach achieves superior performance and robustness compared to existing methods. Conclusion: The proposed method integrating dual-model weight selection with self-knowledge distillation outperforms existing approaches in medical image classification tasks while maintaining computational efficiency. Abstract: We propose a novel medical image classification method that integrates dual-model weight selection with self-knowledge distillation (SKD). In real-world medical settings, deploying large-scale models is often limited by computational resource constraints, which pose significant challenges for their practical implementation. Thus, developing lightweight models that achieve comparable performance to large-scale models while maintaining computational efficiency is crucial. To address this, we employ a dual-model weight selection strategy that initializes two lightweight models with weights derived from a large pretrained model, enabling effective knowledge transfer. Next, SKD is applied to these selected models, allowing the use of a broad range of initial weight configurations without imposing additional excessive computational cost, followed by fine-tuning for the target classification tasks. By combining dual-model weight selection with self-knowledge distillation, our method overcomes the limitations of conventional approaches, which often fail to retain critical information in compact models. Extensive experiments on publicly available datasets-chest X-ray images, lung computed tomography scans, and brain magnetic resonance imaging scans-demonstrate the superior performance and robustness of our approach compared to existing methods.[70] Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds
Pengpeng Yu,Haoran Li,Dingquan Li,Runqing Jiang,Jing Wang,Liang Lin,Yulan Guo
Main category: cs.CV
TL;DR: A new method for compressing LiDAR point clouds improves efficiency and achieves real-time performance with high compression ratios.
Details
Motivation: High-precision LiDAR scans cause significant storage and transmission overhead, and existing methods struggle with efficient context modeling due to geometric sparsity. Method: The method involves two modules: Geometry Re-Densification Module and Cross-scale Feature Propagation Module, which together generate compact features for efficient predictive coding. Result: The proposed framework achieves state-of-the-art compression ratios and real-time performance, with 26 FPS encoding and decoding on the KITTI dataset at 12-bit quantization. Conclusion: The proposed method achieves efficient compression of LiDAR point clouds with state-of-the-art performance and real-time processing speed. Abstract: LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.[71] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Xiaochuan Li,Guoguang Du,Runze Zhang,Liang Jin,Qi Jia,Lihua Lu,Zhenhua Guo,Yaqian Zhao,Haiyang Liu,Tianqi Wang,Changsheng Li,Xiaoli Gong,Rengang Li,Baoyu Fan
Main category: cs.CV
TL;DR: This paper proposes Droplet3D-4M and Droplet3D to leverage video data for 3D asset generation, successfully producing spatially consistent and semantically plausible results with potential for scene-level applications.
Details
Motivation: The motivation is to overcome the data scarcity in the 3D domain by leveraging the rich semantic and spatial consistency priors found in videos, which can serve as an alternative supervisory signal. Method: The paper introduces Droplet3D-4M, a large-scale video dataset with multi-view level annotations, and Droplet3D, a generative model trained on this dataset to support both image and dense text input for 3D generation. Result: The experiments show that the proposed method effectively generates spatially consistent and semantically plausible 3D content, outperforming prevailing 3D solutions in potential for scene-level applications. Conclusion: The paper concludes that commonsense priors from videos can significantly facilitate 3D creation, and their approach has the potential for extension to scene-level applications. Abstract: Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.[72] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation
Jiusi Li,Jackson Jiang,Jinyu Miao,Miao Long,Tuopu Wen,Peijin Jia,Shengxiang Liu,Chunlei Yu,Maolin Liu,Yuzhan Cai,Kun Jiang,Mengmeng Yang,Diange Yang
Main category: cs.CV
TL;DR: 本文提出G^2Editor,一种用于自动驾驶场景中对象编辑的高效框架,结合3D高斯表示和场景布局重建,提高了编辑的精确性和视觉效果。
Details
Motivation: 收集自动驾驶系统中的极端情况通常成本高昂且危险,现有的对象编辑方法在视觉保真度或姿态控制上存在局限。 Method: G^2Editor利用3D高斯表示作为密集先验,并结合场景级3D边界框布局和分层细粒度特征,实现精确的姿态控制和空间一致性。 Result: 在Waymo Open Dataset上的实验表明,G^2Editor在姿态可控性和视觉质量方面优于现有方法。 Conclusion: G^2Editor提供了一个统一的框架,用于在驾驶视频中进行对象重新定位、插入和删除,有效支持下游数据驱动任务。 Abstract: Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fidelity or imprecise pose control. To address these issues, we propose G^2Editor, a framework designed for photorealistic and precise object editing in driving videos. Our method leverages a 3D Gaussian representation of the edited object as a dense prior, injected into the denoising process to ensure accurate pose control and spatial consistency. A scene-level 3D bounding box layout is employed to reconstruct occluded areas of non-target objects. Furthermore, to guide the appearance details of the edited object, we incorporate hierarchical fine-grained features as additional conditions during generation. Experiments on the Waymo Open Dataset demonstrate that G^2Editor effectively supports object repositioning, insertion, and deletion within a unified framework, outperforming existing methods in both pose controllability and visual quality, while also benefiting downstream data-driven tasks.[73] Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization
Marina Grifell i Plana,Vladyslav Zalevskyi,Léa Schmidt,Yvan Gomez,Thomas Sanchez,Vincent Dunet,Mériam Koob,Vanessa Siffredi,Meritxell Bach Cuadra
Main category: cs.CV
TL;DR: 该研究提出了一种基于病理信息的合成数据生成策略,以解决罕见的胼胝体发育异常(CCD)胎儿大脑分割数据不足的问题,从而提高了分割精度和生物标志物提取的可靠性。
Details
Motivation: 由于CCD病例稀缺,导致缺乏足够的标注数据来训练深度学习模型,影响其泛化能力。这是研究的主要动机。 Method: 研究者提出了一种病理信息引导的领域随机化策略,通过从健康胎儿大脑数据中模拟多种脑部变化,生成合成数据用于分割任务,而无需病理标注。 Result: 该方法在包含248个健康胎儿、26个CCD病例和47个其他脑部病理病例的数据集上进行了验证,显著提高了CCD病例的分割性能,同时保持了对健康和其他病理病例的效果。此外,该方法将健康胎儿的胼胝体长度(LCC)估计误差从1.89 mm降至0.80 mm,CCD病例的误差从10.9 mm降至0.7 mm。 Conclusion: 该研究表明,将特定领域的解剖先验知识引入合成数据生成流程,可以有效缓解数据稀缺问题,提高罕见但临床重要的畸形分析能力。 Abstract: Accurate fetal brain segmentation is crucial for extracting biomarkers and assessing neurodevelopment, especially in conditions such as corpus callosum dysgenesis (CCD), which can induce drastic anatomical changes. However, the rarity of CCD severely limits annotated data, hindering the generalization of deep learning models. To address this, we propose a pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into a synthetic data generation pipeline. By simulating diverse brain alterations from healthy data alone, our approach enables robust segmentation without requiring pathological annotations. We validate our method on a cohort comprising 248 healthy fetuses, 26 with CCD, and 47 with other brain pathologies, achieving substantial improvements on CCD cases while maintaining performance on both healthy fetuses and those with other pathologies. From the predicted segmentations, we derive clinically relevant biomarkers, such as corpus callosum length (LCC) and volume, and show their utility in distinguishing CCD subtypes. Our pathology-informed augmentation reduces the LCC estimation error from 1.89 mm to 0.80 mm in healthy cases and from 10.9 mm to 0.7 mm in CCD cases. Beyond these quantitative gains, our approach yields segmentations with improved topological consistency relative to available ground truth, enabling more reliable shape-based analyses. Overall, this work demonstrates that incorporating domain-specific anatomical priors into synthetic data pipelines can effectively mitigate data scarcity and enhance analysis of rare but clinically significant malformations.[74] Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding
Jeong Hun Yeo,Hyeongseop Rha,Sungjune Park,Junil Won,Yong Man Ro
Main category: cs.CV
TL;DR: This paper introduces a unified framework for spoken-language text generation that integrates sign language, lip movements, and audio, achieving strong performance across multiple tasks and emphasizing the importance of modeling lip movements in sign language translation.
Details
Motivation: The motivation is to create an inclusive communication system by integrating visual and audio modalities within a unified framework to overcome the limitations of current ASR systems for individuals who are deaf or hard of hearing. Method: The authors built a unified, modality-agnostic architecture to process heterogeneous inputs, such as sign language, lip movements, and audio, and explored the synergy among these modalities. Result: The framework performs as well as or better than state-of-the-art models for SLT, VSR, ASR, and AVSR, with improved SLT performance when lip movements are explicitly modeled. Conclusion: The paper concludes that the proposed unified framework effectively handles diverse combinations of modalities and achieves performance comparable to or better than existing specialized models while emphasizing the importance of lip movements in SLT. Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.[75] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
Yuan Xie,Tianshui Chen,Zheng Ge,Lionel Ni
Main category: cs.CV
TL;DR: 本文提出Video-MTR,一种用于长视频理解的强化多轮推理框架,结合门控双层奖励机制,实现了端到端训练,并在多个基准测试中表现优异。
Details
Motivation: 现有的长视频理解方法依赖静态推理或外部视觉-语言模型(VLMs),存在复杂度高和性能不理想的问题。 Method: 提出了Video-MTR,一个强化的多轮推理框架,结合了门控双层奖励机制,用于优化视频片段选择和问题理解。 Result: Video-MTR在VideoMME、MLVU和EgoSchema等基准测试中展示了更高的准确性和效率,达到了最先进的水平。 Conclusion: Video-MTR通过迭代式多轮推理框架和门控双层奖励系统,在长视频理解方面优于现有方法,实现了端到端的训练,并在多个基准测试中表现出色。 Abstract: Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.[76] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts
Zixuan Hu,Dongxiao Li,Xinzhu Ma,Shixiang Tang,Xiaotong Li,Wenhan Yang,Ling-Yu Duan
Main category: cs.CV
TL;DR: DUO is a novel Test-Time Adaptation framework that jointly minimizes semantic and geometric uncertainties in monocular 3D object detection, enhancing performance under real-world domain shifts.
Details
Motivation: Existing TTA methods fail to address dual uncertainties (semantic and geometric) in monocular 3D object detection, which limits their effectiveness in real-world conditions with domain shifts. Method: DUO employs a convex optimization approach with an unsupervised focal loss variant for uncertainty weighting and a semantic-aware normal field constraint to maintain geometric coherence. Result: DUO outperforms existing methods across multiple datasets and domain shift scenarios, showing superior robustness and generalization in M3OD tasks. Conclusion: The proposed DUO framework effectively reduces both semantic and geometric uncertainties in monocular 3D object detection, leading to improved robustness and performance under domain shifts. Abstract: Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types.[77] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information
Seunghyeon Jung,Seoyoung Hong,Jiwoo Jeong,Seungwon Jeong,Jaerim Choi,Hoki Kim,Woojin Lee
Main category: cs.CV
TL;DR: This paper introduces CaddieSet, a novel dataset for golf swing analysis that quantitatively links swing posture to ball trajectory, offering valuable insights for improving golfers' shot precision.
Details
Motivation: The motivation stems from the lack of quantitative establishment of the relationship between swing posture and ball trajectory in existing studies, limiting insights for swing improvement in golfers. Method: The method involves creating a new dataset, CaddieSet, by extracting joint information from swing videos segmented into eight phases using computer vision. Additionally, 15 key metrics based on expert golf domain knowledge are defined to interpret swing outcomes through swing-related features. Result: The experiments demonstrated the feasibility of CaddieSet for predicting ball trajectories using various benchmarks, particularly focusing on interpretable models, where swing feedback using joint features was quantitatively consistent with established domain knowledge. Conclusion: This paper concludes that the proposed CaddieSet dataset and defined key metrics enable the quantitative interpretation of swing outcomes and offer new insights into golf swing analysis for both academia and the sports industry. Abstract: Recent advances in deep learning have led to more studies to enhance golfers' shot precision. However, these existing studies have not quantitatively established the relationship between swing posture and ball trajectory, limiting their ability to provide golfers with the necessary insights for swing improvement. In this paper, we propose a new dataset called CaddieSet, which includes joint information and various ball information from a single shot. CaddieSet extracts joint information from a single swing video by segmenting it into eight swing phases using a computer vision-based approach. Furthermore, based on expert golf domain knowledge, we define 15 key metrics that influence a golf swing, enabling the interpretation of swing outcomes through swing-related features. Through experiments, we demonstrated the feasibility of CaddieSet for predicting ball trajectories using various benchmarks. In particular, we focus on interpretable models among several benchmarks and verify that swing feedback using our joint features is quantitatively consistent with established domain knowledge. This work is expected to offer new insight into golf swing analysis for both academia and the sports industry.[78] IAENet: An Importance-Aware Ensemble Model for 3D Point Cloud-Based Anomaly Detection
Xuanming Cao,Chengyu Tao,Yifeng Cheng,Juan Du
Main category: cs.CV
TL;DR: 本文提出了一种基于 2D 和 3D 模型集成的新型表面异常检测方法,通过动态评估和重新加权每个来源的异常分数,实现了新的最先进的性能。
Details
Motivation: 尽管 3D 点云提供了更丰富的几何线索,但与 2D 图像相比,在表面异常检测中缺乏强大的预训练基础模型。 Method: 提出了 Importance-Aware Ensemble Network (IAENet),并引入了 Importance-Aware Fusion (IAF) 模块,该模块动态评估每个来源的重要性,并重新加权它们的异常分数。 Result: 在 MVTec 3D-AD 上进行的大量实验表明,IAENet 显著降低了误报率,展示了其在工业部署中的实用价值。 Conclusion: IAENet 通过动态评估和重新加权每个来源的异常分数,成功结合了 2D 和 3D 模型的优势,从而在表面异常检测中实现了新的最先进的性能。 Abstract: Surface anomaly detection is pivotal for ensuring product quality in industrial manufacturing. While 2D image-based methods have achieved remarkable success, 3D point cloud-based detection remains underexplored despite its richer geometric cues. We argue that the key bottleneck is the absence of powerful pretrained foundation backbones in 3D comparable to those in 2D. To bridge this gap, we propose Importance-Aware Ensemble Network (IAENet), an ensemble framework that synergizes 2D pretrained expert with 3D expert models. However, naively fusing predictions from disparate sources is non-trivial: existing strategies can be affected by a poorly performing modality and thus degrade overall accuracy. To address this challenge, We introduce an novel Importance-Aware Fusion (IAF) module that dynamically assesses the contribution of each source and reweights their anomaly scores. Furthermore, we devise critical loss functions that explicitly guide the optimization of IAF, enabling it to combine the collective knowledge of the source experts but also preserve their unique strengths, thereby enhancing the overall performance of anomaly detection. Extensive experiments on MVTec 3D-AD demonstrate that our IAENet achieves a new state-of-the-art with a markedly lower false positive rate, underscoring its practical value for industrial deployment.[79] Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent
En Ci,Shanyan Guan,Yanhao Ge,Yilin Zhang,Wei Li,Zhenyu Zhang,Jian Yang,Ying Tai
Main category: cs.CV
TL;DR: 本文提出了一种基于描述性提示的图像编辑框架DescriptiveEdit,通过参考图像和文本提示生成编辑后的图像,解决了现有图像编辑方法中存在的问题,并在Emu Edit基准测试中表现出更好的编辑准确性和一致性。
Details
Motivation: 尽管在文本到图像生成方面取得了进展,但语义图像编辑仍然是一个挑战。基于反转的算法不可避免地引入重建错误,而基于指令的模型主要受限于数据集的质量和规模。 Method: 提出了一种基于描述性提示的编辑框架DescriptiveEdit,将“基于指令的图像编辑”重新定义为“基于参考图像的文本到图像生成”,并引入了Cross-Attentive UNet模型,通过注意力桥梁将参考图像特征注入到文本到图像生成过程中。 Result: DescriptiveEdit在Emu Edit基准测试中展示了更高的编辑准确性和一致性,克服了现有方法在指令数据集质量方面的限制,并且能够无缝集成ControlNet、IP-Adapter等扩展。 Conclusion: DescriptiveEdit通过其文本到图像的本质解决了现有图像编辑方法中存在的问题,具有更高的可扩展性和性能表现。 Abstract: Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.[80] DCFS: Continual Test-Time Adaptation via Dual Consistency of Feature and Sample
Wenting Yin,Han Sun,Xinru Meng,Ningzhong Liu,Huiyu Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的持续测试时适应框架DCFS,通过双路径特征一致性和置信度感知方法提升模型适应能力。
Details
Motivation: 持续测试时适应旨在在无法访问源域数据的情况下连续适应目标域数据流。仅依赖目标数据特征可能导致混淆和学习偏差,同时伪标签质量难以保证且存在误差累积问题。 Method: 提出了一种名为DCFS的新框架,该框架通过双分类器将目标数据的特征解耦为语义相关特征和领域相关特征,并利用子特征与整体特征之间的一致性进行学习。此外,通过自适应阈值和置信度评分机制进行加权损失的自监督学习,减少伪标签噪声。 Result: DCFS在多个数据集(如CIFAR10-C、CIFAR100-C和ImageNet-C)上进行了广泛的实验验证,证明了其在持续测试时适应场景中的有效性,展示了持续稳定的性能表现。 Conclusion: DCFS有效提升了持续测试时适应场景下的模型性能,通过引入双路径特征一致性和置信度感知的样本学习,解决了伪标签噪声和误差累积问题。 Abstract: Continual test-time adaptation aims to continuously adapt a pre-trained model to a stream of target domain data without accessing source data. Without access to source domain data, the model focuses solely on the feature characteristics of the target data. Relying exclusively on these features can lead to confusion and introduce learning biases. Currently, many existing methods generate pseudo-labels via model predictions. However, the quality of pseudo-labels cannot be guaranteed and the problem of error accumulation must be solved. To address these challenges, we propose DCFS, a novel CTTA framework that introduces dual-path feature consistency and confidence-aware sample learning. This framework disentangles the whole feature representation of the target data into semantic-related feature and domain-related feature using dual classifiers to learn distinct feature representations. By maintaining consistency between the sub-features and the whole feature, the model can comprehensively capture data features from multiple perspectives. Additionally, to ensure that the whole feature information of the target domain samples is not overlooked, we set a adaptive threshold and calculate a confidence score for each sample to carry out loss weighted self-supervised learning, effectively reducing the noise of pseudo-labels and alleviating the problem of error accumulation. The efficacy of our proposed method is validated through extensive experimentation across various datasets, including CIFAR10-C, CIFAR100-C, and ImageNet-C, demonstrating consistent performance in continual test-time adaptation scenarios.[81] Adam SLAM - the last mile of camera calibration with 3DGS
Matthieu Gendrin,Stéphane Pateux,Xiaoran Jiang,Théo Ladune,Luce Morin
Main category: cs.CV
TL;DR: 论文提出了一种通过3DGS模型微调相机校准的方法,以提高新视角合成的质量。
Details
Motivation: 相机校准质量对于新视角合成的评估至关重要,因为校准中的1像素误差会显著影响重建质量。 Method: 利用3DGS模型,通过反向传播新视角颜色损失来调整相机参数。 Result: 新的校准方法在用作3DGS参考的数据集中平均提高了0.4 dB PSNR。 Conclusion: 该论文提出使用3DGS模型通过反向传播新视角颜色损失来微调相机参数,从而提高相机校准的质量。 Abstract: The quality of the camera calibration is of major importance for evaluating progresses in novel view synthesis, as a 1-pixel error on the calibration has a significant impact on the reconstruction quality. While there is no ground truth for real scenes, the quality of the calibration is assessed by the quality of the novel view synthesis. This paper proposes to use a 3DGS model to fine tune calibration by backpropagation of novel view color loss with respect to the cameras parameters. The new calibration alone brings an average improvement of 0.4 dB PSNR on the dataset used as reference by 3DGS. The fine tuning may be long and its suitability depends on the criticity of training time, but for calibration of reference scenes, such as Mip-NeRF 360, the stake of novel view quality is the most important.[82] Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation
Jingyun Yang,Guoqing Zhang,Jingge Wang,Yang Li
Main category: cs.CV
TL;DR: This paper proposes an active and sequential domain adaptation framework for efficient sample selection in multi-modal medical image segmentation, significantly improving performance over existing methods with fewer labeled samples.
Details
Motivation: Accurate tumor segmentation in multi-modal medical images is essential for radiotherapy planning. However, obtaining labeled data is costly and time-consuming. Existing ADA methods risk negative transfer and lack exploration in multi-modal query strategies. Method: The authors propose a dynamic multi-modal sample selection framework in active domain adaptation (ADA), prioritizing labeling and training based on informativeness and representativeness of samples without requiring extensive access to source domain data. Result: Empirical validation shows that the proposed method achieves superior segmentation performance with fewer labeled samples, significantly outperforming existing ADA techniques in diverse gross tumor volume segmentation tasks. Conclusion: The proposed active and sequential domain adaptation framework outperforms state-of-the-art ADA methods in gross tumor volume segmentation tasks, offering a promising solution for efficient labeling and training on multi-modal medical data. Abstract: Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.[83] Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection
Mingqian Ji,Jian Yang,Shanshan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的无监督3D物体检测方法,通过早期整合RGB图像和LiDAR数据,提高了检测性能。
Details
Motivation: 现有的基于LiDAR的3D物体检测方法依赖于手动标注的标签,而获取高质量的3D标签费时费力,因此需要探索无监督的3D物体检测方法。 Method: 利用视觉基础模型进行实例分割和深度估计,并引入双向融合方法和局部与全局过滤方法,以及数据级融合的动态自我进化策略。 Result: 在nuScenes数据集上的实验表明,该方法显著优于之前最先进的方法,在nuScenes验证基准上mAP达到28.4%。 Conclusion: 本文提出了一种新的数据级融合框架,通过整合RGB图像和LiDAR数据来提高3D物体检测的性能,该方法在nuScenes数据集上显著优于之前的方法。 Abstract: Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4$\%$ mAP on the nuScenes validation benchmark.[84] Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset
Frederik Rajiv Manichand,Robin Deuber,Robert Jakob,Steve Swerling,Jamie Rosen,Elgar Fleisch,Patrick Langer
Main category: cs.CV
TL;DR: 我们开发了一种基于深度学习的BMI估计方法,并构建了一个包含84,963张智能手机图像的大型数据集(WayBED),通过自动过滤方法提高了数据质量。在多个数据集上验证了模型性能,并在Android设备上部署了完整的流程。
Details
Motivation: 当传统方法不可用或不切实际时,例如在远程医疗或紧急情况下,通过机器学习模型从相机图像中估计体重指数(BMI)可以实现快速体重评估。 Method: 我们使用深度学习模型,并引入了一种自动过滤方法,通过姿势聚类和人物检测来整理数据集,去除了低质量图像。我们还在不同的数据集上测试了模型的性能,并使用CLAID框架在Android设备上部署了完整的流程。 Result: 我们使用全身体图像在WayBED数据集的保留测试集上实现了7.9%的MAPE,这是已发表文献中的最低值。在完全未见过的VisualBodyToBMI数据集上,我们实现了13%的MAPE,并在微调后实现了8.56%的MAPE,这是该数据集上最低的报告值。 Conclusion: 我们开发了一种基于深度学习的BMI估计方法,并创建了名为WayBED的大型专有数据集。我们的方法在测试数据集上实现了最低的MAPE值,并且在Android设备上成功部署,代码也作为开源贡献发布。 Abstract: Estimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen~(during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.[85] Domain Adaptation Techniques for Natural and Medical Image Classification
Ahmad Chaddad,Yihang Wu,Reem Kateb,Christian Desrosiers
Main category: cs.CV
TL;DR: The study evaluates seven domain adaptation techniques across multiple datasets, showing that DSAN achieves high accuracy and explainability in both natural and medical image classification tasks.
Details
Motivation: The motivation is to understand the benefits of domain adaptation techniques for both natural and medical images, particularly in scenarios like out-of-distribution, dynamic data streams, and limited training samples. Method: The study conducted 557 simulation experiments using seven popular domain adaptation techniques across five natural and eight medical datasets. Result: The Deep Subdomain Adaptation Network (DSAN) algorithm showed excellent performance, achieving 91.2% accuracy on the COVID-19 dataset and a 6.7% improvement in dynamic data stream scenarios. It also demonstrated high explainability on the COVID-19 and skin cancer datasets. Conclusion: The paper concludes that DSAN is an effective domain adaptation technique for both natural and medical image classification tasks, showing high accuracy and explainability. Abstract: Domain adaptation (DA) techniques have the potential in machine learning to alleviate distribution differences between training and test sets by leveraging information from source domains. In image classification, most advances in DA have been made using natural images rather than medical data, which are harder to work with. Moreover, even for natural images, the use of mainstream datasets can lead to performance bias. {With the aim of better understanding the benefits of DA for both natural and medical images, this study performs 557 simulation studies using seven widely-used DA techniques for image classification in five natural and eight medical datasets that cover various scenarios, such as out-of-distribution, dynamic data streams, and limited training samples.} Our experiments yield detailed results and insightful observations highlighting the performance and medical applicability of these techniques. Notably, our results have shown the outstanding performance of the Deep Subdomain Adaptation Network (DSAN) algorithm. This algorithm achieved feasible classification accuracy (91.2\%) in the COVID-19 dataset using Resnet50 and showed an important accuracy improvement in the dynamic data stream DA scenario (+6.7\%) compared to the baseline. Our results also demonstrate that DSAN exhibits remarkable level of explainability when evaluated on COVID-19 and skin cancer datasets. These results contribute to the understanding of DA techniques and offer valuable insight into the effective adaptation of models to medical data.[86] Contrastive Learning through Auxiliary Branch for Video Object Detection
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: 该论文介绍了一种新的视频目标检测方法CLAB,它通过对比学习和动态损失加权策略提高检测性能,同时不增加推理时的计算负担。
Details
Motivation: 视频目标检测任务困难,因为视频常常遭受图像退化,如运动模糊、遮挡和可变形形状。现有的方法虽然提高了检测性能,但增加了计算需求。 Method: 论文采用了对比辅助分支和动态损失加权策略,以增强视频目标检测器的特征表示能力。 Result: 通过全面的实验和消融研究,证明了CLAB方法的有效性。在ImageNet VID数据集上,使用ResNet-101和ResNeXt-101分别达到了84.0%和85.2%的mAP性能。 Conclusion: 论文提出了一种名为CLAB的方法,通过对比学习提升视频目标检测的鲁棒性,同时不增加推理时的计算负担。 Abstract: Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector's backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.[87] Towards Mechanistic Defenses Against Typographic Attacks in CLIP
Lorenz Hufe,Constantin Venhoff,Maximilian Dreyer,Sebastian Lapuschkin,Wojciech Samek
Main category: cs.CV
TL;DR: This paper introduces a training-free method to defend CLIP models against typographic attacks by ablating specific attention heads, resulting in more robust models suitable for safety-critical applications.
Details
Motivation: The motivation stems from the vulnerability of multi-modal systems to typographic attacks, which can lead to misclassifications and security breaches, necessitating robust defenses without compromising performance. Method: The researchers analyzed the behavior of CLIP vision encoders under typographic attacks, identifying attention heads responsible for typographic information processing, and then selectively ablated these heads to create a defense mechanism. Result: The introduced method improved performance by up to 19.6% on a typographic variant of ImageNet-100, with less than a 1% reduction in standard ImageNet-100 accuracy, showing competitiveness with state-of-the-art defenses relying on finetuning. Conclusion: The study concludes that typographic attacks can be effectively mitigated in CLIP models by selectively ablating specific attention heads, resulting in dyslexic CLIP models that are robust against such attacks. Abstract: Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.[88] GLaRE: A Graph-based Landmark Region Embedding Network for Emotion Recognition
Debasis Maji,Debaditya Barman
Main category: cs.CV
TL;DR: 本文提出了一种名为GLaRE的新方法,通过图神经网络和面部关键点建模来提高面部表情识别的性能和可解释性。
Details
Motivation: 传统的面部表情识别(FER)系统在性能上受到遮挡、表情变化和缺乏可解释性的阻碍。 Method: 使用3D面部对齐提取面部关键点,并通过分层粗化构建商图,以减少复杂性并保留空间结构。 Result: 该方法在AffectNet上的准确率为64.89%,在FERG上的准确率为94.24%,超过了多个现有基线方法。 Conclusion: GLaRE是一个基于图的面部区域嵌入网络,用于情绪识别,能够提高预测性能并保持结构的可解释性。 Abstract: Facial expression recognition (FER) is a crucial task in computer vision with wide range of applications including human computer interaction, surveillance, and assistive technologies. However, challenges such as occlusion, expression variability, and lack of interpretability hinder the performance of traditional FER systems. Graph Neural Networks (GNNs) offer a powerful alternative by modeling relational dependencies between facial landmarks, enabling structured and interpretable learning. In this paper, we propose GLaRE, a novel Graph-based Landmark Region Embedding network for emotion recognition. Facial landmarks are extracted using 3D facial alignment, and a quotient graph is constructed via hierarchical coarsening to preserve spatial structure while reducing complexity. Our method achieves 64.89 percentage accuracy on AffectNet and 94.24 percentage on FERG, outperforming several existing baselines. Additionally, ablation studies have demonstrated that region-level embeddings from quotient graphs have contributed to improved prediction performance.[89] FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models
Zheng Chong,Yanwei Lei,Shiyue Zhang,Zhuandi He,Zhen Wang,Xujie Zhang,Xiao Dong,Yiling Wu,Dongmei Jiang,Xiaodan Liang
Main category: cs.CV
TL;DR: FastFit is a highly efficient multi-reference virtual try-on framework that improves speed by 3.5x and outperforms state-of-the-art methods in fidelity.
Details
Motivation: Virtual try-on technology faces two major challenges: the inability to support multi-reference outfit compositions and inefficiency due to redundant re-computation of reference features in each denoising step. Method: FastFit employs a Semi-Attention mechanism and replaces traditional timestep embeddings with class embeddings for reference items, decoupling reference feature encoding from the denoising process. This enables one-time computation and lossless reuse of reference features across all steps. Result: FastFit achieves an average 3.5x speedup over comparable methods and demonstrates superior performance on key fidelity metrics across multiple datasets, including VITON-HD, DressCode, and DressCode-MR. Conclusion: FastFit is a high-speed multi-reference virtual try-on framework that achieves superior performance on key fidelity metrics while significantly improving inference efficiency. Abstract: Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5x speedup over comparable methods. Furthermore, to facilitate research on complex, multi-reference virtual try-on, we introduce DressCode-MR, a new large-scale dataset. It comprises 28,179 sets of high-quality, paired images covering five key categories (tops, bottoms, dresses, shoes, and bags), constructed through a pipeline of expert models and human feedback refinement. Extensive experiments on the VITON-HD, DressCode, and our DressCode-MR datasets show that FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency.[90] UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching
Yuqi Han,Songqian Zhang,Weijian Su,Ke Li,Jiayu Yang,Jinli Suo,Qiang Zhang
Main category: cs.CV
TL;DR: 论文提出了一种用于低照度环境下交通标志识别的无监督热-事件视频增强方法UTA-Sign,结合热像仪和事件相机的优点,通过双增强机制融合热帧和事件信号,提高自动驾驶系统对环境的理解能力。
Details
Motivation: 热像仪在捕捉由相似材料制成的物体的标志时面临挑战,而事件相机在低光交通环境中表现出色,但存在非均匀采样的问题。因此,需要结合这两种模态的优点来提高交通标志识别的准确性。 Method: 论文采用了双增强机制,利用热帧提供准确的运动线索作为时间参考,用于对齐不均匀的事件信号,同时利用事件信号补充热帧中的细微标志内容。 Result: 在真实场景数据集上验证了所提出的方法,结果表明该方法在交通标志素描质量和感知层面的检测准确性方面表现出色。 Conclusion: 该论文提出了一种新的无监督热-事件视频增强方法UTA-Sign,通过融合热帧和事件信号来改善低照度环境下的交通标志识别,从而提高自动驾驶系统对环境的理解能力。 Abstract: The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contrast, the neuromorphic vision camera, also known as an event camera, detects changes in light intensity asynchronously and has proven effective in high-speed, low-light traffic environments. Recognizing the complementary characteristics of these two modalities, this paper proposes UTA-Sign, an unsupervised thermal-event video augmentation for traffic signage in low-illumination environments, targeting elements such as license plates and roadblock indicators. To address the signage blind spots of thermal imaging and the non-uniform sampling of event cameras, we developed a dual-boosting mechanism that fuses thermal frames and event signals for consistent signage representation over time. The proposed method utilizes thermal frames to provide accurate motion cues as temporal references for aligning the uneven event signals. At the same time, event signals contribute subtle signage content to the raw thermal frames, enhancing the overall understanding of the environment. The proposed method is validated on datasets collected from real-world scenarios, demonstrating superior quality in traffic signage sketching and improved detection accuracy at the perceptual level.[91] Disruptive Attacks on Face Swapping via Low-Frequency Perceptual Perturbations
Mengxiao Huang,Minglei Shu,Shuwang Zhou,Zhaoyang Liu
Main category: cs.CV
TL;DR: This paper proposes an active defense strategy using low-frequency perturbations to disrupt deepfake generation, achieving strong defense while maintaining visual plausibility.
Details
Motivation: Deepfake technology poses serious threats to privacy and societal security. Existing methods focus on passive detection, while an active defense approach is needed to disrupt deepfake generation before it occurs. Method: An active defense mechanism using low-frequency perceptual perturbations was developed. This method leverages discrete wavelet transform (DWT) to extract low-frequency components and integrates an encoder, perturbation generator, and decoder to disrupt facial manipulation models. Result: Experiments on CelebA-HQ and LFW datasets showed a significant reduction in face-swapping effectiveness, improved defense success rates, and preservation of visual quality. Conclusion: The proposed active defense method effectively disrupts deepfake technology by targeting the generative process through low-frequency perceptual perturbations, achieving a balance between defense strength and visual quality. Abstract: Deepfake technology, driven by Generative Adversarial Networks (GANs), poses significant risks to privacy and societal security. Existing detection methods are predominantly passive, focusing on post-event analysis without preventing attacks. To address this, we propose an active defense method based on low-frequency perceptual perturbations to disrupt face swapping manipulation, reducing the performance and naturalness of generated content. Unlike prior approaches that used low-frequency perturbations to impact classification accuracy,our method directly targets the generative process of deepfake techniques. We combine frequency and spatial domain features to strengthen defenses. By introducing artifacts through low-frequency perturbations while preserving high-frequency details, we ensure the output remains visually plausible. Additionally, we design a complete architecture featuring an encoder, a perturbation generator, and a decoder, leveraging discrete wavelet transform (DWT) to extract low-frequency components and generate perturbations that disrupt facial manipulation models. Experiments on CelebA-HQ and LFW demonstrate significant reductions in face-swapping effectiveness, improved defense success rates, and preservation of visual quality.[92] Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion
Zheng Qin,Yabing Wang,Minghui Yang,Sanping Zhou,Ming Yang,Le Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Diverse-T2M的方法,通过引入不确定性,实现了在保持文本语义一致性的同时生成多样化的3D人类动作。
Details
Motivation: 尽管最近的技术进步使得从文本生成精确且高质量的人类动作成为可能,但生成动作的多样性仍然是一个重大挑战。本论文旨在解决这一挑战。 Method: 该论文通过使用噪声信号作为多样性信息的载体,并在transformer-based方法中进行显式建模不确定性。此外,构建了一个潜在空间,将文本投射到连续表示中,并通过集成潜在空间采样器引入随机采样。 Result: 实验结果表明,该方法在文本到动作生成基准数据集(HumanML3D和KIT-ML)上显著提高了生成的多样性,同时保持了文本一致性的最先进性能。 Conclusion: 该论文提出了一种新的方法Diverse-T2M,通过引入不确定性,实现了在保持文本语义一致性的同时生成多样化的3D人类动作。 Abstract: Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, \textit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.[93] Optimization-Based Calibration for Intravascular Ultrasound Volume Reconstruction
Karl-Philippe Beaudet,Sidaty El Hadramy,Philippe C Cattin,Juan Verde,Stéphane Cotin
Main category: cs.CV
TL;DR: 该论文提出了一种基于优化的校准方法,利用3D打印的模型实现准确的3D血管内超声体积重建,从而改善肝脏手术中的术中导航。
Details
Motivation: 由于视野有限和解剖结构复杂,术中超声图像在肝脏手术中难以解释。弥合术前和术中数据之间的差距对于有效的手术导航至关重要。 Method: 提出了一种基于优化的校准方法,并使用3D打印的模型进行实验验证。 Result: 在体内猪肝脏图像上验证了该方法,3D IVUS数据与相应CT扫描之间的校准误差为0.88至1.80毫米,配准误差为3.40至5.71毫米。 Conclusion: 该方法提供了一种可靠且准确的校准和体积重建手段,可用于肝脏手术中注册术中超声图像与术前CT图像,并增强术中导航。 Abstract: Intraoperative ultrasound images are inherently challenging to interpret in liver surgery due to the limited field of view and complex anatomical structures. Bridging the gap between preoperative and intraoperative data is crucial for effective surgical guidance. 3D IntraVascular UltraSound (IVUS) offers a potential solution by enabling the reconstruction of the entire organ, which facilitates registration between preoperative computed tomography (CT) scans and intraoperative IVUS images. In this work, we propose an optimization-based calibration method using a 3D-printed phantom for accurate 3D Intravascular Ultrasound volume reconstruction. Our approach ensures precise alignment of tracked IVUS data with preoperative CT images, improving intraoperative navigation. We validated our method using in vivo swine liver images, achieving a calibration error from 0.88 to 1.80 mm and a registration error from 3.40 to 5.71 mm between the 3D IVUS data and the corresponding CT scan. Our method provides a reliable and accurate means of calibration and volume reconstruction. It can be used to register intraoperative ultrasound images with preoperative CT images in the context of liver surgery, and enhance intraoperative guidance.[94] Physics Informed Generative Models for Magnetic Field Images
Aye Phyu Phyu Aung,Lucas Lum,Zhansen Shi,Wen Qiu,Bernice Zee,JM Chin,Yeow Kheng Lim,J. Senthilnath
Main category: cs.CV
TL;DR: The paper proposes a machine learning-driven approach to generate synthetic Magnetic Field Images for semiconductor defect detection, which could optimize the defect localization process.
Details
Motivation: The motivation behind the research is to address the challenge of limited availability of Magnetic Field Imaging (MFI) datasets due to proprietary concerns, which hinders the training of machine learning models for defect detection in semiconductor manufacturing. Method: The researchers used a machine learning-driven approach leveraging diffusion models with two physical constraints to generate synthetic MFI samples. They integrated specific physical information to generate MFI images for common defect types, specifically power shorts. Result: The result of the research is the successful generation of synthetic MFI images for power shorts, which can be used as training data for ML algorithms. The proposed method was evaluated using qualitative and quantitative metrics, showing promising results in optimizing the defect localization process. Conclusion: The study concludes that the proposed Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) can generate synthetic MFI samples that are useful for training ML algorithms to efficiently localize defect areas in semiconductor manufacturing. Abstract: In semiconductor manufacturing, defect detection and localization are critical to ensuring product quality and yield. While X-ray imaging is a reliable non-destructive testing method, it is memory-intensive and time-consuming for large-scale scanning, Magnetic Field Imaging (MFI) offers a more efficient means to localize regions of interest (ROI) for targeted X-ray scanning. However, the limited availability of MFI datasets due to proprietary concerns presents a significant bottleneck for training machine learning (ML) models using MFI. To address this challenge, we consider an ML-driven approach leveraging diffusion models with two physical constraints. We propose Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) to generate synthetic MFI samples by integrating specific physical information. We generate MFI images for the most common defect types: power shorts. These synthetic images will serve as training data for ML algorithms designed to localize defect areas efficiently. To evaluate generated MFIs, we compare our model to SOTA generative models from both variational autoencoder (VAE) and diffusion methods. We present a domain expert evaluation to assess the generated samples. In addition, we present qualitative and quantitative evaluation using various metrics used for image generation and signal processing, showing promising results to optimize the defect localization process.[95] Revisiting the Privacy Risks of Split Inference: A GAN-Based Data Reconstruction Attack via Progressive Feature Optimization
Yixiang Qiu,Yanhan Liu,Hongyao Yu,Hao Fang,Bin Chen,Shu-Tao Xia,Ke Xu
Main category: cs.CV
TL;DR: This paper introduces a GAN-based framework with PFO to enhance the effectiveness of data reconstruction attacks in split inference systems, achieving superior image reconstruction quality compared to existing methods.
Details
Motivation: The motivation is to address the limitations of existing Data Reconstruction Attacks (DRAs), which are typically effective only on shallow models and fail to fully leverage semantic priors, thus limiting their reconstruction quality and generalizability. Method: The method involves a GAN-based framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and uses an L1-ball constraint to stabilize optimization and improve image realism. Result: The experiments show that the proposed method significantly outperforms previous attacks, particularly in high-resolution, out-of-distribution scenarios, and against deeper and more complex neural networks. Conclusion: The proposed GAN-based DRA framework with PFO demonstrates superior performance in reconstructing high-quality images, highlighting its potential to expose privacy vulnerabilities in split inference systems. Abstract: The growing complexity of Deep Neural Networks (DNNs) has led to the adoption of Split Inference (SI), a collaborative paradigm that partitions computation between edge devices and the cloud to reduce latency and protect user privacy. However, recent advances in Data Reconstruction Attacks (DRAs) reveal that intermediate features exchanged in SI can be exploited to recover sensitive input data, posing significant privacy risks. Existing DRAs are typically effective only on shallow models and fail to fully leverage semantic priors, limiting their reconstruction quality and generalizability across datasets and model architectures. In this paper, we propose a novel GAN-based DRA framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and incrementally refines intermediate representations to enhance the semantic fidelity of reconstructed images. To stabilize the optimization and improve image realism, we introduce an L1-ball constraint during reconstruction. Extensive experiments show that our method outperforms prior attacks by a large margin, especially in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs.[96] EmoCAST: Emotional Talking Portrait via Emotive Text Description
Yiguo Jiang,Xiaodong Cun,Yong Zhang,Yudian Zheng,Fan Tang,Chi-Man Pun
Main category: cs.CV
TL;DR: EmoCAST是一个基于扩散模型的情感说话头像合成框架,通过创新的模块和数据集解决了现有方法的局限性,实现了高质量的视频生成。
Details
Motivation: 现有的情感说话头像合成方法在控制灵活性、动作自然性和表情质量方面仍存在局限,而当前可用的数据集主要在实验室环境中收集,这进一步加剧了这些缺点,限制了其在现实世界场景中的应用。 Method: 提出了EmoCAST框架,包括一个文本引导的解耦情感模块和一个情感音频注意力模块,同时构建了一个带有全面情感文本描述的说话头像数据集,并提出了情绪感知采样训练策略和渐进式功能训练策略。 Result: EmoCAST在生成逼真、情感表达丰富和音频同步的说话头像视频方面达到了最先进的性能,并构建了一个新的情感说话头像数据集。 Conclusion: EmoCAST是一个基于扩散的框架,通过两个关键模块实现了精确的文本驱动情感合成,并在生成逼真、富有情感和音频同步的说话头像视频方面达到了最先进的性能。 Abstract: Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework's performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model's ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST[97] Mask-Guided Multi-Channel SwinUNETR Framework for Robust MRI Classification
Smriti Joshi,Lidia Garrucho,Richard Osuala,Oliver Diaz,Karim Lekadir
Main category: cs.CV
TL;DR: 本文提出了一种基于SwinUNETR的深度学习框架,用于乳腺MRI图像的乳腺癌检测和分类,通过结合乳腺区域掩码、大量数据增强和集成学习,在ODELIA挑战赛中取得了第二名的好成绩。
Details
Motivation: 乳腺癌是女性癌症相关死亡的主要原因之一,早期检测对改善预后至关重要。MRI是乳腺癌检测的高灵敏度工具,特别是在高危女性或致密乳腺组织中。然而,传统方法存在局限性,因此需要基于人工智能的解决方案来提高诊断的准确性和可靠性。 Method: 本文开发了一种基于SwinUNETR的深度学习框架,结合了乳腺区域掩码、大量数据增强和集成学习策略,以提高模型的鲁棒性和泛化能力。实验基于ODELIA联盟提供的包含511个研究的多中心数据集,每个研究都标注了左右乳腺的无病变、良性病变或恶性病变情况。 Result: 该方法在挑战赛的排行榜上取得了第二名的成绩,表明其在支持临床乳腺MRI解释方面具有潜力。 Conclusion: 基于SwinUNETR的深度学习框架在乳腺癌检测和分类任务中表现优异,有望为乳腺MRI的自动诊断提供有效的技术支撑。 Abstract: Breast cancer is one of the leading causes of cancer-related mortality in women, and early detection is essential for improving outcomes. Magnetic resonance imaging (MRI) is a highly sensitive tool for breast cancer detection, particularly in women at high risk or with dense breast tissue, where mammography is less effective. The ODELIA consortium organized a multi-center challenge to foster AI-based solutions for breast cancer diagnosis and classification. The dataset included 511 studies from six European centers, acquired on scanners from multiple vendors at both 1.5 T and 3 T. Each study was labeled for the left and right breast as no lesion, benign lesion, or malignant lesion. We developed a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Our method achieved second place on the challenge leaderboard, highlighting its potential to support clinical breast MRI interpretation. We publicly share our codebase at https://github.com/smriti-joshi/bcnaim-odelia-challenge.git.[98] AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images
Shiqi Xin,Xiaolin Zhang,Yanbin Liu,Peng Zhang,Caifeng Shan
Main category: cs.CV
TL;DR: AvatarBack is a novel framework for improving back-head reconstruction in 3D Gaussian avatars, using a generative prior and spatial alignment strategy.
Details
Motivation: Existing Gaussian Splatting methods for 3D head avatars primarily rely on frontal-view images, resulting in poor reconstruction of the back-head regions and limiting the overall realism of the avatars. Method: AvatarBack introduces a Subject-specific Generator (SSG) to synthesize back-view pseudo-images and an Adaptive Spatial Alignment Strategy (ASA) to align synthetic views with the 3D Gaussian representation. Result: AvatarBack significantly enhances the quality of back-head reconstruction, maintains visual realism, and ensures avatars remain fully animatable under diverse motions. Conclusion: AvatarBack improves the reconstruction quality of the back-head regions while maintaining the realism and fidelity of the front regions in 3D Gaussian avatars. Abstract: Recent advances in Gaussian Splatting have significantly boosted the reconstruction of head avatars, enabling high-quality facial modeling by representing an 3D avatar as a collection of 3D Gaussians. However, existing methods predominantly rely on frontal-view images, leaving the back-head poorly constructed. This leads to geometric inconsistencies, structural blurring, and reduced realism in the rear regions, ultimately limiting the fidelity of reconstructed avatars. To address this challenge, we propose AvatarBack, a novel plug-and-play framework specifically designed to reconstruct complete and consistent 3D Gaussian avatars by explicitly modeling the missing back-head regions. AvatarBack integrates two core technical innovations,i.e., the Subject-specific Generator (SSG) and the Adaptive Spatial Alignment Strategy (ASA). The former leverages a generative prior to synthesize identity-consistent, plausible back-view pseudo-images from sparse frontal inputs, providing robust multi-view supervision. To achieve precise geometric alignment between these synthetic views and the 3D Gaussian representation, the later employs learnable transformation matrices optimized during training, effectively resolving inherent pose and coordinate discrepancies. Extensive experiments on NeRSemble and K-hairstyle datasets, evaluated using geometric, photometric, and GPT-4o-based perceptual metrics, demonstrate that AvatarBack significantly enhances back-head reconstruction quality while preserving frontal fidelity. Moreover, the reconstructed avatars maintain consistent visual realism under diverse motions and remain fully animatable.[99] ArtFace: Towards Historical Portrait Face Identification via Model Adaptation
Francois Poh,Anjith George,Sébastien Marcel
Main category: cs.CV
TL;DR: 本文研究了利用基础模型改进艺术品中的人脸识别,通过微调和集成方法显著提升了识别效果,解决了传统方法在绘画作品中的局限性。
Details
Motivation: 传统人脸识别模型在处理照片时表现良好,但在绘画作品上表现不佳,主要由于领域差异和类内变化较大,艺术因素进一步影响识别。 Method: 研究利用基础模型改进艺术品中的人脸识别,通过微调基础模型并将其嵌入与传统人脸识别网络集成。 Result: 展示了比当前最先进方法显著改进的结果,表明基础模型可以弥补传统方法的不足。 Conclusion: 通过微调基础模型并将其嵌入与传统人脸识别网络集成,展示了比当前最先进方法显著改进的结果,表明基础模型可以在传统方法无效时弥补差距。 Abstract: Identifying sitters in historical paintings is a key task for art historians, offering insight into their lives and how they chose to be seen. However, the process is often subjective and limited by the lack of data and stylistic variations. Automated facial recognition is capable of handling challenging conditions and can assist, but while traditional facial recognition models perform well on photographs, they struggle with paintings due to domain shift and high intra-class variation. Artistic factors such as style, skill, intent, and influence from other works further complicate recognition. In this work, we investigate the potential of foundation models to improve facial recognition in artworks. By fine-tuning foundation models and integrating their embeddings with those from conventional facial recognition networks, we demonstrate notable improvements over current state-of-the-art methods. Our results show that foundation models can bridge the gap where traditional methods are ineffective. Paper page at https://www.idiap.ch/paper/artface/[100] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models
Ayan Banerjee,Fernando Vilariño,Josep Lladós
Main category: cs.CV
TL;DR: CraftGraffiti是一种基于文本引导的端到端涂鸦生成框架,通过先风格迁移后身份增强的策略,在保持艺术风格自由的同时确保面部身份可识别。
Details
Motivation: 在生成艺术中,保持面部身份在极端风格转换下仍然具有挑战性,尤其是在涂鸦这种高对比度、抽象的媒介中,微小的变形就可能导致身份识别丧失。 Method: CraftGraffiti采用“风格优先,身份后验”的范式,通过LoRA微调的预训练扩散模型进行风格迁移,并利用显式身份嵌入增强的自注意力机制保持面部一致性。 Result: CraftGraffiti在面部特征一致性方面表现出色,同时在美学和用户偏好评分上达到SOTA,且通过Cruilla艺术节的实际部署验证了其创意影响力。 Conclusion: CraftGraffiti成功实现了在极端风格转换下保持面部身份,推动了尊重身份的AI辅助艺术创作。 Abstract: Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium, subtle distortions to the eyes, nose, or mouth can erase the subject's recognizability, undermining both personal and cultural authenticity. We present CraftGraffiti, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, CraftGraffiti first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the "style-first, identity-after" paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruilla Festival highlight the system's real-world creative impact. CraftGraffiti advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications.[101] Improving Alignment in LVLMs with Debiased Self-Judgment
Sihan Yang,Chenhang Cui,Zihao Zhao,Yiyang Zhou,Weilong Yan,Ying Wei,Huaxiu Yao
Main category: cs.CV
TL;DR: A new self-evaluation method improves the alignment of Large Visual-Language Models by reducing hallucinations and enhancing safety without relying on external data.
Details
Motivation: The motivation is to overcome the limitations of existing alignment methods that rely on external datasets, human annotations, or complex post-processing, which hinder scalability and increase costs. Method: The method involves generating a debiased self-judgment score internally within the model to improve decoding strategies and preference tuning processes. Result: The empirical results demonstrate that the proposed approach significantly outperforms traditional alignment methods in terms of effectiveness and efficiency. Conclusion: The proposed approach effectively improves the alignment of LVLMs by generating a debiased self-judgment score, leading to reduced hallucinations, enhanced safety, and better overall performance without reliance on external resources. Abstract: The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations--where generated outputs are not grounded in the visual input--and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.[102] "Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection
Anastasios Skoularikis,Stefanos-Iordanis Papadopoulos,Symeon Papadopoulos,Panagiotis C. Petrantonakis
Main category: cs.CV
TL;DR: This paper introduces S-HArM, a dataset for intent-aware classification of AI-generated images, and explores prompting strategies for synthetic data generation. While some models show improved generalization, performance remains limited, highlighting the need for specialized architectures.
Details
Motivation: Recent advances in multimodal AI have improved detection of synthetic and out-of-context content, but the intent behind AI-generated images is often overlooked. This work aims to address this gap by focusing on intent-aware classification. Method: The paper introduces S-HArM, a multimodal dataset for intent-aware classification, and explores three prompting strategies (image-guided, description-guided, and multimodally-guided) to generate a synthetic training dataset using Stable Diffusion. The study includes comparisons of modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Result: Models trained on image- and multimodally-guided data generalize better to 'in the wild' content due to preserved visual context, but overall performance remains limited, indicating the complexity of intent inference. Conclusion: The study concludes that while models trained on image- and multimodally-guided data show better generalization to real-world content due to preserved visual context, overall performance in inferring intent remains limited, emphasizing the need for more specialized architectures. Abstract: Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.[103] MobileCLIP2: Improving Multi-Modal Reinforced Training
Fartash Faghri,Pavan Kumar Anasosalu Vasu,Cem Koc,Vaishaal Shankar,Alexander Toshev,Oncel Tuzel,Hadi Pouransari
Main category: cs.CV
TL;DR: MobileCLIP2 improves zero-shot image classification accuracy at low latency by enhancing teacher models and training data, achieving state-of-the-art performance on ImageNet-1k.
Details
Motivation: The motivation is to improve the multi-modal reinforced training of MobileCLIP by enhancing the quality of teacher models and training data, aiming to achieve better zero-shot accuracy at low latency compared to existing models. Method: The paper introduces MobileCLIP2, which improves upon MobileCLIP through enhanced CLIP teacher ensembles trained on the DFN dataset, improved captioner teachers fine-tuned on high-quality image-caption datasets, and ablation studies to analyze factors like temperature tuning and caption diversity. The data generation code is also made publicly available for scalable distributed processing. Result: MobileCLIP2 achieves state-of-the-art zero-shot accuracy on ImageNet-1k with a 2.2% improvement over MobileCLIP-B. MobileCLIP2-S4 matches the accuracy of SigLIP-SO400M/14 while being twice as small and outperforms DFN ViT-L/14 at 2.5× lower latency. Conclusion: The paper concludes that through improved multi-modal reinforced training, including better CLIP teacher ensembles and enhanced captioner teachers, MobileCLIP2 achieves state-of-the-art performance in zero-shot ImageNet-1k accuracy while maintaining low latency. The authors also highlight the efficiency and scalability of their approach. Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.[104] ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering
Paritosh Parmar,Eric Peh,Basura Fernando
Main category: cs.CV
TL;DR: 本文提出了一种新的模块化框架,通过显式解耦因果推理与答案生成,提升了视频问答任务中的可解释性和推理能力。
Details
Motivation: 现有的因果视频问答模型通常依赖于不透明的单体架构,难以进行高阶推理,且可解释性较差。本文旨在通过引入可解释的中间表示(即自然语言因果链)来提升模型的透明度和推理能力。 Method: 本文提出了一种两阶段架构,包括因果链提取器(CCE)和因果链驱动回答器(CCDA)。CCE从视频-问题对中生成因果链,CCDA基于这些因果链生成答案。此外,作者还提出了一种利用大语言模型生成高质量因果链的方法,并设计了一个新的评估指标CauCo。 Result: 实验结果表明,该方法在三个大规模基准测试中均优于现有最先进的模型,并在可解释性、用户信任度和泛化能力上取得了显著提升。此外,提出的CauCo指标能够有效评估因果导向的描述质量。 Conclusion: 本文提出的模块化框架通过将因果推理与答案生成解耦,提高了视频问答任务中的可解释性和泛化能力。实验表明,这种方法在多个大规模基准测试中优于最先进的模型,并且在因果推理方面表现优异。 Abstract: Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/[105] Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network
Chenhao Zhang,Wei Gao
Main category: cs.CV
TL;DR: 论文提出了一种用于可变比特率场景的动态视频压缩框架,解决了基于学习的编解码器在速率控制方面的局限性问题,取得了良好的实验结果。
Details
Motivation: 由于基于学习的编解码器存在固有局限性,精确的速率控制仍然具有挑战性,因此需要提出新的框架来解决这个问题。 Method: 设计了Dynamic-Route Autoencoder,采用可变编码路径,引入了Rate Control Agent和Joint-Routes Optimization策略,实现运行时的比特率调整和协同训练。 Result: 实验表明,该方法在HEVC和UVG数据集上平均BD-Rate降低14.8%,BD-PSNR增益0.47dB,同时保持平均比特率误差1.66%。 Conclusion: 论文提出了一种动态视频压缩框架,实现了在不同比特率场景下的高效压缩性能,并通过了实验验证其在RDCO上的优势。 Abstract: Neural Video Compression (NVC) has achieved remarkable performance in recent years. However, precise rate control remains a challenge due to the inherent limitations of learning-based codecs. To solve this issue, we propose a dynamic video compression framework designed for variable bitrate scenarios. First, to achieve variable bitrate implementation, we propose the Dynamic-Route Autoencoder with variable coding routes, each occupying partial computational complexity of the whole network and navigating to a distinct RD trade-off. Second, to approach the target bitrate, the Rate Control Agent estimates the bitrate of each route and adjusts the coding route of DRA at run time. To encompass a broad spectrum of variable bitrates while preserving overall RD performance, we employ the Joint-Routes Optimization strategy, achieving collaborative training of various routes. Extensive experiments on the HEVC and UVG datasets show that the proposed method achieves an average BD-Rate reduction of 14.8% and BD-PSNR gain of 0.47dB over state-of-the-art methods while maintaining an average bitrate error of 1.66%, achieving Rate-Distortion-Complexity Optimization (RDCO) for various bitrate and bitrate-constrained applications. Our code is available at https://git.openi.org.cn/OpenAICoding/DynamicDVC.[106] CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network
Reza Akbari Movahed,Abuzar Rezaee,Arezoo Zakeri,Colin Berry,Edmond S. L. Ho,Ali Gooya
Main category: cs.CV
TL;DR: 本文提出了一种基于贝叶斯深度学习的3D心脏形态引导的可变形配准框架CardioMorphNet,用于从心脏磁共振图像中准确估计心脏运动。
Details
Motivation: 现有方法通常难以准确捕捉心脏运动,因为它们依赖于基于强度的图像配准相似性损失,这可能忽略了心脏解剖区域。 Method: 使用短轴心脏磁共振图像,CardioMorphNet采用循环变分自编码器来模拟心脏周期内的时空依赖性,并使用两个后验模型进行双心室分割和运动估计。 Result: CardioMorphNet在心脏运动估计方面表现出优于现有最先进方法的性能,并在心脏区域的运动场估计中产生较低的不确定性值,表明其预测具有更高的置信度。 Conclusion: CardioMorphNet是一种有效的心脏运动估计方法,它通过递归配准分割图来关注解剖区域,同时利用序列短轴心脏磁共振图像和时空特征。 Abstract: Accurate cardiac motion estimation from cine cardiac magnetic resonance (CMR) images is vital for assessing cardiac function and detecting its abnormalities. Existing methods often struggle to capture heart motion accurately because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. To address this, we propose CardioMorphNet, a recurrent Bayesian deep learning framework for 3D cardiac shape-guided deformable registration using short-axis (SAX) CMR images. It employs a recurrent variational autoencoder to model spatio-temporal dependencies over the cardiac cycle and two posterior models for bi-ventricular segmentation and motion estimation. The derived loss function from the Bayesian formulation guides the framework to focus on anatomical regions by recursively registering segmentation maps without using intensity-based image registration similarity loss, while leveraging sequential SAX volumes and spatio-temporal features. The Bayesian modelling also enables computation of uncertainty maps for the estimated motion fields. Validated on the UK Biobank dataset by comparing warped mask shapes with ground truth masks, CardioMorphNet demonstrates superior performance in cardiac motion estimation, outperforming state-of-the-art methods. Uncertainty assessment shows that it also yields lower uncertainty values for estimated motion fields in the cardiac region compared with other probabilistic-based cardiac registration methods, indicating higher confidence in its predictions.[107] Mix, Align, Distil: Reliable Cross-Domain Atypical Mitosis Classification
Kaustubh Atey,Sameer Anand Jha,Gouranga Bala,Amit Sethi
Main category: cs.CV
TL;DR: This paper presents a training approach for robust classification of atypical mitotic figures under domain shifts, achieving high accuracy and performance in the MIDOG 2025 challenge.
Details
Motivation: Atypical mitotic figures are important histopathological markers that are challenging to identify consistently, especially under domain shift from scanner, stain, and acquisition differences. Method: The method involves three steps: increasing feature diversity via style perturbations, aligning features across sites using weak domain labels with an auxiliary alignment loss, and stabilizing predictions using an EMA teacher with temperature-scaled KL divergence. Result: The approach achieved a balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499 on the preliminary leaderboard for atypical mitosis classification. Conclusion: The method proposed is a competitive submission for the MIDOG 2025 challenge, showing strong and balanced performance with negligible inference-time overhead. Abstract: Atypical mitotic figures (AMFs) are important histopathological markers yet remain challenging to identify consistently, particularly under domain shift stemming from scanner, stain, and acquisition differences. We present a simple training-time recipe for domain-robust AMF classification in MIDOG 2025 Task 2. The approach (i) increases feature diversity via style perturbations inserted at early and mid backbone stages, (ii) aligns attention-refined features across sites using weak domain labels (Scanner, Origin, Species, Tumor) through an auxiliary alignment loss, and (iii) stabilizes predictions by distilling from an exponential moving average (EMA) teacher with temperature-scaled KL divergence. On the organizer-run preliminary leaderboard for atypical mitosis classification, our submission attains balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499. The method incurs negligible inference-time overhead, relies only on coarse domain metadata, and delivers strong, balanced performance, positioning it as a competitive submission for the MIDOG 2025 challenge.[108] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang,Zhimin Li,Yuhang Zang,Yujie Zhou,Jiazi Bu,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
Main category: cs.CV
TL;DR: The paper proposes Pref-GRPO to stabilize T2I generation by mitigating reward hacking through pairwise preference fitting. It also introduces UniGenBench, a comprehensive benchmark for evaluating T2I models with detailed criteria.
Details
Motivation: Current GRPO-based methods using pointwise RMs are susceptible to reward hacking due to amplified score differences after normalization, destabilizing the generation process. Additionally, existing benchmarks lack granularity in evaluation criteria. Method: Pref-GRPO uses pairwise preference reward-based optimization, comparing images in pairs within groups using preference RM and using win rate as the reward signal. UniGenBench evaluates semantic consistency through detailed criteria leveraging MLLM. Result: Extensive experiments show that Pref-GRPO effectively differentiates subtle image quality differences and provides more stable advantages. UniGenBench offers a unified and detailed evaluation framework across 600 prompts, revealing model strengths and weaknesses and validating Pref-GRPO's effectiveness. Conclusion: The proposed Pref-GRPO method enhances the stability of image generation by shifting the optimization objective to preference fitting, effectively mitigating reward hacking. UniGenBench provides a comprehensive evaluation framework for T2I models. Abstract: Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.[109] ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting
Yuxi Hu,Jun Zhang,Kuangyi Chen,Zhe Zhang,Friedrich Fraundorfer
Main category: cs.CV
TL;DR: 本文提出了一种新的框架C3-GS,通过引入上下文感知、跨维度和跨尺度约束,改进了特征融合,实现了无需额外监督的光真实感合成,并在基准数据集上验证了其优越的渲染质量和泛化能力。
Details
Motivation: 现有的方法在编码用于高斯预测的判别性、多视角一致特征方面存在不足,这使得在稀疏视图下难以构建准确的几何结构。 Method: 提出了一种名为C3-GS的框架,该框架通过结合上下文感知、跨维度和跨尺度约束来增强特征学习。 Result: C3-GS在基准数据集上的实验表明,其能够实现无需额外监督的光真实感合成,并具有良好的泛化能力。 Conclusion: C3-GS实现了最先进的渲染质量和泛化能力。 Abstract: Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose $\mathbf{C}^{3}$-GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that $\mathbf{C}^{3}$-GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: https://github.com/YuhsiHu/C3-GS.[110] SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
Jiawen Lin,Shiran Bian,Yihang Zhu,Wenbin Tan,Yachao Zhang,Yuan Xie,Yanyun Qu
Main category: cs.CV
TL;DR: SeqVLM 提出了一种新的零样本 3D 视觉基础框架,通过利用多视图真实场景图像和空间信息来提高目标对象推理的准确性,解决了现有方法的空间有限推理和上下文遗漏问题。
Details
Motivation: 尽管监督方法在受限环境中实现了更高的准确性,但零样本 3DVG 在现实应用中更有前景,因为它消除了场景特定的训练需求。然而,现有的零样本方法面临由于依赖单视图定位导致的空间有限推理,以及上下文遗漏或细节退化的问题。 Method: SeqVLM 首先通过 3D 语义分割网络生成 3D 实例提议,并通过语义过滤对其进行优化,保留仅语义相关的候选。然后,一种提议引导的多视图投影策略将这些候选提议投射到真实的场景图像序列上。此外,为了减轻 VLM 的计算过载,实现了一种动态调度机制,迭代处理序列查询提示,利用 VLM 的跨模态推理能力来识别文本指定的对象。 Result: 实验在 ScanRefer 和 Nr3D 基准测试上证明了 SeqVLM 的先进性能,Acc@0.25 分数分别达到 55.6% 和 53.2%,比之前的零样本方法分别高出 4.0% 和 5.2%,这使得 3DVG 向更大的泛化和现实应用迈进。 Conclusion: SeqVLM 是一种新的零样本 3DVG 框架,它利用具有空间信息的多视图真实场景图像进行目标对象推理,从而在 ScanRefer 和 Nr3D 基准测试中实现了最先进的性能,分别达到了 55.6% 和 53.2% 的 Acc@0.25 分数,超越了以前的零样本方法。 Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.[111] Occlusion Robustness of CLIP for Military Vehicle Classification
Jan Erik van Woerden,Gertjan Burghouts,Lotte Nijskens,Alma M. Liezenga,Sabina van Rooij,Frank Ruis,Hugo J. Kuijf
Main category: cs.CV
TL;DR: This paper explores the robustness of CLIP models in military environments with occlusion, finding that Transformer-based models perform best, and training with occlusion-specific augmentations is vital for real-world deployment.
Details
Motivation: Vision-language models like CLIP offer advantages for defense applications with scarce labeled data, but their robustness in challenging military environments with occlusion and low signal-to-noise ratio remains underexplored. Method: A custom dataset of 18 military vehicle classes was used to evaluate CLIP variants' robustness to occlusion. Performance was measured using Normalized Area Under the Curve (NAUC) across different occlusion percentages. Result: Transformer-based CLIP models outperformed CNNs. Fine-grained, dispersed occlusions had a more significant negative impact than larger contiguous occlusions. Linear-probed models experienced a sharp performance drop at around 35% occlusion, while finetuning the backbone delayed this drop to more than 60% occlusion. Conclusion: The study concludes that occlusion-specific augmentations during training are crucial for improving CLIP models' robustness in military environments, and further research is needed in patch-level sensitivity and architectural resilience. Abstract: Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.[112] SKGE-SWIN: End-To-End Autonomous Vehicle Waypoint Prediction and Navigation Using Skip Stage Swin Transformer
Fachri Najm Noer Kartiman,Rasim,Yaya Wihardi,Nurul Hasanah,Oskar Natan,Bambang Wahono,Taufik Ibnu Salim
Main category: cs.CV
TL;DR: This paper proposes the SKGE-Swin architecture for autonomous vehicles, enhancing context awareness and achieving better driving performance through Swin Transformer with skip-stage mechanism.
Details
Motivation: To develop an end-to-end autonomous vehicle model with enhanced pixel-to-pixel context awareness for better comprehension of complex environmental patterns. Method: SKGE-Swin architecture utilizes the Swin Transformer with a skip-stage mechanism for global and multi-level feature representation, leveraging SW-MSA for distant pixel information extraction. Result: The SKGE-Swin model achieved a superior Driving Score on the CARLA platform under adversarial scenarios compared to previous methods. Conclusion: SKGE-Swin architecture contributes to the improvement of the autonomous vehicle model's driving score by enhancing its ability to understand complex patterns in the vehicle's surroundings. Abstract: Focusing on the development of an end-to-end autonomous vehicle model with pixel-to-pixel context awareness, this research proposes the SKGE-Swin architecture. This architecture utilizes the Swin Transformer with a skip-stage mechanism to broaden feature representation globally and at various network levels. This approach enables the model to extract information from distant pixels by leveraging the Swin Transformer's Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism and to retain critical information from the initial to the final stages of feature extraction, thereby enhancing its capability to comprehend complex patterns in the vehicle's surroundings. The model is evaluated on the CARLA platform using adversarial scenarios to simulate real-world conditions. Experimental results demonstrate that the SKGE-Swin architecture achieves a superior Driving Score compared to previous methods. Furthermore, an ablation study will be conducted to evaluate the contribution of each architectural component, including the influence of skip connections and the use of the Swin Transformer, in improving model performance.[113] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
Gowreesh Mago,Pascal Mettes,Stevan Rudinac
Main category: cs.CV
TL;DR: 这篇论文讨论了视频中抽象概念理解的重要性,并指出基础模型的进展为解决这一问题提供了新的机会,同时强调应借鉴历史研究避免重复。
Details
Motivation: 机器在视频内容理解方面取得了显著进展,但与人类相比,仍缺乏对抽象概念如正义、自由的理解能力。这是视频理解领域的一个重要挑战。 Method: 论文通过调查现有的任务和数据集,回顾了研究人员在不同时间段如何尝试解决抽象概念识别的问题。 Result: 研究发现,尽管研究者们一直在尝试解决抽象概念理解问题,但随着多模态基础模型的兴起,现在是重新审视这一问题并利用已有经验推动进展的好时机。 Conclusion: 该论文强调利用基础模型来理解和识别视频中的抽象概念是一个理想的选择,同时指出需要借鉴过去几十年的研究经验以避免重复劳动。 Abstract: The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel'' as we start revisiting it in the era of multi-modal foundation models.[114] Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML
Kuniko Paxton,Koorosh Aslansefat,Amila Akagić,Dhavalkumar Thakker,Yiannis Papadopoulos
Main category: cs.CV
TL;DR: This paper introduces a new explainability method for skin lesion classification models that improves diagnostic trustworthiness, reduces misdiagnosis risks, and enhances patient safety.
Details
Motivation: Despite high accuracy in skin lesion classification models, distrust in AI remains due to a lack of explainability. Existing methods like LIME and CAM have limitations in reliability, necessitating a more robust and trustworthy diagnostic approach. Method: The method evaluates activation probability maps probabilistically at a pixel level for all classes, visualizes the diagnostic process in a unified manner, and incorporates SafeML to detect and warn against false diagnoses. Result: The proposed method successfully improves diagnostic reliability and reduces the risk of misdiagnosis by providing unified visualization and enhancing false diagnosis detection using SafeML. Conclusion: The proposed Global Class Activation Probabilistic Map Evaluation method enhances diagnostic reliability and patient safety by improving the explainability and detection of false diagnoses in skin lesion classification models. Abstract: Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists' diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes' activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.[115] Evaluating Compositional Generalisation in VLMs and Diffusion Models
Beth Pearson,Bilal Boulbarss,Michael Wray,Martha Lewis
Main category: cs.CV
TL;DR: 本研究评估了扩散分类器、CLIP和ViLT在零样本和广义零样本学习设置下的组合泛化能力,发现扩散分类器和ViLT在概念绑定任务中表现良好,但所有模型在关系推理方面都面临挑战。
Details
Motivation: 自然语言语义的一个基本方面是,可以从已知部分的组合中形成新的意义。虽然视觉语言模型近年来取得了显著进展,但有证据表明它们无法进行这种组合。例如,CLIP这样的模型在面对红立方体和蓝圆柱体的图像时可能会错误地将其标记为红圆柱体或蓝立方体,表明它将图像表示为“词袋”,未能捕捉组合语义。扩散模型因其出色的生成能力而受到广泛关注,并且基于扩散模型的零样本分类器在某些组合任务中已显示出与CLIP相媲美的表现。 Method: 本文通过零样本学习和广义零样本学习设置,对扩散分类器、CLIP和ViLT三种模型在绑定对象与属性和关系方面的表现进行了评估。 Result: 研究结果显示,扩散分类器和ViLT在概念绑定任务中表现良好,但所有模型在关系性广义零样本学习任务中都面临显著困难。CLIP的嵌入分析表明,问题可能源于对关系概念(如左和右)的表示过于相似。 Conclusion: 尽管扩散分类器和ViLT在概念绑定任务中表现良好,但所有模型在关系性广义零样本学习任务中都面临显著困难,这突显了视觉语言模型在关系推理方面面临的广泛挑战。CLIP的嵌入分析表明,问题可能源于对关系概念(如左和右)的表示过于相似。 Abstract: A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip[116] Surfel-based 3D Registration with Equivariant SE(3) Features
Xueyang Kang,Hang Zhao,Kourosh Khoshelham,Patrick Vandewalle
Main category: cs.CV
TL;DR: A new surfel-based pose learning regression method is proposed for point cloud registration, which effectively handles noise and aggressive rotations, showing robust performance on real-world datasets.
Details
Motivation: Current point cloud registration methods ignore point orientations and uncertainties, making them susceptible to noisy input and aggressive rotations. This necessitates extensive training with transformation augmentations. Method: The method involves initializing surfels from a Lidar point cloud using virtual perspective camera parameters, learning explicit SE(3) equivariant features through SE(3) equivariant convolutional kernels, and predicting the relative transformation between source and target scans using a model composed of an equivariant convolutional encoder, a cross-attention mechanism, a fully-connected decoder, and a non-linear Huber loss. Result: Experimental results on indoor and outdoor datasets show the model's robust performance and superiority in handling real point-cloud scans. Conclusion: The proposed surfel-based pose learning regression approach demonstrates superiority and robust performance on real point-cloud scans compared to state-of-the-art methods. Abstract: Point cloud registration is crucial for ensuring 3D alignment consistency of multiple local point clouds in 3D reconstruction for remote sensing or digital heritage. While various point cloud-based registration methods exist, both non-learning and learning-based, they ignore point orientations and point uncertainties, making the model susceptible to noisy input and aggressive rotations of the input point cloud like orthogonal transformation; thus, it necessitates extensive training point clouds with transformation augmentations. To address these issues, we propose a novel surfel-based pose learning regression approach. Our method can initialize surfels from Lidar point cloud using virtual perspective camera parameters, and learns explicit $\mathbf{SE(3)}$ equivariant features, including both position and rotation through $\mathbf{SE(3)}$ equivariant convolutional kernels to predict relative transformation between source and target scans. The model comprises an equivariant convolutional encoder, a cross-attention mechanism for similarity computation, a fully-connected decoder, and a non-linear Huber loss. Experimental results on indoor and outdoor datasets demonstrate our model superiority and robust performance on real point-cloud scans compared to state-of-the-art methods.[117] Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training
Tao Luo,Han Wu,Tong Yang,Dinggang Shen,Zhiming Cui
Main category: cs.CV
TL;DR: DVCTNet is a novel method for accurate dental caries detection using panoramic X-rays, combining global and local views with a Gated Cross-View Attention module, showing superior performance compared to current state-of-the-art methods.
Details
Motivation: Current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. Method: DVCTNet uses a dual-view co-training network with a Gated Cross-View Attention module for integrating global and local views from panoramic X-ray images. Result: DVCTNet outperforms existing state-of-the-art methods on both a public dataset and a newly curated, high-precision dental caries detection dataset. Conclusion: DVCTNet demonstrates superior performance in dental caries detection and has clinical applicability. Abstract: Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet's superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.[118] FusionCounting: Robust visible-infrared image fusion guided by crowd counting via multi-task learning
He Li,Xinyu Liu,Weihang Kong,Xingchen Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的多任务学习框架FusionCounting,通过将人群计数融入可见光和红外图像融合过程,解决了现有方法注释工作量大和复杂场景适应性差的问题。
Details
Motivation: 现有的VIF方法主要关注优化融合图像质量,而最近的研究尝试结合下游任务如语义分割和目标检测来提供语义指导,但这些方法存在注释工作量大、重叠边界框和遮挡等挑战。 Method: 提出了一种新的多任务学习框架FusionCounting,将人群计数融入VIF过程中,并引入了动态损失函数加权策略和对抗训练。 Result: 实验结果表明FusionCounting在公共数据集上不仅增强了图像融合质量,而且实现了优越的人群计数性能。 Conclusion: FusionCounting不仅提升了图像融合的质量,还实现了优越的人群计数性能。 Abstract: Most visible and infrared image fusion (VIF) methods focus primarily on optimizing fused image quality. Recent studies have begun incorporating downstream tasks, such as semantic segmentation and object detection, to provide semantic guidance for VIF. However, semantic segmentation requires extensive annotations, while object detection, despite reducing annotation efforts compared with segmentation, faces challenges in highly crowded scenes due to overlapping bounding boxes and occlusion. Moreover, although RGB-T crowd counting has gained increasing attention in recent years, no studies have integrated VIF and crowd counting into a unified framework. To address these challenges, we propose FusionCounting, a novel multi-task learning framework that integrates crowd counting into the VIF process. Crowd counting provides a direct quantitative measure of population density with minimal annotation, making it particularly suitable for dense scenes. Our framework leverages both input images and population density information in a mutually beneficial multi-task design. To accelerate convergence and balance tasks contributions, we introduce a dynamic loss function weighting strategy. Furthermore, we incorporate adversarial training to enhance the robustness of both VIF and crowd counting, improving the model's stability and resilience to adversarial attacks. Experimental results on public datasets demonstrate that FusionCounting not only enhances image fusion quality but also achieves superior crowd counting performance.[119] Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation
Krit Duangprom,Tryphon Lambrou,Binod Bhattarai
Main category: cs.CV
TL;DR: This paper introduces a novel pipeline using LoRA-fine-tuned Vision Language Models (VLMs) for 2D keypoint estimation of surgical tools, showing improved performance over traditional methods in low-resource settings.
Details
Motivation: Traditional CNN or Transformer-based approaches often overfit small-scale medical datasets. The authors aimed to leverage the generalization capabilities of pre-trained VLMs to improve performance in low-resource scenarios. Method: The method involves fine-tuning pre-trained VLMs using LoRA to adapt them for 2D keypoint estimation of surgical tools. Prompts were designed to align visual features with semantic keypoint descriptions. Result: Experimental results showed that the adapted VLM outperformed baseline models with only two epochs of fine-tuning, demonstrating the effectiveness of LoRA in this context. Conclusion: The paper concludes that leveraging Vision Language Models (VLMs) fine-tuned with LoRA is an effective approach for 2D keypoint estimation of surgical tools, particularly in low-resource scenarios. Abstract: This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.[120] PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification
Hao Yang,Qianyu Zhou,Haijia Sun,Xiangtai Li,Xuequan Lu,Lizhuang Ma,Shuicheng Yan
Main category: cs.CV
TL;DR: 本文提出PointDGRWKV,首次将RWKV架构应用于点云分类的域泛化,通过解决令牌移位导致的空间失真和跨域注意力漂移问题,实现了高效的性能提升。
Details
Motivation: 现有的基于卷积网络、Transformer或Mamba架构的点云分类模型在未见过的域上存在有限的感受野、高计算成本或不足的长距离依赖建模问题,而RWKV架构具有优越的线性复杂度、全局感受野和长距离依赖性,但需要解决其在非结构化点云上的空间失真和跨域差异问题。 Method: 提出了自适应几何令牌移位模块和跨域键特征分布对齐模块,以解决RWKV在未见过的域中的空间失真和注意力漂移问题。 Result: PointDGRWKV在多个基准测试中展示了最先进的性能,有效提升了RWKV在点云分类中的域泛化能力。 Conclusion: PointDGRWKV框架通过解决空间建模和跨域鲁棒性问题,实现了点云分类领域最先进的域泛化性能,同时保持了RWKV的线性效率。 Abstract: Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.[121] PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis
Ye Zhang,Yu Zhou,Jingwen Qi,Yongbing Zhang,Simon Puettmann,Finn Wichmann,Larissa Pereira Ferreira,Lara Sichward,Julius Keyl,Sylvia Hartmann,Shuo Zhao,Hongxiao Wang,Xiaowei Xu,Jianxu Chen
Main category: cs.CV
TL;DR: PathMR是一种细胞级多模态视觉推理框架,能够生成专家级病理诊断解释,并提高AI辅助病理诊断的可解释性。
Details
Motivation: 深度学习在病理诊断中的临床应用受限于模型决策不透明和缺乏可追溯的推理过程,因此需要提高模型的可解释性和透明度。 Method: PathMR基于多模态视觉推理框架,结合病理图像和文本查询生成专家级诊断解释,并预测细胞分布模式。 Result: PathMR在文本生成质量、分割准确性和跨模态对齐方面始终优于现有的视觉推理方法。 Conclusion: PathMR展示了在AI驱动的病理诊断中提高可解释性和诊断质量的潜力,同时代码已公开,便于后续研究和应用。 Abstract: Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in https://github.com/zhangye-zoe/PathMR.[122] Deep Learning Framework for Early Detection of Pancreatic Cancer Using Multi-Modal Medical Imaging Analysis
Dennis Slobodzian,Karissa Tilbury,Amir Kordijazi
Main category: cs.CV
TL;DR: 本研究开发并验证了一种用于早期胰腺导管腺癌(PDAC)检测的深度学习框架,通过自体荧光和二次谐波生成(SHG)双模成像分析,实现了超过90%的检测准确率,为临床应用提供了潜在可能。
Details
Motivation: 胰腺导管腺癌(PDAC)由于晚期检测导致五年生存率低于10%,因此需要一种有效的早期检测方法。 Method: 研究人员分析了40个独特的患者样本,创建了一个专门的神经网络,能够区分正常、纤维化和癌组织,并评估了六种不同的深度学习架构,包括传统的卷积神经网络(CNNs)和现代的视觉变换器(ViTs)。最终优化的框架基于修改后的ResNet架构,采用冻结预训练层和类别加权训练。 Result: 最终优化框架在癌症检测中实现了超过90%的准确率,显著优于当前的手动分析方法。 Conclusion: 这项研究建立了一个用于自动化PDAC检测的稳健流程,不仅能够增强病理学家的能力,还为未来扩展到其他癌症类型奠定了基础。此外,该方法还为在临床应用中常见的有限大小医学成像数据集上应用深度学习提供了宝贵的见解。 Abstract: Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms of cancer, with a five-year survival rate below 10% primarily due to late detection. This research develops and validates a deep learning framework for early PDAC detection through analysis of dual-modality imaging: autofluorescence and second harmonic generation (SHG). We analyzed 40 unique patient samples to create a specialized neural network capable of distinguishing between normal, fibrotic, and cancerous tissue. Our methodology evaluated six distinct deep learning architectures, comparing traditional Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs). Through systematic experimentation, we identified and overcome significant challenges in medical image analysis, including limited dataset size and class imbalance. The final optimized framework, based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training, achieved over 90% accuracy in cancer detection. This represents a significant improvement over current manual analysis methods an demonstrates potential for clinical deployment. This work establishes a robust pipeline for automated PDAC detection that can augment pathologists' capabilities while providing a foundation for future expansion to other cancer types. The developed methodology also offers valuable insights for applying deep learning to limited-size medical imaging datasets, a common challenge in clinical applications.[123] Understanding and evaluating computer vision models through the lens of counterfactuals
Pushkar Shukla
Main category: cs.CV
TL;DR: This thesis proposes counterfactual-based frameworks to address bias in AI models, offering scalable methods for interpretability, fairness, and causal analysis in both vision classifiers and generative models.
Details
Motivation: The motivation is to enhance interpretability and fairness in AI by using counterfactual reasoning, systematically addressing biases in both discriminative and generative models through causal analysis. Method: The thesis develops frameworks using counterfactual reasoning to analyze and mitigate bias in vision classifiers and generative models, incorporating methods like CAVLI, ASAC, TIBET, BiasConnect, and InterMit. Result: The contributions include frameworks (CAVLI, ASAC, TIBET, BiasConnect, InterMit) that enable bias detection and mitigation in AI models, providing scalable methods for causal auditing and fairness improvement. Conclusion: The thesis concludes that counterfactuals provide a unifying framework for interpretability, fairness, and causality in AI models, offering scalable and principled methods for socially responsible bias evaluation and mitigation. Abstract: Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.[124] To New Beginnings: A Survey of Unified Perception in Autonomous Vehicle Software
Loïc Stratil,Felix Fent,Esteban Rivera,Markus Lienkamp
Main category: cs.CV
TL;DR: This survey paper explores unified perception as a promising alternative to traditional modular pipelines in autonomous vehicle perception, offering improved robustness, efficiency, and interpretability. It introduces a taxonomy for categorizing methods and highlights future research directions.
Details
Motivation: The motivation stems from the limitations of traditional modular pipelines in autonomous vehicle perception, such as error accumulation and limited inter-task synergy. The paper aims to explore how unified perception can overcome these issues and provide a more effective and interpretable solution. Method: The paper provides a survey and systematic analysis of existing unified perception methods, introducing a taxonomy based on task integration, tracking formulation, and representation flow. It categorizes approaches into three paradigms: Early, Late, and Full Unified Perception. Result: The paper presents a comprehensive overview of unified perception, establishes a taxonomy for classifying methods, and reviews existing approaches, their architectures, training strategies, datasets, and open-source availability. It identifies future research directions to guide the development of more robust and generalizable perception systems. Conclusion: The paper concludes that unified perception is a promising approach for autonomous vehicle perception, which can improve robustness, contextual reasoning, and efficiency while maintaining interpretable outputs. It emphasizes the need for further research to advance this field. Abstract: Autonomous vehicle perception typically relies on modular pipelines that decompose the task into detection, tracking, and prediction. While interpretable, these pipelines suffer from error accumulation and limited inter-task synergy. Unified perception has emerged as a promising paradigm that integrates these sub-tasks within a shared architecture, potentially improving robustness, contextual reasoning, and efficiency while retaining interpretable outputs. In this survey, we provide a comprehensive overview of unified perception, introducing a holistic and systemic taxonomy that categorizes methods along task integration, tracking formulation, and representation flow. We define three paradigms -Early, Late, and Full Unified Perception- and systematically review existing methods, their architectures, training strategies, datasets used, and open-source availability, while highlighting future research directions. This work establishes the first comprehensive framework for understanding and advancing unified perception, consolidates fragmented efforts, and guides future research toward more robust, generalizable, and interpretable perception.[125] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation
Yifan Gao,Haoyue Li,Feng Yuan,Xiaosong Wang,Xin Gao
Main category: cs.CV
TL;DR: Dino U-Net通过融合DINOv3的语义特征与空间细节,并利用FAPM模块优化特征表示,实现了医学图像分割的最先进性能。
Details
Motivation: 尽管在大规模自然图像数据集上预训练的基础模型为医学图像分割提供了强大的范式,但如何有效转移其学习到的表示以实现精确的临床应用仍然是一个挑战。 Method: Dino U-Net引入了基于冻结DINOv3主干的编码器,采用专门的适配器融合模型丰富的语义特征和低级空间细节,并设计了保真感知投影模块(FAPM)以在降维过程中有效优化和投影特征。 Result: Dino U-Net在七个不同的公共医学图像分割数据集上取得了最先进的性能,持续优于之前的方法,并且分割准确性随着主干模型尺寸的增加而提高,最高可达70亿参数变体。 Conclusion: Dino U-Net是一种新颖的编码器-解码器架构,利用DINOv3视觉基础模型的高保真密集特征,有效推进医学图像分割的准确性和参数效率。 Abstract: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.[126] Classifying Mitotic Figures in the MIDOG25 Challenge with Deep Ensemble Learning and Rule Based Refinement
Sara Krauss,Ellena Spieß,Daniel Hieber,Frank Kramer,Johannes Schobel,Dominik Müller
Main category: cs.CV
TL;DR: 本研究利用深度集成学习结合RBR模块自动区分AMFs和NMFs,在MIDOG25数据集上取得了84.02%的平衡准确率,RBR模块在提高特异性的同时降低了灵敏度,仍需优化。
Details
Motivation: MFs(有丝分裂图像)是肿瘤分级中的重要生物标志物,但人工标注AMFs和NMFs费时且主观性强,因此需要自动化的分类方法。 Method: 使用AUCMEDI训练了ConvNeXtBase模型的集成,并引入了一个基于规则的优化模块(RBR)用于提升分类性能。 Result: 在MIDOG25初步测试集上,该集成模型达到了84.02%的平衡准确率;RBR模块提升了特异性但降低了灵敏度和整体性能。 Conclusion: 深度集成模型在AMF分类任务上表现良好,而RBR模块虽然能够提升特定指标,但需要进一步研究以优化其性能。 Abstract: Mitotic figures (MFs) are relevant biomarkers in tumor grading. Differentiating atypical MFs (AMFs) from normal MFs (NMFs) remains difficult, as manual annotation is time-consuming and subjective. In this work an ensemble of ConvNeXtBase models was trained with AUCMEDI and extend with a rule-based refinement (RBR) module. On the MIDOG25 preliminary test set, the ensemble achieved a balanced accuracy of 84.02%. While the RBR increased specificity, it reduced sensitivity and overall performance. The results show that deep ensembles perform well for AMF classification. RBR can increase specific metrics but requires further research.[127] COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans
Enrico Martini,Ho Jin Choi,Nadia Figueroa,Nicola Bombieri
Main category: cs.CV
TL;DR: COMETH is a lightweight real-time multi-view human pose fusion algorithm that improves accuracy and scalability for industrial applications.
Details
Motivation: Monitoring human activity is essential in Industry 5.0 for ergonomic safety and well-being, but existing multi-camera setups face scalability and real-time challenges due to high computational costs and bandwidth requirements. Method: COMETH uses convex optimization-based inverse kinematics for spatial fusion, integrates kinematic and biomechanical constraints, and implements a state observer for temporal consistency. Result: COMETH outperforms state-of-the-art methods in localization, detection, and tracking accuracy on both public and industrial datasets. Conclusion: COMETH enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. Abstract: In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.[128] Olive Tree Satellite Image Segmentation Based On SAM and Multi-Phase Refinement
Amir Jmal,Chaima Chtourou,Mahdi Louati,Abdelaziz Kallel,Houda Khmila
Main category: cs.CV
TL;DR: 该论文介绍了一种利用遥感技术和先进分割模型检测橄榄树的新方法,准确率达到98%,对农业管理具有重要意义。
Details
Motivation: 在气候变化背景下,通过早期异常检测和遥感技术维持橄榄生物多样性至关重要。 Method: 该研究结合了Segment Anything Model (SAM) 进行分割,并通过基于树木排列和形状大小的可学习约束进行校正。 Result: 该方法达到了98%的准确率,显著超过了初始SAM的82%性能。 Conclusion: 该论文提出了一种创新的方法,利用基础模型和先进分割技术,显著提高了橄榄树分割的准确性。 Abstract: In the context of proven climate change, maintaining olive biodiversity through early anomaly detection and treatment using remote sensing technology is crucial, offering effective management solutions. This paper presents an innovative approach to olive tree segmentation from satellite images. By leveraging foundational models and advanced segmentation techniques, the study integrates the Segment Anything Model (SAM) to accurately identify and segment olive trees in agricultural plots. The methodology includes SAM segmentation and corrections based on trees alignement in the field and a learanble constraint about the shape and the size. Our approach achieved a 98\% accuracy rate, significantly surpassing the initial SAM performance of 82\%.[129] E-ConvNeXt: A Lightweight and Efficient ConvNeXt Variant with Cross-Stage Partial Connections
Fang Wang,Huitao Li,Wenhan Chao,Zheng Zhuo,Yiran Ji,Chang Peng,Yupeng Sun
Main category: cs.CV
TL;DR: E-ConvNeXt是一种轻量级网络,通过整合CSPNet与ConvNeXt,并进行结构优化,实现了高精度和高效率的平衡。
Details
Motivation: 许多高性能网络在设计之初没有考虑轻量级应用场景,这极大地限制了它们的应用范围。 Method: 将Cross Stage Partial Network (CSPNet)与ConvNeXt结合并调整网络结构,优化Stem和Block结构,用通道注意力替代Layer Scale。 Result: E-ConvNeXt-mini在0.9GFLOPs下达到78.3%的Top-1准确率,E-ConvNeXt-small在3.1GFLOPs下达到81.9%的Top-1准确率。迁移学习测试进一步验证了其泛化能力。 Conclusion: E-ConvNeXt通过整合Cross Stage Partial Connections机制和一系列优化设计,显著减少了ConvNeXt的参数规模和网络复杂度,同时保持了高精度性能,并具有良好的泛化能力。 Abstract: Many high-performance networks were not designed with lightweight application scenarios in mind from the outset, which has greatly restricted their scope of application. This paper takes ConvNeXt as the research object and significantly reduces the parameter scale and network complexity of ConvNeXt by integrating the Cross Stage Partial Connections mechanism and a series of optimized designs. The new network is named E-ConvNeXt, which can maintain high accuracy performance under different complexity configurations. The three core innovations of E-ConvNeXt are : (1) integrating the Cross Stage Partial Network (CSPNet) with ConvNeXt and adjusting the network structure, which reduces the model's network complexity by up to 80%; (2) Optimizing the Stem and Block structures to enhance the model's feature expression capability and operational efficiency; (3) Replacing Layer Scale with channel attention. Experimental validation on ImageNet classification demonstrates E-ConvNeXt's superior accuracy-efficiency balance: E-ConvNeXt-mini reaches 78.3% Top-1 accuracy at 0.9GFLOPs. E-ConvNeXt-small reaches 81.9% Top-1 accuracy at 3.1GFLOPs. Transfer learning tests on object detection tasks further confirm its generalization capability.[130] DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes
Yajiao Xiong,Xiaoyu Zhou,Yongtao Wan,Deqing Sun,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: DrivingGaussian++ 是一个高效的框架,用于自动驾驶场景的真实重建和可控编辑,通过集成3D高斯模型和大语言模型实现了动态场景的高效建模和多样性增强。
Details
Motivation: 为了实现自动驾驶场景的真实重建和可控编辑,需要一个高效且有效的方法来准确建模静态背景和移动物体,并实现细节丰富、一致性强的场景重建以及增强场景多样性。 Method: DrivingGaussian++ 使用增量3D高斯模型对静态背景进行建模,并使用复合动态高斯图重建移动物体,结合LiDAR先验实现详细的场景重建,同时支持无需训练的可控编辑,并集成了大语言模型(LLMs)来生成动态物体运动轨迹。 Result: DrivingGaussian++ 在动态场景重建和光真实感环绕视图合成方面优于现有方法,同时支持无需训练的可控编辑,包括纹理修改、天气模拟和物体操作,并能生成动态物体运动轨迹并提升其在优化过程中的真实性。 Conclusion: DrivingGaussian++是一个高效且有效的框架,用于自动驾驶场景的重建和可控编辑,展示了其在生成动态多视角驾驶场景和增强场景多样性方面的卓越能力。 Abstract: We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io[131] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation
Chenfan Qu,Yiwu Zhong,Bin Li,Lianwen Jin
Main category: cs.CV
TL;DR: 本文提出CAAAv2和Web-IML方法,利用网络数据缓解图像操作定位中的数据稀缺问题,显著提升模型性能。
Details
Motivation: 解决图像编辑工具带来的误导问题,当前缺乏高质量标注数据集制约了该领域发展。 Method: 引入CAAAv2和QES方法构建大规模数据集MIMLv2,并开发Web-IML模型利用网络规模监督进行图像操作定位。 Result: Web-IML在多个真实伪造基准测试中性能显著提升,平均IoU比先前SOTA TruFor高24.1点。 Conclusion: Web-IML有效利用网络监督,在图像操作定位任务中显著提升了模型性能,并缓解了数据稀缺问题。 Abstract: Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.[132] ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts
Patryk Będkowski,Jan Dubiński,Filip Szatkowski,Kamil Deja,Przemysław Rokita,Tomasz Trzciński
Main category: cs.CV
TL;DR: ExpertSim使用混合生成专家架构,为ALICE实验中的零度量热器提供高效的探测器响应模拟,相较于传统蒙特卡洛方法,提高了准确性和速度。
Details
Motivation: 粒子对撞机实验中的探测器响应模拟对于理解粒子碰撞至关重要,但传统方法计算成本高。机器学习方法被提出以提高效率,但由于数据分布差异大,现成的方法难以应对。 Method: 提出ExpertSim,采用混合生成专家架构,每个专家专注于模拟数据的不同子集,从而提高模拟的精确性和效率。 Result: ExpertSim不仅提高了准确性,还比传统蒙特卡洛方法提供了显著的速度提升,为CERN的粒子物理实验提供了高效的探测器模拟解决方案。 Conclusion: ExpertSim是一个有效的解决方案,可以用于CERN粒子物理实验中高效率的探测器响应模拟。 Abstract: Simulating detector responses is a crucial part of understanding the inner workings of particle collisions in the Large Hadron Collider at CERN. Such simulations are currently performed with statistical Monte Carlo methods, which are computationally expensive and put a significant strain on CERN's computational grid. Therefore, recent proposals advocate for generative machine learning methods to enable more efficient simulations. However, the distribution of the data varies significantly across the simulations, which is hard to capture with out-of-the-box methods. In this study, we present ExpertSim - a deep learning simulation approach tailored for the Zero Degree Calorimeter in the ALICE experiment. Our method utilizes a Mixture-of-Generative-Experts architecture, where each expert specializes in simulating a different subset of the data. This allows for a more precise and efficient generation process, as each expert focuses on a specific aspect of the calorimeter response. ExpertSim not only improves accuracy, but also provides a significant speedup compared to the traditional Monte-Carlo methods, offering a promising solution for high-efficiency detector simulations in particle physics experiments at CERN. We make the code available at https://github.com/patrick-bedkowski/expertsim-mix-of-generative-experts.[133] POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models
Jiaxiang Cheng,Bing Ma,Xuhua Ren,Hongyi Jin,Kai Yu,Peng Zhang,Wenyue Li,Yuan Zhou,Tianxiang Zheng,Qinglin Lu
Main category: cs.CV
TL;DR: POSE is a novel distillation framework for video diffusion models that enables efficient, high-quality video generation in a single step, significantly reducing latency while maintaining performance.
Details
Motivation: Existing video acceleration methods fail to model temporal coherence and provide single-step distillation for large-scale video models, leading to inefficiencies and lower video quality. Method: POSE utilizes a two-phase distillation process, including stability priming and unified adversarial equilibrium, along with conditional adversarial consistency for conditional video generation. Result: POSE reduces sampling steps and latency by 100×, from 1000 seconds to 10 seconds, while improving video quality and performance on VBench-I2V by an average of 7.15%. Conclusion: POSE successfully addresses the bottlenecks in sampling efficiency for video diffusion generation, achieving high-quality, single-step video generation while significantly reducing latency. Abstract: The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.[134] Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
Dale Decatur,Thibault Groueix,Wang Yifan,Rana Hanocka,Vladimir Kim,Matheus Gadelha
Main category: cs.CV
TL;DR: 本文提出了一种高效文本到图像生成方法,通过在扩散模型的早期阶段共享相似提示的计算来减少冗余,从而降低计算成本并提升图像质量。
Details
Motivation: 文本到图像扩散模型虽然能生成高质量图像,但计算成本高。现有的工作主要优化单次推理效率,而本文探索了减少相关提示之间冗余的新方法。 Method: 该方法基于扩散模型的由粗到细的特性,通过语义相似性对提示进行聚类,并在早期扩散步骤中共享计算。 Result: 实验表明,该方法在使用基于图像嵌入条件的模型时,显著降低了计算成本并提升了图像质量,同时通过UnClip的文本到图像先验提高了扩散步骤的分配效率。 Conclusion: 该论文提出了一种无需训练的方法,通过在相似提示的早期扩散步骤中共享计算来减少文本到图像生成中的冗余,从而降低计算成本并提高生成图像的质量。 Abstract: Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/[135] Mitosis detection in domain shift scenarios: a Mamba-based approach
Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的基于Mamba的方法,用于解决有丝分裂检测中的领域转移问题。
Details
Motivation: 有丝分裂检测在组织病理学图像中起着关键作用,但机器学习算法在不同领域图像上的性能会显著下降。 Method: 该论文采用了Mamba和VM-UNet架构进行有丝分裂检测,并使用染色增强操作来提高模型鲁棒性。 Result: 初步实验结果显示,该方法在MIDOG++数据集上还有很大的改进空间。 Conclusion: 该论文提出了一种基于Mamba的方法来解决有丝分裂检测中的领域转移问题,并利用VM-UNet架构和染色增强操作来提高模型对领域转移的鲁棒性。 Abstract: Mitosis detection in histopathology images plays a key role in tumor assessment. Although machine learning algorithms could be exploited for aiding physicians in accurately performing such a task, these algorithms suffer from significative performance drop when evaluated on images coming from domains that are different from the training ones. In this work, we propose a Mamba-based approach for mitosis detection under domain shift, inspired by the promising performance demonstrated by Mamba in medical imaging segmentation tasks. Specifically, our approach exploits a VM-UNet architecture for carrying out the addressed task, as well as stain augmentation operations for further improving model robustness against domain shift. Our approach has been submitted to the track 1 of the MItosis DOmain Generalization (MIDOG) challenge. Preliminary experiments, conducted on the MIDOG++ dataset, show large room for improvement for the proposed method.[136] A multi-task neural network for atypical mitosis recognition under domain shift
Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
Main category: cs.CV
TL;DR: This paper proposes a multi-task learning approach to improve the detection of atypical mitotic figures in histopathology images under domain shift, showing promising results across multiple datasets.
Details
Motivation: Recognizing atypical mitotic figures in histopathology images helps physicians assess tumor aggressiveness. However, machine learning models suffer from performance drops under domain shift, which the proposed method aims to address. Method: The approach uses multi-task learning, leveraging auxiliary tasks correlated to the main classification task to improve domain generalization. It was submitted to Track 2 of the MIDOG 2025 challenge. Result: The model achieved promising results in a preliminary evaluation on three datasets: the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, and the preliminary test set of the MIDOG25 challenge. Conclusion: The proposed multi-task learning approach demonstrates promising performance in recognizing atypical mitotic figures under domain shift, suggesting its potential for aiding tumor aggressiveness assessment. Abstract: Recognizing atypical mitotic figures in histopathology images allows physicians to correctly assess tumor aggressiveness. Although machine learning models could be exploited for automatically performing such a task, under domain shift these models suffer from significative performance drops. In this work, an approach based on multi-task learning is proposed for addressing this problem. By exploiting auxiliary tasks, correlated to the main classification task, the proposed approach, submitted to the track 2 of the MItosis DOmain Generalization (MIDOG) challenge, aims to aid the model to focus only on the object to classify, ignoring the domain varying background of the image. The proposed approach shows promising performance in a preliminary evaluation conducted on three distinct datasets, i.e., the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, as well as the preliminary test set of the MIDOG25 challenge.[137] FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator
Huynh Tong Dang Khoa,Dang Hoai Nam,Vo Nguyen Le Duy
Main category: cs.CV
TL;DR: 本研究提出了一种名为FW-GAN的单样本手写合成框架,通过引入Wave-MLP、频率引导的判别器和频率分布损失,有效提升了生成手写数据的质量和一致性。
Details
Motivation: 传统方法在建模长距离依赖关系和复杂笔画模式上存在局限性,且忽略了频率信息的重要性。 Method: 提出FW-GAN框架,包含相位感知的Wave-MLP生成器、频率引导的判别器以及频率分布损失函数。 Result: 在越南语和英语手写数据集上的实验表明,FW-GAN能够生成高质量且风格一致的手写文本。 Conclusion: FW-GAN有效地解决了手写数据稀缺问题,通过生成高质量、风格一致的手写文本,成为低资源手写识别管道中数据增强的有力工具。 Abstract: Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at https://github.com/DAIR-Group/FW-GAN[138] MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Junpeng Ma,Qizhe Zhang,Ming Lu,Zhibin Wang,Qiang Zhou,Jun Song,Shanghang Zhang
Main category: cs.CV
TL;DR: MMG-Vid是一种无需训练的视觉令牌剪枝框架,通过最大化边际增益显著提高了视频理解任务的效率,同时保持了强大的性能。
Details
Motivation: 现有的方法没有考虑到视频帧的动态特性和时间依赖性,因此在实际应用中存在计算挑战。 Method: MMG-Vid通过在段级和令牌级最大化边际增益,提出了一种无需训练的视觉令牌剪枝框架。首先根据帧的相似性将视频分成段,然后动态分配每段的令牌预算以最大化每段的边际增益。随后,提出了一种时间引导的DPC算法,共同建模帧间独特性和帧内多样性,从而最大化每个令牌的边际增益。 Result: 实验表明,MMG-Vid可以保持超过99.5%的原始性能,同时有效减少75%的视觉令牌,并在LLaVA-OneVision-7B上加速预填充阶段达3.9倍。 Conclusion: MMG-Vid显著提高了视频理解任务的效率,同时保持了强大的性能,证明了其在实际应用中的潜力。 Abstract: Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.[139] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
Wei Li,Renshan Zhang,Rui Shao,Jie He,Liqiang Nie
Main category: cs.CV
TL;DR: CogVLA improves the efficiency and performance of Vision-Language-Action models through a 3-stage architecture, achieving high success rates while significantly reducing training costs and inference latency.
Details
Motivation: Existing VLA models require extensive post-training, resulting in high computational overhead that limits scalability and deployment. CogVLA was developed to address this issue by improving efficiency and performance. Method: CogVLA introduces a 3-stage progressive architecture: 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) for instruction-aware latent representation, 2) LLM-FiLM based Pruning Routing (LFP-Routing) for token-level sparsity, and 3) V-L-A Coupled Attention (CAtten) for coherent action generation. Result: CogVLA achieves state-of-the-art performance with success rates of 97.4% on the LIBERO benchmark and 70.0% on real-world robotic tasks, reduces training costs by 2.5-fold, and decreases inference latency by 2.8-fold compared to OpenVLA. Conclusion: CogVLA is a more efficient and high-performing alternative to existing VLA models, with significant reductions in training costs and inference latency while maintaining state-of-the-art performance. Abstract: Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.[140] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning
Hao Tan,Jun Lan,Zichang Tan,Ajian Liu,Chuanbiao Song,Senyuan Shi,Huijia Zhu,Weiqiang Wang,Jun Wan,Zhen Lei
Main category: cs.CV
TL;DR: 本文提出了HydraFake数据集和基于多模态大语言模型的Veritas检测器,旨在解决深度伪造检测中的现实挑战。
Details
Motivation: 由于现有学术基准与工业实践存在严重差异,深度伪造检测在现实场景中仍然具有挑战性。 Method: 引入了HydraFake数据集,以及基于多模态大语言模型的Veritas检测器,并采用两阶段训练流程和模式感知推理方法。 Result: 实验显示,Veritas在不同分布外场景中实现了显著提升,并能提供透明可靠的检测结果。 Conclusion: HydraFake和Veritas为深度伪造检测提供了更贴近实际应用的解决方案,表现出优越的泛化能力。 Abstract: Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as "planning" and "self-reflection" to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.[141] FakeParts: a New Family of AI-Generated DeepFakes
Gaetan Brison,Soobash Daiboo,Samy Aimeur,Awais Hussain Sani,Xi Wang,Gianni Franchi,Vicky Kalogeiton
Main category: cs.CV
TL;DR: 该论文介绍了FakeParts,一种新的deepfakes类型,以及用于检测此类deepfakes的资源和方法。
Details
Motivation: 识别当前deepfake检测方法中的紧急漏洞,并提供必要的资源来开发针对部分视频操作的更强大方法。 Method: 引入了FakePartsBench,这是一个大规模基准数据集,旨在全面捕捉部分deepfakes的全貌,并进行用户研究以评估检测准确率。 Result: FakeParts减少了人类检测准确率超过30%,与传统deepfakes相比,当前最先进的检测模型也观察到类似的性能下降。 Conclusion: FakeParts被介绍为一类新的deepfakes,其特征是对真实视频的特定空间区域或时间段进行微妙的局部操作,使其难以检测。 Abstract: We introduce FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. Unlike fully synthetic content, these partial manipulations, ranging from altered facial expressions to object substitutions and background modifications, blend seamlessly with real elements, making them particularly deceptive and difficult to detect. To address the critical gap in detection capabilities, we present FakePartsBench, the first large-scale benchmark dataset specifically designed to capture the full spectrum of partial deepfakes. Comprising over 25K videos with pixel-level and frame-level manipulation annotations, our dataset enables comprehensive evaluation of detection methods. Our user studies demonstrate that FakeParts reduces human detection accuracy by over 30% compared to traditional deepfakes, with similar performance degradation observed in state-of-the-art detection models. This work identifies an urgent vulnerability in current deepfake detection approaches and provides the necessary resources to develop more robust methods for partial video manipulations.[142] Multi-View 3D Point Tracking
Frano Rajič,Haofei Xu,Marko Mihajlovic,Siyuan Li,Irem Demir,Emircan Gündoğdu,Lei Ke,Sergey Prokudin,Marc Pollefeys,Siyu Tang
Main category: cs.CV
TL;DR: 本文提出了一种基于数据驱动的多视角3D点跟踪器,该跟踪器能够在动态场景中使用少量相机(如四个)对任意点进行鲁棒且准确的在线跟踪,解决了现有单目跟踪器在深度模糊和遮挡情况下的困难以及传统多相机方法需要大量相机和复杂优化的问题。
Details
Motivation: 现有的单目跟踪器在处理深度模糊和遮挡问题时存在困难,而传统的多相机方法需要超过20个相机和繁琐的每序列优化。本文旨在解决这些问题,通过使用实际数量的相机实现鲁棒且准确的在线3D点跟踪。 Method: 该方法通过融合多视角特征生成统一的点云,并结合k近邻相关性和基于Transformer的更新机制来估计长距离3D对应关系。训练使用了5000个合成的多视角Kubric序列,并在两个真实世界基准数据集(Panoptic Studio和DexYCB)上进行了评估。 Result: 本文提出的方法在Panoptic Studio和DexYCB数据集上分别实现了3.1厘米和2.0厘米的中值轨迹误差,同时能够很好地推广到1-8个视角的不同相机设置,并适用于24-150帧的视频长度。 Conclusion: 本文提出了一种实用且高效的多视角3D点跟踪方法,为多视角3D跟踪研究设立了新标准,并提供了适用于实际应用的工具。 Abstract: We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.[143] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning
Yuan Gong,Xionghui Wang,Jie Wu,Shiyin Wang,Yitong Wang,Xinglong Wu
Main category: cs.CV
TL;DR: 本文提出了一种名为OneReward的统一强化学习框架,通过使用单一的奖励模型提升模型在多种任务和不同评估标准下的生成能力,并在此基础上开发了Seedream 3.0 Fill,一个无需任务特定监督微调的掩码引导生成模型。
Details
Motivation: 现有的方法通常依赖于任务特定的监督微调(SFT),这限制了泛化能力和训练效率。 Method: 通过使用一个单一的视觉-语言模型(VLM)作为生成奖励模型,将OneReward用于掩码引导的图像生成任务,直接在预训练基础模型上进行多任务强化学习训练,从而消除了任务特定的SFT需求。 Result: 实验结果表明,基于OneReward开发的Seedream 3.0 Fill在多个评估维度上始终优于商业和开源竞品,如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。 Conclusion: OneReward提供了一种有效的统一强化学习框架,而Seedream 3.0 Fill则在多任务生成模型中展现了卓越的性能。 Abstract: In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io[144] Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang
Main category: cs.CV
TL;DR: Dress&Dance是一种生成虚拟试穿视频的扩散框架,利用新条件网络CondNet统一多模态输入,实现高质量和灵活的试穿体验。
Details
Motivation: 生成高质量的虚拟试穿视频,以支持用户穿着期望的服装根据给定参考视频移动的场景,同时提高服装注册和动作保真度。 Method: Dress&Dance框架使用了一种新的条件网络CondNet,通过注意力机制统一多模态输入(文本、图像和视频),并采用多阶段渐进式训练方法,结合有限的视频数据和更大的图像数据集。 Result: Dress&Dance能够生成5秒长、24 FPS、1152x720分辨率的高质量虚拟试穿视频,支持多种上装、下装和连体服装,以及单次处理中同时上装和下装的试穿。 Conclusion: Dress&Dance提供了一种高质量、灵活的虚拟试穿体验,优于现有的开源和商业解决方案。 Abstract: We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.[145] First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge
Fahad Shamshad,Tameem Bakr,Yahia Shaaban,Noor Hussein,Karthik Nandakumar,Nils Lukas
Main category: cs.CV
TL;DR: This paper presents a winning solution to the NeurIPS 2024 challenge, demonstrating highly effective watermark removal techniques with minimal impact on image quality.