cs.CL [Back]

Lance Calvin Lim Gamboa,Yue Feng,Mark Lee

Main category: cs.CL

TL;DR: This paper reviews research on social bias in multilingual NLP models, highlighting methodological gaps and proposing more inclusive, cross-cultural approaches aligned with current NLP advancements.

Details

Motivation: The motivation is to understand and address the social biases present in multilingual and non-English contexts within NLP, as these biases mirror those seen in English-language models. Method: The study conducts a systematic review of existing research on multilingual bias evaluation and mitigation, analyzing methodological choices, linguistic diversity, cultural awareness, evaluation metrics, and mitigation techniques. Result: The review identifies gaps in methodological design such as language preference and limited multilingual mitigation experiments, along with common issues and solutions in adapting bias benchmarks across languages and cultures. Conclusion: The study concludes that there is a need for more inclusive and cross-culturally appropriate methodologies in multilingual bias research, aligned with advancements in NLP. Abstract: Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field's dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature's inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.

[2] Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

Mohammad Amini,Babak Ahmadi,Xiaomeng Xiong,Yilin Zhang,Christopher Qiao

Main category: cs.CL

TL;DR: This study demonstrates how structured prompting and efficient fine-tuning can enhance mid-sized language models for automatic test question generation, offering a scalable and practical solution for K-12 language assessments.

Details

Motivation: The motivation of this study is to reduce the cost and inconsistency associated with manual test development by exploring the use of language models for automatic generation of multiple-choice questions (MCQs) for morphological assessment. Method: The study employed a two-fold approach: comparing a fine-tuned medium model (Gemma, 2B) with a larger untuned model (GPT-3.5, 175B), and evaluating seven structured prompting strategies. Generated items were assessed using both automated metrics and expert scoring across five dimensions. Additionally, GPT-4.1 was trained on expert-rated samples to simulate human scoring at scale. Result: Structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's output quality. Gemma produced more construct-aligned and instructionally appropriate items compared to GPT-3.5's zero-shot responses, with prompt design being crucial for mid-size model performance. Conclusion: This study concludes that structured prompting and efficient fine-tuning can enhance mid-sized models for automatic item generation (AIG) in language assessment, particularly under limited data conditions. The integration of automated metrics, expert judgment, and large-model simulation ensures alignment with assessment goals, offering a scalable and practical workflow for K-12 language assessment item development and validation. Abstract: This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.

[3] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach

Andrei Mihai Albu,Giovanni Pollo,Alessio Burrello,Daniele Jahier Pagliari,Cristian Tesconi,Alessandra Neri,Dario Soldi,Fabio Autieri,Sara Vinco

Main category: cs.CL

TL;DR: This paper proposes an open-source methodology to integrate SystemC TLM models into FMI-based co-simulation workflows by encapsulating them as FMI 3.0 Co Simulation FMUs, enabling seamless cross-domain integration.

Details

Motivation: The motivation is the growing complexity of cyber-physical systems in automotive applications, which demands efficient modeling and cross-domain co-simulation techniques. SystemC TLM's limited interoperability with other domains creates integration challenges. Method: The methodology involves encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs) and using a lightweight open-source toolchain to address challenges like time synchronization and data exchange. Result: The result is a fully open-source methodology that successfully integrates SystemC TLM models into FMI-based co-simulation workflows, demonstrated through representative case studies to be feasible and effective. Conclusion: The paper concludes that by encapsulating SystemC TLM components as FMI 3.0 Co Simulation FMUs, a fully open-source methodology enables efficient integration of SystemC TLM models into FMI-based co-simulation workflows. Abstract: The growing complexity of cyber-physical systems, particularly in automotive applications, has increased the demand for efficient modeling and cross-domain co-simulation techniques. While SystemC Transaction-Level Modeling (TLM) enables effective hardware/software co-design, its limited interoperability with models from other engineering domains poses integration challenges. This paper presents a fully open-source methodology for integrating SystemC TLM models into Functional Mock-up Interface (FMI)-based co-simulation workflows. By encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs), the proposed approach facilitates seamless, standardized integration across heterogeneous simulation environments. We introduce a lightweight open-source toolchain, address key technical challenges such as time synchronization and data exchange, and demonstrate the feasibility and effectiveness of the integration through representative case studies.

[4] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge,Mai Nishimura,Jiaxin Ma

Main category: cs.CL

TL;DR: DGPO使得紧凑语言模型在计算资源受限的环境下实现智能RAG行为成为可能。

Details

Motivation: 小型语言模型（例如0.5B参数）由于推理能力差，导致奖励稀疏和训练不稳定，难以通过强化学习获得智能RAG行为。 Method: 提出DGPO方法，通过教师示范进行冷启动初始化，并在策略优化过程中持续提供教师指导。 Result: 实验表明，DGPO使紧凑模型能够实现复杂的智能搜索行为，在某些情况下甚至超过了更大的教师模型。 Conclusion: DGPO为计算资源受限环境下的智能RAG提供了可行方案。 Abstract: Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

[5] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin,Ruoxi Chen,Peiyan Zhang,Andy Zhou,Yang Zhang,Haohan Wang

Main category: cs.CL

TL;DR: GUARD translates AI ethics guidelines into actionable tests to identify guideline violations and jailbreak scenarios in LLMs, validated across multiple models and adaptable to vision-language models.

Details

Motivation: There is a gap in translating high-level government ethics guidelines for AI into actionable testing questions to verify compliance in LLMs. Method: GUARD generates guideline-violating questions and integrates jailbreak diagnostics (GUARD-JD) to provoke unethical responses, assessing LLM compliance with government guidelines. Result: GUARD was validated on seven LLMs, effectively identifying guideline violations and jailbreak scenarios, and was shown to transfer to vision-language models. Conclusion: GUARD is an effective method for operationalizing government guidelines into actionable testing questions, identifying guideline violations and potential jailbreak scenarios in LLMs. Abstract: As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

[6] Joint Enhancement of Relational Reasoning for Long-Context LLMs

Zhirui Chen,Wei Shen,Jiashui Huang,Ling Shao

Main category: cs.CL

TL;DR: JERR improves large language models' ability to handle long contexts and complex reasoning tasks through graph-based reasoning.

Details

Motivation: LLMs struggle with long contexts due to memory limitations, lack of transparency, and hallucinations, which JERR aims to address. Method: JERR integrates synopsis extraction, graph construction, and relational reasoning using Monte Carlo Tree Search (MCTS). Result: JERR outperforms all baselines on ROUGE and F1 metrics and achieves the highest scores on LLM-Rater evaluation. Conclusion: JERR provides a solution to enhance long-context comprehension and complex reasoning tasks for LLMs, improving reliability and transparency. Abstract: Despite significant progress, large language models (LLMs) still struggle with long contexts due to memory limitations and their inability to tackle complex and long-context tasks. Additionally, LLMs often suffer from a lack of transparency and are prone to producing hallucinations. To address these challenges, we propose \textbf{JERR}, a novel framework designed to enhance long-context comprehension via graph-based reasoning in LLMs. JERR integrates three key components: synopsis extraction, graph construction, and relational reasoning. First, synopsis is extracted by chunking text strategically, allowing the model to summarize and understand information more efficiently. Second, we build a directed acyclic graph (DAG) to resolve redundancy, ensuring logical consistency and clarity. Finally, we incorporate Monte Carlo Tree Search (MCTS) to help the model navigate complex reasoning paths, ensuring more accurate and interpretable outputs. This framework provides a novel solution that enables LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency. Experimental results show that JERR consistently outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation.

[7] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems

Yuyao Wang,Bowen Liu,Jianheng Tang,Nuo Chen,Yuhan Li,Qifan Zhang,Jia Li

Main category: cs.CL

TL;DR: This paper introduces NP-hard graph problems as a scalable synthetic training corpus for enhancing Long CoT reasoning in LLMs, presenting a two-stage post-training framework and demonstrating the strong generalization of their model across various domains.

Details

Motivation: The development of Long CoT behaviors in RLLMs relies heavily on post-training with high-quality datasets, which are typically costly and human-curated. This work explores scalable alternatives by introducing NP-hard (NPH) graph problems as a novel synthetic training corpus. Method: A two-stage post-training framework was developed: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances and (ii) Reinforcement Learning (RL) with a fine-grained reward design. Result: The Graph-R1-7B model demonstrated strong generalization across mathematics, coding, STEM, and logic, surpassing QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. Conclusion: NPH graph problems are identified as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at https://github.com/Graph-Reasoner/Graph-R1, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.

[8] CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

Jivnesh Sandhan,Fei Cheng,Tushar Sandhan,Yugo Murawaki

Main category: cs.CL

TL;DR: This paper introduces the Context-Aware Personality Evaluation (CAPE) framework for Large Language Models (LLMs), showing that conversational history significantly influences their personality and response consistency. Traditional methods ignore context, leading to artificial evaluations. The study finds that GPT models are more robust and have intrinsic traits, while Gemini and Llama heavily depend on context. Applying CAPE to RPAs improves performance and human alignment.

Details

Motivation: The motivation behind this research is to address the limitations of traditional psychometric testing methods for Large Language Models (LLMs), which typically ignore the influence of conversational context. The authors aim to bridge the gap between artificial testing environments and real-world applications where context shapes responses. Method: The researchers introduced the Context-Aware Personality Evaluation (CAPE) framework, which incorporates prior conversational interactions to assess LLMs. They conducted extensive experiments on 7 LLMs, using novel metrics to quantify response consistency and analyze personality shifts. The study also evaluated the impact of question ordering and the extent to which different models rely on intrinsic traits versus prior interactions. Result: Experiments revealed that conversational history enhances response consistency through in-context learning but also causes personality shifts. GPT-3.5-Turbo and GPT-4-Turbo exhibited extreme deviations, while Gemini-1.5-Flash and Llama-8B showed significant sensitivity to prior interactions. The study also found that applying the CAPE framework to Role Playing Agents (RPAs) improved response consistency and alignment with human judgments. Conclusion: The study concludes that context plays a significant role in shaping the personality and response consistency of Large Language Models (LLMs), and that the proposed Context-Aware Personality Evaluation (CAPE) framework effectively captures these effects. The findings highlight the importance of context in evaluating LLMs, particularly for applications such as Role Playing Agents (RPAs), where alignment with human judgments is crucial. Abstract: Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior. Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE

[9] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

Xu Guo

Main category: cs.CL

TL;DR: This paper explores how monitoring conditional entropy during reasoning steps can predict the correctness of the final answer in large language models, showing that decreasing entropy is associated with correct answers and that longer reasoning paths do not always lead to better outcomes.

Details

Motivation: The motivation is to understand how reasoning utility contributes to the final answer's correctness in large language models, as generating more context does not guarantee increased confidence in the answer. Method: An oracle study was conducted on the MATH dataset using Qwen2.5-32B and GPT-4o to generate reasoning chains. A separate model (Qwen3-8B) was used to quantify the utility of these chains by measuring the model's uncertainty on the answer span Y at each step using conditional entropy. Result: The results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. Additionally, incorrect reasoning paths tend to be longer than correct ones. Conclusion: The study concludes that monitoring conditional entropy during reasoning steps can indicate the correctness of the final answer, and that incorrect reasoning paths tend to be longer, suggesting that more reasoning steps do not necessarily lead to better outcomes. Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.

[10] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

Sam Jung,Agustin Garcinuno,Spencer Mateega

Main category: cs.CL

TL;DR: 本文介绍了UI-Bench，一个用于评估AI生成应用程序和网站视觉效果的大型基准测试，旨在提供可重复的标准并推动AI驱动的网页设计领域的发展。

Details

Motivation: AI文本到应用工具声称能在几分钟内生成高质量的应用程序和网站，但目前缺乏公开的基准来验证这些声明。 Method: 通过跨10个工具、30个提示和300个生成网站的专家配对比较，利用4000多个专家判断，使用基于TrueSkill的模型进行评估。 Result: UI-Bench是首个大规模基准测试，能够评估不同AI文本到应用工具的视觉效果，并发布了完整的提示集、开源评估框架和公开排行榜。 Conclusion: UI-Bench提供了一个可重复的标准，用于评估AI文本到应用程序工具的视觉效果，并引入了一个基于TrueSkill的排名模型。 Abstract: AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and \textit{4000+} expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.

[11] DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

Hengchuan Zhu,Yihuan Xu,Yichen Li,Zijie Meng,Zuozhu Liu

Main category: cs.CL

TL;DR: This paper introduces DentalBench, a dental-specific benchmark for evaluating LLMs, highlighting the importance of domain adaptation and targeted evaluation in healthcare applications.

Details

Motivation: The motivation stems from the lack of targeted evaluation resources for LLMs in specialized medical fields like dentistry, despite their success in general medical benchmarks. Method: The authors introduced DentalBench, a bilingual benchmark for the dental domain, consisting of DentalQA and DentalCorpus. They evaluated 14 LLMs and conducted experiments on domain adaptation techniques. Result: The evaluation revealed significant performance gaps across task types and languages. Domain adaptation improved performance, especially for knowledge-intensive tasks. Conclusion: The study concludes that domain-specific benchmarks like DentalBench are crucial for advancing and evaluating LLMs in specialized healthcare fields like dentistry, and that domain adaptation significantly enhances model performance. Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.

[12] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval

Chi Minh Bui,Ngoc Mai Thieu,Van Vinh Nguyen,Json J. Jung,Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: KG-CQR通过利用知识图谱增强查询上下文，显著提高了检索增强生成系统的性能。

Details

Motivation: 现有的方法主要解决语料库级别的上下文损失问题，而本研究聚焦于通过结构化关系表示来增强查询，从而改善检索阶段的效果。 Method: KG-CQR利用基于知识图谱的子图提取、补全和上下文生成模块来丰富查询的上下文表示。 Result: 实验结果表明，KG-CQR在RAGBench和MultiHop-RAG数据集上取得了4-6%的mAP提升和2-3%的Recall@25提升。 Conclusion: KG-CQR是一个可以提高检索效果的新型框架，特别适用于复杂的查询和多跳问答任务。 Abstract: The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR's superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness

[13] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance

Feng Zhang,Chengjie Pang,Yuehan Zhang,Chenyu Luo

Main category: cs.CL

TL;DR: 本文提出了一种针对民航维修领域的工业级基准测试工具，旨在评估大语言模型（LLM）在该领域的性能，特别关注领域知识和复杂推理能力的缺陷，从而为改进提供方向。

Details

Motivation: 民航维修领域具有严格的标准，维修程序和故障排查是知识密集型任务，需要高级推理能力。目前缺乏针对大语言模型在该领域的评估工具，因此需要开发专门的评估基准。 Method: 开发了一个工业级基准测试工具，用于评估大语言模型在民航维修领域的性能，并利用该工具分析现有的向量嵌入模型和大语言模型的表现。 Result: 通过实验探索和分析，验证了该基准测试工具在评估模型性能方面的有效性，并开源了该评估基准和代码以促进进一步的研究和发展。 Conclusion: 本文填补了大语言模型评估在民航维修领域的空白，为提升该领域智能解决方案的能力提供了基础。 Abstract: Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:https://github.com/CamBenchmark/cambenchmark

[14] Searching the Title of Practical Work of the Informatics Engineering Bachelor Program with the Case Base Reasoning Method

Agung Sukrisna Jaya,Osvari Arsalan,Danny Matthew Saputra

Main category: cs.CL

TL;DR: 本研究开发了一种基于案例推理的系统，用于搜索实践工作标题，并使用TF-IDF和余弦相似度计算标题间的相似性。

Details

Motivation: 基于过往案例的相似性，寻找实践工作标题的实用方法。 Method: 使用TF-IDF对实践工作标题进行向量化处理，并使用余弦相似度计算相似度值。 Result: 系统能够基于标题或关键词搜索实践工作标题，并提供匹配值。 Conclusion: 系统在第二阶段测试中找到了相同数量的标题并获得了最高的平均匹配分数。 Abstract: Case Base Reasoning (CBR) is a case solving technique based on experience in cases that have occurred before with the highest similarity. CBR is used to search for practical work titles. TF-IDF is applied to process the vectorization of each practical work title word and Cosine Similarity for the calculation of similarity values. This system can search either in the form of titles or keywords. The output of the system is the title of practical work and the match value of each title. Based on the test results using 705 practical work titles, testing was carried out with five titles and carried out in two stages. The first stage searches with existing titles and the second stage randomizes the title from the first stage. And the results obtained in the second stage are the same number of titles found and the highest average match score.

[15] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Zhenting Wang,Qi Chang,Hemani Patel,Shashank Biju,Cheng-En Wu,Quan Liu,Aolin Ding,Alireza Rezazadeh,Ankit Shah,Yujia Bao,Eugene Siow

Main category: cs.CL

TL;DR: MCP-Bench是一个用于评估大型语言模型在多步骤复杂任务中使用工具和规划能力的新基准，结果显示现有模型在这些方面仍面临挑战。

Details

Motivation: 现有API基准测试无法全面评估大型语言模型在工具使用、跨领域协调和复杂任务规划方面的能力，因此需要一个新的基准来填补这一空白。 Method: 构建于Model Context Protocol（MCP）之上，MCP-Bench连接了28个代表性的实时MCP服务器，覆盖250个工具，涵盖金融、旅行、科学计算和学术搜索等领域。 Result: 实验显示，MCP-Bench对20个先进的大语言模型提出了持续的挑战，表明这些模型在多步骤任务解决和工具协调方面仍存在困难。 Conclusion: MCP-Bench是一个新的基准，用于评估大语言模型在需要工具使用、跨工具协调、精确参数控制和任务解决规划/推理方面的多步骤任务中的表现，旨在测试现有基准无法充分评估的能力。 Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

[16] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques

Yucheng Ruan,Xiang Lan,Daniel J. Tan,Hairil Rizal Abdullah,Mengling Feng

Main category: cs.CL

TL;DR: 该研究提出了一种利用自然语言处理技术的深度学习框架，整合多模态电子健康记录来预测危重病护理中的死亡率和资源利用。

Details

Motivation: 从电子健康记录（EHR）预测死亡率和资源利用是具有挑战性的，但对于优化患者结果和管理重症监护病房（ICU）的成本至关重要。现有方法主要关注结构化EHR，常常忽略了自由文本笔记中的宝贵临床见解。此外，结构化数据中文本信息的潜力没有得到充分利用。 Method: 利用两个真实世界的EHR数据集，我们在三个临床任务上开发并评估了我们的模型，并与现有的领先方法进行了比较。我们还对框架中的三个关键组件进行了消融研究：医疗提示、自由文本和预训练句子编码器。此外，我们评估了模型对结构化EHR中损坏的鲁棒性。 Result: 在两个真实世界数据集上的实验表明，我们提出的模型在死亡率预测的BACC/AUROC上提高了1.6%/0.8%，在住院时间预测的RMSE/MAE上提高了0.5%/2.2%，在手术持续时间估计的RMSE/MAE上提高了10.9%/11.0%。与现有最佳方法相比，在不同损坏率下，它在三个任务中始终表现出优于其他基线模型的性能。 Conclusion: 该框架是一种有效且准确的深度学习方法，用于预测危重护理中的死亡率和资源利用。研究还强调了使用带有变压器编码器的提示学习分析多模态EHR的成功。重要的是，模型在结构化数据中显示出对数据损坏的强大弹性，尤其是在高损坏水平下。 Abstract: Background Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. This study aimed to introduce and assess a deep learning framework using natural language processing techniques that integrates multimodal EHRs to predict mortality and resource utilization in critical care settings. Methods Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model's robustness against the corruption in structured EHRs. Results Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6\%/0.8\% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusions The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels.

[17] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety

Luke Bates,Max Glockner,Preslav Nakov,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出了 ConspirED 数据集，用于分析阴谋论内容的认知特征，并展示了计算模型和 AI 在处理此类内容时的局限性。

Details

Motivation: 阴谋论削弱了公众对科学和机构的信任，并通过吸收反驳证据不断演变，使辟谣变得困难。随着人工智能生成的虚假信息日益复杂，研究阴谋论内容的修辞模式对于开发干预措施（如针对性预驳）和评估人工智能漏洞至关重要。 Method: 引入 ConspirED（CONSPIR 评估数据集），使用 CONSPIR 认知框架对在线阴谋论文章的多句片段进行标注，并利用该数据集开发识别阴谋论特征的计算模型，并评估大型语言/推理模型对阴谋论输入的鲁棒性。 Result: ConspirED 是首个针对普遍认知特征进行标注的阴谋论内容数据集。基于 ConspirED，研究开发了可识别文本片段中阴谋论特征的计算模型，并发现大型语言/推理模型在面对阴谋论内容时表现出推理模式的错位，其输出反映了输入的推理方式。 Conclusion: ConspirED 的分析表明，计算模型和大型语言/推理模型在处理阴谋论内容时存在不匹配问题，即使能够成功应对类似的事实错误信息，其输出仍会反映输入的推理模式。 Abstract: Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80--120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.

[18] Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark

Chihiro Taguchi,Seng Mai,Keita Kurabe,Yusuke Sakai,Georgina Agyei,Soudabeh Eslami,David Chiang

Main category: cs.CL

TL;DR: The paper analyzes the FLORES+ benchmark for multilingual machine translation, finding that its data quality and domain specificity undermine its suitability for real-world translation evaluation.

Details

Motivation: The motivation is to assess the suitability of the FLORES+ benchmark for truly multilingual evaluation due to potential shortcomings in its quality and domain specificity. Method: The authors studied data in four languages (Asente Twi, Japanese, Jinghpaw, and South Azerbaijani) and conducted human assessments and analysis of simple heuristics to evaluate the FLORES+ benchmark. Result: Human assessments reveal that many translations fall below the claimed 90% quality standard, and the study shows that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on the domain-relevant evaluation set. Conclusion: The paper concludes that multilingual MT benchmarks should use domain-general and culturally neutral source texts that rely less on named entities to better reflect real-world translation challenges. Abstract: Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark's suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts rely less on named entities, in order to better reflect real-world translation challenges.

[19] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

Pengjiang Li,Zaitian Wang,Xinhao Zhang,Ran Zhang,Lu Jiang,Pengfei Wang,Yuanchun Zhou

Main category: cs.CL

TL;DR: 本文提出了一种利用大型语言模型改进科学主题识别的方法SciTopic，并验证了其在三个真实数据集上的优越性能。

Details

Motivation: 现有的主题发现方法大多依赖词嵌入来捕捉语义，缺乏对科学出版物全面的理解，处理复杂、高维文本关系存在困难。 Method: SciTopic首先构建了一个文本编码器来捕获科学出版物的内容，包括元数据、标题和摘要。接着构建了一个空间优化模块，该模块结合了基于熵的采样和由大型语言模型指导的三重任务，增强了对主题相关性和模糊实例之间上下文复杂性的关注。最后，通过优化三重对比损失，基于大型语言模型的指导对文本编码器进行微调，使文本编码器能更好地区分不同主题的实例。 Result: 实验结果表明，SciTopic在三个真实世界科学出版物数据集上均优于最先进的科学主题发现方法，能够使研究人员获得更深入和快速的洞察力。 Conclusion: SciTopic通过整合大型语言模型的指导，提出了一种改进的科学主题识别方法，并在三个真实世界科学出版物数据集中证明了其优于现有最先进科学主题发现方法的性能。 Abstract: Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.

[20] Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Salvador Lima-López,Eulàlia Farré-Maduell,Martin Krallinger,Natalia Loukachevitch,Vera Davydova,Elena Tutubalina,Georgios Paliouras

Main category: cs.CL

TL;DR: The BioASQ 2024 challenge advanced biomedical semantic indexing and question answering, with 37 teams submitting over 700 entries and achieving strong results.

Details

Motivation: To present the outcomes and developments from the twelfth edition of the BioASQ challenge, highlighting its role in advancing biomedical semantic indexing and question answering. Method: The paper provides an overview of the BioASQ challenge, detailing its tasks, participant performance, and contributions to the field. Result: Thirty-seven teams participated in BioASQ 2024, submitting over 700 entries across four tasks, with most systems achieving competitive performance. Conclusion: The twelfth edition of the BioASQ challenge successfully promoted advancements in biomedical semantic indexing and question answering, with significant team participation and competitive system performances. Abstract: This is an overview of the twelfth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2024. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks b and Synergy, and two new tasks: a) MultiCardioNER on the adaptation of clinical entity detection to the cardiology domain in a multilingual setting, and b) BIONNE on nested NER in Russian and English. In this edition of BioASQ, 37 competing teams participated with more than 700 distinct submissions in total for the four different shared tasks of the challenge. Similarly to previous editions, most of the participating systems achieved competitive performance, suggesting the continuous advancement of the state-of-the-art in the field.

[21] Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Martin Krallinger,Miguel Rodríguez-Ortega,Eduard Rodriguez-López,Natalia Loukachevitch,Andrey Sakhovskiy,Elena Tutubalina,Dimitris Dimitriadis,Grigorios Tsoumakas,George Giannakoulas,Alexandra Bekiaridou,Athanasios Samaras,Giorgio Maria Di Nunzio,Nicola Ferro,Stefano Marchesin,Marco Martinelli,Gianmaria Silvello,Georgios Paliouras

Main category: cs.CL

TL;DR: The BioASQ challenge continues to drive progress in biomedical semantic indexing and question answering through diverse tasks and widespread participation.

Details

Motivation: To promote advances in large-scale biomedical semantic indexing and question answering by organizing a series of international challenges. Method: The challenge featured six shared tasks, including new editions of established tasks and four new tasks focusing on multilingual clinical summarization, nested named entity linking, clinical coding in cardiology, and gut-brain interplay information extraction. Result: Several participating systems achieved competitive performance, highlighting the progress in the state-of-the-art in the field. Conclusion: The thirteenth edition of the BioASQ challenge witnessed continuous advancement in biomedical semantic indexing and question answering with the participation of 83 competing teams and over 1000 submissions. Abstract: This is an overview of the thirteenth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2025. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks, b and Synergy, and four new tasks: a) Task MultiClinSum on multilingual clinical summarization. b) Task BioNNE-L on nested named entity linking in Russian and English. c) Task ELCardioCC on clinical coding in cardiology. d) Task GutBrainIE on gut-brain interplay information extraction. In this edition of BioASQ, 83 competing teams participated with more than 1000 distinct submissions in total for the six different shared tasks of the challenge. Similar to previous editions, several participating systems achieved competitive performance, indicating the continuous advancement of the state-of-the-art in the field.

[22] Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data

Jiahao Xiao,Jiangming Liu

Main category: cs.CL

TL;DR: 本文提出了一种新的联邦学习框架AdaFD，用于应对多领域语言数据的非独立同分布问题，并在真实环境中表现出优越性能。

Details

Motivation: 现有联邦学习研究主要关注标签多样性，而忽视了自然语言处理中至关重要的语言领域多样性。为解决这一问题，论文提出多领域non-IID场景和自适应联邦蒸馏框架。 Method: 论文引入了一种统一的多领域non-IID场景，并提出了一个包含多样化数据的基准框架，以评估真实环境中的联邦学习效果。 Result: 实验结果表明，提出的AdaFD框架能够有效捕捉本地客户端的多样性，并在性能上优于现有方法。 Conclusion: 该论文提出了一种适用于多领域非独立同分布（non-IID）场景的自适应联邦蒸馏框架（AdaFD），并在同构和异构设置下有效应对多领域non-IID挑战。 Abstract: The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: https://github.com/jiahaoxiao1228/AdaFD.

[23] Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search

Zeyu Xiong,Yixuan Nan,Li Gao,Hengzhu Tang,Shuaiqiang Wang,Junfeng Wang,Dawei Yin

Main category: cs.CL

TL;DR: 本文提出了一种基于生成模型的查询驱动文本摘要新框架，克服了传统抽取式模型的局限性，实现了高效、准确的实时摘要生成。

Details

Motivation: 传统的抽取式摘要模型在工业应用中存在信息损失累积、架构瓶颈以及对用户查询和文档语义理解不足的问题，因此需要一种更高效的生成模型解决方案。 Method: 提出了一种新颖的框架，将大模型蒸馏、监督微调、直接偏好优化和前瞻解码技术结合，将轻量级模型转化为领域专家模型，以解决实时查询驱动文本摘要问题。 Result: 该模型在多个工业相关指标上优于生产基线模型，达到新的技术前沿水平，并且部署效率高，仅需334块NVIDIA L20 GPU即可每秒处理约50,000个查询，平均延迟为55毫秒。 Conclusion: 研究得出，通过结合大模型蒸馏、监督微调、直接偏好优化和前瞻解码技术，可以有效地将轻量级模型转化为领域专家模型，实现工业级实时查询驱动文本摘要。 Abstract: In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.

[24] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling

Yangfan Wang,Jie Liu,Chen Tang,Lian Yan,Jingchi Jiang

Main category: cs.CL

TL;DR: 本文提出了一种名为知识组合采样（KCS）的创新框架，用于生成多样化的多跳问题，通过在给定上下文中采样不同的知识组合来提高多跳问答的整体准确性，并在数据增强应用中表现出色。

Details

Motivation: 多跳问答面临数据稀疏性的挑战，传统方法容易学习错误模式，而现有方法侧重于生成简单问题，忽略了必要知识的整合。 Method: KCS将知识组合选择建模为句子级条件预测任务，并使用概率对比损失预测下一个最相关的知识点。在推理过程中，采用随机解码策略来有效平衡准确性和多样性。 Result: 与现有先进方法相比，KCS将知识组合选择的整体准确性提高了3.9%，并且在HotpotQA和2WikiMultihopQA数据集上表现出色。 Conclusion: KCS框架在生成多样化的多跳问题方面表现优异，通过整合相关知识并平衡准确性和多样性，有效解决了多跳问答中的数据稀疏性问题。 Abstract: Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: https://github.com/yangfanww/kcs.

[25] A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models

Soham Petkar,Hari Aakash K,Anirudh Vempati,Akshit Sinha,Ponnurangam Kumarauguru,Chirag Agarwal

Main category: cs.CL

TL;DR: 当前图-语言模型(GLMs)的评估基准不足以测试多模态推理能力，本文提出新的基准CLEGR以更准确评估GLMs的多模态推理性能，并发现一些仅基于语言模型的基线方法表现与包含完整图神经网络(GNN)的方法相当。

Details

Motivation: 现有GLMs评估基准主要基于节点分类任务，难以真正测试图结构与语言的联合推理能力，因此需要更合理的评估方法。 Method: 设计了CLEGR基准，包含合成图生成流程与需要联合图结构和语义推理的问题，并对现有GLM架构进行了全面评估。 Result: 实验发现，仅使用文本信息的软提示LLM基线模型在CLEGR上表现与整合GNN的GLMs相当，且GLMs在需要图结构推理的任务上表现显著下降。 Conclusion: 当前GLMs在图结构推理方面存在局限，CLEGR为未来研究多模态图-语言推理提供了新方向。 Abstract: Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.

[26] Generative Annotation for ASR Named Entity Correction

Yuanchang Luo,Daimeng Wei,Shaojun Li,Hengchao Shang,Jiaxin Guo,Zongyao Li,Zhanglin Wu,Xiaoyu Chen,Zhiqiang Rao,Jinlong Yang,Hao Yang

Main category: cs.CL

TL;DR: 本文提出了一种基于语音声音特征的新型命名实体校正方法，在词形差异场景下显著提高了实体准确性，并计划开源相关测试集和训练数据。

Details

Motivation: 端到端自动语音识别系统在转录领域特定的命名实体时经常失败，而现有的基于语音级编辑距离算法的命名实体校正模型在词形差异较大的情况下往往无法定位错误转录的单词，限制了其应用。 Method: 利用语音声音特征检索候选实体，并采用生成方法进行实体错误标注和替换。 Result: 该方法在开源和自建测试集上进行了测试，结果表明其显著提高了实体准确性，作者还计划开源自建的测试集和训练数据。 Conclusion: 本文提出了一种新的命名实体校正方法，利用语音声音特征来检索候选实体，并设计了一种生成方法来标注ASR文本中的实体错误并替换为正确实体，从而在词形差异场景下有效提高实体准确性。 Abstract: End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. We will open source our self-constructed test set and training data.

[27] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning

Nelson Filipe Costa,Leila Kosseim

Main category: cs.CL

TL;DR: This paper proposes HArch, a hierarchical multi-lingual model for implicit discourse relation recognition (IDRR), which outperforms large language models (LLMs) through task-specific fine-tuning and achieves state-of-the-art results on the DiscoGeM corpora.

Details

Motivation: The motivation for this work stems from the lack of effective multi-lingual and multi-label classification models for implicit discourse relation recognition (IDRR). The authors aim to leverage hierarchical dependencies between discourse senses and improve performance in multi-lingual settings, surpassing the capabilities of existing large language models (LLMs) through task-specific fine-tuning. Method: The authors developed HArch, a multi-lingual and multi-label classification model for IDRR, which leverages hierarchical dependencies between discourse senses. They evaluated it on the DiscoGeM 2.0 corpus and compared several pre-trained encoder backbones like RoBERTa and XLM-RoBERTa. They also compared their fine-tuned models against LLMs using few-shot prompting across multiple language configurations. Result: RoBERTa-HArch achieved the best performance in English, while XLM-RoBERTa-HArch performed best in the multi-lingual setting. The fine-tuned models consistently outperformed LLMs like GPT-4o and Llama-4-Maverick in IDRR tasks. The hierarchical approach was further validated by achieving SOTA results on the DiscoGeM 1.0 corpus. Conclusion: The paper concludes that task-specific fine-tuning of models like HArch outperforms few-shot prompting of large language models (LLMs) such as GPT-4o and Llama-4-Maverick in the context of implicit discourse relation recognition (IDRR). Additionally, the hierarchical approach used in HArch proves effective, as demonstrated by SOTA results on the DiscoGeM 1.0 corpus. Abstract: This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.

[28] Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models

Ruiyi Yan,Yugo Murawaki

Main category: cs.CL

TL;DR: 本研究探讨了在文本隐写和水印技术中爱丽丝与鲍勃之间的分词不一致性（TI）问题，并提出了针对该问题的两种解决方案，以提高隐写和水印的鲁棒性。

Details

Motivation: 大语言模型的广泛应用提高了文本生成的质量，但也突显了水印技术的重要性，以防止恶意滥用。而分词不一致性问题可能削弱隐写和水印的鲁棒性，因此需要研究解决这一问题的方法。 Method: 研究发现导致TI的分词具有低频性和临时性两个特点，并基于此提出了两种解决方案：隐写中的逐步验证方法和水印中的事后回滚方法。 Result: 实验表明，直接解决TI问题能够提高隐写技术的流畅性、不可察觉性和抗隐写分析能力，同时增强水印技术的可检测性和抗攻击能力。 Conclusion: 本研究通过分析分词不一致性问题，提出了有效的解决方案，从而提高了隐写和水印技术的鲁棒性和性能。 Abstract: Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based steganography. On the other hand, they have also underscored the importance of watermarking as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: infrequency and temporariness. Based on these findings, we propose two tailored solutions for TI elimination: a stepwise verification method for steganography and a post-hoc rollback method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.

[29] rStar2-Agent: Agentic Reasoning Technical Report

Ning Shang,Yifei Liu,Yi Zhu,Li Lyna Zhang,Weijiang Xu,Xinyu Guan,Buze Zhang,Bingcheng Dong,Xudong Zhou,Bowen Zhang,Ying Xin,Ziming Miao,Scarlett Li,Fan Yang,Mao Yang

Main category: cs.CL

TL;DR: rStar2-Agent is a 14B math reasoning model trained with agentic reinforcement learning that achieves state-of-the-art performance on mathematical reasoning tasks and generalizes well to other tasks like alignment and scientific reasoning.

Details

Motivation: The motivation is to develop a math reasoning model capable of achieving frontier-level performance through agentic reinforcement learning. Method: The method involves an efficient RL infrastructure, GRPO-RoC algorithm, and a multi-stage training recipe starting with SFT. Result: rStar2-Agent achieves state-of-the-art results with average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25. Conclusion: rStar2-Agent-14B demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

[30] Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees

Stephen Meisenbacher,Maulik Chevli,Florian Matthes

Main category: cs.CL

TL;DR: This paper introduces DP-ST, a method for private text generation under local differential privacy, showing improved coherence and balanced privacy-utility trade-offs, especially at lower ε values.

Details

Motivation: The motivation is to address the challenge of achieving reasonable privacy-utility trade-offs in local DP for NLP, particularly at low ε values where existing methods struggle. Method: The method involves using DP-ST to generate private documents under local DP guarantees, evaluated through a divide-and-conquer paradigm and enhanced with LLM post-processing. Result: The results show that DP-ST improves text generation coherence while maintaining privacy, demonstrating the effectiveness of the divide-and-conquer approach and highlighting the role of semantic structures in privatization. Conclusion: The study concludes that DP-ST, leveraging semantic triples and LLM post-processing, enables coherent text generation under local DP with balanced privacy and utility, especially effective at lower ε values. Abstract: Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter $\varepsilon$. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high $\varepsilon$ values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower $\varepsilon$ values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable $\varepsilon$ levels.

[31] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets

Vassiliy Cheremetiev,Quang Long Ho Ngo,Chau Ying Kot,Alina Elena Baia,Andrea Cavallaro

Main category: cs.CL

TL;DR: This paper shows that fine-tuning general-purpose embedding models based on large language models can achieve state-of-the-art performance in detecting implicit hate speech.

Details

Motivation: Implicit hate speech is difficult to detect because it uses subtle cues, sarcasm, or coded language instead of explicit derogatory words. Method: Fine-tuned recent general-purpose embedding models (Stella, Jasper, NV-Embed, and E5) for implicit hate speech detection. Result: Experiments on multiple IHS datasets showed up to 1.10 percentage points improvement in in-dataset evaluation and up to 20.35 percentage points improvement in cross-dataset evaluation in terms of F1-macro score. Conclusion: Fine-tuning general-purpose embedding models based on LLMs achieves state-of-the-art performance in detecting implicit hate speech without additional task-specific pipelines or information. Abstract: Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.

[32] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation

Yuanhao Ding,Esteban Garces Arias,Meimingwei Li,Julian Rodemann,Matthias Aßenmacher,Danlu Chen,Gaojuan Fan,Christian Heumann,Chongsheng Zhang

Main category: cs.CL

TL;DR: This paper proposes GUARD, a self-adaptive decoding method for open-ended text generation that balances coherence and diversity efficiently, achieving better performance and faster speed than existing methods.

Details

Motivation: The challenge of balancing coherence and diversity in LLM outputs, which existing contrastive search-based methods struggle to address due to hyperparameter dependence and high computational costs. Method: GUARD uses a 'Glocal' uncertainty-driven framework combining global entropy estimates with local entropy deviations, along with a token-count-based penalty to reduce computational overhead. Result: GUARD effectively balances coherence and diversity, mitigates abrupt uncertainty variations with theoretical guarantees, and achieves faster generation speed and strong performance validated by human and LLM evaluators. Conclusion: GUARD is an effective and efficient decoding method for open-ended text generation that balances coherence and diversity while improving generation speed. Abstract: Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel "Glocal" uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at https://github.com/YecanLee/GUARD.

[33] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions

Xiaoyi Wang,Jiwei Zhang,Guangtao Zhang,Honglei Guo

Main category: cs.CL

TL;DR: 该研究分析LLM生成的合成疗法对话与真实对话在情感动态上的差异，发现合成对话在情感保真度上存在局限，并介绍RealCBT数据集。

Details

Motivation: 研究LLM生成的合成疗法对话是否能捕捉真实治疗中的情感动态。 Method: 采用话语情感动态框架，分析真实和LLM生成的认知行为疗法对话中的情感轨迹，涵盖整体对话和个体角色。 Result: 虽然合成对话流畅且结构连贯，但它们在情感属性上与真实对话存在差异：真实对话展示出更高情感变化，更多情感语言，更真实的反应和调节模式。 Conclusion: 合成对话在情感保真度上存在局限，强调了心理健康应用中情感真实性重要性，并引入RealCBT数据集以支持未来研究。 Abstract: Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we conduct the first comparative analysis of emotional arcs between real and LLM-generated Cognitive Behavioral Therapy dialogues. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions transcribed from public videos and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability,more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity between real and synthetic speakers is low, especially for clients. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. We introduce RealCBT, a curated dataset of real CBT sessions, to support future research in this space.

[34] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

Harethah Abu Shairah,Hasan Abed Al Kader Hammoud,George Turkiyyah,Bernard Ghanem

Main category: cs.CL

TL;DR: This paper introduces ROSI, a lightweight and interpretable method to enhance LLM safety by steering model activations toward refusal-mediating directions, improving safety without compromising utility.

Details

Motivation: Safety mechanisms in LLMs can be bypassed by ablating specific representational directions, so this paper proposes an approach to strengthen safety by steering the model towards refusal-mediating directions. Method: ROSI uses a fine-tuning-free rank-one weight modification on residual stream write matrices, computed from harmful and harmless instruction pairs, to amplify safety alignment. Result: ROSI consistently increases safety refusal rates evaluated by Llama Guard 3 and maintains model utility on benchmarks like MMLU, HellaSwag, and Arc. It also re-aligns uncensored models by amplifying latent safety directions. Conclusion: ROSI is an effective and efficient method for enhancing the safety alignment of LLMs by steering activations toward a refusal-mediating subspace, offering a complementary approach to resource-intensive fine-tuning methods. Abstract: Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.

[35] Signs of Struggle: Spotting Cognitive Distortions across Language and Register

Abhishek Kuber,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah

Main category: cs.CL

TL;DR: This paper presents the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection in digital text, focusing on Dutch adolescents' forum posts.

Details

Motivation: Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress, particularly cognitive distortions, to enable timely and low-cost interventions. Method: The study analyzed forum posts written by Dutch adolescents to assess the cross-lingual and cross-register generalization of cognitive distortion detection. Result: The findings indicate that changes in language and writing style significantly affect model performance in detecting cognitive distortions. Conclusion: Domain adaptation methods show the most promise in detecting cognitive distortions in Dutch adolescents' forum posts despite changes in language and writing style. Abstract: Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One key focus is the identification of cognitive distortions, irrational thought patterns that have a role in aggravating mental distress. Early detection of these distortions may enable timely, low-cost interventions. While prior work has focused on English clinical data, we present the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection, analyzing forum posts written by Dutch adolescents. Our findings show that while changes in language and writing style can significantly affect model performance, domain adaptation methods show the most promise.

[36] Exploring Machine Learning and Language Models for Multimodal Depression Detection

Javier Si Zhao Hong,Timothy Zoe Delaya,Sherwyn Chan Yin Kit,Pai Chet Ng,Xiaoxiao Miao

Main category: cs.CL

TL;DR: This paper compares XGBoost, transformer-based models, and LLMs for multimodal depression detection, analyzing their effectiveness in capturing depression-related signals from audio, video, and text data.

Details

Motivation: The motivation of the paper is to explore multimodal depression detection using machine learning and deep learning models as part of the first Multimodal Personality-Aware Depression Detection Challenge. Method: The paper compares the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Result: The results show how each model performs in detecting depression-related signals across modalities, highlighting their respective strengths and limitations. Conclusion: The paper concludes that different models have unique strengths and limitations in capturing depression-related signals across modalities, providing insights into effective multimodal representation strategies for mental health prediction. Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.

[37] GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction

Jie Zhao,Wanting Ning,Yuxiao Fei,Yubo Feng,Lishuang Li

Main category: cs.CL

TL;DR: This paper proposes GDLLM, a global distance-aware modeling method for event temporal relation extraction, combining graph structures and soft inference to enhance LLM performance, achieving SOTA results.

Details

Motivation: SLMs have limited pre-trained knowledge, while LLMs face issues with manually designed prompts introducing noise and affecting judgment of event dependencies. This work aims to address these limitations. Method: GDLLM uses a distance-aware graph structure with GAT and a soft inference-based temporal feature learning paradigm to enhance LLMs' ability in capturing long-distance dependencies and identifying short-distance relations. Result: The approach achieves state-of-the-art performance on TB-Dense and MATRES datasets. Conclusion: The proposed GDLLM framework effectively captures global features and improves the performance on minority relation classes, achieving SOTA results on TB-Dense and MATRES datasets. Abstract: In Natural Language Processing(NLP), Event Temporal Relation Extraction (ETRE) is to recognize the temporal relations of two events. Prior studies have noted the importance of language models for ETRE. However, the restricted pre-trained knowledge of Small Language Models(SLMs) limits their capability to handle minority class relations in imbalanced classification datasets. For Large Language Models(LLMs), researchers adopt manually designed prompts or instructions, which may introduce extra noise, leading to interference with the model's judgment of the long-distance dependencies between events. To address these issues, we propose GDLLM, a Global Distance-aware modeling approach based on LLMs. We first present a distance-aware graph structure utilizing Graph Attention Network(GAT) to assist the LLMs in capturing long-distance dependency features. Additionally, we design a temporal feature learning paradigm based on soft inference to augment the identification of relations with a short-distance proximity band, which supplements the probabilistic information generated by LLMs into the multi-head attention mechanism. Since the global feature can be captured effectively, our framework substantially enhances the performance of minority relation classes and improves the overall learning ability. Experiments on two publicly available datasets, TB-Dense and MATRES, demonstrate that our approach achieves state-of-the-art (SOTA) performance.

[38] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation

Rohan Phanse,Yijie Zhou,Kejian Shi,Wencai Zhang,Yixin Liu,Yilun Zhao,Arman Cohan

Main category: cs.CL

TL;DR: This paper presents a framework for evaluating RAG systems in multi-source information integration tasks, revealing the dependency of generation quality on retrieval effectiveness and the superior performance of reasoning models over standard LLMs.

Details

Motivation: The motivation is to evaluate retrieval-augmented systems in real-world applications that demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. Method: The authors present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. They build two new benchmarks, MSRS-Story and MSRS-Meet, and conduct extensive experiments with various RAG pipelines. Result: The experiments reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. Multi-source synthesis proves challenging even in an oracle retrieval setting. Conclusion: The study concludes that while multi-source synthesis is challenging, reasoning models significantly outperform standard LLMs in this task. Abstract: Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.

[39] The Uneven Impact of Post-Training Quantization in Machine Translation

Benjamin Marie,Atsushi Fujita

Main category: cs.CL

TL;DR: 这篇论文探讨了量化对在资源受限硬件上部署的大型语言模型（LLMs）在多语言任务中表现的影响。

Details

Motivation: 本研究的动机是探究量化对多语言任务的影响，因为量化对于在资源受限的硬件上部署大型语言模型（LLMs）至关重要，但这一领域尚未被充分探索。 Method: 论文的方法是对后训练量化（PTQ）在机器翻译上的效果进行大规模评估，覆盖55种语言，并使用了从17亿到700亿参数的五种LLMs。比较了四种量化技术（AWQ, BitsAndBytes, GGUF, 和 AutoRound），并量化了量化、解码超参数和校准语言之间的相互作用。 Result: 研究的结果包括：1. 4位量化通常可以保持高资源语言和大型模型的翻译质量，但在低资源和语言类型多样的语言上，尤其是2位量化的设置下，翻译质量显著下降；2. 量化技术的选择和模型大小共同决定了鲁棒性，其中GGUF变体在2位精度下也具有最一致的性能；3. 语言匹配的校准在低比特场景中主要提供了好处。 Conclusion: 论文的结论是，量化对于在资源受限硬件上部署大型语言模型（LLMs）至关重要，但在多语言任务中的影响尚未被充分探索。研究发现，对于低资源和语言类型多样的语言，尤其是在2位量化的设置下，翻译质量显著下降。论文还发现，量化技术的选择和模型大小共同决定了鲁棒性，GGUF变体在2位精度下也具有最一致的性能。此外，研究发现语言匹配的校准在低比特场景中主要提供了好处。研究结果为在量化限制下，尤其是在低资源环境中部署多语言LLMs进行机器翻译提供了可行的见解。 Abstract: Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. Our analysis reveals that while 4-bit quantization often preserves translation quality for high-resource languages and large models, significant degradation occurs for low-resource and typologically diverse languages, particularly in 2-bit settings. We compare four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound), showing that algorithm choice and model size jointly determine robustness. GGUF variants provide the most consistent performance, even at 2-bit precision. Additionally, we quantify the interactions between quantization, decoding hyperparameters, and calibration languages, finding that language-matched calibration offers benefits primarily in low-bit scenarios. Our findings offer actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.

[40] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Yuan Ge,Junxiang Zhang,Xiaoqian Liu,Bei Li,Xiangnan Ma,Chenglong Wang,Kaiyang Ye,Yangfan Du,Linfeng Zhang,Yuxin Huang,Tong Xiao,Zhengtao Yu,JingBo Zhu

Main category: cs.CL

TL;DR: SageLM is an end-to-end, multi-aspect evaluation model for speech-to-speech LLMs that considers both semantic and acoustic features, achieving high alignment with human evaluation.

Details

Motivation: The lack of comprehensive evaluation methods for speech-to-speech large language models that consider both semantic and acoustic aspects motivated the development of SageLM. Method: SageLM uses rationale-based supervision and a two-stage training paradigm on a synthetic preference dataset called SpeechFeedback to evaluate speech LLMs comprehensively. Result: SageLM achieved an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively. Conclusion: SageLM outperforms existing methods in evaluating speech-to-speech models by incorporating both semantic and acoustic features and achieves a high agreement rate with human evaluators. Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.

[41] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench

Venkatesh Mishra,Amir Saeidi,Satyam Raj,Mutsumi Nakamura,Jayanth Srinivasa,Gaowen Liu,Ali Payani,Chitta Baral

Main category: cs.CL

TL;DR: IRMA框架通过输入重构显著提升了大型语言模型在多轮对话环境中的性能。

Details

Motivation: 为了捕捉并减轻对话轨迹中常见的错误，提升LLM在多轮对话环境中的推理一致性、领域策略遵循能力和信息提取能力。 Method: 进行了全面的手动分析，实验采用了输入重构，并提出了IRMA框架。 Result: IRMA在整体pass^5分数上分别比ReAct、函数调用和自我反思高出16.1%、12.7%和19.1%。 Conclusion: IRMA提供了一种更可靠和一致的方法，相比其他方法在动态环境中表现出优越性。 Abstract: Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $\tau$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.

[42] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment

Jiaqian Li,Qisheng Hu,Jing Li,Wenya Wang

Main category: cs.CL

TL;DR: 本文提出了一种新颖的两阶段示例选择策略，以解决上下文学习中结构对齐不足的问题，并在多个语义解析任务中展示了其优于现有基线方法的性能。

Details

Motivation: 上下文学习的有效性在很大程度上取决于示例选择的质量，尤其是在结构预测任务中，如语义解析，现有策略常常忽视了结构对齐，导致性能次优和泛化能力差。 Method: 首先，使用结构感知监督微调基于BERT的检索器，引导其选择语义相关且结构对齐的示例；然后，通过一个插件模块增强检索器，该模块放大隐藏表示中的句法有意义信息。 Result: 实验结果表明，该方法在四个跨越三个语义解析任务的基准测试中，使用多个最新的语言模型作为推理模型时，始终优于现有基线方法。 Conclusion: 文章提出的方法有效解决了上下文学习中结构对齐不足的问题，提高了性能和泛化能力。 Abstract: In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.

[43] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

Tianjian Liu,Fanqi Wan,Jiajian Guo,Xiaojun Quan

Main category: cs.CL

TL;DR: This paper introduces ProactiveEval, a framework for evaluating proactive dialogue abilities in LLMs, identifying top-performing models and exploring the impact of reasoning capabilities.

Details

Motivation: The research aims to address the lack of comprehensive evaluation of proactive dialogue capabilities in LLMs, as existing studies are often domain-specific or task-oriented, leading to fragmented assessments. Method: The study proposed ProactiveEval, a unified framework that decomposes proactive dialogue into target planning and dialogue guidance, and generates evaluation data across multiple domains. It also involved experiments with 22 LLMs to assess proactive dialogue capabilities. Result: The framework enabled the creation of 328 evaluation environments across 6 domains and revealed insights into how reasoning capabilities affect proactive behaviors in models. Conclusion: DeepSeek-R1 and Claude-3.7-Sonnet perform exceptionally well in target planning and dialogue guidance tasks, respectively, according to the ProactiveEval framework. Abstract: Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.

[44] Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Chen Chen,Yuchen Sun,Jiaxin Gao,Xueluan Gong,Qian Wang,Ziyao Wang,Yongsen Zheng,Kwok-Yan Lam

Main category: cs.CL

TL;DR: LETHE effectively neutralizes backdoor attacks on large language models using internal model merging and external prompt-based distraction, achieving high defense performance while preserving utility.

Details

Motivation: Large language models are vulnerable to backdoor attacks that activate harmful behaviors upon specific triggers. Existing defenses are limited in scope or fail to handle advanced attacks such as multi-trigger or triggerless scenarios. Method: The paper introduces LETHE, which uses internal and external mechanisms for knowledge dilution. Internally, a clean model trained on a lightweight dataset is merged with the backdoored model. Externally, benign and semantically relevant evidence is incorporated into the prompt to divert attention from backdoor features. Result: LETHE significantly reduces the attack success rate of advanced backdoor attacks by up to 98%, outperforms 8 state-of-the-art defense baselines across 5 LLMs, and maintains model utility while being cost-efficient and robust. Conclusion: LETHE is a comprehensive and effective method to defend against various backdoor attacks on large language models, demonstrating high efficacy in reducing attack success rates while maintaining model utility. Abstract: Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model's parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM's attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.

[45] An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs

Mathieu Bourdin,Anas Neumann,Thomas Paviot,Robert Pellerin,Samir Lamouri

Main category: cs.CL

TL;DR: 这篇论文介绍了一种名为EASI-RAG的方法，旨在帮助资源有限的中小企业快速部署RAG系统，并通过一个实际案例验证了其有效性。

Details

Motivation: 论文的动机是解决中小企业在部署基于RAG的工具时面临的资源有限和缺乏自然语言处理专业知识的挑战。 Method: 论文采用了基于方法工程原理的EASI-RAG方法，包括明确定义的角色、活动和技术，并通过一个实际案例进行验证。 Result: 结果表明，EASI-RAG支持快速实现、高用户采纳率，提供准确答案并增强底层数据的可靠性。 Conclusion: 论文得出结论，EASI-RAG方法在工业中小企业中部署RAG系统具有可行性，并能够实现快速实施和高用户采用率。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful solution to mitigate the limitations of Large Language Models (LLMs), such as hallucinations and outdated knowledge. However, deploying RAG-based tools in Small and Medium Enterprises (SMEs) remains a challenge due to their limited resources and lack of expertise in natural language processing (NLP). This paper introduces EASI-RAG, Enterprise Application Support for Industrial RAG, a structured, agile method designed to facilitate the deployment of RAG systems in industrial SME contexts. EASI-RAG is based on method engineering principles and comprises well-defined roles, activities, and techniques. The method was validated through a real-world case study in an environmental testing laboratory, where a RAG tool was implemented to answer operators queries using data extracted from operational procedures. The system was deployed in under a month by a team with no prior RAG experience and was later iteratively improved based on user feedback. Results demonstrate that EASI-RAG supports fast implementation, high user adoption, delivers accurate answers, and enhances the reliability of underlying data. This work highlights the potential of RAG deployment in industrial SMEs. Future works include the need for generalization across diverse use cases and further integration with fine-tuned models.

[46] Re-Representation in Sentential Relation Extraction with Sequence Routing Algorithm

Ramazan Ali Bahrami,Ramin Yahyapour

Main category: cs.CL

TL;DR: This paper proposes a dynamic routing capsule approach for sentential relation extraction that excels on standard datasets but faces challenges due to label noise and re-representation issues on larger datasets like Wikidata.

Details

Motivation: Sentential relation extraction is a key task in NLP, and while existing approaches perform well on standard datasets, their performance on larger, more complex datasets is less understood. This paper aims to address this gap and explores the role of re-representation in improving results. Method: Dynamic routing in capsules is proposed for sentential relation extraction, and its performance is evaluated on several datasets including Tacred, Tacredrev, Retacred, Conll04, and Wikidata. Result: The proposed method outperforms state-of-the-art models on standard datasets but shows lower performance on the Wikidata dataset, partly due to noisy labels. It also demonstrates better re-representation compared to vanilla models. Conclusion: The proposed dynamic routing approach in capsules performs well on standard sentential relation extraction datasets but faces challenges with label noise and re-representation on larger datasets like Wikidata. Abstract: Sentential relation extraction (RE) is an important task in natural language processing (NLP). In this paper we propose to do sentential RE with dynamic routing in capsules. We first show that the proposed approach outperform state of the art on common sentential relation extraction datasets Tacred, Tacredrev, Retacred, and Conll04. We then investigate potential reasons for its good performance on the mentioned datasets, and yet low performance on another similar, yet larger sentential RE dataset, Wikidata. As such, we identify noise in Wikidata labels as one of the reasons that can hinder performance. Additionally, we show associativity of better performance with better re-representation, a term from neuroscience referred to change of representation in human brain to improve the match at comparison time. As example, in the given analogous terms King:Queen::Man:Woman, at comparison time, and as a result of re-representation, the similarity between related head terms (King,Man), and tail terms (Queen,Woman) increases. As such, our observation show that our proposed model can do re-representation better than the vanilla model compared with. To that end, beside noise in the labels of the distantly supervised RE datasets, we propose re-representation as a challenge in sentential RE.

[47] Enabling Equitable Access to Trustworthy Financial Reasoning

William Jurayj,Nils Holzenberger,Benjamin Van Durme

Main category: cs.CL

TL;DR: 结合LLM与符号求解器，实现准确且经济的税务计算系统。

Details

Motivation: 税务申报复杂且容易出错，错误申报可能带来严重罚款，传统LLM不适合该任务。 Method: 将LLM与符号求解器结合，将规则转换为逻辑程序，并使用示例进行辅助推理。 Result: 在SARA数据集上显著提高了性能，部署成本远低于现实平均水平。 Conclusion: 神经符号系统具有经济可行性，有助于实现公平的税务援助。 Abstract: According to the United States Internal Revenue Service, ''the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.

cs.CV [Back]

[48] Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

Alberto Compagnoni,Davide Caffagni,Nicholas Moratelli,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara

Main category: cs.CV

TL;DR: 本文提出了一种名为CHAIR-DPO的新方法，通过使用CHAIR指标和直接偏好优化来减少多模态大语言模型中的幻觉问题。

Details

Motivation: 多模态大语言模型（MLLMs）在生成答案时容易产生幻觉，即生成与视觉输入无关的答案。 Method: 利用CHAIR指标生成偏好数据，并通过直接偏好优化（DPO）对MLLM进行微调。 Result: 该方法在多个幻觉基准测试中显著减少了幻觉答案的生成。 Conclusion: CHAIR-DPO有效地减少了多模态大语言模型中的幻觉问题，并在多个基准测试中展示了其有效性。 Abstract: Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.

[49] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization

Yang Su,Shunquan Tan,Jiwu Huang

Main category: cs.CV

TL;DR: This paper proposes a novel image forensic framework using Stable Diffusion's multi-modal capabilities to enhance forgery localization, achieving significant performance improvements without relying on annotated data.

Details

Motivation: The motivation stems from the rapid advancement of multi-modal large models like Stable Diffusion, which challenge traditional image forensics methods that rely heavily on costly annotated data. Method: The method involves leveraging the multi-modal architecture of Stable DiffusionV3 (SD3) by treating image forgery residuals as an explicit modality, which is fused into the latent space during training to enhance forgery localization while preserving semantic information. Result: Experimental results show up to a 12% improvement in performance on benchmark datasets compared to current state-of-the-art models, with strong performance on real-world document and natural scene forgery images not seen during training. Conclusion: The paper concludes that integrating image generation and perceptual capabilities of Stable Diffusion into a forensic framework significantly improves forgery localization performance, with up to 12% improvement on benchmark datasets and robust results on real-world forgery images. Abstract: Driven by the new generation of multi-modal large models, such as Stable Diffusion (SD), image manipulation technologies have advanced rapidly, posing significant challenges to image forensics. However, existing image forgery localization methods, which heavily rely on labor-intensive and costly annotated data, are struggling to keep pace with these emerging image manipulation technologies. To address these challenges, we are the first to integrate both image generation and powerful perceptual capabilities of SD into an image forensic framework, enabling more efficient and accurate forgery localization. First, we theoretically show that the multi-modal architecture of SD can be conditioned on forgery-related information, enabling the model to inherently output forgery localization results. Then, building on this foundation, we specifically leverage the multimodal framework of Stable DiffusionV3 (SD3) to enhance forgery localization performance.We leverage the multi-modal processing capabilities of SD3 in the latent space by treating image forgery residuals -- high-frequency signals extracted using specific highpass filters -- as an explicit modality. This modality is fused into the latent space during training to enhance forgery localization performance. Notably, our method fully preserves the latent features extracted by SD3, thereby retaining the rich semantic information of the input image. Experimental results show that our framework achieves up to 12% improvements in performance on widely used benchmarking datasets compared to current state-of-the-art image forgery localization models. Encouragingly, the model demonstrates strong performance on forensic tasks involving real-world document forgery images and natural scene forging images, even when such data were entirely unseen during training.

[50] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study

Max Torop,Masih Eskandar,Nicholas Kurtansky,Jinyang Liu,Jochen Weber,Octavia Camps,Veronica Rotemberg,Jennifer Dy,Kivanc Kose

Main category: cs.CV

TL;DR: This paper explores combining Multimodal Large Language Models (MLLMs) and quantitative attribute usage to enhance the interpretability of Artificial Intelligence models in diagnosing skin diseases.

Details

Motivation: The motivation is to improve the interpretability of AI models in skin disease diagnosis, as this is crucial for practical use and clinical trust. Method: The method involves using MLLMs for natural language reasoning and predicting quantitative attributes related to lesion appearance. Fine-tuning MLLMs to predict these attributes from images is evaluated through a content-based image retrieval case study on the SLICE-3D dataset. Result: The study provides evidence that MLLM embedding spaces can be grounded in quantitative attributes, enhancing interpretability, as demonstrated by a content-based image retrieval case study. Conclusion: Combining MLLMs with quantitative attribute prediction offers a promising approach for more interpretable AI in skin disease diagnosis. Abstract: Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.

[51] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels

Hossein Ahmadi,Banafsheh Saffari

Main category: cs.CV

TL;DR: 这项研究提出了一种新的自动调制识别方法，利用统一的Vision Transformer框架，通过结合监督学习、自监督学习和重建目标，实现对有限标签数据的高效利用，并在多个性能指标上取得了优良结果。

Details

Motivation: 现有的自动调制识别方案通常依赖于大量标记数据或多阶段训练流程，这在实践中限制了其可扩展性和泛化能力。 Method: 该方法集成了监督学习、自监督学习和重建目标，模型包括一个ViT编码器、一个轻量级卷积解码器和一个线性分类器。 Result: 在RML2018.01A数据集上，该方法在低标签数据环境下优于监督CNN和ViT基线方法，仅使用15-20%的标记数据即可接近ResNet级别的准确度，并在不同信噪比水平下保持强劲性能。 Conclusion: 该研究提出了一种统一的Vision Transformer框架，为自动调制识别提供了一种简单、可推广且标签高效的解决方案。 Abstract: Automatic modulation recognition (AMR) is critical for cognitive radio, spectrum monitoring, and secure wireless communication. However, existing solutions often rely on large labeled datasets or multi-stage training pipelines, which limit scalability and generalization in practice. We propose a unified Vision Transformer (ViT) framework that integrates supervised, self-supervised, and reconstruction objectives. The model combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; the reconstruction branch maps augmented signals back to their originals, anchoring the encoder to fine-grained I/Q structure. This strategy promotes robust, discriminative feature learning during pretraining, while partial label supervision in fine-tuning enables effective classification with limited labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Overall, the framework provides a simple, generalizable, and label-efficient solution for AMR.

[52] InfinityHuman: Towards Long-Term Audio-Driven Human

Xiaodi Li,Pan Xie,Yi Ren,Qijun Gan,Chen Zhang,Fangyuan Kong,Xiang Yin,Bingyue Peng,Zehuan Yuan

Main category: cs.CV

TL;DR: 提出InfinityHuman，一种分阶段生成高分辨率、长持续时间视频的框架，通过姿态引导优化和手部奖励机制提升效果。

Details

Motivation: 解决现有方法生成音频驱动人物动画时出现的身份漂移、颜色偏移、手部动作不自然等问题。 Method: 采用由粗到细的框架，先生成音频同步的中间表示，再通过姿态引导的优化模块逐步提升视频质量，并引入基于高质量手部动作数据的奖励机制。 Result: 在EMTD和HDTF数据集上，InfinityHuman在视频质量、身份保持、手部准确性和唇音同步方面达到SOTA效果。 Conclusion: InfinityHuman有效解决了长时高清音频驱动人物动画中的关键问题，具备实用价值。 Abstract: Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.

[53] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos

Mert Cokelek,Halit Ozsoy,Nevrez Imamoglu,Cagri Ozcinar,Inci Ayhan,Erkut Erdem,Aykut Erdem

Main category: cs.CV

TL;DR: This study introduces new models for predicting visual saliency in 360-degree videos by leveraging audio-visual cues and vision transformers, addressing the need for comprehensive datasets in this domain.

Details

Motivation: The motivation stems from the lack of comprehensive datasets for 360-degree audio-visual saliency prediction and the transformative impact of spatial audio on user experience in spherical scenes. Method: The study proposes two novel saliency prediction models, SalViT360 and SalViT360-AV, which utilize vision transformers with spatio-temporal attention layers and incorporate audio input, respectively. Result: The proposed models, SalViT360 and SalViT360-AV, significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Conclusion: The study concludes that integrating spatial audio cues into the model architecture significantly enhances the accuracy of saliency prediction in omnidirectional videos. Abstract: Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer's perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.

[54] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models

Phu-Vinh Nguyen,Tan-Hanh Pham,Chris Ngo,Truong Son Hy

Main category: cs.CV

TL;DR: 本文提出了一种利用视觉-语言模型的新管道，用于解释视觉模型的行为，有助于发现失败案例并深入了解视觉模型。

Details

Motivation: 许多视觉模型的发展主要关注提高性能指标，如准确率、IoU 和 mAP，而对可解释性的关注较少。此外，理解视觉模型在一般图像上的行为对于防止偏见判断和识别模型趋势非常重要。 Method: 通过应用视觉-语言模型，提出了一种新的解释视觉模型的管道。 Result: 该管道能够在样本级别和数据集级别上解释视觉模型的行为，从而帮助发现失败案例并深入了解视觉模型。 Conclusion: 本文提出了一种用于解释视觉模型的管道，可在样本级别和数据集级别上解释视觉模型的行为，有助于发现失败案例并深入了解视觉模型。 Abstract: The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.

[55] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems

Mohamed Ohamouddou,Said Ohamouddou,Abdellatif El Afia,Rafik Lasri

Main category: cs.CV

TL;DR: 本研究提出了一种新的知识蒸馏框架 ATMS-KD，能够有效提升农业环境中轻量级模型的性能，并在多个评估指标上优于现有方法。

Details

Motivation: 为了解决资源受限的农业环境中，轻量级模型训练和高效知识蒸馏的需求。 Method: 结合了自适应温度调度和混合样本增强技术，将知识从 MobileNetV3 Large 教师模型转移到轻量级残差 CNN 学生模型。 Result: 所有学生模型在 ATMS-KD 下验证准确率均超过 96.7%，紧凑模型达到 97.11% 的准确率，知识保留率超过 99%。 Conclusion: ATMS-KD 是一种用于开发轻量级 CNN 模型的新框架，适用于资源受限的农业环境，并且在知识蒸馏效果方面优于现有方法。 Abstract: This study proposes ATMS-KD (Adaptive Temperature and Mixed-Sample Knowledge Distillation), a novel framework for developing lightweight CNN models suitable for resource-constrained agricultural environments. The framework combines adaptive temperature scheduling with mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model (5.7\,M parameters) to lightweight residual CNN students. Three student configurations were evaluated: Compact (1.3\,M parameters), Standard (2.4\,M parameters), and Enhanced (3.8\,M parameters). The dataset used in this study consists of images of \textit{Rosa damascena} (Damask rose) collected from agricultural fields in the Dades Oasis, southeastern Morocco, providing a realistic benchmark for agricultural computer vision applications under diverse environmental conditions. Experimental evaluation on the Damascena rose maturity classification dataset demonstrated significant improvements over direct training methods. All student models achieved validation accuracies exceeding 96.7\% with ATMS-KD compared to 95--96\% with direct training. The framework outperformed eleven established knowledge distillation methods, achieving 97.11\% accuracy with the compact model -- a 1.60 percentage point improvement over the second-best approach while maintaining the lowest inference latency of 72.19\,ms. Knowledge retention rates exceeded 99\% for all configurations, demonstrating effective knowledge transfer regardless of student model capacity.

[56] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification

Mutahar Safdar,Gentry Wood,Max Zimmermann,Guy Lamouche,Priti Wanjara,Yaoyao Fiona Zhao

Main category: cs.CV

TL;DR: This study proposes a hybrid vision-language framework for rapid and reliable qualification of advanced materials by integrating microstructure informatics with expert knowledge, enabling zero-shot classification of microstructures without retraining models.

Details

Motivation: Rapid and reliable qualification of advanced materials, especially for heterogeneous structures produced via non-conventional additive manufacturing processes, remains a bottleneck in industrial manufacturing. This limitation necessitates the development of a more efficient and scalable qualification approach. Method: The study introduces a novel framework that integrates microstructure informatics with expert characterization knowledge using customized and hybrid vision-language representations (VLRs). It uses deep semantic segmentation combined with pre-trained multi-modal models (CLIP and FLAVA) to encode visual and textual data into shared representations. A customized similarity-based representation incorporating expert-annotated images and textual descriptions enables zero-shot classification of microstructures. Result: Validation on an additively manufactured metal matrix composite dataset demonstrates the framework's ability to distinguish between acceptable and defective samples across various characterization criteria. Comparative analysis shows that the FLAVA model offers higher visual sensitivity, while the CLIP model aligns consistently with textual criteria. Z-score normalization improves alignment and classification performance by adjusting raw similarity scores based on dataset-driven distributions. Conclusion: The proposed hybrid vision-language framework enhances traceability and interpretability in material qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining, contributing to scalable and domain-adaptable qualification strategies in engineering informatics. Abstract: Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non-conventional additive manufacturing processes. This study introduces a novel framework that links microstructure informatics with a range of expert characterization knowledge using customized and hybrid vision-language representations (VLRs). By integrating deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA), we encode both visual microstructural data and textual expert assessments into shared representations. To overcome limitations in general-purpose embeddings, we develop a customized similarity-based representation that incorporates both positive and negative references from expert-annotated images and their associated textual descriptions. This allows zero-shot classification of previously unseen microstructures through a net similarity scoring approach. Validation on an additively manufactured metal matrix composite dataset demonstrates the framework's ability to distinguish between acceptable and defective samples across a range of characterization criteria. Comparative analysis reveals that FLAVA model offers higher visual sensitivity, while the CLIP model provides consistent alignment with the textual criteria. Z-score normalization adjusts raw unimodal and cross-modal similarity scores based on their local dataset-driven distributions, enabling more effective alignment and classification in the hybrid vision-language framework. The proposed method enhances traceability and interpretability in qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining. By advancing semantic interoperability between raw data and expert knowledge, this work contributes toward scalable and domain-adaptable qualification strategies in engineering informatics.

[57] MedNet-PVS: A MedNeXt-Based Deep Learning Model for Automated Segmentation of Perivascular Spaces

Zhen Xuen Brandon Low,Rory Zhang,Hang Min,William Pham,Lucy Vivash,Jasmine Moses,Miranda Lynch,Karina Dorfman,Cassandra Marotta,Shaun Koh,Jacob Bunyamin,Ella Rowsthorn,Alex Jarema,Himashi Peiris,Zhaolin Chen,Sandy R. Shultz,David K. Wright,Dexiao Kong,Sharon L. Naismith,Terence J. O'Brien,Ying Xia,Meng Law,Benjamin Sinclair

Main category: cs.CV

TL;DR: The study demonstrates that MedNeXt-L-k5 enables efficient automated segmentation of PVS in diverse MRI datasets but does not surpass the performance of existing models like nnU-Net, indicating that transformer-based attention mechanisms may not be necessary for accurate PVS segmentation.

Details

Motivation: Manual segmentation of PVS is time-consuming and has moderate inter-rater reliability, while existing automated deep learning models show moderate performance and limited generalizability across diverse MRI datasets. Method: MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, was adapted for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of T2w MRI scans and another using heterogeneous T1w MRI volumes. Performance was evaluated using 5-fold cross-validation and leave-one-site-out cross-validation. Result: MedNeXt-L-k5 models trained on T2w images achieved high voxel-level Dice scores (0.88+/-0.06 in WM), comparable to inter-rater reliability. However, performance was lower on T1w images (0.58+/-0.09 in WM). Under leave-one-site-out cross-validation, the model showed moderate voxel-level and cluster-level Dice scores across WM and BG. Conclusion: MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets, but does not outperform the nnU-Net, suggesting that attention-based mechanisms are not essential for high accuracy in PVS segmentation. Abstract: Enlarged perivascular spaces (PVS) are increasingly recognized as biomarkers of cerebral small vessel disease, Alzheimer's disease, stroke, and aging-related neurodegeneration. However, manual segmentation of PVS is time-consuming and subject to moderate inter-rater reliability, while existing automated deep learning models have moderate performance and typically fail to generalize across diverse clinical and research MRI datasets. We adapted MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of 200 T2-weighted (T2w) MRI scans from the Human Connectome Project-Aging (HCP-Aging) dataset and another using 40 heterogeneous T1-weighted (T1w) MRI volumes from seven studies across six scanners. Model performance was evaluated using internal 5-fold cross validation (5FCV) and leave-one-site-out cross validation (LOSOCV). MedNeXt-L-k5 models trained on the T2w images of the HCP-Aging dataset achieved voxel-level Dice scores of 0.88+/-0.06 (white matter, WM), comparable to the reported inter-rater reliability of that dataset, and the highest yet reported in the literature. The same models trained on the T1w images of the HCP-Aging dataset achieved a substantially lower Dice score of 0.58+/-0.09 (WM). Under LOSOCV, the model had voxel-level Dice scores of 0.38+/-0.16 (WM) and 0.35+/-0.12 (BG), and cluster-level Dice scores of 0.61+/-0.19 (WM) and 0.62+/-0.21 (BG). MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets. MedNeXt-L-k5 did not outperform the nnU-Net, indicating that the attention-based mechanisms present in transformer-inspired models to provide global context are not required for high accuracy in PVS segmentation.

[58] Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Zhixiang Chi,Yanan Wu,Li Gu,Huan Liu,Ziqiang Wang,Yang Zhang,Yang Wang,Konstantinos N. Plataniotis

Main category: cs.CV

TL;DR: This work introduces a training-free framework that improves CLIP-based models' performance by enhancing spatial coherence and semantic consistency through feedback-driven adaptation of intermediate attention.

Details

Motivation: CLIP shows strong visual-textual alignment but struggles with open-vocabulary segmentation due to poor localization and inconsistent propagation of spatial coherence to the final output. Intermediate attention also lacks direct interaction with text representations, limiting CLIP's potential. Method: The method involves a self-adaptive framework that adapts patch-level correspondences from the model outputs back to intermediate attention. Key modules include attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, which utilize output predictions as a spatial coherence prior. Result: The proposed framework consistently improves performance across eight benchmarks, functioning seamlessly as a plug-in module with four state-of-the-art approaches and three backbones (ViT-B, ViT-L, ViT-H), validated across multiple attention types. Conclusion: The proposed training-free, feedback-driven framework enhances the semantic consistency and spatial coherence of CLIP-based models, functioning as a plug-in module across multiple architectures and attention types while improving performance on eight benchmarks. Abstract: CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.

[59] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

Zhuoran Yu,Yong Jae Lee

Main category: cs.CV

TL;DR: The paper introduces a probing framework to analyze how MLLMs process visual and textual inputs, identifying a consistent stage-wise structure across different models and variations.

Details

Motivation: The motivation of the paper is to explore the internal processing dynamics of MLLMs, which have demonstrated strong performance in vision-language tasks but remain underexplored. Method: The authors introduce a probing framework to analyze how MLLMs process visual and textual inputs. They train linear classifiers to predict visual categories from token embeddings and evaluate these probes under controlled prompt variations. Result: The study identifies a consistent stage-wise structure in MLLMs where early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. Conclusion: The study concludes that MLLMs have a consistent stage-wise structure in processing visual and textual inputs, and the layer-wise organization is stable across variations in visual tokenization, instruction tuning data, and pretraining corpus. Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, we introduce a probing framework to systematically analyze how MLLMs process visual and textual inputs across layers. We train linear classifiers to predict fine-grained visual categories (e.g., dog breeds) from token embeddings extracted at each layer, using a standardized anchor question. To uncover the functional roles of different layers, we evaluate these probes under three types of controlled prompt variations: (1) lexical variants that test sensitivity to surface-level changes, (2) semantic negation variants that flip the expected answer by modifying the visual concept in the prompt, and (3) output format variants that preserve reasoning but alter the answer format. Applying our framework to LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, we identify a consistent stage-wise structure in which early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. We further show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts notably with changes in the base LLM architecture. Our findings provide a unified perspective on the layer-wise organization of MLLMs and offer a lightweight, model-agnostic approach for analyzing multimodal representation dynamics.

[60] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)

Zhi Li,Hau Phan,Matthew Emigh,Austin J. Brockmeier

Main category: cs.CV

TL;DR: 本文提出了一种新的监督字典学习方法SLiCS，用于分解嵌入空间以提高图像检索的精确度。

Details

Motivation: 为了提高图像检索的精确度，作者假设可以通过分解嵌入空间来分离复杂场景中的信息。 Method: 提出了一种监督字典学习方法，通过交替优化算法将嵌入分解为多个概念特定的子空间，从而实现对图像的多标签检索和条件生成。 Result: 实验结果表明，使用SLiCS方法进行概念过滤的图像检索精度得到提高，并且可以应用于TiTok和DINOv2等不同类型的嵌入空间。 Conclusion: SLiCS方法能够提高图像检索的精确度，并且可以应用于不同类型的嵌入空间，显示出其广泛适用性和有效性。 Abstract: Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept's group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.

[61] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models

Xiao Li,Yanfan Zhu,Ruining Deng,Wei-Qi Wei,Yu Wang,Shilin Zhao,Yaohong Wang,Haichun Yang,Yuankai Huo

Main category: cs.CV

TL;DR: This paper introduces MedFoundationHub, a secure and user-friendly toolkit for deploying medical vision-language models, identifying current challenges through expert evaluation on clinical cases.

Details

Motivation: The motivation is to advance the deployment of secure and efficient medical vision-language models for clinical applications while addressing privacy, security, and usability challenges. Method: The authors developed MedFoundationHub, a GUI toolkit for deploying medical VLMs, and evaluated five state-of-the-art models with expert pathologists on colon and renal cases. Result: Expert evaluations revealed limitations in the models, including off-target answers, vague reasoning, and inconsistent terminology, highlighting the need for improvement. Conclusion: The paper concludes that while medical vision-language models offer promising applications in clinical settings, they still face significant limitations in accuracy and consistency, requiring further improvement and careful deployment to ensure security and privacy. Abstract: Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.

[62] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction

Mang Cao,Sanping Zhou,Yizhe Li,Ye Deng,Wenli Huang,Le Wang

Main category: cs.CV

TL;DR: This paper proposes a Bidirectional Interaction Mamba (BIM) framework for multi-task dense prediction that achieves superior performance over state-of-the-art methods while maintaining computational efficiency through novel scanning mechanisms.

Details

Motivation: The authors aim to address the challenge of achieving sufficient cross-task interaction in multi-task dense prediction while maintaining computational efficiency, as existing methods face a trade-off between interaction completeness and computational efficiency. Method: The authors propose a Bidirectional Interaction Mamba (BIM) framework incorporating two novel scanning mechanisms: Bidirectional Interaction Scan (BI-Scan) and Multi-Scale Scan (MS-Scan). BI-Scan constructs task-specific representations as bidirectional sequences during interaction, integrating task-first and position-first scanning modes within a unified linear complexity architecture. MS-Scan achieves multi-granularity scene modeling to meet diverse granularity requirements of various tasks and enhance nuanced cross-task feature interactions. Result: Extensive experiments show the superiority of the proposed BIM framework over state-of-the-art competitors on two challenging benchmarks, NYUD-V2 and PASCAL-Context. Conclusion: The proposed BIM framework demonstrates superior performance over state-of-the-art methods in multi-task dense prediction tasks on two challenging benchmarks, NYUD-V2 and PASCAL-Context. Abstract: Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emph{i.e.}, NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.

Hyeonyu Kim,Seokhoon Jeong,Seonghee Han,Chanhyuk Choi,Taehwan Kim

Main category: cs.CV

TL;DR: This paper introduces an audio-guided visual editing framework that effectively handles complex tasks using multiple text and audio prompts without additional training, outperforming text-only approaches.

Details

Motivation: Visual editing with diffusion models struggles with complex scenarios that textual guidance alone cannot describe. There's a need for additional non-text editing prompts like audio to improve performance in real-world situations. Method: The paper proposes a novel approach using a pre-trained multi-modal encoder with zero-shot capabilities to integrate diverse audio into visual editing tasks. It introduces separate noise branching and adaptive patch selection to handle complex, multi-modal editing prompts. Result: The framework excels in handling complicated editing scenarios by incorporating rich audio information, where text-only approaches fail, as demonstrated by comprehensive experiments on diverse editing tasks. Conclusion: The paper concludes that the proposed audio-guided visual editing framework successfully handles complex editing tasks using multiple text and audio prompts without additional training, outperforming text-only approaches. Abstract: Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model's prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.

[64] More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning

Luong Tran,Thieu Vo,Anh Nguyen,Sang Dinh,Van Nguyen

Main category: cs.CV

TL;DR: This paper proposes AEVLP, a framework combining GPR Loss and DAMP technique, to address challenges in Single Positive Multi-Label Learning by effectively handling noisy pseudo-labels and achieving state-of-the-art results.

Details

Motivation: Fully annotating large-scale datasets for multi-label learning is costly and impractical. Existing methods for handling partially annotated data, especially in Single Positive Multi-Label Learning (SPML), often result in inaccuracies and noise, necessitating a more robust solution. Method: The study introduces the Generalized Pseudo-Label Robust Loss (GPR Loss) to handle noisy pseudo-labels and a Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. These methods are combined into the AEVLP framework for improved multi-label learning with limited annotations. Result: Extensive experiments on four benchmark datasets show that the AEVLP framework significantly outperforms existing methods in multi-label classification, achieving state-of-the-art performance. Conclusion: The proposed AEVLP framework, consisting of the GPR Loss and DAMP technique, significantly improves multi-label classification performance on partially annotated datasets, achieving state-of-the-art results. Abstract: Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.

[65] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection

Chengjun Zhang,Yuhao Zhang,Jie Yang,Mohamad Sawan

Main category: cs.CV

TL;DR: This paper introduces a novel tdIF neuron architecture and delay-spike approach for SNNs, significantly improving performance in visual detection tasks with ultra-low latency.

Details

Motivation: Current ANN-SNN conversion methods perform well in classification tasks but show suboptimal results in visual detection tasks. This paper aims to address this issue by improving spike temporal representation and membrane potential handling. Method: A delay-spike approach and a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture are proposed to improve the performance of SNNs in visual detection tasks. Result: The tdIF method achieves superior performance in object detection and lane line detection tasks with ultra-low latency (within 5 time-steps), demonstrating more precise feature representation and high performance. Conclusion: The proposed tdIF neuron architecture outperforms existing ANN-SNN conversion methods in visual detection tasks by enabling dynamic adjustment of IF neurons based on temporal order, achieving state-of-the-art performance with ultra-low latency. Abstract: Spiking Neural Networks (SNNs), inspired by the brain, are characterized by minimal power consumption and swift inference capabilities on neuromorphic hardware, and have been widely applied to various visual perception tasks. Current ANN-SNN conversion methods have achieved excellent results in classification tasks with ultra-low time-steps, but their performance in visual detection tasks remains suboptimal. In this paper, we propose a delay-spike approach to mitigate the issue of residual membrane potential caused by heterogeneous spiking patterns. Furthermore, we propose a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs. This enables Integrate-and-fire (IF) neurons to dynamically adjust their accumulation and firing behaviors based on the temporal order of time-steps. Our method enables spikes to exhibit distinct temporal properties, rather than relying solely on frequency-based representations. Moreover, the tdIF neuron maintains energy consumption on par with traditional IF neuron. We demonstrate that our method achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. In this study, we conduct extensive evaluation of the tdIF method across two critical vision tasks: object detection and lane line detection. The results demonstrate that the proposed method surpasses current ANN-SNN conversion approaches, achieving state-of-the-art performance with ultra-low latency (within 5 time-steps).

[66] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection

Yuqi Xiong,Wuzhen Shi,Yang Wen,Ruhan Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的显著性目标检测方法DUP-MCRNet，通过动态不确定性传播和多模态协同推理，提高了检测的准确性，尤其是在复杂背景下的边缘清晰度和鲁棒性。

Details

Motivation: 现有的显著性目标检测方法在复杂场景中容易丢失细节、边缘模糊和单模态信息融合不足，因此提出了DUP-MCRNet以解决这些问题。 Method: 提出了一种动态不确定性传播和多模态协同推理网络（DUP-MCRNet），包括动态不确定性图卷积模块（DUGC）和多模态协同融合策略（MCF），并通过多尺度BCE和IoU损失、跨尺度一致性约束和不确定性引导监督机制优化检测性能。 Result: DUP-MCRNet在显著性目标检测中表现出色，特别是在边缘清晰度和对复杂背景的鲁棒性方面，并且代码已公开，便于后续研究和应用。 Conclusion: DUP-MCRNet在大多数常见基准数据集上优于各种SOD方法，特别是在边缘清晰度和对复杂背景的鲁棒性方面。 Abstract: In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.

[67] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection

Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shota Orihashi,Tomohiro Tanaka,Mana Ihori,Naoki Makishima,Naotaka Kawata

Main category: cs.CV

TL;DR: The paper proposes Multi-Scale Multi-View Detection (MSMVD) to improve pedestrian detection in bird's eye view by leveraging multi-scale image features, outperforming previous methods on the GMVD dataset.

Details

Motivation: End-to-end deep learning methods struggle with detecting pedestrians with consistently small or large scales in views or with vastly different scales between views because they do not exploit multi-scale image features. Method: Multi-Scale Multi-View Detection (MSMVD) generates multi-scale BEV features by projecting multi-scale image features from individual views into the BEV space, scale-by-scale and combines information using a feature pyramid network. Result: MSMVD improves detection performance for pedestrians with varying scales in multi-view images by generating and combining multi-scale BEV features. Conclusion: MSMVD outperforms previous methods by 4.5 points on the GMVD dataset by exploiting multi-scale image features through multi-scale BEV features. Abstract: Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird's eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by $4.5$ points on the GMVD dataset.

[68] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection

Libo Lv,Tianyi Wang,Mengxiao Huang,Ruixia Liu,Yinglong Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于实时检测深度伪造的轻量级网络SFMFNet，该网络结合了空间频率信息和多尺度特征交互，实现了准确性和效率的平衡。

Details

Motivation: 现有的深度伪造检测方法虽然在标准数据集上表现良好，但由于计算成本高，难以在实际应用中实时部署。因此，需要一种轻量级且高效的解决方案。 Method: 本文设计了一种空间-频率混合感知模块，通过门控机制结合空间纹理和频率伪影；采用令牌选择性交叉注意力机制实现高效的多级特征交互，并通过残差增强模糊池化结构在下采样过程中保留关键语义信息。 Result: 实验表明，SFMFNet在多个基准数据集中在准确性和效率之间取得了良好的平衡，具有较强的泛化能力和实际应用价值。 Conclusion: SFMFNet是一种有效的实时深度伪造检测方法，能够满足实际应用中对效率和准确性的双重需求。 Abstract: With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.

[69] Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification

Ayaka Tsutsumi,Guang Li,Ren Togo,Takahiro Ogawa,Satoshi Kondo,Miki Haseyama

Main category: cs.CV

TL;DR: This paper introduces a lightweight medical image classification method combining dual-model weight selection and self-knowledge distillation, achieving high performance while maintaining computational efficiency.

Details

Motivation: The motivation stems from the need to overcome computational constraints in deploying large-scale models in real-world medical settings, requiring lightweight models that maintain high performance. Method: The method involves initializing two lightweight models using weights from a large pretrained model (dual-model weight selection) and applying self-knowledge distillation (SKD) to enhance knowledge transfer and model performance. Result: Extensive experiments on medical datasets (chest X-ray, lung CT, and brain MRI) show that the proposed method achieves superior performance and robustness compared to existing methods. Conclusion: The proposed method combining dual-model weight selection with self-knowledge distillation outperforms existing approaches in medical image classification tasks, offering a computationally efficient yet high-performing solution. Abstract: We propose a novel medical image classification method that integrates dual-model weight selection with self-knowledge distillation (SKD). In real-world medical settings, deploying large-scale models is often limited by computational resource constraints, which pose significant challenges for their practical implementation. Thus, developing lightweight models that achieve comparable performance to large-scale models while maintaining computational efficiency is crucial. To address this, we employ a dual-model weight selection strategy that initializes two lightweight models with weights derived from a large pretrained model, enabling effective knowledge transfer. Next, SKD is applied to these selected models, allowing the use of a broad range of initial weight configurations without imposing additional excessive computational cost, followed by fine-tuning for the target classification tasks. By combining dual-model weight selection with self-knowledge distillation, our method overcomes the limitations of conventional approaches, which often fail to retain critical information in compact models. Extensive experiments on publicly available datasets-chest X-ray images, lung computed tomography scans, and brain magnetic resonance imaging scans-demonstrate the superior performance and robustness of our approach compared to existing methods.

[70] Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds

Pengpeng Yu,Haoran Li,Dingquan Li,Runqing Jiang,Jing Wang,Liang Lin,Yulan Guo

Main category: cs.CV

TL;DR: 本文提出了一种高效的LiDAR点云压缩方法，通过几何重稠密和跨尺度特征传播模块，实现了先进的压缩性能与实时处理速度。

Details

Motivation: LiDAR点云应用广泛，但高精度扫描带来较大的存储和传输开销。现有方法受限于几何细节的极端稀疏性，压缩性能和速度受限。 Method: 提出了一种基于紧凑特征表示的高效预测编码框架，包含几何重稠密模块和跨尺度特征传播模块。 Result: 在KITTI数据集上验证了方法的有效性，实现了高效的压缩比和实时处理能力。 Conclusion: 实验结果表明，该方法在KITTI数据集上实现了最先进的压缩比和实时性能，在12位量化下编码和解码均达到26 FPS。 Abstract: LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.

[71] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

Xiaochuan Li,Guoguang Du,Runze Zhang,Liang Jin,Qi Jia,Lihua Lu,Zhenhua Guo,Yaqian Zhao,Haiyang Liu,Tianqi Wang,Changsheng Li,Xiaoli Gong,Rengang Li,Baoyu Fan

Main category: cs.CV

TL;DR: This paper proposes a new approach for 3D asset generation using video data, introducing a large-scale dataset and a generative model, which leverages spatial and semantic information from videos to overcome the limitations of scarce 3D data.

Details

Motivation: The motivation is driven by the data scarcity in the 3D domain, which limits the generalization of models. Videos are explored as an alternative source of supervisory signals due to their spatial consistency and rich semantic information. Method: The paper introduces Droplet3D-4M, a large-scale video dataset with multi-view annotations, and Droplet3D, a generative model trained on this dataset to support image and dense text input for 3D generation. Result: Extensive experiments show that the proposed approach is effective in generating spatially consistent and semantically plausible 3D content, outperforming prevailing 3D solutions. Conclusion: The paper concludes that commonsense priors from videos can significantly facilitate 3D creation, and their approach has the potential for extension to scene-level applications. Abstract: Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.

[72] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation

Jiusi Li,Jackson Jiang,Jinyu Miao,Miao Long,Tuopu Wen,Peijin Jia,Shengxiang Liu,Chunlei Yu,Maolin Liu,Yuzhan Cai,Kun Jiang,Mengmeng Yang,Diange Yang

Main category: cs.CV

TL;DR: G^2Editor是一种用于自动驾驶场景生成的高效框架，通过3D高斯表示和分层特征生成实现逼真且精确的物体编辑。

Details

Motivation: 收集真实世界的角落案例成本高昂且危险，需要一种高效生成多样化场景的方法。 Method: 利用3D高斯表示作为密集先验，并结合场景级3D边界框布局和分层细粒度特征生成。 Result: 在Waymo开放数据集上的实验表明，G^2Editor在姿态可控性和视觉质量方面优于现有方法。 Conclusion: G^2Editor实现了在驾驶视频中进行逼真且精确的物体编辑，支持物体的重新定位、插入和删除，优于现有方法。 Abstract: Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fidelity or imprecise pose control. To address these issues, we propose G^2Editor, a framework designed for photorealistic and precise object editing in driving videos. Our method leverages a 3D Gaussian representation of the edited object as a dense prior, injected into the denoising process to ensure accurate pose control and spatial consistency. A scene-level 3D bounding box layout is employed to reconstruct occluded areas of non-target objects. Furthermore, to guide the appearance details of the edited object, we incorporate hierarchical fine-grained features as additional conditions during generation. Experiments on the Waymo Open Dataset demonstrate that G^2Editor effectively supports object repositioning, insertion, and deletion within a unified framework, outperforming existing methods in both pose controllability and visual quality, while also benefiting downstream data-driven tasks.

[73] Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization

Marina Grifell i Plana,Vladyslav Zalevskyi,Léa Schmidt,Yvan Gomez,Thomas Sanchez,Vincent Dunet,Mériam Koob,Vanessa Siffredi,Meritxell Bach Cuadra

Main category: cs.CV

TL;DR: 本文提出了一种病理信息域随机化策略，用于缓解数据稀缺问题，并增强对罕见但具有临床意义的畸形的分析。

Details

Motivation: CCD的罕见性严重限制了注释数据，阻碍了深度学习模型的泛化能力。 Method: 提出了一种病理信息域随机化策略，将CCD表现的先验知识嵌入到合成数据生成管道中。 Result: 在包含248个健康胎儿、26个CCD病例和47个其他脑部病理病例的队列上验证了该方法，在CCD病例上取得了显著改进，同时保持了对健康胎儿和其他病理病例的性能。 Conclusion: 将特定领域的解剖学先验知识纳入合成数据管道可以有效缓解数据稀缺问题，并增强对罕见但具有临床意义的畸形的分析。 Abstract: Accurate fetal brain segmentation is crucial for extracting biomarkers and assessing neurodevelopment, especially in conditions such as corpus callosum dysgenesis (CCD), which can induce drastic anatomical changes. However, the rarity of CCD severely limits annotated data, hindering the generalization of deep learning models. To address this, we propose a pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into a synthetic data generation pipeline. By simulating diverse brain alterations from healthy data alone, our approach enables robust segmentation without requiring pathological annotations. We validate our method on a cohort comprising 248 healthy fetuses, 26 with CCD, and 47 with other brain pathologies, achieving substantial improvements on CCD cases while maintaining performance on both healthy fetuses and those with other pathologies. From the predicted segmentations, we derive clinically relevant biomarkers, such as corpus callosum length (LCC) and volume, and show their utility in distinguishing CCD subtypes. Our pathology-informed augmentation reduces the LCC estimation error from 1.89 mm to 0.80 mm in healthy cases and from 10.9 mm to 0.7 mm in CCD cases. Beyond these quantitative gains, our approach yields segmentations with improved topological consistency relative to available ground truth, enabling more reliable shape-based analyses. Overall, this work demonstrates that incorporating domain-specific anatomical priors into synthetic data pipelines can effectively mitigate data scarcity and enhance analysis of rare but clinically significant malformations.

[74] Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding

Jeong Hun Yeo,Hyeongseop Rha,Sungjune Park,Junil Won,Yong Man Ro

Main category: cs.CV

TL;DR: This paper introduces a unified framework for processing sign language, lip movements, and audio, achieving strong performance across multiple tasks and highlighting the importance of integrating modalities, especially lip cues in sign language translation.

Details

Motivation: The motivation is to create a more inclusive communication system by integrating visual and audio modalities, particularly focusing on the underexplored relationship between sign language and lip reading. Method: The authors built a unified, modality-agnostic architecture to process diverse combinations of sign language, lip movements, and audio, exploring the synergy among modalities. Result: The framework achieves performance equal to or better than existing specialized models across SLT, VSR, ASR, and AVSR tasks, with significant improvement in SLT when lip movements are explicitly modeled. Conclusion: The paper concludes that their unified framework performs on par with or better than state-of-the-art models for individual tasks and that explicitly modeling lip movements enhances SLT performance. Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.

[75] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

Yuan Xie,Tianshui Chen,Zheng Ge,Lionel Ni

Main category: cs.CV

TL;DR: Video-MTR是一个强化的多回合推理框架，用于长视频理解，通过迭代选择关键视频片段和问题理解，结合门控双层奖励系统，实现端到端训练，提高了准确性和效率。

Details

Motivation: 长视频理解面临长期时间依赖性和多事件的挑战，现有方法依赖静态推理或外部视觉-语言模型，导致复杂性和次优性能。 Method: 提出Video-MTR，一个强化多回合推理框架，通过渐进选择视频片段和问题理解，结合轨迹级和回合级奖励优化模型。 Result: 在VideoMME、MLVU和EgoSchema等基准测试中，Video-MTR在准确性和效率方面优于现有方法。 Conclusion: Video-MTR解决了长视频理解的挑战，通过端到端训练和迭代推理过程，提高了视频分析的精度和效率。 Abstract: Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.

[76] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

Zixuan Hu,Dongxiao Li,Xinzhu Ma,Shixiang Tang,Xiaotong Li,Wenhan Yang,Ling-Yu Duan

Main category: cs.CV

TL;DR: DUO 是一种用于单目三维目标检测的测试时自适应框架，通过优化语义和几何不确定性提高模型鲁棒性。

Details

Motivation: 解决单目三维目标检测在真实世界领域偏移下可靠性下降的问题，填补了现有测试时自适应方法未能同时处理语义和几何不确定性的空白。 Method: 提出 Dual Uncertainty Optimization (DUO)，包括基于凸优化的焦点损失新结构和无监督版本，以及语义感知的法向场约束。 Result: 实验表明 DUO 在各种数据集和领域偏移类型上均优于现有方法。 Conclusion: DUO 提出了一种新的测试时自适应框架，通过同时最小化语义不确定性和几何不确定性，提高了单目三维目标检测的鲁棒性。 Abstract: Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types.

[77] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information

Seunghyeon Jung,Seoyoung Hong,Jiwoo Jeong,Seungwon Jeong,Jaerim Choi,Hoki Kim,Woojin Lee

Main category: cs.CV

TL;DR: 本文提出了一个新的高尔夫数据集CaddieSet，通过计算机视觉技术从单个挥杆视频中提取关节信息，并定义了15个关键指标，用于分析挥杆效果，从而预测球的轨迹。

Details

Motivation: 现有的高尔夫挥杆研究未能定量建立挥杆姿势与球轨迹之间的关系，限制了其提供挥杆改进见解的能力。 Method: 本文提出了一种新的数据集CaddieSet，利用计算机视觉方法从挥杆视频中提取关节信息，并根据专家知识定义了15个影响挥杆的关键指标。 Result: 实验表明，CaddieSet能够有效地预测球的轨迹，并且使用该数据集的挥杆反馈与已有的领域知识定量一致。 Conclusion: 该研究为学术界和体育产业提供了新的高尔夫挥杆分析视角。 Abstract: Recent advances in deep learning have led to more studies to enhance golfers' shot precision. However, these existing studies have not quantitatively established the relationship between swing posture and ball trajectory, limiting their ability to provide golfers with the necessary insights for swing improvement. In this paper, we propose a new dataset called CaddieSet, which includes joint information and various ball information from a single shot. CaddieSet extracts joint information from a single swing video by segmenting it into eight swing phases using a computer vision-based approach. Furthermore, based on expert golf domain knowledge, we define 15 key metrics that influence a golf swing, enabling the interpretation of swing outcomes through swing-related features. Through experiments, we demonstrated the feasibility of CaddieSet for predicting ball trajectories using various benchmarks. In particular, we focus on interpretable models among several benchmarks and verify that swing feedback using our joint features is quantitatively consistent with established domain knowledge. This work is expected to offer new insight into golf swing analysis for both academia and the sports industry.

[78] IAENet: An Importance-Aware Ensemble Model for 3D Point Cloud-Based Anomaly Detection

Xuanming Cao,Chengyu Tao,Yifeng Cheng,Juan Du

Main category: cs.CV

TL;DR: This paper introduces IAENet, an ensemble framework that effectively combines 2D and 3D models for surface anomaly detection, achieving state-of-the-art results with a lower false positive rate.

Details

Motivation: The key bottleneck in 3D point cloud-based surface anomaly detection is the lack of powerful pretrained foundation backbones compared to 2D methods. The authors aim to bridge this gap by leveraging both 2D and 3D models. Method: The proposed Importance-Aware Ensemble Network (IAENet) combines 2D pretrained expert models with 3D expert models using an Importance-Aware Fusion (IAF) module that dynamically reweights anomaly scores based on each source's contribution. Result: Extensive experiments on the MVTec 3D-AD dataset show that IAENet outperforms existing methods by achieving a new state-of-the-art performance with a significantly reduced false positive rate. Conclusion: IAENet achieves a new state-of-the-art in surface anomaly detection with a markedly lower false positive rate, demonstrating its practical value for industrial deployment. Abstract: Surface anomaly detection is pivotal for ensuring product quality in industrial manufacturing. While 2D image-based methods have achieved remarkable success, 3D point cloud-based detection remains underexplored despite its richer geometric cues. We argue that the key bottleneck is the absence of powerful pretrained foundation backbones in 3D comparable to those in 2D. To bridge this gap, we propose Importance-Aware Ensemble Network (IAENet), an ensemble framework that synergizes 2D pretrained expert with 3D expert models. However, naively fusing predictions from disparate sources is non-trivial: existing strategies can be affected by a poorly performing modality and thus degrade overall accuracy. To address this challenge, We introduce an novel Importance-Aware Fusion (IAF) module that dynamically assesses the contribution of each source and reweights their anomaly scores. Furthermore, we devise critical loss functions that explicitly guide the optimization of IAF, enabling it to combine the collective knowledge of the source experts but also preserve their unique strengths, thereby enhancing the overall performance of anomaly detection. Extensive experiments on MVTec 3D-AD demonstrate that our IAENet achieves a new state-of-the-art with a markedly lower false positive rate, underscoring its practical value for industrial deployment.

[79] Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

En Ci,Shanyan Guan,Yanhao Ge,Yilin Zhang,Wei Li,Zhenyu Zhang,Jian Yang,Ying Tai

Main category: cs.CV

TL;DR: DescriptiveEdit是一种新的图像编辑框架，通过结合参考图像和文本提示来提高编辑质量。

Details

Motivation: 现有的基于指令的数据集质量有限，而基于反转的算法不可避免地引入重建错误。 Method: 提出了一种基于描述性提示的编辑框架DescriptiveEdit，并引入了跨注意UNet来将参考图像特征注入到文本到图像生成过程中。 Result: DescriptiveEdit在Emu Edit基准测试中表现出更高的编辑准确性和一致性，并且能够与ControlNet、IP-Adapter等扩展无缝集成。 Conclusion: DescriptiveEdit克服了现有图像编辑方法的局限性，提高了编辑的准确性和一致性。 Abstract: Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.

[80] DCFS: Continual Test-Time Adaptation via Dual Consistency of Feature and Sample

Wenting Yin,Han Sun,Xinru Meng,Ningzhong Liu,Huiyu Zhou

Main category: cs.CV

TL;DR: This paper proposes DCFS, a new method for continual test-time adaptation that improves model performance by reducing noise and error accumulation through dual-path feature consistency and confidence-aware learning.

Details

Motivation: Continual test-time adaptation without access to source data can lead to confusion and learning biases, and existing methods struggle with pseudo-label quality and error accumulation. Method: DCFS, a new CTTA framework that uses dual-path feature consistency and confidence-aware sample learning to reduce pseudo-label noise and error accumulation. Result: The DCFS framework improves feature representation learning and reduces pseudo-label noise through dual classifiers and confidence-based weighting, validated through experiments on datasets like CIFAR10-C, CIFAR100-C, and ImageNet-C. Conclusion: DCFS demonstrates consistent performance in continual test-time adaptation scenarios across various datasets. Abstract: Continual test-time adaptation aims to continuously adapt a pre-trained model to a stream of target domain data without accessing source data. Without access to source domain data, the model focuses solely on the feature characteristics of the target data. Relying exclusively on these features can lead to confusion and introduce learning biases. Currently, many existing methods generate pseudo-labels via model predictions. However, the quality of pseudo-labels cannot be guaranteed and the problem of error accumulation must be solved. To address these challenges, we propose DCFS, a novel CTTA framework that introduces dual-path feature consistency and confidence-aware sample learning. This framework disentangles the whole feature representation of the target data into semantic-related feature and domain-related feature using dual classifiers to learn distinct feature representations. By maintaining consistency between the sub-features and the whole feature, the model can comprehensively capture data features from multiple perspectives. Additionally, to ensure that the whole feature information of the target domain samples is not overlooked, we set a adaptive threshold and calculate a confidence score for each sample to carry out loss weighted self-supervised learning, effectively reducing the noise of pseudo-labels and alleviating the problem of error accumulation. The efficacy of our proposed method is validated through extensive experimentation across various datasets, including CIFAR10-C, CIFAR100-C, and ImageNet-C, demonstrating consistent performance in continual test-time adaptation scenarios.

[81] Adam SLAM - the last mile of camera calibration with 3DGS

Matthieu Gendrin,Stéphane Pateux,Xiaoran Jiang,Théo Ladune,Luce Morin

Main category: cs.CV

TL;DR: 本文提出了一种基于3DGS模型的新视角颜色损失反向传播方法，用于优化相机校准，显著提升了新视角合成质量。

Details

Motivation: 论文的动机在于，相机校准质量对新视角合成效果有重大影响，但真实场景中缺乏地面真值，因此需要通过新视角合成的质量来评估校准效果。 Method: 论文提出了一种利用3D高斯泼溅（3DGS）模型，通过反向传播新视角颜色损失来微调相机参数的校准方法。 Result: 该方法在3DGS所用的参考数据集上平均提升了0.4 dB的PSNR，并表明校准优化对新视角合成质量有显著改善。 Conclusion: 论文得出结论，通过使用3DGS模型进行反向传播优化相机校准，可以显著提升新视角合成的质量，尽管该过程可能较为耗时，但对于要求高精度的场景校准（如Mip-NeRF 360）而言是值得的。 Abstract: The quality of the camera calibration is of major importance for evaluating progresses in novel view synthesis, as a 1-pixel error on the calibration has a significant impact on the reconstruction quality. While there is no ground truth for real scenes, the quality of the calibration is assessed by the quality of the novel view synthesis. This paper proposes to use a 3DGS model to fine tune calibration by backpropagation of novel view color loss with respect to the cameras parameters. The new calibration alone brings an average improvement of 0.4 dB PSNR on the dataset used as reference by 3DGS. The fine tuning may be long and its suitability depends on the criticity of training time, but for calibration of reference scenes, such as Mip-NeRF 360, the stake of novel view quality is the most important.

Jingyun Yang,Guoqing Zhang,Jingge Wang,Yang Li

Main category: cs.CV

TL;DR: This paper introduces a novel active and sequential domain adaptation framework for efficient and effective multi-modal medical image segmentation, particularly for gross tumor volume tasks.

Details

Motivation: To reduce annotation costs in medical image segmentation by selecting the most informative samples to label, particularly addressing limitations in existing ADA methods such as one-off selection causing negative transfer and lack of exploration in multi-modal query strategies. Method: The authors proposed a dynamic multi-modal sample selection approach based on informativeness and representativeness within an active and sequential domain adaptation framework. Result: The method achieved favorable and significantly superior segmentation performance on diverse gross tumor volume tasks compared to state-of-the-art ADA approaches. Conclusion: The proposed active and sequential domain adaptation framework effectively improves gross tumor volume segmentation performance, outperforming existing ADA methods. Abstract: Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.

[83] Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection

Mingqian Ji,Jian Yang,Shanshan Zhang

Main category: cs.CV

TL;DR: This paper proposes a novel data-level fusion framework for unsupervised 3D object detection that integrates RGB images and LiDAR data at an early stage, resulting in significant performance improvements over existing methods.

Details

Motivation: Existing LiDAR-based 3D object detectors rely on manually annotated labels, which are time-consuming and labor-intensive to obtain. Current unsupervised methods that use RGB images for pseudo-box generation only integrate pseudo-boxes at a label-level, which limits performance improvements. Method: A bi-directional fusion method that integrates RGB images and LiDAR data at an early stage, along with a local and global filtering method and a dynamic self-evolution strategy. Result: Extensive experiments on the nuScenes dataset demonstrate that the detector trained by the proposed method outperforms previous state-of-the-art methods with a 28.4% improvement in mAP on the nuScenes validation benchmark. Conclusion: The proposed data-level fusion framework significantly improves the performance of unsupervised 3D object detection compared to existing methods. Abstract: Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4$\%$ mAP on the nuScenes validation benchmark.

[84] Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset

Frederik Rajiv Manichand,Robin Deuber,Robert Jakob,Steve Swerling,Jamie Rosen,Elgar Fleisch,Patrick Langer

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的BMI估计方法，使用大规模数据集WayBED和自动图像筛选技术，实现了当前文献中最低的误差，并成功部署在移动设备上。

Details

Motivation: 在传统方法不可行或不实用的情况下（如远程医疗或紧急情况），通过机器学习模型从相机图像中估计体重指数（BMI）可以实现快速体重评估。 Method: 研究者构建了一个包含84,963张智能手机图像的大规模数据集WayBED，并开发了一种基于深度学习的BMI估计方法。此外，他们利用姿态聚类和人物检测技术对数据集进行自动筛选，以去除低质量图像。 Result: 在WayBED测试集上，使用全身图像实现了7.9%的平均绝对百分比误差（MAPE），在完全未见过的VisualBodyToBMI数据集上实现了13%的MAPE。微调模型后，在VisualBodyToBMI数据集上达到了8.56%的MAPE。 Conclusion: 本研究通过深度学习模型和大规模数据集WayBED，实现了BMI估计的最低误差，并展示了在移动设备上的应用潜力。 Abstract: Estimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen~(during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.

[85] Domain Adaptation Techniques for Natural and Medical Image Classification

Ahmad Chaddad,Yihang Wu,Reem Kateb,Christian Desrosiers

Main category: cs.CV

TL;DR: This study evaluates domain adaptation techniques for image classification in natural and medical datasets, highlighting the strong performance and explainability of the DSAN algorithm, particularly in medical contexts like the COVID-19 and skin cancer datasets.

Details

Motivation: To better understand the benefits of domain adaptation for both natural and medical images, considering the challenges in working with medical data and potential biases in mainstream datasets. Method: The study uses 557 simulation studies across seven DA techniques, applied to five natural and eight medical datasets, focusing on scenarios like out-of-distribution and dynamic data streams. Result: DSAN demonstrated high classification accuracy (91.2%) on the COVID-19 dataset and a significant improvement (+6.7%) in dynamic data stream scenarios, along with notable explainability on medical datasets. Conclusion: The study concludes that DA techniques, particularly DSAN, are effective in adapting models for both natural and medical images, offering high performance and explainability. Abstract: Domain adaptation (DA) techniques have the potential in machine learning to alleviate distribution differences between training and test sets by leveraging information from source domains. In image classification, most advances in DA have been made using natural images rather than medical data, which are harder to work with. Moreover, even for natural images, the use of mainstream datasets can lead to performance bias. {With the aim of better understanding the benefits of DA for both natural and medical images, this study performs 557 simulation studies using seven widely-used DA techniques for image classification in five natural and eight medical datasets that cover various scenarios, such as out-of-distribution, dynamic data streams, and limited training samples.} Our experiments yield detailed results and insightful observations highlighting the performance and medical applicability of these techniques. Notably, our results have shown the outstanding performance of the Deep Subdomain Adaptation Network (DSAN) algorithm. This algorithm achieved feasible classification accuracy (91.2\%) in the COVID-19 dataset using Resnet50 and showed an important accuracy improvement in the dynamic data stream DA scenario (+6.7\%) compared to the baseline. Our results also demonstrate that DSAN exhibits remarkable level of explainability when evaluated on COVID-19 and skin cancer datasets. These results contribute to the understanding of DA techniques and offer valuable insight into the effective adaptation of models to medical data.

[86] Contrastive Learning through Auxiliary Branch for Video Object Detection

Lucas Rakotoarivony

Main category: cs.CV

TL;DR: 本文提出了一种名为CLAB的新方法，通过对比学习提高视频目标检测的性能，且不增加推理时的计算负担。

Details

Motivation: 视频目标检测是一项具有挑战性的任务，因为视频常常受到运动模糊、遮挡和可变形形状等图像退化的影响，这使得检测比在静态图像中困难得多。 Method: 实现了一种使用对比损失的对比辅助分支，并提出了动态损失加权策略。 Result: 通过全面的实验和消融研究验证了该方法的有效性，在ImageNet VID数据集上分别使用ResNet-101和ResNeXt-101达到了84.0% mAP和85.2% mAP的性能。 Conclusion: 本文提出了一种名为CLAB的方法，用于提高视频目标检测中特征表示的能力，从而在不增加推理计算负担的情况下提高对图像退化的鲁棒性。 Abstract: Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector's backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.

[87] Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Lorenz Hufe,Constantin Venhoff,Maximilian Dreyer,Sebastian Lapuschkin,Wojciech Samek

Main category: cs.CV

TL;DR: This paper introduces a training-free method to defend CLIP models against typographic attacks by ablating attention heads responsible for typographic information, resulting in improved robustness while maintaining overall accuracy.

Details

Motivation: Typographic attacks pose a threat to multi-modal systems by causing targeted misclassifications and malicious content generation. Understanding how CLIP vision encoders process typographic information and developing defenses against such attacks is crucial for safety-critical applications. Method: The paper analyzes CLIP vision encoders under typographic attacks, identifies attention heads responsible for typographic information transmission, and proposes a method to defend against such attacks by selectively ablating these attention heads without requiring finetuning. Result: The proposed method improves performance by up to 19.6% on a typographic variant of ImageNet-100 while reducing standard ImageNet-100 accuracy by less than 1%, making it competitive with state-of-the-art defenses relying on finetuning. Conclusion: The paper concludes that typographic attacks can be effectively mitigated in CLIP models by selectively ablating attention heads responsible for typographic information processing, resulting in robust models suitable for safety-critical applications. Abstract: Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

[88] GLaRE: A Graph-based Landmark Region Embedding Network for Emotion Recognition

Debasis Maji,Debaditya Barman

Main category: cs.CV

TL;DR: 本文提出了一种新的基于图神经网络的面部表情识别方法GLaRE，通过建模面部地标之间的关系依赖性，提高了识别性能和可解释性。

Details

Motivation: 传统FER系统面临遮挡、表情变异性以及缺乏可解释性等挑战，而图神经网络（GNN）通过建模面部地标之间的关系依赖性，提供了一种强大的替代方案。 Method: 提出了一种名为GLaRE的基于图的地标区域嵌入网络用于情绪识别。通过3D面部对齐提取面部地标，并通过分层粗化构建商图以保持空间结构并降低复杂度。 Result: 该方法在AffectNet上达到了64.89%的准确率，在FERG上达到了94.24%的准确率，超过了多个现有基线。 Conclusion: GLaRE的区域级嵌入有助于提高预测性能，并且在面部表情识别任务中表现出优于现有基线的性能。 Abstract: Facial expression recognition (FER) is a crucial task in computer vision with wide range of applications including human computer interaction, surveillance, and assistive technologies. However, challenges such as occlusion, expression variability, and lack of interpretability hinder the performance of traditional FER systems. Graph Neural Networks (GNNs) offer a powerful alternative by modeling relational dependencies between facial landmarks, enabling structured and interpretable learning. In this paper, we propose GLaRE, a novel Graph-based Landmark Region Embedding network for emotion recognition. Facial landmarks are extracted using 3D facial alignment, and a quotient graph is constructed via hierarchical coarsening to preserve spatial structure while reducing complexity. Our method achieves 64.89 percentage accuracy on AffectNet and 94.24 percentage on FERG, outperforming several existing baselines. Additionally, ablation studies have demonstrated that region-level embeddings from quotient graphs have contributed to improved prediction performance.

[89] FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models

Zheng Chong,Yanwei Lei,Shiyue Zhang,Zhuandi He,Zhen Wang,Xujie Zhang,Xiao Dong,Yiling Wu,Dongmei Jiang,Xiaodan Liang

Main category: cs.CV

TL;DR: The paper introduces FastFit, a high-speed multi-reference virtual try-on framework that overcomes efficiency bottlenecks and outperforms state-of-the-art methods on key fidelity metrics.

Details

Motivation: Virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions and their significant inefficiency. Method: We propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. We also introduce DressCode-MR, a new large-scale dataset for research on complex, multi-reference virtual try-on. Result: The model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5x speedup over comparable methods. Conclusion: FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency. Abstract: Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5x speedup over comparable methods. Furthermore, to facilitate research on complex, multi-reference virtual try-on, we introduce DressCode-MR, a new large-scale dataset. It comprises 28,179 sets of high-quality, paired images covering five key categories (tops, bottoms, dresses, shoes, and bags), constructed through a pipeline of expert models and human feedback refinement. Extensive experiments on the VITON-HD, DressCode, and our DressCode-MR datasets show that FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency.

[90] UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching

Yuqi Han,Songqian Zhang,Weijian Su,Ke Li,Jiayu Yang,Jinli Suo,Qiang Zhang

Main category: cs.CV

TL;DR: This paper proposes UTA-Sign, a fusion method combining thermal and event cameras to improve traffic signage detection in low-light environments.

Details

Motivation: Thermal cameras struggle with detecting signage on objects made of similar materials, while event cameras offer high-speed perception in low-light conditions, making their fusion ideal for improving autonomous driving safety. Method: UTA-Sign employs a dual-boosting mechanism that integrates thermal frames with event signals to overcome thermal imaging's signage blind spots and event cameras' non-uniform sampling. Result: The method was validated on real-world datasets, showing improved traffic signage sketching quality and enhanced detection accuracy at the perceptual level. Conclusion: The proposed UTA-Sign method effectively addresses the limitations of thermal and event cameras by fusing their complementary features, resulting in enhanced traffic signage detection in low-illumination environments. Abstract: The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contrast, the neuromorphic vision camera, also known as an event camera, detects changes in light intensity asynchronously and has proven effective in high-speed, low-light traffic environments. Recognizing the complementary characteristics of these two modalities, this paper proposes UTA-Sign, an unsupervised thermal-event video augmentation for traffic signage in low-illumination environments, targeting elements such as license plates and roadblock indicators. To address the signage blind spots of thermal imaging and the non-uniform sampling of event cameras, we developed a dual-boosting mechanism that fuses thermal frames and event signals for consistent signage representation over time. The proposed method utilizes thermal frames to provide accurate motion cues as temporal references for aligning the uneven event signals. At the same time, event signals contribute subtle signage content to the raw thermal frames, enhancing the overall understanding of the environment. The proposed method is validated on datasets collected from real-world scenarios, demonstrating superior quality in traffic signage sketching and improved detection accuracy at the perceptual level.

[91] Disruptive Attacks on Face Swapping via Low-Frequency Perceptual Perturbations

Mengxiao Huang,Minglei Shu,Shuwang Zhou,Zhaoyang Liu

Main category: cs.CV

TL;DR: This paper proposes an active defense against deepfakes by using low-frequency perceptual perturbations to disrupt facial manipulation models, showing promising results in reducing deepfake effectiveness while maintaining visual quality.

Details

Motivation: Deepfake technology poses risks to privacy and societal security, and current detection methods are largely passive, focusing on post-event analysis without active prevention. Method: The method uses a combination of frequency and spatial domain features, incorporating an encoder, perturbation generator, and decoder that utilizes discrete wavelet transform (DWT) to generate perturbations disrupting facial manipulation models. Result: Experiments on CelebA-HQ and LFW datasets show significant reductions in face-swapping effectiveness, improved defense success rates, and maintenance of visual quality. Conclusion: The proposed active defense method effectively reduces the performance and naturalness of deepfake content by targeting the generative process through low-frequency perceptual perturbations. Abstract: Deepfake technology, driven by Generative Adversarial Networks (GANs), poses significant risks to privacy and societal security. Existing detection methods are predominantly passive, focusing on post-event analysis without preventing attacks. To address this, we propose an active defense method based on low-frequency perceptual perturbations to disrupt face swapping manipulation, reducing the performance and naturalness of generated content. Unlike prior approaches that used low-frequency perturbations to impact classification accuracy,our method directly targets the generative process of deepfake techniques. We combine frequency and spatial domain features to strengthen defenses. By introducing artifacts through low-frequency perturbations while preserving high-frequency details, we ensure the output remains visually plausible. Additionally, we design a complete architecture featuring an encoder, a perturbation generator, and a decoder, leveraging discrete wavelet transform (DWT) to extract low-frequency components and generate perturbations that disrupt facial manipulation models. Experiments on CelebA-HQ and LFW demonstrate significant reductions in face-swapping effectiveness, improved defense success rates, and preservation of visual quality.

[92] Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion

Zheng Qin,Yabing Wang,Minghui Yang,Sanping Zhou,Ming Yang,Le Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Diverse-T2M的方法，用于从文本生成多样化的3D人类动作，通过引入不确定性以及构建潜在空间来增强生成结果的多样性，同时保持文本与动作之间的一致性。

Details

Motivation: 尽管近年来文本到动作生成技术取得了进展，但生成动作的多样性仍然是一个重大挑战，本文旨在解决这一问题。 Method: 引入不确定性到生成过程中，利用噪声信号作为多样性信息的载体，并构建了一个潜在空间以实现文本的连续表示，同时结合潜在空间采样器进行随机采样。 Result: 在HumanML3D和KIT-ML数据集上的实验结果表明，该方法显著提升了生成动作的多样性，同时保持了文本一致性。 Conclusion: Diverse-T2M实现了文本到动作生成的多样化输出，同时保持了文本与动作的一致性，达到了最先进的性能。 Abstract: Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, \textit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.

[93] Optimization-Based Calibration for Intravascular Ultrasound Volume Reconstruction

Karl-Philippe Beaudet,Sidaty El Hadramy,Philippe C Cattin,Juan Verde,Stéphane Cotin

Main category: cs.CV

TL;DR: 本文提出了一种基于优化的校准方法，利用3D打印的模型进行准确的3D血管内超声体积重建，以提高肝手术中的术中导航。

Details

Motivation: 术中超声图像在肝脏手术中因视野有限和复杂的解剖结构而难以解释。弥合术前和术中数据之间的差距对于有效的手术引导至关重要。 Method: 提出了一种基于优化的校准方法，使用3D打印的模型进行准确的3D血管内超声体积重建。 Result: 使用体内猪肝图像验证了该方法，校准误差从0.88到1.80毫米，注册误差从3.40到5.71毫米。 Conclusion: 该方法提供了可靠且准确的校准和体积重建手段，可用于注册术中超声图像与术前CT图像，并增强术中引导。 Abstract: Intraoperative ultrasound images are inherently challenging to interpret in liver surgery due to the limited field of view and complex anatomical structures. Bridging the gap between preoperative and intraoperative data is crucial for effective surgical guidance. 3D IntraVascular UltraSound (IVUS) offers a potential solution by enabling the reconstruction of the entire organ, which facilitates registration between preoperative computed tomography (CT) scans and intraoperative IVUS images. In this work, we propose an optimization-based calibration method using a 3D-printed phantom for accurate 3D Intravascular Ultrasound volume reconstruction. Our approach ensures precise alignment of tracked IVUS data with preoperative CT images, improving intraoperative navigation. We validated our method using in vivo swine liver images, achieving a calibration error from 0.88 to 1.80 mm and a registration error from 3.40 to 5.71 mm between the 3D IVUS data and the corresponding CT scan. Our method provides a reliable and accurate means of calibration and volume reconstruction. It can be used to register intraoperative ultrasound images with preoperative CT images in the context of liver surgery, and enhance intraoperative guidance.

[94] Physics Informed Generative Models for Magnetic Field Images

Aye Phyu Phyu Aung,Lucas Lum,Zhansen Shi,Wen Qiu,Bernice Zee,JM Chin,Yeow Kheng Lim,J. Senthilnath

Main category: cs.CV

TL;DR: This paper introduces PI-GenMFI, a physics-informed generative model for creating synthetic Magnetic Field Imaging (MFI) data to enhance machine learning-based defect detection in semiconductors, showing promising results in improving efficiency and accuracy.

Details

Motivation: The motivation stems from the need for efficient and accurate defect detection in semiconductor manufacturing. While X-ray imaging is reliable, it is resource-intensive. Magnetic Field Imaging (MFI) offers a more efficient alternative, but the lack of available MFI datasets due to proprietary concerns hampers machine learning model training, prompting the development of a synthetic data generation approach. Method: The study employs diffusion models enhanced with two physical constraints to generate synthetic Magnetic Field Imaging (MFI) samples. The method, termed Physics-Informed Generative Models for Magnetic Field Images (PI-GenMFI), integrates specific physical information to improve the accuracy and applicability of synthetic MFI images for training machine learning algorithms. Result: The PI-GenMFI model successfully generates synthetic MFI images for common defects like power shorts. The evaluation, including comparisons with state-of-the-art generative models and domain expert assessments, demonstrates promising results. Qualitative and quantitative metrics used in image generation and signal processing confirm the effectiveness of the proposed method in optimizing the defect localization process. Conclusion: The study concludes that the proposed PI-GenMFI model effectively generates synthetic MFI images for common defect types, particularly power shorts, which can serve as training data for ML algorithms to improve defect localization. The evaluation against state-of-the-art generative models shows promising results in optimizing the defect detection process. Abstract: In semiconductor manufacturing, defect detection and localization are critical to ensuring product quality and yield. While X-ray imaging is a reliable non-destructive testing method, it is memory-intensive and time-consuming for large-scale scanning, Magnetic Field Imaging (MFI) offers a more efficient means to localize regions of interest (ROI) for targeted X-ray scanning. However, the limited availability of MFI datasets due to proprietary concerns presents a significant bottleneck for training machine learning (ML) models using MFI. To address this challenge, we consider an ML-driven approach leveraging diffusion models with two physical constraints. We propose Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) to generate synthetic MFI samples by integrating specific physical information. We generate MFI images for the most common defect types: power shorts. These synthetic images will serve as training data for ML algorithms designed to localize defect areas efficiently. To evaluate generated MFIs, we compare our model to SOTA generative models from both variational autoencoder (VAE) and diffusion methods. We present a domain expert evaluation to assess the generated samples. In addition, we present qualitative and quantitative evaluation using various metrics used for image generation and signal processing, showing promising results to optimize the defect localization process.

[95] Improving Alignment in LVLMs with Debiased Self-Judgment

Sihan Yang,Chenhang Cui,Zihao Zhao,Yiyang Zhou,Weilong Yan,Ying Wei,Huaxiu Yao

Main category: cs.CV

TL;DR: 本文提出了一种新的方法来提高大型视觉-语言模型（LVLMs）的对齐效果，通过生成去偏的自我判断得分，无需依赖外部资源，从而自主改进对齐效果。

Details

Motivation: 视觉和语言模态的有效对齐仍然是一个挑战，现有的对齐方法往往依赖外部数据集、人工注释或复杂后处理，限制了可扩展性并增加了成本。 Method: 提出了一种新方法，通过模型内部生成的去偏自我判断得分进行自我评估，从而提高对齐效果。 Result: 实验结果表明，这种方法显著优于传统方法，提供了更有效的LVLMs对齐解决方案。 Conclusion: 该方法减少了幻觉现象，提高了安全性，并改进了整体性能。 Abstract: The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations--where generated outputs are not grounded in the visual input--and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.

[96] Revisiting the Privacy Risks of Split Inference: A GAN-Based Data Reconstruction Attack via Progressive Feature Optimization

Yixiang Qiu,Yanhan Liu,Hongyao Yu,Hao Fang,Bin Chen,Shu-Tao Xia,Ke Xu

Main category: cs.CV

TL;DR: This paper introduces a novel GAN-based Data Reconstruction Attack with Progressive Feature Optimization to enhance the quality and generalizability of input data reconstruction in Split Inference systems, particularly for deeper and more complex models.

Details

Motivation: The paper addresses the growing privacy risks in Split Inference (SI) systems, where intermediate features can be exploited by Data Reconstruction Attacks (DRAs) to recover sensitive input data. Current DRAs are limited in their effectiveness, particularly on deeper models and across diverse datasets, prompting the need for a more robust and generalizable solution. Method: A GAN-based Data Reconstruction Attack (DRA) framework with Progressive Feature Optimization (PFO) is introduced, where the generator is decomposed into hierarchical blocks that incrementally refine intermediate features. An L1-ball constraint is also applied to stabilize optimization and enhance image realism. Result: The proposed method significantly outperforms previous DRA techniques, achieving higher reconstruction quality across various scenarios, including high-resolution images, out-of-distribution data, and complex, deeper neural network architectures. Conclusion: The proposed GAN-based DRA framework with PFO demonstrates superior performance in reconstructing high-quality images, offering significant improvements over existing methods, especially in challenging scenarios involving high-resolution images, out-of-distribution settings, and deeper DNNs. Abstract: The growing complexity of Deep Neural Networks (DNNs) has led to the adoption of Split Inference (SI), a collaborative paradigm that partitions computation between edge devices and the cloud to reduce latency and protect user privacy. However, recent advances in Data Reconstruction Attacks (DRAs) reveal that intermediate features exchanged in SI can be exploited to recover sensitive input data, posing significant privacy risks. Existing DRAs are typically effective only on shallow models and fail to fully leverage semantic priors, limiting their reconstruction quality and generalizability across datasets and model architectures. In this paper, we propose a novel GAN-based DRA framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and incrementally refines intermediate representations to enhance the semantic fidelity of reconstructed images. To stabilize the optimization and improve image realism, we introduce an L1-ball constraint during reconstruction. Extensive experiments show that our method outperforms prior attacks by a large margin, especially in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs.

Fartash Faghri,Pavan Kumar Anasosalu Vasu,Cem Koc,Vaishaal Shankar,Alexander Toshev,Oncel Tuzel,Hadi Pouransari

Main category: cs.CV

TL;DR: MobileCLIP2 improves zero-shot image classification performance at low latency and reduced model size compared to previous methods.

Details

Motivation: To enhance the performance of low-latency, lightweight image-text models like MobileCLIP in zero-shot learning scenarios. Method: Improved multi-modal reinforced training through better CLIP and captioner teacher ensembles, temperature tuning in contrastive distillation, and combining synthetic captions from multiple models. Result: MobileCLIP2-B achieved a 2.2% improvement in ImageNet-1k accuracy over MobileCLIP-B; MobileCLIP2-S4 matched SigLIP-SO400M/14's accuracy while being 2× smaller and improved over DFN ViT-L/14 at 2.5× lower latency. Conclusion: MobileCLIP2 achieves state-of-the-art ImageNet-1k zero-shot accuracies at low latencies, with improvements from enhanced CLIP and captioner teacher models, and knowledge distillation techniques. Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.

[98] EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang,Xiaodong Cun,Yong Zhang,Yudian Zheng,Fan Tang,Chi-Man Pun

Main category: cs.CV

TL;DR: 本文提出EmoCAST框架，通过两个关键模块实现精确的文本驱动情感合成，并构建了带有情感文本描述的数据集以优化框架性能，最终在生成逼真、富有情感表达和音频同步的说话头视频方面达到了SOTA。

Details

Motivation: 现有方法在控制灵活性、动作自然性和表情质量方面仍存在局限，且目前可用数据集主要在实验室环境中收集，限制了实际应用场景。 Method: 提出EmoCAST框架，包括文本引导的解耦情感模块和情感音频注意力模块，同时构建了新的情感说话头数据集及优化训练策略。 Result: EmoCAST在生成具有真实感、情感表达和音频同步的说话头视频方面取得了SOTA性能。 Conclusion: EmoCAST有效解决了情感说话头合成中的关键挑战，具备更高的控制灵活性、更自然的动作和更高质量的情感表达。 Abstract: Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework's performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model's ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST

[99] ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering

Paritosh Parmar,Eric Peh,Basura Fernando

Main category: cs.CV

TL;DR: This paper introduces a modular framework for Causal-Why VideoQA that enhances interpretability and reasoning through natural language causal chains.

Details

Motivation: Existing models struggle with higher-order reasoning and lack interpretability due to entangled pipelines. Method: A two-stage architecture with a Causal Chain Extractor (CCE) and a Causal Chain-Driven Answerer (CCDA) that uses natural language causal chains as intermediate representations. Result: Experiments show superior performance on three benchmarks and improvements in explainability, user trust, and generalization. Conclusion: The proposed modular framework improves interpretability and performance in Causal-Why VideoQA tasks by decoupling causal reasoning from answer generation. Abstract: Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

[100] Mask-Guided Multi-Channel SwinUNETR Framework for Robust MRI Classification

Smriti Joshi,Lidia Garrucho,Richard Osuala,Oliver Diaz,Karim Lekadir

Main category: cs.CV

TL;DR: A SwinUNETR-based AI model was developed for improved breast cancer detection using MRI, achieving strong results in a multi-center challenge.

Details

Motivation: Breast cancer is a leading cause of cancer-related mortality in women, and early detection is crucial for improving outcomes, particularly in high-risk women or those with dense breast tissue where mammography is less effective. Method: The study used a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Result: The method achieved second place on the ODELIA challenge leaderboard, demonstrating its effectiveness in breast cancer diagnosis and classification using MRI data from multiple centers and vendors. Conclusion: The study concludes that their SwinUNETR-based deep learning framework has significant potential to enhance clinical breast MRI interpretation, achieving second place in the ODELIA consortium challenge. Abstract: Breast cancer is one of the leading causes of cancer-related mortality in women, and early detection is essential for improving outcomes. Magnetic resonance imaging (MRI) is a highly sensitive tool for breast cancer detection, particularly in women at high risk or with dense breast tissue, where mammography is less effective. The ODELIA consortium organized a multi-center challenge to foster AI-based solutions for breast cancer diagnosis and classification. The dataset included 511 studies from six European centers, acquired on scanners from multiple vendors at both 1.5 T and 3 T. Each study was labeled for the left and right breast as no lesion, benign lesion, or malignant lesion. We developed a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Our method achieved second place on the challenge leaderboard, highlighting its potential to support clinical breast MRI interpretation. We publicly share our codebase at https://github.com/smriti-joshi/bcnaim-odelia-challenge.git.

[101] AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images

Shiqi Xin,Xiaolin Zhang,Yanbin Liu,Peng Zhang,Caifeng Shan

Main category: cs.CV

TL;DR: AvatarBack is a novel framework that enhances 3D Gaussian head avatar reconstruction by explicitly modeling missing back-head regions using a subject-specific generator and adaptive spatial alignment.

Details

Motivation: Existing Gaussian Splatting methods mainly rely on frontal-view images, leading to poor reconstruction of the back-head, causing geometric inconsistencies and reduced realism. Method: AvatarBack uses a Subject-specific Generator (SSG) to generate back-view pseudo-images and an Adaptive Spatial Alignment Strategy (ASA) to align synthetic views with the 3D Gaussian representation. Result: AvatarBack significantly improves back-head reconstruction quality while maintaining frontal fidelity, with consistent visual realism under diverse motions and full animatability. Conclusion: AvatarBack provides a solution to the incomplete back-head reconstruction in Gaussian Splatting for 3D head avatars, achieving high-quality, consistent, and animatable reconstructions. Abstract: Recent advances in Gaussian Splatting have significantly boosted the reconstruction of head avatars, enabling high-quality facial modeling by representing an 3D avatar as a collection of 3D Gaussians. However, existing methods predominantly rely on frontal-view images, leaving the back-head poorly constructed. This leads to geometric inconsistencies, structural blurring, and reduced realism in the rear regions, ultimately limiting the fidelity of reconstructed avatars. To address this challenge, we propose AvatarBack, a novel plug-and-play framework specifically designed to reconstruct complete and consistent 3D Gaussian avatars by explicitly modeling the missing back-head regions. AvatarBack integrates two core technical innovations,i.e., the Subject-specific Generator (SSG) and the Adaptive Spatial Alignment Strategy (ASA). The former leverages a generative prior to synthesize identity-consistent, plausible back-view pseudo-images from sparse frontal inputs, providing robust multi-view supervision. To achieve precise geometric alignment between these synthetic views and the 3D Gaussian representation, the later employs learnable transformation matrices optimized during training, effectively resolving inherent pose and coordinate discrepancies. Extensive experiments on NeRSemble and K-hairstyle datasets, evaluated using geometric, photometric, and GPT-4o-based perceptual metrics, demonstrate that AvatarBack significantly enhances back-head reconstruction quality while preserving frontal fidelity. Moreover, the reconstructed avatars maintain consistent visual realism under diverse motions and remain fully animatable.

[102] ArtFace: Towards Historical Portrait Face Identification via Model Adaptation

Francois Poh,Anjith George,Sébastien Marcel

Main category: cs.CV

TL;DR: 本研究通过结合微调基础模型和传统面部识别网络，提高了艺术作品中的人脸识别性能。

Details

Motivation: 自动面部识别技术在处理绘画作品时由于领域差异和类内差异大而表现不佳，因此需要探索基础模型的潜力以改进艺术作品中的人脸识别。 Method: 微调基础模型，并将其嵌入与传统面部识别网络相结合，以提高艺术作品中的人脸识别能力。 Result: 结果表明，基础模型在艺术作品中的人脸识别方面相较于现有最先进的方法有显著改进。 Conclusion: 通过微调基础模型并与传统面部识别网络的嵌入相结合，基础模型能够弥补传统方法无效的差距。 Abstract: Identifying sitters in historical paintings is a key task for art historians, offering insight into their lives and how they chose to be seen. However, the process is often subjective and limited by the lack of data and stylistic variations. Automated facial recognition is capable of handling challenging conditions and can assist, but while traditional facial recognition models perform well on photographs, they struggle with paintings due to domain shift and high intra-class variation. Artistic factors such as style, skill, intent, and influence from other works further complicate recognition. In this work, we investigate the potential of foundation models to improve facial recognition in artworks. By fine-tuning foundation models and integrating their embeddings with those from conventional facial recognition networks, we demonstrate notable improvements over current state-of-the-art methods. Our results show that foundation models can bridge the gap where traditional methods are ineffective. Paper page at https://www.idiap.ch/paper/artface/

[103] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models

Ayan Banerjee,Fernando Vilariño,Josep Lladós

Main category: cs.CV

TL;DR: CraftGraffiti是一个端到端的文本引导涂鸦生成框架，它以面部特征保持为主要目标，通过先进行风格迁移后保持身份的方法，成功地在保持面部特征的同时实现了风格迁移。

Details

Motivation: 在生成艺术中，如何在极端风格转换下保持面部身份是一个重大挑战。在涂鸦这种高对比度、抽象的媒介中，细微的变形可能会抹去主体的可识别性，这既影响了个人真实性也影响了文化真实性。 Method: CraftGraffiti通过LoRA微调预训练扩散变换模型进行风格迁移，然后通过面部一致的自注意力机制来增强注意力层，实现身份保真。 Result: 实验结果表明，CraftGraffiti在面部特征一致性方面表现出色，并在审美和人类偏好评分中达到了最先进的水平。 Conclusion: CraftGraffiti成功地在保持面部特征的同时实现了风格迁移，为AI辅助艺术创作提供了一个有原则的方法。 Abstract: Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium, subtle distortions to the eyes, nose, or mouth can erase the subject's recognizability, undermining both personal and cultural authenticity. We present CraftGraffiti, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, CraftGraffiti first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the "style-first, identity-after" paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruilla Festival highlight the system's real-world creative impact. CraftGraffiti advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications.

[104] "Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Anastasios Skoularikis,Stefanos-Iordanis Papadopoulos,Symeon Papadopoulos,Panagiotis C. Petrantonakis

Main category: cs.CV

TL;DR: 本文介绍了 S-HArM 数据集，用于意图感知分类，并探讨了多种模型训练策略，研究显示基于图像和多模态引导数据训练的模型在真实场景内容上表现更佳。

Details

Motivation: 现有的合成和上下文之外内容检测方法大多忽略了 AI 生成图像背后的意图，本文旨在填补这一空白。 Method: 研究人员引入了 S-HArM 数据集，并探索了三种提示策略来构建大规模合成训练数据集。他们进行了广泛的比较研究，包括模态融合、对比学习、重建网络、注意力机制和大型视觉-语言模型。 Result: 研究表明，基于图像和多模态引导数据训练的模型在“in the wild”内容上表现更好，但总体性能仍然有限，突出了推断意图的复杂性。 Conclusion: S-HArM 是一个用于意图感知分类的多模态数据集，可以促进合成和上下文之外内容的检测研究。研究强调了推断意图的复杂性以及对专用架构的需求。 Abstract: Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

[105] Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network

Chenhao Zhang,Wei Gao

Main category: cs.CV

TL;DR: This paper proposes a dynamic neural video compression framework for variable bitrate scenarios, achieving better rate control and overall performance compared to existing methods.

Details

Motivation: Precise rate control remains a challenge in Neural Video Compression due to the limitations of learning-based codecs, especially in variable bitrate scenarios. Method: The framework uses a Dynamic-Route Autoencoder with variable coding routes, a Rate Control Agent to adjust routes at runtime, and Joint-Routes Optimization for collaborative training. Result: Experiments on HEVC and UVG datasets show a 14.8% BD-Rate reduction, 0.47dB BD-PSNR gain, and an average bitrate error of 1.66%. Conclusion: The proposed dynamic video compression framework achieves better performance in terms of BD-Rate reduction, BD-PSNR gain, and bitrate error control, offering effective Rate-Distortion-Complexity Optimization for variable bitrate scenarios. Abstract: Neural Video Compression (NVC) has achieved remarkable performance in recent years. However, precise rate control remains a challenge due to the inherent limitations of learning-based codecs. To solve this issue, we propose a dynamic video compression framework designed for variable bitrate scenarios. First, to achieve variable bitrate implementation, we propose the Dynamic-Route Autoencoder with variable coding routes, each occupying partial computational complexity of the whole network and navigating to a distinct RD trade-off. Second, to approach the target bitrate, the Rate Control Agent estimates the bitrate of each route and adjusts the coding route of DRA at run time. To encompass a broad spectrum of variable bitrates while preserving overall RD performance, we employ the Joint-Routes Optimization strategy, achieving collaborative training of various routes. Extensive experiments on the HEVC and UVG datasets show that the proposed method achieves an average BD-Rate reduction of 14.8% and BD-PSNR gain of 0.47dB over state-of-the-art methods while maintaining an average bitrate error of 1.66%, achieving Rate-Distortion-Complexity Optimization (RDCO) for various bitrate and bitrate-constrained applications. Our code is available at https://git.openi.org.cn/OpenAICoding/DynamicDVC.

[106] CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network

Reza Akbari Movahed,Abuzar Rezaee,Arezoo Zakeri,Colin Berry,Edmond S. L. Ho,Ali Gooya

Main category: cs.CV

TL;DR: CardioMorphNet是一种新的心脏运动估计框架，通过基于解剖区域的深度学习方法实现了更准确的估计和更高的预测置信度。

Details

Motivation: 现有基于强度相似性损失的心脏运动估计方法在捕捉心脏运动上存在困难，容易忽略心脏解剖区域，因此需要一种更精确的解剖区域感知方法。 Method: 提出CardioMorphNet，基于循环变分自编码器的贝叶斯深度学习框架，用于建模心脏周期中的时空依赖性，并通过两个后验模型进行双心室分割和运动估计。 Result: CardioMorphNet在UK Biobank数据集上的验证显示其在心脏运动估计方面优于现有技术，同时不确定性评估显示其预测具有更高的置信度。 Conclusion: CardioMorphNet通过使用基于解剖区域的3D心脏形状引导的可变形配准方法，结合贝叶斯建模和不确定性估计，在心脏运动估计方面表现出优于现有方法的性能。 Abstract: Accurate cardiac motion estimation from cine cardiac magnetic resonance (CMR) images is vital for assessing cardiac function and detecting its abnormalities. Existing methods often struggle to capture heart motion accurately because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. To address this, we propose CardioMorphNet, a recurrent Bayesian deep learning framework for 3D cardiac shape-guided deformable registration using short-axis (SAX) CMR images. It employs a recurrent variational autoencoder to model spatio-temporal dependencies over the cardiac cycle and two posterior models for bi-ventricular segmentation and motion estimation. The derived loss function from the Bayesian formulation guides the framework to focus on anatomical regions by recursively registering segmentation maps without using intensity-based image registration similarity loss, while leveraging sequential SAX volumes and spatio-temporal features. The Bayesian modelling also enables computation of uncertainty maps for the estimated motion fields. Validated on the UK Biobank dataset by comparing warped mask shapes with ground truth masks, CardioMorphNet demonstrates superior performance in cardiac motion estimation, outperforming state-of-the-art methods. Uncertainty assessment shows that it also yields lower uncertainty values for estimated motion fields in the cardiac region compared with other probabilistic-based cardiac registration methods, indicating higher confidence in its predictions.

[107] Mix, Align, Distil: Reliable Cross-Domain Atypical Mitosis Classification

Kaustubh Atey,Sameer Anand Jha,Gouranga Bala,Amit Sethi

Main category: cs.CV

TL;DR: The paper presents a training method for domain-robust classification of atypical mitotic figures, involving style perturbations, feature alignment using weak domain labels, and prediction stabilization using an EMA teacher. The method performs well on the MIDOG 2025 Task 2 preliminary leaderboard with balanced accuracy, sensitivity, specificity, and ROC AUC metrics.

Details

Motivation: Atypical mitotic figures are important histopathological markers that are challenging to identify consistently, particularly under domain shift due to differences in scanner, stain, and acquisition. The motivation is to develop a domain-robust classification method for AMFs. Method: The method involves three steps: (i) increasing feature diversity via style perturbations, (ii) aligning features across sites using weak domain labels through an auxiliary alignment loss, and (iii) stabilizing predictions by distilling from an exponential moving average teacher with temperature-scaled KL divergence. Result: The approach attained balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499 on the organizer-run preliminary leaderboard for atypical mitosis classification. Conclusion: The method described is a competitive submission for the MIDOG 2025 challenge due to its strong, balanced performance, negligible inference-time overhead, and reliance only on coarse domain metadata. Abstract: Atypical mitotic figures (AMFs) are important histopathological markers yet remain challenging to identify consistently, particularly under domain shift stemming from scanner, stain, and acquisition differences. We present a simple training-time recipe for domain-robust AMF classification in MIDOG 2025 Task 2. The approach (i) increases feature diversity via style perturbations inserted at early and mid backbone stages, (ii) aligns attention-refined features across sites using weak domain labels (Scanner, Origin, Species, Tumor) through an auxiliary alignment loss, and (iii) stabilizes predictions by distilling from an exponential moving average (EMA) teacher with temperature-scaled KL divergence. On the organizer-run preliminary leaderboard for atypical mitosis classification, our submission attains balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499. The method incurs negligible inference-time overhead, relies only on coarse domain metadata, and delivers strong, balanced performance, positioning it as a competitive submission for the MIDOG 2025 challenge.

[108] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang,Zhimin Li,Yuhang Zang,Yujie Zhou,Jiazi Bu,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang

Main category: cs.CV

TL;DR: This paper proposes Pref-GRPO to stabilize reinforcement learning in text-to-image generation and introduces UniGenBench, a detailed benchmark for evaluating T2I models.

Details

Motivation: Current T2I methods using pointwise reward models are prone to reward hacking due to score amplification after normalization, which destabilizes training. Method: Pref-GRPO uses pairwise preference reward optimization instead of pointwise reward scoring, comparing images within groups to determine a win rate as the reward signal. Result: Extensive experiments show that Pref-GRPO provides more stable training and better differentiation of subtle image quality differences, while UniGenBench offers a detailed evaluation framework for T2I models. Conclusion: The paper concludes that Pref-GRPO effectively mitigates reward hacking in T2I generation while UniGenBench provides a comprehensive benchmark for evaluating T2I models. Abstract: Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

[109] ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting

Yuxi Hu,Jun Zhang,Kuangyi Chen,Zhe Zhang,Friedrich Fraundorfer

Main category: cs.CV

TL;DR: 本文提出了一种名为C3-GS的框架，通过上下文感知、跨维度和跨尺度约束增强特征学习，从而实现高质量的新视角合成。

Details

Motivation: 现有的方法在编码用于高斯预测的判别性、多视角一致特征方面存在不足，难以利用稀疏视图构建精确的几何结构。 Method: C3-GS通过整合三个轻量级模块到统一的渲染管线中，增强了特征学习，并实现了逼真的合成效果。 Result: 实验表明，C3-GS在基准数据集上实现了最先进的渲染质量和泛化能力。 Conclusion: C3-GS实现了最先进的渲染质量和泛化能力。 Abstract: Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose $\mathbf{C}^{3}$-GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that $\mathbf{C}^{3}$-GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: https://github.com/YuhsiHu/C3-GS.

[110] SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

Jiawen Lin,Shiran Bian,Yihang Zhu,Wenbin Tan,Yachao Zhang,Yuan Xie,Yanyun Qu

Main category: cs.CV

TL;DR: 本文提出了一种名为SeqVLM的零样本3D视觉定位框架，通过多视角图像和动态调度机制，显著提升了3D场景中自然语言定位的性能，具有更强的现实应用潜力。

Details

Motivation: 现有的零样本3D视觉定位方法由于依赖单视角定位，存在空间推理受限、上下文缺失或细节退化的问题，而所提出的方法旨在解决这些问题，以提升零样本3D视觉定位在现实世界应用中的表现。 Method: SeqVLM框架首先通过3D语义分割网络生成实例建议，然后通过语义过滤保留相关候选对象；接着采用建议引导的多视角投影策略，将候选对象投影到真实场景图像序列中；最后通过动态调度机制迭代处理序列-查询提示，利用VLM的跨模态推理能力识别文本指定对象。 Result: 在ScanRefer和Nr3D基准测试中，SeqVLM分别实现了55.6%和53.2%的Acc@0.25得分，超过了先前零样本方法4.0%和5.2%。 Conclusion: SeqVLM通过利用多视角实景图像和动态调度机制，在零样本3D视觉定位任务中实现了最先进的性能，推动了3DVG在实际应用中的泛化能力。 Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.

[111] Occlusion Robustness of CLIP for Military Vehicle Classification

Jan Erik van Woerden,Gertjan Burghouts,Lotte Nijskens,Alma M. Liezenga,Sabina van Rooij,Frank Ruis,Hugo J. Kuijf

Main category: cs.CV

TL;DR: This study investigates the robustness of CLIP models under occlusion in military environments, revealing that Transformer-based models perform best, and that performance drops significantly with increasing occlusion unless the model is fine-tuned.

Details

Motivation: Vision-language models like CLIP have potential for defense applications due to their ability to function with limited labeled data. However, their robustness in challenging military environments with partial occlusion and low signal-to-noise ratio (SNR) is not well understood. Method: The researchers evaluated CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes. Performance was measured using Normalized Area Under the Curve (NAUC) across varying occlusion percentages. Result: 1) Transformer-based CLIP models outperformed CNNs. 2) Fine-grained, dispersed occlusions had a more significant negative impact on performance than larger contiguous occlusions. 3) Linear-probed models experienced a sharp performance drop at around 35% occlusion. 4) Finetuning the model's backbone delayed the performance drop to over 60% occlusion. Conclusion: The study concludes that occlusion-specific training augmentations are crucial for improving the robustness of CLIP models in real-world military applications. Further exploration into patch-level sensitivity and architectural resilience is needed. Abstract: Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

Fachri Najm Noer Kartiman,Rasim,Yaya Wihardi,Nurul Hasanah,Oskar Natan,Bambang Wahono,Taufik Ibnu Salim

Main category: cs.CV

TL;DR: 本研究提出了SKGE-Swin架构，结合Swin Transformer与跳跃阶段机制，提升了自动驾驶模型在复杂环境中的感知与驾驶能力。

Details

Motivation: 开发一种具有像素到像素上下文感知能力的端到端自动驾驶车辆模型。 Method: 使用带有跳跃阶段机制的Swin Transformer来拓宽特征表示，并通过CARLA平台进行评估。 Result: 实验结果表明，SKGE-Swin架构在驾驶得分方面优于现有方法，并通过消融研究验证了各组件的贡献。 Conclusion: SKGE-Swin架构在对抗场景中表现出比先前方法更优的驾驶得分，展示了其在复杂模式理解方面的增强能力。 Abstract: Focusing on the development of an end-to-end autonomous vehicle model with pixel-to-pixel context awareness, this research proposes the SKGE-Swin architecture. This architecture utilizes the Swin Transformer with a skip-stage mechanism to broaden feature representation globally and at various network levels. This approach enables the model to extract information from distant pixels by leveraging the Swin Transformer's Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism and to retain critical information from the initial to the final stages of feature extraction, thereby enhancing its capability to comprehend complex patterns in the vehicle's surroundings. The model is evaluated on the CARLA platform using adversarial scenarios to simulate real-world conditions. Experimental results demonstrate that the SKGE-Swin architecture achieves a superior Driving Score compared to previous methods. Furthermore, an ablation study will be conducted to evaluate the contribution of each architectural component, including the influence of skip connections and the use of the Swin Transformer, in improving model performance.

[113] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

Gowreesh Mago,Pascal Mettes,Stevan Rudinac

Main category: cs.CV

TL;DR: This paper surveys progress in understanding abstract concepts in videos, arguing that modern foundation models and historical insights can help machines better align with human reasoning.

Details

Motivation: The motivation stems from the need to bridge the gap between machine and human understanding of abstract concepts in videos, such as justice and freedom, which remain a crucial open challenge in the field. Method: The authors conducted a survey of different tasks and datasets used to understand abstract concepts in video content, and examined how researchers have historically approached these challenges with available tools. Result: The study identifies a long-standing effort by researchers to tackle abstract concept understanding in videos and highlights the importance of building on past work to avoid redundant efforts in the context of modern multi-modal foundation models. Conclusion: The paper concludes that leveraging foundation models and decades of community experience can significantly advance the understanding of abstract concepts in video content, aiding alignment with human reasoning and values. Abstract: The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel'' as we start revisiting it in the era of multi-modal foundation models.

[114] Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML

Kuniko Paxton,Koorosh Aslansefat,Amila Akagić,Dhavalkumar Thakker,Yiannis Papadopoulos

Main category: cs.CV

TL;DR: 研究提出了一种提高皮肤病变分类模型可解释性与可靠性的新方法，有助于减少误诊风险并提高患者安全。

Details

Motivation: 现有的可解释性方法存在可靠性问题，例如 LIME-based 方法不一致，而 CAM-based 方法未考虑所有类别。 Method: 使用 MobileNetV2 和 Vision Transformers，在 ISIC 数据集上评估 Global Class Activation Probabilistic Map Evaluation 和 SafeML 方法。 Result: 提出了一种新的方法 Global Class Activation Probabilistic Map Evaluation，能够以像素级概率分析所有类别的激活图，并通过 SafeML 增强错误诊断检测。 Conclusion: Global Class Activation Probabilistic Map Evaluation 以及 SafeML 提升了皮肤病变分类模型的可解释性与可靠性，有助于减少误诊风险并提高患者安全。 Abstract: Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists' diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes' activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.

[115] Evaluating Compositional Generalisation in VLMs and Diffusion Models

Beth Pearson,Bilal Boulbarss,Michael Wray,Martha Lewis

Main category: cs.CV

TL;DR: 本文研究了扩散分类器是否比CLIP等判别模型在视觉语言模型中具有更好的组合泛化能力，发现生成模型在某些任务中表现良好，但所有模型在关系推理方面仍存在较大挑战。

Details

Motivation: 研究旨在探索生成式扩散分类器是否比判别式模型具有更好的组合泛化能力，因为现有视觉语言模型如CLIP在组合语义任务中表现不佳。 Method: 本文评估了三种模型——扩散分类器、CLIP和ViLT——在零样本学习（ZSL）和广义零样本学习（GZSL）设置中将对象与属性及关系绑定的能力，并通过CLIP嵌入分析探讨了模型表现困难的可能原因。 Result: 结果表明，扩散分类器和ViLT在概念绑定任务中表现良好，但所有模型在关系GZSL任务中均遇到显著困难，扩散分类器基于生成模型的零样本分类能力在某些组合任务中与CLIP相当。 Conclusion: 尽管扩散分类器和ViLT在概念绑定任务上表现良好，但所有模型在关系泛化零样本学习任务中均面临显著挑战，表明视觉语言模型在关系推理方面存在普遍困难。 Abstract: A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip

[116] Surfel-based 3D Registration with Equivariant SE(3) Features

Xueyang Kang,Hang Zhao,Kourosh Khoshelham,Patrick Vandewalle

Main category: cs.CV

TL;DR: A new surfel-based pose learning regression method is proposed for point cloud registration, which considers point orientations and uncertainties, and demonstrates robust performance on real-world datasets.

Details

Motivation: The motivation is to address the limitations of existing point cloud registration methods that ignore point orientations and uncertainties, making them susceptible to noise and aggressive rotations. Method: The method involves initializing surfels from a Lidar point cloud using virtual perspective camera parameters, learning explicit SE(3) equivariant features through SE(3) equivariant convolutional kernels, and utilizing an equivariant convolutional encoder, a cross-attention mechanism, a fully-connected decoder, and a non-linear Huber loss. Result: Experimental results show that the model performs well on both indoor and outdoor datasets, demonstrating its robustness and superiority in handling real point-cloud scans. Conclusion: The proposed surfel-based pose learning regression approach demonstrates robust performance and superiority over state-of-the-art methods in point cloud registration. Abstract: Point cloud registration is crucial for ensuring 3D alignment consistency of multiple local point clouds in 3D reconstruction for remote sensing or digital heritage. While various point cloud-based registration methods exist, both non-learning and learning-based, they ignore point orientations and point uncertainties, making the model susceptible to noisy input and aggressive rotations of the input point cloud like orthogonal transformation; thus, it necessitates extensive training point clouds with transformation augmentations. To address these issues, we propose a novel surfel-based pose learning regression approach. Our method can initialize surfels from Lidar point cloud using virtual perspective camera parameters, and learns explicit $\mathbf{SE(3)}$ equivariant features, including both position and rotation through $\mathbf{SE(3)}$ equivariant convolutional kernels to predict relative transformation between source and target scans. The model comprises an equivariant convolutional encoder, a cross-attention mechanism for similarity computation, a fully-connected decoder, and a non-linear Huber loss. Experimental results on indoor and outdoor datasets demonstrate our model superiority and robust performance on real point-cloud scans compared to state-of-the-art methods.

[117] Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training

Tao Luo,Han Wu,Tong Yang,Dinggang Shen,Zhiming Cui

Main category: cs.CV

TL;DR: 本文提出了一种名为DVCTNet的新型双视角协同训练网络，通过结合全景X光图像的全局视图和牙齿图像的局部视图，实现了精确的龋齿检测。

Details

Motivation: 由于龋齿对比度的细微差异和病变形态的多样性，当前的检测方法往往精度不足，因此需要一种更准确的检测方法。 Method: DVCTNet首先通过自动牙齿检测建立两个互补视角：全景X光图像的全局视角和裁剪牙齿图像的局部视角，分别预训练两个视觉基础模型，并引入门控交叉视角注意力模块（GCV-Atten）动态融合双视角特征，以提升检测效果。 Result: 实验结果表明，DVCTNet在公共数据集和新构建的高精度数据集上均优于现有的最先进方法，显示出其在临床中的适用性。 Conclusion: DVCTNet通过结合全局和局部视角信息，显著提升了龋齿检测的准确性，具有良好的临床应用前景。 Abstract: Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet's superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.

[118] FusionCounting: Robust visible-infrared image fusion guided by crowd counting via multi-task learning

He Li,Xinyu Liu,Weihang Kong,Xingchen Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的多任务学习框架FusionCounting，通过整合人群计数到可见光和红外图像融合(VIF)过程中，利用人群计数提供直接的人口密度度量，解决了VIF方法在密集场景中的挑战。

Details

Motivation: 尽管RGB-T人群计数近年来受到越来越多的关注，但尚未有研究将VIF和人群计数整合到一个统一框架中。人群计数提供了最小注释的直接人口密度度量，使其特别适用于密集场景。 Method: 提出了一种新的多任务学习框架FusionCounting，将人群计数融入VIF过程中，并采用动态损失函数加权策略和对抗训练来加速收敛、平衡任务贡献以及增强鲁棒性。 Result: 实验结果表明，FusionCounting在公共数据集上不仅增强了图像融合质量，还实现了优越的人群计数性能。 Conclusion: FusionCounting不仅提升了图像融合质量，还实现了优越的人群计数性能。 Abstract: Most visible and infrared image fusion (VIF) methods focus primarily on optimizing fused image quality. Recent studies have begun incorporating downstream tasks, such as semantic segmentation and object detection, to provide semantic guidance for VIF. However, semantic segmentation requires extensive annotations, while object detection, despite reducing annotation efforts compared with segmentation, faces challenges in highly crowded scenes due to overlapping bounding boxes and occlusion. Moreover, although RGB-T crowd counting has gained increasing attention in recent years, no studies have integrated VIF and crowd counting into a unified framework. To address these challenges, we propose FusionCounting, a novel multi-task learning framework that integrates crowd counting into the VIF process. Crowd counting provides a direct quantitative measure of population density with minimal annotation, making it particularly suitable for dense scenes. Our framework leverages both input images and population density information in a mutually beneficial multi-task design. To accelerate convergence and balance tasks contributions, we introduce a dynamic loss function weighting strategy. Furthermore, we incorporate adversarial training to enhance the robustness of both VIF and crowd counting, improving the model's stability and resilience to adversarial attacks. Experimental results on public datasets demonstrate that FusionCounting not only enhances image fusion quality but also achieves superior crowd counting performance.

[119] Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation

Krit Duangprom,Tryphon Lambrou,Binod Bhattarai

Main category: cs.CV

TL;DR: This paper proposes a new method for surgical tool keypoint estimation using a LoRA-fine-tuned Vision Language Model, achieving better performance with minimal training.

Details

Motivation: Traditional CNN or Transformer-based methods often overfit small-scale medical datasets. The authors aim to leverage the generalization ability of pre-trained VLMs to improve keypoint estimation performance in such low-resource scenarios. Method: The authors use a Vision Language Model (VLM) fine-tuned with the Low Rank Adjusting (LoRA) technique. They design prompts to create an instruction-tuning dataset, aligning visual features with semantic descriptions of keypoints. Result: Experimental results show that after only two epochs of fine-tuning, the LoRA-adjusted VLM outperforms baseline models, demonstrating its effectiveness in low-resource settings. Conclusion: The paper concludes that the proposed pipeline effectively improves 2D keypoint estimation for surgical tools using a VLM fine-tuned with LoRA, highlighting its potential for future 3D pose estimation applications. Abstract: This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

[120] PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

Hao Yang,Qianyu Zhou,Haijia Sun,Xiangtai Li,Xuequan Lu,Lizhuang Ma,Shuicheng Yan

Main category: cs.CV

TL;DR: 本文提出了PointDGRWKV，一个针对领域泛化点云分类（DG PCC）的RWKV框架，解决了空间建模和跨域鲁棒性问题，实现了最先进的性能。

Details

Motivation: 现有基于卷积网络、Transformer或Mamba架构的DG PCC方法受限于感受野、计算成本或长程依赖建模能力，而新兴的RWKV架构虽然具有线性复杂度和全局感受野，但在DG PCC中存在空间失真和跨域注意力偏移问题。 Method: 提出了Adaptive Geometric Token Shift以改进几何上下文感知，并设计了Cross-Domain key feature Distribution Alignment以减少注意力漂移。 Result: PointDGRWKV在多个基准测试中表现出色，成功提升了RWKV模型在未见域点云分类中的泛化能力和鲁棒性。 Conclusion: PointDGRWKV框架在DG PCC的多个基准测试中实现了最先进的性能，解决了RWKV模型在点云分类中的空间建模和跨域鲁棒性问题。 Abstract: Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.

[121] PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis

Ye Zhang,Yu Zhou,Jingwen Qi,Yongbing Zhang,Simon Puettmann,Finn Wichmann,Larissa Pereira Ferreira,Lara Sichward,Julius Keyl,Sylvia Hartmann,Shuo Zhao,Hongxiao Wang,Xiaowei Xu,Jianxu Chen

Main category: cs.CV

TL;DR: 本文介绍了一种新的细胞级多模态视觉推理框架PathMR，用于病理图像分析，其在多个数据集上的表现优于现有方法，提升了AI在病理诊断中的可解释性。

Details

Motivation: 深度学习在自动化病理诊断中的应用因模型决策不透明和缺乏可追溯的推理而限制了临床采用。为解决此问题，研究提出了提供统一框架的多模态视觉推理架构，以生成像素级分割掩码和语义对齐的文本解释。 Method: 提出了PathMR，这是一种用于病理图像分析的细胞级多模态视觉推理框架。 Result: 在公开的PathGen数据集和新开发的GADVR数据集上的大量实验表明，PathMR在文本生成质量、分割准确性和跨模态对齐方面始终优于现有技术水平。 Conclusion: PathMR展现出在AI驱动的病理诊断中提升可解释性的潜力，并且在文本生成质量、分割准确性和跨模态对齐方面始终优于现有技术水平。 Abstract: Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in https://github.com/zhangye-zoe/PathMR.

Dennis Slobodzian,Karissa Tilbury,Amir Kordijazi

Main category: cs.CV

TL;DR: 该研究开发了一种高效的深度学习框架，用于胰腺癌的早期检测，准确率超过90%，并为小规模医学影像数据的应用提供了有价值的见解。

Details

Motivation: 胰腺导管腺癌（PDAC）由于早期检测困难，五年生存率低于10%，因此需要一种高效的自动化检测方法来改善当前手动分析的局限性。 Method: 研究团队使用自体荧光和二次谐波生成（SHG）双模态成像数据，分析了40个独特的患者样本，并评估了六种不同的深度学习架构，包括传统的卷积神经网络（CNNs）和现代视觉变换器（ViTs）。最终的优化框架基于带有冻结预训练层和类别加权训练的ResNet架构。 Result: 开发的深度学习框架在癌症检测中实现了超过90%的准确率，显著优于当前的手动分析方法。 Conclusion: 该研究成功开发了一种基于改进ResNet架构的深度学习框架，用于胰腺导管腺癌（PDAC）的早期检测，具有超过90%的准确率，并展示了其在临床部署的潜力。 Abstract: Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms of cancer, with a five-year survival rate below 10% primarily due to late detection. This research develops and validates a deep learning framework for early PDAC detection through analysis of dual-modality imaging: autofluorescence and second harmonic generation (SHG). We analyzed 40 unique patient samples to create a specialized neural network capable of distinguishing between normal, fibrotic, and cancerous tissue. Our methodology evaluated six distinct deep learning architectures, comparing traditional Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs). Through systematic experimentation, we identified and overcome significant challenges in medical image analysis, including limited dataset size and class imbalance. The final optimized framework, based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training, achieved over 90% accuracy in cancer detection. This represents a significant improvement over current manual analysis methods an demonstrates potential for clinical deployment. This work establishes a robust pipeline for automated PDAC detection that can augment pathologists' capabilities while providing a foundation for future expansion to other cancer types. The developed methodology also offers valuable insights for applying deep learning to limited-size medical imaging datasets, a common challenge in clinical applications.

[123] Understanding and evaluating computer vision models through the lens of counterfactuals

Pushkar Shukla

Main category: cs.CV

TL;DR: This thesis explores counterfactual reasoning as a method to enhance interpretability, fairness, and causality in AI models, particularly addressing bias in vision classifiers and generative models.

Details

Motivation: Counterfactual reasoning has become central to interpretable and fair AI. The motivation is to uncover spurious correlations, probe causal dependencies, and build more robust systems by asking 'what if' questions through variations in inputs. Method: The thesis develops frameworks that use counterfactual reasoning to explain, audit, and mitigate bias in vision classifiers and generative models by systematically altering semantically meaningful attributes while holding others fixed. Result: The contributions include frameworks like CAVLI, ASAC, TIBET, BiasConnect, and InterMit, which demonstrate the effectiveness of counterfactuals in addressing bias in both vision classifiers and generative Text-to-Image models through explanation, auditing, and mitigation. Conclusion: Counterfactuals can serve as a unifying approach for enhancing interpretability, fairness, and causality in both discriminative and generative AI models, providing scalable and principled methods for socially responsible bias evaluation and mitigation. Abstract: Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.

[124] To New Beginnings: A Survey of Unified Perception in Autonomous Vehicle Software

Loïc Stratil,Felix Fent,Esteban Rivera,Markus Lienkamp

Main category: cs.CV

TL;DR: 这篇论文综述了自动驾驶中统一感知的发展，提出了分类体系并总结了未来方向。

Details

Motivation: 传统自动驾驶感知模块化流水线存在误差累积和任务间协同受限的问题，因此需要一种新的统一感知范式来提升鲁棒性、上下文推理能力和效率。 Method: 论文通过综述现有文献，提出了统一感知的分类体系，包括任务整合、跟踪公式和表示流，并定义了三种范式：早期、晚期和完全统一感知。 Result: 论文提供了统一感知的全面综述，包括现有方法、架构、训练策略、数据集及开源资源，并指出了未来的研究方向。 Conclusion: 该论文总结了统一感知的当前发展状况，并提出了一个系统框架来促进未来更鲁棒、可泛化和可解释的感知研究。 Abstract: Autonomous vehicle perception typically relies on modular pipelines that decompose the task into detection, tracking, and prediction. While interpretable, these pipelines suffer from error accumulation and limited inter-task synergy. Unified perception has emerged as a promising paradigm that integrates these sub-tasks within a shared architecture, potentially improving robustness, contextual reasoning, and efficiency while retaining interpretable outputs. In this survey, we provide a comprehensive overview of unified perception, introducing a holistic and systemic taxonomy that categorizes methods along task integration, tracking formulation, and representation flow. We define three paradigms -Early, Late, and Full Unified Perception- and systematically review existing methods, their architectures, training strategies, datasets used, and open-source availability, while highlighting future research directions. This work establishes the first comprehensive framework for understanding and advancing unified perception, consolidates fragmented efforts, and guides future research toward more robust, generalizable, and interpretable perception.

[125] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

Yifan Gao,Haoyue Li,Feng Yuan,Xiaosong Wang,Xin Gao

Main category: cs.CV

TL;DR: Dino U-Net通过融合DINOv3的语义特征与低层细节，结合FAPM模块优化特征投影，显著提升了医学图像分割的精度和可扩展性。

Details

Motivation: 尽管基于大规模自然图像数据集预训练的基础模型在医学图像分割中提供了强大的范式，但如何有效迁移其学习到的表征以实现精确的临床应用仍是一个挑战。 Method: 提出了一种新的编码器-解码器架构Dino U-Net，编码器基于冻结的DINOv3骨干网络，并引入了专门的适配器来融合丰富的语义特征与低层次空间细节；设计了保真度感知投影模块（FAPM）来在降维过程中保持特征质量。 Result: Dino U-Net在七个多样化的公共医学图像分割数据集上实现了最先进的性能，并且随着骨干模型规模的增加（最高达70亿参数），分割精度持续提升。 Conclusion: Dino U-Net通过利用DINOv3视觉基础模型的高保真密集特征，为医学图像分割提供了一种高效且参数高效的方法，推动了分割精度的提升。 Abstract: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

Sara Krauss,Ellena Spieß,Daniel Hieber,Frank Kramer,Johannes Schobel,Dominik Müller

Main category: cs.CV

TL;DR: 本文研究了利用深度集成模型和RBR模块自动区分非典型与正常有丝分裂图像的方法。

Details

Motivation: 非典型有丝分裂图像（AMFs）与正常有丝分裂图像（NMFs）的区分仍然困难，因为手动注释耗时且主观性强。 Method: 使用AUCMEDI训练了ConvNeXtBase模型的集成，并通过基于规则的优化（RBR）模块进行了扩展。 Result: 在MIDOG25初步测试集上，该集成模型达到了84.02%的平衡准确率。RBR模块提高了特异性，但降低了敏感性及整体性能。 Conclusion: 深度集成模型在AMF分类中表现出色，RBR模块能提高特定指标但仍需进一步研究。 Abstract: Mitotic figures (MFs) are relevant biomarkers in tumor grading. Differentiating atypical MFs (AMFs) from normal MFs (NMFs) remains difficult, as manual annotation is time-consuming and subjective. In this work an ensemble of ConvNeXtBase models was trained with AUCMEDI and extend with a rule-based refinement (RBR) module. On the MIDOG25 preliminary test set, the ensemble achieved a balanced accuracy of 84.02%. While the RBR increased specificity, it reduced sensitivity and overall performance. The results show that deep ensembles perform well for AMF classification. RBR can increase specific metrics but requires further research.

[127] COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans

Enrico Martini,Ho Jin Choi,Nadia Figueroa,Nicola Bombieri

Main category: cs.CV

TL;DR: COMETH is a lightweight, real-time multi-view human pose fusion algorithm that improves joint positioning accuracy through kinematic constraints and convex optimization, making it ideal for industrial applications.

Details

Motivation: Monitoring human activity is crucial for ergonomic safety and well-being in Industry 5.0. However, centralized multi-camera setups face scalability and real-time challenges due to high computational costs and bandwidth requirements, while edge-based processing faces accuracy and consistency issues. Method: COMETH uses convex optimization-based inverse kinematics for spatial fusion, integrates kinematic and biomechanical constraints, and implements a state observer for temporal consistency in real-time multi-view human pose fusion. Result: COMETH outperforms state-of-the-art methods in localization, detection, and tracking accuracy on both public and industrial datasets. Conclusion: The proposed COMETH algorithm enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications in the era of Industry 5.0. Abstract: In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.

Amir Jmal,Chaima Chtourou,Mahdi Louati,Abdelaziz Kallel,Houda Khmila

Main category: cs.CV

TL;DR: 本研究提出了一种创新的橄榄树分割方法，结合了Segment Anything Model（SAM）和基于田间树木排列和形状大小的可学习约束的校正，提高了分割准确性。

Details

Motivation: 在气候变化背景下，通过遥感技术进行早期异常检测和处理，以维护橄榄生物多样性，提供有效的管理解决方案。 Method: 利用基础模型和先进分割技术，整合Segment Anything Model（SAM）以准确识别和分割农业地块中的橄榄树。方法包括SAM分割以及基于田间树木排列和关于形状和大小的可学习约束的校正。 Result: 该方法显著提高了橄榄树分割的准确性，准确率达到98%，远超初始SAM的82%性能。 Conclusion: 通过结合SAM分割和基于树木排列和可学习形状大小约束的校正，该方法显著提高了橄榄树分割的准确性，准确率达到98%，远超初始SAM的82%性能。 Abstract: In the context of proven climate change, maintaining olive biodiversity through early anomaly detection and treatment using remote sensing technology is crucial, offering effective management solutions. This paper presents an innovative approach to olive tree segmentation from satellite images. By leveraging foundational models and advanced segmentation techniques, the study integrates the Segment Anything Model (SAM) to accurately identify and segment olive trees in agricultural plots. The methodology includes SAM segmentation and corrections based on trees alignement in the field and a learanble constraint about the shape and the size. Our approach achieved a 98\% accuracy rate, significantly surpassing the initial SAM performance of 82\%.

[129] E-ConvNeXt: A Lightweight and Efficient ConvNeXt Variant with Cross-Stage Partial Connections

Fang Wang,Huitao Li,Wenhan Chao,Zheng Zhuo,Yiran Ji,Chang Peng,Yupeng Sun

Main category: cs.CV

TL;DR: 本文提出E-ConvNeXt，通过结构改进和优化设计，在保持高准确性的同时大幅降低网络复杂度，适用于轻量级应用。

Details

Motivation: 原始ConvNeXt不适合轻量级应用场景，需要降低其参数规模和网络复杂度。 Method: 结合Cross Stage Partial Connections机制和优化设计改进ConvNeXt。 Result: E-ConvNeXt-mini在0.9GFLOPs下达到78.3%的Top-1准确率，E-ConvNeXt-small在3.1GFLOPs下达到81.9%的Top-1准确率。 Conclusion: E-ConvNeXt实现了高效性与准确性的平衡，适用于轻量级场景。 Abstract: Many high-performance networks were not designed with lightweight application scenarios in mind from the outset, which has greatly restricted their scope of application. This paper takes ConvNeXt as the research object and significantly reduces the parameter scale and network complexity of ConvNeXt by integrating the Cross Stage Partial Connections mechanism and a series of optimized designs. The new network is named E-ConvNeXt, which can maintain high accuracy performance under different complexity configurations. The three core innovations of E-ConvNeXt are : (1) integrating the Cross Stage Partial Network (CSPNet) with ConvNeXt and adjusting the network structure, which reduces the model's network complexity by up to 80%; (2) Optimizing the Stem and Block structures to enhance the model's feature expression capability and operational efficiency; (3) Replacing Layer Scale with channel attention. Experimental validation on ImageNet classification demonstrates E-ConvNeXt's superior accuracy-efficiency balance: E-ConvNeXt-mini reaches 78.3% Top-1 accuracy at 0.9GFLOPs. E-ConvNeXt-small reaches 81.9% Top-1 accuracy at 3.1GFLOPs. Transfer learning tests on object detection tasks further confirm its generalization capability.

[130] DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes

Yajiao Xiong,Xiaoyu Zhou,Yongtao Wan,Deqing Sun,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: DrivingGaussian++是一个用于自动驾驶场景真实感重建和可控编辑的高效框架，结合LiDAR先验和LLMs实现动态对象轨迹生成与真实感增强。

Details

Motivation: 为了实现更高效、真实感更强的自动驾驶场景重建与可控编辑，DrivingGaussian++引入了LiDAR先验和LLMs来生成动态对象运动轨迹并提升场景多样性。 Method: DrivingGaussian++使用增量3D高斯模型静态背景，并通过复合动态高斯图重建移动物体，同时结合LiDAR先验和多视角图像及深度先验进行训练-free可控编辑。 Result: DrivingGaussian++在动态场景重建和光真实感全景视图合成方面优于现有方法，支持纹理修改、天气模拟和物体操控等训练-free可控编辑，并显著增强了场景多样性。 Conclusion: DrivingGaussian++是一种高效且有效的真实感重建和可控编辑自动驾驶场景的框架，它通过集成LiDAR先验和大型语言模型（LLMs）实现了动态对象运动轨迹的自动生成和真实感增强。 Abstract: We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io

[131] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation

Chenfan Qu,Yiwu Zhong,Bin Li,Lianwen Jin

Main category: cs.CV

TL;DR: Web-IML通过利用网络数据和创新性的方法解决了图像操作定位中的数据稀缺问题，取得了显著的性能提升。

Details

Motivation: 由于获取高质量标注数据的成本高昂且数据稀缺，图像操作定位仍是一个具有挑战性的问题。 Method: 通过CAAA v2和QES构建了一个大规模、多样化的MIMLv2数据集，并引入了Object Jitter技术以生成高质量的操作伪影。 Result: Web-IML在性能上取得了显著提升，相比之前最先进的TruFor模型，平均IoU提升了24.1个百分点。 Conclusion: Web-IML有效地利用了网络监督来提升图像操作定位任务的性能，解决了数据稀缺问题，并在多个真实世界伪造基准上显著提高了各种模型的性能。 Abstract: Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.

[132] ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts

Patryk Będkowski,Jan Dubiński,Filip Szatkowski,Kamil Deja,Przemysław Rokita,Tomasz Trzciński

Main category: cs.CV

TL;DR: ExpertSim是ALICE实验中零度量热计的深度学习模拟方法，利用生成专家混合架构，提供比传统蒙特卡洛方法更高效、更准确的模拟过程。

Details

Motivation: 当前粒子碰撞模拟使用计算成本高昂的统计蒙特卡洛方法，而生成式机器学习方法则难以应对数据分布的变化，因此需要一种更高效的模拟方法。 Method: 提出ExpertSim，采用混合生成专家架构，每个专家专注于模拟数据的不同子集，从而提高生成过程的精确性和效率。 Result: ExpertSim在提高准确性的同时，显著加快了模拟速度，为CERN的粒子物理实验提供了高效的探测器模拟解决方案。 Conclusion: ExpertSim是一种有效的深度学习模拟方法，能够显著提升粒子物理实验中探测器模拟的效率和准确性。 Abstract: Simulating detector responses is a crucial part of understanding the inner workings of particle collisions in the Large Hadron Collider at CERN. Such simulations are currently performed with statistical Monte Carlo methods, which are computationally expensive and put a significant strain on CERN's computational grid. Therefore, recent proposals advocate for generative machine learning methods to enable more efficient simulations. However, the distribution of the data varies significantly across the simulations, which is hard to capture with out-of-the-box methods. In this study, we present ExpertSim - a deep learning simulation approach tailored for the Zero Degree Calorimeter in the ALICE experiment. Our method utilizes a Mixture-of-Generative-Experts architecture, where each expert specializes in simulating a different subset of the data. This allows for a more precise and efficient generation process, as each expert focuses on a specific aspect of the calorimeter response. ExpertSim not only improves accuracy, but also provides a significant speedup compared to the traditional Monte-Carlo methods, offering a promising solution for high-efficiency detector simulations in particle physics experiments at CERN. We make the code available at https://github.com/patrick-bedkowski/expertsim-mix-of-generative-experts.

[133] POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models

Jiaxiang Cheng,Bing Ma,Xuhua Ren,Hongyi Jin,Kai Yu,Peng Zhang,Wenyue Li,Yuan Zhou,Tianxiang Zheng,Qinglin Lu

Main category: cs.CV

TL;DR: POSE is a new framework for video diffusion models that makes video generation faster and more efficient while maintaining high quality.

Details

Motivation: The motivation is to overcome the inefficiency in video diffusion generation, especially for large models and long sequences, by creating a method that models temporal coherence and allows single-step distillation. Method: POSE uses a two-phase distillation process involving stability priming and unified adversarial equilibrium, along with conditional adversarial consistency for better video generation. Result: POSE improves latency by 100x, reduces sampling time from 1000 to 10 seconds, and outperforms other methods by 7.15% in video quality metrics. Conclusion: POSE offers a distillation framework that significantly reduces sampling steps for large-scale video diffusion models, improving efficiency without compromising quality. Abstract: The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.

[134] Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

Dale Decatur,Thibault Groueix,Wang Yifan,Rana Hanocka,Vladimir Kim,Matheus Gadelha

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的方法，通过在扩散模型的早期阶段共享计算来减少文本到图像生成中的冗余，从而在降低计算成本的同时提高图像质量。

Details

Motivation: 文本到图像扩散模型虽然能够生成高质量图像，但计算成本高。以往的工作优化了单次推理效率，而本文探索了一种正交方法：减少相关提示之间的冗余。 Method: 利用扩散模型的由粗到细的特性，基于语义相似性对提示进行聚类，并在早期去噪步骤中共享计算。 Result: 实验表明，该方法在减少计算成本的同时提高了图像质量，并且能够与现有流水线无缝集成，可扩展到大规模提示集合。 Conclusion: 本文提出了一种无需训练的方法，通过聚类语义相似的提示并在扩散模型的早期阶段共享计算，来减少文本到图像生成中的冗余计算。 Abstract: Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/

[135] Mitosis detection in domain shift scenarios: a Mamba-based approach

Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento

Main category: cs.CV

TL;DR: 本文提出一种基于Mamba的VM-UNet方法结合染色增强技术，用于提升有丝分裂检测中的跨领域适应性能。

Details

Motivation: 有丝分裂检测在肿瘤评估中至关重要，但当前机器学习算法在跨领域数据上表现下降，因此需要一种更鲁棒的方法。 Method: 该论文采用了VM-UNet架构进行有丝分裂检测，并结合染色增强操作提高模型在不同领域数据下的鲁棒性。 Result: 该方法在MIDOG++数据集上的初步实验表明其在跨领域有丝分裂检测中有较大的改进空间。 Conclusion: 该论文提出了一种基于Mamba的方法来提高在不同领域数据下的有丝分裂检测性能，并在MIDOG++数据集上验证了该方法的改进空间。 Abstract: Mitosis detection in histopathology images plays a key role in tumor assessment. Although machine learning algorithms could be exploited for aiding physicians in accurately performing such a task, these algorithms suffer from significative performance drop when evaluated on images coming from domains that are different from the training ones. In this work, we propose a Mamba-based approach for mitosis detection under domain shift, inspired by the promising performance demonstrated by Mamba in medical imaging segmentation tasks. Specifically, our approach exploits a VM-UNet architecture for carrying out the addressed task, as well as stain augmentation operations for further improving model robustness against domain shift. Our approach has been submitted to the track 1 of the MItosis DOmain Generalization (MIDOG) challenge. Preliminary experiments, conducted on the MIDOG++ dataset, show large room for improvement for the proposed method.

[136] A multi-task neural network for atypical mitosis recognition under domain shift

Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento

Main category: cs.CV

TL;DR: This paper proposes a multi-task learning approach to address the issue of domain shift in recognizing atypical mitotic figures in histopathology images, showing promising results on multiple datasets.

Details

Motivation: Recognizing atypical mitotic figures in histopathology images helps physicians assess tumor aggressiveness, but machine learning models suffer from significant performance drops under domain shift. Method: An approach based on multi-task learning is proposed, exploiting auxiliary tasks correlated to the main classification task. Result: The proposed approach demonstrates promising performance in a preliminary evaluation on three distinct datasets: the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, and the preliminary test set of the MIDOG25 challenge. Conclusion: The proposed multi-task learning approach shows promising performance in recognizing atypical mitotic figures under domain shift. Abstract: Recognizing atypical mitotic figures in histopathology images allows physicians to correctly assess tumor aggressiveness. Although machine learning models could be exploited for automatically performing such a task, under domain shift these models suffer from significative performance drops. In this work, an approach based on multi-task learning is proposed for addressing this problem. By exploiting auxiliary tasks, correlated to the main classification task, the proposed approach, submitted to the track 2 of the MItosis DOmain Generalization (MIDOG) challenge, aims to aid the model to focus only on the object to classify, ignoring the domain varying background of the image. The proposed approach shows promising performance in a preliminary evaluation conducted on three distinct datasets, i.e., the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, as well as the preliminary test set of the MIDOG25 challenge.

[137] FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator

Huynh Tong Dang Khoa,Dang Hoai Nam,Vo Nguyen Le Duy

Main category: cs.CV

TL;DR: FW-GAN通过创新的生成对抗网络结构，成功解决了手写合成中的两个主要问题，实现了高质量的手写文本生成。

Details

Motivation: 手写识别系统受限于多样性、风格一致的训练样本，而现有的手写合成方法存在两个主要限制：传统卷积架构难以建模长距离依赖关系和复杂笔画模式，且忽视了对于捕捉细粒度风格和结构细节至关重要的频率信息。 Method: 提出了一种名为FW-GAN的一次性手写合成框架，其生成器整合了相位感知Wave-MLP，同时引入了频率引导的判别器和频率分布损失。 Result: 实验表明，FW-GAN在越南语和英语手写数据集上均能生成高质量、风格一致的手写文本。 Conclusion: FW-GAN是一种有价值的工具，可以增强低资源手写识别流程中的数据扩充。 Abstract: Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at https://github.com/DAIR-Group/FW-GAN

[138] MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Junpeng Ma,Qizhe Zhang,Ming Lu,Zhibin Wang,Qiang Zhou,Jun Song,Shanghang Zhang

Main category: cs.CV

TL;DR: MMG-Vid is a novel, training-free visual token pruning framework for Video Large Language Models that improves inference efficiency by maximizing marginal gains at both segment and token levels without compromising performance.

Details

Motivation: Existing visual token pruning methods for Video Large Language Models (VLLMs) fail to consider dynamic video characteristics and temporal dependencies, leading to inefficiencies in real-world applications due to excessive visual tokens. Method: MMG-Vid uses a two-stage, training-free visual token pruning approach: segment-level token budget allocation based on frame similarity and a temporal-guided DPC algorithm for intra-frame token pruning, both aimed at maximizing marginal gains. Result: MMG-Vid achieves significant efficiency improvements by reducing 75% of visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B, while preserving over 99.5% of the original model performance. Conclusion: MMG-Vid maintains over 99.5% of original performance while reducing visual tokens by 75% and accelerating the prefilling stage by 3.9x, demonstrating its effectiveness in enhancing video large language model efficiency. Abstract: Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.

[139] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Wei Li,Renshan Zhang,Rui Shao,Jie He,Liqiang Nie

Main category: cs.CV

TL;DR: CogVLA是一种高效的视觉-语言-动作框架，通过指令驱动的路由和稀疏化技术，在降低计算开销的同时实现了最先进的性能。

Details

Motivation: 现有的基于预训练视觉-语言模型的视觉-语言-动作模型需要大量后训练，计算开销大，限制了其可扩展性和部署能力。 Method: CogVLA引入了3阶段渐进式架构：1) EFA-Routing将指令信息注入视觉编码器，形成指令感知的潜在表示；2) LFP-Routing通过剪枝指令无关的视觉标记实现标记级稀疏性；3) CAtten结合因果视觉-语言注意力和双向动作并行解码，确保压缩感知输入仍能支持准确的动作生成。 Result: CogVLA在LIBERO基准测试和实际机器人任务中分别取得了97.4%和70.0%的成功率，训练成本降低2.5倍，推理延迟减少2.8倍。 Conclusion: CogVLA实现了最先进的性能，同时显著降低了训练成本和推理延迟，适用于大规模部署和实际应用。 Abstract: Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

[140] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Hao Tan,Jun Lan,Zichang Tan,Ajian Liu,Chuanbiao Song,Senyuan Shi,Huijia Zhu,Weiqiang Wang,Jun Wan,Zhen Lei

Main category: cs.CV

TL;DR: This paper introduces HydraFake, a new dataset for deepfake detection, and Veritas, a multi-modal large language model that outperforms existing detectors in real-world scenarios.

Details

Motivation: The challenge of deepfake detection due to discrepancies between academic benchmarks and industrial practice, including homogeneous training sources and low-quality testing images. Method: Development of the HydraFake dataset and the Veritas multi-modal large language model (MLLM) based deepfake detector with pattern-aware reasoning and a two-stage training pipeline. Result: Experiments showed that previous detectors have good cross-model generalization but struggle with unseen forgeries and data domains, while Veritas achieves significant improvements in these OOD scenarios. Conclusion: Veritas, a multi-modal large language model, achieves significant gains in detecting deepfake across different OOD scenarios and delivers transparent and faithful detection outputs. Abstract: Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as "planning" and "self-reflection" to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.

[141] FakeParts: a New Family of AI-Generated DeepFakes

Gaetan Brison,Soobash Daiboo,Samy Aimeur,Awais Hussain Sani,Xi Wang,Gianni Franchi,Vicky Kalogeiton

Main category: cs.CV

TL;DR: 本文介绍了一种新的deepfakes - FakeParts，并提出了一个大规模基准数据集FakePartsBench用于检测部分视频操作。

Details

Motivation: 目前的deepfakes检测方法在部分视频操作检测方面存在关键差距。 Method: 引入FakePartsBench，一个大规模基准数据集，包含超过25K带有像素级和帧级操作注释的视频。 Result: 用户研究表明，与传统的deepfakes相比，FakeParts将人类检测准确性降低了超过30%。 Conclusion: FakeParts是一种新的deepfakes，具有细微的局部操作，可以欺骗人类和现有的检测模型。 Abstract: We introduce FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. Unlike fully synthetic content, these partial manipulations, ranging from altered facial expressions to object substitutions and background modifications, blend seamlessly with real elements, making them particularly deceptive and difficult to detect. To address the critical gap in detection capabilities, we present FakePartsBench, the first large-scale benchmark dataset specifically designed to capture the full spectrum of partial deepfakes. Comprising over 25K videos with pixel-level and frame-level manipulation annotations, our dataset enables comprehensive evaluation of detection methods. Our user studies demonstrate that FakeParts reduces human detection accuracy by over 30% compared to traditional deepfakes, with similar performance degradation observed in state-of-the-art detection models. This work identifies an urgent vulnerability in current deepfake detection approaches and provides the necessary resources to develop more robust methods for partial video manipulations.

[142] Multi-View 3D Point Tracking

Frano Rajič,Haofei Xu,Marko Mihajlovic,Siyuan Li,Irem Demir,Emircan Gündoğdu,Lei Ke,Sergey Prokudin,Marc Pollefeys,Siyu Tang

Main category: cs.CV

TL;DR: 论文介绍了一种新的多视角3D点跟踪器，使用四台相机即可实现鲁棒且准确的在线跟踪，解决了传统方法的局限性。

Details

Motivation: 现有的单目跟踪器在深度模糊和遮挡情况下表现不佳，而之前的多相机方法需要超过20个相机和繁琐的序列优化，因此需要一种更加实用和高效的方法。 Method: 论文中的方法融合了多视角特征，利用k近邻相关性和基于Transformer的更新来估计长距离3D对应关系，并在5K合成多视角Kubric序列上进行训练，以及在两个真实世界基准数据集上进行评估。 Result: 该方法在Panoptic Studio和DexYCB数据集上的中位轨迹误差分别为3.1厘米和2.0厘米，并且能够很好地泛化到1-8视角的不同相机设置和24-150帧的视频长度。 Conclusion: 该论文提出了一种新的多视角3D点跟踪方法，解决了现有单目跟踪器在深度模糊和遮挡情况下的问题，并通过使用四台相机实现了鲁棒且准确的在线跟踪。 Abstract: We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.

[143] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Yuan Gong,Xionghui Wang,Jie Wu,Shiyin Wang,Yitong Wang,Xinglong Wu

Main category: cs.CV

TL;DR: 本文提出了OneReward，一种统一的强化学习框架，以及基于其开发的Seedream 3.0 Fill模型，通过单一奖励模型提升多任务生成模型的性能，并在多个评估维度上优于现有模型。

Details

Motivation: 现有的方法通常依赖于任务特定的监督微调（SFT），这限制了泛化能力和训练效率，而本文旨在通过统一的强化学习框架解决这一问题。 Method: 使用一个单一的视觉-语言模型（VLM）作为生成奖励模型来区分给定任务和评估标准下的优胜者和失败者，并基于OneReward开发了Seedream 3.0 Fill，通过多任务强化学习直接在预训练模型上进行训练，无需任务特定的监督微调。 Result: 实验结果显示，该方法在多个评估维度上均优于商业和开源竞争对手，例如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。 Conclusion: OneReward是一个统一的强化学习框架，能够通过单一的奖励模型提升多任务生成模型的性能，而Seedream 3.0 Fill则在此框架下实现了优于现有商业和开源模型的表现。 Abstract: In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

[144] Dress&Dance: Dress up and Dance as You Like It - Technical Preview

Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang

Main category: cs.CV

TL;DR: Dress&Dance是一种基于视频扩散的框架，通过CondNet网络实现了高质量、灵活的虚拟试穿体验。

Details

Motivation: 为了解决虚拟试穿中高质量生成和动作同步问题，同时支持多种服装类型和多模态输入。 Method: 提出了CondNet这一新型条件网络，利用注意力机制整合多模态输入（文本、图像和视频），并通过多阶段渐进方式训练，结合有限视频数据和更大规模图像数据集。 Result: 生成了5秒长、24 FPS、1152x720分辨率的虚拟试穿视频，支持单张用户图像输入，同时支持多种服装类型的一次性试穿。 Conclusion: Dress&Dance通过CondNet实现了高质量、灵活的虚拟试穿体验，优于现有开源和商业解决方案。 Abstract: We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.

[145] First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge

Fahad Shamshad,Tameem Bakr,Yahia Shaaban,Noor Hussein,Karthik Nandakumar,Nils Lukas

Main category: cs.CV

TL;DR: This paper presents advanced adversarial attacks on image watermarking, achieving high watermark removal rates while preserving image quality, and aims to inspire more robust watermarking techniques.

Details

Motivation: The motivation is to stress-test watermark robustness against adversarial attacks, particularly under varying degrees of adversary knowledge, to understand the vulnerabilities of current watermarking methods. Method: The paper presents two tracks: black-box and beige-box. In the beige-box track, a VAE-based evasion attack is used with test-time optimization and color-contrast restoration in CIELAB space. In the black-box track, images are clustered based on artifacts, and image-to-image diffusion models with controlled noise injection and semantic priors are applied. Result: The empirical evaluations show that the method achieves near-perfect watermark removal (95.7%) with minimal impact on image quality. Conclusion: The paper concludes that their methods effectively achieve near-perfect watermark removal while maintaining image quality, and they hope to inspire the development of more robust watermarking techniques. Abstract: Content watermarking is an important tool for the authentication and copyright protection of digital media. However, it is unclear whether existing watermarks are robust against adversarial attacks. We present the winning solution to the NeurIPS 2024 Erasing the Invisible challenge, which stress-tests watermark robustness under varying degrees of adversary knowledge. The challenge consisted of two tracks: a black-box and beige-box track, depending on whether the adversary knows which watermarking method was used by the provider. For the beige-box track, we leverage an adaptive VAE-based evasion attack, with a test-time optimization and color-contrast restoration in CIELAB space to preserve the image's quality. For the black-box track, we first cluster images based on their artifacts in the spatial or frequency-domain. Then, we apply image-to-image diffusion models with controlled noise injection and semantic priors from ChatGPT-generated captions to each cluster with optimized parameter settings. Empirical evaluations demonstrate that our method successfully achieves near-perfect watermark removal (95.7%) with negligible impact on the residual image's quality. We hope that our attacks inspire the development of more robust image watermarking methods.

Table of Contents

cs.CL [Back]

[1] Social Bias in Multilingual Language Models: A Survey

[2] Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models

[3] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach

[4] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

[5] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

[6] Joint Enhancement of Relational Reasoning for Long-Context LLMs

[7] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems

[8] CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

[9] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

[10] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

[11] DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

[12] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval

[13] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance

[14] Searching the Title of Practical Work of the Informatics Engineering Bachelor Program with the Case Base Reasoning Method

[15] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

[16] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques

[17] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety

[18] Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark

[19] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

[20] Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

[21] Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

[22] Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data

[23] Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search

[24] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling

[25] A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models

[26] Generative Annotation for ASR Named Entity Correction

[27] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning

[28] Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models

[29] rStar2-Agent: Agentic Reasoning Technical Report

[30] Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees

[31] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets

[32] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation

[33] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions

[34] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

[35] Signs of Struggle: Spotting Cognitive Distortions across Language and Register

[36] Exploring Machine Learning and Language Models for Multimodal Depression Detection

[37] GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction

[38] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation

[39] The Uneven Impact of Post-Training Quantization in Machine Translation

[40] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

[41] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench

[42] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment

[43] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

[44] Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

[45] An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs

[46] Re-Representation in Sentential Relation Extraction with Sequence Routing Algorithm

[47] Enabling Equitable Access to Trustworthy Financial Reasoning

cs.CV [Back]

[48] Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization

[49] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization

[50] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study

[51] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels

[52] InfinityHuman: Towards Long-Term Audio-Driven Human

[53] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos

[54] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models

[55] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems

[56] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification

[57] MedNet-PVS: A MedNeXt-Based Deep Learning Model for Automated Segmentation of Perivascular Spaces

[58] Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

[59] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

[60] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)

[61] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models

[62] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction

[63] Audio-Guided Visual Editing with Complex Multi-Modal Prompts

[64] More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning

[65] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection

[66] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection

[67] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection

[68] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection

[69] Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification

[70] Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds

[71] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

[72] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation

[73] Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization

[74] Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding

[75] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

[76] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

[77] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information