Table of Contents
cs.CL [Back]
[1] Social Bias in Multilingual Language Models: A Survey
Lance Calvin Lim Gamboa,Yue Feng,Mark Lee
Main category: cs.CL
TL;DR: This paper reviews research on social bias in multilingual AI models, emphasizing the need for more inclusive and culturally aware methodologies.
Details
Motivation: To understand and address the social biases present in multilingual pretrained models, similar to those found in English language models. Method: A systematic review of existing research on multilingual and non-English bias evaluation and mitigation approaches. Result: The review highlights issues like preference for certain languages, lack of multilingual mitigation experiments, and challenges in adapting bias benchmarks across languages and cultures. Conclusion: The study identifies gaps in the current methodologies in multilingual bias research and suggests future directions for more inclusive and culturally appropriate approaches. Abstract: Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field's dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature's inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.[2] Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models
Mohammad Amini,Babak Ahmadi,Xiaomeng Xiong,Yilin Zhang,Christopher Qiao
Main category: cs.CL
TL;DR: 该研究探讨了使用语言模型生成多项选择题以降低人工测试开发成本,发现结构化提示和微调中等规模模型可以有效生成符合评估目标的题目。
Details
Motivation: 旨在降低人工测试开发的成本和不一致性,探索使用语言模型自动生成(AIG)多项选择题(MCQs)进行形态评估的可能性。 Method: 研究采用了两方面的策略。首先,比较了微调的中等规模模型(Gemma,2B)和未微调的大型模型(GPT-3.5,175B)。其次,评估了七种结构化提示策略。生成的题目通过自动化指标和专家评分进行评估,还使用了GPT-4.1进行大规模模拟人类评分。 Result: 结果表明,结构化提示特别是结合思维链和顺序设计的策略显著提高了Gemma的输出质量。Gemma通常比GPT-3.5的零样本响应生成了更多与构建目标一致和教学上适当的题目,提示设计在中等规模模型性能中起到了关键作用。 Conclusion: 结构化提示和高效微调可以增强中等规模模型在有限数据条件下的AIG能力。结合自动化指标、专家判断和大模型模拟,可以确保与评估目标对齐。提出的流程为K-12语言评估项目的发展和验证提供了一种实用且可扩展的方法。 Abstract: This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment, aiming to reduce the cost and inconsistency of manual test development. The study used a two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B) with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations. Generated items were assessed using automated metrics and expert scoring across five dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate human scoring at scale. Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs. Gemma generally produced more construct-aligned and instructionally appropriate items than GPT-3.5's zero-shot responses, with prompt design playing a key role in mid-size model performance. This study demonstrates that structured prompting and efficient fine-tuning can enhance midsized models for AIG under limited data conditions. We highlight the value of combining automated metrics, expert judgment, and large-model simulation to ensure alignment with assessment goals. The proposed workflow offers a practical and scalable way to develop and validate language assessment items for K-12.[3] Integrating SystemC TLM into FMI 3.0 Co-Simulations with an Open-Source Approach
Andrei Mihai Albu,Giovanni Pollo,Alessio Burrello,Daniele Jahier Pagliari,Cristian Tesconi,Alessandra Neri,Dario Soldi,Fabio Autieri,Sara Vinco
Main category: cs.CL
TL;DR: This paper presents an open-source methodology to integrate SystemC TLM models into FMI-based co-simulation workflows, addressing interoperability challenges in cyber-physical systems.
Details
Motivation: The motivation stems from the increasing complexity of cyber-physical systems in automotive applications, which demands efficient modeling and cross-domain co-simulation techniques. SystemC TLM's limited interoperability with other domains poses integration challenges that this paper aims to solve. Method: The authors introduce a lightweight open-source toolchain to encapsulate SystemC TLM components into FMI 3.0 standards, addressing technical challenges like time synchronization and data exchange. Result: The result is a feasible and effective integration methodology demonstrated through case studies, enabling standardized and seamless integration across heterogeneous simulation environments. Conclusion: The paper concludes that by encapsulating SystemC TLM components as FMI 3.0 Co Simulation FMUs, a fully open-source methodology allows for seamless integration of SystemC TLM models into FMI-based co-simulation workflows. Abstract: The growing complexity of cyber-physical systems, particularly in automotive applications, has increased the demand for efficient modeling and cross-domain co-simulation techniques. While SystemC Transaction-Level Modeling (TLM) enables effective hardware/software co-design, its limited interoperability with models from other engineering domains poses integration challenges. This paper presents a fully open-source methodology for integrating SystemC TLM models into Functional Mock-up Interface (FMI)-based co-simulation workflows. By encapsulating SystemC TLM components as FMI 3.0 Co Simulation Functional Mock-up Units (FMUs), the proposed approach facilitates seamless, standardized integration across heterogeneous simulation environments. We introduce a lightweight open-source toolchain, address key technical challenges such as time synchronization and data exchange, and demonstrate the feasibility and effectiveness of the integration through representative case studies.[4] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
Rikuto Kotoge,Mai Nishimura,Jiaxin Ma
Main category: cs.CL
TL;DR: 本文提出了一种名为DGPO的新方法,旨在解决紧凑语言模型在强化学习获取代理RAG行为时的困难,证明了其有效性并展示了其在资源受限环境下的可行性。
Details
Motivation: 紧凑的语言模型由于推理能力差,导致强化学习在获取代理RAG行为方面遇到困难,如搜索和规划。 Method: 提出了一种新的方法DGPO,通过教师演示的冷启动初始化和策略优化期间的持续教师指导来解决挑战。 Result: 全面的实验证明了DGPO能够使紧凑模型实现复杂的代理搜索行为,在某些情况下甚至超过了较大的教师模型。 Conclusion: DGPO使得在计算资源受限的环境下实现代理RAG成为可能。 Abstract: Reinforcement Learning has emerged as a post-training approach to elicit agentic RAG behaviors such as search and planning from language models. However, compact language models (e.g., 0.5B parameters) struggle due to poor reasoning ability, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which addresses the challenges through cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To systematically evaluate our approach, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.[5] GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
Haibo Jin,Ruoxi Chen,Peiyan Zhang,Andy Zhou,Yang Zhang,Haohan Wang
Main category: cs.CL
TL;DR: 这篇论文介绍了一种名为GUARD的方法,用于将政府发布的伦理指南转化为具体的测试问题,以评估大型语言模型(LLM)的合规性。GUARD通过自动生成违反指南的问题以及引入“越狱诊断”(jailbreak diagnostics)来测试LLM是否会产生不道德或违规的回应。实验表明,该方法在多个LLM上有效,并可扩展到视觉-语言模型,以促进可靠的应用开发。
Details
Motivation: 随着大型语言模型在各个领域的广泛应用,其可能生成有害回应的问题引发了社会和监管方面的担忧。政府发布了伦理指南以推动可信AI的发展,但这些指南通常较为宽泛,缺乏具体的操作性测试方法。因此,论文的动机是填补这一空白,将伦理指南转化为可执行的测试问题,以验证LLM是否符合指南要求。 Method: 论文提出了一种名为GUARD的测试方法,具体包括:1) 利用自动化技术生成基于政府指南的违规问题;2) 引入“越狱诊断”(GUARD-JD),通过构建挑衅性场景来诱发不道德或违规回应;3) 生成合规性报告,明确模型的合规程度并指出违规情况。此外,GUARD-JD还被扩展至视觉-语言模型,验证其适用范围。 Result: GUARD在七个LLM(如Vicuna-13B、GPT-4、Claude-3.7等)上进行了实证验证,测试了它们在三项政府指南下的合规性,并进行了越狱诊断。结果表明,GUARD能够有效识别出模型的违规行为,并且GUARD-JD可以成功应用于视觉-语言模型,为构建可靠LLM应用提供了有力支持。 Conclusion: 论文提出的GUARD方法成功地将政府伦理指南转化为具体的测试问题,并通过越狱诊断揭示了LLM可能存在的安全隐患。该方法不仅提高了LLM合规性测试的可操作性,还为促进可信AI的发展提供了实用工具。 Abstract: As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.[6] Joint Enhancement of Relational Reasoning for Long-Context LLMs
Zhirui Chen,Wei Shen,Jiashui Huang,Ling Shao
Main category: cs.CL
TL;DR: JERR improves LLMs' ability to handle long contexts and complex reasoning tasks through graph-based reasoning and MCTS, leading to better reliability and transparency.
Details
Motivation: LLMs struggle with long contexts due to memory limitations, lack of transparency, and hallucinations. Method: JERR integrates synopsis extraction, graph construction, and relational reasoning using Monte Carlo Tree Search (MCTS). Result: Experimental results show that JERR outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation. Conclusion: JERR provides a novel solution to enhance long-context comprehension and complex reasoning tasks for LLMs, improving reliability and transparency. Abstract: Despite significant progress, large language models (LLMs) still struggle with long contexts due to memory limitations and their inability to tackle complex and long-context tasks. Additionally, LLMs often suffer from a lack of transparency and are prone to producing hallucinations. To address these challenges, we propose \textbf{JERR}, a novel framework designed to enhance long-context comprehension via graph-based reasoning in LLMs. JERR integrates three key components: synopsis extraction, graph construction, and relational reasoning. First, synopsis is extracted by chunking text strategically, allowing the model to summarize and understand information more efficiently. Second, we build a directed acyclic graph (DAG) to resolve redundancy, ensuring logical consistency and clarity. Finally, we incorporate Monte Carlo Tree Search (MCTS) to help the model navigate complex reasoning paths, ensuring more accurate and interpretable outputs. This framework provides a novel solution that enables LLMs to handle extended contexts and complex reasoning tasks with improved reliability and transparency. Experimental results show that JERR consistently outperforms all baselines on the ROUGE and F1 metrics, achieving the highest scores on the LLM-Rater evaluation.[7] Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problems
Yuyao Wang,Bowen Liu,Jianheng Tang,Nuo Chen,Yuhan Li,Qifan Zhang,Jia Li
Main category: cs.CL
TL;DR: 该研究提出了一种利用NPH图问题作为合成训练数据的新方法,以提升大语言模型的长链式推理能力,并开发了Graph-R1-7B模型,其性能优于现有模型。
Details
Motivation: 现有的长链式推理能力的开发依赖于高质量、成本高昂的人工整理数据集,而可扩展的替代方案仍未被探索。 Method: 引入NPH图问题作为合成训练语料,并开发了一个两阶段的后训练框架,包括基于拒绝采样的长链式监督微调和具有细粒度奖励设计的强化学习。 Result: Graph-R1-7B模型在数学、编码、STEM和逻辑任务中表现出强大的泛化能力,并且在NPH图问题的准确性和推理效率方面超过了QwQ-32B。 Conclusion: NPH图问题被证明是推进LLMs中长链式推理的有效且可扩展的资源,并为LLM的后训练开辟了新的领域。 Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. Our flagship model, Graph-R1-7B, demonstrates strong generalization across mathematics, coding, STEM, and logic, and surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLMs, opening a new frontier for LLM post-training. Our implementation is available at https://github.com/Graph-Reasoner/Graph-R1, with models and datasets hosted in our Hugging Face collection HKUST-DSAIL/Graph-R1.[8] CAPE: Context-Aware Personality Evaluation Framework for Large Language Models
Jivnesh Sandhan,Fei Cheng,Tushar Sandhan,Yugo Murawaki
Main category: cs.CL
TL;DR: This paper proposes CAPE, a context-aware framework for evaluating LLM personalities, showing that prior interactions influence responses, improve consistency, and cause personality shifts, particularly in certain models.
Details
Motivation: The motivation is to address the lack of context in traditional psychometric-style evaluations of LLMs, which ignore real-world scenarios where prior interactions influence responses, thus limiting their behavioral assessment. Method: The researchers introduced the Context-Aware Personality Evaluation (CAPE) framework and novel metrics to evaluate the impact of conversational history on LLM responses, conducting experiments on 7 LLMs and analyzing response consistency and personality shifts. Result: Experiments showed that context improves response consistency through in-context learning but also causes personality shifts, with GPT models showing robustness and intrinsic traits, while Gemini-1.5-Flash and Llama-8B are more context-sensitive. Conclusion: The study concludes that conversational history plays a significant role in shaping LLM responses, affecting both consistency and personality traits, and context-aware approaches like CAPE are essential for accurate personality evaluation. Abstract: Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior. Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE[9] Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
Xu Guo
Main category: cs.CL
TL;DR: 该研究通过分析推理步骤的效用来预测最终答案的正确性,利用条件熵评估每一步的不确定性,发现正确答案的熵随步骤减少,而错误答案的熵趋于稳定或增加。
Details
Motivation: 当前大语言模型依赖生成中间推理步骤来提高准确性,但缺乏对推理效用与答案正确性关系的研究。 Method: 在MATH数据集上进行实验,使用Qwen2.5-32B和GPT-4o生成推理链,并用Qwen3-8B模型量化这些推理链的效用,通过逐步扩展上下文计算条件熵来衡量答案的不确定性。 Result: 研究发现,随着推理步骤的增加,正确答案的条件熵显著减少,而错误答案的熵保持不变或增加,且错误推理路径往往比正确路径更长。 Conclusion: 研究结果为设计高效的推理管道提供了基础,能够早期检测并避免无效推理。 Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.[10] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools
Sam Jung,Agustin Garcinuno,Spencer Mateega
Main category: cs.CL
TL;DR: UI-Bench 是一个大规模的基准测试工具,用于评估AI生成网站的视觉质量,并发布了排行榜和评估框架。
Details
Motivation: 为验证AI文本到应用程序工具声称能快速生成高质量网站的声明,建立一个大规模的基准测试。 Method: 通过专家成对比较的方法,使用基于TrueSkill的模型评估10种AI文本到应用程序工具的视觉效果。 Result: UI-Bench 评估了300个生成的网站,获得了4000多个专家判断,并对这些工具进行了排名。 Conclusion: UI-Bench 提供了一个可重复的标准,用于评估AI生成网站的质量,并发布了相应的工具和排行榜。 Abstract: AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and \textit{4000+} expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.[11] DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
Hengchuan Zhu,Yihuan Xu,Yichen Li,Zijie Meng,Zuozhu Liu
Main category: cs.CL
TL;DR: 该论文介绍了DentalBench,首个针对牙科领域的综合性双语基准,用于评估和推进大型语言模型(LLMs)在牙科领域的应用。
Details
Motivation: 尽管大型语言模型和医学专用模型在通用医学基准上表现出色,但在需要更深入领域知识的专科医学领域(如牙科)仍未得到充分探索,缺乏针对性评估资源。 Method: 作者构建了DentalBench,包括DentalQA(包含36,597个问题的英中问答基准)和DentalCorpus(包含337.35亿个token的高质量牙科语料库),支持监督微调和检索增强生成,并对14种LLMs进行了评估。 Result: 评估结果显示不同任务类型和语言间存在显著性能差距,领域适应显著提升了模型在知识密集型和术语聚焦任务上的表现。 Conclusion: 领域特定基准对于开发适用于医疗保健应用的可靠和高效的LLMs至关重要。 Abstract: Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.[12] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval
Chi Minh Bui,Ngoc Mai Thieu,Van Vinh Nguyen,Json J. Jung,Khac-Hoai Nam Bui
Main category: cs.CL
TL;DR: KG-CQR是一种利用知识图谱增强查询上下文表示的新型RAG系统框架,无需额外训练即可提升检索效果。
Details
Motivation: 现有方法主要解决语料库级别的上下文丢失问题,而本研究旨在通过结构化关系表示对查询进行丰富,从而提升检索效果。 Method: KG-CQR通过子图提取、补全和上下文生成模块,利用语料库中心的知识图谱增强复杂输入查询的上下文表示。 Result: 在RAGBench和MultiHop-RAG数据集上的实验表明,KG-CQR在mAP上有4-6%的提升,在Recall@25上有2-3%的提升,并在多跳问答任务中表现出色。 Conclusion: KG-CQR是一个可以提升RAG系统检索阶段表现的新型框架,实验结果表明其在多个数据集上优于基线模型。 Abstract: The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR's superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness[13] CAMB: A comprehensive industrial LLM benchmark on civil aviation maintenance
Feng Zhang,Chengjie Pang,Yuehan Zhang,Chenyu Luo
Main category: cs.CL
TL;DR: This paper introduces an industrial-grade benchmark for evaluating large language models in civil aviation maintenance, identifying domain knowledge gaps and promoting the development of more intelligent solutions through open-sourced tools.
Details
Motivation: The motivation stems from the lack of specialized evaluation tools for large language models (LLMs) in civil aviation maintenance, a field requiring sophisticated reasoning and domain-specific knowledge, as current evaluations predominantly focus on mathematical and coding tasks. Method: The authors developed an industrial-grade benchmark tailored for civil aviation maintenance to evaluate LLMs and vector embedding models. They conducted experimental exploration and analysis to validate the effectiveness of the benchmark and open-sourced the tools to promote further research. Result: The benchmark successfully identifies gaps in domain knowledge and complex reasoning in LLMs used for civil aviation maintenance. The experimental results demonstrate its effectiveness, and the open-sourced benchmark and code are made available for future research and development. Conclusion: The study concludes that the proposed benchmark effectively assesses LLM capabilities in civil aviation maintenance and identifies specific knowledge gaps and reasoning deficiencies, laying the groundwork for future improvements in intelligent solutions within this domain. Abstract: Civil aviation maintenance is a domain characterized by stringent industry standards. Within this field, maintenance procedures and troubleshooting represent critical, knowledge-intensive tasks that require sophisticated reasoning. To address the lack of specialized evaluation tools for large language models (LLMs) in this vertical, we propose and develop an industrial-grade benchmark specifically designed for civil aviation maintenance. This benchmark serves a dual purpose: It provides a standardized tool to measure LLM capabilities within civil aviation maintenance, identifying specific gaps in domain knowledge and complex reasoning. By pinpointing these deficiencies, the benchmark establishes a foundation for targeted improvement efforts (e.g., domain-specific fine-tuning, RAG optimization, or specialized prompt engineering), ultimately facilitating progress toward more intelligent solutions within civil aviation maintenance. Our work addresses a significant gap in the current LLM evaluation, which primarily focuses on mathematical and coding reasoning tasks. In addition, given that Retrieval-Augmented Generation (RAG) systems are currently the dominant solutions in practical applications , we leverage this benchmark to evaluate existing well-known vector embedding models and LLMs for civil aviation maintenance scenarios. Through experimental exploration and analysis, we demonstrate the effectiveness of our benchmark in assessing model performance within this domain, and we open-source this evaluation benchmark and code to foster further research and development:https://github.com/CamBenchmark/cambenchmark[14] Searching the Title of Practical Work of the Informatics Engineering Bachelor Program with the Case Base Reasoning Method
Agung Sukrisna Jaya,Osvari Arsalan,Danny Matthew Saputra
Main category: cs.CL
TL;DR: A system using CBR, TF-IDF, and Cosine Similarity efficiently searches for similar practical work titles.
Details
Motivation: To efficiently find practical work titles with the highest similarity using existing experience and data. Method: The system uses TF-IDF for vectorization and Cosine Similarity for similarity calculations. Result: The system successfully found titles with the highest average match score in the second testing stage. Conclusion: CBR, TF-IDF, and Cosine Similarity are effectively used in searching for practical work titles with high similarity. Abstract: Case Base Reasoning (CBR) is a case solving technique based on experience in cases that have occurred before with the highest similarity. CBR is used to search for practical work titles. TF-IDF is applied to process the vectorization of each practical work title word and Cosine Similarity for the calculation of similarity values. This system can search either in the form of titles or keywords. The output of the system is the title of practical work and the match value of each title. Based on the test results using 705 practical work titles, testing was carried out with five titles and carried out in two stages. The first stage searches with existing titles and the second stage randomizes the title from the first stage. And the results obtained in the second stage are the same number of titles found and the highest average match score.[15] MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Zhenting Wang,Qi Chang,Hemani Patel,Shashank Biju,Cheng-En Wu,Quan Liu,Aolin Ding,Alireza Rezazadeh,Ankit Shah,Yujia Bao,Eugene Siow
Main category: cs.CL
TL;DR: MCP-Bench是一个用于评估大型语言模型在复杂任务中表现的新基准,揭示了当前模型的局限性。
Details
Motivation: 现有基准测试无法充分评估LLM在复杂、多步骤任务中的能力,需要更真实的测试环境。 Method: 构建基于MCP协议的基准,连接28个MCP服务器,涵盖250种工具,并设计多步骤任务和评估框架。 Result: 实验显示LLM在工具使用、计划和跨领域工作流方面仍面临挑战。 Conclusion: MCP-Bench提供了一个全面评估LLM的框架,揭示了LLM在复杂任务中的挑战。 Abstract: We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.[16] Prediction of mortality and resource utilization in critical care: a deep learning approach using multimodal electronic health records with natural language processing techniques
Yucheng Ruan,Xiang Lan,Daniel J. Tan,Hairil Rizal Abdullah,Mengling Feng
Main category: cs.CL
TL;DR: This study introduces a deep learning framework using natural language processing techniques to integrate multimodal EHRs for predicting mortality and resource utilization in critical care. The proposed model outperforms existing methods and demonstrates robustness against data corruption.
Details
Motivation: Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. Method: Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model's robustness against the corruption in structured EHRs. Result: Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6%/0.8% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusion: The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels. Abstract: Background Predicting mortality and resource utilization from electronic health records (EHRs) is challenging yet crucial for optimizing patient outcomes and managing costs in intensive care unit (ICU). Existing approaches predominantly focus on structured EHRs, often ignoring the valuable clinical insights in free-text notes. Additionally, the potential of textual information within structured data is not fully leveraged. This study aimed to introduce and assess a deep learning framework using natural language processing techniques that integrates multimodal EHRs to predict mortality and resource utilization in critical care settings. Methods Utilizing two real-world EHR datasets, we developed and evaluated our model on three clinical tasks with leading existing methods. We also performed an ablation study on three key components in our framework: medical prompts, free-texts, and pre-trained sentence encoder. Furthermore, we assessed the model's robustness against the corruption in structured EHRs. Results Our experiments on two real-world datasets across three clinical tasks showed that our proposed model improved performance metrics by 1.6\%/0.8\% on BACC/AUROC for mortality prediction, 0.5%/2.2% on RMSE/MAE for LOS prediction, 10.9%/11.0% on RMSE/MAE for surgical duration estimation compared to the best existing methods. It consistently demonstrated superior performance compared to other baselines across three tasks at different corruption rates. Conclusions The proposed framework is an effective and accurate deep learning approach for predicting mortality and resource utilization in critical care. The study also highlights the success of using prompt learning with a transformer encoder in analyzing multimodal EHRs. Importantly, the model showed strong resilience to data corruption within structured data, especially at high corruption levels.[17] ConspirED: A Dataset for Cognitive Traits of Conspiracy Theories and Large Language Model Safety
Luke Bates,Max Glockner,Preslav Nakov,Iryna Gurevych
Main category: cs.CL
TL;DR: 该论文介绍了ConspirED数据集,用于分析阴谋论内容的认知特征,并评估大型语言模型在处理阴谋论时的稳健性。
Details
Motivation: 阴谋论削弱公众对科学和机构的信任,同时通过吸收反证而抵制辟谣,因此分析其模式对开发干预措施至关重要。 Method: 开发并使用ConspirED数据集,通过CONSPIR认知框架分析阴谋论内容的认知特征。 Result: 研究发现,大语言模型和推理模型在面对阴谋论内容时输出会与输入的推理模式趋同,即使能成功应对事实核查的虚假信息。 Conclusion: AI生成的虚假信息日益复杂,研究阴谋论的修辞模式对开发干预措施和评估AI漏洞至关重要。 Abstract: Conspiracy theories erode public trust in science and institutions while resisting debunking by evolving and absorbing counter-evidence. As AI-generated misinformation becomes increasingly sophisticated, understanding rhetorical patterns in conspiratorial content is important for developing interventions such as targeted prebunking and assessing AI vulnerabilities. We introduce ConspirED (CONSPIR Evaluation Dataset), which captures the cognitive traits of conspiratorial ideation in multi-sentence excerpts (80--120 words) from online conspiracy articles, annotated using the CONSPIR cognitive framework (Lewandowsky and Cook, 2020). ConspirED is the first dataset of conspiratorial content annotated for general cognitive traits. Using ConspirED, we (i) develop computational models that identify conspiratorial traits and determine dominant traits in text excerpts, and (ii) evaluate large language/reasoning model (LLM/LRM) robustness to conspiratorial inputs. We find that both are misaligned by conspiratorial content, producing output that mirrors input reasoning patterns, even when successfully deflecting comparable fact-checked misinformation.[18] Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark
Chihiro Taguchi,Seng Mai,Keita Kurabe,Yusuke Sakai,Georgina Agyei,Soudabeh Eslami,David Chiang
Main category: cs.CL
TL;DR: The paper identifies significant shortcomings in the FLORES+ multilingual machine translation benchmark, including translation quality below claimed standards, domain-specific and culturally biased source sentences, and vulnerabilities in evaluation protocols. The authors advocate for improved benchmarks that better reflect real-world translation challenges.
Details
Motivation: The motivation is to investigate the suitability of the FLORES+ benchmark for truly multilingual evaluation due to observed shortcomings in translation quality, domain specificity, and cultural bias. Method: The authors studied data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and conducted human assessments of translation quality. They also evaluated the impact of domain specificity and cultural bias in source sentences and tested the effectiveness of simple heuristics in achieving BLEU scores. Additionally, they compared the performance of MT models on FLORES+ and a domain-relevant evaluation set. Result: Human assessments revealed that many translations fell below the claimed 90% quality standard, and source sentences were often too domain-specific and culturally biased toward the English-speaking world. Simple heuristics were found to yield non-trivial BLEU scores, indicating vulnerabilities in the evaluation protocol. MT models trained on high-quality, naturalistic data performed poorly on FLORES+ but achieved significant gains on the domain-relevant evaluation set. Conclusion: The paper concludes that current multilingual MT benchmarks, particularly FLORES+, have significant shortcomings in terms of domain specificity, cultural bias, and evaluation protocol vulnerabilities. The authors advocate for improved benchmarks that use domain-general and culturally neutral texts to better reflect real-world translation challenges. Abstract: Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark's suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on high-quality, naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts rely less on named entities, in order to better reflect real-world translation challenges.[19] SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM
Pengjiang Li,Zaitian Wang,Xinhao Zhang,Ran Zhang,Lu Jiang,Pengfei Wang,Yuanchun Zhou
Main category: cs.CL
TL;DR: SciTopic是一种利用大型语言模型改进科学主题识别的方法,通过构建文本编码器和空间优化模块,并进行对比损失的微调,实现了优于现有方法的表现。
Details
Motivation: 现有的主题发现方法大多依赖于词嵌入来捕捉语义,缺乏对科学出版物的全面理解,在处理复杂、高维的文本关系上存在困难。SciTopic旨在通过大型语言模型的卓越文本理解能力,提高科学主题识别的效果。 Method: SciTopic首先构建了一个文本编码器来捕获科学出版物的内容,包括元数据、标题和摘要。接着,构建了一个空间优化模块,该模块结合了基于熵的采样和由大型语言模型指导的三元组任务,增强了对主题相关性和模糊实例之间上下文复杂性的关注。最后,通过优化三元组的对比损失,根据大型语言模型的指导对文本编码器进行微调,迫使文本编码器更好地辨别不同主题的实例。 Result: 在三个真实世界科学出版物数据集上进行的广泛实验表明,SciTopic在性能上优于现有的最先进的科学主题发现方法。 Conclusion: SciTopic通过结合大型语言模型的指导,改进了科学主题识别,优于现有的最先进的科学主题发现方法,为研究人员提供更深入、更快的洞察力。 Abstract: Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.[20] Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Salvador Lima-López,Eulàlia Farré-Maduell,Martin Krallinger,Natalia Loukachevitch,Vera Davydova,Elena Tutubalina,Georgios Paliouras
Main category: cs.CL
TL;DR: The 2024 BioASQ challenge saw strong global participation and notable advancements in biomedical NLP tasks like multilingual clinical NER and nested NER.
Details
Motivation: To promote progress in large-scale biomedical semantic indexing and question answering by providing a platform for innovation and benchmarking. Method: BioASQ organized four tasks, including two established ones (Task B and Synergy) and two new ones (MultiCardioNER and BIONNE), involving 37 competing teams and over 700 submissions. Result: Most participating systems achieved competitive results, indicating ongoing advancements in the state-of-the-art for this domain. Conclusion: The twelfth edition of the BioASQ challenge successfully advanced biomedical semantic indexing and question answering, with significant participation and competitive performance across four tasks. Abstract: This is an overview of the twelfth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2024. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks b and Synergy, and two new tasks: a) MultiCardioNER on the adaptation of clinical entity detection to the cardiology domain in a multilingual setting, and b) BIONNE on nested NER in Russian and English. In this edition of BioASQ, 37 competing teams participated with more than 700 distinct submissions in total for the four different shared tasks of the challenge. Similarly to previous editions, most of the participating systems achieved competitive performance, suggesting the continuous advancement of the state-of-the-art in the field.[21] Overview of BioASQ 2025: The Thirteenth BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Anastasios Nentidis,Georgios Katsimpras,Anastasia Krithara,Martin Krallinger,Miguel Rodríguez-Ortega,Eduard Rodriguez-López,Natalia Loukachevitch,Andrey Sakhovskiy,Elena Tutubalina,Dimitris Dimitriadis,Grigorios Tsoumakas,George Giannakoulas,Alexandra Bekiaridou,Athanasios Samaras,Giorgio Maria Di Nunzio,Nicola Ferro,Stefano Marchesin,Marco Martinelli,Gianmaria Silvello,Georgios Paliouras
Main category: cs.CL
TL;DR: The BioASQ 2025 challenge promoted progress in biomedical semantic indexing and question answering through new and existing tasks, drawing 83 competing teams and over 1000 submissions.
Details
Motivation: To highlight the advancements in biomedical semantic indexing and question answering through the BioASQ challenge. Method: Providing an overview of the BioASQ 2025 challenge, including descriptions of new and existing tasks and analyzing the performance of participating teams. Result: 83 teams participated with over 1000 submissions, showing competitive performance across six tasks, including new ones like multilingual clinical summarization and nested named entity linking. Conclusion: The thirteenth edition of the BioASQ challenge successfully advanced biomedical semantic indexing and question answering with diverse tasks and significant team participation. Abstract: This is an overview of the thirteenth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2025. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks, b and Synergy, and four new tasks: a) Task MultiClinSum on multilingual clinical summarization. b) Task BioNNE-L on nested named entity linking in Russian and English. c) Task ELCardioCC on clinical coding in cardiology. d) Task GutBrainIE on gut-brain interplay information extraction. In this edition of BioASQ, 83 competing teams participated with more than 1000 distinct submissions in total for the six different shared tasks of the challenge. Similar to previous editions, several participating systems achieved competitive performance, indicating the continuous advancement of the state-of-the-art in the field.[22] Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data
Jiahao Xiao,Jiangming Liu
Main category: cs.CL
TL;DR: 本文提出了一种自适应联邦蒸馏框架(AdaFD),用于解决多领域非独立同分布(non-IID)数据在联邦学习中的挑战,并引入了一套多领域non-IID场景和统一的基准测试框架。
Details
Motivation: 当前联邦学习中的non-IID场景主要关注标签多样性,忽略了自然语言处理中至关重要的语言领域多样性。 Method: 提出了一种适用于同构和异构环境的自适应联邦蒸馏框架(AdaFD),并在多领域non-IID场景下进行实验验证。 Result: 实验结果表明,所提出的框架能够更好地捕捉本地客户端的多样性,并在性能上优于现有方法。 Conclusion: AdaFD为解决多领域non-IID数据在联邦学习中的挑战提供了一个有效的框架,并通过统一的基准测试验证了其优越性。 Abstract: The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: https://github.com/jiahaoxiao1228/AdaFD.[23] Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search
Zeyu Xiong,Yixuan Nan,Li Gao,Hengzhu Tang,Shuaiqiang Wang,Junfeng Wang,Dawei Yin
Main category: cs.CL
TL;DR: This study introduces a novel generative framework for Query-Driven Text Summarization that improves semantic understanding and achieves superior performance and efficiency in industrial web search.
Details
Motivation: Traditional extractive summarization models face limitations in information loss, architectural bottlenecks, and insufficient semantic understanding for complex search intents. Method: The study integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to develop a lightweight, domain-specialized QDTS model. Result: The model outperforms the production baseline on multiple metrics, achieving state-of-the-art results while efficiently handling ~50,000 queries per second with low latency. Conclusion: The proposed framework successfully applies generative models to real-time QDTS, achieving superior performance and deployment efficiency. Abstract: In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.[24] KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling
Yangfan Wang,Jie Liu,Chen Tang,Lian Yan,Jingchi Jiang
Main category: cs.CL
TL;DR: This paper proposes Knowledge Composition Sampling (KCS) to improve multi-hop question generation by enhancing diversity and accuracy through knowledge composition modeling and probabilistic contrastive loss.
Details
Motivation: Multi-hop question answering faces challenges due to data sparsity, leading to models learning spurious patterns. Prior approaches often neglect essential knowledge integration, necessitating a more effective method for generating diverse and knowledge-rich questions. Method: KCS models knowledge composition selection as a sentence-level conditional prediction task and uses a probabilistic contrastive loss along with stochastic decoding during inference to balance accuracy and diversity. Result: KCS improves the accuracy of knowledge composition selection by 3.9% over baselines and shows performance improvements on the HotpotQA and 2WikiMultihopQA datasets when used for data augmentation. Conclusion: The KCS framework effectively enhances the diversity and accuracy of multi-hop question generation by modeling knowledge composition selection and employing a probabilistic contrastive loss. Abstract: Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the Knowledge Composition Sampling (KCS), an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: https://github.com/yangfanww/kcs.[25] A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models
Soham Petkar,Hari Aakash K,Anirudh Vempati,Akshit Sinha,Ponnurangam Kumarauguru,Chirag Agarwal
Main category: cs.CL
TL;DR: This paper critiques current GLM evaluation benchmarks and proposes a new benchmark, CLEGR, for assessing multimodal reasoning. It finds that current GLMs don't necessarily benefit from graph integration and struggle with structural reasoning tasks.
Details
Motivation: The motivation is to address the insufficiency of current GLM evaluation benchmarks in assessing true multimodal reasoning involving both graph structure and language. Method: The study introduces a new benchmark called CLEGR for evaluating multimodal reasoning and evaluates GLM architectures against it. Result: The results show that GLMs perform similarly to LLM baselines without graph integration and exhibit degradation in structural reasoning tasks. Conclusion: The study concludes that current GLMs have limitations in graph reasoning capabilities and questions the necessity of integrating graph structure into LLMs. Abstract: Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.[26] Generative Annotation for ASR Named Entity Correction
Yuanchang Luo,Daimeng Wei,Shaojun Li,Hengchao Shang,Jiaxin Guo,Zongyao Li,Zhanglin Wu,Xiaoyu Chen,Zhiqiang Rao,Jinlong Yang,Hao Yang
Main category: cs.CL
TL;DR: 本文提出了一种新的命名实体校正方法,通过利用语音声音特征和生成方法,显著提升了自动语音识别中实体的准确率。
Details
Motivation: 端到端自动语音识别系统在转录领域特定的命名实体时常常失败,现有的基于语音级别编辑距离算法的NEC模型在错误转录词与真实实体形式差异较大时表现不佳。 Method: 利用语音声音特征来检索候选实体,并设计了一种生成方法来标注和替换ASR转录中的错误实体。 Result: 该方法在多种形式差异的场景下表现优异,通过开源和自构建测试集的测试,结果表明该NEC方法显著提升了实体准确率。 Conclusion: 该论文提出了一种新的命名实体校正方法,利用语音声音特征检索候选实体,并通过生成方法标注ASR转录中的实体错误,从而有效提升实体准确率。 Abstract: End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities, causing catastrophic failures in downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when the forms of the wrongly-transcribed words(s) and the ground-truth entity are significantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entity errors in ASR transcripts and replace the text with correct entities. This method is effective in scenarios of word form difference. We test our method using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring significant improvement to entity accuracy. We will open source our self-constructed test set and training data.[27] Multi-Lingual Implicit Discourse Relation Recognition with Multi-Label Hierarchical Learning
Nelson Filipe Costa,Leila Kosseim
Main category: cs.CL
TL;DR: 本文提出了一种新的多语言隐式话语关系识别模型HArch,在多个任务和语言中优于大型语言模型。
Details
Motivation: 为了提升隐式话语关系识别的效果,并实现多语言支持。 Method: 提出了一个用于隐式话语关系识别的多语言、多标签分类模型HArch,并比较了不同预训练编码器的效果。 Result: HArch模型在DiscoGeM 2.0语料库上表现良好,且在多语言设置下优于GPT-4o和Llama-4-Maverick等大型语言模型。 Conclusion: 任务特定的微调模型在隐式话语关系识别中优于提示方法。 Abstract: This paper introduces the first multi-lingual and multi-label classification model for implicit discourse relation recognition (IDRR). Our model, HArch, is evaluated on the recently released DiscoGeM 2.0 corpus and leverages hierarchical dependencies between discourse senses to predict probability distributions across all three sense levels in the PDTB 3.0 framework. We compare several pre-trained encoder backbones and find that RoBERTa-HArch achieves the best performance in English, while XLM-RoBERTa-HArch performs best in the multi-lingual setting. In addition, we compare our fine-tuned models against GPT-4o and Llama-4-Maverick using few-shot prompting across all language configurations. Our results show that our fine-tuned models consistently outperform these LLMs, highlighting the advantages of task-specific fine-tuning over prompting in IDRR. Finally, we report SOTA results on the DiscoGeM 1.0 corpus, further validating the effectiveness of our hierarchical approach.[28] Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models
Ruiyi Yan,Yugo Murawaki
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型如何影响文本隐写和水印技术,并提出了解决分词不一致性问题的方法,以提高两者的性能。
Details
Motivation: 本文研究了在隐写术和水印中,由于Alice和Bob之间的分词不一致性(TI)导致的鲁棒性削弱问题。 Method: 研究采用了逐步验证方法来解决隐写术中的TI问题,采用事后回滚方法解决水印中的TI问题。 Result: 实验显示,直接解决隐写术中的TI问题比传统消歧方法在流畅性、不可察觉性和抗隐写分析能力方面有所提升;在水印方面,解决TI增强了可检测性和抗攻击性。 Conclusion: 解决TI问题在隐写术和水印技术中都带来了性能的提升,表明针对TI的解决方案是有效的。 Abstract: Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based steganography. On the other hand, they have also underscored the importance of watermarking as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: infrequency and temporariness. Based on these findings, we propose two tailored solutions for TI elimination: a stepwise verification method for steganography and a post-hoc rollback method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.[29] rStar2-Agent: Agentic Reasoning Technical Report
Ning Shang,Yifei Liu,Yi Zhu,Li Lyna Zhang,Weijiang Xu,Xinyu Guan,Buze Zhang,Bingcheng Dong,Xudong Zhou,Bowen Zhang,Ying Xin,Ziming Miao,Scarlett Li,Fan Yang,Mao Yang
Main category: cs.CL
TL;DR: rStar2-Agent 是一个 14B 数学推理模型,通过代理强化学习实现了前沿性能,具备高级认知行为,训练效率高,性能优于更大模型。
Details
Motivation: 当前的长链思维链(CoT)存在局限,希望实现更高级的认知行为,例如在复杂问题解决中自主探索、验证和优化中间步骤。 Method: 开发了三种关键技术:高效的 RL 基础设施、GRPO-RoC 算法和高效的代理训练策略,并使用 64 MI300X GPUs 进行训练。 Result: rStar2-Agent 在 510 个 RL 步骤内达到 SOTA,AIME24 上平均 pass@1 得分为 80.6%,AIME25 上为 69.8%,且响应更短。 Conclusion: rStar2-Agent 通过代理强化学习在有限资源下实现了前沿级别的数学推理性能,并展现出强大的泛化能力。 Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.[30] Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees
Stephen Meisenbacher,Maulik Chevli,Florian Matthes
Main category: cs.CL
TL;DR: 本研究提出了DP-ST方法,通过分治策略和语义三元组,在局部差分隐私下实现了文本的高效隐私保护生成。
Details
Motivation: 在局部差分隐私下,文本隐私保护面临挑战,尤其是在低ε值情况下,需要平衡隐私和实用性。 Method: 通过结合LLM后处理的分治策略,在局部差分隐私下利用语义三元组生成私有文档。 Result: 该方法在较低的ε值下也能生成连贯的文本,有效平衡了隐私和效用。 Conclusion: 研究强调了在合理ε水平下实现平衡隐私保护输出中连贯性的重要性。 Abstract: Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under local DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter $\varepsilon$. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high $\varepsilon$ values. Addressing this challenge, we introduce DP-ST, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the divide-and-conquer paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a privatization neighborhood. When combined with LLM post-processing, our method allows for coherent text generation even at lower $\varepsilon$ values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable $\varepsilon$ levels.[31] Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets
Vassiliy Cheremetiev,Quang Long Ho Ngo,Chau Ying Kot,Alina Elena Baia,Andrea Cavallaro
Main category: cs.CL
TL;DR: Fine-tuning embedding models based on LLMs improves implicit hate speech detection performance.
Details
Motivation: Implicit hate speech is challenging to detect due to its subtlety and lack of explicit derogatory language, so external knowledge or additional information like context, emotions, and sentiment data are needed. Method: Fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed, and E5. Result: Experiments on multiple IHS datasets show up to 1.10 percentage points improvement for in-dataset evaluation and up to 20.35 percentage points improvement in cross-dataset evaluation in terms of F1-macro score. Conclusion: Fine-tuning general-purpose embedding models based on large language models can achieve state-of-the-art performance in detecting implicit hate speech. Abstract: Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.[32] GUARD: Glocal Uncertainty-Aware Robust Decoding for Effective and Efficient Open-Ended Text Generation
Yuanhao Ding,Esteban Garces Arias,Meimingwei Li,Julian Rodemann,Matthias Aßenmacher,Danlu Chen,Gaojuan Fan,Christian Heumann,Chongsheng Zhang
Main category: cs.CL
TL;DR: GUARD is a self-adaptive decoding method that balances coherence and diversity in text generation by combining global and local uncertainty signals, achieving better performance and faster generation speed.
Details
Motivation: Open-ended text generation struggles to balance coherence with diversity in LLM outputs. Existing contrastive search-based decoding strategies are limited by hyperparameter dependence and high computational costs. Method: GUARD combines global entropy estimates with local entropy deviations in a "Glocal" uncertainty-driven framework, incorporating a token-count-based penalty to reduce computational overhead. Result: Experimental results show that GUARD achieves a good balance between text diversity and coherence, with significant improvements in generation speed. It also received positive evaluations from both humans and LLMs across multiple text quality dimensions. Conclusion: GUARD is an effective decoding method for open-ended text generation, successfully balancing coherence and diversity while improving generation speed. Abstract: Open-ended text generation faces a critical challenge: balancing coherence with diversity in LLM outputs. While contrastive search-based decoding strategies have emerged to address this trade-off, their practical utility is often limited by hyperparameter dependence and high computational costs. We introduce GUARD, a self-adaptive decoding method that effectively balances these competing objectives through a novel "Glocal" uncertainty-driven framework. GUARD combines global entropy estimates with local entropy deviations to integrate both long-term and short-term uncertainty signals. We demonstrate that our proposed global entropy formulation effectively mitigates abrupt variations in uncertainty, such as sudden overconfidence or high entropy spikes, and provides theoretical guarantees of unbiasedness and consistency. To reduce computational overhead, we incorporate a simple yet effective token-count-based penalty into GUARD. Experimental results demonstrate that GUARD achieves a good balance between text diversity and coherence, while exhibiting substantial improvements in generation speed. In a more nuanced comparison study across different dimensions of text quality, both human and LLM evaluators validated its remarkable performance. Our code is available at https://github.com/YecanLee/GUARD.[33] Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions
Xiaoyi Wang,Jiwei Zhang,Guangtao Zhang,Honglei Guo
Main category: cs.CL
TL;DR: 研究比较了真实和LLM生成的认知行为疗法对话中的情感动态,发现合成对话在情感真实性方面存在局限性,并提出了RealCBT数据集用于未来研究。
Details
Motivation: 了解LLM生成的合成治疗对话是否能够捕捉真实治疗中的复杂情感动态。 Method: 采用“话语情感动态”框架,对真实和LLM生成的认知行为疗法对话中的情感轨迹进行比较分析,分析涵盖整体对话和个体角色(咨询师和客户)。 Result: 尽管合成对话流畅且结构连贯,但它们在情感多样性、情感语言的使用以及反应和调节模式方面与真实对话存在差异。真实对话表现出更高情感变化和更真实的情感动态,尤其是客户角色。 Conclusion: 合成治疗对话在情感动态上与真实对话存在显著差异,强调了当前LLM生成的治疗数据的局限性,并介绍了RealCBT数据集以支持未来研究。 Abstract: Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we conduct the first comparative analysis of emotional arcs between real and LLM-generated Cognitive Behavioral Therapy dialogues. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions transcribed from public videos and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability,more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity between real and synthetic speakers is low, especially for clients. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. We introduce RealCBT, a curated dataset of real CBT sessions, to support future research in this space.[34] Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection
Harethah Abu Shairah,Hasan Abed Al Kader Hammoud,George Turkiyyah,Bernard Ghanem
Main category: cs.CL
TL;DR: This paper introduces ROSI, a lightweight and interpretable method to enhance safety in Large Language Models by steering model activations toward a safety-aligned subspace without the need for fine-tuning.
Details
Motivation: Safety mechanisms in LLMs can often be bypassed by manipulating representational directions. This paper aims to develop a method to strengthen these safety mechanisms in a permanent and interpretable way. Method: The paper introduces ROSI, a white-box, fine-tuning-free method that applies a rank-one weight modification to the residual stream write matrices of an LLM, steering its activations toward a safety-aligned subspace computed from harmful and harmless instruction pairs. Result: ROSI was shown to increase safety refusal rates, as evaluated by Llama Guard 3, while maintaining model utility on standard benchmarks. It was also effective in re-aligning 'uncensored' models by amplifying their latent safety directions. Conclusion: ROSI proves to be an effective and efficient method for enhancing the safety of LLMs by steering their activation toward a refusal-mediating subspace, offering a complementary alternative to resource-heavy fine-tuning methods. Abstract: Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.[35] Signs of Struggle: Spotting Cognitive Distortions across Language and Register
Abhishek Kuber,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah
Main category: cs.CL
TL;DR: This study explores the use of automated methods to detect cognitive distortions in Dutch adolescents' forum posts, highlighting the potential of domain adaptation techniques for cross-lingual and cross-register applications.
Details
Motivation: The increasing mental health issues among youth and the need for early, low-cost interventions have driven the interest in detecting cognitive distortions in digital text across different languages and writing styles. Method: The study analyzed forum posts written by Dutch adolescents, focusing on the cross-lingual and cross-register generalization of cognitive distortion detection using automated approaches. Result: The research found that variations in language and writing style significantly affect model performance in detecting cognitive distortions. Conclusion: The study concludes that domain adaptation methods are promising for overcoming challenges in cross-lingual and cross-register generalization of cognitive distortion detection. Abstract: Rising mental health issues among youth have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One key focus is the identification of cognitive distortions, irrational thought patterns that have a role in aggravating mental distress. Early detection of these distortions may enable timely, low-cost interventions. While prior work has focused on English clinical data, we present the first in-depth study of cross-lingual and cross-register generalization of cognitive distortion detection, analyzing forum posts written by Dutch adolescents. Our findings show that while changes in language and writing style can significantly affect model performance, domain adaptation methods show the most promise.[36] Exploring Machine Learning and Language Models for Multimodal Depression Detection
Javier Si Zhao Hong,Timothy Zoe Delaya,Sherwyn Chan Yin Kit,Pai Chet Ng,Xiaoxiao Miao
Main category: cs.CL
TL;DR: This paper explores the use of machine learning and deep learning models like XGBoost, transformers, and LLMs for multimodal depression detection, comparing their strengths and limitations.
Details
Motivation: The motivation is to explore effective multimodal depression detection strategies using advanced machine learning and deep learning models. Method: The paper compares the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features using machine learning and deep learning methods. Result: The results highlight the varying effectiveness of XGBoost, transformer-based models, and LLMs in detecting depression-related signals across multiple modalities. Conclusion: The paper concludes that different models have distinct strengths and limitations in capturing depression-related signals, providing insights into multimodal representation strategies for mental health prediction. Abstract: This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.[37] GDLLM: A Global Distance-aware Modeling Approach Based on Large Language Models for Event Temporal Relation Extraction
Jie Zhao,Wanting Ning,Yuxiao Fei,Yubo Feng,Lishuang Li
Main category: cs.CL
TL;DR: This paper proposes GDLLM, a global distance-aware modeling approach based on Large Language Models (LLMs), to improve Event Temporal Relation Extraction (ETRE) by capturing long-distance dependencies and enhancing minority class relation identification, achieving state-of-the-art results on two public datasets.
Details
Motivation: The motivation is to overcome the limitations of Small Language Models (SLMs) and Large Language Models (LLMs) in handling minority class relations and capturing long-distance dependencies due to manually designed prompts or instructions. Method: The method involves a distance-aware graph structure using Graph Attention Networks (GAT) and a temporal feature learning paradigm based on soft inference to enhance relation identification. Result: Experiments demonstrate that the GDLLM approach achieves state-of-the-art (SOTA) performance on the TB-Dense and MATRES datasets. Conclusion: The proposed GDLLM framework significantly improves the performance on minority relation classes and achieves state-of-the-art results on two public datasets, TB-Dense and MATRES. Abstract: In Natural Language Processing(NLP), Event Temporal Relation Extraction (ETRE) is to recognize the temporal relations of two events. Prior studies have noted the importance of language models for ETRE. However, the restricted pre-trained knowledge of Small Language Models(SLMs) limits their capability to handle minority class relations in imbalanced classification datasets. For Large Language Models(LLMs), researchers adopt manually designed prompts or instructions, which may introduce extra noise, leading to interference with the model's judgment of the long-distance dependencies between events. To address these issues, we propose GDLLM, a Global Distance-aware modeling approach based on LLMs. We first present a distance-aware graph structure utilizing Graph Attention Network(GAT) to assist the LLMs in capturing long-distance dependency features. Additionally, we design a temporal feature learning paradigm based on soft inference to augment the identification of relations with a short-distance proximity band, which supplements the probabilistic information generated by LLMs into the multi-head attention mechanism. Since the global feature can be captured effectively, our framework substantially enhances the performance of minority relation classes and improves the overall learning ability. Experiments on two publicly available datasets, TB-Dense and MATRES, demonstrate that our approach achieves state-of-the-art (SOTA) performance.[38] MSRS: Evaluating Multi-Source Retrieval-Augmented Generation
Rohan Phanse,Yijie Zhou,Kejian Shi,Wencai Zhang,Yixin Liu,Yilun Zhao,Arman Cohan
Main category: cs.CL
TL;DR: This paper presents a framework for evaluating RAG systems in multi-source information integration and summarization tasks, showing that reasoning models outperform standard LLMs.
Details
Motivation: The motivation is to evaluate retrieval-augmented systems in real-world scenarios where information integration and summarization across multiple sources are required, as opposed to relying on a single source or factoid-based answers. Method: A scalable framework was developed to construct evaluation benchmarks for RAG systems, focusing on multi-source retrieval and synthesis. Two benchmarks, MSRS-Story and MSRS-Meet, were created and tested using various RAG pipelines, including sparse and dense retrievers combined with frontier LLMs. Result: The experiments revealed that generation quality is highly dependent on retrieval effectiveness, which varies by task. Reasoning models showed superior performance in multi-source synthesis compared to standard LLMs. Conclusion: The study concludes that while multi-source synthesis is challenging, reasoning models perform significantly better than standard LLMs in integrating and summarizing information from multiple sources. Abstract: Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.[39] The Uneven Impact of Post-Training Quantization in Machine Translation
Benjamin Marie,Atsushi Fujita
Main category: cs.CL
TL;DR: 本文通过大规模评估后训练量化技术在机器翻译中的应用,发现量化对低资源语言影响较大,并提出在量化约束下部署多语言LLMs的可行见解。
Details
Motivation: 研究动机是探索量化对多语言任务的影响,特别是在资源受限的情况下部署大型语言模型的需求。 Method: 论文采用了后训练量化(PTQ)技术,在55种语言的机器翻译任务上进行了大规模评估,使用了五个参数范围从1.7B到70B的LLMs。比较了四种量化技术(AWQ、BitsAndBytes、GGUF和AutoRound)。 Result: 研究结果显示,4位量化通常可以保持高质量的翻译,但低资源和类型多样的语言尤其在2位设置下会出现显著的性能下降。GGUF变体在2位精度下也提供了最一致的性能。 Conclusion: 论文得出结论,量化对于在资源受限硬件上部署大型语言模型(LLMs)至关重要,但在多语言任务中的影响仍未被充分探索。论文提供了在量化约束下部署多语言LLMs进行机器翻译的可行见解,尤其是在资源匮乏的设置下。 Abstract: Quantization is essential for deploying large language models (LLMs) on resource-constrained hardware, but its implications for multilingual tasks remain underexplored. We conduct the first large-scale evaluation of post-training quantization (PTQ) on machine translation across 55 languages using five LLMs ranging from 1.7B to 70B parameters. Our analysis reveals that while 4-bit quantization often preserves translation quality for high-resource languages and large models, significant degradation occurs for low-resource and typologically diverse languages, particularly in 2-bit settings. We compare four quantization techniques (AWQ, BitsAndBytes, GGUF, and AutoRound), showing that algorithm choice and model size jointly determine robustness. GGUF variants provide the most consistent performance, even at 2-bit precision. Additionally, we quantify the interactions between quantization, decoding hyperparameters, and calibration languages, finding that language-matched calibration offers benefits primarily in low-bit scenarios. Our findings offer actionable insights for deploying multilingual LLMs for machine translation under quantization constraints, especially in low-resource settings.[40] SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement
Yuan Ge,Junxiang Zhang,Xiaoqian Liu,Bei Li,Xiangnan Ma,Chenglong Wang,Kaiyang Ye,Yangfan Du,Linfeng Zhang,Yuxin Huang,Tong Xiao,Zhengtao Yu,JingBo Zhu
Main category: cs.CL
TL;DR: SageLM improves the evaluation of Speech-to-Speech models by combining semantic and acoustic analysis with enhanced explainability and achieves superior performance over existing methods.
Details
Motivation: The challenge of evaluating Speech-to-Speech Large Language Models necessitates a comprehensive and explainable evaluation approach. Method: SageLM jointly assesses semantic and acoustic dimensions using rationale-based supervision and a synthetic preference dataset called SpeechFeedback. Result: SageLM achieves an 82.79% agreement rate with human evaluators, surpassing cascaded and SLM-based baselines by at least 7.42% and 26.20%. Conclusion: SageLM is a highly effective and explainable speech LLM that outperforms existing methods in evaluating Speech-to-Speech models. Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.[41] How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench
Venkatesh Mishra,Amir Saeidi,Satyam Raj,Mutsumi Nakamura,Jayanth Srinivasa,Gaowen Liu,Ali Payani,Chitta Baral
Main category: cs.CL
TL;DR: The Input-Reformulation Multi-Agent (IRMA) framework improves the reasoning and planning capabilities of large language models in dynamic environments, outperforming existing methods by significant margins.
Details
Motivation: Large language models (LLMs) struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over long horizons in multi-turn conversational environments like τ-bench. This research aims to identify and mitigate these failures. Method: A comprehensive manual analysis of common errors in conversation trajectories was conducted. Experiments were performed with input reformulations to improve agent decision-making. The Input-Reformulation Multi-Agent (IRMA) framework was proposed, which automatically reformulates user queries with domain rules and tool suggestions. Result: IRMA significantly outperformed ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass@5 scores. Conclusion: The proposed IRMA framework demonstrates superior reliability and consistency compared to other methods like ReAct, Function Calling, and Self-Reflection in dynamic environments. Abstract: Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $\tau$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.[42] STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment
Jiaqian Li,Qisheng Hu,Jing Li,Wenya Wang
Main category: cs.CL
TL;DR: 本文提出了一种新的两阶段例证选择策略,以提高ICL在结构预测任务中的性能和泛化能力。
Details
Motivation: 现有ICL选择策略在结构预测任务中往往忽略结构对齐,导致性能次优和泛化能力差。 Method: 提出了一个两阶段的例证选择策略,首先使用结构感知监督微调基于BERT的检索器,然后增强检索器以放大隐藏表示中的语法有意义信息。 Result: 该方法在四个基准测试中均表现出色,适用于多种近期LLM作为推理时模型。 Conclusion: 实验结果表明,所提出的方法在多个基准数据集上均优于现有基线方法。 Abstract: In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.[43] ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents
Tianjian Liu,Fanqi Wan,Jiajian Guo,Xiaojun Quan
Main category: cs.CL
TL;DR: This paper introduces ProactiveEval, a comprehensive framework for evaluating proactive dialogue capabilities in large language models, identifying top-performing models and linking reasoning abilities to proactive behavior.
Details
Motivation: Existing research on proactive dialogue is fragmented and domain-specific, limiting comprehensive evaluation of LLMs' proactive conversation abilities. A unified framework is needed to better assess and explore these capabilities. Method: The paper proposes ProactiveEval, a unified framework that decomposes proactive dialogue into target planning and dialogue guidance, and enables automated generation of diverse evaluation data across multiple domains. Result: The framework was used to create 328 evaluation environments across 6 domains. Experiments with 22 LLMs identified top-performing models and showed how reasoning capabilities affect proactive dialogue performance. Conclusion: DeepSeek-R1 and Claude-3.7-Sonnet perform exceptionally well in target planning and dialogue guidance tasks, respectively, highlighting the importance of reasoning capabilities in proactive dialogue behaviors. Abstract: Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.[44] Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution
Chen Chen,Yuchen Sun,Jiaxin Gao,Xueluan Gong,Qian Wang,Ziyao Wang,Yongsen Zheng,Kwok-Yan Lam
Main category: cs.CL
TL;DR: LETHE is a new defense method that effectively removes backdoor behaviors from large language models using both internal model merging and external prompt engineering, significantly reducing attack success rates while preserving model performance.
Details
Motivation: LLMs are vulnerable to backdoor attacks that activate harmful behavior upon specific triggers. Existing defenses are limited in scope and ineffective against advanced attacks such as multi-trigger or triggerless attacks. Method: The paper introduces LETHE, which uses two mechanisms: (1) internal knowledge dilution via merging a clean model with a backdoored model, and (2) external defense by incorporating benign evidence into prompts to distract the model from backdoor triggers. Result: LETHE achieves up to a 98% reduction in attack success rate across 5 LLMs and outperforms 8 state-of-the-art defense methods against 8 types of backdoor attacks. It is also cost-efficient and robust against adaptive attacks. Conclusion: LETHE effectively neutralizes backdoor behaviors in LLMs by combining internal model merging and external prompt-based defenses, demonstrating high efficacy against advanced backdoor attacks while maintaining model utility. Abstract: Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model's parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM's attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.[45] An Agile Method for Implementing Retrieval Augmented Generation Tools in Industrial SMEs
Mathieu Bourdin,Anas Neumann,Thomas Paviot,Robert Pellerin,Samir Lamouri
Main category: cs.CL
TL;DR: This paper presents EASI-RAG, a structured and agile method for deploying Retrieval-Augmented Generation systems in industrial SMEs, validated through a real-world case study. The results show that it enables fast implementation, high user adoption, and accurate responses despite limited resources and expertise.
Details
Motivation: The motivation for this work stems from the challenges faced by Small and Medium Enterprises (SMEs) in deploying Retrieval-Augmented Generation (RAG) tools due to limited resources and lack of NLP expertise. This paper aims to address these challenges by proposing a practical solution. Method: The paper introduces EASI-RAG, a structured and agile method based on method engineering principles, validated through a real-world case study in an environmental testing laboratory. Result: The results show that EASI-RAG supports fast implementation, high user adoption, accurate answers, and enhances the reliability of underlying data. The system was deployed in under a month by a team with no prior RAG experience. Conclusion: This paper concludes that EASI-RAG has the potential to enable the successful deployment of RAG systems in industrial SMEs, given its ability to support fast implementation, high user adoption, and accurate results. Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful solution to mitigate the limitations of Large Language Models (LLMs), such as hallucinations and outdated knowledge. However, deploying RAG-based tools in Small and Medium Enterprises (SMEs) remains a challenge due to their limited resources and lack of expertise in natural language processing (NLP). This paper introduces EASI-RAG, Enterprise Application Support for Industrial RAG, a structured, agile method designed to facilitate the deployment of RAG systems in industrial SME contexts. EASI-RAG is based on method engineering principles and comprises well-defined roles, activities, and techniques. The method was validated through a real-world case study in an environmental testing laboratory, where a RAG tool was implemented to answer operators queries using data extracted from operational procedures. The system was deployed in under a month by a team with no prior RAG experience and was later iteratively improved based on user feedback. Results demonstrate that EASI-RAG supports fast implementation, high user adoption, delivers accurate answers, and enhances the reliability of underlying data. This work highlights the potential of RAG deployment in industrial SMEs. Future works include the need for generalization across diverse use cases and further integration with fine-tuned models.[46] Re-Representation in Sentential Relation Extraction with Sequence Routing Algorithm
Ramazan Ali Bahrami,Ramin Yahyapour
Main category: cs.CL
TL;DR: 这篇论文研究了句子关系抽取任务,提出了一种基于胶囊动态路由的新方法,在多个数据集上取得了优异的表现,并指出了标签噪声和重新表示对性能的影响。
Details
Motivation: 论文的动机是探索句子关系抽取中的表现改进方法,并识别影响性能的因素。 Method: 论文提出了一种使用胶囊动态路由的句子关系抽取方法,并对不同数据集上的性能进行了评估。 Result: 该方法在Tacred、Tacredrev、Retacred和Conll04数据集上超过了最先进的方法,但在更大的Wikidata数据集上表现较低,识别出Wikidata标签中的噪声作为原因之一,并展示了更好的性能与更好的重新表示之间的关联。 Conclusion: 论文得出结论,除了远程监督关系抽取数据集的标签噪声外,重新表示是句子关系抽取的另一个挑战。 Abstract: Sentential relation extraction (RE) is an important task in natural language processing (NLP). In this paper we propose to do sentential RE with dynamic routing in capsules. We first show that the proposed approach outperform state of the art on common sentential relation extraction datasets Tacred, Tacredrev, Retacred, and Conll04. We then investigate potential reasons for its good performance on the mentioned datasets, and yet low performance on another similar, yet larger sentential RE dataset, Wikidata. As such, we identify noise in Wikidata labels as one of the reasons that can hinder performance. Additionally, we show associativity of better performance with better re-representation, a term from neuroscience referred to change of representation in human brain to improve the match at comparison time. As example, in the given analogous terms King:Queen::Man:Woman, at comparison time, and as a result of re-representation, the similarity between related head terms (King,Man), and tail terms (Queen,Woman) increases. As such, our observation show that our proposed model can do re-representation better than the vanilla model compared with. To that end, beside noise in the labels of the distantly supervised RE datasets, we propose re-representation as a challenge in sentential RE.[47] Enabling Equitable Access to Trustworthy Financial Reasoning
William Jurayj,Nils Holzenberger,Benjamin Van Durme
Main category: cs.CL
TL;DR: 本文探讨了将大型语言模型(LLMs)与符号求解器结合,用于计算税收义务的方法,以提高自动化税务系统的准确性与经济可行性。
Details
Motivation: 税务申报需要复杂的推理和精确的计算,错误可能导致高昂的罚款。现代LLMs难以满足这一任务的高准确性和可审计性要求,因此需要更可靠的解决方案。 Method: 将LLMs与符号求解器结合,通过将文本规则翻译为形式逻辑程序,并结合智能检索的案例表示,提高系统性能。在StAtutory Reasoning Assessment (SARA) 数据集上进行评估,并提出一种基于真实税务错误成本的部署成本估算方法。 Result: 实验表明,该方法在SARA数据集上显著提升了性能,并将成本降至现实世界平均水平以下,证明了神经符号架构在税务辅助领域的经济可行性。 Conclusion: 通过结合LLMs和符号求解器,可以有效提高税务计算的准确性并降低成本,展示了神经符号架构在提高公平、可靠税务服务获取方面的潜力。 Abstract: According to the United States Internal Revenue Service, ''the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the promise and economic feasibility of neuro-symbolic architectures for increasing equitable access to reliable tax assistance.cs.CV [Back]
[48] Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization
Alberto Compagnoni,Davide Caffagni,Nicholas Moratelli,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara
Main category: cs.CV
TL;DR: 本文提出CHAIR-DPO方法,通过基于CHAIR指标的奖励函数微调多模态大语言模型,有效减少幻觉问题。
Details
Motivation: 多模态大语言模型(MLLMs)倾向于生成与视觉输入不一致的幻觉答案,需要解决这一对齐问题。 Method: 利用CHAIR指标区分生成答案中的幻觉与非幻觉样本,并通过直接偏好优化(DPO)微调MLLM。 Result: 在多个幻觉基准测试中,CHAIR-DPO显著减少了幻觉答案的生成数量。 Conclusion: CHAIR-DPO有效地减少了多模态大语言模型中的幻觉问题,展示了基于CHAIR奖励微调MLLM的有效性。 Abstract: Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.[49] SDiFL: Stable Diffusion-Driven Framework for Image Forgery Localization
Yang Su,Shunquan Tan,Jiwu Huang
Main category: cs.CV
TL;DR: This paper introduces a novel image forensic framework using Stable Diffusion's multi-modal capabilities to enhance forgery localization, achieving significant performance improvements without relying on extensive annotated data.
Details
Motivation: The motivation stems from the rapid advancement of image manipulation technologies, such as Stable Diffusion, which outpace current image forgery localization techniques that rely heavily on annotated data. Method: The method involves leveraging the multi-modal architecture of Stable DiffusionV3 (SD3) and treating image forgery residuals as an explicit modality fused into the latent space during training to enhance localization accuracy. Result: The proposed framework achieves up to a 12% performance improvement on benchmark datasets and demonstrates strong generalization on real-world and natural scene forgery images not seen during training. Conclusion: The paper concludes that integrating both image generation and perceptual capabilities of Stable Diffusion into a forensic framework significantly improves forgery localization performance, with up to a 12% improvement over existing models. Abstract: Driven by the new generation of multi-modal large models, such as Stable Diffusion (SD), image manipulation technologies have advanced rapidly, posing significant challenges to image forensics. However, existing image forgery localization methods, which heavily rely on labor-intensive and costly annotated data, are struggling to keep pace with these emerging image manipulation technologies. To address these challenges, we are the first to integrate both image generation and powerful perceptual capabilities of SD into an image forensic framework, enabling more efficient and accurate forgery localization. First, we theoretically show that the multi-modal architecture of SD can be conditioned on forgery-related information, enabling the model to inherently output forgery localization results. Then, building on this foundation, we specifically leverage the multimodal framework of Stable DiffusionV3 (SD3) to enhance forgery localization performance.We leverage the multi-modal processing capabilities of SD3 in the latent space by treating image forgery residuals -- high-frequency signals extracted using specific highpass filters -- as an explicit modality. This modality is fused into the latent space during training to enhance forgery localization performance. Notably, our method fully preserves the latent features extracted by SD3, thereby retaining the rich semantic information of the input image. Experimental results show that our framework achieves up to 12% improvements in performance on widely used benchmarking datasets compared to current state-of-the-art image forgery localization models. Encouragingly, the model demonstrates strong performance on forensic tasks involving real-world document forgery images and natural scene forging images, even when such data were entirely unseen during training.[50] Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study
Max Torop,Masih Eskandar,Nicholas Kurtansky,Jinyang Liu,Jochen Weber,Octavia Camps,Veronica Rotemberg,Jennifer Dy,Kivanc Kose
Main category: cs.CV
TL;DR: This paper explores combining Multimodal Large Language Models with quantitative attributes to improve the interpretability of AI models in diagnosing skin diseases, showing that MLLMs can offer more understandable and accurate diagnostic support through fine-tuning.
Details
Motivation: The motivation is to enhance the interpretability of Artificial Intelligence models used in diagnosing skin diseases, particularly cancer, to make them more useful and trustworthy for clinicians. Method: The research combines Multimodal Large Language Models (MLLMs) with quantitative attribute usage to improve interpretability. It involves fine-tuning MLLMs to predict attribute values from images and evaluates this through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset. Result: The study demonstrates that MLLM embedding spaces can be successfully grounded in quantitative attributes, offering improved interpretability for diagnosis through natural language reasoning and interactive formats. Conclusion: The study concludes that MLLM embedding spaces can be effectively grounded in quantitative attributes related to lesion appearance through fine-tuning, enhancing the interpretability of model predictions in skin disease diagnosis. Abstract: Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.[51] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels
Hossein Ahmadi,Banafsheh Saffari
Main category: cs.CV
TL;DR: A new Vision Transformer framework improves Automatic Modulation Recognition by being more efficient with labels and performing well under various conditions.
Details
Motivation: To overcome the limitations of existing Automatic Modulation Recognition solutions that rely on large labeled datasets or complex training pipelines, thus affecting scalability and generalization. Method: A unified Vision Transformer framework integrating supervised, self-supervised, and reconstruction objectives was developed, using a ViT encoder, a lightweight convolutional decoder, and a linear classifier. Result: The approach outperformed supervised CNN and ViT baselines in low-label scenarios, achieving ResNet-level accuracy with only 15-20% labeled data and maintaining performance across various SNR levels. Conclusion: The proposed unified Vision Transformer framework is effective for Automatic Modulation Recognition, offering a label-efficient and generalizable solution that performs well across different conditions. Abstract: Automatic modulation recognition (AMR) is critical for cognitive radio, spectrum monitoring, and secure wireless communication. However, existing solutions often rely on large labeled datasets or multi-stage training pipelines, which limit scalability and generalization in practice. We propose a unified Vision Transformer (ViT) framework that integrates supervised, self-supervised, and reconstruction objectives. The model combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; the reconstruction branch maps augmented signals back to their originals, anchoring the encoder to fine-grained I/Q structure. This strategy promotes robust, discriminative feature learning during pretraining, while partial label supervision in fine-tuning enables effective classification with limited labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Overall, the framework provides a simple, generalizable, and label-efficient solution for AMR.[52] InfinityHuman: Towards Long-Term Audio-Driven Human
Xiaodi Li,Pan Xie,Yi Ren,Qijun Gan,Chen Zhang,Fangyuan Kong,Xiang Yin,Bingyue Peng,Zehuan Yuan
Main category: cs.CV
TL;DR: 提出InfinityHuman,一种从音频生成高分辨率、长时间视频的新框架,解决了身份漂移和手部动作问题。
Details
Motivation: 现有方法在扩展视频时存在误差累积问题,导致身份漂移、颜色偏移和场景不稳定,同时手部动作建模差,与音频不匹配。 Method: 提出一种由粗到细的框架,首先生成与音频同步的表示,然后通过姿态引导的细化器逐步将其细化为高分辨率、长时间的视频。 Result: 在EMTD和HDTF数据集上的实验表明,InfinityHuman在视频质量、身份保持、手部准确性和唇同步方面达到了最先进的性能。 Conclusion: InfinityHuman有效解决了音频驱动人物动画中的身份漂移和手部动作问题,为未来的研究提供了新的方向。 Abstract: Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.[53] Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
Mert Cokelek,Halit Ozsoy,Nevrez Imamoglu,Cagri Ozcinar,Inci Ayhan,Erkut Erdem,Aykut Erdem
Main category: cs.CV
TL;DR: This study introduces a new dataset and two transformer-based models for 360-degree audio-visual saliency prediction, showing that integrating spatial audio significantly improves performance.
Details
Motivation: The motivation stems from the lack of comprehensive datasets for 360-degree audio-visual saliency prediction and the need to better understand how to utilize audio-visual cues in predicting visual saliency in 360-degree videos. Method: The study introduces two novel models, SalViT360 and SalViT360-AV, which use vision transformers with spatio-temporal attention layers and incorporate audio input through transformer adapters. Result: The proposed models, SalViT360 and SalViT360-AV, outperformed existing methods in predicting viewer attention in 360-degree scenes across multiple benchmark datasets, including the newly curated YT360-EyeTracking dataset. Conclusion: The study concludes that integrating spatial audio cues into the model architecture significantly improves saliency prediction accuracy in omnidirectional videos. Abstract: Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer's perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.[54] A Novel Framework for Automated Explain Vision Model Using Vision-Language Models
Phu-Vinh Nguyen,Tan-Hanh Pham,Chris Ngo,Truong Son Hy
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-语言模型的解释管道,能够从样本和数据集层面分析视觉模型行为,帮助发现模型问题并提升可解释性。
Details
Motivation: 当前的视觉模型发展主要关注性能提升,而对可解释性关注较少,尤其是解释模型整体行为的方法仍不足。理解视觉模型在一般图像上的行为对避免偏差和识别模型趋势至关重要。 Method: 应用视觉-语言模型,设计了一种能够解释视觉模型行为的管道,并在样本和数据集级别上进行了分析。 Result: 提出的管道能够有效解释视觉模型的行为,发现失败案例,并为模型分析提供深入见解。 Conclusion: 本文提出了一种用于解释视觉模型的管道,可以在样本级别和数据集级别上解释视觉模型的行为,并能发现失败案例,为视觉模型开发与xAI分析的集成提供可能。 Abstract: The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.[55] ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems
Mohamed Ohamouddou,Said Ohamouddou,Abdellatif El Afia,Rafik Lasri
Main category: cs.CV
TL;DR: This paper proposes ATMS-KD, a new knowledge distillation framework that enhances the performance of lightweight CNN models for agricultural image classification, achieving high accuracy and low latency.
Details
Motivation: To address the need for efficient CNN models suitable for resource-constrained agricultural environments, enabling accurate and fast image classification. Method: The study introduces ATMS-KD, combining adaptive temperature scheduling and mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model to lightweight CNN student models. Result: All student models using ATMS-KD achieved validation accuracies exceeding 96.7%, with the compact model reaching 97.11% accuracy and the lowest inference latency of 72.19 ms. Conclusion: ATMS-KD is an effective framework for developing lightweight CNN models for agricultural environments, showing superior performance compared to existing methods. Abstract: This study proposes ATMS-KD (Adaptive Temperature and Mixed-Sample Knowledge Distillation), a novel framework for developing lightweight CNN models suitable for resource-constrained agricultural environments. The framework combines adaptive temperature scheduling with mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model (5.7\,M parameters) to lightweight residual CNN students. Three student configurations were evaluated: Compact (1.3\,M parameters), Standard (2.4\,M parameters), and Enhanced (3.8\,M parameters). The dataset used in this study consists of images of \textit{Rosa damascena} (Damask rose) collected from agricultural fields in the Dades Oasis, southeastern Morocco, providing a realistic benchmark for agricultural computer vision applications under diverse environmental conditions. Experimental evaluation on the Damascena rose maturity classification dataset demonstrated significant improvements over direct training methods. All student models achieved validation accuracies exceeding 96.7\% with ATMS-KD compared to 95--96\% with direct training. The framework outperformed eleven established knowledge distillation methods, achieving 97.11\% accuracy with the compact model -- a 1.60 percentage point improvement over the second-best approach while maintaining the lowest inference latency of 72.19\,ms. Knowledge retention rates exceeded 99\% for all configurations, demonstrating effective knowledge transfer regardless of student model capacity.[56] Linking heterogeneous microstructure informatics with expert characterization knowledge through customized and hybrid vision-language representations for industrial qualification
Mutahar Safdar,Gentry Wood,Max Zimmermann,Guy Lamouche,Priti Wanjara,Yaoyao Fiona Zhao
Main category: cs.CV
TL;DR: This study proposes a hybrid vision-language framework to improve the qualification of advanced materials in industrial manufacturing, combining microstructure informatics with expert knowledge for zero-shot classification of microstructures without retraining models.
Details
Motivation: Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non-conventional additive manufacturing processes. Method: The study introduces a novel framework that links microstructure informatics with expert characterization knowledge using customized and hybrid vision-language representations (VLRs), integrating deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA) and developing a customized similarity-based representation incorporating expert-annotated images and textual descriptions. Result: Validation on an additively manufactured metal matrix composite dataset demonstrates the framework's ability to distinguish between acceptable and defective samples across a range of characterization criteria, with FLAVA offering higher visual sensitivity and CLIP providing consistent alignment with textual criteria. Z-score normalization improves alignment and classification effectiveness. Conclusion: The proposed hybrid vision-language framework enhances traceability and interpretability in qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining, contributing to scalable and domain-adaptable qualification strategies in engineering informatics. Abstract: Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non-conventional additive manufacturing processes. This study introduces a novel framework that links microstructure informatics with a range of expert characterization knowledge using customized and hybrid vision-language representations (VLRs). By integrating deep semantic segmentation with pre-trained multi-modal models (CLIP and FLAVA), we encode both visual microstructural data and textual expert assessments into shared representations. To overcome limitations in general-purpose embeddings, we develop a customized similarity-based representation that incorporates both positive and negative references from expert-annotated images and their associated textual descriptions. This allows zero-shot classification of previously unseen microstructures through a net similarity scoring approach. Validation on an additively manufactured metal matrix composite dataset demonstrates the framework's ability to distinguish between acceptable and defective samples across a range of characterization criteria. Comparative analysis reveals that FLAVA model offers higher visual sensitivity, while the CLIP model provides consistent alignment with the textual criteria. Z-score normalization adjusts raw unimodal and cross-modal similarity scores based on their local dataset-driven distributions, enabling more effective alignment and classification in the hybrid vision-language framework. The proposed method enhances traceability and interpretability in qualification pipelines by enabling human-in-the-loop decision-making without task-specific model retraining. By advancing semantic interoperability between raw data and expert knowledge, this work contributes toward scalable and domain-adaptable qualification strategies in engineering informatics.[57] MedNet-PVS: A MedNeXt-Based Deep Learning Model for Automated Segmentation of Perivascular Spaces
Zhen Xuen Brandon Low,Rory Zhang,Hang Min,William Pham,Lucy Vivash,Jasmine Moses,Miranda Lynch,Karina Dorfman,Cassandra Marotta,Shaun Koh,Jacob Bunyamin,Ella Rowsthorn,Alex Jarema,Himashi Peiris,Zhaolin Chen,Sandy R. Shultz,David K. Wright,Dexiao Kong,Sharon L. Naismith,Terence J. O'Brien,Ying Xia,Meng Law,Benjamin Sinclair
Main category: cs.CV
TL;DR: This paper presents MedNeXt-L-k5, a 3D encoder-decoder convolutional network adapted for automated segmentation of enlarged perivascular spaces (PVS) in MRI scans, showing comparable performance to inter-rater reliability on a homogeneous dataset but lower generalization across diverse datasets.
Details
Motivation: Manual segmentation of PVS is time-consuming and has moderate inter-rater reliability, while existing automated deep learning models show moderate performance and often fail to generalize across diverse clinical and research MRI datasets. Method: MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, was adapted for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of 200 T2-weighted MRI scans and another using 40 heterogeneous T1-weighted MRI volumes. Performance was evaluated using internal 5-fold cross validation (5FCV) and leave-one-site-out cross validation (LOSOCV). Result: Models trained on the HCP-Aging dataset's T2w images achieved voxel-level Dice scores of 0.88+/-0.06 (white matter), comparable to the dataset's inter-rater reliability. Models trained on T1w images achieved a Dice score of 0.58+/-0.09 (white matter). Under LOSOCV, the model achieved voxel-level Dice scores of 0.38+/-0.16 (white matter) and 0.35+/-0.12 (basal ganglia), and cluster-level Dice scores of 0.61+/-0.19 (white matter) and 0.62+/-0.21 (basal ganglia). Conclusion: MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets. However, it did not outperform the nnU-Net, indicating that attention-based mechanisms from transformer-inspired models are not necessary for achieving high accuracy in PVS segmentation. Abstract: Enlarged perivascular spaces (PVS) are increasingly recognized as biomarkers of cerebral small vessel disease, Alzheimer's disease, stroke, and aging-related neurodegeneration. However, manual segmentation of PVS is time-consuming and subject to moderate inter-rater reliability, while existing automated deep learning models have moderate performance and typically fail to generalize across diverse clinical and research MRI datasets. We adapted MedNeXt-L-k5, a Transformer-inspired 3D encoder-decoder convolutional network, for automated PVS segmentation. Two models were trained: one using a homogeneous dataset of 200 T2-weighted (T2w) MRI scans from the Human Connectome Project-Aging (HCP-Aging) dataset and another using 40 heterogeneous T1-weighted (T1w) MRI volumes from seven studies across six scanners. Model performance was evaluated using internal 5-fold cross validation (5FCV) and leave-one-site-out cross validation (LOSOCV). MedNeXt-L-k5 models trained on the T2w images of the HCP-Aging dataset achieved voxel-level Dice scores of 0.88+/-0.06 (white matter, WM), comparable to the reported inter-rater reliability of that dataset, and the highest yet reported in the literature. The same models trained on the T1w images of the HCP-Aging dataset achieved a substantially lower Dice score of 0.58+/-0.09 (WM). Under LOSOCV, the model had voxel-level Dice scores of 0.38+/-0.16 (WM) and 0.35+/-0.12 (BG), and cluster-level Dice scores of 0.61+/-0.19 (WM) and 0.62+/-0.21 (BG). MedNeXt-L-k5 provides an efficient solution for automated PVS segmentation across diverse T1w and T2w MRI datasets. MedNeXt-L-k5 did not outperform the nnU-Net, indicating that the attention-based mechanisms present in transformer-inspired models to provide global context are not required for high accuracy in PVS segmentation.[58] Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
Zhixiang Chi,Yanan Wu,Li Gu,Huan Liu,Ziqiang Wang,Yang Zhang,Yang Wang,Konstantinos N. Plataniotis
Main category: cs.CV
TL;DR: This paper introduces a feedback-driven framework that enhances CLIP's open-vocabulary segmentation by improving spatial coherence and semantic consistency without training.
Details
Motivation: CLIP struggles with open-vocabulary segmentation due to poor localization and inconsistent spatial coherence propagation. Method: A training-free, feedback-driven self-adaptive framework that adapts patch-level correspondences from output predictions back to intermediate attention. Result: The approach improves performance across eight benchmarks and integrates seamlessly into state-of-the-art methods with various backbones and attention types. Conclusion: The proposed framework effectively enhances semantic consistency and spatial coherence in CLIP models for improved open-vocabulary segmentation. Abstract: CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.[59] How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
Zhuoran Yu,Yong Jae Lee
Main category: cs.CV
TL;DR: 本研究提出一种新框架,揭示多模态大语言模型内部的阶段式处理结构,并发现其在不同模型设置下的稳定性。
Details
Motivation: 多模态大语言模型在视觉-语言任务中表现出色,但其内部处理机制尚未被深入探索。 Method: 引入探测框架,通过训练线性分类器从每层的token嵌入中预测细粒度视觉类别,并在三种受控提示变化下评估探测器性能。 Result: 识别出一致的阶段式结构,且结构在视觉token化、指令微调数据和预训练语料变化下保持稳定,但在基础LLM架构变化时层分配有显著变化。 Conclusion: 研究发现多模态大语言模型具有稳定的阶段式结构,早期层执行视觉基础,中间层支持词汇整合和语义推理,最后层准备任务特定输出。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, we introduce a probing framework to systematically analyze how MLLMs process visual and textual inputs across layers. We train linear classifiers to predict fine-grained visual categories (e.g., dog breeds) from token embeddings extracted at each layer, using a standardized anchor question. To uncover the functional roles of different layers, we evaluate these probes under three types of controlled prompt variations: (1) lexical variants that test sensitivity to surface-level changes, (2) semantic negation variants that flip the expected answer by modifying the visual concept in the prompt, and (3) output format variants that preserve reasoning but alter the answer format. Applying our framework to LLaVA-1.5, LLaVA-Next-LLaMA-3, and Qwen2-VL, we identify a consistent stage-wise structure in which early layers perform visual grounding, middle layers support lexical integration and semantic reasoning, and final layers prepare task-specific outputs. We further show that while the overall stage-wise structure remains stable across variations in visual tokenization, instruction tuning data, and pretraining corpus, the specific layer allocation to each stage shifts notably with changes in the base LLM architecture. Our findings provide a unified perspective on the layer-wise organization of MLLMs and offer a lightweight, model-agnostic approach for analyzing multimodal representation dynamics.[60] Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)
Zhi Li,Hau Phan,Matthew Emigh,Austin J. Brockmeier
Main category: cs.CV
TL;DR: This paper proposes SLiCS, a method for disentangling vision-language embeddings into concept-specific components, enabling more precise image retrieval and conditional generation.
Details
Motivation: The hypothesis is that embedding spaces from vision-language co-embedding networks can be disentangled to separate information about complex scenes into concept-specific subspaces. Method: Proposed a supervised dictionary learning approach to estimate a linear synthesis model that decomposes embeddings into concept-specific components. Result: SLiCS enabled more precise concept-filtered image retrieval and was applied successfully to TiTok autoencoder embeddings and DINOv2 latent embeddings. Conclusion: The disentangled embeddings provided by SLiCS enable more precise concept-filtered image retrieval and conditional generation using image-to-prompt. Abstract: Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept's group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.[61] MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models
Xiao Li,Yanfan Zhu,Ruining Deng,Wei-Qi Wei,Yu Wang,Shilin Zhao,Yaohong Wang,Haichun Yang,Yuankai Huo
Main category: cs.CV
TL;DR: MedFoundationHub is a secure, user-friendly toolkit for deploying medical vision-language models, aiming to enhance clinical applications while addressing privacy and security concerns.
Details
Motivation: Medical vision-language models offer significant potential for clinical applications, but pose security risks such as PHI exposure and data leakage. A secure and easy-to-use deployment solution is needed. Method: Developing a GUI toolkit for deploying medical VLMs with privacy-preserving features and evaluating its performance using expert assessments on colon and renal pathology cases. Result: Engaged board-certified pathologists evaluated five state-of-the-art VLMs using MedFoundationHub, revealing issues such as off-target answers, vague reasoning, and inconsistent terminology. Conclusion: MedFoundationHub provides a secure and accessible solution for deploying medical vision-language models, but current models still show limitations in precision and consistency. Abstract: Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.[62] Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction
Mang Cao,Sanping Zhou,Yizhe Li,Ye Deng,Wenli Huang,Le Wang
Main category: cs.CV
TL;DR: This paper proposes BIM, a method for multi-task dense prediction that improves cross-task interaction with novel scanning mechanisms while maintaining computational efficiency, showing superior performance on benchmark datasets.
Details
Motivation: The authors aim to address the challenge of balancing interaction completeness and computational efficiency in multi-task dense prediction. Method: The paper introduces BIM, which uses Bidirectional Interaction Scan (BI-Scan) and Multi-Scale Scan (MS-Scan) mechanisms to improve cross-task interaction while maintaining computational efficiency. Result: Extensive experiments on NYUD-V2 and PASCAL-Context benchmarks show that BIM outperforms existing state-of-the-art methods. Conclusion: The proposed Bidirectional Interaction Mamba (BIM) demonstrates superior performance compared to state-of-the-art methods in multi-task dense prediction. Abstract: Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emph{i.e.}, NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.[63] Audio-Guided Visual Editing with Complex Multi-Modal Prompts
Hyeonyu Kim,Seokhoon Jeong,Seonghee Han,Chanhyuk Choi,Taehwan Kim
Main category: cs.CV
TL;DR: 本文提出了一种无需额外训练的音频引导视觉编辑框架,解决了文本引导在复杂场景中的局限性。
Details
Motivation: 现有的音频引导视觉编辑方法需要特定数据集的训练,限制了其在现实场景中的泛化能力,因此需要一种更通用的解决方案。 Method: 利用预训练的多模态编码器和零样本学习能力,结合多样化的音频输入,并通过独立的噪声分支和自适应补丁选择处理复杂的多模态编辑提示。 Result: 实验表明,该框架在处理复杂编辑场景时表现优异,能够有效结合音频信息,超越了仅依赖文本的方法。 Conclusion: 该研究提出了一种新的音频引导视觉编辑框架,在处理复杂的多模态编辑任务方面表现出色,特别是在文本无法充分描述的情况下。 Abstract: Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model's prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.[64] More Reliable Pseudo-labels, Better Performance: A Generalized Approach to Single Positive Multi-label Learning
Luong Tran,Thieu Vo,Anh Nguyen,Sang Dinh,Van Nguyen
Main category: cs.CV
TL;DR: This paper proposes AEVLP, a framework combining a novel loss function (GPR Loss) and a pseudo-labeling technique (DAMP) to improve multi-label classification from partially annotated data, achieving state-of-the-art results.
Details
Motivation: Multi-label learning is challenging due to the high cost and effort of fully annotating large-scale datasets. Existing methods for Single Positive Multi-Label Learning (SPML) often lead to inaccuracies and false negatives, prompting the need for more robust and efficient solutions. Method: The authors propose a novel loss function called Generalized Pseudo-Label Robust Loss (GPR Loss) to handle noisy pseudo-labels and introduce a Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique, forming the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Result: Extensive experiments show that the AEVLP framework significantly improves multi-label classification performance and achieves state-of-the-art results. Conclusion: The proposed AEVLP framework, combining GPR Loss and DAMP technique, significantly advances multi-label classification and achieves state-of-the-art results on four benchmark datasets. Abstract: Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.[65] Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection
Chengjun Zhang,Yuhao Zhang,Jie Yang,Mohamad Sawan
Main category: cs.CV
TL;DR: 本文介绍了一种新的脉冲神经网络方法,以提升视觉检测任务中的性能,特别是在超低时间步长下的表现。
Details
Motivation: 当前的ANN-SNN转换方法在分类任务中取得了很好的结果,但在视觉检测任务中的表现仍然不理想,因此本文试图解决这一问题。 Method: 提出了一种延迟脉冲方法和时间依赖性Integrate-and-Fire(tdIF)神经元架构,使Integrate-and-Fire神经元能够根据时间步长的时间顺序动态调整其累积和发射行为。 Result: 所提出的方法在两个关键的视觉任务(物体检测和车道线检测)中进行了广泛的评估,结果表明该方法超越了当前的ANN-SNN转换方法,在超低延迟(5个时间步长内)下实现了最先进的性能。 Conclusion: 本文提出了一种基于延迟脉冲的方法和一种新的时间依赖性Integrate-and-Fire(tdIF)神经元架构,以解决脉冲神经网络在视觉检测任务中的表现不佳问题,并在超低时间步长下实现了最先进的性能。 Abstract: Spiking Neural Networks (SNNs), inspired by the brain, are characterized by minimal power consumption and swift inference capabilities on neuromorphic hardware, and have been widely applied to various visual perception tasks. Current ANN-SNN conversion methods have achieved excellent results in classification tasks with ultra-low time-steps, but their performance in visual detection tasks remains suboptimal. In this paper, we propose a delay-spike approach to mitigate the issue of residual membrane potential caused by heterogeneous spiking patterns. Furthermore, we propose a novel temporal-dependent Integrate-and-Fire (tdIF) neuron architecture for SNNs. This enables Integrate-and-fire (IF) neurons to dynamically adjust their accumulation and firing behaviors based on the temporal order of time-steps. Our method enables spikes to exhibit distinct temporal properties, rather than relying solely on frequency-based representations. Moreover, the tdIF neuron maintains energy consumption on par with traditional IF neuron. We demonstrate that our method achieves more precise feature representation with lower time-steps, enabling high performance and ultra-low latency in visual detection tasks. In this study, we conduct extensive evaluation of the tdIF method across two critical vision tasks: object detection and lane line detection. The results demonstrate that the proposed method surpasses current ANN-SNN conversion approaches, achieving state-of-the-art performance with ultra-low latency (within 5 time-steps).[66] Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection
Yuqi Xiong,Wuzhen Shi,Yang Wen,Ruhan Liu
Main category: cs.CV
TL;DR: This paper introduces DUP-MCRNet, an improved method for salient object detection that enhances detail preservation, edge clarity, and multimodal information fusion, demonstrating superior performance on benchmark datasets.
Details
Motivation: Existing salient object detection (SOD) methods tend to lose details, blur edges, and insufficiently fuse single-modal information in complex scenes. Method: The paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet), which includes a dynamic uncertainty graph convolution module (DUGC) and a multimodal collaborative fusion strategy (MCF), optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Result: DUP-MCRNet outperforms various SOD methods on most common benchmark datasets. Conclusion: DUP-MCRNet demonstrates superior performance over existing SOD methods, particularly in edge clarity and robustness to complex backgrounds. Abstract: In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.[67] MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection
Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shota Orihashi,Tomohiro Tanaka,Mana Ihori,Naoki Makishima,Naotaka Kawata
Main category: cs.CV
TL;DR: The paper proposes Multi-Scale Multi-View Detection (MSMVD) to improve Multi-View Pedestrian Detection by generating multi-scale BEV features and combining information using a feature pyramid network, which outperforms previous methods on the GMVD dataset.
Details
Motivation: End-to-end trainable deep learning methods struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. Method: Multi-Scale Multi-View Detection (MSMVD) generates multi-scale BEV features and combines information using a feature pyramid network. Result: Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves detection performance. Conclusion: MSMVD outperforms the previous highest MODA by 4.5 points on the GMVD dataset. Abstract: Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird's eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by $4.5$ points on the GMVD dataset.[68] A Spatial-Frequency Aware Multi-Scale Fusion Network for Real-Time Deepfake Detection
Libo Lv,Tianyi Wang,Mengxiao Huang,Ruixia Liu,Yinglong Wang
Main category: cs.CV
TL;DR: 本文提出了一种高效的实时深度伪造检测方法 SFMFNet,通过创新的网络结构在保证检测性能的同时降低计算成本。
Details
Motivation: 现有的深度伪造检测方法虽然在标准基准上表现良好,但计算成本过高,难以满足实时应用的需求。 Method: 提出了一种名为 Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet) 的轻量级网络结构,包括空间-频率混合感知模块、token-selective cross attention 机制和 residual-enhanced blur pooling 结构。 Result: SFMFNet 在多个基准数据集上实现了准确性和效率之间的良好平衡,具有较强的泛化能力和实时应用价值。 Conclusion: SFMFNet 提供了一种高效且有效的实时深度伪造检测架构,能够在保持高准确率的同时实现良好的泛化能力和实用性。 Abstract: With the rapid advancement of real-time deepfake generation techniques, forged content is becoming increasingly realistic and widespread across applications like video conferencing and social media. Although state-of-the-art detectors achieve high accuracy on standard benchmarks, their heavy computational cost hinders real-time deployment in practical applications. To address this, we propose the Spatial-Frequency Aware Multi-Scale Fusion Network (SFMFNet), a lightweight yet effective architecture for real-time deepfake detection. We design a spatial-frequency hybrid aware module that jointly leverages spatial textures and frequency artifacts through a gated mechanism, enhancing sensitivity to subtle manipulations. A token-selective cross attention mechanism enables efficient multi-level feature interaction, while a residual-enhanced blur pooling structure helps retain key semantic cues during downsampling. Experiments on several benchmark datasets show that SFMFNet achieves a favorable balance between accuracy and efficiency, with strong generalization and practical value for real-time applications.[69] Dual-Model Weight Selection and Self-Knowledge Distillation for Medical Image Classification
Ayaka Tsutsumi,Guang Li,Ren Togo,Takahiro Ogawa,Satoshi Kondo,Miki Haseyama
Main category: cs.CV
TL;DR: 该论文提出了一种结合双模型权重选择与自知识蒸馏的新型医学图像分类方法,在计算资源受限的情况下显著提升了轻量级模型的性能。
Details
Motivation: 在现实世界的医学场景中,由于计算资源的限制,大规模模型的部署往往受到限制,因此开发能够在保持计算效率的同时达到与大规模模型相当性能的轻量级模型至关重要。 Method: 该方法首先使用双模型权重选择策略,通过大规模预训练模型的权重初始化两个轻量级模型,实现有效的知识迁移;随后对这些模型应用自知识蒸馏,并在目标分类任务上进行微调。 Result: 在公开可用的数据集(胸部X光图像、肺部CT扫描和脑部MRI扫描)上的大量实验表明,该方法在性能和鲁棒性方面均优于现有方法。 Conclusion: 该论文提出了一种结合双模型权重选择与自知识蒸馏(SKD)的新型医学图像分类方法,克服了传统方法在紧凑模型中保留关键信息的局限性,实现了优于现有方法的性能和鲁棒性。 Abstract: We propose a novel medical image classification method that integrates dual-model weight selection with self-knowledge distillation (SKD). In real-world medical settings, deploying large-scale models is often limited by computational resource constraints, which pose significant challenges for their practical implementation. Thus, developing lightweight models that achieve comparable performance to large-scale models while maintaining computational efficiency is crucial. To address this, we employ a dual-model weight selection strategy that initializes two lightweight models with weights derived from a large pretrained model, enabling effective knowledge transfer. Next, SKD is applied to these selected models, allowing the use of a broad range of initial weight configurations without imposing additional excessive computational cost, followed by fine-tuning for the target classification tasks. By combining dual-model weight selection with self-knowledge distillation, our method overcomes the limitations of conventional approaches, which often fail to retain critical information in compact models. Extensive experiments on publicly available datasets-chest X-ray images, lung computed tomography scans, and brain magnetic resonance imaging scans-demonstrate the superior performance and robustness of our approach compared to existing methods.[70] Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds
Pengpeng Yu,Haoran Li,Dingquan Li,Runqing Jiang,Jing Wang,Liang Lin,Yulan Guo
Main category: cs.CV
TL;DR: 论文提出了一种高效的LiDAR点云压缩方法,通过轻量级模块改进特征表示和多尺度信息传播,实现了高性能压缩和实时处理能力。
Details
Motivation: 高精度LiDAR点云带来存储和传输开销,现有方法因几何细节稀疏性限制了压缩性能和速度。 Method: 该方法包括几何重稠密模块和跨尺度特征传播模块,分别用于生成紧凑特征表示和多尺度信息共享。 Result: 在KITTI数据集上的实验表明,该方法在12位量化下实现了每秒26帧的实时编解码性能。 Conclusion: 该论文提出了一种高效的LiDAR点云压缩框架,通过两个轻量级模块实现了最先进的压缩比和实时性能。 Abstract: LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.[71] Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Xiaochuan Li,Guoguang Du,Runze Zhang,Liang Jin,Qi Jia,Lihua Lu,Zhenhua Guo,Yaqian Zhao,Haiyang Liu,Tianqi Wang,Changsheng Li,Xiaoli Gong,Rengang Li,Baoyu Fan
Main category: cs.CV
TL;DR: This paper introduces Droplet3D-4M and Droplet3D, utilizing video data to overcome 3D data scarcity, enabling effective 3D asset generation with spatial and semantic accuracy.
Details
Motivation: The motivation stems from the scarcity of 3D data on the internet compared to text, image, and video domains. Videos are proposed as a solution due to their inherent spatial consistency and rich semantic information, which can improve 3D generation. Method: The paper introduces Droplet3D-4M, a large-scale video dataset with multi-view annotations, and trains Droplet3D, a generative model that supports image and dense text input for 3D asset generation. Result: The experiments show that the proposed approach is effective in generating spatially consistent and semantically plausible 3D content. Additionally, the approach has potential for scene-level applications, as opposed to existing 3D solutions. Conclusion: The paper concludes that commonsense priors from videos can significantly aid in 3D creation, as demonstrated by the Droplet3D model and Droplet3D-4M dataset, which enable the generation of spatially consistent and semantically plausible content. Abstract: Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.[72] Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation
Jiusi Li,Jackson Jiang,Jinyu Miao,Miao Long,Tuopu Wen,Peijin Jia,Shengxiang Liu,Chunlei Yu,Maolin Liu,Yuzhan Cai,Kun Jiang,Mengmeng Yang,Diange Yang
Main category: cs.CV
TL;DR: G^2Editor 是一种用于自动驾驶视频中对象编辑的高效框架,通过利用3D高斯表示、场景级3D边界框布局和分层细粒度特征,解决了现有方法在视觉保真度和姿态控制方面的不足,并在Waymo Open Dataset上取得了优于现有方法的结果。
Details
Motivation: 收集真实世界中的角落案例用于训练和验证自动驾驶系统成本高昂且危险,而现有的对象编辑方法存在视觉保真度和姿态控制不足的问题。 Method: 利用3D高斯表示作为密集先验,并结合场景级3D边界框布局和分层细粒度特征进行精确的对象编辑。 Result: G^2Editor 在Waymo Open Dataset上的实验表明,其在姿态可控性和视觉质量方面优于现有方法,并支持对象的重新定位、插入和删除。 Conclusion: G^2Editor 是一种用于自动驾驶视频中对象编辑的有效框架,解决了现有方法在视觉保真度和姿态控制方面的不足。 Abstract: Corner cases are crucial for training and validating autonomous driving systems, yet collecting them from the real world is often costly and hazardous. Editing objects within captured sensor data offers an effective alternative for generating diverse scenarios, commonly achieved through 3D Gaussian Splatting or image generative models. However, these approaches often suffer from limited visual fidelity or imprecise pose control. To address these issues, we propose G^2Editor, a framework designed for photorealistic and precise object editing in driving videos. Our method leverages a 3D Gaussian representation of the edited object as a dense prior, injected into the denoising process to ensure accurate pose control and spatial consistency. A scene-level 3D bounding box layout is employed to reconstruct occluded areas of non-target objects. Furthermore, to guide the appearance details of the edited object, we incorporate hierarchical fine-grained features as additional conditions during generation. Experiments on the Waymo Open Dataset demonstrate that G^2Editor effectively supports object repositioning, insertion, and deletion within a unified framework, outperforming existing methods in both pose controllability and visual quality, while also benefiting downstream data-driven tasks.[73] Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization
Marina Grifell i Plana,Vladyslav Zalevskyi,Léa Schmidt,Yvan Gomez,Thomas Sanchez,Vincent Dunet,Mériam Koob,Vanessa Siffredi,Meritxell Bach Cuadra
Main category: cs.CV
TL;DR: 本文提出了一种病理信息引导的域随机化策略,用于胎儿大脑分割,通过合成数据生成来缓解罕见病例如冠状动脉发育不全(CCD)的标注数据不足的问题。
Details
Motivation: 冠状动脉发育不全(CCD)的罕见性严重限制了标注数据的数量,阻碍了深度学习模型的泛化能力。 Method: 提出了一种病理信息引导的域随机化策略,将CCD表现的先验知识嵌入合成数据生成流程中,通过仅使用健康数据模拟多种大脑变化,实现无需病理标注的稳健分割。 Result: 在包含248个健康胎儿、26个CCD病例和47个其他脑部病理病例的数据集上验证,该方法在CCD病例上取得了显著改进,同时保持了对健康胎儿和其他病理病例的分割性能。从预测的分割结果中提取了临床相关生物标志物(如胼胝体长度和体积),并展示了其区分CCD亚型的能力。该方法将健康病例的胼胝体长度估计误差从1.89毫米降低到0.80毫米,CCD病例的误差从10.9毫米降低到0.7毫米。此外,分割结果在拓扑一致性上优于现有方法,使基于形状的分析更加可靠。 Conclusion: 将领域特定的解剖先验知识纳入合成数据生成流程,可以有效缓解数据稀缺问题,并增强对罕见但临床意义重大的畸形的分析能力。 Abstract: Accurate fetal brain segmentation is crucial for extracting biomarkers and assessing neurodevelopment, especially in conditions such as corpus callosum dysgenesis (CCD), which can induce drastic anatomical changes. However, the rarity of CCD severely limits annotated data, hindering the generalization of deep learning models. To address this, we propose a pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into a synthetic data generation pipeline. By simulating diverse brain alterations from healthy data alone, our approach enables robust segmentation without requiring pathological annotations. We validate our method on a cohort comprising 248 healthy fetuses, 26 with CCD, and 47 with other brain pathologies, achieving substantial improvements on CCD cases while maintaining performance on both healthy fetuses and those with other pathologies. From the predicted segmentations, we derive clinically relevant biomarkers, such as corpus callosum length (LCC) and volume, and show their utility in distinguishing CCD subtypes. Our pathology-informed augmentation reduces the LCC estimation error from 1.89 mm to 0.80 mm in healthy cases and from 10.9 mm to 0.7 mm in CCD cases. Beyond these quantitative gains, our approach yields segmentations with improved topological consistency relative to available ground truth, enabling more reliable shape-based analyses. Overall, this work demonstrates that incorporating domain-specific anatomical priors into synthetic data pipelines can effectively mitigate data scarcity and enhance analysis of rare but clinically significant malformations.[74] Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding
Jeong Hun Yeo,Hyeongseop Rha,Sungjune Park,Junil Won,Yong Man Ro
Main category: cs.CV
TL;DR: 这篇论文介绍了一个统一的框架,可以处理手语、唇动和音频的组合以生成口语文本,其目标是设计一个能够处理异构输入的统一架构,探索不同模态间的协同作用,并实现与当前最先进的模型相当或更好的性能。
Details
Motivation: 虽然音频是人类交流的主要方式,但自动语音识别(ASR)技术对于聋哑人或听力受损者来说仍然存在不可及性。视觉替代方案如手语和唇读提供了有效的替代方式,但这些模态在很大程度上是孤立研究的,它们在统一框架内的整合仍有待探索。 Method: 设计了一个统一的、与模态无关的架构,能够有效处理异构输入,并探索了不同模态之间的协同作用,尤其是唇部动作在手语理解中的作用。 Result: 基于这个框架,实现了与任务特定的最先进模型相当或更优的性能,涵盖了SLT、VSR、ASR和AVSR等领域。 Conclusion: 构建了首个统一的框架,该框架能够处理手语、唇部动作和音频的多样化组合以生成口语文本,并且显示了显性建模唇部动作作为非手动提示对SLT的重要性。 Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.[75] Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
Yuan Xie,Tianshui Chen,Zheng Ge,Lionel Ni
Main category: cs.CV
TL;DR: Video-MTR is an end-to-end framework for long-form video understanding that uses iterative reasoning and a novel reward system to improve performance without relying on external models.
Details
Motivation: Existing methods for long-form video understanding face challenges like complexity and sub-optimal performance due to reliance on static reasoning or external visual-language models. Method: Video-MTR uses a reinforced multi-turn reasoning framework with a gated bi-level reward system for iterative video segment selection and question comprehension. Result: Video-MTR achieves state-of-the-art performance on benchmarks like VideoMME, MLVU, and EgoSchema in both accuracy and efficiency. Conclusion: Video-MTR provides a new framework for long-form video understanding, outperforming existing methods in accuracy and efficiency. Abstract: Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.[76] Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts
Zixuan Hu,Dongxiao Li,Xinzhu Ma,Shixiang Tang,Xiaotong Li,Wenhan Yang,Ling-Yu Duan
Main category: cs.CV
TL;DR: This paper proposes DUO, a novel Test-Time Adaptation framework for Monocular 3D Object Detection that jointly reduces semantic and geometric uncertainties, leading to better robustness under domain shifts.
Details
Motivation: Monocular 3D object detection (M3OD) suffers from performance degradation under real-world domain shifts. Existing Test-Time Adaptation (TTA) methods fail to address both semantic and geometric uncertainties, which are critical for robust detection. Method: The proposed method uses a dual-branch mechanism: one branch minimizes semantic and geometric uncertainties through convex optimization and unsupervised focal loss, while the other enforces geometric coherence via a semantic-aware normal field constraint. Result: Extensive experiments show that DUO outperforms existing TTA methods across various datasets and domain shift types, demonstrating its effectiveness in improving generalization and reliability. Conclusion: DUO is the first TTA framework that effectively addresses both semantic and geometric uncertainties in M3OD, leading to improved robustness and performance under domain shifts. Abstract: Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types.[77] CaddieSet: A Golf Swing Dataset with Human Joint Features and Ball Information
Seunghyeon Jung,Seoyoung Hong,Jiwoo Jeong,Seungwon Jeong,Jaerim Choi,Hoki Kim,Woojin Lee
Main category: cs.CV
TL;DR: This paper introduces CaddieSet, a new dataset for golf swing analysis that uses computer vision and domain-specific metrics to quantitatively link swing posture with ball trajectory, enabling more interpretable and actionable feedback for golfers.
Details
Motivation: The motivation is to quantitatively establish the relationship between swing posture and ball trajectory, addressing the limitation of existing studies in providing actionable insights for golfers' swing improvement. Method: The paper proposes a new dataset, CaddieSet, which extracts joint information from swing videos segmented into eight phases using computer vision. Additionally, 15 key metrics based on golf domain knowledge are defined to interpret swing outcomes. Result: Experiments demonstrated the feasibility of CaddieSet for predicting ball trajectories using benchmarks, particularly with interpretable models. Swing feedback using joint features was found to align quantitatively with established domain knowledge. Conclusion: This paper concludes that CaddieSet provides a feasible and interpretable way to predict ball trajectories and offers insights for golf swing improvement, benefiting both academia and the sports industry. Abstract: Recent advances in deep learning have led to more studies to enhance golfers' shot precision. However, these existing studies have not quantitatively established the relationship between swing posture and ball trajectory, limiting their ability to provide golfers with the necessary insights for swing improvement. In this paper, we propose a new dataset called CaddieSet, which includes joint information and various ball information from a single shot. CaddieSet extracts joint information from a single swing video by segmenting it into eight swing phases using a computer vision-based approach. Furthermore, based on expert golf domain knowledge, we define 15 key metrics that influence a golf swing, enabling the interpretation of swing outcomes through swing-related features. Through experiments, we demonstrated the feasibility of CaddieSet for predicting ball trajectories using various benchmarks. In particular, we focus on interpretable models among several benchmarks and verify that swing feedback using our joint features is quantitatively consistent with established domain knowledge. This work is expected to offer new insight into golf swing analysis for both academia and the sports industry.[78] IAENet: An Importance-Aware Ensemble Model for 3D Point Cloud-Based Anomaly Detection
Xuanming Cao,Chengyu Tao,Yifeng Cheng,Juan Du
Main category: cs.CV
TL;DR: 本文提出 IAENet,结合 2D 和 3D 模型用于表面异常检测,通过动态融合策略实现了更高的准确率和更少的误报。
Details
Motivation: 现有的 2D 图像方法在表面异常检测中表现优异,但 3D 点云方法由于缺乏强大的预训练模型而发展滞后,需要解决多模态融合中的非平凡问题。 Method: 提出了一种基于 2D 和 3D 模型集成的框架 Importance-Aware Ensemble Network (IAENet),并引入了 Importance-Aware Fusion (IAF) 模块来动态评估并重加权不同来源的异常分数。 Result: 在 MVTec 3D-AD 数据集上的实验表明,IAENet 达到了新的最先进水平,异常检测性能显著提升,误报率更低。 Conclusion: IAENet 通过其创新的融合模块和优化策略,在工业制造中的表面异常检测中展现出卓越的性能,具有较低的误报率。 Abstract: Surface anomaly detection is pivotal for ensuring product quality in industrial manufacturing. While 2D image-based methods have achieved remarkable success, 3D point cloud-based detection remains underexplored despite its richer geometric cues. We argue that the key bottleneck is the absence of powerful pretrained foundation backbones in 3D comparable to those in 2D. To bridge this gap, we propose Importance-Aware Ensemble Network (IAENet), an ensemble framework that synergizes 2D pretrained expert with 3D expert models. However, naively fusing predictions from disparate sources is non-trivial: existing strategies can be affected by a poorly performing modality and thus degrade overall accuracy. To address this challenge, We introduce an novel Importance-Aware Fusion (IAF) module that dynamically assesses the contribution of each source and reweights their anomaly scores. Furthermore, we devise critical loss functions that explicitly guide the optimization of IAF, enabling it to combine the collective knowledge of the source experts but also preserve their unique strengths, thereby enhancing the overall performance of anomaly detection. Extensive experiments on MVTec 3D-AD demonstrate that our IAENet achieves a new state-of-the-art with a markedly lower false positive rate, underscoring its practical value for industrial deployment.[79] Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent
En Ci,Shanyan Guan,Yanhao Ge,Yilin Zhang,Wei Li,Zhenyu Zhang,Jian Yang,Ying Tai
Main category: cs.CV
TL;DR: 提出了一种基于描述性提示的图像编辑框架DescriptiveEdit,通过将“指令式图像编辑”重新定义为“参考图像指导的文本到图像生成”,解决了传统方法中存在的问题。
Details
Motivation: 文本到图像生成技术虽有进展,但语义图像编辑仍面临挑战:基于反转的方法存在重建错误,而依赖指令数据集的方法受限于数据质量和规模。 Method: 提出了Cross-Attentive UNet模型,通过引入注意力机制,在生成编辑图像的过程中注入参考图像特征,实现参考图像和文本提示的结合。 Result: 在Emu Edit基准测试中,DescriptiveEdit展示了更高的编辑准确性和一致性,并能与ControlNet、IP-Adapter等扩展模块无缝集成。 Conclusion: DescriptiveEdit克服了传统图像编辑方法的局限性,具有更强的可扩展性和实用性。 Abstract: Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.[80] DCFS: Continual Test-Time Adaptation via Dual Consistency of Feature and Sample
Wenting Yin,Han Sun,Xinru Meng,Ningzhong Liu,Huiyu Zhou
Main category: cs.CV
TL;DR: 本文提出DCFS方法,通过双路径特征解耦和置信度加权策略,解决持续测试时模型适应中伪标签质量差和误差累积问题。
Details
Motivation: 现有方法依赖目标域特征生成伪标签,但伪标签质量无法保证且存在误差累积问题,需引入更鲁棒的特征学习机制。 Method: 提出DCFS框架,采用双分类器解耦语义相关和领域相关特征,并利用一致性约束和置信度加权损失进行自监督学习。 Result: 在CIFAR10-C、CIFAR100-C和ImageNet-C等多个数据集上验证了方法的有效性,持续测试时适应场景下性能稳定。 Conclusion: DCFS有效解决了持续测试时适应中的伪标签噪声和误差累积问题,通过双路径特征一致性和置信度感知学习,实现了在多个数据集上的稳定性能提升。 Abstract: Continual test-time adaptation aims to continuously adapt a pre-trained model to a stream of target domain data without accessing source data. Without access to source domain data, the model focuses solely on the feature characteristics of the target data. Relying exclusively on these features can lead to confusion and introduce learning biases. Currently, many existing methods generate pseudo-labels via model predictions. However, the quality of pseudo-labels cannot be guaranteed and the problem of error accumulation must be solved. To address these challenges, we propose DCFS, a novel CTTA framework that introduces dual-path feature consistency and confidence-aware sample learning. This framework disentangles the whole feature representation of the target data into semantic-related feature and domain-related feature using dual classifiers to learn distinct feature representations. By maintaining consistency between the sub-features and the whole feature, the model can comprehensively capture data features from multiple perspectives. Additionally, to ensure that the whole feature information of the target domain samples is not overlooked, we set a adaptive threshold and calculate a confidence score for each sample to carry out loss weighted self-supervised learning, effectively reducing the noise of pseudo-labels and alleviating the problem of error accumulation. The efficacy of our proposed method is validated through extensive experimentation across various datasets, including CIFAR10-C, CIFAR100-C, and ImageNet-C, demonstrating consistent performance in continual test-time adaptation scenarios.[81] Adam SLAM - the last mile of camera calibration with 3DGS
Matthieu Gendrin,Stéphane Pateux,Xiaoran Jiang,Théo Ladune,Luce Morin
Main category: cs.CV
TL;DR: 该论文提出利用3DGS模型通过反向传播新视角颜色损失来优化相机校准,从而提升新视角合成质量。
Details
Motivation: 相机校准质量对新视角合成效果至关重要,1像素的校准误差会显著影响重建质量。 Method: 通过反向传播新视角颜色损失来调整相机参数,利用3DGS模型进行校准微调。 Result: 在3DGS使用的数据集中,平均提升了0.4 dB PSNR。 Conclusion: 使用3DGS模型微调相机校准可提升新视角合成的质量,尽管微调过程可能较长,但对于参考场景的校准具有重要意义。 Abstract: The quality of the camera calibration is of major importance for evaluating progresses in novel view synthesis, as a 1-pixel error on the calibration has a significant impact on the reconstruction quality. While there is no ground truth for real scenes, the quality of the calibration is assessed by the quality of the novel view synthesis. This paper proposes to use a 3DGS model to fine tune calibration by backpropagation of novel view color loss with respect to the cameras parameters. The new calibration alone brings an average improvement of 0.4 dB PSNR on the dataset used as reference by 3DGS. The fine tuning may be long and its suitability depends on the criticity of training time, but for calibration of reference scenes, such as Mip-NeRF 360, the stake of novel view quality is the most important.[82] Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation
Jingyun Yang,Guoqing Zhang,Jingge Wang,Yang Li
Main category: cs.CV
TL;DR: 本文提出了一种用于多模态医学图像分割的主动且顺序的领域自适应框架,通过动态选择最有价值的样本进行标注和训练,从而减少标注成本并提高模型性能。
Details
Motivation: 标注医学图像耗时且费力,主动学习被用来减少标注成本,但现有的主动领域自适应方法存在负迁移和源域数据访问受限的问题,且多模态医学数据的查询策略尚未被探索。 Method: 提出了一种主动且顺序的领域自适应框架,用于动态多模态样本选择。 Result: 所提出的方法在多模态医学图像分割任务中取得了良好的性能表现,显著优于当前最先进的ADA方法。 Conclusion: 实验结果表明,所提出的方法在肿瘤体积分割任务中表现出色,显著优于现有的ADA方法。 Abstract: Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.[83] Enhancing Pseudo-Boxes via Data-Level LiDAR-Camera Fusion for Unsupervised 3D Object Detection
Mingqian Ji,Jian Yang,Shanshan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的无监督3D目标检测方法,通过结合RGB图像和LiDAR数据提升了检测性能。
Details
Motivation: 为了解决当前3D目标检测依赖人工标注的问题,以及提升伪框生成的质量。 Method: 提出了一种新的数据级融合框架,结合RGB图像和LiDAR数据,并采用双向融合方法和动态自演进策略。 Result: 在nuScenes验证基准上实现了28.4%的mAP提升。 Conclusion: 实验结果表明,所提出的方法在nuScenes数据集上显著优于现有最先进方法,提升了检测性能。 Abstract: Existing LiDAR-based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high-quality 3D labels is time-consuming and labor-intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo-box generation. However, these methods simply integrate pseudo-boxes generated by LiDAR point clouds and RGB images. Yet, such a label-level fusion strategy brings limited improvements to the quality of pseudo-boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data-level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi-directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation-induced outliers. Furthermore, we propose a data-level fusion based dynamic self-evolution strategy, which iteratively refines pseudo-boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state-of-the-art methods with 28.4$\%$ mAP on the nuScenes validation benchmark.[84] Digital Scale: Open-Source On-Device BMI Estimation from Smartphone Camera Images Trained on a Large-Scale Real-World Dataset
Frederik Rajiv Manichand,Robin Deuber,Robert Jakob,Steve Swerling,Jamie Rosen,Elgar Fleisch,Patrick Langer
Main category: cs.CV
TL;DR: 本文提出了一种利用智能手机图像进行BMI估计的深度学习方法,构建了大规模数据集WayBED,并通过自动过滤和模型优化实现了当前最低的MAPE,同时开源了所有代码和移动部署框架。
Details
Motivation: 在传统方法不可行或不切实际的情况下(如远程医疗或紧急情况),通过相机图像估计BMI可以实现快速体重评估。 Method: 使用大规模WayBED数据集训练深度学习模型,引入基于姿态聚类和人物检测的自动过滤方法去除低质量图像,并在VisualBodyToBMI数据集上测试模型性能,进一步进行微调提升效果。 Result: 在WayBED测试集上实现了7.9%的MAPE,在未见过的VisualBodyToBMI数据集上实现了13%的MAPE,微调后在该数据集上达到了8.56%的MAPE,为目前最低值。 Conclusion: 本文提出了一种基于深度学习的BMI估计方法,并构建了大规模数据集WayBED,通过自动过滤方法提高了数据质量,最终实现了较低的MAPE,并在多个数据集上验证了模型的泛化能力,同时开源了代码和部署框架。 Abstract: Estimating Body Mass Index (BMI) from camera images with machine learning models enables rapid weight assessment when traditional methods are unavailable or impractical, such as in telehealth or emergency scenarios. Existing computer vision approaches have been limited to datasets of up to 14,500 images. In this study, we present a deep learning-based BMI estimation method trained on our WayBED dataset, a large proprietary collection of 84,963 smartphone images from 25,353 individuals. We introduce an automatic filtering method that uses posture clustering and person detection to curate the dataset by removing low-quality images, such as those with atypical postures or incomplete views. This process retained 71,322 high-quality images suitable for training. We achieve a Mean Absolute Percentage Error (MAPE) of 7.9% on our hold-out test set (WayBED data) using full-body images, the lowest value in the published literature to the best of our knowledge. Further, we achieve a MAPE of 13% on the completely unseen~(during training) VisualBodyToBMI dataset, comparable with state-of-the-art approaches trained on it, demonstrating robust generalization. Lastly, we fine-tune our model on VisualBodyToBMI and achieve a MAPE of 8.56%, the lowest reported value on this dataset so far. We deploy the full pipeline, including image filtering and BMI estimation, on Android devices using the CLAID framework. We release our complete code for model training, filtering, and the CLAID package for mobile deployment as open-source contributions.[85] Domain Adaptation Techniques for Natural and Medical Image Classification
Ahmad Chaddad,Yihang Wu,Reem Kateb,Christian Desrosiers
Main category: cs.CV
TL;DR: This paper evaluates the effectiveness of domain adaptation techniques for both natural and medical image classification, with a focus on the Deep Subdomain Adaptation Network (DSAN), which showed strong performance and explainability, particularly in medical datasets.
Details
Motivation: The motivation is to better understand the benefits of DA for both natural and medical images, addressing the challenges posed by the difficulty of working with medical data and potential performance bias in mainstream datasets. Method: The study conducted 557 simulation studies using seven widely-used DA techniques for image classification across five natural and eight medical datasets, covering various scenarios such as out-of-distribution, dynamic data streams, and limited training samples. Result: The Deep Subdomain Adaptation Network (DSAN) algorithm showed outstanding performance, achieving feasible classification accuracy (91.2%) in the COVID-19 dataset using Resnet50, showing a significant accuracy improvement (+6.7%) in the dynamic data stream DA scenario compared to the baseline, and demonstrating remarkable explainability on the COVID-19 and skin cancer datasets. Conclusion: This study contributes to the understanding of domain adaptation (DA) techniques, particularly highlighting the effectiveness and explainability of the Deep Subdomain Adaptation Network (DSAN) in adapting models for medical data. Abstract: Domain adaptation (DA) techniques have the potential in machine learning to alleviate distribution differences between training and test sets by leveraging information from source domains. In image classification, most advances in DA have been made using natural images rather than medical data, which are harder to work with. Moreover, even for natural images, the use of mainstream datasets can lead to performance bias. {With the aim of better understanding the benefits of DA for both natural and medical images, this study performs 557 simulation studies using seven widely-used DA techniques for image classification in five natural and eight medical datasets that cover various scenarios, such as out-of-distribution, dynamic data streams, and limited training samples.} Our experiments yield detailed results and insightful observations highlighting the performance and medical applicability of these techniques. Notably, our results have shown the outstanding performance of the Deep Subdomain Adaptation Network (DSAN) algorithm. This algorithm achieved feasible classification accuracy (91.2\%) in the COVID-19 dataset using Resnet50 and showed an important accuracy improvement in the dynamic data stream DA scenario (+6.7\%) compared to the baseline. Our results also demonstrate that DSAN exhibits remarkable level of explainability when evaluated on COVID-19 and skin cancer datasets. These results contribute to the understanding of DA techniques and offer valuable insight into the effective adaptation of models to medical data.[86] Contrastive Learning through Auxiliary Branch for Video Object Detection
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: 本文提出了一种高效的视频目标检测方法CLAB,通过对比学习和动态损失加权策略,在不增加推理成本的情况下提高了检测性能,并达到了最先进的结果。
Details
Motivation: 视频图像存在运动模糊、遮挡和可变形形状等问题,使得视频目标检测比静态图像检测更具挑战性,需要提高鲁棒性而不增加计算负载。 Method: 引入了一种通过辅助分支的对比学习(CLAB)方法,并提出了动态损失加权策略。 Result: CLAB在ImageNet VID数据集上分别使用ResNet-101和ResNeXt-101达到了84.0%和85.2%的mAP性能,表现优于现有CNN模型。 Conclusion: CLAB方法在不增加推理计算负担的情况下提高了视频目标检测的鲁棒性,并在ImageNet VID数据集上达到了最先进的性能。 Abstract: Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector's backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.[87] Towards Mechanistic Defenses Against Typographic Attacks in CLIP
Lorenz Hufe,Constantin Venhoff,Maximilian Dreyer,Sebastian Lapuschkin,Wojciech Samek
Main category: cs.CV
TL;DR: This paper introduces a training-free method to defend CLIP models against typographic attacks by selectively ablating attention heads, significantly improving robustness with minimal impact on standard accuracy.
Details
Motivation: The motivation stems from the vulnerability of multi-modal systems to typographic attacks that can lead to misclassifications and security breaches, necessitating robust defense mechanisms. Method: The researchers analyzed CLIP vision encoders to identify attention heads responsible for typographic information processing and selectively ablated these to defend against attacks. Result: The method improved performance by up to 19.6% on a typographic variant of ImageNet-100 while reducing standard accuracy by less than 1%, proving competitive against state-of-the-art defenses requiring finetuning. Conclusion: The study concludes that typographic attacks can be effectively countered in CLIP models by selectively ablating a typographic circuit without significant loss in overall performance, offering a training-free alternative to current defenses. Abstract: Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.[88] GLaRE: A Graph-based Landmark Region Embedding Network for Emotion Recognition
Debasis Maji,Debaditya Barman
Main category: cs.CV
TL;DR: 本文提出了一种基于图神经网络的面部表情识别方法GLaRE,通过建模面部地标之间的关系,提高了识别准确率并优于现有方法。
Details
Motivation: 传统FER系统在遮挡、表情变异性以及缺乏可解释性方面存在性能障碍,因此需要一种结构化且可解释的学习方法。 Method: 使用3D面部对齐提取面部地标,并通过分层粗化构建商图,以减少复杂性同时保持空间结构。 Result: 在AffectNet上实现了64.89%的准确率,在FERG上实现了94.24%的准确率,证明了该方法的有效性。 Conclusion: GLaRE是一个基于图的地标区域嵌入网络,用于情绪识别,实现高精度并优于现有基线。 Abstract: Facial expression recognition (FER) is a crucial task in computer vision with wide range of applications including human computer interaction, surveillance, and assistive technologies. However, challenges such as occlusion, expression variability, and lack of interpretability hinder the performance of traditional FER systems. Graph Neural Networks (GNNs) offer a powerful alternative by modeling relational dependencies between facial landmarks, enabling structured and interpretable learning. In this paper, we propose GLaRE, a novel Graph-based Landmark Region Embedding network for emotion recognition. Facial landmarks are extracted using 3D facial alignment, and a quotient graph is constructed via hierarchical coarsening to preserve spatial structure while reducing complexity. Our method achieves 64.89 percentage accuracy on AffectNet and 94.24 percentage on FERG, outperforming several existing baselines. Additionally, ablation studies have demonstrated that region-level embeddings from quotient graphs have contributed to improved prediction performance.[89] FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models
Zheng Chong,Yanwei Lei,Shiyue Zhang,Zhuandi He,Zhen Wang,Xujie Zhang,Xiao Dong,Yiling Wu,Dongmei Jiang,Xiaodan Liang
Main category: cs.CV
TL;DR: FastFit는 멀티-참조 의상 조합을 지원하고 참조 기능의 중복 재계산으로 인한 비효율성을 해결하기 위한 고속 가상 착의 프레임워크입니다.
Details
Motivation: 가상 착의 기술의 실제 적용을 저해하는 두 가지 주요 문제, 즉 현재 방법들이 멀티-참조 의상 구성(의류 및 액세서리 포함)을 지원하지 못하고, 각 노이즈 제거 단계에서 참조 기능의 중복 재계산으로 인한 상당한 비효율성을 해결하기 위함입니다. Method: FastFit는 캐시 가능한 확산 구조를 기반으로 하며, 세미-어텐션 메커니즘을 사용하여 전통적인 타임스텝 임베딩을 클래스 임베딩으로 대체하여 참조 항목을 처리합니다. 이를 통해 참조 기능 인코딩을 노이즈 제거 프로세스와 완전히 분리하여 매개변수 오버헤드 없이 효율적으로 재사용할 수 있도록 했습니다. Result: FastFit은 기존 방법보다 평균적으로 3.5배 빠른 속도를 달성했으며, VITON-HD, DressCode, 그리고 새로 도입된 DressCode-MR 데이터셋에서 최신 기법보다 품질 지표와 추론 효율성 측면에서 우 superior 한 성능을 보였습니다. Conclusion: FastFit은 멀티-참조 가상 착의의 효율성과 성능을 동시에 개선하는 새로운 프레임워크로, 복잡한 의상 조합을 위한 연구와 실제 응용을 촉진합니다. Abstract: Despite its great potential, virtual try-on technology is hindered from real-world application by two major challenges: the inability of current methods to support multi-reference outfit compositions (including garments and accessories), and their significant inefficiency caused by the redundant re-computation of reference features in each denoising step. To address these challenges, we propose FastFit, a high-speed multi-reference virtual try-on framework based on a novel cacheable diffusion architecture. By employing a Semi-Attention mechanism and substituting traditional timestep embeddings with class embeddings for reference items, our model fully decouples reference feature encoding from the denoising process with negligible parameter overhead. This allows reference features to be computed only once and losslessly reused across all steps, fundamentally breaking the efficiency bottleneck and achieving an average 3.5x speedup over comparable methods. Furthermore, to facilitate research on complex, multi-reference virtual try-on, we introduce DressCode-MR, a new large-scale dataset. It comprises 28,179 sets of high-quality, paired images covering five key categories (tops, bottoms, dresses, shoes, and bags), constructed through a pipeline of expert models and human feedback refinement. Extensive experiments on the VITON-HD, DressCode, and our DressCode-MR datasets show that FastFit surpasses state-of-the-art methods on key fidelity metrics while offering its significant advantage in inference efficiency.[90] UTA-Sign: Unsupervised Thermal Video Augmentation via Event-Assisted Traffic Signage Sketching
Yuqi Han,Songqian Zhang,Weijian Su,Ke Li,Jiayu Yang,Jinli Suo,Qiang Zhang
Main category: cs.CV
TL;DR: This paper introduces UTA-Sign, an unsupervised thermal-event video augmentation approach for better traffic signage detection in low-light environments, combining the strengths of thermal and event cameras.
Details
Motivation: Thermal cameras struggle with detecting signage on similar materials, while event cameras offer high-speed performance in low-light conditions; combining both modalities can enhance autonomous driving safety. Method: UTA-Sign employs a dual-boosting mechanism that integrates thermal frames and event signals to overcome thermal camera blind spots and event camera non-uniform sampling, providing consistent signage representation. Result: The method achieves superior performance in traffic signage sketching and improves detection accuracy in real-world low-illumination scenarios. Conclusion: The proposed UTA-Sign method effectively addresses the limitations of thermal and event cameras by fusing their complementary features, resulting in enhanced traffic signage understanding in low-light environments. Abstract: The thermal camera excels at perceiving outdoor environments under low-light conditions, making it ideal for applications such as nighttime autonomous driving and unmanned navigation. However, thermal cameras encounter challenges when capturing signage from objects made of similar materials, which can pose safety risks for accurately understanding semantics in autonomous driving systems. In contrast, the neuromorphic vision camera, also known as an event camera, detects changes in light intensity asynchronously and has proven effective in high-speed, low-light traffic environments. Recognizing the complementary characteristics of these two modalities, this paper proposes UTA-Sign, an unsupervised thermal-event video augmentation for traffic signage in low-illumination environments, targeting elements such as license plates and roadblock indicators. To address the signage blind spots of thermal imaging and the non-uniform sampling of event cameras, we developed a dual-boosting mechanism that fuses thermal frames and event signals for consistent signage representation over time. The proposed method utilizes thermal frames to provide accurate motion cues as temporal references for aligning the uneven event signals. At the same time, event signals contribute subtle signage content to the raw thermal frames, enhancing the overall understanding of the environment. The proposed method is validated on datasets collected from real-world scenarios, demonstrating superior quality in traffic signage sketching and improved detection accuracy at the perceptual level.[91] Disruptive Attacks on Face Swapping via Low-Frequency Perceptual Perturbations
Mengxiao Huang,Minglei Shu,Shuwang Zhou,Zhaoyang Liu
Main category: cs.CV
TL;DR: This paper introduces an active defense strategy against deepfakes by applying low-frequency perturbations to disrupt face-swapping GANs while maintaining visual fidelity.
Details
Motivation: Deepfake technology poses serious threats to privacy and societal security, and current detection methods are mostly passive. An active defense mechanism is needed to proactively disrupt deepfake generation processes. Method: An active defense approach using low-frequency perceptual perturbations was developed, combining frequency and spatial domain features. The architecture includes an encoder, perturbation generator, and decoder, utilizing discrete wavelet transform (DWT) to generate perturbations that disrupt facial manipulation models. Result: Experiments on CelebA-HQ and LFW datasets showed a significant reduction in face-swapping effectiveness, improved defense success rates, and preservation of visual quality. Conclusion: The proposed active defense method effectively disrupts deepfake technology by targeting its generative process, thereby reducing the effectiveness of face-swapping while maintaining visual quality. Abstract: Deepfake technology, driven by Generative Adversarial Networks (GANs), poses significant risks to privacy and societal security. Existing detection methods are predominantly passive, focusing on post-event analysis without preventing attacks. To address this, we propose an active defense method based on low-frequency perceptual perturbations to disrupt face swapping manipulation, reducing the performance and naturalness of generated content. Unlike prior approaches that used low-frequency perturbations to impact classification accuracy,our method directly targets the generative process of deepfake techniques. We combine frequency and spatial domain features to strengthen defenses. By introducing artifacts through low-frequency perturbations while preserving high-frequency details, we ensure the output remains visually plausible. Additionally, we design a complete architecture featuring an encoder, a perturbation generator, and a decoder, leveraging discrete wavelet transform (DWT) to extract low-frequency components and generate perturbations that disrupt facial manipulation models. Experiments on CelebA-HQ and LFW demonstrate significant reductions in face-swapping effectiveness, improved defense success rates, and preservation of visual quality.[92] Embracing Aleatoric Uncertainty: Generating Diverse 3D Human Motion
Zheng Qin,Yabing Wang,Minghui Yang,Sanping Zhou,Ming Yang,Le Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Diverse-T2M的方法,通过将噪声信号作为多样性信息的载体和构建连续文本表示的潜在空间,实现了文本到动作生成的多样化,同时保持了语义一致性。
Details
Motivation: 尽管文本到动作生成在质量上已有进展,但生成动作的多样性仍是一个挑战。 Method: 利用噪声信号作为多样性信息的载体,并构建了一个连续文本表示的潜在空间,结合采样策略引入随机性。 Result: 在HumanML3D和KIT-ML数据集上验证了方法的有效性,展示了生成多样性和文本一致性的提升。 Conclusion: Diverse-T2M通过引入不确定性实现了文本到动作生成的多样化,同时保持了语义一致性。 Abstract: Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, \textit{i.e.}, Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~(HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.[93] Optimization-Based Calibration for Intravascular Ultrasound Volume Reconstruction
Karl-Philippe Beaudet,Sidaty El Hadramy,Philippe C Cattin,Juan Verde,Stéphane Cotin
Main category: cs.CV
TL;DR: 本文提出了一种基于优化的校准方法,使用3D打印的模型进行准确的3D血管内超声体积重建,以实现术中超声图像与术前CT图像的配准,从而提高肝脏手术中的导航精度。
Details
Motivation: 由于视野有限和解剖结构复杂,术中超声图像在肝脏手术中难以解释。弥合术前和术中数据之间的差距对于有效的手术引导至关重要。 Method: 提出了一种基于优化的校准方法,使用3D打印的模型进行准确的3D血管内超声体积重建。 Result: 通过体内猪肝脏图像验证了该方法,校准误差从0.88到1.80毫米,3D IVUS数据与相应CT扫描之间的配准误差从3.40到5.71毫米。 Conclusion: 该方法提供了一种可靠且准确的校准和体积重建手段,可用于肝脏手术中术中超声图像与术前CT图像的配准,并增强术中引导。 Abstract: Intraoperative ultrasound images are inherently challenging to interpret in liver surgery due to the limited field of view and complex anatomical structures. Bridging the gap between preoperative and intraoperative data is crucial for effective surgical guidance. 3D IntraVascular UltraSound (IVUS) offers a potential solution by enabling the reconstruction of the entire organ, which facilitates registration between preoperative computed tomography (CT) scans and intraoperative IVUS images. In this work, we propose an optimization-based calibration method using a 3D-printed phantom for accurate 3D Intravascular Ultrasound volume reconstruction. Our approach ensures precise alignment of tracked IVUS data with preoperative CT images, improving intraoperative navigation. We validated our method using in vivo swine liver images, achieving a calibration error from 0.88 to 1.80 mm and a registration error from 3.40 to 5.71 mm between the 3D IVUS data and the corresponding CT scan. Our method provides a reliable and accurate means of calibration and volume reconstruction. It can be used to register intraoperative ultrasound images with preoperative CT images in the context of liver surgery, and enhance intraoperative guidance.[94] Physics Informed Generative Models for Magnetic Field Images
Aye Phyu Phyu Aung,Lucas Lum,Zhansen Shi,Wen Qiu,Bernice Zee,JM Chin,Yeow Kheng Lim,J. Senthilnath
Main category: cs.CV
TL;DR: This paper presents a physics-informed generative model for Magnetic Field Images (PI-GenMFI) to generate synthetic MFI samples, which can be used to train machine learning algorithms for efficient defect localization in semiconductor manufacturing.
Details
Motivation: The motivation for this study is the need for efficient defect detection and localization in semiconductor manufacturing, where the limited availability of Magnetic Field Imaging (MFI) datasets due to proprietary concerns presents a significant bottleneck for training machine learning models. Method: The researchers used a machine learning-driven approach leveraging diffusion models with two physical constraints to generate synthetic MFI samples. They compared their model with state-of-the-art generative models from both variational autoencoder (VAE) and diffusion methods. Result: The proposed PI-GenMFI model showed promising results in generating synthetic MFI samples that can serve as training data for ML algorithms. The evaluation, both qualitative and quantitative, showed that the model can effectively optimize the defect localization process. Conclusion: The study concludes that using Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) can effectively generate synthetic MFI samples for training ML algorithms, optimizing the defect localization process in semiconductor manufacturing. Abstract: In semiconductor manufacturing, defect detection and localization are critical to ensuring product quality and yield. While X-ray imaging is a reliable non-destructive testing method, it is memory-intensive and time-consuming for large-scale scanning, Magnetic Field Imaging (MFI) offers a more efficient means to localize regions of interest (ROI) for targeted X-ray scanning. However, the limited availability of MFI datasets due to proprietary concerns presents a significant bottleneck for training machine learning (ML) models using MFI. To address this challenge, we consider an ML-driven approach leveraging diffusion models with two physical constraints. We propose Physics Informed Generative Models for Magnetic Field Images (PI-GenMFI) to generate synthetic MFI samples by integrating specific physical information. We generate MFI images for the most common defect types: power shorts. These synthetic images will serve as training data for ML algorithms designed to localize defect areas efficiently. To evaluate generated MFIs, we compare our model to SOTA generative models from both variational autoencoder (VAE) and diffusion methods. We present a domain expert evaluation to assess the generated samples. In addition, we present qualitative and quantitative evaluation using various metrics used for image generation and signal processing, showing promising results to optimize the defect localization process.[95] Improving Alignment in LVLMs with Debiased Self-Judgment
Sihan Yang,Chenhang Cui,Zihao Zhao,Yiyang Zhou,Weilong Yan,Ying Wei,Huaxiu Yao
Main category: cs.CV
TL;DR: This paper proposes a novel alignment method for Large Visual-Language Models (LVLMs) that uses an internal self-evaluation metric called the debiased self-judgment score, enabling autonomous improvement and outperforming traditional approaches in reducing hallucinations and enhancing safety.
Details
Motivation: Aligning visual and linguistic modalities in LVLMs remains challenging, often leading to hallucinations and safety concerns. Existing methods rely on external resources, limiting scalability and increasing costs, which necessitates a more autonomous and efficient solution. Method: The method introduces a debiased self-judgment score, an internal self-evaluation metric that enables the model to autonomously improve alignment by enhancing decoding strategies and preference tuning processes. Result: The proposed approach significantly outperforms traditional alignment methods, reducing hallucinations, enhancing safety, and improving overall model capability. Conclusion: The proposed approach effectively improves the alignment of LVLMs by reducing hallucinations and enhancing safety, offering a more efficient solution compared to traditional methods. Abstract: The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations--where generated outputs are not grounded in the visual input--and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.[96] Revisiting the Privacy Risks of Split Inference: A GAN-Based Data Reconstruction Attack via Progressive Feature Optimization
Yixiang Qiu,Yanhan Liu,Hongyao Yu,Hao Fang,Bin Chen,Shu-Tao Xia,Ke Xu
Main category: cs.CV
TL;DR: This paper introduces a novel GAN-based Data Reconstruction Attack framework with Progressive Feature Optimization (PFO) that improves the semantic fidelity of reconstructed images in Split Inference systems, offering superior performance over existing methods.
Details
Motivation: The motivation stems from the limitations of existing Data Reconstruction Attacks, which are typically effective only on shallow models and fail to fully leverage semantic priors, posing privacy risks in Split Inference systems. Method: The authors propose a GAN-based Data Reconstruction Attack framework with Progressive Feature Optimization (PFO), which uses hierarchical generator blocks and an L1-ball constraint to improve image reconstruction realism and semantic fidelity. Result: The proposed method outperforms prior attacks by a large margin, particularly in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs. Conclusion: The paper concludes that the proposed GAN-based DRA framework with PFO significantly enhances the semantic fidelity of reconstructed images, outperforming previous methods in various scenarios. Abstract: The growing complexity of Deep Neural Networks (DNNs) has led to the adoption of Split Inference (SI), a collaborative paradigm that partitions computation between edge devices and the cloud to reduce latency and protect user privacy. However, recent advances in Data Reconstruction Attacks (DRAs) reveal that intermediate features exchanged in SI can be exploited to recover sensitive input data, posing significant privacy risks. Existing DRAs are typically effective only on shallow models and fail to fully leverage semantic priors, limiting their reconstruction quality and generalizability across datasets and model architectures. In this paper, we propose a novel GAN-based DRA framework with Progressive Feature Optimization (PFO), which decomposes the generator into hierarchical blocks and incrementally refines intermediate representations to enhance the semantic fidelity of reconstructed images. To stabilize the optimization and improve image realism, we introduce an L1-ball constraint during reconstruction. Extensive experiments show that our method outperforms prior attacks by a large margin, especially in high-resolution scenarios, out-of-distribution settings, and against deeper and more complex DNNs.[97] MobileCLIP2: Improving Multi-Modal Reinforced Training
Fartash Faghri,Pavan Kumar Anasosalu Vasu,Cem Koc,Vaishaal Shankar,Alexander Toshev,Oncel Tuzel,Hadi Pouransari
Main category: cs.CV
TL;DR: MobileCLIP2 improves upon MobileCLIP with enhanced multi-modal training techniques, achieving state-of-the-art zero-shot accuracy on ImageNet-1k at low latencies and with reduced model size.
Details
Motivation: To enhance the performance of low-latency, lightweight image-text models like MobileCLIP while maintaining efficiency and reproducibility in zero-shot learning applications. Method: Improvements in MobileCLIP's multi-modal reinforced training through better CLIP teacher ensembles, improved captioner teachers, temperature tuning in contrastive knowledge distillation, caption-generator fine-tuning for caption diversity, and combining synthetic captions from multiple models. Result: A new family of models, MobileCLIP2, was trained, achieving a 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared to MobileCLIP-B, with MobileCLIP2-S4 matching the accuracy of SigLIP-SO400M/14 while being smaller and improving on DFN ViT-L/14 at lower latency. Conclusion: MobileCLIP2 achieves state-of-the-art ImageNet-1k zero-shot accuracies at low latencies, with improvements over the previous MobileCLIP models and competitive performance compared to other models like SigLIP-SO400M/14 and DFN ViT-L/14. Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.[98] EmoCAST: Emotional Talking Portrait via Emotive Text Description
Yiguo Jiang,Xiaodong Cun,Yong Zhang,Yudian Zheng,Fan Tang,Chi-Man Pun
Main category: cs.CV
TL;DR: 本文提出了一种名为EmoCAST的框架,通过外观建模和情感音频注意力模块,以及构建新的情感化说话人数据集,实现了精确的文本驱动情感合成,并在生成逼真、富有情感表达和音频同步的说话人视频方面达到了最先进的性能。
Details
Motivation: 情感化说话人合成旨在生成具有生动表情的说话肖像视频,但现有方法在控制灵活性、动作自然性和表情质量方面仍存在局限性。 Method: 提出了一种名为EmoCAST的扩散框架,包含外观建模和情感音频注意力模块,并构建了一个新的情感化说话人数据集。 Result: EmoCAST在生成逼真、富有情感表达和音频同步的说话人视频方面达到了最先进的性能。 Conclusion: EmoCAST是一个基于扩散的框架,通过外观建模和情感音频注意力模块,实现了精确的文本驱动情感合成,并构建了一个具有全面情感文本描述的情感化说话人数据集。 Abstract: Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework's performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model's ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST[99] ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering
Paritosh Parmar,Eric Peh,Basura Fernando
Main category: cs.CV
TL;DR: A new modular framework for Causal-Why VideoQA improves interpretability and performance by decoupling causal reasoning from answer generation using natural language causal chains.
Details
Motivation: Existing VideoQA models struggle with higher-order causal reasoning and lack interpretability, relying on opaque pipelines that entangle multiple processes. Method: The framework uses a two-stage architecture: a Causal Chain Extractor (CCE) and a Causal Chain-Driven Answerer (CCDA), with causal chains generated using large language models. Result: The approach outperforms state-of-the-art models on three large-scale benchmarks while offering improved explainability and generalization. Conclusion: The proposed modular framework improves causal reasoning in VideoQA models by decoupling reasoning from answer generation, leading to better explainability, user trust, and generalization. Abstract: Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/[100] Mask-Guided Multi-Channel SwinUNETR Framework for Robust MRI Classification
Smriti Joshi,Lidia Garrucho,Richard Osuala,Oliver Diaz,Karim Lekadir
Main category: cs.CV
TL;DR: 本研究提出了一种基于SwinUNETR的深度学习方法,用于乳腺MRI图像分析,以提高乳腺癌的检测和分类准确性。
Details
Motivation: 乳腺癌是女性癌症相关死亡的主要原因之一,早期检测对改善预后至关重要。MRI在乳腺癌检测中具有高度敏感性,尤其是在高风险女性或乳腺组织致密的情况下。然而,开发基于人工智能的解决方案可以进一步提高MRI的诊断准确性。 Method: 研究采用基于SwinUNETR的深度学习框架,结合乳腺区域掩码、大量数据增强和集成学习技术,以提高模型的鲁棒性和泛化能力。 Result: 该方法在由6个欧洲中心提供的511个研究数据集上表现优异,在挑战赛的排行榜上排名第二。 Conclusion: 研究展示了一个基于SwinUNETR的深度学习框架,其在乳腺MRI图像分析中具有良好的鲁棒性和泛化能力,且在挑战赛中取得了第二名的成绩,表明其在临床应用中的潜力。此外,代码库已公开,可供进一步研究。 Abstract: Breast cancer is one of the leading causes of cancer-related mortality in women, and early detection is essential for improving outcomes. Magnetic resonance imaging (MRI) is a highly sensitive tool for breast cancer detection, particularly in women at high risk or with dense breast tissue, where mammography is less effective. The ODELIA consortium organized a multi-center challenge to foster AI-based solutions for breast cancer diagnosis and classification. The dataset included 511 studies from six European centers, acquired on scanners from multiple vendors at both 1.5 T and 3 T. Each study was labeled for the left and right breast as no lesion, benign lesion, or malignant lesion. We developed a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Our method achieved second place on the challenge leaderboard, highlighting its potential to support clinical breast MRI interpretation. We publicly share our codebase at https://github.com/smriti-joshi/bcnaim-odelia-challenge.git.[101] AvatarBack: Back-Head Generation for Complete 3D Avatars from Front-View Images
Shiqi Xin,Xiaolin Zhang,Yanbin Liu,Peng Zhang,Caifeng Shan
Main category: cs.CV
TL;DR: AvatarBack 是一种用于3D高斯头像重建的新框架,专注于改善后脑区域的构造,从而提高整体重建质量与真实感。
Details
Motivation: 现有方法主要依赖正视图图像,导致后脑区域构建不佳,出现几何不一致和模糊等问题。 Method: 提出AvatarBack,包含主题特定生成器(SSG)和自适应空间对齐策略(ASA),合成后视伪图像并优化几何对齐。 Result: 在NeRSemble和K-hairstyle数据集上的实验表明,AvatarBack显著改善了后脑重建质量,同时保持了正面的高保真度。 Conclusion: AvatarBack有效解决了现有方法在后脑区域重建中的不足,提升了3D头像的整体重建质量与真实感。 Abstract: Recent advances in Gaussian Splatting have significantly boosted the reconstruction of head avatars, enabling high-quality facial modeling by representing an 3D avatar as a collection of 3D Gaussians. However, existing methods predominantly rely on frontal-view images, leaving the back-head poorly constructed. This leads to geometric inconsistencies, structural blurring, and reduced realism in the rear regions, ultimately limiting the fidelity of reconstructed avatars. To address this challenge, we propose AvatarBack, a novel plug-and-play framework specifically designed to reconstruct complete and consistent 3D Gaussian avatars by explicitly modeling the missing back-head regions. AvatarBack integrates two core technical innovations,i.e., the Subject-specific Generator (SSG) and the Adaptive Spatial Alignment Strategy (ASA). The former leverages a generative prior to synthesize identity-consistent, plausible back-view pseudo-images from sparse frontal inputs, providing robust multi-view supervision. To achieve precise geometric alignment between these synthetic views and the 3D Gaussian representation, the later employs learnable transformation matrices optimized during training, effectively resolving inherent pose and coordinate discrepancies. Extensive experiments on NeRSemble and K-hairstyle datasets, evaluated using geometric, photometric, and GPT-4o-based perceptual metrics, demonstrate that AvatarBack significantly enhances back-head reconstruction quality while preserving frontal fidelity. Moreover, the reconstructed avatars maintain consistent visual realism under diverse motions and remain fully animatable.[102] ArtFace: Towards Historical Portrait Face Identification via Model Adaptation
Francois Poh,Anjith George,Sébastien Marcel
Main category: cs.CV
TL;DR: 本文研究了如何利用基础模型改进艺术作品中的面部识别,通过结合传统方法和微调模型,有效解决了领域偏移和类内差异问题。
Details
Motivation: 传统面部识别模型在处理绘画作品时表现不佳,存在领域偏移和类内差异大的问题。 Method: 通过微调基础模型,并将其嵌入与传统面部识别网络结合。 Result: 研究结果表明,基础模型能够弥补传统方法的不足,显著提升艺术作品中的面部识别效果。 Conclusion: 结合基础模型和传统面部识别网络的嵌入可以显著改善艺术作品中的面部识别效果。 Abstract: Identifying sitters in historical paintings is a key task for art historians, offering insight into their lives and how they chose to be seen. However, the process is often subjective and limited by the lack of data and stylistic variations. Automated facial recognition is capable of handling challenging conditions and can assist, but while traditional facial recognition models perform well on photographs, they struggle with paintings due to domain shift and high intra-class variation. Artistic factors such as style, skill, intent, and influence from other works further complicate recognition. In this work, we investigate the potential of foundation models to improve facial recognition in artworks. By fine-tuning foundation models and integrating their embeddings with those from conventional facial recognition networks, we demonstrate notable improvements over current state-of-the-art methods. Our results show that foundation models can bridge the gap where traditional methods are ineffective. Paper page at https://www.idiap.ch/paper/artface/[103] CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models
Ayan Banerjee,Fernando Vilariño,Josep Lladós
Main category: cs.CV
TL;DR: CraftGraffiti是一种端到端的文本引导涂鸦生成框架,通过“先风格、后身份”的范式,在保留面部身份的同时实现高质量的艺术风格迁移。
Details
Motivation: 在生成艺术中,如何在极端风格转换下保留面部身份是一个重大挑战,特别是在涂鸦这种高对比度、抽象的媒介中,微小的面部特征扭曲可能导致主体身份识别丧失。 Method: CraftGraffiti采用分两步的“先风格、后身份”范式,通过LoRA微调的预训练扩散模型进行风格迁移,并通过基于显式身份嵌入的面部一致性自注意力机制确保身份保真度。 Result: CraftGraffiti在面部特征一致性、美学评分和用户偏好方面表现优异,实现了最先进的艺术生成效果,并在Cruilla音乐节上进行了实际部署验证。 Conclusion: CraftGraffiti成功实现了在极端风格转换下的人脸身份保留,为AI辅助艺术创作提供了新的方法和思路。 Abstract: Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium, subtle distortions to the eyes, nose, or mouth can erase the subject's recognizability, undermining both personal and cultural authenticity. We present CraftGraffiti, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, CraftGraffiti first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the "style-first, identity-after" paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruilla Festival highlight the system's real-world creative impact. CraftGraffiti advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications.[104] "Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection
Anastasios Skoularikis,Stefanos-Iordanis Papadopoulos,Symeon Papadopoulos,Panagiotis C. Petrantonakis
Main category: cs.CV
TL;DR: 本文介绍了S-HArM数据集和三种提示策略,用于意图感知的AI生成图像分类,尽管模型泛化能力有所提高,但整体性能仍有限。
Details
Motivation: 现有的合成和上下文外内容检测工作大多忽视了AI生成图像背后的意图,因此需要一个意图感知的分类数据集。 Method: 引入了S-HArM数据集,并探索了三种提示策略来构建大规模合成训练数据集,同时进行了广泛的比较研究。 Result: 使用图像引导和多模态引导数据训练的模型在“真实世界”内容上泛化能力更好,但整体性能仍有限。 Conclusion: S-HArM是一个意图感知的多模态数据集,有助于分类AI生成图像的意图,但推断意图的复杂性仍然存在,需要专门的架构。 Abstract: Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.[105] Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network
Chenhao Zhang,Wei Gao
Main category: cs.CV
TL;DR: The paper proposes a dynamic video compression framework with variable coding routes and rate control for optimizing bitrate, distortion, and computational complexity.
Details
Motivation: Precise rate control remains a challenge for learning-based codecs in Neural Video Compression (NVC) despite their recent performance advancements. Method: Dynamic-Route Autoencoder with variable coding routes, Rate Control Agent for bitrate estimation and adjustment, Joint-Routes Optimization for collaborative training. Result: Experiments on HEVC and UVG datasets show 14.8% BD-Rate reduction, 0.47dB BD-PSNR gain, and 1.66% average bitrate error. Conclusion: The proposed dynamic video compression framework achieves RDCO for various bitrate and bitrate-constrained applications, outperforming state-of-the-art methods in BD-Rate reduction, BD-PSNR gain, and average bitrate error. Abstract: Neural Video Compression (NVC) has achieved remarkable performance in recent years. However, precise rate control remains a challenge due to the inherent limitations of learning-based codecs. To solve this issue, we propose a dynamic video compression framework designed for variable bitrate scenarios. First, to achieve variable bitrate implementation, we propose the Dynamic-Route Autoencoder with variable coding routes, each occupying partial computational complexity of the whole network and navigating to a distinct RD trade-off. Second, to approach the target bitrate, the Rate Control Agent estimates the bitrate of each route and adjusts the coding route of DRA at run time. To encompass a broad spectrum of variable bitrates while preserving overall RD performance, we employ the Joint-Routes Optimization strategy, achieving collaborative training of various routes. Extensive experiments on the HEVC and UVG datasets show that the proposed method achieves an average BD-Rate reduction of 14.8% and BD-PSNR gain of 0.47dB over state-of-the-art methods while maintaining an average bitrate error of 1.66%, achieving Rate-Distortion-Complexity Optimization (RDCO) for various bitrate and bitrate-constrained applications. Our code is available at https://git.openi.org.cn/OpenAICoding/DynamicDVC.[106] CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network
Reza Akbari Movahed,Abuzar Rezaee,Arezoo Zakeri,Colin Berry,Edmond S. L. Ho,Ali Gooya
Main category: cs.CV
TL;DR: CardioMorphNet is a new deep learning framework for accurate cardiac motion estimation using 3D cardiac shape-guided deformable registration, showing better performance and lower uncertainty than current methods.
Details
Motivation: Existing methods struggle to capture heart motion accurately due to reliance on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. Method: CardioMorphNet uses a recurrent Bayesian deep learning framework with a variational autoencoder and two posterior models for bi-ventricular segmentation and motion estimation, focusing on anatomical regions without intensity-based image registration similarity loss. Result: Validated on the UK Biobank dataset, CardioMorphNet outperforms state-of-the-art methods in cardiac motion estimation and yields lower uncertainty values for motion fields in the cardiac region. Conclusion: CardioMorphNet demonstrates superior performance in cardiac motion estimation and provides lower uncertainty values, indicating higher confidence in its predictions compared to existing methods. Abstract: Accurate cardiac motion estimation from cine cardiac magnetic resonance (CMR) images is vital for assessing cardiac function and detecting its abnormalities. Existing methods often struggle to capture heart motion accurately because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. To address this, we propose CardioMorphNet, a recurrent Bayesian deep learning framework for 3D cardiac shape-guided deformable registration using short-axis (SAX) CMR images. It employs a recurrent variational autoencoder to model spatio-temporal dependencies over the cardiac cycle and two posterior models for bi-ventricular segmentation and motion estimation. The derived loss function from the Bayesian formulation guides the framework to focus on anatomical regions by recursively registering segmentation maps without using intensity-based image registration similarity loss, while leveraging sequential SAX volumes and spatio-temporal features. The Bayesian modelling also enables computation of uncertainty maps for the estimated motion fields. Validated on the UK Biobank dataset by comparing warped mask shapes with ground truth masks, CardioMorphNet demonstrates superior performance in cardiac motion estimation, outperforming state-of-the-art methods. Uncertainty assessment shows that it also yields lower uncertainty values for estimated motion fields in the cardiac region compared with other probabilistic-based cardiac registration methods, indicating higher confidence in its predictions.[107] Mix, Align, Distil: Reliable Cross-Domain Atypical Mitosis Classification
Kaustubh Atey,Sameer Anand Jha,Gouranga Bala,Amit Sethi
Main category: cs.CV
TL;DR: 本文提出了一种用于MIDOG 2025任务2中领域鲁棒性AMF分类的简单训练方法,通过风格扰动、注意力细化特征对齐和基于EMA教师模型的预测稳定化技术,实现了优异的分类性能。
Details
Motivation: 非典型有丝分裂图(AMFs)是重要的组织病理学标记,但在扫描仪、染色和采集差异导致的领域转移下难以一致识别。 Method: (i)在骨干网络的早期和中期插入风格扰动以增加特征多样性;(ii)利用辅助对齐损失,通过弱领域标签(Scanner, Origin, Species, Tumor)对注意力细化特征进行跨站点对齐;(iii)使用温度缩放KL散度从EMA教师模型中蒸馏以稳定预测。 Result: 在MIDOG 2025任务2的初步排行榜上,该方法取得了0.8762的平衡准确率、0.8873的灵敏度、0.8651的特异度和0.9499的ROC AUC,且推理开销可忽略。 Conclusion: 该方法仅依赖粗略的领域元数据,具备极强的领域适应能力,是一种具有竞争力的MIDOG 2025挑战赛提交方案。 Abstract: Atypical mitotic figures (AMFs) are important histopathological markers yet remain challenging to identify consistently, particularly under domain shift stemming from scanner, stain, and acquisition differences. We present a simple training-time recipe for domain-robust AMF classification in MIDOG 2025 Task 2. The approach (i) increases feature diversity via style perturbations inserted at early and mid backbone stages, (ii) aligns attention-refined features across sites using weak domain labels (Scanner, Origin, Species, Tumor) through an auxiliary alignment loss, and (iii) stabilizes predictions by distilling from an exponential moving average (EMA) teacher with temperature-scaled KL divergence. On the organizer-run preliminary leaderboard for atypical mitosis classification, our submission attains balanced accuracy of 0.8762, sensitivity of 0.8873, specificity of 0.8651, and ROC AUC of 0.9499. The method incurs negligible inference-time overhead, relies only on coarse domain metadata, and delivers strong, balanced performance, positioning it as a competitive submission for the MIDOG 2025 challenge.[108] Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang,Zhimin Li,Yuhang Zang,Yujie Zhou,Jiazi Bu,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
Main category: cs.CV
TL;DR: The paper introduces Pref-GRPO to stabilize reinforcement learning in text-to-image generation and UniGenBench for comprehensive model evaluation.
Details
Motivation: Current GRPO-based reinforcement learning methods for text-to-image generation suffer from reward hacking due to pointwise reward models. Additionally, existing benchmarks lack granularity for comprehensive model assessment. Method: Pref-GRPO uses pairwise comparisons and win rates as reward signals, while UniGenBench evaluates image generation using detailed semantic consistency criteria. Result: Experiments show that Pref-GRPO provides more stable training and effectively mitigates reward hacking, while UniGenBench offers a detailed evaluation framework that reveals model strengths and weaknesses. Conclusion: The paper proposes Pref-GRPO, a pairwise preference reward-based GRPO method, and introduces UniGenBench, a comprehensive T2I benchmark, to address reward hacking and improve model evaluation in text-to-image generation. Abstract: Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.[109] ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting
Yuxi Hu,Jun Zhang,Kuangyi Chen,Zhe Zhang,Friedrich Fraundorfer
Main category: cs.CV
TL;DR: C3-GS是一种新的高斯绘制方法,通过优化特征学习解决了稀疏视角下多视角一致性和几何准确性不足的问题,从而实现了高质量的新视角合成。
Details
Motivation: 现有方法在使用稀疏视角构建准确几何结构时存在不足,缺乏多视角一致性特征编码,因此需要一种更有效的特征学习框架。 Method: C3-GS将三种轻量级模块集成到统一的渲染管线中,以改进特征融合,从而预测更具判别性和多视角一致性的高斯参数。 Result: 在基准数据集上的实验表明,C3-GS在渲染质量和泛化能力方面达到了最先进的水平。 Conclusion: C3-GS通过引入上下文感知、跨维度和跨尺度约束,提升了特征学习能力,在无需额外监督的情况下实现了高质量的高斯绘制和泛化能力。 Abstract: Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose $\mathbf{C}^{3}$-GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that $\mathbf{C}^{3}$-GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: https://github.com/YuhsiHu/C3-GS.[110] SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
Jiawen Lin,Shiran Bian,Yihang Zhu,Wenbin Tan,Yachao Zhang,Yuan Xie,Yanyun Qu
Main category: cs.CV
TL;DR: SeqVLM是一个新的零样本3D视觉定位框架,它通过多视图现实场景图像和空间信息来推理目标对象,实现了在ScanRefer和Nr3D基准上的最先进性能。
Details
Motivation: 尽管监督方法在受限环境中可以达到更高的准确性,但零样本3D视觉定位在现实应用中具有更大的潜力,因为可以消除特定场景的训练需求。 Method: 首先,通过3D语义分割网络生成3D实例建议,并通过语义过滤来保留语义相关的候选。然后,采用建议引导的多视图投影策略,将候选建议投影到真实场景图像序列上。此外,为了减轻视觉语言模型的计算负担,采用动态调度机制迭代处理序列-查询提示。 Result: 实验结果显示,在ScanRefer和Nr3D基准上的Acc@0.25得分分别为55.6%和53.2%,分别超过了之前的零样本方法4.0%和5.2%。 Conclusion: SeqVLM通过动态调度机制和多视图投影策略,有效解决了零样本3D视觉定位中的空间限制推理和上下文遗漏问题,从而实现了最先进的性能。 Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.[111] Occlusion Robustness of CLIP for Military Vehicle Classification
Jan Erik van Woerden,Gertjan Burghouts,Lotte Nijskens,Alma M. Liezenga,Sabina van Rooij,Frank Ruis,Hugo J. Kuijf
Main category: cs.CV
TL;DR: This paper investigates the robustness of Vision-Language Models (VLMs) like CLIP in challenging military environments with occlusions. It highlights the effectiveness of Transformer-based models, the impact of occlusion types on performance, and how fine-tuning can delay performance degradation, emphasizing the need for occlusion-specific training and further exploration of architectural resilience.
Details
Motivation: Vision-language models (VLMs) such as CLIP offer advantages for defense applications with scarce labeled data by enabling zero-shot classification. However, their robustness in challenging military environments with partial occlusion and degraded signal-to-noise ratio (SNR) remains underexplored. Method: The research evaluates CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes. Performance is measured using Normalized Area Under the Curve (NAUC) across varying occlusion percentages. Result: Four key insights emerged: (1) Transformer-based CLIP models outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous ones, (3) linear-probed models' performance drops sharply at around 35% occlusion, and (4) finetuning the model's backbone delays the performance drop to more than 60% occlusion. Conclusion: The study concludes that occlusion-specific augmentations during training are crucial for improving the robustness of Vision-Language Models (VLMs) like CLIP in challenging military environments. Furthermore, exploring patch-level sensitivity and architectural resilience is necessary for their real-world deployment. Abstract: Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.[112] SKGE-SWIN: End-To-End Autonomous Vehicle Waypoint Prediction and Navigation Using Skip Stage Swin Transformer
Fachri Najm Noer Kartiman,Rasim,Yaya Wihardi,Nurul Hasanah,Oskar Natan,Bambang Wahono,Taufik Ibnu Salim
Main category: cs.CV
TL;DR: This research proposes the SKGE-Swin architecture for autonomous vehicles, combining the Swin Transformer with a skip-stage mechanism to enhance context awareness and feature representation, achieving better performance in driving scores on the CARLA platform.
Details
Motivation: To develop an end-to-end autonomous vehicle model with enhanced pixel-to-pixel context awareness for better comprehension of complex environmental patterns. Method: The SKGE-Swin architecture integrates the Swin Transformer with a skip-stage mechanism, leveraging the Shifted Window-based Multi-head Self-Attention (SW-MSA) for global and multi-level feature representation. Result: The model achieved a superior Driving Score on the CARLA platform and demonstrated improved performance through ablation studies focusing on architectural components like skip connections and the Swin Transformer. Conclusion: The SKGE-Swin architecture outperforms previous methods in autonomous vehicle modeling by enhancing feature representation and context awareness through the Swin Transformer and skip-stage mechanism. Abstract: Focusing on the development of an end-to-end autonomous vehicle model with pixel-to-pixel context awareness, this research proposes the SKGE-Swin architecture. This architecture utilizes the Swin Transformer with a skip-stage mechanism to broaden feature representation globally and at various network levels. This approach enables the model to extract information from distant pixels by leveraging the Swin Transformer's Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism and to retain critical information from the initial to the final stages of feature extraction, thereby enhancing its capability to comprehend complex patterns in the vehicle's surroundings. The model is evaluated on the CARLA platform using adversarial scenarios to simulate real-world conditions. Experimental results demonstrate that the SKGE-Swin architecture achieves a superior Driving Score compared to previous methods. Furthermore, an ablation study will be conducted to evaluate the contribution of each architectural component, including the influence of skip connections and the use of the Swin Transformer, in improving model performance.[113] Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
Gowreesh Mago,Pascal Mettes,Stevan Rudinac
Main category: cs.CV
TL;DR: 这篇论文讨论了视频内容中抽象概念识别的重要性,并调查了使用基础模型来解决这一挑战的潜力。
Details
Motivation: 尽管机器在理解视频中的具体实体方面取得了进展,但在识别抽象概念如正义、自由等方面仍不如人类。论文旨在探索基础模型如何帮助解决这一开放挑战。 Method: 论文对过去的研究任务和数据集进行了调查,并分析了利用基础模型进行抽象概念理解的潜力。 Result: 研究发现,研究人员长期尝试解决抽象概念识别问题,并利用现有工具取得了一些进展。 Conclusion: 论文强调了借鉴社区经验的重要性,以避免重复劳动,并利用多模态基础模型推动抽象概念理解的发展。 Abstract: The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel'' as we start revisiting it in the era of multi-modal foundation models.[114] Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML
Kuniko Paxton,Koorosh Aslansefat,Amila Akagić,Dhavalkumar Thakker,Yiannis Papadopoulos
Main category: cs.CV
TL;DR: This paper proposes a novel explainability method for skin lesion classification models that improves diagnostic reliability and patient safety by analyzing activation probability maps and detecting false diagnoses.
Details
Motivation: Despite high accuracy in skin lesion classification models, distrust in AI models persists in medical practice. Existing explainability methods have reliability issues, necessitating a more trustworthy and explainable diagnostic approach. Method: The proposed method, Global Class Activation Probabilistic Map Evaluation, analyzes all classes' activation probability maps probabilistically at a pixel level, and SafeML is applied to detect false diagnoses. Result: The method was evaluated using the ISIC datasets with MobileNetV2 and Vision Transformers, showing improved diagnostic reliability and the ability to issue warnings for potential misdiagnoses. Conclusion: The proposed method enhances diagnostic reliability and patient safety by improving the explainability and detection of false diagnoses in skin lesion classification models. Abstract: Recent advancements in skin lesion classification models have significantly improved accuracy, with some models even surpassing dermatologists' diagnostic performance. However, in medical practice, distrust in AI models remains a challenge. Beyond high accuracy, trustworthy, explainable diagnoses are essential. Existing explainability methods have reliability issues, with LIME-based methods suffering from inconsistency, while CAM-based methods failing to consider all classes. To address these limitations, we propose Global Class Activation Probabilistic Map Evaluation, a method that analyses all classes' activation probability maps probabilistically and at a pixel level. By visualizing the diagnostic process in a unified manner, it helps reduce the risk of misdiagnosis. Furthermore, the application of SafeML enhances the detection of false diagnoses and issues warnings to doctors and patients as needed, improving diagnostic reliability and ultimately patient safety. We evaluated our method using the ISIC datasets with MobileNetV2 and Vision Transformers.[115] Evaluating Compositional Generalisation in VLMs and Diffusion Models
Beth Pearson,Bilal Boulbarss,Michael Wray,Martha Lewis
Main category: cs.CV
TL;DR: 这项研究探讨了扩散分类器与判别模型在组合泛化能力方面的差异,发现扩散分类器和ViLT在概念绑定任务中表现良好,但所有模型在关系泛化任务中都面临挑战。
Details
Motivation: 自然语言的一个基本方面是可以通过已知部分的组合形成新意义,但现有视觉语言模型(VLM)在组合语义上存在不足。扩散模型在生成能力上表现出色,并在某些组合任务中表现出竞争力,因此本研究旨在探索其是否具备更强的组合泛化能力。 Method: 研究评估了三种模型——扩散分类器、CLIP 和 ViLT 在零样本学习(ZSL)和广义零样本学习(GZSL)设置下的对象属性绑定能力与关系推理能力,并对CLIP嵌入进行了分析以探究问题根源。 Result: 扩散分类器和ViLT在概念绑定任务中表现良好,但所有模型在关系GZSL任务中都面临显著挑战,表明VLM在关系推理方面存在普遍困难。 Conclusion: 尽管扩散分类器和ViLT在某些组合任务上表现良好,但视觉语言模型在关系推理方面仍面临重大挑战,表明需要进一步改进以实现真正的组合泛化能力。 Abstract: A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip[116] Surfel-based 3D Registration with Equivariant SE(3) Features
Xueyang Kang,Hang Zhao,Kourosh Khoshelham,Patrick Vandewalle
Main category: cs.CV
TL;DR: This paper introduces a surfel-based pose learning regression method for point cloud registration that improves robustness to noise and aggressive rotations by learning SE(3) equivariant features.
Details
Motivation: Current point cloud registration methods, both non-learning and learning-based, often ignore point orientations and uncertainties, making them vulnerable to noise and aggressive transformations. This necessitates a method that considers these factors to improve robustness and reduce dependency on augmented training data. Method: The method involves initializing surfels from a Lidar point cloud using virtual perspective camera parameters and learning explicit SE(3) equivariant features using SE(3) equivariant convolutional kernels. The model uses a convolutional encoder, a cross-attention mechanism, a decoder, and a Huber loss for training. Result: The proposed method achieves superior performance on both indoor and outdoor datasets, demonstrating robustness in real-world point cloud registration tasks. Conclusion: The proposed surfel-based pose learning regression approach demonstrates superiority and robust performance in point cloud registration tasks, especially under noisy inputs and aggressive rotations. Abstract: Point cloud registration is crucial for ensuring 3D alignment consistency of multiple local point clouds in 3D reconstruction for remote sensing or digital heritage. While various point cloud-based registration methods exist, both non-learning and learning-based, they ignore point orientations and point uncertainties, making the model susceptible to noisy input and aggressive rotations of the input point cloud like orthogonal transformation; thus, it necessitates extensive training point clouds with transformation augmentations. To address these issues, we propose a novel surfel-based pose learning regression approach. Our method can initialize surfels from Lidar point cloud using virtual perspective camera parameters, and learns explicit $\mathbf{SE(3)}$ equivariant features, including both position and rotation through $\mathbf{SE(3)}$ equivariant convolutional kernels to predict relative transformation between source and target scans. The model comprises an equivariant convolutional encoder, a cross-attention mechanism for similarity computation, a fully-connected decoder, and a non-linear Huber loss. Experimental results on indoor and outdoor datasets demonstrate our model superiority and robust performance on real point-cloud scans compared to state-of-the-art methods.[117] Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training
Tao Luo,Han Wu,Tong Yang,Dinggang Shen,Zhiming Cui
Main category: cs.CV
TL;DR: DVCTNet is a novel method for accurate dental caries detection using panoramic X-rays, combining global and local views with advanced feature fusion techniques, and showing superior performance over existing approaches.
Details
Motivation: Current dental caries detection methods struggle with accuracy due to contrast variations and lesion diversity, necessitating a more effective solution. Method: DVCTNet uses a dual-view approach with global and local image analyses, integrating features through a Gated Cross-View Attention module. Result: DVCTNet outperforms state-of-the-art methods on both a public dataset and a newly curated high-precision dataset. Conclusion: DVCTNet demonstrates superior performance in dental caries detection compared to existing methods, showing its potential for clinical application. Abstract: Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet's superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.[118] FusionCounting: Robust visible-infrared image fusion guided by crowd counting via multi-task learning
He Li,Xinyu Liu,Weihang Kong,Xingchen Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的多任务学习框架FusionCounting,通过将人群计数整合到可见光和红外图像融合(VIF)过程中,解决了现有方法在密集场景中的局限性。
Details
Motivation: 尽管RGB-T人群计数近年来受到越来越多的关注,但尚未有研究将VIF和人群计数整合到一个统一框架中。同时,语义分割需要大量注释,而目标检测在高度拥挤的场景中由于边界框重叠和遮挡面临挑战。 Method: 我们提出了FusionCounting,这是一种新颖的多任务学习框架,将人群计数整合到VIF过程中,并引入了动态损失函数加权策略和对抗训练来加速收敛、平衡任务贡献以及增强鲁棒性。 Result: FusionCounting不仅提升了图像融合质量,还实现了优越的人群计数性能。 Conclusion: 实验结果表明,FusionCounting不仅提升了图像融合质量,还实现了优越的人群计数性能。 Abstract: Most visible and infrared image fusion (VIF) methods focus primarily on optimizing fused image quality. Recent studies have begun incorporating downstream tasks, such as semantic segmentation and object detection, to provide semantic guidance for VIF. However, semantic segmentation requires extensive annotations, while object detection, despite reducing annotation efforts compared with segmentation, faces challenges in highly crowded scenes due to overlapping bounding boxes and occlusion. Moreover, although RGB-T crowd counting has gained increasing attention in recent years, no studies have integrated VIF and crowd counting into a unified framework. To address these challenges, we propose FusionCounting, a novel multi-task learning framework that integrates crowd counting into the VIF process. Crowd counting provides a direct quantitative measure of population density with minimal annotation, making it particularly suitable for dense scenes. Our framework leverages both input images and population density information in a mutually beneficial multi-task design. To accelerate convergence and balance tasks contributions, we introduce a dynamic loss function weighting strategy. Furthermore, we incorporate adversarial training to enhance the robustness of both VIF and crowd counting, improving the model's stability and resilience to adversarial attacks. Experimental results on public datasets demonstrate that FusionCounting not only enhances image fusion quality but also achieves superior crowd counting performance.[119] Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation
Krit Duangprom,Tryphon Lambrou,Binod Bhattarai
Main category: cs.CV
TL;DR: 本文提出了一种利用低秩调整技术微调视觉语言模型的方法,用于外科工具的2D关键点估计,有效地解决了传统方法在小规模医学数据集上的过拟合问题。
Details
Motivation: 传统的卷积神经网络(CNN)或基于Transformer的方法在小规模医学数据集上经常出现过拟合问题,而该研究利用预训练VLM的泛化能力来解决这一问题。 Method: 利用低秩调整(LoRA)技术微调预训练的视觉语言模型(VLM),并精心设计提示以创建指令调整数据集,将视觉特征与语义关键点描述对齐。 Result: 实验结果表明,仅经过两个训练周期的微调,适应后的VLM就超越了基线模型,证明了所提出方法的有效性。 Conclusion: 本文提出了一种基于低秩调整技术微调视觉语言模型的新方法,用于外科工具的2D关键点估计,展示了LoRA在资源有限情况下的有效性,并为未来在3D手术手和工具姿态估计方面的工作铺平了道路。 Abstract: This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.[120] PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification
Hao Yang,Qianyu Zhou,Haijia Sun,Xiangtai Li,Xuequan Lu,Lizhuang Ma,Shuicheng Yan
Main category: cs.CV
TL;DR: This paper proposes PointDGRWKV, the first RWKV-based framework tailored for Domain Generalization in Point Cloud Classification, introducing modules to address spatial distortions and attention shifts for improved generalization.
Details
Motivation: Prior works on DG for PCC based on convolutional networks, Transformer, or Mamba architectures face limitations in receptive fields, computational cost, or long-range dependency modeling. RWKV offers promising features like linear complexity and global receptive fields, but its direct application to DG PCC faces significant challenges. Method: PointDGRWKV introduces Adaptive Geometric Token Shift and Cross-Domain key feature Distribution Alignment to enhance spatial modeling and cross-domain robustness for RWKV-based DG PCC. Result: Experiments show that PointDGRWKV achieves state-of-the-art performance on DG PCC, overcoming the limitations of RWKV's fixed token shift and Bi-WKV attention for improved spatial modeling and cross-domain robustness. Conclusion: PointDGRWKV successfully addresses the challenges of applying RWKV to DG PCC, achieving state-of-the-art performance on DG PCC. Abstract: Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.[121] PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis
Ye Zhang,Yu Zhou,Jingwen Qi,Yongbing Zhang,Simon Puettmann,Finn Wichmann,Larissa Pereira Ferreira,Lara Sichward,Julius Keyl,Sylvia Hartmann,Shuo Zhao,Hongxiao Wang,Xiaowei Xu,Jianxu Chen
Main category: cs.CV
TL;DR: PathMR是一种用于病理图像分析的细胞级多模态视觉推理框架,能够生成专家级诊断解释并预测细胞分布模式,其在文本生成质量、分割准确性和跨模态对齐方面优于现有方法。
Details
Motivation: 深度学习在自动病理诊断中的应用因模型决策不透明和缺乏可追溯的推理而限制了临床应用。为解决这一问题,最近的多模态视觉推理架构提供了一个统一的框架,能够在像素级别生成分割掩码的同时提供语义对齐的文本解释。 Method: PathMR是一种用于病理图像分析的细胞级多模态视觉推理框架,能够在给定病理图像和文本查询的情况下生成专家级诊断解释,并同时预测细胞分布模式。 Result: 在PathGen数据集和新开发的GADVR数据集上的大量实验表明,PathMR在文本生成质量、分割准确性和跨模态对齐方面持续优于现有视觉推理方法。 Conclusion: PathMR展现出在AI驱动的病理诊断中提升可解释性的潜力,并在文本生成质量、分割准确性和跨模态对齐方面持续优于现有视觉推理方法。 Abstract: Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in https://github.com/zhangye-zoe/PathMR.[122] Deep Learning Framework for Early Detection of Pancreatic Cancer Using Multi-Modal Medical Imaging Analysis
Dennis Slobodzian,Karissa Tilbury,Amir Kordijazi
Main category: cs.CV
TL;DR: A deep learning framework using a modified ResNet architecture achieved over 90% accuracy in early PDAC detection from dual-modality images, offering a promising tool for automated diagnosis and future applications in broader cancer detection.
Details
Motivation: Pancreatic ductal adenocarcinoma (PDAC) has a very low five-year survival rate due to late detection, highlighting the urgent need for more effective early detection methods. Method: The study analyzed 40 patient samples and evaluated six deep learning architectures, comparing Convolutional Neural Networks (CNNs) with Vision Transformers (ViTs). The final framework was based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training to address dataset limitations. Result: The optimized framework achieved over 90% accuracy in distinguishing between normal, fibrotic, and cancerous tissue, significantly outperforming current manual analysis methods. Conclusion: This research successfully developed a deep learning framework for early detection of pancreatic ductal adenocarcinoma using dual-modality imaging, demonstrating significant improvement over manual analysis and establishing a pipeline for automated detection that can assist pathologists and potentially be expanded to other cancer types. Abstract: Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms of cancer, with a five-year survival rate below 10% primarily due to late detection. This research develops and validates a deep learning framework for early PDAC detection through analysis of dual-modality imaging: autofluorescence and second harmonic generation (SHG). We analyzed 40 unique patient samples to create a specialized neural network capable of distinguishing between normal, fibrotic, and cancerous tissue. Our methodology evaluated six distinct deep learning architectures, comparing traditional Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs). Through systematic experimentation, we identified and overcome significant challenges in medical image analysis, including limited dataset size and class imbalance. The final optimized framework, based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training, achieved over 90% accuracy in cancer detection. This represents a significant improvement over current manual analysis methods an demonstrates potential for clinical deployment. This work establishes a robust pipeline for automated PDAC detection that can augment pathologists' capabilities while providing a foundation for future expansion to other cancer types. The developed methodology also offers valuable insights for applying deep learning to limited-size medical imaging datasets, a common challenge in clinical applications.[123] Understanding and evaluating computer vision models through the lens of counterfactuals
Pushkar Shukla
Main category: cs.CV
TL;DR: This thesis introduces counterfactual-based frameworks to improve interpretability, fairness, and causality in AI models, targeting bias evaluation and mitigation in both vision classifiers and generative systems.
Details
Motivation: The motivation is to improve interpretability, fairness, and robustness of AI systems by uncovering spurious correlations, probing causal dependencies, and mitigating biases in both vision classifiers and generative models. Method: The thesis develops frameworks that use counterfactual reasoning to systematically alter inputs and analyze model behavior, including methods like CAVLI, ASAC, TIBET, BiasConnect, and InterMit. Result: The developed frameworks, such as CAVLI for vision classifiers and TIBET, BiasConnect, and InterMit for generative models, demonstrate effective bias evaluation and mitigation while enhancing interpretability and fairness. Conclusion: The thesis concludes that counterfactuals provide a unifying approach for enhancing interpretability, fairness, and causality in AI models, offering scalable and principled methods for responsible bias evaluation and mitigation. Abstract: Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.[124] To New Beginnings: A Survey of Unified Perception in Autonomous Vehicle Software
Loïc Stratil,Felix Fent,Esteban Rivera,Markus Lienkamp
Main category: cs.CV
TL;DR: This paper presents a comprehensive framework for unified perception in autonomous vehicles, consolidating fragmented efforts and guiding future research towards more robust and interpretable perception systems.
Details
Motivation: Autonomous vehicle perception typically relies on modular pipelines that decompose the task into detection, tracking, and prediction. While interpretable, these pipelines suffer from error accumulation and limited inter-task synergy. Unified perception has emerged as a promising paradigm to address these issues. Method: The paper provides a comprehensive overview of unified perception, introducing a holistic and systemic taxonomy that categorizes methods along task integration, tracking formulation, and representation flow. It defines three paradigms - Early, Late, and Full Unified Perception - and systematically reviews existing methods, their architectures, training strategies, datasets used, and open-source availability. Result: The paper highlights future research directions while consolidating the current state of unified perception into a systemic taxonomy. It reviews existing methods and their effectiveness in improving robustness, contextual reasoning, and efficiency while retaining interpretable outputs. Conclusion: This paper establishes the first comprehensive framework for understanding and advancing unified perception, consolidating fragmented efforts and guiding future research toward more robust, generalizable, and interpretable perception. Abstract: Autonomous vehicle perception typically relies on modular pipelines that decompose the task into detection, tracking, and prediction. While interpretable, these pipelines suffer from error accumulation and limited inter-task synergy. Unified perception has emerged as a promising paradigm that integrates these sub-tasks within a shared architecture, potentially improving robustness, contextual reasoning, and efficiency while retaining interpretable outputs. In this survey, we provide a comprehensive overview of unified perception, introducing a holistic and systemic taxonomy that categorizes methods along task integration, tracking formulation, and representation flow. We define three paradigms -Early, Late, and Full Unified Perception- and systematically review existing methods, their architectures, training strategies, datasets used, and open-source availability, while highlighting future research directions. This work establishes the first comprehensive framework for understanding and advancing unified perception, consolidates fragmented efforts, and guides future research toward more robust, generalizable, and interpretable perception.[125] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation
Yifan Gao,Haoyue Li,Feng Yuan,Xiaosong Wang,Xin Gao
Main category: cs.CV
TL;DR: Dino U-Net is a novel encoder-decoder architecture that leverages the DINOv3 vision foundation model to improve medical image segmentation accuracy by effectively combining semantic and spatial features.
Details
Motivation: Foundation models pre-trained on natural images offer powerful tools for medical image segmentation, but effectively transferring their representations for precise clinical applications remains a challenge. This work aims to bridge that gap. Method: The paper introduces Dino U-Net, an encoder-decoder architecture that utilizes a frozen DINOv3 backbone with a specialized adapter to fuse semantic and spatial features. It also proposes a fidelity-aware projection module (FAPM) to preserve feature quality during dimensionality reduction. Result: Dino U-Net achieved state-of-the-art performance across seven public medical image segmentation datasets, consistently outperforming previous methods. Segmentation accuracy improved with larger backbone model sizes, up to the 7-billion-parameter variant. Conclusion: Dino U-Net proves to be a highly scalable and effective framework for medical image segmentation, demonstrating that leveraging dense-pretrained features from a general-purpose foundation model can significantly advance segmentation accuracy. Abstract: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.[126] Classifying Mitotic Figures in the MIDOG25 Challenge with Deep Ensemble Learning and Rule Based Refinement
Sara Krauss,Ellena Spieß,Daniel Hieber,Frank Kramer,Johannes Schobel,Dominik Müller
Main category: cs.CV
TL;DR: 本文提出了一种基于深度集成学习的方法用于区分AMFs与NMFs,并探索了RBR模块的影响,取得了良好的分类性能
Details
Motivation: MFs是肿瘤分级中的重要生物标志物,但手动注释耗时且主观性强,难以区分AMFs与NMFs Method: 使用AUCMEDI训练ConvNeXtBase模型集成,并扩展一个基于规则的优化(RBR)模块 Result: 在MIDOG25初步测试集上,该集成模型达到了84.02%的平衡准确率。RBR提升了特异性但降低了敏感性和整体性能 Conclusion: 深度集成模型在AMF分类任务中表现出色,RBR模块可以提升特定指标但需要进一步研究 Abstract: Mitotic figures (MFs) are relevant biomarkers in tumor grading. Differentiating atypical MFs (AMFs) from normal MFs (NMFs) remains difficult, as manual annotation is time-consuming and subjective. In this work an ensemble of ConvNeXtBase models was trained with AUCMEDI and extend with a rule-based refinement (RBR) module. On the MIDOG25 preliminary test set, the ensemble achieved a balanced accuracy of 84.02%. While the RBR increased specificity, it reduced sensitivity and overall performance. The results show that deep ensembles perform well for AMF classification. RBR can increase specific metrics but requires further research.[127] COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans
Enrico Martini,Ho Jin Choi,Nadia Figueroa,Nicola Bombieri
Main category: cs.CV
TL;DR: COMETH is a lightweight, real-time multi-view human pose fusion algorithm designed for improved accuracy and scalability in industrial human motion tracking applications.
Details
Motivation: The motivation is to overcome the limitations of multi-camera centralized setups and edge devices in terms of computational costs, scalability, and accuracy for real-time human pose estimation in the context of Industry 5.0. Method: COMETH utilizes convex optimization for multiview estimation and tracking, integrating kinematic and biomechanical constraints, employing convex optimization-based inverse kinematics, and implementing a state observer for temporal consistency. Result: COMETH outperforms state-of-the-art methods in localization, detection, and tracking accuracy on both public and industrial datasets. Conclusion: The proposed COMETH algorithm effectively addresses the challenges of accurate and scalable human motion tracking in industrial settings, demonstrating superior performance compared to existing methods. Abstract: In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.[128] Olive Tree Satellite Image Segmentation Based On SAM and Multi-Phase Refinement
Amir Jmal,Chaima Chtourou,Mahdi Louati,Abdelaziz Kallel,Houda Khmila
Main category: cs.CV
TL;DR: This paper proposes an innovative method for olive tree segmentation from satellite images using the Segment Anything Model (SAM) and advanced techniques, achieving 98% accuracy.
Details
Motivation: Maintaining olive biodiversity through early anomaly detection and treatment using remote sensing technology is crucial in the context of proven climate change. Method: The study integrates the Segment Anything Model (SAM) for segmentation, incorporating corrections based on tree alignment in the field and a learnable constraint regarding shape and size. Result: The approach achieved a 98% accuracy rate, significantly surpassing the initial SAM performance of 82%. Conclusion: This innovative method offers effective management solutions for identifying and segmenting olive trees in agricultural plots using remote sensing technology. Abstract: In the context of proven climate change, maintaining olive biodiversity through early anomaly detection and treatment using remote sensing technology is crucial, offering effective management solutions. This paper presents an innovative approach to olive tree segmentation from satellite images. By leveraging foundational models and advanced segmentation techniques, the study integrates the Segment Anything Model (SAM) to accurately identify and segment olive trees in agricultural plots. The methodology includes SAM segmentation and corrections based on trees alignement in the field and a learanble constraint about the shape and the size. Our approach achieved a 98\% accuracy rate, significantly surpassing the initial SAM performance of 82\%.[129] E-ConvNeXt: A Lightweight and Efficient ConvNeXt Variant with Cross-Stage Partial Connections
Fang Wang,Huitao Li,Wenhan Chao,Zheng Zhuo,Yiran Ji,Chang Peng,Yupeng Sun
Main category: cs.CV
TL;DR: 本文提出了一种高效的E-ConvNeXt网络,通过结构优化在降低复杂度的同时保持高精度性能。
Details
Motivation: 许多高性能网络在设计之初并未考虑轻量级应用场景,限制了它们的应用范围。 Method: 结合Cross Stage Partial Connections机制和一系列优化设计改进ConvNeXt网络结构。 Result: E-ConvNeXt-mini在0.9GFLOPs下达到78.3%的Top-1准确率,E-ConvNeXt-small在3.1GFLOPs下达到81.9%的Top-1准确率,并通过了目标检测任务的迁移学习测试。 Conclusion: E-ConvNeXt在保持高精度性能的同时,显著降低了网络复杂度,具有良好的应用前景。 Abstract: Many high-performance networks were not designed with lightweight application scenarios in mind from the outset, which has greatly restricted their scope of application. This paper takes ConvNeXt as the research object and significantly reduces the parameter scale and network complexity of ConvNeXt by integrating the Cross Stage Partial Connections mechanism and a series of optimized designs. The new network is named E-ConvNeXt, which can maintain high accuracy performance under different complexity configurations. The three core innovations of E-ConvNeXt are : (1) integrating the Cross Stage Partial Network (CSPNet) with ConvNeXt and adjusting the network structure, which reduces the model's network complexity by up to 80%; (2) Optimizing the Stem and Block structures to enhance the model's feature expression capability and operational efficiency; (3) Replacing Layer Scale with channel attention. Experimental validation on ImageNet classification demonstrates E-ConvNeXt's superior accuracy-efficiency balance: E-ConvNeXt-mini reaches 78.3% Top-1 accuracy at 0.9GFLOPs. E-ConvNeXt-small reaches 81.9% Top-1 accuracy at 3.1GFLOPs. Transfer learning tests on object detection tasks further confirm its generalization capability.[130] DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes
Yajiao Xiong,Xiaoyu Zhou,Yongtao Wan,Deqing Sun,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: DrivingGaussian++ 是一种高效且有效的真实感重建和可控编辑周围动态自动驾驶场景的框架。
Details
Motivation: 为了提高自动驾驶场景的真实感重建和可控编辑能力,DrivingGaussian++ 被提出。其目的是通过使用增量3D高斯模型和复合动态高斯图,实现精确的位置和遮挡效果,并通过集成LiDAR先验信息,实现更详细和一致的场景重建。此外,DrivingGaussian++ 支持无需训练的可控编辑,包括纹理修改、天气模拟和物体操作,并利用大型语言模型(LLMs)和可控编辑方法在优化过程中自动生成动态物体运动轨迹,提高其真实感。 Method: DrivingGaussian++ 采用增量3D高斯模型建模静态背景,并使用复合动态高斯图重建移动物体。此外,通过集成LiDAR先验信息,实现了更详细和一致的场景重建。为了支持训练无关的可控编辑,DrivingGaussian++ 利用多视角图像和深度先验进行纹理修改、天气模拟和物体操作,并结合大型语言模型(LLMs)和可控编辑方法在优化过程中自动生成动态物体运动轨迹。 Result: DrivingGaussian++ 在动态场景重建和光真实感全景视图合成方面超越了现有方法。它支持无需训练的可控编辑,包括纹理修改、天气模拟和物体操作,并利用多视角图像和深度先验。此外,DrivingGaussian++ 利用大型语言模型(LLMs)和可控编辑方法在优化过程中自动生成动态物体运动轨迹,提高其真实感。 Conclusion: DrivingGaussian++ 是一种高效且有效的真实感重建和可控编辑周围动态自动驾驶场景的框架。它通过使用增量3D高斯模型静态背景和复合动态高斯图重建移动物体,实现了精确的位置和遮挡效果,并通过集成LiDAR先验信息,实现了更详细和一致的场景重建。此外,DrivingGaussian++ 支持无需训练的可控编辑,包括纹理修改、天气模拟和物体操作,并利用大型语言模型(LLMs)和可控编辑方法在优化过程中自动生成动态物体运动轨迹,提高其真实感。 Abstract: We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io[131] Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation
Chenfan Qu,Yiwu Zhong,Bin Li,Lianwen Jin
Main category: cs.CV
TL;DR: This paper introduces a new framework for image manipulation localization using automatically generated annotations and a novel quality evaluation metric, significantly improving performance while reducing reliance on manually annotated data.
Details
Motivation: The high cost of acquiring high-quality annotated datasets poses a significant barrier to accurately localizing manipulated regions in images. The motivation is to develop methods that can mitigate data scarcity and improve model performance for image manipulation localization. Method: The paper introduces a novel paradigm CAAA v2 for automatic annotation of manipulated regions, a quality evaluation metric QES to filter unreliable annotations, and a data enhancement technique called Object Jitter. These methods are used to construct the MIMLv2 dataset and train the Web-IML model. Result: The authors constructed MIMLv2, a dataset with 246,212 manually forged images, 120x larger than existing datasets. Using Web-IML, they achieved a 31% performance gain and surpassed the previous state-of-the-art method, TruFor, by 24.1 average IoU points. Conclusion: The paper concludes that the proposed Web-IML model, along with the newly introduced CAAA v2 and QES techniques, significantly improves the performance of image manipulation localization, outperforming previous state-of-the-art methods by a large margin. Abstract: Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.[132] ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts
Patryk Będkowski,Jan Dubiński,Filip Szatkowski,Kamil Deja,Przemysław Rokita,Tomasz Trzciński
Main category: cs.CV
TL;DR: ExpertSim通过深度学习的Mixture-of-Generative-Experts架构,实现比传统蒙特卡洛方法更高效、更准确的探测器响应模拟。
Details
Motivation: 当前基于统计蒙特卡洛方法的模拟成本过高,对CERN计算网格造成显著压力,且数据分布差异大,传统方法难以有效应对。 Method: ExpertSim利用Mixture-of-Generative-Experts架构,每个专家专注于模拟数据的不同子集,以更精确和高效地生成模拟结果。 Result: ExpertSim不仅提高了准确性,还显著加快了模拟过程,为ALICE实验的零度量热器提供了更高效的模拟方法。 Conclusion: ExpertSim是一个有前景的高效率粒子物理实验探测器模拟解决方案,相比传统蒙特卡洛方法,在准确性和速度上均有提升。 Abstract: Simulating detector responses is a crucial part of understanding the inner workings of particle collisions in the Large Hadron Collider at CERN. Such simulations are currently performed with statistical Monte Carlo methods, which are computationally expensive and put a significant strain on CERN's computational grid. Therefore, recent proposals advocate for generative machine learning methods to enable more efficient simulations. However, the distribution of the data varies significantly across the simulations, which is hard to capture with out-of-the-box methods. In this study, we present ExpertSim - a deep learning simulation approach tailored for the Zero Degree Calorimeter in the ALICE experiment. Our method utilizes a Mixture-of-Generative-Experts architecture, where each expert specializes in simulating a different subset of the data. This allows for a more precise and efficient generation process, as each expert focuses on a specific aspect of the calorimeter response. ExpertSim not only improves accuracy, but also provides a significant speedup compared to the traditional Monte-Carlo methods, offering a promising solution for high-efficiency detector simulations in particle physics experiments at CERN. We make the code available at https://github.com/patrick-bedkowski/expertsim-mix-of-generative-experts.[133] POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models
Jiaxiang Cheng,Bing Ma,Xuhua Ren,Hongyi Jin,Kai Yu,Peng Zhang,Wenyue Li,Yuan Zhou,Tianxiang Zheng,Qinglin Lu
Main category: cs.CV
TL;DR: POSE is a novel distillation framework that enhances sampling efficiency in video diffusion models, enabling high-quality video generation in a single step with significantly reduced latency.
Details
Motivation: The motivation is to overcome the limitations of existing video acceleration methods that do not model temporal coherence or provide single-step distillation for large-scale video models. Method: POSE uses a two-phase distillation process including stability priming and unified adversarial equilibrium, along with conditional adversarial consistency for conditional video generation. Result: POSE reduces sampling steps and latency by 100x, while maintaining or improving video quality and performance across various metrics. Conclusion: POSE effectively addresses the critical bottlenecks in sampling efficiency for video diffusion generation, particularly for large-scale models and long sequences. Abstract: The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.[134] Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
Dale Decatur,Thibault Groueix,Wang Yifan,Rana Hanocka,Vladimir Kim,Matheus Gadelha
Main category: cs.CV
TL;DR: 本文提出了一种通过共享相似提示的早期扩散步骤来减少文本到图像生成冗余的方法,有效降低了计算成本并提高了图像质量。
Details
Motivation: 文本到图像扩散模型虽然能生成高质量图像,但计算成本高。现有工作主要优化单次推理效率,而本文则探索减少相关提示之间的冗余计算。 Method: 该方法基于扩散模型的由粗到细的特性,通过语义相似性对提示进行聚类,并在早期扩散步骤中共享计算。 Result: 实验表明,该方法在使用基于图像嵌入条件的模型时,显著降低了计算成本,同时提升了图像质量,并且能够与现有流程无缝集成,适用于大规模提示集合。 Conclusion: 该论文提出了一种无需训练的方法,通过在相似提示的早期扩散步骤中共享计算来减少文本到图像生成中的冗余,从而降低计算成本并提高生成图像的质量。 Abstract: Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/[135] Mitosis detection in domain shift scenarios: a Mamba-based approach
Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba的方法用于在域转移情况下进行有丝分裂检测,利用VM-UNet架构和染色增强操作以提高模型对域转移的鲁棒性。
Details
Motivation: 有丝分裂检测在组织病理学图像中对肿瘤评估起着关键作用,但由于域转移问题,机器学习算法在此任务上的性能显著下降。 Method: 该方法采用了Mamba-based VM-UNet架构进行有丝分裂检测,并结合染色增强操作提高模型鲁棒性。 Result: 初步实验结果显示,该方法在MIDOG++数据集上显示了较大的改进空间。 Conclusion: 该方法为有丝分裂检测中的域转移问题提供了一个有潜力的解决方案。 Abstract: Mitosis detection in histopathology images plays a key role in tumor assessment. Although machine learning algorithms could be exploited for aiding physicians in accurately performing such a task, these algorithms suffer from significative performance drop when evaluated on images coming from domains that are different from the training ones. In this work, we propose a Mamba-based approach for mitosis detection under domain shift, inspired by the promising performance demonstrated by Mamba in medical imaging segmentation tasks. Specifically, our approach exploits a VM-UNet architecture for carrying out the addressed task, as well as stain augmentation operations for further improving model robustness against domain shift. Our approach has been submitted to the track 1 of the MItosis DOmain Generalization (MIDOG) challenge. Preliminary experiments, conducted on the MIDOG++ dataset, show large room for improvement for the proposed method.[136] A multi-task neural network for atypical mitosis recognition under domain shift
Gennaro Percannella,Mattia Sarno,Francesco Tortorella,Mario Vento
Main category: cs.CV
TL;DR: 本研究提出了一种多任务学习方法,用于在域转移条件下自动识别组织病理图像中的非典型有丝分裂图形,初步评估结果良好。
Details
Motivation: 机器学习模型在域转移下表现显著下降,而识别组织病理图像中的非典型有丝分裂图形对医生正确评估肿瘤侵袭性至关重要。 Method: 提出了一种基于多任务学习的方法,通过利用与主要分类任务相关的辅助任务来解决域转移问题。 Result: 该方法在三个不同数据集上进行了初步评估,包括MIDOG 2025非典型训练集、Ami-Br数据集以及MIDOG25挑战的初步测试集,均表现出了有希望的性能。 Conclusion: 利用多任务学习方法在初步评估中展现了有希望的性能,能够帮助模型专注于分类对象,忽略图像中域变化的背景。 Abstract: Recognizing atypical mitotic figures in histopathology images allows physicians to correctly assess tumor aggressiveness. Although machine learning models could be exploited for automatically performing such a task, under domain shift these models suffer from significative performance drops. In this work, an approach based on multi-task learning is proposed for addressing this problem. By exploiting auxiliary tasks, correlated to the main classification task, the proposed approach, submitted to the track 2 of the MItosis DOmain Generalization (MIDOG) challenge, aims to aid the model to focus only on the object to classify, ignoring the domain varying background of the image. The proposed approach shows promising performance in a preliminary evaluation conducted on three distinct datasets, i.e., the MIDOG 2025 Atypical Training Set, the Ami-Br dataset, as well as the preliminary test set of the MIDOG25 challenge.[137] FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator
Huynh Tong Dang Khoa,Dang Hoai Nam,Vo Nguyen Le Duy
Main category: cs.CV
TL;DR: 本文提出FW-GAN,一种基于生成对抗网络的一次性手写合成框架,通过引入Wave-MLP、频率引导判别器和频率分布损失,有效解决了传统方法在长距离依赖和频率信息建模方面的不足。
Details
Motivation: 手写识别系统需要多样化的训练样本,但标注的手写数据稀缺且风格不一致。当前的合成方法在建模长距离依赖和复杂笔画模式上存在不足,并忽略了频率信息的重要性。 Method: FW-GAN结合了相位感知的Wave-MLP生成器、频率引导的判别器以及频率分布损失,以提升生成手写文本的真实性与风格一致性。 Result: FW-GAN在越南语和英语手写数据集上的实验表明其能够生成高质量、风格一致的手写文本,并有效增强低资源手写识别流程。 Conclusion: FW-GAN是一个有效的低资源手写识别数据增强工具,能够生成高质量且风格一致的手写文本。 Abstract: Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at https://github.com/DAIR-Group/FW-GAN[138] MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Junpeng Ma,Qizhe Zhang,Ming Lu,Zhibin Wang,Qiang Zhou,Jun Song,Shanghang Zhang
Main category: cs.CV
TL;DR: MMG-Vid是一种无需训练的视觉令牌剪枝框架,通过动态分配令牌预算和时间引导的DPC算法,显著提高视频大语言模型的推理效率,同时保持强大的性能。
Details
Motivation: 现有的视频理解方法未能考虑视频帧的动态特性和时间依赖性,导致计算效率低下,限制了视频大语言模型在实际应用中的可行性。 Method: MMG-Vid通过在分段级别和令牌级别最大化边际增益,动态分配令牌预算,并采用时间引导的DPC算法,联合建模帧间独特性和帧内多样性。 Result: 实验表明,MMG-Vid在保持超过99.5%原始性能的同时,有效减少了75%的视觉令牌,并在LLaVA-OneVision-7B上将预填充阶段加速了3.9倍。 Conclusion: MMG-Vid显著提高了视频大语言模型的推理效率,同时保持了强大的性能,证明了其在实际应用中的潜力。 Abstract: Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.[139] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
Wei Li,Renshan Zhang,Rui Shao,Jie He,Liqiang Nie
Main category: cs.CV
TL;DR: CogVLA is an efficient Vision-Language-Action framework inspired by human multimodal coordination that improves performance while significantly reducing computational costs.
Details
Motivation: The motivation behind CogVLA is to address the high computational overhead and scalability issues of existing Vision-Language-Action models that rely on extensive post-training. Method: CogVLA introduces a 3-stage progressive architecture: Encoder-FiLM based Aggregation Routing (EFA-Routing) to compress visual tokens, LLM-FiLM based Pruning Routing (LFP-Routing) to introduce action intent and achieve token-level sparsity, and V-L-A Coupled Attention (CAtten) for coherent action generation. Result: CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0% on the LIBERO benchmark and real-world robotic tasks, respectively. It also reduces training costs by 2.5-fold and decreases inference latency by 2.8-fold compared to OpenVLA. Conclusion: CogVLA is an efficient and high-performing Vision-Language-Action framework that significantly reduces computational overhead while achieving superior performance on vision-language and robotic tasks compared to existing models like OpenVLA. Abstract: Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.[140] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning
Hao Tan,Jun Lan,Zichang Tan,Ajian Liu,Chuanbiao Song,Senyuan Shi,Huijia Zhu,Weiqiang Wang,Jun Wan,Zhen Lei
Main category: cs.CV
TL;DR: This paper introduces HydraFake dataset and Veritas model, which improves deepfake detection in real-world scenarios through advanced reasoning and training techniques.
Details
Motivation: Existing deepfake detection benchmarks do not align well with real-world industrial practices, featuring homogeneous training sources and low-quality testing images. Method: Introducing HydraFake dataset with hierarchical generalization testing and developing Veritas with pattern-aware reasoning and a two-stage training pipeline. Result: Veritas outperforms previous detectors in handling unseen forgeries and data domains on the HydraFake dataset. Conclusion: Veritas, a multi-modal large language model-based deepfake detector, achieves significant gains across different OOD scenarios and delivers transparent and faithful detection outputs. Abstract: Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as "planning" and "self-reflection" to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.[141] FakeParts: a New Family of AI-Generated DeepFakes
Gaetan Brison,Soobash Daiboo,Samy Aimeur,Awais Hussain Sani,Xi Wang,Gianni Franchi,Vicky Kalogeiton
Main category: cs.CV
TL;DR: 本文提出了 FakeParts,一种局部伪造的深度伪造视频类型,并发布了 FakePartsBench 数据集以促进相关检测方法的研究。
Details
Motivation: 局部深度伪造视频相较于完全合成内容更具欺骗性且难以检测,需要新的检测方法和数据集支持。 Method: 引入 FakeParts 和 FakePartsBench 数据集,并通过用户研究评估检测难度。 Result: FakeParts 使人类检测准确率下降超过 30%,最先进的检测模型也表现出类似的性能下降。 Conclusion: FakePartsBench 是一个用于检测局部深度伪造视频的大规模基准数据集,旨在推动更强大的检测方法的发展。 Abstract: We introduce FakeParts, a new class of deepfakes characterized by subtle, localized manipulations to specific spatial regions or temporal segments of otherwise authentic videos. Unlike fully synthetic content, these partial manipulations, ranging from altered facial expressions to object substitutions and background modifications, blend seamlessly with real elements, making them particularly deceptive and difficult to detect. To address the critical gap in detection capabilities, we present FakePartsBench, the first large-scale benchmark dataset specifically designed to capture the full spectrum of partial deepfakes. Comprising over 25K videos with pixel-level and frame-level manipulation annotations, our dataset enables comprehensive evaluation of detection methods. Our user studies demonstrate that FakeParts reduces human detection accuracy by over 30% compared to traditional deepfakes, with similar performance degradation observed in state-of-the-art detection models. This work identifies an urgent vulnerability in current deepfake detection approaches and provides the necessary resources to develop more robust methods for partial video manipulations.[142] Multi-View 3D Point Tracking
Frano Rajič,Haofei Xu,Marko Mihajlovic,Siyuan Li,Irem Demir,Emircan Gündoğdu,Lei Ke,Sergey Prokudin,Marc Pollefeys,Siyu Tang
Main category: cs.CV
TL;DR: 论文介绍了一种新的多视角3D点跟踪器,使用实际数量的相机(例如四个)即可实现稳健且准确的在线跟踪,优于现有的单目跟踪器和多相机方法。
Details
Motivation: 现有的单目跟踪器在深度模糊和遮挡情况下表现不佳,而之前的多相机方法需要超过20个相机和繁琐的序列优化。因此,作者希望开发一种使用实际数量相机(例如四个)即可实现稳健且准确在线跟踪的新方法。 Method: 该方法通过前馈模型直接预测3D对应关系,利用K近邻相关和基于Transformer的更新,将多视角特征融合到统一的点云中。 Result: 在5K合成多视角Kubric序列上进行了训练,并在两个真实世界基准测试(Panoptic Studio和DexYCB)上进行了评估,分别实现了3.1厘米和2.0厘米的中值轨迹误差。 Conclusion: 该论文介绍了首个数据驱动的多视角3D点跟踪器,旨在使用多个相机视角跟踪动态场景中的任意点。与现有的单目跟踪器和其他多相机方法相比,该方法在深度模糊和遮挡情况下表现更优。 Abstract: We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.[143] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning
Yuan Gong,Xionghui Wang,Jie Wu,Shiyin Wang,Yitong Wang,Xinglong Wu
Main category: cs.CV
TL;DR: OneReward is a unified reinforcement learning framework that improves multi-task image generation by using a single reward model, eliminating the need for task-specific fine-tuning and achieving better performance than existing models.
Details
Motivation: Existing methods for multi-task image generation require task-specific supervised fine-tuning, which limits generalization and training efficiency. A unified framework is needed to handle diverse tasks with varied data distributions and evaluation metrics. Method: The paper proposes OneReward, which uses a single vision-language model (VLM) as a reward model to guide multi-task reinforcement learning for mask-guided image generation tasks. Result: Seedream 3.0 Fill, trained with OneReward, outperforms commercial and open-source models like Ideogram, Adobe Photoshop, and FLUX Fill across multiple evaluation dimensions. Conclusion: OneReward provides a unified reinforcement learning framework that enhances generative capabilities across multiple tasks without task-specific supervised fine-tuning, achieving superior performance in mask-guided image generation. Abstract: In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io[144] Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang
Main category: cs.CV
TL;DR: Dress&Dance是一个能够生成高质量虚拟试穿视频的视频扩散框架,通过CondNet网络实现多模态输入的统一处理,提供灵活的试穿体验。
Details
Motivation: 需要生成高质量的虚拟试穿视频,使用户能够在参考视频动作下试穿所需的服装。 Method: 提出了一种名为CondNet的新条件网络,通过注意力机制统一多模态输入(文本、图像和视频),并采用多阶段渐进式训练方法,结合有限的视频数据和更大的图像数据集。 Result: 生成了5秒长、24 FPS、分辨率为1152x720的高质量虚拟试穿视频,支持多种服装类型和同时试穿。 Conclusion: Dress&Dance通过CondNet实现了高质量、灵活的虚拟试穿体验,优于现有的开源和商业解决方案。 Abstract: We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.[145] First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge
Fahad Shamshad,Tameem Bakr,Yahia Shaaban,Noor Hussein,Karthik Nandakumar,Nils Lukas
Main category: cs.CV
TL;DR: 我们开发了两种攻击方法来测试图像水印的鲁棒性,在NeurIPS 2024 Erasing the Invisible挑战赛中,我们的方法几乎完美地消除了水印,同时保持了图像的质量。