Table of Contents
cs.CL [Back]
[1] GreenTEA: Gradient Descent with Topic-modeling and Evolutionary Auto-prompting
Zheng Dong,Luming Shang,Gabriela Olinto
Main category: cs.CL
TL;DR: GreenTEA, an agentic LLM workflow, optimizes prompts by balancing exploration of new candidates and exploitation of existing knowledge, outperforming manual and existing automatic methods.
Details
Motivation: Manual prompt crafting is labor-intensive and domain expertise-dependent, while existing automatic methods face challenges in balancing exploration and exploitation. Method: GreenTEA uses a collaborative team of agents within a genetic algorithm framework to iteratively refine prompts based on error sample feedback. Result: Experiments show GreenTEA outperforms human-engineered and state-of-the-art automatic prompt optimization methods across multiple domains. Conclusion: GreenTEA effectively balances exploration and exploitation in prompt optimization, enhancing LLM performance with reduced computational cost. Abstract: High-quality prompts are crucial for Large Language Models (LLMs) to achieve exceptional performance. However, manually crafting effective prompts is labor-intensive and demands significant domain expertise, limiting its scalability. Existing automatic prompt optimization methods either extensively explore new prompt candidates, incurring high computational costs due to inefficient searches within a large solution space, or overly exploit feedback on existing prompts, risking suboptimal optimization because of the complex prompt landscape. To address these challenges, we introduce GreenTEA, an agentic LLM workflow for automatic prompt optimization that balances candidate exploration and knowledge exploitation. It leverages a collaborative team of agents to iteratively refine prompts based on feedback from error samples. An analyzing agent identifies common error patterns resulting from the current prompt via topic modeling, and a generation agent revises the prompt to directly address these key deficiencies. This refinement process is guided by a genetic algorithm framework, which simulates natural selection by evolving candidate prompts through operations such as crossover and mutation to progressively optimize model performance. Extensive numerical experiments conducted on public benchmark datasets suggest the superior performance of GreenTEA against human-engineered prompts and existing state-of-the-arts for automatic prompt optimization, covering logical and quantitative reasoning, commonsense, and ethical decision-making.[2] Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow
Y. Du,C. Guo,W. Wang,G. Tang
Main category: cs.CL
TL;DR: 本文提出了一种基于认知决策路由(CDR)的新框架,通过动态选择适当的推理策略,结合认知科学原理,提高大型语言模型的性能和计算效率。
Details
Motivation: 大型语言模型在决定何时依赖快速直觉反应与更慢、更慎重的推理之间面临基本挑战。现有模型要么应用统一的推理深度,要么对所有查询依赖计算成本高的方法。 Method: 引入元认知层,通过多个维度分析查询复杂性,包括给定信息与所需结论之间的相关强度、领域边界跨越、利益相关者多重性以及不确定性水平。 Result: CDR在各种推理任务上表现出色,与统一深度推理方法相比,性能更优且计算成本降低34%。在专业判断任务中,一致性提高了23%,专家级评估的准确性提高了18%。 Conclusion: CDR框架成功地将认知科学原理与AI系统设计相结合,为大型语言模型提供了一种适应性推理的原理方法。 Abstract: Large Language Models (LLMs) face a fundamental challenge in deciding when to rely on rapid, intuitive responses versus engaging in slower, more deliberate reasoning. Inspired by Daniel Kahneman's dual-process theory and his insights on human cognitive biases, we propose a novel Cognitive Decision Routing (CDR) framework that dynamically determines the appropriate reasoning strategy based on query characteristics. Our approach addresses the current limitations where models either apply uniform reasoning depth or rely on computationally expensive methods for all queries. We introduce a meta-cognitive layer that analyzes query complexity through multiple dimensions: correlation strength between given information and required conclusions, domain boundary crossings, stakeholder multiplicity, and uncertainty levels. Through extensive experiments on diverse reasoning tasks, we demonstrate that CDR achieves superior performance while reducing computational costs by 34\% compared to uniform deep reasoning approaches. Our framework shows particular strength in professional judgment tasks, achieving 23\% improvement in consistency and 18\% better accuracy on expert-level evaluations. This work bridges cognitive science principles with practical AI system design, offering a principled approach to adaptive reasoning in LLMs.[3] Trust but Verify! A Survey on Verification Design for Test-time Scaling
V Venktesh,Mandeep rathee,Avishek Anand
Main category: cs.CL
TL;DR: 本文探讨了测试时扩展(TTS)中验证器的训练方法、类型及其在提升大型语言模型推理能力中的作用,并对现有验证方法进行了系统梳理和分类。
Details
Motivation: 尽管验证器在测试时扩展(TTS)中被广泛使用,但尚缺乏对其训练机制、类型及效用的全面整理和分析。 Method: 本文对文献中不同的验证方法进行了综述,提出了验证器训练的统一视角,并对其类型和应用进行了分类讨论。 Result: 本文提供了一个关于验证器的系统性综述,涵盖了其训练方式、类型划分以及在推理性能提升中的作用,并提供了相关代码库。 Conclusion: 验证器作为测试时扩展的关键组成部分,能够有效提升模型性能,本文为未来研究提供了全面的参考框架。 Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.[4] Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
Siddhant Bhambri,Upasana Biswas,Subbarao Kambhampati
Main category: cs.CL
TL;DR: 该研究发现推理痕迹的可解释性并不一定提升模型任务表现,R1痕迹虽然效果最好但最难以解释。
Details
Motivation: 本文探讨了推理痕迹是否必须具备可解释性以增强大型语言模型的任务表现,挑战了以往的隐含假设。 Method: 在开放书籍问答领域中,对LLaMA和Qwen模型进行监督微调,使用四种不同类型的推理痕迹,并通过100名参与者的人类实验评估每种痕迹的可解释性。 Result: 使用DeepSeek R1痕迹进行微调的表现最佳,但参与者认为这些痕迹的可解释性最低,揭示了可解释性与任务表现之间的不匹配。 Conclusion: 研究发现中间推理痕迹的可解释性与最终任务表现之间没有直接关联,这意味着可以将中间标记与最终用户的可解释性分离。 Abstract: Recent progress in reasoning-oriented Large Language Models (LLMs) has been driven by introducing Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an answer. These traces, as in DeepSeek R1, are not only used to guide inference but also serve as supervision signals for distillation into smaller models. A common but often implicit assumption is that CoT traces should be semantically meaningful and interpretable to the end user. While recent research questions the need for semantic nature of these traces, in this paper, we ask: ``\textit{Must CoT reasoning traces be interpretable to enhance LLM task performance?}" We investigate this question in the Open Book Question-Answering domain by supervised fine-tuning LLaMA and Qwen models on four types of reasoning traces: (1) DeepSeek R1 traces, (2) LLM-generated summaries of R1 traces, (3) LLM-generated post-hoc explanations of R1 traces, and (4) algorithmically generated verifiably correct traces. To quantify the trade-off between interpretability and performance, we further conduct a human-subject study with 100 participants rating the interpretability of each trace type. Our results reveal a striking mismatch: while fine-tuning on R1 traces yields the strongest performance, participants judged these traces to be the least interpretable. These findings suggest that it is useful to decouple intermediate tokens from end user interpretability.[5] QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting
Nicole Cho,William Watson,Alec Koppel,Sumitra Ganesh,Manuela Veloso
Main category: cs.CL
TL;DR: QueryBandits is a bandit framework that proactively rewrites queries to reduce hallucinations in Large Language Models (LLMs) by leveraging linguistic features and reward models, showing significant improvement over static rewriting strategies.
Details
Motivation: LLMs exhibit advanced reasoning but are prone to hallucinations, which current approaches mostly address through post-hoc filtering rather than preventing triggers. This work aims to proactively shape queries to reduce hallucinations. Method: The authors introduce QueryBandits, a bandit framework that learns query rewrite strategies by modeling the sensitivities of 17 linguistic features to guide LLMs away from hallucinations. It uses Thompson Sampling to adaptively optimize rewrites. Result: On 13 QA benchmarks with 1,050 perturbed queries each, QueryBandits achieved an 87.5% win rate over a no-rewrite baseline and outperformed zero-shot static prompts by 42.6% ("paraphrase") and 60.3% ("expand"). Static rewriting strategies were found to sometimes increase hallucinations. Conclusion: QueryBandits effectively mitigates hallucinations through adaptive query rewriting based on linguistic features, without requiring model retraining. This approach outperforms static rewriting methods and demonstrates the importance of contextual, dynamic interventions. Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have caused higher hallucination prevalence; yet most mitigation work focuses on after-the-fact filtering rather than shaping the queries that trigger them. We introduce QueryBandits, a bandit framework that designs rewrite strategies to maximize a reward model, that encapsulates hallucination propensity based upon the sensitivities of 17 linguistic features of the input query-and therefore, proactively steer LLMs away from generating hallucinations. Across 13 diverse QA benchmarks and 1,050 lexically perturbed queries per dataset, our top contextual QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a no-rewrite baseline and also outperforms zero-shot static prompting ("paraphrase" or "expand") by 42.6% and 60.3% respectively. Therefore, we empirically substantiate the effectiveness of QueryBandits in mitigating hallucination via the intervention that takes the form of a query rewrite. Interestingly, certain static prompting strategies, which constitute a considerable number of current query rewriting literature, have a higher cumulative regret than the no-rewrite baseline, signifying that static rewrites can worsen hallucination. Moreover, we discover that the converged per-arm regression feature weight vectors substantiate that there is no single rewrite strategy optimal for all queries. In this context, guided rewriting via exploiting semantic features with QueryBandits can induce significant shifts in output behavior through forward-pass mechanisms, bypassing the need for retraining or gradient-based adaptation.[6] Assessing Consciousness-Related Behaviors in Large Language Models Using the Maze Test
Rui A. Pimenta,Tim Schlippe,Kristina Schaaff
Main category: cs.CL
TL;DR: This study evaluates consciousness-like traits in LLMs using the Maze Test, showing that while reasoning models perform better, they still lack persistent self-awareness.
Details
Motivation: To investigate consciousness-like behaviors in Large Language Models (LLMs) and evaluate their spatial awareness, perspective-taking, goal-directed behavior, and temporal sequencing. Method: The Maze Test was used to evaluate 12 leading LLMs across zero-shot, one-shot, and few-shot learning scenarios, based on 13 synthesized consciousness characteristics. Result: Reasoning-capable LLMs outperformed standard versions, with Gemini 2.0 Pro achieving 52.9% Complete Path Accuracy and DeepSeek-R1 reaching 80.5% Partial Path Accuracy. However, the gap between these metrics indicates LLMs struggle to maintain coherent self-models throughout solutions. Conclusion: LLMs show progress in consciousness-related behaviors through reasoning mechanisms, but they lack the integrated, persistent self-awareness characteristic of consciousness. Abstract: We investigate consciousness-like behaviors in Large Language Models (LLMs) using the Maze Test, challenging models to navigate mazes from a first-person perspective. This test simultaneously probes spatial awareness, perspective-taking, goal-directed behavior, and temporal sequencing-key consciousness-associated characteristics. After synthesizing consciousness theories into 13 essential characteristics, we evaluated 12 leading LLMs across zero-shot, one-shot, and few-shot learning scenarios. Results showed reasoning-capable LLMs consistently outperforming standard versions, with Gemini 2.0 Pro achieving 52.9% Complete Path Accuracy and DeepSeek-R1 reaching 80.5% Partial Path Accuracy. The gap between these metrics indicates LLMs struggle to maintain coherent self-models throughout solutions -- a fundamental consciousness aspect. While LLMs show progress in consciousness-related behaviors through reasoning mechanisms, they lack the integrated, persistent self-awareness characteristic of consciousness.[7] Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval
Jonghyun Song,Youngjune Lee,Gyu-Hwung Cho,Ilhyeon Song,Saehun Kim,Yohan Jo
Main category: cs.CL
TL;DR: This paper proposes a framework for bidirectional learning between dense and sparse representations in multimodal retrieval, achieving superior performance with greater efficiency.
Details
Motivation: To overcome the limitations of existing multimodal Learned Sparse Retrieval methods that rely on computationally expensive contrastive pre-training or distillation from a frozen dense model. Method: An integrated similarity score, a weighted sum of dense and sparse similarities, is used as a shared teacher signal for both representations. The final layer of the dense encoder and the sparse projection head are fine-tuned. Result: The sparse retriever outperforms existing sparse baselines and achieves performance comparable to or surpassing dense models on MSCOCO and Flickr30k datasets. Conclusion: The proposed framework allows for bidirectional learning between dense and sparse representations using Self-Knowledge Distillation, enhancing performance while retaining efficiency. Abstract: Vision-Language Pretrained (VLP) models have achieved impressive performance on multimodal tasks, including text-image retrieval, based on dense representations. Meanwhile, Learned Sparse Retrieval (LSR) has gained traction in text-only settings due to its interpretability and efficiency with fast term-based lookup via inverted indexes. Inspired by these advantages, recent work has extended LSR to the multimodal domain. However, these methods often rely on computationally expensive contrastive pre-training, or distillation from a frozen dense model, which limits the potential for mutual enhancement. To address these limitations, we propose a simple yet effective framework that enables bi-directional learning between dense and sparse representations through Self-Knowledge Distillation. This bi-directional learning is achieved using an integrated similarity score-a weighted sum of dense and sparse similarities-which serves as a shared teacher signal for both representations. To ensure efficiency, we fine-tune the final layer of the dense encoder and the sparse projection head, enabling easy adaptation of any existing VLP model. Experiments on MSCOCO and Flickr30k demonstrate that our sparse retriever not only outperforms existing sparse baselines, but also achieves performance comparable to-or even surpassing-its dense counterparts, while retaining the benefits of sparse models.[8] Error Reflection Prompting: Can Large Language Models Successfully Understand Errors?
Jason Li,Lauren Yraola,Kevin Zhu,Sean O'Brien
Main category: cs.CL
TL;DR: 本文提出了一种称为错误反思提示(ERP)的方法,用于增强语言模型的推理能力。该方法基于链式思维(CoT),通过引入错误识别和纠正机制,提高模型的准确性和可解释性。
Details
Motivation: 尽管链式思维(CoT)等提示方法在问题解决方面取得了进展,但其缺乏反思和纠错能力,可能导致模型持续犯错。受人类反思和纠错能力的启发,本文提出了一种新的方法。 Method: ERP方法包括生成错误答案、识别错误和生成正确答案三个步骤。模型能够自动生成错误轮廓,并将错误识别和纠正集成到推理链中。 Result: 实验结果表明,ERP是传统CoT的一种有效补充,能够提高模型的推理能力和可解释性。 Conclusion: ERP方法为语言模型提供了一种有效的反思和纠错机制,为未来的研究提供了新的方向。 Abstract: Prompting methods for language models, such as Chain-of-thought (CoT), present intuitive step-by-step processes for problem solving. These methodologies aim to equip models with a better understanding of the correct procedures for addressing a given task. Despite these advancements, CoT lacks the ability of reflection and error correction, potentially causing a model to perpetuate mistakes and errors. Therefore, inspired by the human ability for said tasks, we propose Error Reflection Prompting (ERP) to further enhance reasoning in language models. Building upon CoT, ERP is a method comprised of an incorrect answer, error recognition, and a correct answer. This process enables the model to recognize types of errors and the steps that lead to incorrect answers, allowing the model to better discern which steps to avoid and which to take. The model is able to generate the error outlines itself with automated ERP generation, allowing for error recognition and correction to be integrated into the reasoning chain and produce scalability and reliability in the process. The results demonstrate that ERP serves as a versatile supplement to conventional CoT, ultimately contributing to more robust and capable reasoning abilities along with increased interpretability in how models ultimately reach their errors.[9] GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs
Nitin Gupta,Pallav Koppisetti,Kausik Lakkaraju,Biplav Srivastava
Main category: cs.CL
TL;DR: GAICo是一个用于评估和比较生成式AI输出的标准化、开源Python库,支持多种数据类型和多模态分析,旨在提高AI系统的可比性、可重复性和开发效率。
Details
Motivation: 生成式AI在高风险领域的广泛应用需要可靠且可重复的评估方法,但当前的评估工具缺乏标准化,难以处理结构化输出和多模态整体比较,阻碍了AI系统的发展。 Method: 开发了GAICo(Generative AI Comparator),一个统一且可扩展的框架,提供参考度量、高级API进行端到端分析,并支持多模型比较、可视化和报告生成。 Result: GAICo成功应用于复杂的多模态AI旅行助手管道评估和调试,证明其能够提高评估效率和可重复性。自2025年6月发布以来,到8月已获得超过13,000次下载,表明社区对其兴趣浓厚。 Conclusion: GAICo为研究人员和开发者提供了高效、标准化的评估工具,有助于提高生成式AI系统的可信度和开发速度,实现更安全、更快速的AI部署。 Abstract: The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo's utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.[10] How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
Abdelrahman Abdallah,Bhawna Piryani,Jamshid Mozafari,Mohammed Ali,Adam Jatowt
Main category: cs.CL
TL;DR: A comprehensive evaluation of reranking methods reveals that while LLM-based rerankers perform well on familiar queries, lightweight models offer comparable efficiency and better handle unseen queries.
Details
Motivation: To determine performance differences between LLM-based and lightweight rerankers on new queries. Method: Evaluated 22 reranking methods across multiple benchmarks and a new dataset for unseen queries. Result: LLM rerankers excel on familiar queries, but lightweight models offer comparable efficiency and better generalization. Conclusion: LLM rerankers perform better on familiar queries but have varied generalization on new queries, with lightweight models being comparably efficient. Abstract: In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalization ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches. https://github.com/DataScienceUIBK/llm-reranking-generalization-study[11] Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation
Arka Mukherjee,Shreya Ghosh
Main category: cs.CL
TL;DR: 该论文首次系统评估了视觉-语言模型(VLMs)在多模态故事生成任务中的文化能力,揭示了其潜力与挑战,并提出了新的评估框架和数据集。
Details
Motivation: 随着视觉-语言模型(VLMs)在全球不同文化背景中的广泛应用,确保其文化能力对于负责任的人工智能系统变得至关重要。然而,目前尚无研究系统地评估VLM在生成任务中如何适应嵌入在文本提示和视觉输入中的文化身份线索。 Method: 论文提出了一种新的多模态框架,通过扰动文化身份并评估5种现代视觉-语言模型(VLM)在生成故事任务中的表现,对VLM的文化能力进行了全面评估。 Result: 分析揭示了VLM在文化适应方面的重要能力,包括丰富的特定文化词汇(如名字、家庭术语和地理标志)。然而,研究也发现了令人担忧的局限性:不同架构之间的文化能力差异显著,某些模型表现出逆向文化对齐,且自动化指标显示出与人类评估相矛盾的架构偏见。跨模态评估显示,通过视觉语义相似性可以检测到文化上不同的输出(28.7%的国内相似性对比0.2%的跨国相似性回忆率),但视觉文化理解仍然有限。 Conclusion: 该论文总结指出,虽然多模态AI在文化能力方面展现出潜力,但也存在显著挑战,包括架构间的文化适应能力差异、文化对齐的反向现象以及视觉文化理解的局限性。 Abstract: As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments. Cross-modal evaluation shows that culturally distinct outputs are indeed detectable through visual-semantic similarity (28.7% within-nationality vs. 0.2% cross-nationality recall), yet visual-cultural understanding remains limited. In essence, we establish the promise and challenges of cultural competence in multimodal AI. We publicly release our codebase and data: https://github.com/ArkaMukherjee0/mmCultural[12] Assess and Prompt: A Generative RL Framework for Improving Engagement in Online Mental Health Communities
Bhagesh Gaur,Karan Gupta,Aseem Srivastava,Manish Gupta,Md Shad Akhtar
Main category: cs.CL
TL;DR: This study introduces MH-COPILOT, a reinforcement learning-based system that identifies missing support attributes in mental health posts and generates prompts to improve user engagement.
Details
Motivation: Many posts in Online Mental Health Communities (OMHCs) remain unanswered due to the absence of key support attributes that signal the need for help. Method: The study introduces REDDME, a dataset annotated for support attributes, CueTaxo, a hierarchical taxonomy for question generation, and MH-COPILOT, a reinforcement learning-based system integrating attribute identification, classification, controlled question generation, and reward modeling. Result: Empirical results across four language models show significant improvements in attribute elicitation and user engagement, while human evaluation confirms the model's effectiveness in real-world settings. Conclusion: The proposed MH-COPILOT system effectively improves user engagement in OMHCs by dynamically identifying missing support attributes and generating targeted prompts. Abstract: Online Mental Health Communities (OMHCs) provide crucial peer and expert support, yet many posts remain unanswered due to missing support attributes that signal the need for help. We present a novel framework that identifies these gaps and prompts users to enrich their posts, thereby improving engagement. To support this, we introduce REDDME, a new dataset of 4,760 posts from mental health subreddits annotated for the span and intensity of three key support attributes: event what happened?, effect what did the user experience?, and requirement what support they need?. Next, we devise a hierarchical taxonomy, CueTaxo, of support attributes for controlled question generation. Further, we propose MH-COPILOT, a reinforcement learning-based system that integrates (a) contextual attribute-span identification, (b) support attribute intensity classification, (c) controlled question generation via a hierarchical taxonomy, and (d) a verifier for reward modeling. Our model dynamically assesses posts for the presence/absence of support attributes, and generates targeted prompts to elicit missing information. Empirical results across four notable language models demonstrate significant improvements in attribute elicitation and user engagement. A human evaluation further validates the model's effectiveness in real-world OMHC settings.[13] ReProCon: Scalable and Resource-Efficient Few-Shot Biomedical Named Entity Recognition
Jeongkyun Yoo,Nela Riddle,Andrew Hoblitzell
Main category: cs.CL
TL;DR: ReProCon is a novel few-shot NER framework designed for biomedical domains that effectively addresses data scarcity and label imbalance, achieving near-BERT performance with lower memory usage.
Details
Motivation: Biomedical Named Entity Recognition faces challenges due to data scarcity and imbalanced label distributions, especially with fine-grained entity types, which motivated the development of ReProCon. Method: ReProCon combines multi-prototype modeling, cosine-contrastive learning, and Reptile meta-learning to address challenges in few-shot NER scenarios. Result: ReProCon achieves a macro-F1 score close to BERT-based baselines, remains stable with a low label budget, and outperforms other models like SpanProto and CONTaiNER when expanding categories. Conclusion: ReProCon demonstrates state-of-the-art performance in resource-limited biomedical NER settings, particularly managing data scarcity and label imbalance effectively. Abstract: Named Entity Recognition (NER) in biomedical domains faces challenges due to data scarcity and imbalanced label distributions, especially with fine-grained entity types. We propose ReProCon, a novel few-shot NER framework that combines multi-prototype modeling, cosine-contrastive learning, and Reptile meta-learning to tackle these issues. By representing each category with multiple prototypes, ReProCon captures semantic variability, such as synonyms and contextual differences, while a cosine-contrastive objective ensures strong interclass separation. Reptile meta-updates enable quick adaptation with little data. Using a lightweight fastText + BiLSTM encoder with much lower memory usage, ReProCon achieves a macro-$F_1$ score close to BERT-based baselines (around 99 percent of BERT performance). The model remains stable with a label budget of 30 percent and only drops 7.8 percent in $F_1$ when expanding from 19 to 50 categories, outperforming baselines such as SpanProto and CONTaiNER, which see 10 to 32 percent degradation in Few-NERD. Ablation studies highlight the importance of multi-prototype modeling and contrastive learning in managing class imbalance. Despite difficulties with label ambiguity, ReProCon demonstrates state-of-the-art performance in resource-limited settings, making it suitable for biomedical applications.[14] LLMs Learn Constructions That Humans Do Not Know
Jonathan Dunn,Mai Mohamed Eida
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型中假阳性构造的问题,发现模型会幻觉出不存在的语法结构,并揭示了探测方法中的确认偏见。
Details
Motivation: 研究动机是探索大型语言模型(LLM)产生的“假阳性构造”现象,即模型幻觉出的、但没有人类直觉支持的语法结构。 Method: 论文采用了基于上下文嵌入的行为探测任务和基于提示的元语言探测任务,区分了模型的隐性和显性语言知识,并模拟了假设检验过程。 Result: 研究发现模型确实会幻觉出构造,并且如果语言学家错误假设这些构造存在,会通过高准确率得到强烈确认。 Conclusion: 该论文得出结论,构造探测方法存在确认偏见,引发人们对模型未知且不正确语法知识的担忧。 Abstract: This paper investigates false positive constructions: grammatical structures which an LLM hallucinates as distinct constructions but which human introspection does not support. Both a behavioural probing task using contextual embeddings and a meta-linguistic probing task using prompts are included, allowing us to distinguish between implicit and explicit linguistic knowledge. Both methods reveal that models do indeed hallucinate constructions. We then simulate hypothesis testing to determine what would have happened if a linguist had falsely hypothesized that these hallucinated constructions do exist. The high accuracy obtained shows that such false hypotheses would have been overwhelmingly confirmed. This suggests that construction probing methods suffer from a confirmation bias and raises the issue of what unknown and incorrect syntactic knowledge these models also possess.[15] If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition
Shubhashis Roy Dipta,Francis Ferraro
Main category: cs.CL
TL;DR: This study addresses prompt sensitivity and presupposition issues in LLMs for claim verification by proposing a structured framework that improves performance by up to 2-5%.
Details
Motivation: Prior work has shown that presupposition in generated questions can introduce unverified assumptions, leading to inconsistencies in claim verification. Prompt sensitivity also remains a challenge for LLMs, creating a need for more robust solutions. Method: We proposed a structured and robust claim verification framework that operates through presupposition-free, decomposed questions. Experiments were conducted across multiple prompts, datasets, and LLMs to evaluate its effectiveness. Result: Even state-of-the-art models were found to be susceptible to prompt variance and presupposition. The proposed method consistently reduced these issues, achieving up to a 2-5% improvement in performance. Conclusion: The study concludes that prompt sensitivity and presupposition in questions continue to impact the performance of LLMs in claim verification, and the proposed structured framework effectively mitigates these issues. Abstract: Prior work has shown that presupposition in generated questions can introduce unverified assumptions, leading to inconsistencies in claim verification. Additionally, prompt sensitivity remains a significant challenge for large language models (LLMs), resulting in performance variance as high as 3-6%. While recent advancements have reduced this gap, our study demonstrates that prompt sensitivity remains a persistent issue. To address this, we propose a structured and robust claim verification framework that reasons through presupposition-free, decomposed questions. Extensive experiments across multiple prompts, datasets, and LLMs reveal that even state-of-the-art models remain susceptible to prompt variance and presupposition. Our method consistently mitigates these issues, achieving up to a 2-5% improvement.[16] Learning from Diverse Reasoning Paths with Routing and Collaboration
Zhenyu Lei,Zhen Tan,Song Wang,Yaochen Zhu,Zihan Chen,Yushun Dong,Jundong Li
Main category: cs.CL
TL;DR: QR-Distill improves knowledge transfer from large models to smaller ones by filtering quality reasoning paths, dynamic path assignment, and peer teaching among students.
Details
Motivation: The limitations of conventional token-level supervision and the varying quality of multiple reasoning paths necessitate a more effective distillation method. Method: QR-Distill combines quality filtering, conditional routing, and cooperative peer teaching for efficient knowledge transfer from teacher to student models. Result: Experiments showed QR-Distill's superiority in knowledge transfer, effectively addressing knowledge gaps and biases in reasoning. Conclusion: QR-Distill proves to be more effective than traditional distillation methods, with ablation studies underscoring the importance of its individual components. Abstract: Advances in large language models (LLMs) significantly enhance reasoning capabilities but their deployment is restricted in resource-constrained scenarios. Knowledge distillation addresses this by transferring knowledge from powerful teacher models to compact and transparent students. However, effectively capturing the teacher's comprehensive reasoning is challenging due to conventional token-level supervision's limited scope. Using multiple reasoning paths per query alleviates this problem, but treating each path identically is suboptimal as paths vary widely in quality and suitability across tasks and models. We propose Quality-filtered Routing with Cooperative Distillation (QR-Distill), combining path quality filtering, conditional routing, and cooperative peer teaching. First, quality filtering retains only correct reasoning paths scored by an LLM-based evaluation. Second, conditional routing dynamically assigns paths tailored to each student's current learning state. Finally, cooperative peer teaching enables students to mutually distill diverse insights, addressing knowledge gaps and biases toward specific reasoning styles. Experiments demonstrate QR-Distill's superiority over traditional single- and multi-path distillation methods. Ablation studies further highlight the importance of each component including quality filtering, conditional routing, and peer teaching in effective knowledge transfer. Our code is available at https://github.com/LzyFischer/Distill.[17] QFrCoLA: a Quebec-French Corpus of Linguistic Acceptability Judgments
David Beauchemin,Richard Khoury
Main category: cs.CL
TL;DR: 本研究介绍了一个魁北克法语语言可接受性判断数据集QFrCoLA,并利用它和其他七种语言的数据集评估了七种语言模型的表现。
Details
Motivation: 为了更好地理解大型和基于Transformer的语言模型如何内化语言知识,并评估它们在不同语言中的表现。 Method: 使用QFrCoLA数据集和其他七种语言的二元可接受性判断语料库来评估七种语言模型。 Result: 平均而言,微调的基于Transformer的语言模型在大多数语言中表现良好,而零样本的大型语言模型在任务中表现较差。QFrCoLA基准测试显示,微调的Transformer模型优于其他测试方法,并且预训练的跨语言模型似乎没有获得魁北克法语的语言判断能力。 Conclusion: QFrCoLA是一个具有挑战性的数据集,可以用于衡量语言模型的语言判断能力,并且实验结果显示,微调的Transformer模型在大多数语言中是强有力的基线。 Abstract: Large and Transformer-based language models perform outstandingly in various downstream tasks. However, there is limited understanding regarding how these models internalize linguistic knowledge, so various linguistic benchmarks have recently been proposed to facilitate syntactic evaluation of language models across languages. This paper introduces QFrCoLA (Quebec-French Corpus of Linguistic Acceptability Judgments), a normative binary acceptability judgments dataset comprising 25,153 in-domain and 2,675 out-of-domain sentences. Our study leverages the QFrCoLA dataset and seven other linguistic binary acceptability judgment corpora to benchmark seven language models. The results demonstrate that, on average, fine-tuned Transformer-based LM are strong baselines for most languages and that zero-shot binary classification large language models perform poorly on the task. However, for the QFrCoLA benchmark, on average, a fine-tuned Transformer-based LM outperformed other methods tested. It also shows that pre-trained cross-lingual LLMs selected for our experimentation do not seem to have acquired linguistic judgment capabilities during their pre-training for Quebec French. Finally, our experiment results on QFrCoLA show that our dataset, built from examples that illustrate linguistic norms rather than speakers' feelings, is similar to linguistic acceptability judgment; it is a challenging dataset that can benchmark LM on their linguistic judgment capabilities.[18] JUDGEBERT: Assessing Legal Meaning Preservation Between Sentences
David Beauchemin,Michelle Albert-Rochette,Richard Khoury,Pierre-Luc Déziel
Main category: cs.CL
TL;DR: This paper introduces FrJUDGE and JUDGEBERT, a dataset and evaluation metric, respectively, designed to improve legal text simplification by accurately preserving meaning, with JUDGEBERT showing strong performance in correlation with human judgment and passing critical sanity checks.
Details
Motivation: Text simplification in specialized domains like the legal field requires meaning preservation that differs significantly from general text simplification. Existing evaluation metrics are insufficient for capturing this nuance in legal texts. Method: The paper introduces FrJUDGE, a dataset for assessing legal meaning preservation, and JUDGEBERT, a novel evaluation metric for French legal text simplification. The metric's performance is evaluated based on its correlation with human judgment and its ability to pass sanity checks. Result: JUDGEBERT demonstrates a superior correlation with human judgment compared to existing metrics and successfully passes critical sanity checks: it returns a score of 100% for identical sentences and 0% for unrelated sentences. Conclusion: JUDGEBERT has the potential to transform legal NLP applications by ensuring accurate and accessible text simplification for both legal practitioners and lay users. Abstract: Simplifying text while preserving its meaning is a complex yet essential task, especially in sensitive domain applications like legal texts. When applied to a specialized field, like the legal domain, preservation differs significantly from its role in regular texts. This paper introduces FrJUDGE, a new dataset to assess legal meaning preservation between two legal texts. It also introduces JUDGEBERT, a novel evaluation metric designed to assess legal meaning preservation in French legal text simplification. JUDGEBERT demonstrates a superior correlation with human judgment compared to existing metrics. It also passes two crucial sanity checks, while other metrics did not: For two identical sentences, it always returns a score of 100%; on the other hand, it returns 0% for two unrelated sentences. Our findings highlight its potential to transform legal NLP applications, ensuring accuracy and accessibility for text simplification for legal practitioners and lay users.[19] Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling
Yue Zhao,Xiaoyu Wang,Dan Wang,Zhonglin Jiang,Qingqing Gu,Teng Chen,Ningyuan Xi,Jinxian Qu,Yong Chen,Luo Ji
Main category: cs.CL
TL;DR: 本文提出了DreamCUB框架,通过构建对话世界模型,在自然语言处理任务中实现了先进的性能,特别是在情绪分类和情感识别方面。
Details
Motivation: 尽管世界模型已在机器人技术、游戏和自动驾驶中得到广泛应用,但它们在自然语言任务中的应用相对有限。作者试图填补这一空白,构建能够预测用户情感、情绪和意图以及未来话语的对话世界模型。 Method: 通过定义POMDP,将情感、情绪和意图建模为用户信念,并通过最大化信息瓶颈来解决。然后将基于模型的强化学习框架应用到对话系统中,提出了DreamCUB框架。 Result: 实验结果显示,预训练的对话世界模型在情绪分类和情感识别方面达到了最先进的性能,同时通过策略、批评和对话世界模型的联合训练,对话质量也得到了提高。此外,这种方法在探索与利用之间保持了合理的平衡,并且能够很好地转移到域外场景,如共情对话。 Conclusion: DreamCUB框架通过预训练对话世界模型,在情绪分类和情感识别方面达到了最先进的性能,同时在策略、批评和对话世界模型的联合训练下提高了对话质量。 Abstract: World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user's emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.[20] ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
Hyunjun Kim,Junwoo Ha,Sangyoon Yu,Haon Park
Main category: cs.CL
TL;DR: 本文提出了OBJEX(MT)基准测试,用于评估LLM模型在多轮越狱攻击中提取对话目标的能力,发现LLM评估者在高置信度下常误判目标,并建议提供明确目标和使用选择性预测来管理风险。
Details
Motivation: 研究者关注LLMs作为评估者时是否能够准确推断对话的潜在目标,尤其是在目标分布于嘈杂、对抗性、多轮越狱攻击中的情况下。 Method: 引入了OBJEX(MT)基准测试,要求模型从对话记录中提取基础目标并报告其置信度,使用语义相似性评分、ECE、Brier分数等指标评估准确性、正确性和元认知能力。 Result: Claude-sonnet-4在目标提取准确率(0.515)和校准能力(ECE 0.296;Brier 0.324)上表现最佳,而gpt-4.1和Qwen3的准确率为0.441但表现出明显的过度自信。 Conclusion: 研究发现,大型语言模型(LLMs)在评估多轮越狱攻击时常常高置信度地错误推断目标,建议在可能的情况下为评估模型提供明确的目标,并使用选择性预测或放弃机制来管理风险。 Abstract: Large language models (LLMs) are increasingly used as judges of other models, yet it is unclear whether a judge can reliably infer the latent objective of the conversation it evaluates, especially when the goal is distributed across noisy, adversarial, multi-turn jailbreaks. We introduce OBJEX(MT), a benchmark that requires a model to (i) distill a transcript into a single-sentence base objective and (ii) report its own confidence. Accuracy is scored by an LLM judge using semantic similarity between extracted and gold objectives; correctness uses a single human-aligned threshold calibrated once on N=100 items (tau* = 0.61); and metacognition is evaluated with ECE, Brier score, Wrong@High-Conf, and risk-coverage curves. We evaluate gpt-4.1, claude-sonnet-4, and Qwen3-235B-A22B-FP8 on SafeMT Attack_600, SafeMTData_1K, MHJ, and CoSafe. claude-sonnet-4 attains the highest objective-extraction accuracy (0.515) and the best calibration (ECE 0.296; Brier 0.324), while gpt-4.1 and Qwen3 tie at 0.441 accuracy yet show marked overconfidence (mean confidence approx. 0.88 vs. accuracy approx. 0.44; Wrong@0.90 approx. 48-52%). Performance varies sharply across datasets (approx. 0.167-0.865), with MHJ comparatively easy and Attack_600/CoSafe harder. These results indicate that LLM judges often misinfer objectives with high confidence in multi-turn jailbreaks and suggest operational guidance: provide judges with explicit objectives when possible and use selective prediction or abstention to manage risk. We release prompts, scoring templates, and complete logs to facilitate replication and analysis.[21] Unbiased Reasoning for Knowledge-Intensive Tasks in Large Language Models via Conditional Front-Door Adjustment
Bo Zhao,Yinghao Zhang,Ziqi Xu,Yongli Ren,Xiuzhen Zhang,Renqiang Luo,Zaiwen Feng,Feng Xia
Main category: cs.CL
TL;DR: 本文提出了一种新的因果提示框架CFD-Prompting,通过构建反事实的外部知识,减轻大型语言模型中的内部偏差,从而提高了知识密集型任务的准确性和鲁棒性。
Details
Motivation: 大型语言模型在知识密集型任务中表现不佳,需要解决内部偏差问题。 Method: 提出了条件前门提示(CFD-Prompting)框架,通过构建反事实的外部知识来模拟查询在不同上下文下的行为。 Result: 实验表明,CFD-Prompting在多个大型语言模型和基准数据集上显著优于现有基线方法。 Conclusion: CFD-Prompting是一种有效的提示框架,能够提高大型语言模型在知识密集型任务上的性能。 Abstract: Large Language Models (LLMs) have shown impressive capabilities in natural language processing but still struggle to perform well on knowledge-intensive tasks that require deep reasoning and the integration of external knowledge. Although methods such as Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) have been proposed to enhance LLMs with external knowledge, they still suffer from internal bias in LLMs, which often leads to incorrect answers. In this paper, we propose a novel causal prompting framework, Conditional Front-Door Prompting (CFD-Prompting), which enables the unbiased estimation of the causal effect between the query and the answer, conditional on external knowledge, while mitigating internal bias. By constructing counterfactual external knowledge, our framework simulates how the query behaves under varying contexts, addressing the challenge that the query is fixed and is not amenable to direct causal intervention. Compared to the standard front-door adjustment, the conditional variant operates under weaker assumptions, enhancing both robustness and generalisability of the reasoning process. Extensive experiments across multiple LLMs and benchmark datasets demonstrate that CFD-Prompting significantly outperforms existing baselines in both accuracy and robustness.[22] Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs
Sewon Kim,Jiwon Kim,Seungwoo Shin,Hyejin Chung,Daeun Moon,Yejin Kwon,Hyunsoo Yoon
Main category: cs.CL
TL;DR: This paper identifies and addresses the risk of affective hallucination in Large Language Models (LLMs), where models simulate empathy and create false emotional connections. The authors introduce AHaBench and AHaPairs as tools to evaluate and align models with emotionally responsible behavior, showing that DPO fine-tuning effectively reduces this risk.
Details
Motivation: The motivation is to address the risk of affective hallucination, where LLMs simulate empathy and create the illusion of genuine emotional connection despite lacking true affective capacity, potentially leading to psychological harm. Method: The authors introduce AHaBench, a benchmark of 500 mental health-related prompts, and AHaPairs, a 5K-instance preference dataset. These tools are used to evaluate and align LLMs with emotionally responsible behavior through Direct Preference Optimization (DPO). Result: Experiments show that DPO fine-tuning significantly reduces affective hallucination without compromising core model performance. AHaBench is validated as an effective diagnostic tool for capturing affective hallucination. Conclusion: The paper concludes that affective hallucination is a distinct safety concern for Large Language Models (LLMs), and they provide practical resources to develop LLMs that are both factually reliable and psychologically safe. Abstract: Large Language Models (LLMs) are increasingly used in emotionally sensitive interactions, where their simulated empathy can create the illusion of genuine relational connection. We define this risk as Affective Hallucination, the production of emotionally immersive responses that foster illusory social presence despite the model's lack of affective capacity. To systematically diagnose and mitigate this risk, we introduce AHaBench, a benchmark of 500 mental health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. Experiments across multiple model families show that DPO fine-tuning substantially reduces affective hallucination without degrading core reasoning and knowledge performance. Human-model agreement analyses confirm that AHaBench reliably captures affective hallucination, validating it as an effective diagnostic tool. This work establishes affective hallucination as a distinct safety concern and provides practical resources for developing LLMs that are not only factually reliable but also psychologically safe. AHaBench and AHaPairs are accessible via https://huggingface.co/datasets/o0oMiNGo0o/AHaBench, and code for fine-tuning and evaluation are in https://github.com/0oOMiNGOo0/AHaBench. Warning: This paper contains examples of mental health-related language that may be emotionally distressing.[23] Explaining Black-box Language Models with Knowledge Probing Systems: A Post-hoc Explanation Perspective
Yunxiao Zhao,Hao Xu,Zhiqiang Wang,Xiaoli Li,Jiye Liang,Ru Li
Main category: cs.CL
TL;DR: This paper introduces KnowProb, a new method to analyze pre-trained language models' understanding of implicit knowledge, revealing their limitations and promoting explainable AI research.
Details
Motivation: PLMs, despite their reasoning skills, are black-box models with trustworthiness challenges. This paper aims to probe their understanding of implicit knowledge. Method: KnowProb uses a knowledge-guided probing approach to analyze PLMs' understanding beyond surface-level text content. Result: Current PLMs struggle to capture hidden knowledge and only learn a single representation distribution, as validated through experiments. Conclusion: KnowProb helps identify limitations in current PLMs regarding hidden knowledge capture and promotes explainable model detection research. Abstract: Pre-trained Language Models (PLMs) are trained on large amounts of unlabeled data, yet they exhibit remarkable reasoning skills. However, the trustworthiness challenges posed by these black-box models have become increasingly evident in recent years. To alleviate this problem, this paper proposes a novel Knowledge-guided Probing approach called KnowProb in a post-hoc explanation way, which aims to probe whether black-box PLMs understand implicit knowledge beyond the given text, rather than focusing only on the surface level content of the text. We provide six potential explanations derived from the underlying content of the given text, including three knowledge-based understanding and three association-based reasoning. In experiments, we validate that current small-scale (or large-scale) PLMs only learn a single distribution of representation, and still face significant challenges in capturing the hidden knowledge behind a given text. Furthermore, we demonstrate that our proposed approach is effective for identifying the limitations of existing black-box models from multiple probing perspectives, which facilitates researchers to promote the study of detecting black-box models in an explainable way.[24] Decoding Alignment: A Critical Survey of LLM Development Initiatives through Value-setting and Data-centric Lens
Ilias Chalkidis
Main category: cs.CL
TL;DR: 本文研究了AI对齐在大型语言模型中的应用,审查了6个开发计划的文档,探讨了对齐的价值和数据使用情况。
Details
Motivation: AI对齐在训练后阶段及多学科领域的重要性,以及对齐过程范围的缺乏关注。 Method: 调查和审查了6个LLM开发计划的公开文档,涵盖专有和开源项目。 Result: 详细记录了每个计划的发现,并对价值设定和数据中心视角进行了总结。 Conclusion: 研究揭示了AI对齐在价值设定和数据中心视角下的实际应用情况,并讨论了相关的广泛问题。 Abstract: AI Alignment, primarily in the form of Reinforcement Learning from Human Feedback (RLHF), has been a cornerstone of the post-training phase in developing Large Language Models (LLMs). It has also been a popular research topic across various disciplines beyond Computer Science, including Philosophy and Law, among others, highlighting the socio-technical challenges involved. Nonetheless, except for the computational techniques related to alignment, there has been limited focus on the broader picture: the scope of these processes, which primarily rely on the selected objectives (values), and the data collected and used to imprint such objectives into the models. This work aims to reveal how alignment is understood and applied in practice from a value-setting and data-centric perspective. For this purpose, we investigate and survey (`audit') publicly available documentation released by 6 LLM development initiatives by 5 leading organizations shaping this technology, focusing on proprietary (OpenAI's GPT, Anthropic's Claude, Google's Gemini) and open-weight (Meta's Llama, Google's Gemma, and Alibaba's Qwen) initiatives, all published in the last 3 years. The findings are documented in detail per initiative, while there is also an overall summary concerning different aspects, mainly from a value-setting and data-centric perspective. On the basis of our findings, we discuss a series of broader related concerns.[25] ReFactX: Scalable Reasoning with Reliable Facts via Constrained Generation
Riccardo Pozzi,Matteo Palmonari,Andrea Coletta,Luigi Bellomarini,Jens Lehmann,Sahar Vahdati
Main category: cs.CL
TL;DR: ReFactX 提出了一种通过约束生成和前缀树索引让 LLM 直接访问外部知识的方法,解决了知识缺口和幻觉问题,具有良好的扩展性和效率。
Details
Motivation: 大型语言模型在缺乏必要信息时会产生不可靠的响应,存在知识缺口和幻觉问题。现有方法如 RAG 和工具使用依赖额外模型或服务,导致复杂流程、潜在错误传播和高 token 消耗。 Method: ReFactX 利用预构建的前缀树索引进行约束生成。从知识图谱中将三元组转化为文本事实,进行分词并建立前缀树索引。在推理过程中,LLMs 通过仅允许生成已有事实的文本序列来获取外部知识。 Result: ReFactX 能够扩展到大规模知识库(8 亿事实),适应特定领域数据,并在问答任务中取得良好效果,同时生成时间开销极小。 Conclusion: ReFactX 是一种可扩展的方法,使大型语言模型(LLMs)无需依赖检索器或辅助模型即可访问外部知识,从而有效解决知识缺口和幻觉问题。 Abstract: Knowledge gaps and hallucinations are persistent challenges for Large Language Models (LLMs), which generate unreliable responses when lacking the necessary information to fulfill user instructions. Existing approaches, such as Retrieval-Augmented Generation (RAG) and tool use, aim to address these issues by incorporating external knowledge. Yet, they rely on additional models or services, resulting in complex pipelines, potential error propagation, and often requiring the model to process a large number of tokens. In this paper, we present a scalable method that enables LLMs to access external knowledge without depending on retrievers or auxiliary models. Our approach uses constrained generation with a pre-built prefix-tree index. Triples from a Knowledge Graph are verbalized in textual facts, tokenized, and indexed in a prefix tree for efficient access. During inference, to acquire external knowledge, the LLM generates facts with constrained generation which allows only sequences of tokens that form an existing fact. We evaluate our proposal on Question Answering and show that it scales to large knowledge bases (800 million facts), adapts to domain-specific data, and achieves effective results. These gains come with minimal generation-time overhead. ReFactX code is available at https://github.com/rpo19/ReFactX.[26] GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation
Jeongsoo Lee,Daeyong Kwon,Kyohoon Jin
Main category: cs.CL
TL;DR: GRADE is a new evaluation framework for RAG systems that models task difficulty through reasoning depth and semantic distance, enabling detailed performance analysis and improvements in multi-hop reasoning.
Details
Motivation: Current evaluations of RAG systems overlook structural complexity and multi-step reasoning, missing key factors like the interaction between retrieval difficulty and reasoning depth. Method: GRADE constructs a synthetic multi-hop QA dataset from factual news articles using knowledge graphs and semantic clustering, and employs a 2D difficulty matrix combining generator-side and retriever-side difficulty. Result: Experiments show that error rates strongly correlate with GRADE's difficulty measures, validating their diagnostic utility. Conclusion: GRADE provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications by enabling fine-grained analysis of RAG performance. Abstract: Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose \textsc{GRADE}, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty matrix that combines generator-side and retriever-side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. \textsc{GRADE} enables fine-grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications.[27] DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation
Abdelrahman Abdallah,Jamshid Mozafari,Bhawna Piryani,Adam Jatowt
Main category: cs.CL
TL;DR: DeAR 是一种用于文档重排序的开源框架,通过双阶段方法实现了更好的准确性和可解释性,在多个评估数据集上表现出色。
Details
Motivation: 单个模型在平衡细粒度相关性评分和整体跨文档分析方面存在困难,因此需要一种能够解耦这些任务的方法。 Method: DeAR采用双阶段方法,第一阶段使用交叉熵、RankNet 和 KL 散度损失的混合来从LLaMA教师模型中提取令牌级相关信号;第二阶段附加一个LoRA适配器并使用GPT-4o生成的链式思维排列进行微调,实现列表级推理。 Result: DeAR 在 TREC-DL19/20、BEIR 数据集和 NovelEval-2306 上表现优异,超越了开源基线并在 NovelEval 上超过了 GPT-4 3.09 个百分点。 Conclusion: DeAR 提供了一种高效且可解释的现代重排序系统解决方案,并在多个数据集上超越了开源基线和GPT-4。 Abstract: Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose \textbf{De}ep\textbf{A}gent\textbf{R}ank (\textbf{\DeAR}), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In \emph{Stage 1}, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact \{3, 8\}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In \emph{Stage 2}, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, \DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making \DeAR a highly effective and interpretable solution for modern reranking systems.\footnote{Dataset and code available at https://github.com/DataScienceUIBK/DeAR-Reranking.}.[28] KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF
Jason R Brown,Lennie Wells,Edward James Young,Sergio Bacallado
Main category: cs.CL
TL;DR: This paper introduces KLQ, a new action-value RL method for LM-RLHF that is theoretically equivalent to PPO in a specific sense, despite differing motivations, and demonstrates its effectiveness in language generation tasks.
Details
Motivation: PPO, while effective, has a heuristic motivation and handles KL-divergence in an ad-hoc manner in LM-RLHF. The paper aims to develop a method that better addresses KL-divergence constraints with theoretical grounding. Method: The paper proposes KL-regularised Q-Learning (KLQ), a new action-value RL method for LM-RLHF. Theoretical analysis shows KLQ's equivalence to a version of PPO. The method is benchmarked on summarisation and single-turn dialogue tasks. Result: KLQ performs on-par with PPO in optimising the LM-RLHF objective and achieves a higher win-rate in LLM-as-a-judge evaluations. Conclusion: KLQ offers a theoretically grounded alternative to PPO in LM-RLHF, performing as well as or better than PPO in empirical benchmarks. Abstract: Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.[29] Planning for Success: Exploring LLM Long-term Planning Capabilities in Table Understanding
Thi-Nhung Nguyen,Hoang Ngo,Dinh Phung,Thuy-Trang Vu,Dat Quoc Nguyen
Main category: cs.CL
TL;DR: This paper proposes a novel method leveraging LLMs' long-term planning to enhance table understanding, achieving state-of-the-art results on WikiTableQuestions and TabFact.
Details
Motivation: Existing methods, such as Chain-of-Thought and question decomposition, suffer from a lack of explicit long-term planning and weak inter-step connections, leading to missed constraints within questions. This work aims to overcome these limitations. Method: The approach utilizes long-term planning capabilities of LLMs to create tightly interconnected steps that serve the ultimate goal, minimizing unnecessary details in solving short-term goals. Result: Extensive experiments show that the proposed method outperforms strong baselines on table-based question answering and fact verification tasks. Conclusion: The proposed method enhances table understanding by leveraging the long-term planning capabilities of large language models (LLMs), achieving state-of-the-art performance on WikiTableQuestions and TabFact datasets. Abstract: Table understanding is key to addressing challenging downstream tasks such as table-based question answering and fact verification. Recent works have focused on leveraging Chain-of-Thought and question decomposition to solve complex questions requiring multiple operations on tables. However, these methods often suffer from a lack of explicit long-term planning and weak inter-step connections, leading to miss constraints within questions. In this paper, we propose leveraging the long-term planning capabilities of large language models (LLMs) to enhance table understanding. Our approach enables the execution of a long-term plan, where the steps are tightly interconnected and serve the ultimate goal, an aspect that methods based on Chain-of-Thought and question decomposition lack. In addition, our method effectively minimizes the inclusion of unnecessary details in the process of solving the next short-term goals, a limitation of methods based on Chain-of-Thought. Extensive experiments demonstrate that our method outperforms strong baselines and achieves state-of-the-art performance on WikiTableQuestions and TabFact datasets.[30] EduRABSA: An Education Review Dataset for Aspect-based Sentiment Analysis Tasks
Yan Cathy Hua,Paul Denny,Jörg Wicker,Katerina Taskova
Main category: cs.CL
TL;DR: 该论文介绍了EduRABSA,这是首个公开的、带注释的教育评论ABSA数据集,以及一个用于生成全面ABSA任务标记数据集的离线手动数据注释工具ASQE-DPT。
Details
Motivation: 由于教育领域中自动意见挖掘解决方案的应用面临内容复杂性和低粒度报告要求的挑战,且现有的ABSA研究和资源主要集中在商业领域,因此教育领域需要一个高质量的带注释数据集来推动研究。 Method: 开发了一个名为EduRABSA的首个公共、带注释的ABSA教育评论数据集,涵盖了三种评论主题类型(课程、教学人员、大学),并提出了一个名为ASQE-DPT的离线、轻量级、无需安装的手动数据注释工具,用于生成全面ABSA任务的标记数据集。 Result: 开发了首个公共、带注释的ABSA教育评论数据集EduRABSA和一个离线手动数据注释工具ASQE-DPT,以支持ABSA社区和教育领域,消除数据集障碍,并促进进一步资源的创建和共享。 Conclusion: 这项工作介绍了EduRABSA,这是首个公开的、带注释的教育评论ABSA数据集,以及ASQE-DPT,一个用于生成全面ABSA任务标记数据集的离线手动数据注释工具。这些资源有助于推动这一资源匮乏领域的研究,支持研究透明性和可重复性,并促进更多资源的创建和共享。 Abstract: Every year, most educational institutions seek and receive an enormous volume of text feedback from students on courses, teaching, and overall experience. Yet, turning this raw feedback into useful insights is far from straightforward. It has been a long-standing challenge to adopt automatic opinion mining solutions for such education review text data due to the content complexity and low-granularity reporting requirements. Aspect-based Sentiment Analysis (ABSA) offers a promising solution with its rich, sub-sentence-level opinion mining capabilities. However, existing ABSA research and resources are very heavily focused on the commercial domain. In education, they are scarce and hard to develop due to limited public datasets and strict data protection. A high-quality, annotated dataset is urgently needed to advance research in this under-resourced area. In this work, we present EduRABSA (Education Review ABSA), the first public, annotated ABSA education review dataset that covers three review subject types (course, teaching staff, university) in the English language and all main ABSA tasks, including the under-explored implicit aspect and implicit opinion extraction. We also share ASQE-DPT (Data Processing Tool), an offline, lightweight, installation-free manual data annotation tool that generates labelled datasets for comprehensive ABSA tasks from a single-task annotation. Together, these resources contribute to the ABSA community and education domain by removing the dataset barrier, supporting research transparency and reproducibility, and enabling the creation and sharing of further resources. The dataset, annotation tool, and scripts and statistics for dataset processing and sampling are available at https://github.com/yhua219/edurabsa_dataset_and_annotation_tool.[31] Improving Table Understanding with LLMs and Entity-Oriented Search
Thi-Nhung Nguyen,Hoang Ngo,Dinh Phung,Thuy-Trang Vu,Dat Quoc Nguyen
Main category: cs.CL
TL;DR: This paper proposes an entity-oriented search method to improve table understanding by leveraging semantic similarities and a graph query language, achieving state-of-the-art results on standard benchmarks.
Details
Motivation: Existing methods struggle with unpredictable table content and lack of contextual information, which complicates the reasoning processes of LLMs. Method: An entity-oriented search method that leverages semantic similarities between questions and table data, and uses a graph query language for table understanding. Result: The approach achieves state-of-the-art performances on the WikiTableQuestions and TabFact benchmarks. Conclusion: The proposed entity-oriented search method significantly enhances table understanding with LLMs and establishes a new research direction through the use of a graph query language. Abstract: Our work addresses the challenges of understanding tables. Existing methods often struggle with the unpredictable nature of table content, leading to a reliance on preprocessing and keyword matching. They also face limitations due to the lack of contextual information, which complicates the reasoning processes of large language models (LLMs). To overcome these challenges, we introduce an entity-oriented search method to improve table understanding with LLMs. This approach effectively leverages the semantic similarities between questions and table data, as well as the implicit relationships between table cells, minimizing the need for data preprocessing and keyword matching. Additionally, it focuses on table entities, ensuring that table cells are semantically tightly bound, thereby enhancing contextual clarity. Furthermore, we pioneer the use of a graph query language for table understanding, establishing a new research direction. Experiments show that our approach achieves new state-of-the-art performances on standard benchmarks WikiTableQuestions and TabFact.[32] GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection
Melissa Kazemi Rad,Alberto Purpura,Himanshu Kumar,Emily Chen,Mohammad Shahed Sorower
Main category: cs.CL
TL;DR: GRAID is a novel data augmentation method that uses LLMs to enhance harmful text classification by addressing data scarcity through geometric control and reflective processing, leading to improved model performance.
Details
Motivation: The motivation stems from the challenge of data scarcity in harmful text classification for guardrailing applications, which limits the performance of guardrail models. Method: The research introduces GRAID, a two-stage pipeline that utilizes Large Language Models (LLMs) to generate geometrically controlled examples and augment data through a multi-agentic reflective process, enhancing stylistic diversity and uncovering edge cases. Result: The experiments on two benchmark datasets show that GRAID significantly enhances the performance of downstream guardrail models by improving dataset augmentation. Conclusion: The study concludes that GRAID effectively improves harmful text classification by addressing data scarcity through geometric and reflective AI-driven data augmentation. Abstract: We address the problem of data scarcity in harmful text classification for guardrailing applications and introduce GRAID (Geometric and Reflective AI-Driven Data Augmentation), a novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation. GRAID consists of two stages: (i) generation of geometrically controlled examples using a constrained LLM, and (ii) augmentation through a multi-agentic reflective process that promotes stylistic diversity and uncovers edge cases. This combination enables both reliable coverage of the input space and nuanced exploration of harmful content. Using two benchmark data sets, we demonstrate that augmenting a harmful text classification dataset with GRAID leads to significant improvements in downstream guardrail model performance.[33] Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages
Yuemei Xu,Kexin Xu,Jian Zhou,Ling Hu,Lin Gui
Main category: cs.CL
TL;DR: BridgeX-ICL improves cross-lingual performance in LLMs by identifying shared neurons and guiding bridge selection using a HSIC-based metric, validated across multiple language pairs.
Details
Motivation: LLMs struggle with low-resource languages and require data-efficient methods to enhance cross-lingual performance without costly fine-tuning. Method: BridgeX-ICL constructs neuron probe data from MUSE bilingual dictionaries, identifies overlapping neurons, and uses an HSIC-based metric to guide optimal bridge selection for cross-lingual in-context learning. Result: BridgeX-ICL was validated across 2 cross-lingual tasks and 15 language pairs, showing effectiveness in improving zero-shot X-ICL and offering insights into multilingual mechanisms in LLMs. Conclusion: BridgeX-ICL provides an effective method for improving zero-shot cross-lingual performance in LLMs by leveraging shared neurons and HSIC-based metrics. Abstract: The current Large Language Models (LLMs) face significant challenges in improving performance on low-resource languages and urgently need data-efficient methods without costly fine-tuning. From the perspective of language-bridge, we propose BridgeX-ICL, a simple yet effective method to improve zero-shot Cross-lingual In-Context Learning (X-ICL) for low-resource languages. Unlike existing works focusing on language-specific neurons, BridgeX-ICL explores whether sharing neurons can improve cross-lingual performance in LLMs or not. We construct neuron probe data from the ground-truth MUSE bilingual dictionaries, and define a subset of language overlap neurons accordingly, to ensure full activation of these anchored neurons. Subsequently, we propose an HSIC-based metric to quantify LLMs' internal linguistic spectrum based on overlap neurons, which guides optimal bridge selection. The experiments conducted on 2 cross-lingual tasks and 15 language pairs from 7 diverse families (covering both high-low and moderate-low pairs) validate the effectiveness of BridgeX-ICL and offer empirical insights into the underlying multilingual mechanisms of LLMs.[34] Token Homogenization under Positional Bias
Viacheslav Yusupov,Danil Maksimov,Ameliia Alaeva,Tatiana Zaitceva,Antipina Anna,Anna Vasileva,Chenlin Liu,Rayuth Chheng,Danil Sazanakov,Andrey Chetvergov,Alina Ermilova,Egor Shvetsov
Main category: cs.CL
TL;DR: This paper explores token homogenization in transformer layers and its link to positional bias in language models, showing that token distinctiveness decreases with processing, particularly at extremal positions.
Details
Motivation: The motivation is to understand token homogenization and its relationship to positional bias in large language models. Method: Layer-wise similarity analysis and controlled experiments were used to examine token homogenization and the impact of positional bias. Result: The results show that tokens systematically lose distinctiveness during processing, especially when biased toward extremal positions. Conclusion: The paper concludes that token homogenization exists and is dependent on positional attention mechanisms. Abstract: This paper investigates token homogenization - the convergence of token representations toward uniformity across transformer layers and its relationship to positional bias in large language models. We empirically examine whether homogenization occurs and how positional bias amplifies this effect. Through layer-wise similarity analysis and controlled experiments, we demonstrate that tokens systematically lose distinctiveness during processing, particularly when biased toward extremal positions. Our findings confirm both the existence of homogenization and its dependence on positional attention mechanisms.[35] A Straightforward Pipeline for Targeted Entailment and Contradiction Detection
Antonin Sulc
Main category: cs.CL
TL;DR: 本文提出了一种结合Transformer注意力机制和NLI模型的新方法,以高效识别文本中句子之间的关键语义关系,特别是在识别前提和矛盾句子方面表现优异。
Details
Motivation: 识别文档中句子之间的关系对于事实核查、论点挖掘和文本摘要等任务至关重要,但现有方法在注意力机制和NLI模型之间面临权衡问题。 Method: 首先通过聚合token级别的注意力得分来识别与用户选定的目标句子相关的候选句子,然后使用预训练的NLI模型将每个候选句子分类为前提(entailment)或矛盾(contradiction)。 最后通过注意力基础显著性得分过滤NLI确定的关系,以隔离最重要的语义关系。 Result: 该方法能够高效地隔离给定文本中任何声明的最重要语义关系,从而有效识别句子之间的关键语义联系。 Conclusion: 该方法通过结合Transformer注意力机制和NLI模型的优势,有效地识别文本中句子之间的关键语义关系,为特定声明提供了高效且精确的分析。 Abstract: Finding the relationships between sentences in a document is crucial for tasks like fact-checking, argument mining, and text summarization. A key challenge is to identify which sentences act as premises or contradictions for a specific claim. Existing methods often face a trade-off: transformer attention mechanisms can identify salient textual connections but lack explicit semantic labels, while Natural Language Inference (NLI) models can classify relationships between sentence pairs but operate independently of contextual saliency. In this work, we introduce a method that combines the strengths of both approaches for a targeted analysis. Our pipeline first identifies candidate sentences that are contextually relevant to a user-selected target sentence by aggregating token-level attention scores. It then uses a pretrained NLI model to classify each candidate as a premise (entailment) or contradiction. By filtering NLI-identified relationships with attention-based saliency scores, our method efficiently isolates the most significant semantic relationships for any given claim in a text.[36] The Power of Framing: How News Headlines Guide Search Behavior
Amrit Poudel,Maria Milkowski,Tim Weninger
Main category: cs.CL
TL;DR: 研究表明,搜索行为会受到标题框架的影响,短暂接触不同的语言框架会改变用户的信息寻求方向。
Details
Motivation: 虽然关于框架效应对判断的影响已有详细记录,但其对后续搜索行为的影响尚不明确。 Method: 通过一个受控实验,参与者发布查询并从特定语言框架过滤的标题中进行选择。 Result: 标题框架显著地影响了后续查询:冲突和策略框架破坏了与先前选择的一致性,而叙事框架比主题框架导致了更具体的查询。此外,观察到适度的短期框架持续效应,且随时间推移而减弱。 Conclusion: 短暂接触框架信息会影响用户的信息寻求行为方向。 Abstract: Search engines play a central role in how people gather information, but subtle cues like headline framing may influence not only what users believe but also how they search. While framing effects on judgment are well documented, their impact on subsequent search behavior is less understood. We conducted a controlled experiment where participants issued queries and selected from headlines filtered by specific linguistic frames. Headline framing significantly shaped follow-up queries: conflict and strategy frames disrupted alignment with prior selections, while episodic frames led to more concrete queries than thematic ones. We also observed modest short-term frame persistence that declined over time. These results suggest that even brief exposure to framing can meaningfully alter the direction of users information-seeking behavior.[37] Geolocation-Aware Robust Spoken Language Identification
Qingzheng Wang,Hye-jin Shim,Jiancheng Sun,Shinji Watanabe
Main category: cs.CL
TL;DR: This paper proposes a geolocation-aware LID approach that incorporates geolocation information into an SSL-based model to better handle dialects and accents, achieving strong performance on multilingual datasets.
Details
Motivation: Existing Self-supervised Learning (SSL) models for Spoken Language Identification (LID) struggle to consistently classify dialects and accents of the same language as a unified class. Method: Introduce geolocation prediction as an auxiliary task and inject predicted vectors into intermediate representations as conditioning signals in an SSL-based LID model. Result: Experiments across six multilingual datasets demonstrate improved robustness to intra-language variations and unseen domains, with new state-of-the-art accuracy on FLEURS (97.7%) and 9.7% relative improvement on ML-SUPERB 2.0 dialect set. Conclusion: The proposed geolocation-aware LID approach improves robustness to intra-language variations and unseen domains, achieving state-of-the-art accuracy on FLEURS and significant improvement on ML-SUPERB 2.0 dialect set. Abstract: While Self-supervised Learning (SSL) has significantly improved Spoken Language Identification (LID), existing models often struggle to consistently classify dialects and accents of the same language as a unified class. To address this challenge, we propose geolocation-aware LID, a novel approach that incorporates language-level geolocation information into the SSL-based LID model. Specifically, we introduce geolocation prediction as an auxiliary task and inject the predicted vectors into intermediate representations as conditioning signals. This explicit conditioning encourages the model to learn more unified representations for dialectal and accented variations. Experiments across six multilingual datasets demonstrate that our approach improves robustness to intra-language variations and unseen domains, achieving new state-of-the-art accuracy on FLEURS (97.7%) and 9.7% relative improvement on ML-SUPERB 2.0 dialect set.[38] Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models
Tharindu Madusanka,Ian Pratt-Hartmann,Riza Batista-Navarro
Main category: cs.CL
TL;DR: This paper investigates how different computational complexity classes and grammatical constructs impact the ability of transformer-based language models to learn rules of inference for satisfiability problems in natural language, and conducts an empirical study for model evaluation.
Details
Motivation: The motivation behind this paper is the increasing success of applying transformer-based language models to reasoning in natural language, with a focus on satisfiability as a fundamental task. However, the logical variations of satisfiability problems and their impact on TLMs' learning ability have not been adequately discussed in prior research. Method: The authors conduct an empirical study to explore the distribution of satisfiability problems and how problem instances from varying computational complexity classes and having different grammatical constructs impact TLMs' ability to learn rules of inference. Result: The paper's results reveal the effects of different computational complexity classes and grammatical constructs on the ability of TLMs to learn rules of inference, and provide insights into the distribution of satisfiability problems for faithful evaluation of TLMs. Conclusion: The paper emphasizes on the impact of varying computational complexity classes and grammatical constructs on TLMs' ability to learn rules of inference, and stresses the need for faithful evaluation of TLMs through an empirical study. Abstract: Efforts to apply transformer-based language models (TLMs) to the problem of reasoning in natural language have enjoyed ever-increasing success in recent years. The most fundamental task in this area to which nearly all others can be reduced is that of determining satisfiability. However, from a logical point of view, satisfiability problems vary along various dimensions, which may affect TLMs' ability to learn how to solve them. The problem instances of satisfiability in natural language can belong to different computational complexity classes depending on the language fragment in which they are expressed. Although prior research has explored the problem of natural language satisfiability, the above-mentioned point has not been discussed adequately. Hence, we investigate how problem instances from varying computational complexity classes and having different grammatical constructs impact TLMs' ability to learn rules of inference. Furthermore, to faithfully evaluate TLMs, we conduct an empirical study to explore the distribution of satisfiability problems.[39] SPORTSQL: An Interactive System for Real-Time Sports Reasoning and Visualization
Sebastian Martinez,Naman Ahuja,Fenil Bardoliya,Chris Bryan,Vivek Gupta
Main category: cs.CL
TL;DR: SPORTSQL is a natural language interface for querying and visualizing dynamic sports data using LLMs and real-time FPL data, allowing non-experts to explore sports statistics conversationally.
Details
Motivation: The authors aim to provide an interactive and natural language-based interface for querying and visualizing dynamic sports data, particularly for non-expert users. Method: The system uses Large Language Models (LLMs) for query parsing, schema linking, and visualization selection on a live, temporally indexed database from real-time Fantasy Premier League (FPL) data. Result: The SPORTSQL system successfully translates user questions into SQL queries on a real-time database and supports both tabular and visual outputs, evaluated using the DSQABENCH benchmark with 1,700+ annotated queries. Conclusion: SPORTSQL enables non-expert users to seamlessly explore evolving sports statistics through a natural, conversational interface by translating questions into SQL and providing visualizations. Abstract: We present a modular, interactive system, SPORTSQL, for natural language querying and visualization of dynamic sports data, with a focus on the English Premier League (EPL). The system translates user questions into executable SQL over a live, temporally indexed database constructed from real-time Fantasy Premier League (FPL) data. It supports both tabular and visual outputs, leveraging the symbolic reasoning capabilities of Large Language Models (LLMs) for query parsing, schema linking, and visualization selection. To evaluate system performance, we introduce the Dynamic Sport Question Answering benchmark (DSQABENCH), comprising 1,700+ queries annotated with SQL programs, gold answers, and database snapshots. Our demo highlights how non-expert users can seamlessly explore evolving sports statistics through a natural, conversational interface.[40] Quantifying Language Disparities in Multilingual Large Language Models
Songbo Hu,Ivan Vulić,Anna Korhonen
Main category: cs.CL
TL;DR: 本文提出了一种新的评估框架,通过三个可解释指标来解构多语言评估中的混杂因素,从而更准确地衡量模型性能和语言差异。
Details
Motivation: 研究的动机是为了解决当前大规模多语言评估中存在的结果碎片化和混杂因素问题,这些因素包括目标语言的不同、实验设置的差异和模型选择的影响。通过提出新的评估框架,研究旨在更准确地衡量模型性能和语言之间的差异。 Method: 研究提出了一种框架,通过三个可解释的指标:性能实现比率、其变异系数和语言潜力,来解构这些混杂变量,从而进行更细致和深入的性能差异量化分析。此外,通过13种模型变体在11个多语言数据集上的案例研究,验证了该框架的有效性。 Result: 研究结果显示,提出的框架能够更可靠地测量模型性能和语言差异,尤其是在评估低资源语言时。研究还发现,更高的整体模型性能并不一定意味着在不同语言之间具有更大的公平性。 Conclusion: 该研究得出的结论是,现有的大规模多语言评估结果往往受到多种因素的干扰,通过引入三个可解释的指标,可以更准确地衡量模型性能和语言差异,尤其是在低资源语言上。研究强调,更高的整体模型性能并不一定意味着在不同语言之间具有更大的公平性。 Abstract: Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.[41] The Impact of Annotator Personas on LLM Behavior Across the Perspectivism Spectrum
Olufunke O. Sarumi,Charles Welch,Daniel Braun,Jörg Schlötterer
Main category: cs.CL
TL;DR: This paper explores the use of Large Language Models (LLMs) in annotating hate speech and abusiveness by incorporating predefined annotator personas, revealing that LLMs selectively utilize demographic attributes and perform well under weak data perspectivism but do not surpass human annotators in strong perspectivism scenarios.
Details
Motivation: To understand how LLMs can model annotator perspectives in hate speech annotation and assess their performance relative to human annotators and existing modeling techniques. Method: The study evaluates LLM-generated annotations using predefined annotator personas across strong-to-weak data perspectivism spectra and compares them to existing annotator modeling techniques. Result: LLMs selectively use demographic attributes from personas, with some alignment to human annotators. Models without explicit annotator info perform better under weak perspectivism, while LLM performance approaches but does not exceed human annotators in strong perspectivism. Conclusion: LLMs can model annotator perspectives to some extent and perform well under weak data perspectivism, but they still fall short of fully replicating or surpassing human annotators in more personalized, strong perspectivism contexts. Abstract: In this work, we explore the capability of Large Language Models (LLMs) to annotate hate speech and abusiveness while considering predefined annotator personas within the strong-to-weak data perspectivism spectra. We evaluated LLM-generated annotations against existing annotator modeling techniques for perspective modeling. Our findings show that LLMs selectively use demographic attributes from the personas. We identified prototypical annotators, with persona features that show varying degrees of alignment with the original human annotators. Within the data perspectivism paradigm, annotator modeling techniques that do not explicitly rely on annotator information performed better under weak data perspectivism compared to both strong data perspectivism and human annotations, suggesting LLM-generated views tend towards aggregation despite subjective prompting. However, for more personalized datasets tailored to strong perspectivism, the performance of LLM annotator modeling approached, but did not exceed, human annotators.[42] Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models
Xudong Han,Junjie Yang,Tianyang Wang,Ziqian Bi,Junfeng Hao,Junhao Song
Main category: cs.CL
TL;DR: 本文综述了指令调整技术在大语言模型中的应用,涵盖了数据收集、参数高效微调策略和评估协议,并指出了未来的发展方向。
Details
Motivation: 指令调整技术是使大语言模型(LLMs)与人类意图、安全约束和领域特定需求对齐的关键技术。本文旨在为研究人员和从业者提供一个实用的参考,帮助他们设计既有效又可靠地与人类意图对齐的大语言模型。 Method: 本文通过综述的方式,系统地概述了指令调整技术的完整流程,包括数据收集方法、全参数和参数高效微调策略以及评估协议。 Result: 本文将数据构建分为专家标注、从更大模型中提取知识和自我改进机制三种主要范式,微调技术从传统的监督训练到轻量级方法如LoRA和前缀微调,重点在于计算效率和模型可重用性。此外,本文还讨论了多语言和多模态场景下的忠实性、效用性和安全性评估挑战,并指出了医疗、法律和金融等领域的特定基准的出现。 Conclusion: 本文总结了指令调整技术的现状,并指出未来发展方向为自动化数据生成、自适应优化和鲁棒评估框架,强调数据、算法和人类反馈的紧密集成对于推进指令调整LLMs的重要性。 Abstract: Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline, encompassing (i) data collection methodologies, (ii) full-parameter and parameter-efficient fine-tuning strategies, and (iii) evaluation protocols. We categorized data construction into three major paradigms: expert annotation, distillation from larger models, and self-improvement mechanisms, each offering distinct trade-offs between quality, scalability, and resource cost. Fine-tuning techniques range from conventional supervised training to lightweight approaches, such as low-rank adaptation (LoRA) and prefix tuning, with a focus on computational efficiency and model reusability. We further examine the challenges of evaluating faithfulness, utility, and safety across multilingual and multimodal scenarios, highlighting the emergence of domain-specific benchmarks in healthcare, legal, and financial applications. Finally, we discuss promising directions for automated data generation, adaptive optimization, and robust evaluation frameworks, arguing that a closer integration of data, algorithms, and human feedback is essential for advancing instruction-tuned LLMs. This survey aims to serve as a practical reference for researchers and practitioners seeking to design LLMs that are both effective and reliably aligned with human intentions.[43] Active Domain Knowledge Acquisition with \$100 Budget: Enhancing LLMs via Cost-Efficient, Expert-Involved Interaction in Sensitive Domains
Yang Wu,Raha Moraffah,Rujing Yao,Jinhong Yu,Zhimin Tao,Xiaozhong Liu
Main category: cs.CL
TL;DR: This paper introduces PU-ADKA, a budget-aware framework for enhancing LLMs through targeted expert input, validated in both simulations and real-world drug development scenarios.
Details
Motivation: LLMs often lack expert-level knowledge in highly specialized domains like drug discovery, making it challenging to achieve high performance without exceeding budget constraints. Method: PU-ADKA selectively engages domain experts based on availability, expertise, and cost, using simulations on PubMed data and real-world validation with a drug development team. Result: PU-ADKA improved LLM performance in specialized domains under budget limitations and led to the creation of the CKAD benchmark dataset for cost-effective knowledge acquisition. Conclusion: PU-ADKA is an effective framework for enhancing domain-specific LLMs by integrating expert knowledge within a fixed budget, showing promise in specialized fields like drug discovery. Abstract: Large Language Models (LLMs) have demonstrated an impressive level of general knowledge. However, they often struggle in highly specialized and cost-sensitive domains such as drug discovery and rare disease research due to the lack of expert knowledge. In this paper, we propose a novel framework (PU-ADKA) designed to efficiently enhance domain-specific LLMs by actively engaging domain experts within a fixed budget. Unlike traditional fine-tuning approaches, PU-ADKA selectively identifies and queries the most appropriate expert from a team, taking into account each expert's availability, knowledge boundaries, and consultation costs. We train PU-ADKA using simulations on PubMed data and validate it through both controlled expert interactions and real-world deployment with a drug development team, demonstrating its effectiveness in enhancing LLM performance in specialized domains under strict budget constraints. In addition to outlining our methodological innovations and experimental results, we introduce a new benchmark dataset, CKAD, for cost-effective LLM domain knowledge acquisition to foster further research in this challenging area.[44] SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation
Xiaqiang Tang,Yi Wang,Keyu Hu,Rui Xu,Chuang Li,Weigao Sun,Jian Li,Sihong Xie
Main category: cs.CL
TL;DR: SSFO is a self-supervised alignment approach that enhances RAG faithfulness by leveraging Direct Preference Optimization (DPO) without incurring labeling costs or additional inference burden.
Details
Motivation: Faithfulness hallucination in Retrieval-Augmented Generation (RAG) systems remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. Method: SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context and leverages Direct Preference Optimization (DPO) to align model faithfulness. Result: Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Conclusion: SSFO is an effective self-supervised alignment approach for enhancing RAG faithfulness without labeling costs or additional inference burden. Abstract: Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: https://github.com/chkwy/SSFO[45] ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
Siying Zhou,Yiquan Wu,Hui Chen,Xavier Hu,Kun Kuang,Adam Jatowt,Ming Hu,Chunyan Zheng,Fei Wu
Main category: cs.CL
TL;DR: 本文提出了一个中文法律诉求生成的新数据集和评估方法,揭示了现有模型的不足,并呼吁对该领域进行更深入的研究。
Details
Motivation: 虽然许多研究致力于提高法律专业人士的效率,但帮助非专业人士(如原告)的研究仍属空白。本文旨在基于案件事实生成法律诉求,填补这一研究空白。 Method: 构建了首个用于中文法律诉求生成任务的数据集ClaimGen-CN,并设计了包含事实性和清晰度两个维度的评估指标,基于此对最先进的通用和法律领域大语言模型进行了全面的零样本评估。 Result: 研究发现,当前模型在事实准确性和表达清晰度方面存在局限,表明需要在该领域进行更有针对性的开发。 Conclusion: 本文强调了当前模型在事实精度和表达清晰度方面的局限性,指出了该领域需要更有针对性的发展,并承诺公开数据集以促进进一步研究。 Abstract: Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution. While many works have focused on improving the efficiency of legal professionals, the research on helping non-professionals (e.g., plaintiffs) remains unexplored. This paper explores the problem of legal claim generation based on the given case's facts. First, we construct ClaimGen-CN, the first dataset for Chinese legal claim generation task, from various real-world legal disputes. Additionally, we design an evaluation metric tailored for assessing the generated claims, which encompasses two essential dimensions: factuality and clarity. Building on this, we conduct a comprehensive zero-shot evaluation of state-of-the-art general and legal-domain large language models. Our findings highlight the limitations of the current models in factual precision and expressive clarity, pointing to the need for more targeted development in this domain. To encourage further exploration of this important task, we will make the dataset publicly available.[46] Routing Distilled Knowledge via Mixture of LoRA Experts for Large Language Model based Bundle Generation
Kaidong Feng,Zhu Sun,Hui Fang,Jie Yang,Wenyuan Liu,Yew-Soon Ong
Main category: cs.CL
TL;DR: 本文提出了RouteDK框架,通过混合LoRA专家架构和动态知识集成,有效解决知识蒸馏中的冲突问题,在保持计算效率的同时提升了捆绑生成的准确性。
Details
Motivation: 大型语言模型在自动捆绑生成方面有潜力,但计算成本高。知识蒸馏虽然可以提高效率,但不同类型知识的简单集成会导致冲突和性能下降。 Method: 提出了一种名为RouteDK的框架,利用混合LoRA专家架构进行知识蒸馏,并引入动态融合模块和推理时增强模块,以优化知识集成和减少推理方差。 Result: 在三个公开数据集上的实验表明,RouteDK在准确性上优于或至少与教师模型相当,并且优于现有的最先进的捆绑生成方法。 Conclusion: RouteDK有效地解决了知识蒸馏中不同类型知识集成导致的知识冲突问题,同时在保持计算效率的同时实现了与教师模型相当甚至更好的准确性。 Abstract: Large Language Models (LLMs) have shown potential in automatic bundle generation but suffer from prohibitive computational costs. Although knowledge distillation offers a pathway to more efficient student models, our preliminary study reveals that naively integrating diverse types of distilled knowledge from teacher LLMs into student LLMs leads to knowledge conflict, negatively impacting the performance of bundle generation. To address this, we propose RouteDK, a framework for routing distilled knowledge through a mixture of LoRA expert architecture. Specifically, we first distill knowledge from the teacher LLM for bundle generation in two complementary types: high-level knowledge (generalizable rules) and fine-grained knowledge (session-specific reasoning). We then train knowledge-specific LoRA experts for each type of knowledge together with a base LoRA expert. For effective integration, we propose a dynamic fusion module, featuring an input-aware router, where the router balances expert contributions by dynamically determining optimal weights based on input, thereby effectively mitigating knowledge conflicts. To further improve inference reliability, we design an inference-time enhancement module to reduce variance and mitigate suboptimal reasoning. Experiments on three public datasets show that our RouteDK achieves accuracy comparable to or even better than the teacher LLM, while maintaining strong computational efficiency. In addition, it outperforms state-of-the-art approaches for bundle generation.[47] Are You Sure You're Positive? Consolidating Chain-of-Thought Agents with Uncertainty Quantification for Aspect-Category Sentiment Analysis
Filippos Ventirozos,Peter Appleby,Matthew Shardlow
Main category: cs.CL
TL;DR: 本文提出了一种新的零样本方法,用于方面类别情感分析,利用大型语言模型的token级不确定性得分,解决了标注数据稀缺的问题。
Details
Motivation: 监督学习方法在新领域中因数据稀缺和标注成本高昂而受限,同时可能存在标注偏差导致模型迁移效果差。 Method: 结合多个思维链代理的新技术,利用大型语言模型的token级不确定性得分进行方面类别情感分析。 Result: 通过使用Llama和Qwen模型的不同参数规模变体,展示了这些方法如何满足实际需求,并在标注稀缺条件下探讨了准确性评估方法。 Conclusion: 零样本方法在标注数据有限的情况下具有优势,并为在标签稀缺条件下评估准确性提供了新思路。 Abstract: Aspect-category sentiment analysis provides granular insights by identifying specific themes within product reviews that are associated with particular opinions. Supervised learning approaches dominate the field. However, data is scarce and expensive to annotate for new domains. We argue that leveraging large language models in a zero-shot setting is beneficial where the time and resources required for dataset annotation are limited. Furthermore, annotation bias may lead to strong results using supervised methods but transfer poorly to new domains in contexts that lack annotations and demand reproducibility. In our work, we propose novel techniques that combine multiple chain-of-thought agents by leveraging large language models' token-level uncertainty scores. We experiment with the 3B and 70B+ parameter size variants of Llama and Qwen models, demonstrating how these approaches can fulfil practical needs and opening a discussion on how to gauge accuracy in label-scarce conditions.[48] From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users
Sadia Sultana Chowa,Riasad Alvi,Subhey Sadi Rahman,Md Abdur Rahman,Mohaimenul Azam Khan Raiaan,Md Rafiqul Islam,Mukhtar Hussain,Sami Azam
Main category: cs.CL
TL;DR: 这篇论文综述了将大语言模型用作自主代理的研究进展,分析了架构设计、认知机制和评估方法,并提出了未来研究方向。
Details
Motivation: 随着人类水平人工智能的发展,LLMs因其能够解释指令、管理顺序任务并通过反馈进行适应,被广泛应用于决策代理。本文旨在回顾LLMs作为自主代理的最新进展,并探讨其关键设计原则和性能影响因素。 Method: 论文采用了结构化分析方法,将LLMs代理的架构设计分为单代理和多代理系统,探讨了外部工具集成策略、认知机制以及提示方法和微调程序对代理性能的影响。此外,还评估了当前基准和测试协议,并分析了68个公开数据集。 Result: 论文识别了LLMs在可验证推理、自我改进能力和个性化方面的关键发现,分析了LLMs代理的架构设计、认知机制及评估方法,并提出了未来研究方向。 Conclusion: 论文总结了当前将大语言模型(LLMs)用作自主代理和工具用户的研究进展,并提出了未来十个研究方向以弥补现有研究的不足。 Abstract: The pursuit of human-level artificial intelligence (AI) has significantly advanced the development of autonomous agents and Large Language Models (LLMs). LLMs are now widely utilized as decision-making agents for their ability to interpret instructions, manage sequential tasks, and adapt through feedback. This review examines recent developments in employing LLMs as autonomous agents and tool users and comprises seven research questions. We only used the papers published between 2023 and 2025 in conferences of the A* and A rank and Q1 journals. A structured analysis of the LLM agents' architectural design principles, dividing their applications into single-agent and multi-agent systems, and strategies for integrating external tools is presented. In addition, the cognitive mechanisms of LLM, including reasoning, planning, and memory, and the impact of prompting methods and fine-tuning procedures on agent performance are also investigated. Furthermore, we evaluated current benchmarks and assessment protocols and have provided an analysis of 68 publicly available datasets to assess the performance of LLM-based agents in various tasks. In conducting this review, we have identified critical findings on verifiable reasoning of LLMs, the capacity for self-improvement, and the personalization of LLM-based agents. Finally, we have discussed ten future research directions to overcome these gaps.[49] Handling Students Dropouts in an LLM-driven Interactive Online Course Using Language Models
Yuanchun Wang,Yiyang Fu,Jifan Yu,Daniel Zhang-Li,Zheyuan Zhang,Joy Lim Jia Yin,Yucheng Wang,Peng Zhou,Jing Zhang,Huiqin Liu
Main category: cs.CL
TL;DR: This paper explores dropout prediction and reduction in interactive online courses (MAIC), achieving high prediction accuracy and effectively re-engaging at-risk students through personalized email interventions.
Details
Motivation: The motivation of this paper is to understand the factors leading to dropouts in interactive online learning environments like MAIC, and to develop strategies to predict and reduce these dropouts. Method: The paper uses interaction logs from a specific MAIC course to define dropouts and identify contributing factors. A course-progress-adaptive dropout prediction framework (CPADP) is proposed, and a personalized email recall agent is designed to re-engage at-risk students. Result: The study achieves up to 95.4% accuracy in predicting dropouts using the CPADP framework. The personalized email recall agent is shown to be feasible and effective in re-engaging at-risk students across diverse backgrounds. Conclusion: The study concludes that there is a strong connection between dropout behaviors and textual interaction patterns in MAIC courses. The proposed CPADP framework can effectively predict dropouts, and the personalized email recall agent can successfully re-engage at-risk students. Abstract: Interactive online learning environments, represented by Massive AI-empowered Courses (MAIC), leverage LLM-driven multi-agent systems to transform passive MOOCs into dynamic, text-based platforms, enhancing interactivity through LLMs. This paper conducts an empirical study on a specific MAIC course to explore three research questions about dropouts in these interactive online courses: (1) What factors might lead to dropouts? (2) Can we predict dropouts? (3) Can we reduce dropouts? We analyze interaction logs to define dropouts and identify contributing factors. Our findings reveal strong links between dropout behaviors and textual interaction patterns. We then propose a course-progress-adaptive dropout prediction framework (CPADP) to predict dropouts with at most 95.4% accuracy. Based on this, we design a personalized email recall agent to re-engage at-risk students. Applied in the deployed MAIC system with over 3,000 students, the feasibility and effectiveness of our approach have been validated on students with diverse backgrounds.[50] CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation
Hunzalah Hassan Bhatti,Youssef Ahmed,Md Arid Hasan,Firoj Alam
Main category: cs.CL
TL;DR: 本文提出CultranAI系统,通过数据增强和LoRA微调大型语言模型,以提升阿拉伯文化知识表示,在PalmX任务中表现出色。
Details
Motivation: 为了提升阿拉伯文化知识在大型语言模型中的表示能力,参与PalmX文化评估共享任务。 Method: 使用数据增强和LoRA微调大型语言模型(LLMs),并对多个LLM进行基准测试以选择最佳模型,最后对Fanar-1-9B-Instruct模型进行微调。 Result: 在盲测数据集上,系统排名第五,准确率为70.50%;在PalmX开发集上,准确率达到84.1%。 Conclusion: CultranAI系统通过数据增强和LoRA微调方法,在阿拉伯文化知识表示任务中取得了良好效果,在PalmX开发集上准确率达到84.1%。 Abstract: In this paper, we report our participation to the PalmX cultural evaluation shared task. Our system, CultranAI, focused on data augmentation and LoRA fine-tuning of large language models (LLMs) for Arabic cultural knowledge representation. We benchmarked several LLMs to identify the best-performing model for the task. In addition to utilizing the PalmX dataset, we augmented it by incorporating the Palm dataset and curated a new dataset of over 22K culturally grounded multiple-choice questions (MCQs). Our experiments showed that the Fanar-1-9B-Instruct model achieved the highest performance. We fine-tuned this model on the combined augmented dataset of 22K+ MCQs. On the blind test set, our submitted system ranked 5th with an accuracy of 70.50%, while on the PalmX development set, it achieved an accuracy of 84.1%.[51] Omne-R1: Learning to Reason with Memory for Multi-hop Question Answering
Boyuan Liu,Feng Ji,Jiayan Nan,Han Zhao,Weiling Chen,Shihao Xu,Xing Zhou
Main category: cs.CL
TL;DR: 这篇论文提出了Omne-R1,通过多阶段训练和自动生成数据的方法,显著提升了在无模式知识图谱上回答多跳问题的能力,尤其是在复杂问题上的表现优异。
Details
Motivation: 为了解决模式自由知识图谱上的多跳问答能力受限的问题,尤其是数据不足和复杂问答效果不佳的挑战,本文提出了Omne-R1方法。 Method: 该论文提出了一种多阶段训练工作流程,包括两个强化学习阶段和一个监督微调阶段,并通过构建领域无关的知识图谱和自动生成问答对来解决知识图谱和QA数据有限的问题。 Result: 实验结果表明,在多跳问题回答效果上有显著提升,特别是在3+跳的复杂问题上表现突出,同时训练框架具备跨领域知识的强泛化能力。 Conclusion: Omne-R1展现出在不同知识领域中强大的泛化能力,显著提升了多跳问题的回答效果,尤其是在更复杂的3+跳问题上的性能增益明显。 Abstract: This paper introduces Omne-R1, a novel approach designed to enhance multi-hop question answering capabilities on schema-free knowledge graphs by integrating advanced reasoning models. Our method employs a multi-stage training workflow, including two reinforcement learning phases and one supervised fine-tuning phase. We address the challenge of limited suitable knowledge graphs and QA data by constructing domain-independent knowledge graphs and auto-generating QA pairs. Experimental results show significant improvements in answering multi-hop questions, with notable performance gains on more complex 3+ hop questions. Our proposed training framework demonstrates strong generalization abilities across diverse knowledge domains.[52] DropLoRA: Sparse Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
Haojie Zhang
Main category: cs.CL
TL;DR: DropLoRA是一种基于剪枝的方法,通过在LoRA中引入剪枝模块模拟动态子空间学习,解决了传统LoRA的静态子空间限制,从而在不增加成本的情况下提升了性能。
Details
Motivation: LoRA-based大模型参数高效微调(PEFT)方法使用低秩分解来近似模型参数的更新,但与全参数微调相比,低秩更新往往导致下游任务的性能下降。为解决这一问题,引入了DropLoRA。 Method: DropLoRA是一种基于剪枝的方法,专注于剪枝秩维度,与传统方法不同,它在LoRA中的两个低秩矩阵之间引入了一个剪枝模块来模拟动态子空间学习。 Result: 实验结果表明,在微调LLaMA系列时,DropLoRA在一系列大语言模型生成任务(包括常识推理、数学推理、代码生成和指令跟踪)中始终优于LoRA。 Conclusion: DropLoRA通过在LoRA中引入一个剪枝模块来模拟动态子空间学习,有效克服了传统LoRA的静态子空间限制,从而在不增加训练或推理成本的情况下显著提升了性能。 Abstract: LoRA-based large model parameter-efficient fine-tuning (PEFT) methods use low-rank de- composition to approximate updates to model parameters. However, compared to full- parameter fine-tuning, low-rank updates often lead to a performance gap in downstream tasks. To address this, we introduce DropLoRA, a novel pruning-based approach that focuses on pruning the rank dimension. Unlike conven- tional methods that attempt to overcome the low-rank bottleneck, DropLoRA innovatively integrates a pruning module between the two low-rank matrices in LoRA to simulate dy- namic subspace learning. This dynamic low- rank subspace learning allows DropLoRA to overcome the limitations of traditional LoRA, which operates within a static subspace. By continuously adapting the learning subspace, DropLoRA significantly boosts performance without incurring additional training or infer- ence costs. Our experimental results demon- strate that DropLoRA consistently outperforms LoRA in fine-tuning the LLaMA series across a wide range of large language model gener- ation tasks, including commonsense reason- ing, mathematical reasoning, code generation, and instruction-following. Our code is avail- able at https://github.com/TayeeChang/DropLoRA.[53] Capturing Legal Reasoning Paths from Facts to Law in Court Judgments using Knowledge Graphs
Ryoma Kondo,Riona Matsuoka,Takahiro Yoshida,Kazuyuki Yamasawa,Ryohei Hisano
Main category: cs.CL
TL;DR: This paper presents a legal knowledge graph built from Japanese administrative court decisions that captures the structured legal reasoning using prompt-based LLMs and an ontology of legal inference, enabling more accurate retrieval of legal provisions from factual scenarios.
Details
Motivation: Existing automated approaches, including large language models, often fail to accurately capture legal context, trace the relation of facts to legal norms, or represent the layered structure of judicial reasoning. This work aims to overcome these limitations. Method: The method involves extracting components of legal reasoning using prompt-based large language models, normalizing references to legal provisions, and linking facts, norms, and legal applications through an ontology of legal inference. Result: The system achieves more accurate retrieval of relevant legal provisions from facts compared to large language model baselines and retrieval-augmented methods, as evaluated using expert annotated data. Conclusion: The constructed legal knowledge graph effectively captures the full structure of legal reasoning from court decisions, making implicit reasoning explicit and machine-readable. Abstract: Court judgments reveal how legal rules have been interpreted and applied to facts, providing a foundation for understanding structured legal reasoning. However, existing automated approaches for capturing legal reasoning, including large language models, often fail to identify the relevant legal context, do not accurately trace how facts relate to legal norms, and may misrepresent the layered structure of judicial reasoning. These limitations hinder the ability to capture how courts apply the law to facts in practice. In this paper, we address these challenges by constructing a legal knowledge graph from 648 Japanese administrative court decisions. Our method extracts components of legal reasoning using prompt-based large language models, normalizes references to legal provisions, and links facts, norms, and legal applications through an ontology of legal inference. The resulting graph captures the full structure of legal reasoning as it appears in real court decisions, making implicit reasoning explicit and machine-readable. We evaluate our system using expert annotated data, and find that it achieves more accurate retrieval of relevant legal provisions from facts than large language model baselines and retrieval-augmented methods.[54] The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness
Sanad Shaban,Nizar Habash
Main category: cs.CL
TL;DR: 本文提出了一种新的阿拉伯语方言通用性度量方法AGS,并展示了其在建模词汇普遍性方面的有效性。
Details
Motivation: 阿拉伯语方言具有连续性且变化复杂,而现有模型往往将其视为离散类别。ALDi虽然以连续变量建模,但将其简化为一个维度,因此需要一个更全面的度量方式。 Method: 引入了一种新的阿拉伯语通用性评分(AGS)度量方法,并结合词对齐、考虑词源的编辑距离和优化技术,在平行语料库上进行标注,然后训练回归模型预测上下文中的AGS。 Result: 该方法在多方言基准测试中优于强大的基线模型,包括最先进的方言识别系统。 Conclusion: AGS是一个可扩展且有语言学基础的方法,能够丰富阿拉伯语方言性的表达方式。 Abstract: Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness.[55] UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
Omer Nacar
Main category: cs.CL
TL;DR: This paper evaluates the Arabic language model ALLaM-34B, showing high performance across various tasks including generation, code-switching, and handling of regional dialects, indicating its readiness for real-world deployment.
Details
Motivation: Large language models trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. This research aims to address this gap by evaluating the performance of an Arabic-focused model, ALLaM-34B. Method: An expanded and refined UI-level evaluation of ALLaM-34B was conducted using a prompt pack that spans modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety. 115 outputs were collected and scored by three frontier LLM judges. Result: ALLaM-34B showed consistently high performance on generation and code-switching tasks (both averaging 4.92/5), strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts showed stable, reliable performance (4.54/5). Conclusion: ALLaM-34B is a robust and culturally grounded Arabic LLM that demonstrates technical strength and practical readiness for real-world deployment. Abstract: Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the $ALLaM$ family of Arabic-focused models. The most capable of these available to the public, $ALLaM-34B$, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position $ALLaM-34B$ as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.[56] Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents
Sameer Komoravolu,Khalil Mrini
Main category: cs.CL
TL;DR: 本文介绍了一种名为Agent-Testing Agent (ATA)的元代理,用于自动测试大型语言模型代理,通过多种方法生成对抗性测试并动态调整难度,从而快速发现代理的弱点。
Details
Motivation: 当前对大型语言模型代理的评估依赖于静态基准测试和小型的人类研究,缺乏高效且全面的自动化测试方法。 Method: ATA结合了静态代码分析、设计者质询、文献挖掘和基于角色的对抗性测试生成,利用LLM-as-a-Judge (LAAJ)评分机制评估对话,并根据反馈调整后续测试的难度。 Result: 在旅行规划器和维基百科写作代理上的测试表明,ATA比专家标注人员更快地发现更多样化和严重的错误,且测试过程仅需20-30分钟。 Conclusion: ATA提供了一种高效的自动化测试方法,能够生成定量指标和定性错误报告,有助于改进大型语言模型代理的开发和评估。 Abstract: LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, designer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full methodology and open-source implementation for reproducible agent testing: https://github.com/KhalilMrini/Agent-Testing-Agent[57] DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards
Aaryaman Kartha,Ahmed Masry,Mohammed Saidul Islam,Thinh Lang,Shadikur Rahman,Ridwan Mahbub,Mizanur Rahman,Mahir Ahmed,Md Rizwan Parvez,Enamul Hoque,Shafiq Joty
Main category: cs.CL
TL;DR: DashboardQA is introduced as the first benchmark to assess how vision-language GUI agents comprehend and interact with real-world dashboards, highlighting the challenges faced by current models.
Details
Motivation: Existing question-answering benchmarks for data visualizations overlook interactivity, limiting their ability to evaluate modern multimodal agents designed for GUI-based reasoning. Method: The authors introduced DashboardQA, a benchmark for assessing vision-language GUI agents' ability to comprehend and interact with real-world dashboards. It includes 112 interactive dashboards and 405 question-answer pairs across five categories. Result: Even the top-performing agents struggled on the DashboardQA benchmark, with Gemini-Pro-2.5 achieving only 38.69% accuracy and the OpenAI CUA agent reaching just 22.69%. Conclusion: Interactive dashboard reasoning is a challenging task for current vision-language models (VLMs), as demonstrated by the performance of leading GUI agents on the DashboardQA benchmark. Abstract: Dashboards are powerful visualization tools for data-driven decision-making, integrating multiple interactive views that allow users to explore, filter, and navigate data. Unlike static charts, dashboards support rich interactivity, which is essential for uncovering insights in real-world analytical workflows. However, existing question-answering benchmarks for data visualizations largely overlook this interactivity, focusing instead on static charts. This limitation severely constrains their ability to evaluate the capabilities of modern multimodal agents designed for GUI-based reasoning. To address this gap, we introduce DashboardQA, the first benchmark explicitly designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards. The benchmark includes 112 interactive dashboards from Tableau Public and 405 question-answer pairs with interactive dashboards spanning five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational. By assessing a variety of leading closed- and open-source GUI agents, our analysis reveals their key limitations, particularly in grounding dashboard elements, planning interaction trajectories, and performing reasoning. Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated. Even the top-performing agents struggle; for instance, the best agent based on Gemini-Pro-2.5 achieves only 38.69% accuracy, while the OpenAI CUA agent reaches just 22.69%, demonstrating the benchmark's significant difficulty. We release DashboardQA at https://github.com/vis-nlp/DashboardQA[58] DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization
Aleksandar Pramov,Jiangqin Ma,Bina Patel
Main category: cs.CL
TL;DR: 本文提出了一种用于声明归一化的轻量级方法,在单语条件下表现优异,但在零样本条件下效果较差。
Details
Motivation: 归一化声明是自动事实核查验证系统的重要组成部分,旨在将通常嘈杂的声明数据(如社交媒体帖子)解析为标准化声明,以供下游真实性分类任务使用。 Method: 提出了一种轻量级的“先检索、后LLM支持”的流程,包括使用GPT-4o-mini进行动态提示或直接从训练数据集中检索最接近的归一化结果。 Result: 该系统在官方测试集上大多数单语任务中排名靠前,在13种语言中的7种语言中排名第一,但在零样本设置中表现不佳。 Conclusion: 该解决方案在单语任务中表现优异,但在零样本设置中表现不佳,表明其局限性。 Abstract: Claim normalization is an integral part of any automatic fact-check verification system. It parses the typically noisy claim data, such as social media posts into normalized claims, which are then fed into downstream veracity classification tasks. The CheckThat! 2025 Task 2 focuses specifically on claim normalization and spans 20 languages under monolingual and zero-shot conditions. Our proposed solution consists of a lightweight \emph{retrieval-first, LLM-backed} pipeline, in which we either dynamically prompt a GPT-4o-mini with in-context examples, or retrieve the closest normalization from the train dataset directly. On the official test set, the system ranks near the top for most monolingual tracks, achieving first place in 7 out of of the 13 languages. In contrast, the system underperforms in the zero-shot setting, highlighting the limitation of the proposed solution.[59] MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models
Suramya Jadhav,Abhay Shanbhag,Amogh Thakurdesai,Ridhima Sinare,Ananya Joshi,Raviraj Joshi
Main category: cs.CL
TL;DR: 本研究创建了一个用于马哈拉施特拉语的高质量同义句语料库,并展示了基于BERT的模型在该数据集上的实验结果,旨在推动低资源语言的自然语言处理研究。
Details
Motivation: 由于印度语言在形态和句法变化、书写系统多样性以及标注数据有限方面的复杂性,需要一个高质量的同义句语料库来推动低资源语言的自然语言处理研究。 Method: 创建了一个包含8000个句子对的高质量的马哈拉施特拉语同义句语料库,并使用基于BERT的标准Transformer模型对其进行了实验。 Result: 展示了基于BERT的标准Transformer模型在该数据集上的实验结果。 Conclusion: 作者公开了数据集和模型,以促进低资源语言的自然语言处理研究。 Abstract: Paraphrases are a vital tool to assist language understanding tasks such as question answering, style transfer, semantic parsing, and data augmentation tasks. Indic languages are complex in natural language processing (NLP) due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data. In this work, we present the L3Cube-MahaParaphrase Dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP). We also present the results of standard transformer-based BERT models on these datasets. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP[60] Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
Bryan Chen Zhengyu Tan,Daniel Wai Kit Chin,Zhengyuan Liu,Nancy F. Chen,Roy Ka-Wei Lee
Main category: cs.CL
TL;DR: DuET-PD框架和Holistic DPO训练方法显著提升了大型语言模型在多轮对话中应对误导信息和接受有效纠正的能力。
Details
Motivation: 大型语言模型在说服性对话中难以平衡对误导信息的轻信与对有效纠正的抵抗,因此需要一种新的方法来提高模型的可靠性。 Method: DuET-PD框架通过双维度评估模型在多轮对话中的立场变化动态,结合Holistic DPO训练方法,平衡正负向说服示例,以提升模型性能。 Result: GPT-4o在MMLU-Pro测试中仅达到27.32%的准确率,而新开发的Holistic DPO方法将Llama-3.1-8B-Instruct在安全情境下的准确率从4.21%提升至76.54%。 Conclusion: DuET-PD框架和Holistic DPO训练方法有助于提升大型语言模型在多轮对话中的可靠性和适应性,从而更好地应对象牙塔误导信息和接受有效纠正的挑战。 Abstract: Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.[61] Evaluating the Impact of Verbal Multiword Expressions on Machine Translation
Linfeng Liu,Saptarshi Ghosh,Tianyu Jiang
Main category: cs.CL
TL;DR: This study examines how verbal multiword expressions hinder machine translation and proposes an LLM-based solution to enhance translation accuracy for specific expression types.
Details
Motivation: VMWEs are challenging for natural language processing, and despite improvements in machine translation models, accurately translating these structures remains problematic. Method: The study evaluates the impact of three VMWE categories on machine translation quality using multiword expression datasets and sentences from machine translation datasets. An LLM-based paraphrasing method is also proposed and tested. Result: Experimental results show that VMWEs reduce translation quality, and the proposed LLM-based paraphrasing approach significantly improves the translation of certain VMWEs like verbal idioms and verb-particle constructions. Conclusion: VMWEs negatively affect translation quality, and an LLM-based paraphrasing approach can improve the translation of verbal idioms and verb-particle constructions. Abstract: Verbal multiword expressions (VMWEs) present significant challenges for natural language processing due to their complex and often non-compositional nature. While machine translation models have seen significant improvement with the advent of language models in recent years, accurately translating these complex linguistic structures remains an open problem. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and sentences containing these language phenomena extracted from machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality. We also propose an LLM-based paraphrasing approach that replaces these expressions with their literal counterparts, demonstrating significant improvement in translation quality for verbal idioms and verb-particle constructions.[62] Efficient Zero-Shot Long Document Classification by Reducing Context Through Sentence Ranking
Prathamesh Kokate,Mitali Sarnaik,Manavi Khopade,Mukta Takalikar,Raviraj Joshi
Main category: cs.CL
TL;DR: 本文提出了一种基于句子排序的高效零样本长文档分类方法,通过选择信息量最大的句子,在保持分类准确性的同时显著减少了推理时间。
Details
Motivation: 基于Transformer的模型(如BERT)在短文本分类上表现出色,但由于输入长度限制和计算效率低下,在长文档分类(LDC)上表现不佳。 Method: 我们提出了一种高效的零样本方法,利用句子排序来减少输入上下文,而不改变模型架构。 Result: 仅保留前50%的排名句子可以保持与全文档推理相当的性能,同时将推理时间减少多达35%。 Conclusion: 句子排序是一种简单而有效的可扩展零样本长文档分类技术。 Abstract: Transformer-based models like BERT excel at short text classification but struggle with long document classification (LDC) due to input length limitations and computational inefficiencies. In this work, we propose an efficient, zero-shot approach to LDC that leverages sentence ranking to reduce input context without altering the model architecture. Our method enables the adaptation of models trained on short texts, such as headlines, to long-form documents by selecting the most informative sentences using a TF-IDF-based ranking strategy. Using the MahaNews dataset of long Marathi news articles, we evaluate three context reduction strategies that prioritize essential content while preserving classification accuracy. Our results show that retaining only the top 50\% ranked sentences maintains performance comparable to full-document inference while reducing inference time by up to 35\%. This demonstrates that sentence ranking is a simple yet effective technique for scalable and efficient zero-shot LDC.[63] Improving French Synthetic Speech Quality via SSML Prosody Control
Nassima Ould Ouali,Awais Hussain Sani,Ruben Bueno,Jonah Dauvet,Tim Luka Horstmann,Eric Moulines
Main category: cs.CL
TL;DR: This paper introduces an end-to-end pipeline for inserting SSML tags into French text to enhance prosody control in TTS systems, achieving significant improvements in naturalness and performance over existing methods.
Details
Motivation: The motivation is to address the lack of expressiveness in synthetic voices due to limited prosody control in commercial TTS systems. Method: The method involves a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one for predicting phrase-break positions and another for regression on prosodic targets, generating SSML markup compatible with commercial TTS systems. Result: Evaluated on a 14-hour French podcast corpus, the method achieved 99.2% F1 for break placement and reduced mean absolute error on pitch, rate, and volume by 25-40% compared to other approaches. Perceptual evaluation showed improved naturalness with a mean opinion score increase from 3.20 to 3.87, and 15 out of 18 listeners preferred the enhanced synthesis. Conclusion: The paper concludes that their proposed end-to-end pipeline significantly improves prosody control in French TTS systems, demonstrating progress in bridging the expressiveness gap between synthetic and natural French speech. Abstract: Despite recent advances, synthetic voices often lack expressiveness due to limited prosody control in commercial text-to-speech (TTS) systems. We introduce the first end-to-end pipeline that inserts Speech Synthesis Markup Language (SSML) tags into French text to control pitch, speaking rate, volume, and pause duration. We employ a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one predicts phrase-break positions and the other performs regression on prosodic targets, generating commercial TTS-compatible SSML markup. Evaluated on a 14-hour French podcast corpus, our method achieves 99.2% F1 for break placement and reduces mean absolute error on pitch, rate, and volume by 25-40% compared with prompting-only large language models (LLMs) and a BiLSTM baseline. In perceptual evaluation involving 18 participants across over 9 hours of synthesized audio, SSML-enhanced speech generated by our pipeline significantly improves naturalness, with the mean opinion score increasing from 3.20 to 3.87 (p < 0.005). Additionally, 15 of 18 listeners preferred our enhanced synthesis. These results demonstrate substantial progress in bridging the expressiveness gap between synthetic and natural French speech. Our code is publicly available at https://github.com/hi-paris/Prosody-Control-French-TTS.[64] Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?
Hyeong Kyu Choi,Xiaojin Zhu,Yixuan Li
Main category: cs.CL
TL;DR: 研究发现多数投票是MAD中提升性能的主要因素,而代理间辩论的效果有限,简单集成方法仍是可靠替代方案。
Details
Motivation: 尽管MAD在提高大型语言模型性能方面很有前景,但其有效性的关键因素仍不清楚。 Method: 将MAD分解为两个关键组成部分——多数投票和代理间辩论,并通过广泛的实验评估各自贡献。此外,提出了一种将辩论建模为随机过程的理论框架。 Result: 多数投票单独就能解释通常归因于MAD的大部分性能提升。辩论单独不会提高预期正确性,但有针对性的干预可以有意义地增强辩论效果。 Conclusion: 虽然MAD具有潜力,但在许多实际环境中,简单的集成方法仍然是更强大和可靠的替代方案。 Abstract: Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD's effectiveness remain unclear. In this work, we disentangle MAD into two key components--Majority Voting and inter-agent Debate--and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents' belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.[65] Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design
Yunze Xiao,Lynnette Hui Xian Ng,Jiarui Liu,Mona T. Diab
Main category: cs.CL
TL;DR: 本文讨论了大型语言模型中拟人化特征的设计问题,并提出了一个基于线索交互的统一分类法。
Details
Motivation: 当前对拟人化的研究主要集中在风险上,缺乏设计指导,因此需要一种新的设计视角。 Method: 从多个学科出发,分析线索的表现和有效性,并提出统一的分类法。 Result: 提出了一个交互模型,该模型通过设计师和解释者之间的互动来实现拟人化设计,并将其分为四个维度:感知、语言、行为和认知。 Conclusion: 作者主张将拟人化视为一种设计概念,通过四个维度的线索来提供统一的分类法,并提倡面向功能的拟人化设计评估。 Abstract: Large Language Models (LLMs) increasingly exhibit \textbf{anthropomorphism} characteristics -- human-like qualities portrayed across their outlook, language, behavior, and reasoning functions. Such characteristics enable more intuitive and engaging human-AI interactions. However, current research on anthropomorphism remains predominantly risk-focused, emphasizing over-trust and user deception while offering limited design guidance. We argue that anthropomorphism should instead be treated as a \emph{concept of design} that can be intentionally tuned to support user goals. Drawing from multiple disciplines, we propose that the anthropomorphism of an LLM-based artifact should reflect the interaction between artifact designers and interpreters. This interaction is facilitated by cues embedded in the artifact by the designers and the (cognitive) responses of the interpreters to the cues. Cues are categorized into four dimensions: \textit{perceptive, linguistic, behavioral}, and \textit{cognitive}. By analyzing the manifestation and effectiveness of each cue, we provide a unified taxonomy with actionable levers for practitioners. Consequently, we advocate for function-oriented evaluations of anthropomorphic design.[66] CausalSent: Interpretable Sentiment Classification with RieszNet
Daniel Frees,Martin Pollack
Main category: cs.CL
TL;DR: CausalSent通过改进模型架构,显著提升处理效应估计精度,并揭示了文本特征的因果影响。
Details
Motivation: 尽管自然语言处理模型性能优越,但其决策过程缺乏透明度,因此需要提高模型的可解释性并探究文本特征的因果效应。 Method: 开发了一个基于RieszNet的双头神经网络架构,并使用集成模型对IMDB电影评论中的“love”词进行观察性案例研究。 Result: CausalSent在半合成IMDB数据上将处理效应估计的MAE降低了2-3倍,且发现“love”词的存在使正面情感概率增加2.9%。 Conclusion: CausalSent框架能够准确预测处理效应,揭示词语如“love”对情感判断的因果影响,提高了模型的可解释性。 Abstract: Despite the overwhelming performance improvements offered by recent natural language procesing (NLP) models, the decisions made by these models are largely a black box. Towards closing this gap, the field of causal NLP combines causal inference literature with modern NLP models to elucidate causal effects of text features. We replicate and extend Bansal et al's work on regularizing text classifiers to adhere to estimated effects, focusing instead on model interpretability. Specifically, we focus on developing a two-headed RieszNet-based neural network architecture which achieves better treatment effect estimation accuracy. Our framework, CausalSent, accurately predicts treatment effects in semi-synthetic IMDB movie reviews, reducing MAE of effect estimates by 2-3x compared to Bansal et al's MAE on synthetic Civil Comments data. With an ensemble of validated models, we perform an observational case study on the causal effect of the word "love" in IMDB movie reviews, finding that the presence of the word "love" causes a +2.9% increase in the probability of a positive sentiment.[67] UQ: Assessing Language Models on Unsolved Questions
Fan Nie,Ken Ziyu Liu,Zihao Wang,Rui Sun,Wei Liu,Weijia Shi,Huaxiu Yao,Linjun Zhang,Andrew Y. Ng,James Zou,Sanmi Koyejo,Yejin Choi,Percy Liang,Niklas Muennighoff
Main category: cs.CL
TL;DR: 本研究提出UQ,一种基于未解决问题的评估方法,旨在解决当前AI基准测试中难度与现实性之间的矛盾。
Details
Motivation: 当前的基准测试面临难度与现实性的矛盾,需要一种新的评估方法来解决这一问题。 Method: 介绍UQ,包括UQ-Dataset、UQ-Validators和UQ-Platform,通过规则过滤、LLM判断、人类审查、验证者辅助筛选和社区验证进行评估。 Result: 顶级模型在UQ验证中仅通过了15%的问题,初步的人工验证已经识别出通过验证的正确答案。 Conclusion: UQ提供了一种评估前沿模型在现实世界开放性挑战上的新方法,这种方法有助于推动人类知识的边界。 Abstract: Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.[68] Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions
Nannan Huang,Haytham Fayek,Xiuzhen Zhang
Main category: cs.CL
TL;DR: This paper introduces HGLA pruning, a new method that better maintains or improves fairness in LLM-generated opinion summaries compared to existing pruning techniques.
Details
Motivation: The effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarization where biased outputs could influence public views. Method: The paper proposes High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. The authors conducted experiments on three open-source LLMs using four fairness metrics and compared HGLA with three state-of-the-art pruning methods. Result: Pruning methods have a greater impact on fairness than calibration sets. HGLA demonstrates better fairness preservation or improvement across models and tasks where traditional pruning methods have limitations. Human evaluation confirms HGLA-generated outputs are fairer than existing methods. Conclusion: HGLA pruning method can better maintain or improve fairness in opinion summarization compared to existing pruning methods. Abstract: Model compression through post-training pruning offers a way to reduce model size and computational requirements without significantly impacting model performance. However, the effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarisation where biased outputs could influence public views.In this paper, we present a comprehensive empirical analysis of opinion summarisation, examining three state-of-the-art pruning methods and various calibration sets across three open-source LLMs using four fairness metrics. Our systematic analysis reveals that pruning methods have a greater impact on fairness than calibration sets. Building on these insights, we propose High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. Our experiments demonstrate that HGLA can better maintain or even improve fairness compared to existing methods, showing promise across models and tasks where traditional methods have limitations. Our human evaluation shows HGLA-generated outputs are fairer than existing state-of-the-art pruning methods. Code is available at: https://github.com/amberhuang01/HGLA.[69] Steering When Necessary: Flexible Steering Large Language Models with Backtracking
Jinwei Gan,Zifeng Cheng,Zhiwei Jiang,Cong Wang,Yafeng Yin,Xiang Luo,Yuchen Fu,Qing Gu
Main category: cs.CL
TL;DR: The paper proposes the Flexible Activation Steering with Backtracking (FASB) framework to dynamically determine the necessity and strength of intervention in aligning large language models (LLMs) with desired behaviors, outperforming existing methods.
Details
Motivation: Existing activation steering methods either intervene indiscriminately or rely solely on the question to determine intervention, limiting accurate assessment of intervention strength. Method: The FASB framework dynamically determines the necessity and strength of intervention by tracking the internal states of LLMs during generation, using a backtracking mechanism to correct deviations. Result: Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that the FASB framework outperforms baseline methods. Conclusion: Activation steering is an effective and cost-efficient approach for aligning LLMs with desired behaviors, and the proposed FASB framework outperforms existing methods. Abstract: Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.[70] EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
Jingwen Liu,Kan Jen Cheng,Jiachen Lian,Akshay Anand,Rishi Jain,Faith Qiao,Robin Netzorg,Huang-Cheng Chou,Tingle Li,Guan-Ting Lin,Gopala Anumanchipalli
Main category: cs.CL
TL;DR: This paper introduces EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems, addressing the lack of emotional speech data and providing a method to evaluate emotion transitions.
Details
Motivation: The motivation is to address the lack of a holistic system for evaluating emotional reasoning in spoken dialogue systems, aiming to enhance engagement and context-aware communication. Method: The paper introduces EMO-Reasoning, which uses a curated dataset generated via text-to-speech to simulate diverse emotional states and proposes the Cross-turn Emotion Reasoning Score to evaluate emotion transitions in multi-turn dialogues. Result: By evaluating seven dialogue systems using continuous, categorical, and perceptual metrics, the framework was shown to effectively detect emotional inconsistencies. Conclusion: The paper concludes that EMO-Reasoning provides an effective framework for detecting emotional inconsistencies in dialogue systems, offering insights for their improvement. Abstract: Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.[71] Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit
Zihao Wei,Liang Pang,Jiahao Liu,Jingcheng Deng,Shicheng Xu,Zenghao Duan,Jingang Wang,Fei Sun,Xunliang Cai,Huawei Shen,Xueqi Cheng
Main category: cs.CL
TL;DR: This paper proposes a method to detect the Reasoning Completion Point (RCP) in large language models (LLMs), reducing token consumption while maintaining or enhancing reasoning accuracy by identifying RCP using heuristic rules.
Details
Motivation: Overthinking in LLMs can degrade performance, increase resource usage, or cause infinite loops. Detecting the Reasoning Completion Point (RCP) is crucial to mitigating overthinking. Method: Mining sensitive and consistent RCP patterns and developing a lightweight thresholding strategy based on heuristic rules to detect the end of the compensatory reasoning stage. Result: Experimental evaluations on benchmarks (AIME24, AIME25, GPQA-D) show that the method reduces token consumption while preserving or enhancing reasoning accuracy. Conclusion: The proposed method effectively reduces token consumption while maintaining or improving reasoning accuracy by identifying the Reasoning Completion Point (RCP) using a lightweight thresholding strategy based on heuristic rules. Abstract: Large language models (LLMs) enhance complex reasoning tasks by scaling the individual thinking process. However, prior work shows that overthinking can degrade overall performance. Motivated by observed patterns in thinking length and content length, we categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage. Typically, LLMs produce correct answers in the compensatory reasoning stage, whereas reasoning convergence often triggers overthinking, causing increased resource usage or even infinite loops. Therefore, mitigating overthinking hinges on detecting the end of the compensatory reasoning stage, defined as the Reasoning Completion Point (RCP). RCP typically appears at the end of the first complete reasoning cycle and can be identified by querying the LLM sentence by sentence or monitoring the probability of an end-of-thinking token (e.g., \texttt{}), though these methods lack an efficient and precise balance. To improve this, we mine more sensitive and consistent RCP patterns and develop a lightweight thresholding strategy based on heuristic rules. Experimental evaluations on benchmarks (AIME24, AIME25, GPQA-D) demonstrate that the proposed method reduces token consumption while preserving or enhancing reasoning accuracy.[72] Weights-Rotated Preference Optimization for Large Language Models
Chenxu Yang,Ruipeng Jia,Mingyu Zheng,Naibin Gu,Zheng Lin,Siyuan Chen,Weichong Yin,Hua Wu,Weiping Wang
Main category: cs.CL
TL;DR: RoPO算法缓解DPO的奖励黑客问题,提高语言模型对齐效果。
Details
Motivation: 解决DPO在对齐大语言模型时出现的奖励黑客问题,该问题会导致生成文本过长、缺乏多样性以及知识灾难性遗忘。 Method: 提出了一种新的权重旋转偏好优化(RoPO)算法,该算法通过KL散度隐式约束输出层logits,并通过在多粒度正交矩阵上微调显式约束中间隐藏状态。 Result: RoPO在AlpacaEval 2上实现了3.27点的提升,并且在MT-Bench上以仅0.015%的可训练参数超越了最佳基线6.2到7.5点。 Conclusion: RoPO能够有效地缓解DPO的奖励黑客问题,同时保留预训练和SFT阶段获得的知识和表达能力。 Abstract: Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model from deviating too far from the reference model, thereby retaining the knowledge and expressive capabilities acquired during pre-training and SFT stages. Our RoPO achieves up to a 3.27-point improvement on AlpacaEval 2, and surpasses the best baseline by 6.2 to 7.5 points on MT-Bench with merely 0.015% of the trainable parameters, demonstrating its effectiveness in alleviating the reward hacking problem of DPO.[73] SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models
Tong Bao,Mir Tafseer Nayeem,Davood Rafiei,Chengzhi Zhang
Main category: cs.CL
TL;DR: SurveyGen 是一个包含 4200 多篇人类撰写的调查报告和 242,143 篇引用文献的大规模数据集,用于改进自动调查生成的质量评估。
Details
Motivation: 缺乏标准化的评估数据集,严重阻碍了对大型语言模型生成调查报告与人工撰写调查报告的性能进行严格评估。 Method: 构建了一个名为 SurveyGen 的大规模数据集,并开发了 QUAL-SG 框架,通过引入质量感知指标来增强调查生成的质量。 Result: 实验结果表明,虽然半自动方法可以取得部分竞争力的结果,但全自动调查生成仍然存在引用质量低和缺乏批判性分析的问题。 Conclusion: SurveyGen 数据集和 QUAL-SG 框架为改进自动调查生成提供了有效的资源和方法,但全自动方法仍需进一步优化以提高生成质量。 Abstract: Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement - from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.[74] CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models
Anant Khandelwal,Manish Gupta,Puneet Agrawal
Main category: cs.CL
TL;DR: CoCoA是一种新的基于token的解码算法,用于解决大型语言模型中的知识冲突问题,提高生成的可靠性。
Details
Motivation: 现有的对比解码方法在处理冲突时缺乏适应性,并且在低冲突情况下可能会降低性能。 Method: CoCoA利用置信度感知度量(熵差和上下文尖锐度)和参数分布与上下文分布之间的广义散度来解决冲突。 Result: 实验结果显示,CoCoA在多种基准测试中平均提高了QA准确率9.2个百分点,摘要和LFQA的事实性平均提高了2.5个百分点。 Conclusion: CoCoA能够实现更明智、上下文感知和更可靠的token生成。 Abstract: Faithful generation in large language models (LLMs) is challenged by knowledge conflicts between parametric memory and external context. Existing contrastive decoding methods tuned specifically to handle conflict often lack adaptability and can degrade performance in low conflict settings. We introduce CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level algorithm for principled conflict resolution and enhanced faithfulness. CoCoA resolves conflict by utilizing confidence-aware measures (entropy gap and contextual peakedness) and the generalized divergence between the parametric and contextual distributions. Crucially, CoCoA maintains strong performance even in low conflict settings. Extensive experiments across multiple LLMs on diverse Question Answering (QA), Summarization, and Long-Form Question Answering (LFQA) benchmarks demonstrate CoCoA's state-of-the-art performance over strong baselines like AdaCAD. It yields significant gains in QA accuracy, up to 9.2 points on average compared to the strong baseline AdaCAD, and improves factuality in summarization and LFQA by up to 2.5 points on average across key benchmarks. Additionally, it demonstrates superior sensitivity to conflict variations. CoCoA enables more informed, context-aware, and ultimately more faithful token generation.[75] Text Meets Topology: Rethinking Out-of-distribution Detection in Text-Rich Networks
Danny Wang,Ruihong Qiu,Guangdong Bai,Zi Huang
Main category: cs.CL
TL;DR: 本文提出了一种新的分布外检测框架TextTopoOOD和TNT-OOD方法,在处理文本丰富网络中的分布外检测问题时表现出色,通过建模文本与拓扑结构之间的复杂交互,有效应对了多种分布外场景。
Details
Motivation: 在文本丰富网络中,文本特征与拓扑结构交织在一起,现有的分布外检测方法主要针对标签转移或基于领域的简单划分,忽略了复杂的文本-结构多样性,因此需要一种新的方法来解决这一问题。 Method: 提出了一种新的TNT-OOD方法,该方法使用跨注意力模块将局部结构融入节点级文本表示,并使用超网络生成特定节点的变换参数,以对齐拓扑和语义特征。 Result: 在11个数据集上的实验表明,TNT-OOD在处理属性级、结构级、主题引导的标签转移和基于领域的划分等四种分布外场景时具有良好的性能。 Conclusion: TNT-OOD框架在处理文本丰富网络中的分布外检测问题时展现出卓越的能力,通过建模文本与拓扑结构之间的复杂交互,有效应对了属性级、结构级、主题引导的标签转移和基于领域的划分等多种分布外场景。 Abstract: Out-of-distribution (OOD) detection remains challenging in text-rich networks, where textual features intertwine with topological structures. Existing methods primarily address label shifts or rudimentary domain-based splits, overlooking the intricate textual-structural diversity. For example, in social networks, where users represent nodes with textual features (name, bio) while edges indicate friendship status, OOD may stem from the distinct language patterns between bot and normal users. To address this gap, we introduce the TextTopoOOD framework for evaluating detection across diverse OOD scenarios: (1) attribute-level shifts via text augmentations and embedding perturbations; (2) structural shifts through edge rewiring and semantic connections; (3) thematically-guided label shifts; and (4) domain-based divisions. Furthermore, we propose TNT-OOD to model the complex interplay between Text aNd Topology using: 1) a novel cross-attention module to fuse local structure into node-level text representations, and 2) a HyperNetwork to generate node-specific transformation parameters. This aligns topological and semantic features of ID nodes, enhancing ID/OOD distinction across structural and textual shifts. Experiments on 11 datasets across four OOD scenarios demonstrate the nuanced challenge of TextTopoOOD for evaluating OOD detection in text-rich networks.[76] EMPOWER: Evolutionary Medical Prompt Optimization With Reinforcement Learning
Yinda Chen,Yangfan He,Jing Yang,Dapeng Zhang,Zhenlong Yuan,Muhammad Attique Khan,Jamel Baili,Por Lip Yee
Main category: cs.CL
TL;DR: 这篇论文介绍了一种新的医学提示优化框架EMPOWER,在医疗应用中显著提升了大型语言模型的可靠性和临床实用性。
Details
Motivation: 当前的优化方法未能充分解决医学领域特定知识和安全要求,而提示工程对大型语言模型在医疗应用中的可靠性和临床效用有显著影响。 Method: 该论文提出了一种新颖的进化框架EMPOWER,包括专业的表示学习、多维评估、结构保持算法,具体方法包括医学术语注意力机制、全面的评估架构、组件级进化算法和语义验证模块。 Result: 在诊断、治疗和教育任务中的评估表明,事实错误内容减少了24.7%,领域特异性提高了19.6%,临床医生在盲审中的偏好提高了15.3%。 Conclusion: EMPOWER框架有效提升医学提示的质量,解决了临床适用提示开发中的关键挑战,促进了LLMs在医疗保健环境中的更负责任的整合。 Abstract: Prompt engineering significantly influences the reliability and clinical utility of Large Language Models (LLMs) in medical applications. Current optimization approaches inadequately address domain-specific medical knowledge and safety requirements. This paper introduces EMPOWER, a novel evolutionary framework that enhances medical prompt quality through specialized representation learning, multi-dimensional evaluation, and structure-preserving algorithms. Our methodology incorporates: (1) a medical terminology attention mechanism, (2) a comprehensive assessment architecture evaluating clarity, specificity, clinical relevance, and factual accuracy, (3) a component-level evolutionary algorithm preserving clinical reasoning integrity, and (4) a semantic verification module ensuring adherence to medical knowledge. Evaluation across diagnostic, therapeutic, and educational tasks demonstrates significant improvements: 24.7% reduction in factually incorrect content, 19.6% enhancement in domain specificity, and 15.3% higher clinician preference in blinded evaluations. The framework addresses critical challenges in developing clinically appropriate prompts, facilitating more responsible integration of LLMs into healthcare settings.[77] Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models
Wataru Ikeda,Kazuki Yano,Ryosuke Takahashi,Jaesung Lee,Keigo Shibata,Jun Suzuki
Main category: cs.CL
TL;DR: This study shows that concentrating feed-forward networks (FFNs) in the middle layers of Transformer-based language models enhances performance on downstream tasks.
Details
Motivation: The study aims to investigate the layerwise importance of FFNs in Transformer-based language models during pretraining, focusing on how their importance varies depending on their position in the model. Method: An experimental approach was introduced to increase FFN dimensions in some layers while removing FFNs from other layers, maintaining the total parameter count. Models were trained from scratch to evaluate the importance of FFNs in different layer positions. Result: Evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers) showed that concentrating FFNs in the middle layers improved performance on downstream tasks. Conclusion: Concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks. Abstract: This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.[78] SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation
Garima Chhikara,Kripabandhu Ghosh,Abhijnan Chakraborty
Main category: cs.CL
TL;DR: This research proposes a dynamic validation approach and the SMITE algorithm to enhance the fairness and accuracy of Large Language Models in tasks like tabular classification.
Details
Motivation: The motivation is to enhance the performance and fairness of Large Language Models (LLMs) in downstream tasks such as tabular classification, ensuring inclusivity and responsible AI deployment. Method: The study introduces a dynamic validation set that evolves alongside the test set and proposes an iterative algorithm called SMITE to select optimal in-context examples for validation. Result: Experiments across four different LLMs demonstrated that the proposed techniques significantly improve both predictive accuracy and fairness in comparison to baseline methods. Conclusion: This study presents a new method using dynamic validation sets and the SMITE algorithm for selecting optimal in-context examples, enhancing both the predictive accuracy and fairness of LLMs. Abstract: Large Language Models (LLMs) are widely used for downstream tasks such as tabular classification, where ensuring fairness in their outputs is critical for inclusivity, equal representation, and responsible AI deployment. This study introduces a novel approach to enhancing LLM performance and fairness through the concept of a dynamic validation set, which evolves alongside the test set, replacing the traditional static validation approach. We also propose an iterative algorithm, SMITE, to select optimal in-context examples, with each example set validated against its corresponding dynamic validation set. The in-context set with the lowest total error is used as the final demonstration set. Our experiments across four different LLMs show that our proposed techniques significantly improve both predictive accuracy and fairness compared to baseline methods. To our knowledge, this is the first study to apply dynamic validation in the context of in-context learning for LLMs.[79] ISACL: Internal State Analyzer for Copyrighted Training Data Leakage
Guangwei Zhang,Qisheng Su,Jiateng Liu,Cheng Qian,Yanzhou Pan,Yanjie Fu,Denghui Zhang
Main category: cs.CL
TL;DR: 本研究提出了一种通过分析大型语言模型内部状态来检测和防止版权数据泄漏的新方法,能够在生成文本之前进行干预,从而保护敏感信息。
Details
Motivation: 大型语言模型(LLMs)在自然语言处理(NLP)领域带来了革命性的变化,但它们可能会意外暴露版权或专有数据,尤其是在这些数据用于训练但不打算分发时。传统方法仅在内容生成后解决泄漏问题,可能导致敏感信息的暴露。 Method: 使用版权材料的策划数据集训练神经网络分类器,以识别风险,并将其集成到检索增强生成(RAG)系统中。 Result: 结果表明,分析内部状态可以有效减轻版权数据泄漏的风险,提供了一个可扩展的解决方案,能够无缝集成到AI工作流程中,确保遵守版权法规,同时保持高质量的文本生成。 Conclusion: 研究得出通过分析大型语言模型的内部状态可以在生成文本之前检测潜在的版权数据泄漏风险,从而实现早期干预,防止敏感信息的泄露。 Abstract: Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but pose risks of inadvertently exposing copyrighted or proprietary data, especially when such data is used for training but not intended for distribution. Traditional methods address these leaks only after content is generated, which can lead to the exposure of sensitive information. This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks. By using a curated dataset of copyrighted materials, we trained a neural network classifier to identify risks, allowing for early intervention by stopping the generation process or altering outputs to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements while enhancing data privacy and ethical standards. Our results show that analyzing internal states effectively mitigates the risk of copyrighted data leakage, offering a scalable solution that fits smoothly into AI workflows, ensuring compliance with copyright regulations while maintaining high-quality text generation. The implementation is available on GitHub.\footnote{https://github.com/changhu73/Internal_states_leakage}[80] Speculating LLMs' Chinese Training Data Pollution from Their Tokens
Qingjie Zhang,Di Wang,Haoting Qian,Liu Yan,Tianwei Zhang,Ke Xu,Qi Li,Minlie Huang,Hewu Li,Han Qiu
Main category: cs.CL
TL;DR: 该研究分析了LLM中污染中文标记的存在情况,揭示了GPT词汇表中大量色情和赌博相关标记,并通过实验推测了训练数据的污染程度。
Details
Motivation: 观察到GPT词汇表中存在大量与色情或在线赌博相关的中文标记,研究旨在定位LLM中的污染中文(PoC)标记,并探讨其与训练数据之间的关系。 Method: (1)基于GPT的词汇表对PoC标记进行了形式化定义和分类;(2)通过微调LLM构建了一个PoC标记检测器,结合搜索引擎的信息分析每个标记的语义;(3)通过PoC标记的出现情况研究训练数据的污染情况,并在C4和Pile等数据集上验证推测方法的有效性。 Result: 实验表明,PoC标记在多个LLM中广泛存在,而GPT的词汇表表现最差。研究还成功推测了GPT-4o训练数据中特定网页(如与“Yui Hatano”相关)的大致比例约为0.5%。 Conclusion: 通过实验发现GPT词汇表中的污染中文(PoC)标记比例较高,尤其是在长中文标记中,其中超过23%的标记与色情或在线赌博有关。研究还验证了推测方法的准确性,并推测了GPT-4o训练数据中特定网页的比例。 Abstract: Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%.[81] Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation
Changsong Liu,Yizhou Peng,Eng Siong Chng
Main category: cs.CL
TL;DR: 本文提出了一种零样本条件下基于Whisper模型的上下文自动语音识别方法,通过合成多样化语音样本并提取多个发音变体来提高罕见词的识别准确率。
Details
Motivation: 上下文自动语音识别系统在面对词汇表外的词(如命名实体或罕见词)时,由于训练数据有限以及发音模糊或不一致而面临挑战。 Method: 提出了一种基于合成驱动的多发音上下文偏置方法,通过构建前缀树并在波束搜索解码期间使用浅层融合方式分配奖励,实现了对罕见词的识别。 Result: 在Librispeech数据集上的评估结果显示,该方法在test-clean和test-other条件下,偏置词错误率分别降低了42%和43%,同时保持了非偏置词错误率基本不变。 Conclusion: 该方法通过利用文本到语音系统合成包含目标罕见词的多样化语音样本,并利用预训练的Whisper模型提取多个预测发音变体,从而在零样本条件下实现上下文自动语音识别。 Abstract: Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. After which, any recognized variant is mapped back to the original rare word in the final transcription. The evaluation results on the Librispeech dataset show that our method reduces biased word error rate (WER) by 42% on test-clean and 43% on test-other while maintaining unbiased WER essentially unchanged.[82] DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Kaiwen Yan,Xuanqing Shi,Hongcheng Guo,Wenxuan Wang,Zhuosheng Zhang,Chengwei Qin
Main category: cs.CL
TL;DR: This paper proposes DRQA to make reasoning large language models more efficient by reducing unnecessary reasoning steps, thereby saving computational resources while maintaining or improving accuracy.
Details
Motivation: RLLMs often suffer from overthinking, producing unnecessarily lengthy reasoning chains even for simple questions. The study aims to reduce token usage and computational inefficiency. Method: Dynamic Reasoning Quota Allocation (DRQA), using batch-generated preference data and reinforcement learning to train models to allocate reasoning resources adaptively. Result: DRQA significantly reduces token usage while maintaining or improving answer accuracy on mathematical and scientific reasoning benchmarks. Conclusion: DRQA offers a promising direction for more efficient deployment of RLLMs by mitigating the overthinking problem and enabling fine-grained control of reasoning behaviors. Abstract: Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.[83] Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning
Haijiang Liu,Qiyuan Li,Chao Gao,Yong Cao,Xiangyu Xu,Xun Wu,Daniel Hershcovich,Jinguang Gu
Main category: cs.CL
TL;DR: MARK, a multi-stage reasoning framework inspired by MBTI psychological theory, enhances the accuracy and interpretability of large language models in simulating cultural value survey responses, outperforming baselines by 10% and reducing prediction divergence from human preferences.
Details
Motivation: To enhance the performance of large language models in simulating cultural value survey responses by improving accuracy, steerability, and interpretability. Method: Inspired by the type dynamics theory in the MBTI psychological framework, MARK utilizes human demographic information for simulation through life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Result: Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% in accuracy and reduces the divergence between model predictions and human preferences. Conclusion: The MARK framework improves the accuracy, steerability, and interpretability of large language models in simulating cultural value survey responses, highlighting its potential for zero-shot personalization and aiding social scientists in interpreting model predictions. Abstract: Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions.[84] Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs
Dingdong Wang,Junan Li,Mingyu Cui,Dongchao Yang,Xueyuan Chen,Helen Meng
Main category: cs.CL
TL;DR: This paper compares discrete tokens and continuous features in SpeechLLMs for spoken language understanding, showing that continuous features generally perform better across tasks and offering insights into their learning patterns.
Details
Motivation: To explore the performance gap between discrete token and continuous feature approaches in SpeechLLMs and provide insights for advancing spoken language understanding. Method: A fair comparison of SSL-based discrete and continuous features under the same experimental settings, evaluating performance on six spoken language understanding tasks using Qwen1.5-0.5B and Llama3.1-8B LLMs, along with in-depth analyses (efficiency, SSL/LLM layers, robustness). Result: Continuous features generally outperform discrete tokens across various tasks, showing distinct learning and processing characteristics for each method. Conclusion: Continuous features are more effective than discrete tokens in most speech processing tasks, and the study provides valuable insights for improving spoken language understanding in SpeechLLMs. Abstract: With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.[85] ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models
Manlai Liang,Mandi Liu,Jiangzhou Ji,Huaijun Li,Haobo Yang,Yaohan He,Jinlong Li
Main category: cs.CL
TL;DR: 本文提出了一种名为中间层检索(ILRe)的上下文压缩管道,通过在解码器的指定中间层进行上下文编码和标记检索,显著降低了长上下文场景的计算复杂度,并实现了接近或优于全上下文的性能。
Details
Motivation: 大型语言模型(LLMs)在处理长上下文时存在有效上下文长度短、计算复杂度高和内存开销大的问题,本文旨在解决这些限制。 Method: ILRe 方法包括三个主要步骤:1)离线选择一个中间解码层;2)仅将分块的输入流式传输到选定的中间层进行编码;3)利用查询与选定层中完整键缓存之间的注意力分数进行标记检索,并提出多池化核分配策略以保持语义完整性。 Result: ILRe 将预填充复杂度从 $O(L^2)$ 降低到 $O(L)$,在不进行额外微调或开发专用算子的情况下,能够在不到半分钟内处理 100 万个标记的请求,速度提升了约 180 倍,并在 RULER-1M 基准测试中得分约 79.8。 Conclusion: ILRe 是一种高效处理长上下文场景的上下文压缩方法,具有显著的性能提升和较低的计算资源需求,适用于实际应用中的大型语言模型部署。 Abstract: Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.[86] Pandora: Leveraging Code-driven Knowledge Transfer for Unified Structured Knowledge Reasoning
Yongrui Chen,Junhao He,Linbo Fu,Shenyu Zhang,Rihui Jin,Xinbang Dai,Jiaqi Li,Dehai Min,Nan Hu,Yuxin Zhang,Guilin Qi,Yi Huang,Tongtong Wu
Main category: cs.CL
TL;DR: 本文提出了一种新的统一结构化知识推理框架Pandora,利用基于代码的统一知识表示和知识迁移技术,解决了现有方法在跨任务场景中的局限性,并在多个基准测试中表现出色。
Details
Motivation: 现有的统一结构化知识推理(USKR)方法依赖于任务特定的策略或定制表示,阻碍了不同SKR任务之间的壁垒消除,限制了其在跨任务场景中的整体性能。 Method: 首先提出了一种基于Python Pandas API的代码统一知识表示方法,其次利用知识迁移技术增强LLMs的统一推理过程,并通过代码执行的反馈自适应地纠正推理。 Result: 在六个广泛使用的基准测试中,Pandora框架表现出色,优于现有的统一推理框架,并能有效与任务特定方法竞争。 Conclusion: Pandora框架通过统一的知识表示和知识迁移技术,有效解决了现有USKR方法的局限性,为跨任务场景下的结构化知识推理提供了新思路。 Abstract: Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods rely on task-specific strategies or bespoke representations, which hinder their ability to dismantle barriers between different SKR tasks, thereby constraining their overall performance in cross-task scenarios. In this paper, we introduce \textsc{Pandora}, a novel USKR framework that addresses the limitations of existing methods by leveraging two key innovations. First, we propose a code-based unified knowledge representation using \textsc{Python}'s \textsc{Pandas} API, which aligns seamlessly with the pre-training of LLMs. This representation facilitates a cohesive approach to handling different structured knowledge sources. Building on this foundation, we employ knowledge transfer to bolster the unified reasoning process of LLMs by automatically building cross-task memory. By adaptively correcting reasoning using feedback from code execution, \textsc{Pandora} showcases impressive unified reasoning capabilities. Extensive experiments on six widely used benchmarks across three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified reasoning frameworks and competes effectively with task-specific methods.[87] Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs
Domenico De Cristofaro,Vincenzo Norman Vitale,Alessandro Vietti
Main category: cs.CL
TL;DR: This study evaluates the effectiveness of CNN-extracted features in representing phonetic information for vowels using SVM classifiers on the TIMIT corpus.
Details
Motivation: The motivation is to understand the phonetic representation quality of CNN-extracted features in automatic speech recognition systems. Method: The study uses SVM classifiers to identify front-back vowels using features from MFCCs, MFCCs with formants, and CNN activations. Result: The classification accuracy of SVM classifiers using different feature extraction methods was assessed to evaluate phonetic representation. Conclusion: The study concludes that CNN-extracted features can effectively represent phonetic information for monophthong vowels, comparable to traditional MFCCs. Abstract: Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.[88] Information availability in different languages and various technological constraints related to multilinguism on the Internet
Sonal Khosla,Haridasa Acharya
Main category: cs.CL
TL;DR: This paper explores the linguistic barriers on the Internet due to English dominance and analyzes the need for multilingual solutions to improve accessibility for non-English speakers.
Details
Motivation: The exponential growth of Internet users and the dominance of English online have created accessibility issues for non-English speakers, prompting the need for multilingual solutions. Method: The paper analyzes the growth of the Internet and the linguistic barriers faced by non-English speaking users, while examining technological constraints related to multilingualism. Result: The study highlights the increasing number of non-English speaking Internet users and identifies the technological challenges that must be addressed to support multilingualism online. Conclusion: The paper concludes that despite the dominance of English on the Internet, there is a growing need to address linguistic diversity to make the Internet more accessible to non-English speakers. Abstract: The usage of Internet has grown exponentially over the last two decades. The number of Internet users has grown from 16 Million to 1650 Million from 1995 to 2010. It has become a major repository of information catering almost every area. Since the Internet has its origin in USA which is English speaking country there is huge dominance of English on the World Wide Web. Although English is a globally acceptable language, still there is a huge population in the world which is not able to access the Internet due to language constraints. It has been estimated that only 20-25% of the world population speaks English as a native language. More and more people are accessing the Internet nowadays removing the cultural and linguistic barriers and hence there is a high growth in the number of non-English speaking users over the last few years on the Internet. Although many solutions have been provided to remove the linguistic barriers, still there is a huge gap to be filled. This paper attempts to analyze the need of information availability in different languages and the various technological constraints related to multi-linguism on the Internet.[89] Feature-Refined Unsupervised Model for Loanword Detection
Promise Dodzi Kpoglu
Main category: cs.CL
TL;DR: This paper proposes an unsupervised, language-internal method for detecting loanwords across Indo-European languages, leveraging linguistic and statistical cues for improved accuracy and scalability.
Details
Motivation: Prior loanword detection methods rely on language-external information, introducing circularity and constraints; a language-internal approach is needed. Method: The method extracts linguistic features, scores them, and uses probabilistic mapping to iteratively refine results, combining linguistic and statistical cues. Result: The model achieves strong performance on loanword detection across six Indo-European languages, with significant gains in cross-linguistic scalability. Conclusion: The proposed unsupervised method effectively detects loanwords by using language-internal information, outperforming baseline methods particularly on cross-linguistic data. Abstract: We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German, French, Italian, Spanish, and Portuguese. Experimental results demonstrate that our model outperforms baseline methods, with strong performance gains observed when scaling to cross-linguistic data.[90] AMELIA: A Family of Multi-task End-to-end Language Models for Argumentation
Henri Savigny,Bruno Yun
Main category: cs.CL
TL;DR: This paper explores using a large language model for argument mining tasks through multi-task learning and model merging, showing that these approaches improve performance and reduce computational costs.
Details
Motivation: The motivation of the paper is to investigate how a single large language model can be utilized to perform one or multiple argument mining tasks, aiming to improve performance and efficiency through multi-task learning and model merging strategies. Method: The authors constructed a multi-task dataset by converting 19 well-known argument mining datasets into a unified format. They explored training strategies using the Llama-3.1-8B-Instruct model, including fine-tuning on individual tasks, joint fine-tuning on multiple tasks, and merging models fine-tuned on individual tasks. Result: Experiments showed that task-specific fine-tuning significantly improved performance on individual tasks. Multi-task fine-tuning maintained strong performance across related tasks, indicating effective transfer learning. Model merging also yielded competitive results while reducing computational costs compared to full multi-task fine-tuning. Conclusion: The study concludes that task-specific fine-tuning significantly enhances performance across all tasks, multi-task fine-tuning allows for effective transfer learning without performance degradation, and model merging provides a cost-effective alternative to multi-task fine-tuning with competitive results. Abstract: Argument mining is a subfield of argumentation that aims to automatically extract argumentative structures and their relations from natural language texts. This paper investigates how a single large language model can be leveraged to perform one or several argument mining tasks. Our contributions are two-fold. First, we construct a multi-task dataset by surveying and converting 19 well-known argument mining datasets from the literature into a unified format. Second, we explore various training strategies using Meta AI's Llama-3.1-8B-Instruct model: (1) fine-tuning on individual tasks, (2) fine-tuning jointly on multiple tasks, and (3) merging models fine-tuned separately on individual tasks. Our experiments show that task-specific fine-tuning significantly improves individual performance across all tasks. Moreover, multi-task fine-tuning maintains strong performance without degradation, suggesting effective transfer learning across related tasks. Finally, we demonstrate that model merging offers a viable compromise: it yields competitive performance while mitigating the computational costs associated with full multi-task fine-tuning.[91] Debiasing Multilingual LLMs in Cross-lingual Latent Space
Qiwei Peng,Guimin Hu,Yekun Chai,Anders Søgaard
Main category: cs.CL
TL;DR: 本文提出了一种在联合潜在空间中进行去偏见处理的新方法,显著提高了去偏见效果和跨语言迁移能力。
Details
Motivation: 以前的研究发现,将去偏见技术直接应用于LLM表示的跨语言迁移能力有限,因此本文尝试寻找更有效的去偏见方法。 Method: 使用在平行TED演讲稿上训练的自动编码器构建了一个良好对齐的跨语言潜在空间,并在此空间中应用了两种去偏见技术。 Result: 实验表明,自动编码器能有效构建良好对齐的跨语言潜在空间,且在此空间中应用去偏见技术显著提高了去偏见效果和跨语言迁移能力。 Conclusion: 进行去偏见处理时,在联合潜在空间中操作比直接在LLM表示上操作更有效,这提高了整体的去偏见性能和跨语言可转移性。 Abstract: Debiasing techniques such as SentDebias aim to reduce bias in large language models (LLMs). Previous studies have evaluated their cross-lingual transferability by directly applying these methods to LLM representations, revealing their limited effectiveness across languages. In this work, we therefore propose to perform debiasing in a joint latent space rather than directly on LLM representations. We construct a well-aligned cross-lingual latent space using an autoencoder trained on parallel TED talk scripts. Our experiments with Aya-expanse and two debiasing techniques across four languages (English, French, German, Dutch) demonstrate that a) autoencoders effectively construct a well-aligned cross-lingual latent space, and b) applying debiasing techniques in the learned cross-lingual latent space significantly improves both the overall debiasing performance and cross-lingual transferability.[92] Understanding Subword Compositionality of Large Language Models
Qiwei Peng,Yekun Chai,Anders Søgaard
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型如何将子词组合成完整的词表示,并发现这些模型在组合策略上存在三种不同的模式。
Details
Motivation: 大型语言模型需要有效地将子词表示组合成有意义的词级表示,因此需要理解它们如何进行这种组合。 Method: 该论文通过一组全面的实验来探测LLMs如何组合子词信息,重点关注结构相似性、语义可分解性和形式保留这三个关键方面。 Result: 实验分析表明,这些LLMs在不同层次之间表现出三种不同的结构相似性模式,对语义分解性有很高的敏感性,并且在探测对形式特征的敏感性时也表现出三种不同的模式。 Conclusion: 该论文得出结论,五个大型语言模型(LLM)家族可以分为三个不同的群体,这反映了它们在组合子词信息方面的不同策略。 Abstract: Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.[93] German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German
Miriam Anschütz,Thanh Mai Pham,Eslam Nasrallah,Maximilian Müller,Cristian-George Craciun,Georg Groh
Main category: cs.CL
TL;DR: The paper presents German4All, a large-scale German dataset for readability-controlled paraphrasing, and a model achieving state-of-the-art performance in text simplification.
Details
Motivation: The motivation for this research is the essential need for paraphrasing texts across different complexity levels to create accessible texts tailored toward diverse reader groups. Method: The paper introduces German4All, a large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases, synthesized using GPT-4. The dataset and a readability-controlled paraphrasing model are evaluated through human and LLM-based judgments. Result: The result is a state-of-the-art performance in German text simplification using the trained open-source readability-controlled paraphrasing model on the German4All dataset, which spans five readability levels and comprises over 25,000 samples. Conclusion: The paper concludes that the German4All dataset and the trained model enable more nuanced and reader-specific adaptations in German text simplification, and they are opensourced to encourage further research on multi-level paraphrasing. Abstract: The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing[94] A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models
Oleg Silcenco,Marcos R. Machad,Wallace C. Ugulino,Daniel Braun
Main category: cs.CL
TL;DR: A new multilingual dataset for aspect-based sentiment analysis is introduced, and GPT-4 and LLaMA-3 models are evaluated on it, with GPT-4 showing superior performance.
Details
Motivation: Aspect-based sentiment analysis provides deeper insights than traditional sentiment analysis by associating sentiment with specific aspects. This study aims to establish a baseline performance on a newly introduced dataset. Method: A manually annotated multilingual dataset was created, and GPT-4 and LLaMA-3 models were evaluated for performance on this dataset. Result: Both GPT-4 and LLaMA-3 achieved over 85% accuracy, with GPT-4 outperforming LLaMA-3 on all relevant metrics. Conclusion: GPT-4 outperforms LLaMA-3 in aspect-based sentiment analysis on the newly introduced dataset. Abstract: Aspect-based sentiment analysis enhances sentiment detection by associating it with specific aspects, offering deeper insights than traditional sentiment analysis. This study introduces a manually annotated dataset of 10,814 multilingual customer reviews covering brick-and-mortar retail stores, labeled with eight aspect categories and their sentiment. Using this dataset, the performance of GPT-4 and LLaMA-3 in aspect based sentiment analysis is evaluated to establish a baseline for the newly introduced data. The results show both models achieving over 85% accuracy, while GPT-4 outperforms LLaMA-3 overall with regard to all relevant metrics.[95] Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Khaoula Chehbouni,Mohammed Haddou,Jackie Chi Kit Cheung,Golnoosh Farnadi
Main category: cs.CL
TL;DR: 该论文讨论了大型语言模型作为评估者(LLJs)在自然语言生成(NLG)评估中的应用,指出了当前LLJs的采用超过了对其可靠性及有效性的严格审查,并基于社会科学中的测量理论分析了LLJs的四个核心假设,强调了更负责任的评估实践的必要性。
Details
Motivation: 由于大型语言模型(LLMs)的兴起,自然语言生成系统评估面临新的挑战,LLJs虽被视为有潜力的替代传统评估方法的工具,但其有效性尚未经过充分研究。 Method: 基于社会科学中的测量理论,对LLJs使用的四个核心假设进行了批判性评估,并通过三个应用案例(文本摘要、数据标注和安全对齐)进行分析。 Result: 论文发现LLJs的假设可能受到LLMs、LLJs或当前NLG评估实践的固有限制的挑战,表明对LLJs的当前热情可能是不成熟的。 Conclusion: 论文呼吁在LLJs评估中采取更负责任的实践,以确保其在NLG领域中的作用能够推动而不是阻碍该领域的发展。 Abstract: Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.[96] How Quantization Shapes Bias in Large Language Models
Federico Marcuzzi,Xuefei Ning,Roy Schwartz,Iryna Gurevych
Main category: cs.CL
TL;DR: This study shows that quantization has mixed effects on model bias: it can reduce toxicity but may increase stereotypes and unfairness, especially in aggressive compression scenarios.
Details
Motivation: The motivation is to understand how quantization, a technique used to improve model efficiency, affects model bias, particularly regarding individual demographic subgroups. Method: The authors conducted a comprehensive evaluation of weight and activation quantization strategies, using probabilistic and generated text-based metrics across nine benchmarks, focusing on bias types like stereotypes, toxicity, sentiment, and fairness. Result: Quantization can reduce model toxicity without significantly impacting sentiment, but it may slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. Conclusion: The study concludes that quantization has a complex impact on model bias, which requires careful balancing of efficiency and ethical considerations in practical applications. Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.[97] Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study
Monica Gonzalez-Machorro,Uwe Reichel,Pascal Hecker,Helly Hammer,Hesam Sagha,Florian Eyben,Robert Hoepner,Björn W. Schuller
Main category: cs.CL
TL;DR: 本研究首次探索了在并发神经退行性疾病的情况下,基于语音的抑郁症检测方法在多发性硬化症人群中的推广潜力。
Details
Motivation: 抑郁症常与多发性硬化症等神经退行性疾病并发,但基于语音的人工智能在检测此类人群中抑郁症的潜力尚未探索。 Method: 研究采用了监督机器学习模型,使用了传统的语音和语言特征、从语音情感识别模型中提取的情感维度,以及探索性语音特征分析。 Result: 尽管数据有限,模型在二分类任务中达到了66%的无加权平均召回率(UAR),通过特征选择将UAR提升至74%。 Conclusion: 这项研究得出结论,即使在并发的神经退行性疾病存在的情况下,基于语音的抑郁检测方法在人群中(如多发性硬化症患者)具有一定的推广潜力。 Abstract: Depression commonly co-occurs with neurodegenerative disorders like Multiple Sclerosis (MS), yet the potential of speech-based Artificial Intelligence for detecting depression in such contexts remains unexplored. This study examines the transferability of speech-based depression detection methods to people with MS (pwMS) through cross-corpus and cross-lingual analysis using English data from the general population and German data from pwMS. Our approach implements supervised machine learning models using: 1) conventional speech and language features commonly used in the field, 2) emotional dimensions derived from a Speech Emotion Recognition (SER) model, and 3) exploratory speech feature analysis. Despite limited data, our models detect depressive mood in pwMS with moderate generalisability, achieving a 66% Unweighted Average Recall (UAR) on a binary task. Feature selection further improved performance, boosting UAR to 74%. Our findings also highlight the relevant role emotional changes have as an indicator of depressive mood in both the general population and within PwMS. This study provides an initial exploration into generalising speech-based depression detection, even in the presence of co-occurring conditions, such as neurodegenerative diseases.[98] Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
Julius Gun,Timo Oksanen
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在技术问答任务中的表现,特别是在跨语言信息检索场景下。通过比较不同的模型和策略,发现Hybrid RAG策略表现最佳,其中Gemini 2.5 Flash和Qwen 2.5 7B模型在所有语言中实现了超过85%的高准确率。
Details
Motivation: 本文旨在评估大型语言模型(LLMs)在技术问答任务中的表现,特别是在跨语言信息检索场景下的表现。研究动机包括探索实际的“大海捞针”挑战,以及通过包含无法回答的问题来测试模型的幻觉情况。 Method: 本文使用了128K标记上下文窗口的大型语言模型(LLMs)进行技术问答任务的案例研究。构建了一个基于农业机械用户手册的基准测试,该手册有英文、法文和德文版本。比较了九种长上下文LLMs使用直接提示与三种检索增强生成(RAG)策略(关键词、语义、混合)的效果,并使用LLM作为评估判断。 Result: Hybrid RAG策略在特定用户手册的跨语言信息检索场景中始终优于直接长上下文提示。Gemini 2.5 Flash和较小的Qwen 2.5 7B等模型在RAG的帮助下在所有语言中都实现了高精度(超过85%)。 Conclusion: Hybrid RAG策略在特定用户手册的跨语言信息检索场景中始终优于直接长上下文提示。Gemini 2.5 Flash和较小的Qwen 2.5 7B等模型在RAG的帮助下在所有语言中都实现了高精度(超过85%)。 Abstract: We present a case study evaluating large language models (LLMs) with 128K-token context windows on a technical question answering (QA) task. Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German. It simulates a cross-lingual information retrieval scenario where questions are posed in English against all three language versions of the manual. The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations. We compare nine long-context LLMs using direct prompting against three Retrieval-Augmented Generation (RAG) strategies (keyword, semantic, hybrid), with an LLM-as-a-judge for evaluation. Our findings for this specific manual show that Hybrid RAG consistently outperforms direct long-context prompting. Models like Gemini 2.5 Flash and the smaller Qwen 2.5 7B achieve high accuracy (over 85%) across all languages with RAG. This paper contributes a detailed analysis of LLM performance in a specialized industrial domain and an open framework for similar evaluations, highlighting practical trade-offs and challenges.[99] Detecting and Characterizing Planning in Language Models
Jatin Nainani,Sankaran Vaidyanathan,Connor Watts,Andre N. Assis,Alice Rigg
Main category: cs.CL
TL;DR: This study introduces a scalable method to detect planning in LLMs, showing that planning behaviors vary across models and tasks, and are refined rather than created by instruction tuning.
Details
Motivation: Existing studies of planning in LLMs assume fixed planning horizons and focus on limited prompts or domains. There is a need for reproducible and scalable methods to distinguish planning from improvisation. Method: Formal and causally grounded criteria for detecting planning were developed and operationalized into a semi-automated annotation pipeline. This was applied to Gemma-2-2B and Claude 3.5 Haiku models on code and poem generation tasks. Result: Gemma-2-2B was found to use improvisation for poem generation (unlike Haiku) and switched between planning and improvisation on similar MBPP tasks. Instruction tuning was shown to refine rather than create planning behaviors. Conclusion: Planning behavior in LLMs is not universal and varies across models and tasks. Instruction tuning refines existing planning behaviors rather than creating them anew. Abstract: Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.[100] SentiMM: A Multimodal Multi-Agent Framework for Sentiment Analysis in Social Media
Xilai Xu,Zilin Zhao,Chengye Song,Zining Wang,Jinhe Qiang,Jiongrui Yan,Yuhuai Lin
Main category: cs.CL
TL;DR: 本文提出了一种新的多智能体框架SentiMM,用于解决多模态情感分析中的挑战,并引入了一个大规模多模态数据集SentiMMD。
Details
Motivation: 随着社交媒体上多模态内容的日益普及,情感分析面临着有效处理异构数据和识别多标签情感的重大挑战。现有方法往往缺乏有效的跨模态融合和外部知识整合。 Method: 提出了一种新的多智能体框架SentiMM,通过专门的智能体处理文本和视觉输入,融合多模态特征,通过知识检索丰富上下文,并聚合结果进行最终情感分类。此外,还引入了一个大规模多模态数据集SentiMMD。 Result: 实验表明,SentiMM的性能优于现有的最先进基线。 Conclusion: SentiMM实现了比现有最先进基线更好的性能,证明了我们结构化方法的有效性。 Abstract: With the increasing prevalence of multimodal content on social media, sentiment analysis faces significant challenges in effectively processing heterogeneous data and recognizing multi-label emotions. Existing methods often lack effective cross-modal fusion and external knowledge integration. We propose SentiMM, a novel multi-agent framework designed to systematically address these challenges. SentiMM processes text and visual inputs through specialized agents, fuses multimodal features, enriches context via knowledge retrieval, and aggregates results for final sentiment classification. We also introduce SentiMMD, a large-scale multimodal dataset with seven fine-grained sentiment categories. Extensive experiments demonstrate that SentiMM achieves superior performance compared to state-of-the-art baselines, validating the effectiveness of our structured approach.[101] Toward a Better Localization of Princeton WordNet
Abed Alhakim Freihat
Main category: cs.CL
TL;DR: This paper introduces a framework for the high-quality localization of Princeton WordNet that maintains cultural authenticity, demonstrated through the localization of 10,000 synsets.
Details
Motivation: The increasing significance of Princeton WordNet in Natural Language Processing necessitates its localization, ensuring both quality and cultural alignment, especially for Arabic. Method: The authors proposed a structured framework for the localization of Princeton WordNet and applied it to 10,000 synsets to evaluate its effectiveness. Result: The framework was successfully applied to localize 10,000 synsets, demonstrating its practicality and effectiveness. Conclusion: The paper concludes that a structured framework can effectively localize Princeton WordNet while preserving cultural authenticity. Abstract: As Princeton WordNet continues to gain significance as a semantic lexicon in Natural Language Processing, the need for its localization and for ensuring the quality of this process has become increasingly critical. Existing efforts remain limited in both scale and rigor, and there is a notable absence of studies addressing the accuracy of localization or its alignment with the cultural context of Arabic. This paper proposes a structured framework for the localization of Princeton WordNet, detailing the stages and procedures required to achieve high-quality results without compromising cultural authenticity. We further present our experience in applying this framework, reporting outcomes from the localization of 10,000 synsets.[102] S2Sent: Nested Selectivity Aware Sentence Representation Learning
Jianxiang Zang,Nijia Mo,Yonda Wei,Meiling Ning,Hui Liu
Main category: cs.CL
TL;DR: 本文提出了一种新的句子表示选择机制S²Sent,通过空间选择和嵌套频率选择优化跨块表示融合,实现了低冗余和低语义损失的句子表示学习。
Details
Motivation: Transformer编码器中不同块的语义感知能力不同,需要优化跨块表示融合。 Method: 提出了一种新的句子表示选择机制S²Sent,结合了空间选择和嵌套频率选择。 Result: 实验表明,S²Sent在基准方法上取得了显著改进,且具有高集成性和可扩展性。 Conclusion: S²Sent有效地提升了句子表示学习的性能,同时保持了较低的冗余和语义损失。 Abstract: The combination of Transformer-based encoders with contrastive learning represents the current mainstream paradigm for sentence representation learning. This paradigm is typically based on the hidden states of the last Transformer block of the encoder. However, within Transformer-based encoders, different blocks exhibit varying degrees of semantic perception ability. From the perspective of interpretability, the semantic perception potential of knowledge neurons is modulated by stimuli, thus rational cross-block representation fusion is a direction worth optimizing. To balance the semantic redundancy and loss across block fusion, we propose a sentence representation selection mechanism S\textsuperscript{2}Sent, which integrates a parameterized nested selector downstream of the Transformer-based encoder. This selector performs spatial selection (SS) and nested frequency selection (FS) from a modular perspective. The SS innovatively employs a spatial squeeze based self-gating mechanism to obtain adaptive weights, which not only achieves fusion with low information redundancy but also captures the dependencies between embedding features. The nested FS replaces GAP with different DCT basis functions to achieve spatial squeeze with low semantic loss. Extensive experiments have demonstrated that S\textsuperscript{2}Sent achieves significant improvements over baseline methods with negligible additional parameters and inference latency, while highlighting high integrability and scalability.[103] DiscussLLM: Teaching Large Language Models When to Speak
Deep Anil Patel,Iain Melvin,Christopher Malon,Martin Renqiang Min
Main category: cs.CL
TL;DR: This paper proposes DiscussLLM, a framework that trains Large Language Models (LLMs) to proactively decide when and what to contribute in human discussions, using a scalable data generation pipeline and two architectural baselines. The approach aims to bridge the 'awareness gap' and enhance the situational awareness of conversational AI.
Details
Motivation: The motivation stems from the passivity of Large Language Models (LLMs), which limits their potential as collaborative partners in dynamic human discussions. The paper aims to bridge the 'awareness gap' by enabling LLMs to proactively decide when and how to intervene. Method: The paper introduces a scalable two-stage data generation pipeline to create a large-scale dataset of realistic multi-turn human discussions annotated with intervention types and explicit conversational triggers. Two architectural baselines are explored: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. Models are trained to predict a silent token when no intervention is needed. Result: The paper evaluates the models on their ability to accurately time interventions and generate helpful responses, showing promising results in creating more situationally aware and proactive conversational AI. Conclusion: The paper concludes that DiscussLLM effectively trains models to proactively decide when and what to contribute in dynamic human discussions, thereby enhancing the situational awareness and proactivity of conversational AI. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an "awareness gap," limiting their potential as truly collaborative partners in dynamic human discussions. We introduce $\textit{DiscussLLM}$, a framework designed to bridge this gap by training models to proactively decide not just $\textit{what}$ to say, but critically, $\textit{when}$ to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.[104] Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation
Hongyu Cao,Yuxuan Wu,Yucheng Cai,Xianyu Zhao,Zhijian Ou
Main category: cs.CL
TL;DR: This paper proposes JSA-RAG, a novel training method for RAG models that improves performance by reducing gradient variance, outperforming existing approaches like vanilla RAG and VRAG.
Details
Motivation: Traditional top-K marginalization and VRAG suffer from biased or high-variance gradient estimates, which hinder the end-to-end optimization of RAG models. Method: Joint Stochastic Approximation (JSA), a stochastic extension of the EM algorithm, is proposed for end-to-end training of RAG models. Result: Extensive experiments on five datasets across two tasks (open-domain question answering and knowledge-grounded dialogs) demonstrate the superiority of JSA-RAG over existing methods. Conclusion: JSA-RAG significantly outperforms vanilla RAG and VRAG by providing low-variance gradient estimates and improving performance in generation and retrieval tasks. Abstract: Retrieval-augmented generation (RAG) has become a widely recognized paradigm to combine parametric memory with non-parametric memories. An RAG model consists of two serial connecting components (retriever and generator). A major challenge in end-to-end optimization of the RAG model is that marginalization over relevant passages (modeled as discrete latent variables) from a knowledge base is required. Traditional top-K marginalization and variational RAG (VRAG) suffer from biased or high-variance gradient estimates. In this paper, we propose and develop joint stochastic approximation (JSA) based end-to-end training of RAG, which is referred to as JSA-RAG. The JSA algorithm is a stochastic extension of the EM (expectation-maximization) algorithm and is particularly powerful in estimating discrete latent variable models. Extensive experiments are conducted on five datasets for two tasks (open-domain question answering, knowledge-grounded dialogs) and show that JSA-RAG significantly outperforms both vanilla RAG and VRAG. Further analysis shows the efficacy of JSA-RAG from the perspectives of generation, retrieval, and low-variance gradient estimate.[105] Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios
Luana Bulla,Gabriele Tuccio,Misael Mongiovì,Aldo Gangemi
Main category: cs.CL
TL;DR: AulSign is a novel approach for sign language translation that effectively utilizes Large Language Models through dynamic prompting and in-context learning, addressing the lack of sign language data and showing strong results in low-data environments.
Details
Motivation: The task of translating natural languages into sign languages is complex and underexplored, largely due to the scarcity of standardized parallel corpora that align natural language with sign language data. Method: AulSign uses Large Language Models (LLMs) with dynamic prompting and in-context learning, associating signs with natural language descriptions to overcome the lack of intrinsic sign language knowledge in LLMs. Result: The AulSign method shows superior performance compared to state-of-the-art models in low-data scenarios, as demonstrated through evaluations on English and Italian languages using SignBank+ and the Italian LaCAM CNR-ISTC dataset. Conclusion: AulSign demonstrates effectiveness in translating natural languages into sign languages, particularly in low-data scenarios, thereby enhancing accessibility and inclusivity for underrepresented linguistic communities. Abstract: Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of LLMs for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, LLMs lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in communication technologies for underrepresented linguistic communities.[106] Exploring the Interplay between Musical Preferences and Personality through the Lens of Language
Eliran Shem-Tov,Ella Rabinovich
Main category: cs.CL
TL;DR: 研究发现音乐偏好可以通过语言反映人格特质,并揭示了不同音乐粉丝群体的人格差异。
Details
Motivation: 已有研究表明音乐偏好与人格特质相关,并且人格可以通过语言分析检测,研究试图将这两个领域结合起来。 Method: 使用超过50万个文本样本和近5000名作者的音乐偏好数据,构建高级模型来评估人格特征。 Result: 发现了五种音乐类型粉丝之间显著的人格差异,并发布了适用于未来研究的资源。 Conclusion: 音乐偏好可以在自发语言中通过五因素人格特质反映出来,并且人格特征在不同音乐类型粉丝间存在显著差异。 Abstract: Music serves as a powerful reflection of individual identity, often aligning with deeper psychological traits. Prior research has established correlations between musical preferences and personality traits, while separate studies have demonstrated that personality is detectable through linguistic analysis. Our study bridges these two research domains by investigating whether individuals' musical preferences are recognizable in their spontaneous language through the lens of the Big Five personality traits (Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism). Using a carefully curated dataset of over 500,000 text samples from nearly 5,000 authors with reliably identified musical preferences, we build advanced models to assess personality characteristics. Our results reveal significant personality differences across fans of five musical genres. We release resources for future research at the intersection of computational linguistics, music psychology and personality analysis.[107] Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Rishikesh Devanathan,Varun Nathan,Ayush Kumar
Main category: cs.CL
TL;DR: 本文提出了一种基于监督信号的合成对话生成方法,并通过18个指标评估生成质量,结果显示目前尚无方法能完美模拟客服对话的复杂性。
Details
Motivation: 在客服中心领域,由于隐私和数据稀缺的限制,需要生成合成对话用于模型训练和评估。然而,客服对话具有目标导向性、角色不对称性和行为复杂性等特点,现有方法难以满足需求。 Method: 利用意图摘要、主题流程和问答评估表等监督信号,指导合成对话生成,并通过18个语言和行为学指标评估生成质量。 Result: 四种生成策略的基准测试结果显示,目前尚无方法能在所有特征上表现优异,特别是在非流畅性、情感和行为真实感方面存在明显缺陷。 Conclusion: 合成对话生成在客服中心领域具有挑战性,目前尚无方法能在所有特征上表现出色,特别是在非流畅性、情感和行为真实感方面存在明显缺陷。 Abstract: Synthetic transcript generation is critical in contact center domains, where privacy and data scarcity limit model training and evaluation. Unlike prior synthetic dialogue generation work on open-domain or medical dialogues, contact center conversations are goal-oriented, role-asymmetric, and behaviorally complex, featuring disfluencies, ASR noise, and compliance-driven agent actions. In deployments where transcripts are unavailable, standard pipelines still yield derived call attributes such as Intent Summaries, Topic Flow, and QA Evaluation Forms. We leverage these as supervision signals to guide generation. To assess the quality of such outputs, we introduce a diagnostic framework of 18 linguistically and behaviorally grounded metrics for comparing real and synthetic transcripts. We benchmark four language-agnostic generation strategies, from simple prompting to characteristic-aware multi-stage approaches, alongside reference-free baselines. Results reveal persistent challenges: no method excels across all traits, with notable deficits in disfluency, sentiment, and behavioral realism. Our diagnostic tool exposes these gaps, enabling fine-grained evaluation and stress testing of synthetic dialogue across languages.[108] Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries
Meiling Ning,Zhongbao Zhang,Junda Ye,Jiabao Guo,Qingyuan Guan
Main category: cs.CL
TL;DR: This paper introduces ESFP-RM, a novel reward model using MLMs and explanation-based slot prediction, which improves performance in reinforcement learning scenarios by leveraging natural language inference capabilities.
Details
Motivation: The authors aim to improve reward modeling in reinforcement learning by exploring the connection between LM-based judging reward modeling and natural language inference (NLI), seeking more stable and scalable solutions. Method: The authors propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation-based slot framework for prediction, leveraging the advantages of MLMs. This was validated through extensive experiments on NLI tasks and feedback scenarios. Result: Slot prediction MLMs with contextual explanations outperformed mainstream autoregressive models on NLI tasks. The ESFP-RM framework demonstrated superior performance in both RLHF and OOD settings compared to generative reward models. Conclusion: The proposed ESFP-RM framework offers more stable and generalizable reward signals compared to generative reward models, particularly in RLHF and OOD scenarios. Abstract: The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.[109] MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
Yuhao Du,Qianwei Huang,Guo Zhu,Zhanchen Dai,Sunian Chen,Qiming Zhu,Yuhao Zhang,Li Zhou,Benyou Wang
Main category: cs.CL
TL;DR: This paper introduces MTalk-Bench, a new multi-turn speech-to-speech evaluation benchmark, revealing current model limitations in paralinguistic and ambient sound perception and highlighting the need for more robust evaluation frameworks.
Details
Motivation: The motivation for this work stems from the inadequacy of current evaluation frameworks in assessing the performance of speech-to-speech (S2S) large language models (LLMs) within complex, multi-turn dialogues. Method: The researchers introduced MTalk-Bench, a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound. They employed a dual-method evaluation framework combining Arena-style (pairwise comparison) and Rubrics-based (absolute scoring) evaluations for both relative and absolute assessments. Result: Experimental results revealed that S2S LLMs excel at semantic information processing but underperform in paralinguistic information and ambient sounds perception. Models often regain coherence by increasing response length, sacrificing efficiency in multi-turn dialogues. Additionally, modality-aware, task-specific designs outperformed brute scaling. Conclusion: The study concludes that current S2S LLMs face limitations in evaluating paralinguistic information and ambient sounds and highlights the need for more robust, speech-aware assessment frameworks. It also finds that modality-aware, task-specific designs outperform brute scaling. Abstract: The rapid advancement of speech-to-speech (S2S) large language models (LLMs) has significantly improved real-time spoken interaction. However, current evaluation frameworks remain inadequate for assessing performance in complex, multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound. Each dimension includes nine realistic scenarios, along with targeted tasks to assess specific capabilities such as reasoning. Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and Rubrics-based evaluation (absolute scoring) for relative and absolute assessment. The benchmark includes both model and human outputs, evaluated by human evaluators and LLMs. Experimental results reveal two sets of findings. Overall performance of S2S LLMs: (1) models excel at semantic information processing yet underperform on paralinguistic information and ambient sounds perception; (2) models typically regain coherence by increasing response length, sacrificing efficiency in multi-turn dialogues; (3) modality-aware, task-specific designs outperform brute scaling. Evaluation framework and reliability: (1) Arena and Rubrics yield consistent, complementary rankings, but reliable distinctions emerge only when performance gaps are large; (2) LLM-as-a-judge aligns with humans when gaps are clear or criteria explicit, but exhibits position and length biases and is reliable on nonverbal evaluation only with text annotations. These results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.[110] Demographic Biases and Gaps in the Perception of Sexism in Large Language Models
Judith Tavarez-Rodríguez,Fernando Sánchez-Vega,A. Pastor López-Monroy
Main category: cs.CL
TL;DR: 研究不同LLM在检测社交媒体文本中的性别歧视方面的能力,发现它们在复制不同人口统计群体感知方面存在局限性,强调需要更好地校准模型以考虑不同人群观点的多样性。
Details
Motivation: 尽管有各种努力来改进性别歧视内容的检测,但由于其主观性质和自动化模型中存在的偏差,这一任务仍然是一个重大挑战。 Method: 使用EXIST 2024推文数据集探索不同LLM在检测社交媒体文本中的性别歧视方面的能力,并进行统计分析以识别哪些人口统计特征(年龄、性别)最有效地促进这一任务。 Result: 结果表明,尽管LLM在考虑总体人群意见时可以在一定程度上检测性别歧视,但它们并不能准确地复制不同人口统计群体之间的感知差异。 Conclusion: 需要更好地校准模型以考虑不同人群观点的多样性。 Abstract: The use of Large Language Models (LLMs) has proven to be a tool that could help in the automatic detection of sexism. Previous studies have shown that these models contain biases that do not accurately reflect reality, especially for minority groups. Despite various efforts to improve the detection of sexist content, this task remains a significant challenge due to its subjective nature and the biases present in automated models. We explore the capabilities of different LLMs to detect sexism in social media text using the EXIST 2024 tweet dataset. It includes annotations from six distinct profiles for each tweet, allowing us to evaluate to what extent LLMs can mimic these groups' perceptions in sexism detection. Additionally, we analyze the demographic biases present in the models and conduct a statistical analysis to identify which demographic characteristics (age, gender) contribute most effectively to this task. Our results show that, while LLMs can to some extent detect sexism when considering the overall opinion of populations, they do not accurately replicate the diversity of perceptions among different demographic groups. This highlights the need for better-calibrated models that account for the diversity of perspectives across different populations.[111] From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models
ZiqiZhang,Jianfei Ma,Emmanuele Chersoni,Jieshun You,Zhaoxin Feng
Main category: cs.CL
TL;DR: 评估大型语言模型对汉语分类词的预测能力,发现其表现不如BERT,即使微调后也是如此。
Details
Motivation: 汉语分类词在教育应用中至关重要,但主流大型语言模型是否具备正确预测汉语分类词的能力尚未被广泛研究。 Method: 使用多种掩码策略评估大型语言模型的内在能力、不同句子元素的贡献以及预测过程中注意力机制的工作方式。此外,还探索了通过微调提升分类词预测能力的方法。 Result: 研究发现,即使经过微调,大型语言模型的表现仍不如BERT。正如预期,后续名词的信息对预测有很大帮助,这也解释了为何像BERT这样具有双向注意力机制的模型具有优势。 Conclusion: 汉语分类词的预测对大型语言模型来说仍然是一个挑战,未来的研究可能需要特别关注改进模型对分类词的处理能力。 Abstract: Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature. To address such a question, we employ various masking strategies to evaluate the LLMs' intrinsic ability, the contribution of different sentence elements, and the working of the attention mechanisms during prediction. Besides, we explore fine-tuning for LLMs to enhance the classifier performance. Our findings reveal that LLMs perform worse than BERT, even with fine-tuning. The prediction, as expected, greatly benefits from the information about the following noun, which also explains the advantage of models with a bidirectional attention mechanism such as BERT.[112] MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains
Kaiwen Wei,Rui Shan,Dongsheng Zou,Jianzhong Yang,Bi Zhao,Junnan Zhu,Jiang Zhong
Main category: cs.CL
TL;DR: MIRAGE is a new framework for medical question-answering that uses dynamic multi-chain inference over structured medical knowledge graphs to enhance accuracy, traceability, and interpretability, outperforming current approaches like GPT-4o and Tree-of-Thought variants.
Details
Motivation: Current approaches to medical question-answering, such as search-o1, have limitations in handling error accumulation and lack of traceability due to reliance on single, linear reasoning chains and flat incorporation of unstructured textual information. MIRAGE aims to overcome these challenges by utilizing a structured, multi-chain reasoning approach. Method: MIRAGE decomposes complex queries into entity-grounded sub-questions, executes parallel inference chains, retrieves evidence adaptively through neighbor expansion and multi-hop traversal, and integrates answers using cross-chain verification to resolve contradictions. Result: Experiments on three medical QA benchmarks (GenMedGPT-5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-augmented baselines in both automatic and human evaluations, while also improving interpretability through explicit reasoning chains. Conclusion: MIRAGE is a novel test-time scalable reasoning framework that enhances medical question-answering tasks by performing dynamic multi-chain inference over structured medical knowledge graphs, offering improved accuracy, traceability, and interpretability over existing methods. Abstract: Large reasoning models (LRMs) have shown significant progress in test-time scaling through chain-of-thought prompting. Current approaches like search-o1 integrate retrieval augmented generation (RAG) into multi-step reasoning processes but rely on a single, linear reasoning chain while incorporating unstructured textual information in a flat, context-agnostic manner. As a result, these approaches can lead to error accumulation throughout the reasoning chain, which significantly limits its effectiveness in medical question-answering (QA) tasks where both accuracy and traceability are critical requirements. To address these challenges, we propose MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration), a novel test-time scalable reasoning framework that performs dynamic multi-chain inference over structured medical knowledge graphs. Specifically, MIRAGE 1) decomposes complex queries into entity-grounded sub-questions, 2) executes parallel inference chains, 3) retrieves evidence adaptively via neighbor expansion and multi-hop traversal, and 4) integrates answers using cross-chain verification to resolve contradictions. Experiments on three medical QA benchmarks (GenMedGPT-5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-augmented baselines in both automatic and human evaluations. Additionally, MIRAGE improves interpretability by generating explicit reasoning chains that trace each factual claim to concrete chains within the knowledge graph, making it well-suited for complex medical reasoning scenarios. The code will be available for further research.cs.CV [Back]
[113] Towards High-Precision Depth Sensing via Monocular-Aided iToF and RGB Integration
Yansong Du,Yutong Deng,Yuting Zhou,Feiyu Jiao,Jian Song,Xun Guan
Main category: cs.CV
TL;DR: This paper introduces an iToF-RGB fusion framework that enhances depth sensing by improving spatial resolution, expanding field-of-view, and reducing structural distortion through geometric calibration and a dual-encoder network, outperforming current methods.
Details
Motivation: The motivation is to overcome the limitations of iToF depth sensing, such as low spatial resolution, limited field-of-view, and structural distortion in complex scenes. Method: The method involves a geometric calibration and alignment module to reproject the iToF depth map onto the RGB coordinate system, followed by a dual-encoder fusion network that extracts features from the reprojected iToF depth and RGB image, guided by monocular depth priors. Result: The experiments show that the proposed framework significantly outperforms existing methods in terms of accuracy, structural consistency, and visual quality. Conclusion: The paper concludes that the proposed iToF-RGB fusion framework significantly enhances depth accuracy, improves edge sharpness, and enables seamless field-of-view expansion, outperforming state-of-the-art methods on both synthetic and real-world datasets. Abstract: This paper presents a novel iToF-RGB fusion framework designed to address the inherent limitations of indirect Time-of-Flight (iToF) depth sensing, such as low spatial resolution, limited field-of-view (FoV), and structural distortion in complex scenes. The proposed method first reprojects the narrow-FoV iToF depth map onto the wide-FoV RGB coordinate system through a precise geometric calibration and alignment module, ensuring pixel-level correspondence between modalities. A dual-encoder fusion network is then employed to jointly extract complementary features from the reprojected iToF depth and RGB image, guided by monocular depth priors to recover fine-grained structural details and perform depth super-resolution. By integrating cross-modal structural cues and depth consistency constraints, our approach achieves enhanced depth accuracy, improved edge sharpness, and seamless FoV expansion. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods in terms of accuracy, structural consistency, and visual quality.[114] CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance
Anindya Mondal,Ayan Banerjee,Sauradip Nag,Josep Lladós,Xiatian Zhu,Anjan Dutta
Main category: cs.CV
TL;DR: 本文提出了一种名为CountLoop的新方法,无需训练即可提高扩散模型在生成复杂场景时的实例控制精度。
Details
Motivation: 扩散模型在生成具有精确对象实例数量的场景时仍不可靠,尤其是在复杂和高密度的情况下。 Method: 通过迭代的结构反馈,交替进行图像生成和多模态代理评估,利用语言引导的规划器和评论器评估对象数量、空间安排和属性一致性,并引入实例驱动的注意力掩码和组合生成技术以改善对象分离。 Result: 在COCO Count、T2I CompBench和两个新的高实例基准测试中,CountLoop的计数准确率高达98%,同时保持了空间保真度和视觉质量,优于基于布局和梯度引导的基线方法,得分为0.97。 Conclusion: CountLoop是一个无需训练的框架,可以准确控制扩散模型中的实例生成。 Abstract: Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.[115] Do VLMs Have Bad Eyes? Diagnosing Compositional Failures via Mechanistic Interpretability
Ashwath Vaithinathan Aravindan,Abha Jha,Mihir Kulkarni
Main category: cs.CV
TL;DR: 本研究揭示了CLIP模型在处理新组合对象和属性时存在困难的原因,即其视觉编码器中MLP层的神经元存在“超叠加”现象,阻碍了组合特征表示能力。
Details
Motivation: 尽管视觉-语言模型(VLMs)在整合视觉和文本信息方面表现出色,但在组合泛化和对象绑定方面仍存在困难,这限制了它们处理对象及其属性新组合的能力。 Method: 本研究使用机械可解释性技术探索了VLM组合泛化和对象绑定能力失败的根本原因。 Result: 研究表明,CLIP的视觉编码器中MLP层的单个神经元表示多个特征,这种“超叠加”现象直接阻碍了其组合特征表示能力。 Conclusion: 这项研究揭示了CLIP的视觉编码器中MLP层中的神经元存在“超叠加”现象,这种现象直接阻碍了其组合特征表示能力,从而影响了组合推理和对象绑定能力。研究希望成为揭示VLM组合失败机制根源的第一步。 Abstract: Vision-Language Models (VLMs) have shown remarkable performance in integrating visual and textual information for tasks such as image captioning and visual question answering. However, these models struggle with compositional generalization and object binding, which limit their ability to handle novel combinations of objects and their attributes. Our work explores the root causes of these failures using mechanistic interpretability techniques. We show evidence that individual neurons in the MLP layers of CLIP's vision encoder represent multiple features, and this "superposition" directly hinders its compositional feature representation which consequently affects compositional reasoning and object binding capabilities. We hope this study will serve as an initial step toward uncovering the mechanistic roots of compositional failures in VLMs. The code and supporting results can be found https://github.com/Mystic-Slice/Do-VLMs-Have-Bad-Eyes .[116] MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning
Chenghao Liu,Zhimu Zhou,Jiachen Zhang,Minghao Zhang,Songfang Huang,Huiling Duan
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉语言导航框架MSNav,通过整合记忆、空间和决策模块显著提升了导航任务的性能。
Details
Motivation: 当前的视觉语言导航方法存在空间推理能力差、跨模态基础薄弱以及在长视野任务中记忆过载等关键漏洞,需要一种更系统的方法来解决这些问题。 Method: 提出了一种名为MSNav的框架,该框架集成了记忆模块(用于解决记忆过载)、空间模块(用于空间推理和对象关系推断)以及决策模块(用于执行稳健动作的LLM路径规划)。 Result: MSNav在Room-to-Room(R2R)和REVERIE数据集上进行了广泛的实验,结果表明其在性能上达到了最先进的水平,特别是在成功率(SR)和路径长度加权的成功率(SPL)方面有显著提高。 Conclusion: MSNav通过结合记忆模块、空间模块和决策模块,显著提高了在视觉语言导航任务中的性能,特别是在成功率和路径长度加权的成功率上。 Abstract: Vision-and-Language Navigation (VLN) requires an agent to interpret natural language instructions and navigate complex environments. Current approaches often adopt a "black-box" paradigm, where a single Large Language Model (LLM) makes end-to-end decisions. However, it is plagued by critical vulnerabilities, including poor spatial reasoning, weak cross-modal grounding, and memory overload in long-horizon tasks. To systematically address these issues, we propose Memory Spatial Navigation(MSNav), a framework that fuses three modules into a synergistic architecture, which transforms fragile inference into a robust, integrated intelligence. MSNav integrates three modules: Memory Module, a dynamic map memory module that tackles memory overload through selective node pruning, enhancing long-range exploration; Spatial Module, a module for spatial reasoning and object relationship inference that improves endpoint recognition; and Decision Module, a module using LLM-based path planning to execute robust actions. Powering Spatial Module, we also introduce an Instruction-Object-Space (I-O-S) dataset and fine-tune the Qwen3-4B model into Qwen-Spatial (Qwen-Sp), which outperforms leading commercial LLMs in object list extraction, achieving higher F1 and NDCG scores on the I-O-S test set. Extensive experiments on the Room-to-Room (R2R) and REVERIE datasets demonstrate MSNav's state-of-the-art performance with significant improvements in Success Rate (SR) and Success weighted by Path Length (SPL).[117] Optimizing Hyper parameters in CNN for Soil Classification using PSO and Whale Optimization Algorithm
Yasir Nooruldeen Ibrahim,Fawziya Mahmood Ramo,Mahmood Siddeeq Qadir,Muna Jaffer Al-Shamdeen
Main category: cs.CV
TL;DR: This study uses Convolutional Neural Networks optimized with swarm algorithms to efficiently classify soil types, contributing to improved agricultural and environmental management.
Details
Motivation: Classifying soil images contributes to better land management, increased agricultural output, and practical solutions for environmental issues. The motivation is to improve the accuracy and efficiency of soil classification using artificial intelligence techniques. Method: The study constructs an intelligent model using Convolutional Neural Networks (CNN) for soil classification and employs swarm algorithms, namely the Whale optimization algorithm and the Particle swarm optimization algorithm, to optimize hyperparameters for improved performance. Result: The proposed model achieved efficient results in soil type classification, as demonstrated by the Accuracy and F1 measures used to evaluate the system's performance. Conclusion: The study concludes that the use of swarm algorithms, specifically the Whale optimization algorithm and the Particle swarm optimization algorithm, enhances the performance of Convolutional Neural Networks in classifying soil types, leading to efficient results. Abstract: Classifying soil images contributes to better land management, increased agricultural output, and practical solutions for environmental issues. The development of various disciplines, particularly agriculture, civil engineering, and natural resource management, is aided by understanding of soil quality since it helps with risk reduction, performance improvement, and sound decision-making . Artificial intelligence has recently been used in a number of different fields. In this study, an intelligent model was constructed using Convolutional Neural Networks to classify soil kinds, and machine learning algorithms were used to enhance the performance of soil classification . To achieve better implementation and performance of the Convolutional Neural Networks algorithm and obtain valuable results for the process of classifying soil type images, swarm algorithms were employed to obtain the best performance by choosing Hyper parameters for the Convolutional Neural Networks network using the Whale optimization algorithm and the Particle swarm optimization algorithm, and comparing the results of using the two algorithms in the process of multiple classification of soil types. The Accuracy and F1 measures were adopted to test the system, and the results of the proposed work were efficient result[118] QA-VLM: Providing human-interpretable quality assessment for wire-feed laser additive manufacturing parts with Vision Language Models
Qiaojie Zheng,Jiucai Zhang,Joy Gockel,Michael B. Wakin,Craig Brice,Xiaoli Zhang
Main category: cs.CV
TL;DR: 本研究开发了一种可解释性强的视觉-语言模型框架(QA-VLM),用于增材制造中的图像质量评估,通过融合领域知识提升模型解释能力,实验表明其在评估一致性与有效性方面优于现有方法。
Details
Motivation: 基于图像的质量评估(QA)在增材制造(AM)中依赖专家的人工操作,而现有的机器学习和深度学习方法通常提供缺乏可解释性的黑盒输出,限制了其在实际环境中的信任和应用。 Method: 提出了一种新的QA-VLM框架,结合了视觉-语言模型(VLM)的注意力机制和推理能力,并融入了来自同行评审期刊文章的应用特定知识,以生成人类可解释的质量评估。 Result: 在24个激光送丝直接能量沉积(DED-LW)生成的单珠样品上评估显示,该框架在解释质量上优于现有的VLM,并展示了更高的有效性与一致性。 Conclusion: QA-VLM框架在AM应用中实现了可靠且可解释的质量评估,相较于现有VLM表现出更高的解释质量和一致性。 Abstract: Image-based quality assessment (QA) in additive manufacturing (AM) often relies heavily on the expertise and constant attention of skilled human operators. While machine learning and deep learning methods have been introduced to assist in this task, they typically provide black-box outputs without interpretable justifications, limiting their trust and adoption in real-world settings. In this work, we introduce a novel QA-VLM framework that leverages the attention mechanisms and reasoning capabilities of vision-language models (VLMs), enriched with application-specific knowledge distilled from peer-reviewed journal articles, to generate human-interpretable quality assessments. Evaluated on 24 single-bead samples produced by laser wire direct energy deposition (DED-LW), our framework demonstrates higher validity and consistency in explanation quality than off-the-shelf VLMs. These results highlight the potential of our approach to enable trustworthy, interpretable quality assessment in AM applications.[119] The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers
Naren Sengodan
Main category: cs.CV
TL;DR: The paper introduces The Loupe, a lightweight attention module for Vision Transformers, improving accuracy and interpretability in fine-grained visual classification without needing part-level annotations.
Details
Motivation: Fine-Grained Visual Classification (FGVC) requires identifying subtle visual cues in critical applications like biodiversity monitoring and medical diagnostics. While Vision Transformers perform well, their lack of interpretability hinders trust and verification in these domains. Method: The Loupe is a plug-and-play attention module inserted into pre-trained backbones like Swin Transformer. It is trained end-to-end with a composite loss function that guides the model to focus on discriminative object parts without explicit annotations, leveraging attention as a regularizer. Result: On the CUB-200-2011 dataset, The Loupe improved Swin-Base model accuracy from 85.40% to 88.06% (a 2.66% gain). The attention maps localized semantically meaningful features, enhancing model interpretability. Conclusion: The Loupe demonstrates that a simple intrinsic attention mechanism can significantly boost FGVC performance while offering clear visual explanations, making models more trustworthy and interpretable. Abstract: Fine-Grained Visual Classification (FGVC) is a critical and challenging area within computer vision, demanding the identification of highly subtle, localized visual cues. The importance of FGVC extends to critical applications such as biodiversity monitoring and medical diagnostics, where precision is paramount. While large-scale Vision Transformers have achieved state-of-the-art performance, their decision-making processes often lack the interpretability required for trust and verification in such domains. In this paper, we introduce The Loupe, a novel, lightweight, and plug-and-play attention module designed to be inserted into pre-trained backbones like the Swin Transformer. The Loupe is trained end-to-end with a composite loss function that implicitly guides the model to focus on the most discriminative object parts without requiring explicit part-level annotations. Our unique contribution lies in demonstrating that a simple, intrinsic attention mechanism can act as a powerful regularizer, significantly boosting performance while simultaneously providing clear visual explanations. Our experimental evaluation on the challenging CUB-200-2011 dataset shows that The Loupe improves the accuracy of a Swin-Base model from 85.40% to 88.06%, a significant gain of 2.66%. Crucially, our qualitative analysis of the learned attention maps reveals that The Loupe effectively localizes semantically meaningful features, providing a valuable tool for understanding and trusting the model's decision-making process.[120] COVID19 Prediction Based On CT Scans Of Lungs Using DenseNet Architecture
Deborup Sanyal
Main category: cs.CV
TL;DR: A machine learning model is developed to assess the severity of COVID19 infections from CT scans, aiming to aid doctors in making more accurate and timely treatment decisions amidst resource shortages.
Details
Motivation: The motivation for this project stems from the high mortality rate due to the failure of the respiratory system in COVID19 patients and the need for more accurate and timely medical assessments due to shortages in healthcare resources. Method: The method involves using a Convolutional Neural Network model to analyze Computed Tomography (CT) scans of the lungs to determine the severity of COVID19 infection. Result: The result of the project is the development of a machine learning model that can analyze the severity of a patient's COVID19 infection within one month of testing positive, with the potential to predict if the infection will be promising or unfavorable. Conclusion: The project concludes that the use of a Convolutional Neural Network model can help in analyzing the severity of COVID19 infection based on CT scans, potentially improving the decision-making process for medical professionals. Abstract: COVID19 took the world by storm since December 2019. A highly infectious communicable disease, COVID19 is caused by the SARSCoV2 virus. By March 2020, the World Health Organization (WHO) declared COVID19 as a global pandemic. A pandemic in the 21st century after almost 100 years was something the world was not prepared for, which resulted in the deaths of around 1.6 million people worldwide. The most common symptoms of COVID19 were associated with the respiratory system and resembled a cold, flu, or pneumonia. After extensive research, doctors and scientists concluded that the main reason for lives being lost due to COVID19 was failure of the respiratory system. Patients were dying gasping for breath. Top healthcare systems of the world were failing badly as there was an acute shortage of hospital beds, oxygen cylinders, and ventilators. Many were dying without receiving any treatment at all. The aim of this project is to help doctors decide the severity of COVID19 by reading the patient's Computed Tomography (CT) scans of the lungs. Computer models are less prone to human error, and Machine Learning or Neural Network models tend to give better accuracy as training improves over time. We have decided to use a Convolutional Neural Network model. Given that a patient tests positive, our model will analyze the severity of COVID19 infection within one month of the positive test result. The severity of the infection may be promising or unfavorable (if it leads to intubation or death), based entirely on the CT scans in the dataset.[121] MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation
Fangxin Shang,Yuan Xia,Dalu Yang,Yahui Wang,Binglin Yang
Main category: cs.CV
TL;DR: MedRepBench: 一个全面的中文医疗报告基准,用于评估视觉语言模型和大型语言模型的理解能力。
Details
Motivation: 目前缺乏评估医疗报告中结构化解析质量的标准基准。 Method: 引入MedRepBench,包括文本评估设置、客观评估、自动化主观评估以及使用组相对策略优化的奖励函数设计。 Result: 实现了高达6%的回忆率增益,并观察到OCR+LLM流水线存在布局盲视和延迟问题。 Conclusion: MedRepBench是一个基于1900份非识别的真实中文医疗报告的基准,用于评估端到端视觉语言模型在结构化医疗报告理解中的性能。此外,研究还设计了一种奖励函数,并应用了组相对策略优化来提高中等规模视觉语言模型的回忆率。 Abstract: Medical report interpretation plays a crucial role in healthcare, enabling both patient-facing explanations and effective information flow across clinical systems. While recent vision-language models (VLMs) and large language models (LLMs) have demonstrated general document understanding capabilities, there remains a lack of standardized benchmarks to assess structured interpretation quality in medical reports. We introduce MedRepBench, a comprehensive benchmark built from 1,900 de-identified real-world Chinese medical reports spanning diverse departments, patient demographics, and acquisition formats. The benchmark is designed primarily to evaluate end-to-end VLMs for structured medical report understanding. To enable controlled comparisons, we also include a text-only evaluation setting using high-quality OCR outputs combined with LLMs, allowing us to estimate the upper-bound performance when character recognition errors are minimized. Our evaluation framework supports two complementary protocols: (1) an objective evaluation measuring field-level recall of structured clinical items, and (2) an automated subjective evaluation using a powerful LLM as a scoring agent to assess factuality, interpretability, and reasoning quality. Based on the objective metric, we further design a reward function and apply Group Relative Policy Optimization (GRPO) to improve a mid-scale VLM, achieving up to 6% recall gain. We also observe that the OCR+LLM pipeline, despite strong performance, suffers from layout-blindness and latency issues, motivating further progress toward robust, fully vision-based report understanding.[122] Two-Stage Framework for Efficient UAV-Based Wildfire Video Analysis with Adaptive Compression and Fire Source Detection
Yanbing Bai,Rui-Yang Ju,Lemeng Zhao,Junjie Hu,Jianchao Bi,Erick Mas,Shunichi Koshimura
Main category: cs.CV
TL;DR: A lightweight two-stage framework enables efficient and accurate real-time wildfire monitoring on UAVs by reducing redundancy and improving fire source detection.
Details
Motivation: UAVs have limited computational resources, making it challenging to run large models for real-time wildfire monitoring. An efficient solution is required to enable timely and accurate disaster response. Method: Stage 1 uses a policy network with frame compression and a station point mechanism to reduce redundancy and improve accuracy. Stage 2 employs an improved YOLOv8 model for fire source localization. Result: In Stage 1, the method significantly reduces computational costs while maintaining classification accuracy. In Stage 2, it achieves higher detection accuracy with similar inference time compared to baseline methods. Conclusion: The proposed two-stage framework effectively reduces computational costs while maintaining or improving detection accuracy for real-time wildfire monitoring using UAVs. Abstract: Unmanned Aerial Vehicles (UAVs) have become increasingly important in disaster emergency response by enabling real-time aerial video analysis. Due to the limited computational resources available on UAVs, large models cannot be run independently for real-time analysis. To overcome this challenge, we propose a lightweight and efficient two-stage framework for real-time wildfire monitoring and fire source detection on UAV platforms. Specifically, in Stage 1, we utilize a policy network to identify and discard redundant video clips using frame compression techniques, thereby reducing computational costs. In addition, we introduce a station point mechanism that leverages future frame information within the sequential policy network to improve prediction accuracy. In Stage 2, once the frame is classified as "fire", we employ the improved YOLOv8 model to localize the fire source. We evaluate the Stage 1 method using the FLAME and HMDB51 datasets, and the Stage 2 method using the Fire & Smoke dataset. Experimental results show that our method significantly reduces computational costs while maintaining classification accuracy in Stage 1, and achieves higher detection accuracy with similar inference time in Stage 2 compared to baseline methods.[123] CellEcoNet: Decoding the Cellular Language of Pathology with Deep Learning for Invasive Lung Adenocarcinoma Recurrence Prediction
Abdul Rehman Akbar,Usama Sajjad,Ziyu Su,Wencheng Li,Fei Xing,Jimmy Ruiz,Wei Chen,Muhammad Khalid Khan Niazi
Main category: cs.CV
TL;DR: The paper introduces CellEcoNet, a deep learning framework that models whole slide images through natural language analogy, demonstrating superior predictive performance in identifying invasive lung adenocarcinoma patients at risk of recurrence.
Details
Motivation: Despite surgical resection, ~70% of invasive lung adenocarcinoma (ILA) patients recur within five years, and current tools fail to identify those needing adjuvant therapy. To address this unmet clinical need. Method: CellEcoNet, a novel spatially aware deep learning framework that models whole slide images (WSIs) through natural language analogy, defining a 'language of pathology,' where cells act as words, cellular neighborhoods become phrases, and tissue architecture forms sentences. Result: On a dataset of 456 H&E-stained WSIs, CellEcoNet achieved superior predictive performance (AUC:77.8% HR:9.54), outperforming IASLC grading system (AUC:71.4% HR:2.36), AJCC Stage (AUC:64.0% HR:1.17) and state-of-the-art computational methods (AUCs:62.2-67.4%). CellEcoNet demonstrated fairness and consistent performance across diverse demographic and clinical subgroups. Conclusion: CellEcoNet marks a paradigm shift by decoding the tumor microenvironment's cellular 'language' to reveal how subtle cell variations encode recurrence risk. Abstract: Despite surgical resection, ~70% of invasive lung adenocarcinoma (ILA) patients recur within five years, and current tools fail to identify those needing adjuvant therapy. To address this unmet clinical need, we introduce CellEcoNet, a novel spatially aware deep learning framework that models whole slide images (WSIs) through natural language analogy, defining a "language of pathology," where cells act as words, cellular neighborhoods become phrases, and tissue architecture forms sentences. CellEcoNet learns these context-dependent meanings automatically, capturing how subtle variations and spatial interactions derive recurrence risk. On a dataset of 456 H&E-stained WSIs, CellEcoNet achieved superior predictive performance (AUC:77.8% HR:9.54), outperforming IASLC grading system (AUC:71.4% HR:2.36), AJCC Stage (AUC:64.0% HR:1.17) and state-of-the-art computational methods (AUCs:62.2-67.4%). CellEcoNet demonstrated fairness and consistent performance across diverse demographic and clinical subgroups. Beyond prognosis, CellEcoNet marks a paradigm shift by decoding the tumor microenvironment's cellular "language" to reveal how subtle cell variations encode recurrence risk.[124] A Framework for Benchmarking Fairness-Utility Trade-offs in Text-to-Image Models via Pareto Frontiers
Marco N. Bochernitsan,Rodrigo C. Barros,Lucas S. Kupssinskü
Main category: cs.CV
TL;DR: This paper proposes a systematic method for evaluating and optimizing fairness and utility in text-to-image generation models using Pareto-optimal frontiers, showing that better hyperparameter choices outperform default settings.
Details
Motivation: Current fairness evaluation methods for text-to-image models rely on subjective, qualitative judgments and narrow comparisons, which are error-prone and not reproducible. There's a need for a systematic, quantitative approach to evaluate fairness and utility. Method: The authors introduce a fairness and utility evaluation framework using Pareto-optimal frontiers with hyperparameter tuning. They use Normalized Shannon Entropy for fairness and ClipScore for utility, testing across multiple text-to-image models. Result: The method successfully identifies better hyperparameter configurations that optimize fairness for a given level of utility and vice versa. Most default model settings are found to be dominated solutions in the fairness-utility space. Conclusion: The proposed evaluation method using Pareto-optimal frontiers helps identify superior hyperparameter configurations for fairness and utility in text-to-image models, showing that most default settings are suboptimal. Abstract: Achieving fairness in text-to-image generation demands mitigating social biases without compromising visual fidelity, a challenge critical to responsible AI. Current fairness evaluation procedures for text-to-image models rely on qualitative judgment or narrow comparisons, which limit the capacity to assess both fairness and utility in these models and prevent reproducible assessment of debiasing methods. Existing approaches typically employ ad-hoc, human-centered visual inspections that are both error-prone and difficult to replicate. We propose a method for evaluating fairness and utility in text-to-image models using Pareto-optimal frontiers across hyperparametrization of debiasing methods. Our method allows for comparison between distinct text-to-image models, outlining all configurations that optimize fairness for a given utility and vice-versa. To illustrate our evaluation method, we use Normalized Shannon Entropy and ClipScore for fairness and utility evaluation, respectively. We assess fairness and utility in Stable Diffusion, Fair Diffusion, SDXL, DeCoDi, and FLUX text-to-image models. Our method shows that most default hyperparameterizations of the text-to-image model are dominated solutions in the fairness-utility space, and it is straightforward to find better hyperparameters.[125] WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal,Mahsa Massoud,Aarash Feizi,Zichao Li,Suyuchen Wang,Christopher Pal,Aishwarya Agrawal,David Vazquez,Siva Reddy,Juan A. Rodriguez,Perouz Taslakian,Spandana Gella,Sai Rajeswar
Main category: cs.CV
TL;DR: WebMMU is a new multilingual benchmark for evaluating models on integrated web tasks like visual question answering, code editing, and mockup-to-code generation, revealing limitations in current MLLMs' reasoning, grounding, and cross-lingual capabilities.
Details
Motivation: Prior benchmarks evaluate web tasks separately, but real-world scenarios require models to handle complex, multilingual, and multimodal tasks. WebMMU aims to address this gap by assessing models' capabilities in realistic and integrated web development scenarios. Method: WebMMU introduces a multilingual benchmark that unifies three core web tasks using expert-annotated, real-world web data to evaluate models' abilities in multi-step reasoning, element grounding, and UI comprehension and coding. Result: MLLMs perform well on basic information extraction but struggle with complex reasoning, precise element grounding, functional code editing, and multilingual design-to-code generation. Conclusion: The paper concludes that current MLLMs have limitations in reasoning, grounding, code editing, and design-to-code generation for multilingual web tasks, highlighting the need for improved multimodal and cross-lingual reasoning in future web agents. Abstract: We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models' abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.[126] Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data
Stefania L. Moroianu,Christian Bluethgen,Pierre Chambon,Mehdi Cherti,Jean-Benoit Delbrouck,Magdalini Paschali,Brandon Price,Judy Gichoya,Jenia Jitsev,Curtis P. Langlotz,Akshay S. Chaudhari
Main category: cs.CV
TL;DR: RoentGen-v2 is a new model that generates synthetic chest radiograph images with demographic control, improving the fairness and performance of diagnostic imaging models.
Details
Motivation: To address challenges in achieving robust performance and fairness across diverse patient populations in diagnostic imaging deep learning models. Method: Development of RoentGen-v2, a text-to-image diffusion model for chest radiographs, and use of synthetic data for supervised pretraining followed by fine-tuning on real data. Result: Synthetic pretraining led to a 6.5% accuracy increase in downstream classification models, compared to a 2.7% increase when combining real and synthetic data naively, while reducing the underdiagnosis fairness gap by 19.3%. Conclusion: RoentGen-v2, a text-to-image diffusion model, has the potential to enhance the fairness and generalizability of medical deep learning by generating a large, demographically balanced synthetic dataset of chest radiographs. Abstract: Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at https://github.com/StanfordMIMI/RoentGen-v2 .[127] Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes
Xinhao Xiang,Kuan-Chuan Peng,Suhas Lohit,Michael J. Jones,Jiawei Zhang
Main category: cs.CV
TL;DR: OVODA is a new framework for open-vocabulary 3D object and attribute detection that doesn't require prior knowledge of novel class sizes, outperforming current methods while also identifying object attributes like spatial relationships and motion states.
Details
Motivation: Existing 3D object detection methods are limited by closed-set assumptions and struggle to detect novel objects and their attributes in real-world scenarios. Method: OVODA uses foundation models to bridge the semantic gap between 3D features and texts, with innovations like feature concatenation, prompt tuning, perspective-specified prompts, and horizontal flip augmentation for attribute detection. Result: OVODA achieves superior performance in open-vocabulary 3D object detection on nuScenes and Argoverse 2 datasets without knowing novel class anchor sizes and successfully detects object attributes. The OVAD dataset is also released for future research. Conclusion: OVODA is a new framework that improves open-vocabulary 3D object and attribute detection without requiring prior knowledge of novel class anchor sizes, outperforming existing methods on nuScenes and Argoverse 2 datasets. Abstract: 3D object detection plays a crucial role in autonomous systems, yet existing methods are limited by closed-set assumptions and struggle to recognize novel objects and their attributes in real-world scenarios. We propose OVODA, a novel framework enabling both open-vocabulary 3D object and attribute detection with no need to know the novel class anchor size. OVODA uses foundation models to bridge the semantic gap between 3D features and texts while jointly detecting attributes, e.g., spatial relationships, motion states, etc. To facilitate such research direction, we propose OVAD, a new dataset that supplements existing 3D object detection benchmarks with comprehensive attribute annotations. OVODA incorporates several key innovations, including foundation model feature concatenation, prompt tuning strategies, and specialized techniques for attribute detection, including perspective-specified prompts and horizontal flip augmentation. Our results on both the nuScenes and Argoverse 2 datasets show that under the condition of no given anchor sizes of novel classes, OVODA outperforms the state-of-the-art methods in open-vocabulary 3D object detection while successfully recognizing object attributes. Our OVAD dataset is released here: https://doi.org/10.5281/zenodo.16904069 .[128] AIM 2025 Low-light RAW Video Denoising Challenge: Dataset, Methods and Results
Alexander Yakovenko,George Chakvetadze,Ilya Khrapov,Maksim Zhelezov,Dmitry Vatolin,Radu Timofte,Youngjin Oh,Junhyeong Kwon,Junyoung Park,Nam Ik Cho,Senyan Xu,Ruixuan Jiang,Long Peng,Xueyang Fu,Zheng-Jun Zha,Xiaoping Peng,Hansen Feng,Zhanyi Tie,Ziming Xia,Lizhi Wang
Main category: cs.CV
TL;DR: The paper introduces a benchmark for low-light RAW video denoising under exposure constraints, aiming to advance temporal redundancy-based methods through the AIM 2025 challenge.
Details
Motivation: The motivation is to advance denoising techniques for low-light RAW videos by leveraging temporal redundancy while adhering to exposure-time limits and adapting to sensor-specific noise. Method: The method involves creating a benchmark dataset with low-light RAW video sequences captured under various conditions and sensor types, and evaluating denoising approaches using PSNR and SSIM metrics. Result: The result is a comprehensive benchmark dataset with 756 sequences and the identification of effective approaches through the AIM 2025 challenge. Conclusion: The paper concludes that the AIM 2025 challenge successfully introduced a new benchmark for low-light RAW video denoising, highlighting effective methods and fostering advancements in exploiting temporal redundancy under exposure constraints. Abstract: This paper reviews the AIM 2025 (Advances in Image Manipulation) Low-Light RAW Video Denoising Challenge. The task is to develop methods that denoise low-light RAW video by exploiting temporal redundancy while operating under exposure-time limits imposed by frame rate and adapting to sensor-specific, signal-dependent noise. We introduce a new benchmark of 756 ten-frame sequences captured with 14 smartphone camera sensors across nine conditions (illumination: 1/5/10 lx; exposure: 1/24, 1/60, 1/120 s), with high-SNR references obtained via burst averaging. Participants process linear RAW sequences and output the denoised 10th frame while preserving the Bayer pattern. Submissions are evaluated on a private test set using full-reference PSNR and SSIM, with final ranking given by the mean of per-metric ranks. This report describes the dataset, challenge protocol, and submitted approaches.[129] Transformer-Based Neural Network for Transient Detection without Image Subtraction
Adi Inada,Masao Sako,Tatiana Acero-Cuellar,Federica Bianco
Main category: cs.CV
TL;DR: A new transformer-based neural network improves supernova detection accuracy and efficiency in astronomical surveys without requiring difference imaging.
Details
Motivation: To enhance the classification accuracy of real and bogus transient detections in astronomical images, particularly for large-scale surveys like the Dark Energy Survey (DES). Method: A transformer-based neural network architecture was developed, designed for pixel-by-pixel comparison of search and template images, eliminating the need for computationally expensive difference imaging. Result: The network achieved a classification accuracy of 97.4% on the autoScan dataset from DES, and further experiments showed robust performance even when input images were not centered on supernova candidates. Conclusion: The paper concludes that the introduced transformer-based neural network improves the accuracy and efficiency of supernova detection in astronomical surveys, surpassing traditional CNN methods. Abstract: We introduce a transformer-based neural network for the accurate classification of real and bogus transient detections in astronomical images. This network advances beyond the conventional convolutional neural network (CNN) methods, widely used in image processing tasks, by adopting an architecture better suited for detailed pixel-by-pixel comparison. The architecture enables efficient analysis of search and template images only, thus removing the necessity for computationally-expensive difference imaging, while maintaining high performance. Our primary evaluation was conducted using the autoScan dataset from the Dark Energy Survey (DES), where the network achieved a classification accuracy of 97.4% and diminishing performance utility for difference image as the size of the training set grew. Further experiments with DES data confirmed that the network can operate at a similar level even when the input images are not centered on the supernova candidate. These findings highlight the network's effectiveness in enhancing both accuracy and efficiency of supernova detection in large-scale astronomical surveys.[130] NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Denis Tarasov,Alexander Nikulin,Ilya Zisman,Albina Klepach,Nikita Lyubaykin,Andrei Polubarov,Alexander Derevyagin,Vladislav Kurenkov
Main category: cs.CV
TL;DR: NinA is a fast and efficient alternative to diffusion-based action decoders for Vision-Language-Action (VLA) models, offering similar performance with much faster inference times.
Details
Motivation: Diffusion models, while effective at modeling complex action distributions, require multiple iterative denoising steps during inference, which limits their applicability in real-time, high-frequency control scenarios. A faster alternative was needed. Method: The authors replaced the diffusion-based action decoder with a Normalizing Flow (NF) model, which allows for one-shot sampling through invertible transformations. This approach was integrated into the FLOWER VLA architecture and fine-tuned on the LIBERO benchmark. Result: NinA achieves comparable performance to diffusion-based decoders on the LIBERO benchmark while significantly reducing inference time due to one-shot sampling. Conclusion: NinA provides a faster and more efficient alternative to diffusion-based action decoders in VLA models without compromising performance. Abstract: Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alter- native to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.[131] RF-PGS: Fully-structured Spatial Wireless Channel Representation with Planar Gaussian Splatting
Lihao Zhang,Zongtan Li,Haijian Sun
Main category: cs.CV
TL;DR: This paper proposes RF-PGS, a novel framework for high-fidelity radio propagation path reconstruction using sparse path loss spectra, offering improved accuracy, reduced training costs, and efficient wireless channel representation for scalable 6G Spatial-CSI modeling.
Details
Motivation: Traditional channel modeling approaches face challenges in spatial resolution, efficiency, and scalability, while radiance field-based methods suffer from geometric inaccuracy and costly supervision. This work aims to provide a more accurate and efficient solution for 6G Spatial-CSI modeling. Method: RF-PGS uses Planar Gaussians as geometry primitives with RF-specific optimizations for dense, surface-aligned scene reconstruction in the geometry training stage. In the RF training stage, it employs a fully-structured radio radiance and a tailored multi-view loss to model radio propagation behavior accurately. Result: RF-PGS reconstructs high-fidelity radio propagation paths from sparse path loss spectra, significantly improving reconstruction accuracy, reducing training costs, and enabling efficient representation of wireless channels. Conclusion: RF-PGS provides a practical solution for scalable 6G Spatial-CSI modeling by efficiently representing wireless channels with high reconstruction accuracy and reduced training costs compared to previous radiance field methods. Abstract: In the 6G era, the demand for higher system throughput and the implementation of emerging 6G technologies require large-scale antenna arrays and accurate spatial channel state information (Spatial-CSI). Traditional channel modeling approaches, such as empirical models, ray tracing, and measurement-based methods, face challenges in spatial resolution, efficiency, and scalability. Radiance field-based methods have emerged as promising alternatives but still suffer from geometric inaccuracy and costly supervision. This paper proposes RF-PGS, a novel framework that reconstructs high-fidelity radio propagation paths from only sparse path loss spectra. By introducing Planar Gaussians as geometry primitives with certain RF-specific optimizations, RF-PGS achieves dense, surface-aligned scene reconstruction in the first geometry training stage. In the subsequent Radio Frequency (RF) training stage, the proposed fully-structured radio radiance, combined with a tailored multi-view loss, accurately models radio propagation behavior. Compared to prior radiance field methods, RF-PGS significantly improves reconstruction accuracy, reduces training costs, and enables efficient representation of wireless channels, offering a practical solution for scalable 6G Spatial-CSI modeling.[132] Gaussian Primitive Optimized Deformable Retinal Image Registration
Xin Tian,Jiazheng Wang,Yuxi Zhang,Xiang Chen,Renjiu Hu,Gaolei Li,Min Liu,Hang Zhang
Main category: cs.CV
TL;DR: This paper proposes Gaussian Primitive Optimization (GPO), a novel framework for deformable retinal image registration that effectively improves registration accuracy and performance over existing methods.
Details
Motivation: Deformable retinal image registration is challenging due to large homogeneous regions and sparse vascular features that lead to limited gradient signals. This paper aims to address these challenges with a novel iterative framework. Method: The paper introduces Gaussian Primitive Optimization (GPO), which uses structured message passing, Gaussian primitive modeling, and KNN Gaussian interpolation to construct a globally coherent displacement field for image registration. Result: Experiments on the FIRE dataset show that GPO reduces the target registration error from 6.2 px to ~2.4 px and increases the AUC at 25 px from 0.770 to 0.938. Conclusion: GPO is an effective method for deformable retinal image registration that outperforms existing methods by significantly reducing registration error and increasing accuracy. Abstract: Deformable retinal image registration is notoriously difficult due to large homogeneous regions and sparse but critical vascular features, which cause limited gradient signals in standard learning-based frameworks. In this paper, we introduce Gaussian Primitive Optimization (GPO), a novel iterative framework that performs structured message passing to overcome these challenges. After an initial coarse alignment, we extract keypoints at salient anatomical structures (e.g., major vessels) to serve as a minimal set of descriptor-based control nodes (DCN). Each node is modelled as a Gaussian primitive with trainable position, displacement, and radius, thus adapting its spatial influence to local deformation scales. A K-Nearest Neighbors (KNN) Gaussian interpolation then blends and propagates displacement signals from these information-rich nodes to construct a globally coherent displacement field; focusing interpolation on the top (K) neighbors reduces computational overhead while preserving local detail. By strategically anchoring nodes in high-gradient regions, GPO ensures robust gradient flow, mitigating vanishing gradient signal in textureless areas. The framework is optimized end-to-end via a multi-term loss that enforces both keypoint consistency and intensity alignment. Experiments on the FIRE dataset show that GPO reduces the target registration error from 6.2\,px to ~2.4\,px and increases the AUC at 25\,px from 0.770 to 0.938, substantially outperforming existing methods. The source code can be accessed via https://github.com/xintian-99/GPOreg.[133] Beyond Emotion Recognition: A Multi-Turn Multimodal Emotion Understanding and Reasoning Benchmark
Jinpeng Hu,Hongchang Shi,Chongyuan Dai,Zhuo Li,Peipei Song,Meng Wang
Main category: cs.CV
TL;DR: This paper introduces the MTMEUR benchmark and a multi-agent framework to advance emotion reasoning in MLLMs, highlighting challenges faced by current models.
Details
Motivation: While MLLMs have strong emotion recognition abilities, their potential in emotion reasoning remains underexplored, which is essential for enhancing human-machine interactions. Method: The paper proposes a multi-agent framework and introduces the MTMEUR benchmark, which includes video data and progressive questions to evaluate emotion understanding and reasoning capabilities. Result: Experiments with existing MLLMs on the MTMEUR benchmark show that most models struggle with emotion reasoning tasks. Conclusion: MLLMs face significant challenges in emotion reasoning tasks, indicating a need for further research and development in this area. Abstract: Multimodal large language models (MLLMs) have been widely applied across various fields due to their powerful perceptual and reasoning capabilities. In the realm of psychology, these models hold promise for a deeper understanding of human emotions and behaviors. However, recent research primarily focuses on enhancing their emotion recognition abilities, leaving the substantial potential in emotion reasoning, which is crucial for improving the naturalness and effectiveness of human-machine interactions. Therefore, in this paper, we introduce a multi-turn multimodal emotion understanding and reasoning (MTMEUR) benchmark, which encompasses 1,451 video data from real-life scenarios, along with 5,101 progressive questions. These questions cover various aspects, including emotion recognition, potential causes of emotions, future action prediction, etc. Besides, we propose a multi-agent framework, where each agent specializes in a specific aspect, such as background context, character dynamics, and event details, to improve the system's reasoning capabilities. Furthermore, we conduct experiments with existing MLLMs and our agent-based method on the proposed benchmark, revealing that most models face significant challenges with this task.[134] Delta-SVD: Efficient Compression for Personalized Text-to-Image Models
Tangyuan Zhang,Shangyu Chen,Qixiang Chen,Jianfei Cai
Main category: cs.CV
TL;DR: Delta-SVD is a training-free compression method for personalized text-to-image models that efficiently reduces storage requirements while maintaining generation quality.
Details
Motivation: Personalized text-to-image models like DreamBooth require significant storage due to fine-tuning large-scale diffusion models. Delta-SVD aims to address this storage overhead by compressing the weight updates in a training-free manner. Method: Delta-SVD applies Singular Value Decomposition (SVD) to factorize weight deltas from DreamBooth fine-tuning, followed by an energy-based rank truncation strategy to balance compression efficiency and reconstruction fidelity. Result: Experiments show that Delta-SVD achieves substantial compression with minimal loss in generation quality, as measured by CLIP score, SSIM, and FID. Conclusion: Delta-SVD enables scalable and efficient deployment of personalized diffusion models, making it a practical solution for real-world applications involving large-scale subject customizations. Abstract: Personalized text-to-image models such as DreamBooth require fine-tuning large-scale diffusion backbones, resulting in significant storage overhead when maintaining many subject-specific models. We present Delta-SVD, a post-hoc, training-free compression method that targets the parameter weights update induced by DreamBooth fine-tuning. Our key observation is that these delta weights exhibit strong low-rank structure due to the sparse and localized nature of personalization. Delta-SVD first applies Singular Value Decomposition (SVD) to factorize the weight deltas, followed by an energy-based rank truncation strategy to balance compression efficiency and reconstruction fidelity. The resulting compressed models are fully plug-and-play and can be re-constructed on-the-fly during inference. Notably, the proposed approach is simple, efficient, and preserves the original model architecture. Experiments on a multiple subject dataset demonstrate that Delta-SVD achieves substantial compression with negligible loss in generation quality measured by CLIP score, SSIM and FID. Our method enables scalable and efficient deployment of personalized diffusion models, making it a practical solution for real-world applications that require storing and deploying large-scale subject customizations.[135] Do Multimodal LLMs See Sentiment?
Neemias B. da Silva,John Harrison,Rodrigo Minetto,Myriam R. Delgado,Bogdan T. Nassu,Thiago H. Silva
Main category: cs.CV
TL;DR: The paper introduces MLLMsent, a framework for investigating sentiment reasoning in MLLMs, showing that fine-tuned models can achieve superior performance in sentiment classification from images compared to existing methods.
Details
Motivation: Understanding how visual content communicates sentiment is crucial due to the dominance of visual media on social platforms, but it is a challenging problem because sentiment perception is linked to complex scene-level semantics. Method: The paper proposes a framework called MLLMsent to study the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three approaches: direct sentiment classification from images, association with pre-trained LLMs for sentiment analysis, and fine-tuning LLMs on sentiment-labeled image descriptions. Result: Experiments showed that the fine-tuned approach achieved state-of-the-art results, outperforming other baselines by up to 30.9%, 64.8%, and 42.4%. In a cross-dataset test without training on new data, the model still outperformed the best runner-up by up to 8.26%. Conclusion: The proposed visual reasoning scheme shows potential for advancing affective computing and sets new benchmarks for future research. Abstract: Understanding how visual content communicates sentiment is critical in an era where online interaction is increasingly dominated by this kind of media on social platforms. However, this remains a challenging problem, as sentiment perception is closely tied to complex, scene-level semantics. In this paper, we propose an original framework, MLLMsent, to investigate the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three perspectives: (1) using those MLLMs for direct sentiment classification from images; (2) associating them with pre-trained LLMs for sentiment analysis on automatically generated image descriptions; and (3) fine-tuning the LLMs on sentiment-labeled image descriptions. Experiments on a recent and established benchmark demonstrate that our proposal, particularly the fine-tuned approach, achieves state-of-the-art results outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4%, respectively, across different levels of evaluators' agreement and sentiment polarity categories. Remarkably, in a cross-dataset test, without any training on these new data, our model still outperforms, by up to 8.26%, the best runner-up, which has been trained directly on them. These results highlight the potential of the proposed visual reasoning scheme for advancing affective computing, while also establishing new benchmarks for future research.[136] AWM-Fuse: Multi-Modality Image Fusion for Adverse Weather via Global and Local Text Perception
Xilai Li,Huichun Liu,Xiaosong Li,Tao Ye,Zhenyu Kuang,Huafeng Li
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态图像融合方法AWM-Fuse,结合文本信息提升恶劣天气条件下图像融合的效果,并取得了优越的性能。
Details
Motivation: 为了解决恶劣天气条件下视觉信息丢失问题,并提升多模态图像融合中的语义感知能力,特别是在文本信息利用方面存在的不足。 Method: 提出了一种名为AWM-Fuse的新融合方法,结合了全局和局部文本感知模块,利用BLIP和ChatGPT生成的文本描述来提升多模态图像融合效果。 Result: 实验表明,AWM-Fuse在复杂天气条件下以及下游任务中均优于当前最先进的方法。 Conclusion: AWM-Fuse在恶劣天气条件下表现出优于现有方法的性能,并促进了更有效的语义感知图像融合。 Abstract: Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although less studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. In response, we propose AWM-Fuse, a novel fusion method for adverse weather conditions, designed to handle multiple degradations through global and local text perception within a unified, shared weight architecture. In particular, a global feature perception module leverages BLIP-produced captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, thereby capturing finer details. Furthermore, textual descriptions are used to constrain the generation of fusion images, effectively steering the network learning process toward better alignment with real semantic labels, thereby promoting the learning of more meaningful visual features. Extensive experiments demonstrate that AWM-Fuse outperforms current state-of-the-art methods in complex weather conditions and downstream tasks. Our code is available at https://github.com/Feecuin/AWM-Fuse.[137] A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism
Yi Zhang,Lingxiao Wei,Bowei Zhang,Ziwei Liu,Kai Yi,Shu Hu
Main category: cs.CV
TL;DR: 本文提出了 SAEViT,一种轻量级的视觉变换模型,通过稀疏注意力机制和深度可分离卷积技术,在降低计算成本的同时保持了高性能,适用于多种视觉任务。
Details
Motivation: Vision Transformer (ViT) 虽然在视觉任务中表现出色,但其高计算成本和弱局部特征建模能力限制了其在实际场景中的应用。因此,SAEViT 被提出以在计算效率和性能之间取得平衡。 Method: SAEViT 引入了基于图像冗余的自适应稀疏采样的 Sparsely Aggregated Attention (SAA) 模块,并开发了 Channel-Interactive Feed-Forward Network (CIFFN) 层以增强通道间的信息交换,同时结合嵌入式深度可分离卷积模块的层次化金字塔结构,进一步增强卷积特征。 Result: 在 ImageNet-1K 分类任务中,SAEViT 在仅需 0.8 GFLOPs 和 1.3 GFLOPs 的计算量下分别实现了 76.3% 和 79.6% 的 Top-1 准确率,验证了其高效性和有效性。 Conclusion: SAEViT 提出了一种轻量级的视觉变换模型,通过稀疏注意力机制和深度可分离卷积模块,有效地平衡了计算效率和性能,适用于各种基础视觉任务。 Abstract: Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability. However, its large model size with high computational cost and weak local feature modeling ability hinder its application in real scenarios. To balance computation efficiency and performance, we propose SAEViT (Sparse-Attention-Efficient-ViT), a lightweight ViT based model with convolution blocks, in this paper to achieve efficient downstream vision tasks. Specifically, SAEViT introduces a Sparsely Aggregated Attention (SAA) module that performs adaptive sparse sampling based on image redundancy and recovers the feature map via deconvolution operation, which significantly reduces the computational complexity of attention operations. In addition, a Channel-Interactive Feed-Forward Network (CIFFN) layer is developed to enhance inter-channel information exchange through feature decomposition and redistribution, mitigating redundancy in traditional feed-forward networks (FNN). Finally, a hierarchical pyramid structure with embedded depth-wise separable convolutional blocks (DWSConv) is devised to further strengthen convolutional features. Extensive experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3\% and 79.6\% on the ImageNet-1K classification task with only 0.8 GFLOPs and 1.3 GFLOPs, respectively, demonstrating a lightweight solution for various fundamental vision tasks.[138] MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration
Shunyu Yao,Ming Liu,Zhilu Zhang,Zhaolin Wan,Zhilong Ji,Jinfeng Bai,Wangmeng Zuo
Main category: cs.CV
TL;DR: 本文提出了一种新的多维图像质量评估方法(MDIQA),该方法能够更好地模拟人类视觉感知,并在图像恢复任务中实现灵活应用。
Details
Motivation: 现有方法过于关注整体评分,忽略了人类从多个维度评估图像质量的特点。 Method: 通过五个技术和四个美学维度建模图像质量,分别训练每个维度,然后融合特征生成最终的IQA评分。 Result: 实验表明,MDIQA在性能上优于现有方法,并能灵活应用于图像恢复任务。 Conclusion: 本文提出了一种多维图像质量评估框架(MDIQA),能够更好地捕捉人类视觉感知的多面性,并在图像恢复任务中灵活应用。 Abstract: Recent advancements in image quality assessment (IQA), driven by sophisticated deep neural network designs, have significantly improved the ability to approach human perceptions. However, most existing methods are obsessed with fitting the overall score, neglecting the fact that humans typically evaluate image quality from different dimensions before arriving at an overall quality assessment. To overcome this problem, we propose a multi-dimensional image quality assessment (MDIQA) framework. Specifically, we model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions, to capture the multifaceted nature of human visual perception within distinct branches. Each branch of our MDIQA is initially trained under the guidance of a separate dimension, and the respective features are then amalgamated to generate the final IQA score. Additionally, when the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models, enabling the restoration results to better align with varying user preferences through the adjustment of perceptual dimension weights. Extensive experiments demonstrate that our MDIQA achieves superior performance and can be effectively and flexibly applied to image restoration tasks. The code is available: https://github.com/YaoShunyu19/MDIQA.[139] Structural Energy-Guided Sampling for View-Consistent Text-to-3D
Qing Zhang,Jinguang Tong,Jie Hong,Jing Zhang,Xuesong Li
Main category: cs.CV
TL;DR: Structural Energy-Guided Sampling (SEGS) mitigates the Janus problem in Text-to-3D generation by enforcing multi-view consistency during sampling without requiring retraining or weight changes.
Details
Motivation: Text-to-3D generation often suffers from the Janus problem, where objects appear correct from the front but have distorted geometry from other angles due to viewpoint bias in 2D diffusion priors. Method: SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory. Result: SEGS improves geometric alignment and viewpoint consistency while preserving appearance fidelity when integrated into SDS/VSD pipelines. Conclusion: SEGS successfully reduces the Janus problem in Text-to-3D generation by enforcing multi-view consistency at sampling time without retraining or weight modification. Abstract: Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enforces multi-view consistency entirely at sampling time. SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory, steering geometry toward the intended viewpoint while preserving appearance fidelity. Integrated seamlessly into SDS/VSD pipelines, SEGS significantly reduces Janus artifacts, achieving improved geometric alignment and viewpoint consistency without retraining or weight modification.[140] MSPCaps: A Multi-Scale Patchify Capsule Network with Cross-Agreement Routing for Visual Recognition
Yudong Hu,Yueju Han,Rui Sun,Jinke Ren
Main category: cs.CV
TL;DR: 本文提出了一种新的胶囊网络架构MSPCaps,通过引入多尺度特征学习和自适应胶囊路由机制,显著提升了视觉识别任务的性能。
Details
Motivation: 现有的胶囊网络及其变体通常依赖单一高层特征图,忽略了多尺度特征的丰富信息,且传统特征融合策略难以解决多尺度特征差异,导致分类性能欠佳。 Method: 提出了多尺度Patchify胶囊网络(MSPCaps),包含多尺度ResNet骨干(MSRB)、Patchify胶囊层(PatchifyCaps)和交叉一致路由(CAR)模块,分别用于提取多尺度特征、生成初级胶囊和自适应路由选择。 Result: MSPCaps在不同配置下(Tiny和Large模型)均超越多种基线方法,分类准确率更高,且具备良好的可扩展性和鲁棒性。 Conclusion: MSPCaps通过整合多尺度特征学习和高效胶囊路由,显著提升了分类准确性和模型鲁棒性,展示了其在特征表示学习中的潜力。 Abstract: Capsule Network (CapsNet) has demonstrated significant potential in visual recognition by capturing spatial relationships and part-whole hierarchies for learning equivariant feature representations. However, existing CapsNet and variants often rely on a single high-level feature map, overlooking the rich complementary information from multi-scale features. Furthermore, conventional feature fusion strategies (e.g., addition and concatenation) struggle to reconcile multi-scale feature discrepancies, leading to suboptimal classification performance. To address these limitations, we propose the Multi-Scale Patchify Capsule Network (MSPCaps), a novel architecture that integrates multi-scale feature learning and efficient capsule routing. Specifically, MSPCaps consists of three key components: a Multi-Scale ResNet Backbone (MSRB), a Patchify Capsule Layer (PatchifyCaps), and Cross-Agreement Routing (CAR) blocks. First, the MSRB extracts diverse multi-scale feature representations from input images, preserving both fine-grained details and global contextual information. Second, the PatchifyCaps partitions these multi-scale features into primary capsules using a uniform patch size, equipping the model with the ability to learn from diverse receptive fields. Finally, the CAR block adaptively routes the multi-scale capsules by identifying cross-scale prediction pairs with maximum agreement. Unlike the simple concatenation of multiple self-routing blocks, CAR ensures that only the most coherent capsules contribute to the final voting. Our proposed MSPCaps achieves remarkable scalability and superior robustness, consistently surpassing multiple baseline methods in terms of classification accuracy, with configurations ranging from a highly efficient Tiny model (344.3K parameters) to a powerful Large model (10.9M parameters), highlighting its potential in advancing feature representation learning.[141] LGE-Guided Cross-Modality Contrastive Learning for Gadolinium-Free Cardiomyopathy Screening in Cine CMR
Siqing Yuan,Yulin Wang,Zirui Cao,Yueyan Wang,Zehao Weng,Hui Wang,Lei Xu,Zixian Chen,Lei Chen,Zhong Xue,Dinggang Shen
Main category: cs.CV
TL;DR: 本文提出了一种无需钆造影剂的心肌病筛查框架CC-CMR,通过对比学习和跨模态对齐技术,利用电影CMR序列实现高精度筛查,解决了传统方法对钆造影剂和人工解读的依赖。
Details
Motivation: 心肌病是心力衰竭和猝死的主要原因,需要精准的早期筛查。虽然心脏磁共振成像(CMR)被认为是诊断的“金标准”,但其对钆造影剂的依赖和繁琐的人工解读限制了其在大规模人群中的应用。 Method: 提出了一种名为CC-CMR的对比学习和跨模态对齐框架,通过将电影CMR和延迟钆增强(LGE)序列的潜在空间对齐,将纤维化特异性病理信息编码到电影CMR嵌入中。同时,特征交互模块优化了诊断精度和跨模态特征一致性,并通过不确定性引导的自适应训练机制动态校准任务目标以确保模型泛化能力。 Result: 在231名受试者的多中心数据上评估,CC-CMR的准确率为0.943(95%置信区间:0.886-0.986),比现有的仅使用电影CMR的模型高出4.3%,且无需使用钆造影剂。 Conclusion: CC-CMR框架成功实现了无需钆造影剂的心肌病高精度筛查,显示出其在不同人群和医疗环境中的临床应用潜力。 Abstract: Cardiomyopathy, a principal contributor to heart failure and sudden cardiac mortality, demands precise early screening. Cardiac Magnetic Resonance (CMR), recognized as the diagnostic 'gold standard' through multiparametric protocols, holds the potential to serve as an accurate screening tool. However, its reliance on gadolinium contrast and labor-intensive interpretation hinders population-scale deployment. We propose CC-CMR, a Contrastive Learning and Cross-Modal alignment framework for gadolinium-free cardiomyopathy screening using cine CMR sequences. By aligning the latent spaces of cine CMR and Late Gadolinium Enhancement (LGE) sequences, our model encodes fibrosis-specific pathology into cine CMR embeddings. A Feature Interaction Module concurrently optimizes diagnostic precision and cross-modal feature congruence, augmented by an uncertainty-guided adaptive training mechanism that dynamically calibrates task-specific objectives to ensure model generalizability. Evaluated on multi-center data from 231 subjects, CC-CMR achieves accuracy of 0.943 (95% CI: 0.886-0.986), outperforming state-of-the-art cine-CMR-only models by 4.3% while eliminating gadolinium dependency, demonstrating its clinical viability for wide range of populations and healthcare environments.[142] Align 3D Representation and Text Embedding for 3D Content Personalization
Qi Song,Ziyuan Luo,Ka Chun Cheung,Simon See,Renjie Wan
Main category: cs.CV
TL;DR: Invert3D是一个高效的3D内容个性化框架,无需昂贵的重新训练。
Details
Motivation: 现有的3D个性化方法依赖于知识蒸馏,需要昂贵的重新训练。因此,我们提出了一个高效的3D内容个性化框架。 Method: Invert3D框架通过建立3D表示与文本嵌入空间之间的对齐来实现3D内容的个性化。 Result: 实验表明,Invert3D能够有效地进行3D内容个性化。 Conclusion: Invert3D实现了高效的3D内容个性化,无需昂贵的重新训练过程。 Abstract: Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose \textbf{Invert3D}, a novel framework for convenient 3D content personalization. Nowadays, vision-language models such as CLIP enable direct image personalization through aligned vision-text embedding spaces. However, the inherent structural differences between 3D content and 2D images preclude direct application of these techniques to 3D personalization. Our approach bridges this gap by establishing alignment between 3D representations and text embedding spaces. Specifically, we develop a camera-conditioned 3D-to-text inverse mechanism that projects 3D contents into a 3D embedding aligned with text embeddings. This alignment enables efficient manipulation and personalization of 3D content through natural language prompts, eliminating the need for computationally retraining procedures. Extensive experiments demonstrate that Invert3D achieves effective personalization of 3D content. Our work is available at: https://github.com/qsong2001/Invert3D.[143] Addressing Annotation Scarcity in Hyperspectral Brain Image Segmentation with Unsupervised Domain Adaptation
Tim Mach,Daniel Rueckert,Alex Berger,Laurin Lux,Ivan Ezhov
Main category: cs.CV
TL;DR: A new deep learning framework using unsupervised domain adaptation improves segmentation of cerebral vasculature in hyperspectral brain images despite label scarcity.
Details
Motivation: The motivation stems from the critical challenge of severe label scarcity in supervised training for segmenting cerebral vasculature in hyperspectral brain images. Method: A novel deep learning framework incorporating unsupervised domain adaptation was developed, utilizing a small, expert-annotated ground truth alongside unlabeled data. Result: Quantitative and qualitative evaluations show that the proposed method significantly outperforms existing state-of-the-art approaches. Conclusion: The proposed unsupervised domain adaptation methodology effectively addresses the challenge of label scarcity in biomedical imaging tasks, specifically in segmenting cerebral vasculature in hyperspectral brain images. Abstract: This work presents a novel deep learning framework for segmenting cerebral vasculature in hyperspectral brain images. We address the critical challenge of severe label scarcity, which impedes conventional supervised training. Our approach utilizes a novel unsupervised domain adaptation methodology, using a small, expert-annotated ground truth alongside unlabeled data. Quantitative and qualitative evaluations confirm that our method significantly outperforms existing state-of-the-art approaches, demonstrating the efficacy of domain adaptation for label-scarce biomedical imaging tasks.[144] NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability
Krishna Kanth Nakka,Alexandre Alahi
Main category: cs.CV
TL;DR: NAT通过针对特定神经元的方法,提高了对抗性扰动的可迁移性,在多个模型和领域上实现了更高的欺骗率,并且仅需少量查询即可达到效果。
Details
Motivation: 先前的嵌入层优化方法往往过度关注代表相似概念的少数神经元,而对攻击层内的其他神经元影响很小。 Method: NAT(Neuron Attack for Transferability)方法专注于攻击神经网络中特定的神经元,而不是传统的在嵌入层进行优化。 Result: 在41个不同的ImageNet模型和9个细粒度模型上进行了广泛的实验,NAT的欺骗率超过了现有基线方法,并且仅需10次查询就能达到显著的欺骗率。 Conclusion: NAT通过针对特定神经元的方法,在跨模型和跨领域设置中实现了比现有基线高出14%和4%的欺骗率,并且通过利用训练生成器的互补攻击能力,仅需10次查询即可达到显著的欺骗率。 Abstract: The generation of transferable adversarial perturbations typically involves training a generator to maximize embedding separation between clean and adversarial images at a single mid-layer of a source model. In this work, we build on this approach and introduce Neuron Attack for Transferability (NAT), a method designed to target specific neuron within the embedding. Our approach is motivated by the observation that previous layer-level optimizations often disproportionately focus on a few neurons representing similar concepts, leaving other neurons within the attacked layer minimally affected. NAT shifts the focus from embedding-level separation to a more fundamental, neuron-specific approach. We find that targeting individual neurons effectively disrupts the core units of the neural network, providing a common basis for transferability across different models. Through extensive experiments on 41 diverse ImageNet models and 9 fine-grained models, NAT achieves fooling rates that surpass existing baselines by over 14\% in cross-model and 4\% in cross-domain settings. Furthermore, by leveraging the complementary attacking capabilities of the trained generators, we achieve impressive fooling rates within just 10 queries. Our code is available at: https://krishnakanthnakka.github.io/NAT/[145] HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis
Junhao Wu,Xiuer Gu,Zhiying Li,Yeying Jin,Yunfeng Diao,Zhiyu Li,Zhenbo Song,Xiaomei Zhang,Zhaoxin Fan
Main category: cs.CV
TL;DR: HieroAction是一个能够提供准确且可解释的人类动作评估的模型,适用于体育、医疗和机器人等领域。
Details
Motivation: 现有的方法只能提供最终分数,缺乏详细的解释和分析,限制了其在实际中的应用。 Method: 提出了HieroAction模型,包括Stepwise Action Reasoning和Hierarchical Policy Learning。 Result: 在多个基准数据集上展现了优越的性能,代码将在被接受后发布。 Conclusion: HieroAction通过结合逐步动作推理和分层策略学习,实现了对人类动作的准确且可解释的评估。 Abstract: Evaluating human actions with clear and detailed feedback is important in areas such as sports, healthcare, and robotics, where decisions rely not only on final outcomes but also on interpretable reasoning. However, most existing methods provide only a final score without explanation or detailed analysis, limiting their practical applicability. To address this, we introduce HieroAction, a vision-language model that delivers accurate and structured assessments of human actions. HieroAction builds on two key ideas: (1) Stepwise Action Reasoning, a tailored chain of thought process designed specifically for action assessment, which guides the model to evaluate actions step by step, from overall recognition through sub action analysis to final scoring, thus enhancing interpretability and structured understanding; and (2) Hierarchical Policy Learning, a reinforcement learning strategy that enables the model to learn fine grained sub action dynamics and align them with high level action quality, thereby improving scoring precision. The reasoning pathway structures the evaluation process, while policy learning refines each stage through reward based optimization. Their integration ensures accurate and interpretable assessments, as demonstrated by superior performance across multiple benchmark datasets. Code will be released upon acceptance.[146] RPD-Diff: Region-Adaptive Physics-Guided Diffusion Model for Visibility Enhancement under Dense and Non-Uniform Haze
Ruicheng Zhang,Puxin Yan,Zeyu Zhang,Yicheng Chang,Hongyi Chen,Zhi Jin
Main category: cs.CV
TL;DR: RPD-Diff is a novel dehazing approach that improves visibility in dense and non-uniform haze conditions using a physics-guided diffusion model, outperforming traditional methods with high-quality image restoration.
Details
Motivation: Single-image dehazing in dense and non-uniform haze conditions is challenging due to information degradation and spatial heterogeneity. Traditional diffusion-based methods lack adaptability and sufficient conditioning, leading to suboptimal results. Method: The paper introduces RPD-Diff, a Region-adaptive Physics-guided Dehazing Diffusion Model. It incorporates a Physics-guided Intermediate State Targeting (PIST) strategy and a Haze-Aware Denoising Timestep Predictor (HADTP) with a transmission map cross-attention mechanism. Result: Extensive experiments on four real-world datasets show that RPD-Diff outperforms existing methods in challenging haze scenarios, producing haze-free images with superior detail clarity and color fidelity. Conclusion: The proposed RPD-Diff method effectively addresses challenges in single-image dehazing under dense and non-uniform haze conditions, achieving state-of-the-art performance with high-quality results. Abstract: Single-image dehazing under dense and non-uniform haze conditions remains challenging due to severe information degradation and spatial heterogeneity. Traditional diffusion-based dehazing methods struggle with insufficient generation conditioning and lack of adaptability to spatially varying haze distributions, which leads to suboptimal restoration. To address these limitations, we propose RPD-Diff, a Region-adaptive Physics-guided Dehazing Diffusion Model for robust visibility enhancement in complex haze scenarios. RPD-Diff introduces a Physics-guided Intermediate State Targeting (PIST) strategy, which leverages physical priors to reformulate the diffusion Markov chain by generation target transitions, mitigating the issue of insufficient conditioning in dense haze scenarios. Additionally, the Haze-Aware Denoising Timestep Predictor (HADTP) dynamically adjusts patch-specific denoising timesteps employing a transmission map cross-attention mechanism, adeptly managing non-uniform haze distributions. Extensive experiments across four real-world datasets demonstrate that RPD-Diff achieves state-of-the-art performance in challenging dense and non-uniform haze scenarios, delivering high-quality, haze-free images with superior detail clarity and color fidelity.[147] Local Information Matters: A Rethink of Crowd Counting
Tianhang Pan,Xiuyi Jia
Main category: cs.CV
TL;DR: This paper proposes LIMM, a new crowd counting model that focuses on local modeling capabilities through window partitioning and contrastive learning, achieving state-of-the-art results.
Details
Motivation: The motivation stems from the observation that existing crowd counting methods do not focus on the unique characteristic where individuals occupy a small portion of the image, typically using backbones designed for other visual tasks. Method: The paper introduces a new model design principle for crowd counting, incorporating a window partitioning design, window-wise contrastive learning, and a global attention module. Result: Extensive experiments demonstrate that the LIMM model significantly improves local modeling capability (e.g., 8.7% improvement in MAE on the JHU-Crowd++ high-density subset) without compromising performance on large-sized individuals. Conclusion: The proposed LIMM model achieves state-of-the-art performance in crowd counting by emphasizing local modeling capabilities while maintaining the ability to handle large-sized individuals. Abstract: The motivation of this paper originates from rethinking an essential characteristic of crowd counting: individuals (heads of humans) in the crowd counting task typically occupy a very small portion of the image. This characteristic has never been the focus of existing works: they typically use the same backbone as other visual tasks and pursue a large receptive field. This drives us to propose a new model design principle of crowd counting: emphasizing local modeling capability of the model. We follow the principle and design a crowd counting model named Local Information Matters Model (LIMM). The main innovation lies in two strategies: a window partitioning design that applies grid windows to the model input, and a window-wise contrastive learning design to enhance the model's ability to distinguish between local density levels. Moreover, a global attention module is applied to the end of the model to handle the occasionally occurring large-sized individuals. Extensive experiments on multiple public datasets illustrate that the proposed model shows a significant improvement in local modeling capability (8.7\% in MAE on the JHU-Crowd++ high-density subset for example), without compromising its ability to count large-sized ones, which achieves state-of-the-art performance. Code is available at: https://github.com/tianhangpan/LIMM.[148] Robust Diagram Reasoning: A Framework for Enhancing LVLM Performance on Visually Perturbed Scientific Diagrams
Minghao Zhou,Rafael Souza,Yaqian Hu,Luming Che
Main category: cs.CV
TL;DR: This paper introduces RDR, a framework for evaluating and improving LVLM robustness against visual perturbations in scientific diagrams, revealing that even top models like GPT-4V struggle with real-world image degradation.
Details
Motivation: LVLMs show promise in processing scientific diagrams but lack robustness against real-world visual perturbations like noise and blur. Current benchmarks do not adequately evaluate this aspect. Method: The researchers introduced the Robust Diagram Reasoning (RDR) framework, which uses an Adaptive Multi-View & Consistency Verification (AMCV) mechanism involving multiple perturbed diagram versions, parallel inference, and consistency-based self-correction. They also proposed two new metrics, PRS and VDC, and built the SciDiagram-Robust dataset. Result: Experiments showed that even state-of-the-art LVLMs suffer performance degradation under visual perturbations, with GPT-4V dropping from 85.2% clean accuracy to 72.1% on the Perturbation Robustness Score (PRS). Conclusion: The study concludes that existing LVLMs, including advanced models like GPT-4V, lack sufficient robustness to visual perturbations in scientific diagrams, highlighting the need for better evaluation frameworks like RDR. Abstract: Large Language Models (LLMs) and their multimodal variants (LVLMs) hold immense promise for scientific and engineering applications, particularly in processing visual information like scientific diagrams. However, their practical deployment is hindered by a critical lack of robustness to common visual perturbations such as noise, blur, and occlusions, which are prevalent in real-world scientific documents. Existing evaluation benchmarks largely overlook this challenge, leaving the robust reasoning capabilities of LVLMs on visually degraded scientific diagrams underexplored. To address this, we introduce the Robust Diagram Reasoning (RDR) framework, a novel approach designed to enhance and rigorously evaluate LVLMs' performance under such conditions. At its core, RDR employs an Adaptive Multi-View & Consistency Verification (AMCV) mechanism, which involves generating multiple perturbed versions of a diagram, performing parallel inference, and then applying a consistency-based self-correction loop. We also propose two new metrics, Perturbation Robustness Score (PRS) and Visual Degradation Consistency (VDC), to quantify robustness. Furthermore, we construct SciDiagram-Robust, the first large-scale scientific diagram question-answering dataset specifically augmented with diverse, programmatically generated visual perturbations. Our extensive experiments demonstrate that even state-of-the-art closed-source LVLMs like GPT-4V exhibit significant performance degradation when faced with perturbed inputs (Clean Accuracy 85.2% vs. PRS 72.1%).[149] Balanced Sharpness-Aware Minimization for Imbalanced Regression
Yahao Liu,Qin Wang,Lixin Duan,Wen Li
Main category: cs.CV
TL;DR: 本文提出BSAM方法,通过优化模型泛化能力的均匀性,有效解决不平衡回归问题。
Details
Motivation: 解决回归模型在处理具有不平衡分布的现实数据时,对罕见目标值表现较差的问题。 Method: 提出了一种名为Balanced Sharpness-Aware Minimization (BSAM) 的方法,通过引入目标重加权策略来增强回归模型在观测空间中的泛化能力均匀性。 Result: 实验表明BSAM在年龄估计和深度估计等任务上均优于现有方法。 Conclusion: BSAM方法在多种视觉回归任务中表现出优于现有方法的性能,同时提供理论泛化边界保证。 Abstract: Regression is fundamental in computer vision and is widely used in various tasks including age estimation, depth estimation, target localization, \etc However, real-world data often exhibits imbalanced distribution, making regression models perform poorly especially for target values with rare observations~(known as the imbalanced regression problem). In this paper, we reframe imbalanced regression as an imbalanced generalization problem. To tackle that, we look into the loss sharpness property for measuring the generalization ability of regression models in the observation space. Namely, given a certain perturbation on the model parameters, we check how model performance changes according to the loss values of different target observations. We propose a simple yet effective approach called Balanced Sharpness-Aware Minimization~(BSAM) to enforce the uniform generalization ability of regression models for the entire observation space. In particular, we start from the traditional sharpness-aware minimization and then introduce a novel targeted reweighting strategy to homogenize the generalization ability across the observation space, which guarantees a theoretical generalization bound. Extensive experiments on multiple vision regression tasks, including age and depth estimation, demonstrate that our BSAM method consistently outperforms existing approaches. The code is available \href{https://github.com/manmanjun/BSAM_for_Imbalanced_Regression}{here}.[150] Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding
Leilei Guo,Antonio Carlos Rivera,Peiyu Tang,Haoxuan Ren,Zheyu Song
Main category: cs.CV
TL;DR: 本文提出了一种新型视觉-语言模型HCG-LVLM,通过模仿人类由粗到细的认知过程,显著提升了模型在复杂场景下的鲁棒性、精确性和细粒度理解能力。
Details
Motivation: 当前视觉-语言大模型在复杂的现实场景中表现出不足的鲁棒性、容易产生幻觉以及推理错误,尤其是在需要精确图像区域定位和细粒度视觉推理时。 Method: 提出了一种模仿人类由粗到细认知过程的新型架构HCG-LVLM,包括全局上下文感知层和细粒度局部基础层,并通过自适应融合机制整合两层信息。 Result: 在GQA、A-OKVQA和RefCOCO/+/g等多个具有挑战性的数据集上进行了广泛的实验,结果表明HCG-LVLM在准确性和减少幻觉方面均优于Flamingo、BLIP-2和MiniGPT-4等现有最先进模型。 Conclusion: HCG-LVLM通过其分层设计有效提升了细粒度视觉-语言理解和精确的图像区域定位能力,减少了幻觉现象。 Abstract: Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) have achieved remarkable progress in natural language processing and multimodal understanding. Despite their impressive generalization capabilities, current LVLMs often exhibit insufficient robustness, proneness to hallucination, and reasoning errors in complex real-world scenarios, particularly when precise image region localization and fine-grained visual reasoning are required. To address these limitations, we propose the Hierarchical Contextual Grounding LVLM (HCG-LVLM), a novel architecture that mimics human coarse-to-fine cognitive processing. HCG-LVLM employs a two-layered approach: a Global Contextual Perception layer for initial broad understanding and a Fine-grained Local Grounding layer. The latter incorporates a Local Detail Enhancement Module to extract high-resolution features and a Semantic Consistency Validator to ensure accurate, hallucination-free visual-language alignment. Through an adaptive fusion mechanism, information from both layers is integrated for robust and precise outputs. Extensive experiments on challenging datasets, including GQA, A-OKVQA for fine-grained VQA, and RefCOCO/+/g for Referring Expression Comprehension, demonstrate that HCG-LVLM consistently outperforms state-of-the-art models such as Flamingo, BLIP-2, and MiniGPT-4. Our model achieves superior accuracy and significantly reduces hallucination, validating the effectiveness of its hierarchical design in enhancing fine-grained visual-language understanding and precise grounding capabilities.[151] Combating Digitally Altered Images: Deepfake Detection
Saksham Kumar,Rhythm Narang
Main category: cs.CV
TL;DR: 本研究开发了一种高效的深度伪造图像检测方法,基于改进的视觉变换模型(ViT),并在测试数据集上取得了卓越的性能。
Details
Motivation: 深度伪造技术的兴起对公众和相关机构构成了重大挑战,因此需要一种强大且有效的深度伪造检测方法。 Method: 使用改进的视觉变换模型(ViT),通过多种增强技术进行训练,以区分真实图像和深度伪造图像。通过过采样和分层的训练-验证数据集分割来处理类别不平衡问题。 Result: 该模型在测试数据集上展示了最先进的结果,能够细致地检测深度伪造图像。 Conclusion: 该研究提出了一种基于改进视觉变换模型(ViT)的深度伪造(Deepfake)检测方法,并在OpenForensics数据集的子集上表现出卓越的检测性能。 Abstract: The rise of Deepfake technology to generate hyper-realistic manipulated images and videos poses a significant challenge to the public and relevant authorities. This study presents a robust Deepfake detection based on a modified Vision Transformer(ViT) model, trained to distinguish between real and Deepfake images. The model has been trained on a subset of the OpenForensics Dataset with multiple augmentation techniques to increase robustness for diverse image manipulations. The class imbalance issues are handled by oversampling and a train-validation split of the dataset in a stratified manner. Performance is evaluated using the accuracy metric on the training and testing datasets, followed by a prediction score on a random image of people, irrespective of their realness. The model demonstrates state-of-the-art results on the test dataset to meticulously detect Deepfake images.[152] Preserving Domain Generalization in Fine-Tuning via Joint Parameter Selection
Bin Pan,Shiyu Shen,Zongbin Wang,Zhenwei Shi,Xia Xu
Main category: cs.CV
TL;DR: JPS is a novel domain generalization method that efficiently fine-tunes only a sparse subset of model parameters, resulting in superior performance while preserving the generalization capabilities of pre-trained models.
Details
Motivation: Full fine-tuning can compromise the intrinsic generalization capabilities of large-scale pre-trained vision models, necessitating parameter-efficient adaptation strategies. Method: Joint Parameter Selection (JPS) identifies and updates parameters with consistent and significant gradients across all source domains using dual operators. Result: Extensive benchmark experiments demonstrate that JPS achieves superior performance compared to state-of-the-art domain generalization methods. Conclusion: JPS is an effective domain generalization approach that selectively fine-tunes a sparse subset of parameters, preserving the generalization capabilities of pre-trained models. Abstract: Domain generalization seeks to develop models trained on a limited set of source domains that are capable of generalizing effectively to unseen target domains. While the predominant approach leverages large-scale pre-trained vision models as initialization, recent studies have highlighted that full fine-tuning can compromise the intrinsic generalization capabilities of these models. To address this limitation, parameter-efficient adaptation strategies have emerged, wherein only a subset of model parameters is selectively fine-tuned, thereby balancing task adaptation with the preservation of generalization. Motivated by this paradigm, we introduce Joint Parameter Selection (JPS), a novel method that restricts updates to a small, sparse subset of parameters, thereby retaining and harnessing the generalization strength of pre-trained models. Theoretically, we establish a generalization error bound that explicitly accounts for the sparsity of parameter updates, thereby providing a principled justification for selective fine-tuning. Practically, we design a selection mechanism employing dual operators to identify and update parameters exhibiting consistent and significant gradients across all source domains. Extensive benchmark experiments demonstrate that JPS achieves superior performance compared to state-of-the-art domain generalization methods, substantiating both the efficiency and efficacy of the proposed approach.[153] HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial-based Feature Caching
Liang Feng,Shikang Zheng,Jiacheng Liu,Yuqi Lin,Qinming Zhou,Peiliang Cai,Xinyu Wang,Junjie Chen,Chang Zou,Yue Ma,Linfeng Zhang
Main category: cs.CV
TL;DR: This paper introduces HiCache, a training-free acceleration framework for diffusion models that significantly speeds up inference while maintaining output quality by leveraging Hermite polynomials and a dual-scaling mechanism.
Details
Motivation: Diffusion models suffer from high computational costs due to iterative sampling, and existing feature caching methods still experience quality loss due to the inability to model complex feature dynamics. Method: HiCache uses Hermite polynomials for feature prediction based on the insight that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics. It also introduces a dual-scaling mechanism for numerical stability and predictive accuracy. Result: HiCache achieves a 6.24x speedup on FLUX.1-dev while maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Conclusion: HiCache is a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties, achieving a 6.24x speedup on FLUX.1-dev while maintaining or exceeding baseline quality across various tasks. Abstract: Diffusion models have achieved remarkable success in content generation but suffer from prohibitive computational costs due to iterative sampling. While recent feature caching methods tend to accelerate inference through temporal extrapolation, these methods still suffer from server quality loss due to the failure in modeling the complex dynamics of feature evolution. To solve this problem, this paper presents HiCache, a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials-the potentially theoretically optimal basis for Gaussian-correlated processes. Besides, We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy. Extensive experiments demonstrate HiCache's superiority: achieving 6.24x speedup on FLUX.1-dev while exceeding baseline quality, maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Core implementation is provided in the appendix, with complete code to be released upon acceptance.[154] An Efficient Dual-Line Decoder Network with Multi-Scale Convolutional Attention for Multi-organ Segmentation
Riad Hassan,M. Rubaiyat Hossain Mondal,Sheikh Iqbal Ahamed,Fahad Mostafa,Md Mostafijur Rahman
Main category: cs.CV
TL;DR: This paper introduces EDLDNet, a medical image segmentation approach that balances accuracy and efficiency, achieving superior performance with reduced computational cost.
Details
Motivation: To address the gap in balancing segmentation accuracy with computational efficiency in deep learning-based medical image segmentation. Method: An efficient dual-line decoder segmentation network (EDLDNet) is introduced, incorporating a noisy decoder for robustness, multi-scale convolutional attention modules, attention gates, up-convolution blocks, and a mutation-based loss function. Result: EDLDNet outperforms state-of-the-art segmentation architectures on four publicly available medical imaging datasets, achieving a Dice score of 84.00% on the Synapse dataset and reducing MACs by 89.7% compared to UNet. Conclusion: EDLDNet demonstrates strong generalization, computational efficiency, and robustness across diverse medical imaging datasets. Abstract: Proper segmentation of organs-at-risk is important for radiation therapy, surgical planning, and diagnostic decision-making in medical image analysis. While deep learning-based segmentation architectures have made significant progress, they often fail to balance segmentation accuracy with computational efficiency. Most of the current state-of-the-art methods either prioritize performance at the cost of high computational complexity or compromise accuracy for efficiency. This paper addresses this gap by introducing an efficient dual-line decoder segmentation network (EDLDNet). The proposed method features a noisy decoder, which learns to incorporate structured perturbation at training time for better model robustness, yet at inference time only the noise-free decoder is executed, leading to lower computational cost. Multi-Scale convolutional Attention Modules (MSCAMs), Attention Gates (AGs), and Up-Convolution Blocks (UCBs) are further utilized to optimize feature representation and boost segmentation performance. By leveraging multi-scale segmentation masks from both decoders, we also utilize a mutation-based loss function to enhance the model's generalization. Our approach outperforms SOTA segmentation architectures on four publicly available medical imaging datasets. EDLDNet achieves SOTA performance with an 84.00% Dice score on the Synapse dataset, surpassing baseline model like UNet by 13.89% in Dice score while significantly reducing Multiply-Accumulate Operations (MACs) by 89.7%. Compared to recent approaches like EMCAD, our EDLDNet not only achieves higher Dice score but also maintains comparable computational efficiency. The outstanding performance across diverse datasets establishes EDLDNet's strong generalization, computational efficiency, and robustness. The source code, pre-processed data, and pre-trained weights will be available at https://github.com/riadhassan/EDLDNet .[155] Contrastive Prompt Clustering for Weakly Supervised Semantic Segmentation
Wangyu Wu,Zhenhong Chen,Xiaowen Ma,Wenqiao Zhang,Xianglin Qiu,Siqi Song,Xiaowei Huang,Fei Ma,Jimin Xiao
Main category: cs.CV
TL;DR: Contrastive Prompt Clustering (CPC) improves Weakly Supervised Semantic Segmentation by incorporating category clusters from Large Language Models and a contrastive loss for better intra-class consistency and inter-class separation.
Details
Motivation: Existing WSSS methods neglect shared semantics among related categories and lack fine-grained discrimination, prompting the need for a method that can better handle inter-class relationships and reduce confusion among visually similar categories. Method: Contrastive Prompt Clustering (CPC), which uses Large Language Models to derive category clusters and introduces a class-aware patch-level contrastive loss. Result: CPC outperforms state-of-the-art methods on the PASCAL VOC 2012 and MS COCO 2014 datasets. Conclusion: CPC demonstrates superior performance over existing state-of-the-art methods in WSSS by leveraging category clusters as semantic priors and enforcing intra-class consistency and inter-class separation. Abstract: Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has gained attention for its cost-effectiveness. Most existing methods emphasize inter-class separation, often neglecting the shared semantics among related categories and lacking fine-grained discrimination. To address this, we propose Contrastive Prompt Clustering (CPC), a novel WSSS framework. CPC exploits Large Language Models (LLMs) to derive category clusters that encode intrinsic inter-class relationships, and further introduces a class-aware patch-level contrastive loss to enforce intra-class consistency and inter-class separation. This hierarchical design leverages clusters as coarse-grained semantic priors while preserving fine-grained boundaries, thereby reducing confusion among visually similar categories. Experiments on PASCAL VOC 2012 and MS COCO 2014 demonstrate that CPC surpasses existing state-of-the-art methods in WSSS.[156] Fiducial Marker Splatting for High-Fidelity Robotics Simulations
Diram Tabaa,Gianni Di Caro
Main category: cs.CV
TL;DR: This paper proposes a hybrid framework combining Gaussian Splatting with structured markers to enhance robotic localization and control in complex environments, offering improved efficiency and accuracy over traditional methods.
Details
Motivation: Traditional mesh-based 3D simulations struggle in complex environments like greenhouses, and while neural rendering methods like Gaussian Splatting provide visual realism, they lack the flexibility to incorporate fiducial markers essential for robotic localization and control. Method: The study introduces a novel algorithm for generating GS-based fiducial markers (e.g., AprilTags) in cluttered scenes, integrating the photorealism of Gaussian Splatting with structured marker representations. Result: The experiments demonstrate that the proposed approach outperforms traditional image-fitting techniques in both efficiency and pose-estimation accuracy, particularly in challenging environments such as greenhouses. Conclusion: The proposed hybrid framework combining Gaussian Splatting with structured marker representations offers improved efficiency and pose-estimation accuracy, making it suitable for real-world robotic applications in complex environments like greenhouses. Abstract: High-fidelity 3D simulation is critical for training mobile robots, but its traditional reliance on mesh-based representations often struggle in complex environments, such as densely packed greenhouses featuring occlusions and repetitive structures. Recent neural rendering methods, like Gaussian Splatting (GS), achieve remarkable visual realism but lack flexibility to incorporate fiducial markers, which are essential for robotic localization and control. We propose a hybrid framework that combines the photorealism of GS with structured marker representations. Our core contribution is a novel algorithm for efficiently generating GS-based fiducial markers (e.g., AprilTags) within cluttered scenes. Experiments show that our approach outperforms traditional image-fitting techniques in both efficiency and pose-estimation accuracy. We further demonstrate the framework's potential in a greenhouse simulation. This agricultural setting serves as a challenging testbed, as its combination of dense foliage, similar-looking elements, and occlusions pushes the limits of perception, thereby highlighting the framework's value for real-world applications.[157] Dual Orthogonal Guidance for Robust Diffusion-based Handwritten Text Generation
Konstantina Nikolaidou,George Retsinas,Giorgos Sfikas,Silvia Cascianelli,Rita Cucchiara,Marcus Liwicki
Main category: cs.CV
TL;DR: This paper introduces a new sampling guidance strategy, Dual Orthogonal Guidance (DOG), to improve the clarity and style variability of diffusion-based handwritten text generation, especially for challenging writing styles and out-of-vocabulary words.
Details
Motivation: Diffusion-based Handwritten Text Generation (HTG) approaches struggle with style variability and generation clarity, often producing artifacts or distortions that negatively affect readability, especially for hard-to-produce styles. The aim is to find a more stable and diverse output generation method. Method: The paper proposes a novel sampling guidance strategy called Dual Orthogonal Guidance (DOG), which leverages an orthogonal projection of a negatively perturbed prompt onto the original positive prompt. A triangular schedule is applied to control the strength of the guidance across the denoising process. Result: Experimental results on state-of-the-art models (DiffusionPen and One-DM) show that DOG improves both content clarity and style variability in generated handwritten text. Conclusion: DOG presents a promising solution for improving clarity and variability in handwritten text generation, particularly for out-of-vocabulary words and challenging writing styles. Abstract: Diffusion-based Handwritten Text Generation (HTG) approaches achieve impressive results on frequent, in-vocabulary words observed at training time and on regular styles. However, they are prone to memorizing training samples and often struggle with style variability and generation clarity. In particular, standard diffusion models tend to produce artifacts or distortions that negatively affect the readability of the generated text, especially when the style is hard to produce. To tackle these issues, we propose a novel sampling guidance strategy, Dual Orthogonal Guidance (DOG), that leverages an orthogonal projection of a negatively perturbed prompt onto the original positive prompt. This approach helps steer the generation away from artifacts while maintaining the intended content, and encourages more diverse, yet plausible, outputs. Unlike standard Classifier-Free Guidance (CFG), which relies on unconditional predictions and produces noise at high guidance scales, DOG introduces a more stable, disentangled direction in the latent space. To control the strength of the guidance across the denoising process, we apply a triangular schedule: weak at the start and end of denoising, when the process is most sensitive, and strongest in the middle steps. Experimental results on the state-of-the-art DiffusionPen and One-DM demonstrate that DOG improves both content clarity and style variability, even for out-of-vocabulary words and challenging writing styles.[158] Probabilistic Temporal Masked Attention for Cross-view Online Action Detection
Liping Xie,Yang Tan,Shicheng Jing,Huimin Lu,Kanjian Zhang
Main category: cs.CV
TL;DR: The paper proposes a novel Probabilistic Temporal Masked Attention (PTMA) model for Online Action Detection (OAD) that effectively handles cross-view variations, achieving state-of-the-art performance on multiple datasets.
Details
Motivation: The mainstream OAD models are sensitive to varying video viewpoints, which hampers their generalization when confronted with unseen sources. This limitation motivates the development of a more robust model that can handle cross-view variations. Method: The paper proposes a Probabilistic Temporal Masked Attention (PTMA) model that uses probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. It employs a GRU-based temporal masked attention (TMA) cell to query the input video sequence and integrates multi-view information into probabilistic modeling for view-invariant feature extraction. Result: The proposed PTMA model achieves state-of-the-art performance on three datasets (DAHLIA, IKEA ASM, and Breakfast) under three evaluation protocols (cross-subject, cross-view, and cross-subject-view). Conclusion: The PTMA model achieves state-of-the-art performance on DAHLIA, IKEA ASM, and Breakfast datasets under cross-subject, cross-view, and cross-subject-view evaluation protocols. Abstract: As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols: cross-subject (cs), cross-view (cv), and cross-subject-view (csv) show that PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.[159] A Novel Local Focusing Mechanism for Deepfake Detection Generalization
Mingliang Li,Lin Yuanbo Wu,Changhong Liu,Hanxi Li
Main category: cs.CV
TL;DR: The paper proposes a new Local Focus Mechanism (LFM) for deepfake detection that improves cross-domain generalization by focusing on discriminative local features and using regularization to prevent overfitting, outperforming existing methods in both accuracy and efficiency.
Details
Motivation: The motivation stems from the limitations of existing deepfake detection methods based on reconstruction learning and deep CNNs, which struggle with generalization across different object categories and generation domains due to overfitting and loss of local forgery cues through Global Average Pooling. Method: The method introduces a Salience Network (SNet) and a Top-K Pooling (TKP) module to focus on discriminative local features. It also incorporates two regularization techniques, Rank-Based Linear Dropout (RBLD) and Random-K Sampling (RKS), to prevent overfitting. Result: The LFM approach achieves a 3.7% improvement in accuracy and a 2.8% increase in average precision compared to the state-of-the-art NPR method, while maintaining high efficiency at 1789 FPS on a single NVIDIA A6000 GPU. Conclusion: The proposed Local Focus Mechanism (LFM) effectively improves the robustness and generalizability of deepfake detection across different object categories and generation domains, setting a new benchmark for cross-domain deepfake detection. Abstract: The rapid advancement of deepfake generation techniques has intensified the need for robust and generalizable detection methods. Existing approaches based on reconstruction learning typically leverage deep convolutional networks to extract differential features. However, these methods show poor generalization across object categories (e.g., from faces to cars) and generation domains (e.g., from GANs to Stable Diffusion), due to intrinsic limitations of deep CNNs. First, models trained on a specific category tend to overfit to semantic feature distributions, making them less transferable to other categories, especially as network depth increases. Second, Global Average Pooling (GAP) compresses critical local forgery cues into a single vector, thus discarding discriminative patterns vital for real-fake classification. To address these issues, we propose a novel Local Focus Mechanism (LFM) that explicitly attends to discriminative local features for differentiating fake from real images. LFM integrates a Salience Network (SNet) with a task-specific Top-K Pooling (TKP) module to select the K most informative local patterns. To mitigate potential overfitting introduced by Top-K pooling, we introduce two regularization techniques: Rank-Based Linear Dropout (RBLD) and Random-K Sampling (RKS), which enhance the model's robustness. LFM achieves a 3.7 improvement in accuracy and a 2.8 increase in average precision over the state-of-the-art Neighboring Pixel Relationships (NPR) method, while maintaining exceptional efficiency at 1789 FPS on a single NVIDIA A6000 GPU. Our approach sets a new benchmark for cross-domain deepfake detection. The source code are available in https://github.com/lmlpy/LFM.git[160] F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search
Raghul Asokan
Main category: cs.CV
TL;DR: 本研究提出F4-ITS,一种无需训练的食品图像到文本检索框架,通过多模态特征融合和重排序机制显著提升了检索性能,尤其适用于资源受限环境。
Details
Motivation: 数字食品内容的激增增加了对能够进行细粒度视觉理解和检索的系统的迫切需求,尤其是在饮食监测、智能厨房和餐厅自动化等应用中。 Method: 提出了F4-ITS:一种无需训练的视觉-语言模型(VLM)引导框架,结合了图像嵌入和VLM生成的文本描述,并采用了一种基于特征的重排序机制。 Result: F4-ITS在标准基线中实现了显著提升,在密集和稀疏字幕场景下top-1检索分别提升了约10%和7.7%,在top-k成分检索中提升了28.6%。 Conclusion: F4-ITS框架在食品图像到文本检索任务中表现出色,特别是在资源受限的情况下,通过文本融合使小型模型能够匹配或超越大型模型的性能。 Abstract: The proliferation of digital food content has intensified the need for robust and accurate systems capable of fine-grained visual understanding and retrieval. In this work, we address the challenging task of food image-to-text matching, a critical component in applications such as dietary monitoring, smart kitchens, and restaurant automation. We propose F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search, a training-free, vision-language model (VLM)-guided framework that significantly improves retrieval performance through enhanced multi-modal feature representations. Our approach introduces two key contributions: (1) a uni-directional(and bi-directional) multi-modal fusion strategy that combines image embeddings with VLM-generated textual descriptions to improve query expressiveness, and (2) a novel feature-based re-ranking mechanism for top-k retrieval, leveraging predicted food ingredients to refine results and boost precision. Leveraging open-source image-text encoders, we demonstrate substantial gains over standard baselines - achieving ~10% and ~7.7% improvements in top-1 retrieval under dense and sparse caption scenarios, and a ~28.6% gain in top-k ingredient-level retrieval. Additionally, we show that smaller models (e.g., ViT-B/32) can match or outperform larger counterparts (e.g., ViT-H, ViT-G, ViT-bigG) when augmented with textual fusion, highlighting the effectiveness of our method in resource-constrained settings. Code and test datasets will be made publicly available at: https://github.com/mailcorahul/f4-its[161] M3DMap: Object-aware Multimodal 3D Mapping for Dynamic Environments
Dmitry Yudin
Main category: cs.CV
TL;DR: This article proposes a taxonomy and a modular method called M3DMap for constructing multimodal 3D maps in static and dynamic environments, integrating data like images and point clouds, and highlights its advantages in practical applications.
Details
Motivation: 3D mapping in dynamic environments is challenging due to the lack of universal representations that can integrate multimodal data such as images, point clouds, and text. This research aims to address this gap by proposing a structured approach to multimodal 3D mapping. Method: The article proposes a taxonomy of methods for constructing multimodal 3D maps, classifying approaches based on scene types, representations, learning methods, and applications. It introduces a modular method called M3DMap, which includes modules for neural multimodal object segmentation and tracking, odometry estimation, 3D map construction and updating, and multimodal data retrieval. Result: The research provides a taxonomy for multimodal 3D mapping methods and introduces the M3DMap method, which demonstrates advantages in practical tasks like 3D object grounding and mobile manipulation. Theoretical propositions also highlight the benefits of using multimodal data and foundational models. Conclusion: The article concludes that incorporating multimodal data and modern foundational models into 3D mapping methods enhances their effectiveness in handling dynamic environments, and the proposed M3DMap method provides a modular and object-aware solution for this task. Abstract: 3D mapping in dynamic environments poses a challenge for modern researchers in robotics and autonomous transportation. There are no universal representations for dynamic 3D scenes that incorporate multimodal data such as images, point clouds, and text. This article takes a step toward solving this problem. It proposes a taxonomy of methods for constructing multimodal 3D maps, classifying contemporary approaches based on scene types and representations, learning methods, and practical applications. Using this taxonomy, a brief structured analysis of recent methods is provided. The article also describes an original modular method called M3DMap, designed for object-aware construction of multimodal 3D maps for both static and dynamic scenes. It consists of several interconnected components: a neural multimodal object segmentation and tracking module; an odometry estimation module, including trainable algorithms; a module for 3D map construction and updating with various implementations depending on the desired scene representation; and a multimodal data retrieval module. The article highlights original implementations of these modules and their advantages in solving various practical tasks, from 3D object grounding to mobile manipulation. Additionally, it presents theoretical propositions demonstrating the positive effect of using multimodal data and modern foundational models in 3D mapping methods. Details of the taxonomy and method implementation are available at https://yuddim.github.io/M3DMap.[162] Styleclone: Face Stylization with Diffusion Based Data Augmentation
Neeraj Matiyali,Siddharth Srivastava,Gaurav Sharma
Main category: cs.CV
TL;DR: StyleClone 通过增强小型风格数据集,训练出快速且风格化效果更好的图像到图像转换网络。
Details
Motivation: 在风格图像有限的情况下,提高图像到图像转换网络的风格化能力和多样性。 Method: StyleClone 利用文本反转和基于扩散的引导图像生成来增强小型风格数据集。通过系统地生成多样化的风格样本,利用增强后的数据集训练快速的图像到图像转换网络。 Result: StyleClone 显著增强了风格数据集的多样性,训练出的图像到图像转换网络在速度和质量上都优于基于扩散的方法。 Conclusion: StyleClone 是一种用于训练图像到图像转换网络的方法,即使在风格图像有限的情况下也能以特定风格对脸部进行风格化处理。 Abstract: We present StyleClone, a method for training image-to-image translation networks to stylize faces in a specific style, even with limited style images. Our approach leverages textual inversion and diffusion-based guided image generation to augment small style datasets. By systematically generating diverse style samples guided by both the original style images and real face images, we significantly enhance the diversity of the style dataset. Using this augmented dataset, we train fast image-to-image translation networks that outperform diffusion-based methods in speed and quality. Experiments on multiple styles demonstrate that our method improves stylization quality, better preserves source image content, and significantly accelerates inference. Additionally, we provide a systematic evaluation of the augmentation techniques and their impact on stylization performance.[163] PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models
Xianjing Cheng,Lintai Wu,Zuowen Wang,Junhui Hou,Jie Wen,Yong Xu
Main category: cs.CV
TL;DR: PVNet is a novel point cloud upsampling method that addresses the issue of LiDAR data sparsity in 3D scene understanding, offering improved performance and arbitrary upsampling rates.
Details
Motivation: LiDAR-scanned data often suffer from extreme sparsity, which hinders 3D perception tasks, and existing methods have limited generalization capability for complex outdoor scenes. Method: PVNet uses a diffusion model-based point-voxel interaction framework for LiDAR point cloud upsampling, incorporating a voxel completion module and a point-voxel interaction module. Result: PVNet achieves state-of-the-art performance on various benchmarks for LiDAR point cloud upsampling. Conclusion: PVNet is the first scene-level point cloud upsampling method that supports arbitrary upsampling rates and achieves state-of-the-art performance on various benchmarks. Abstract: Accurate 3D scene understanding in outdoor environments heavily relies on high-quality point clouds. However, LiDAR-scanned data often suffer from extreme sparsity, severely hindering downstream 3D perception tasks. Existing point cloud upsampling methods primarily focus on individual objects, thus demonstrating limited generalization capability for complex outdoor scenes. To address this issue, we propose PVNet, a diffusion model-based point-voxel interaction framework to perform LiDAR point cloud upsampling without dense supervision. Specifically, we adopt the classifier-free guidance-based DDPMs to guide the generation, in which we employ a sparse point cloud as the guiding condition and the synthesized point clouds derived from its nearby frames as the input. Moreover, we design a voxel completion module to refine and complete the coarse voxel features for enriching the feature representation. In addition, we propose a point-voxel interaction module to integrate features from both points and voxels, which efficiently improves the environmental perception capability of each upsampled point. To the best of our knowledge, our approach is the first scene-level point cloud upsampling method supporting arbitrary upsampling rates. Extensive experiments on various benchmarks demonstrate that our method achieves state-of-the-art performance. The source code will be available at https://github.com/chengxianjing/PVNet.[164] DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method
Qingwen Zhang,Xiaomeng Zhu,Yushan Zhang,Yixi Cai,Olov Andersson,Patric Jensfelt
Main category: cs.CV
TL;DR: DeltaFlow is a new efficient framework for scene flow estimation that leverages temporal information effectively, achieving superior performance with minimal computational cost.
Details
Motivation: Previous methods for scene flow estimation either neglect temporal information or suffer from high computational costs with multi-frame reasoning. Method: DeltaFlow uses a $\Delta$ scheme to capture motion cues and introduces a Category-Balanced Loss and an Instance Consistency Loss to address challenges in scene flow estimation. Result: DeltaFlow achieves up to 22% lower error and 2x faster inference compared to the next-best multi-frame supervised method, with strong cross-domain generalization. Conclusion: DeltaFlow ($\Delta$Flow) is introduced as a lightweight 3D framework for scene flow estimation, achieving state-of-the-art performance with reduced computational costs and improved accuracy. Abstract: Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($\Delta$Flow), a lightweight 3D framework that captures motion cues via a $\Delta$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2 and Waymo datasets show that $\Delta$Flow achieves state-of-the-art performance with up to 22% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.[165] REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework
Stefanos Pasios,Nikos Nikolaidis
Main category: cs.CV
TL;DR: This paper introduces REGEN, a dual-stage generative adversarial network framework, for enhancing photorealism in video games in real-time, achieving faster inference and better results than existing methods.
Details
Motivation: The motivation is to address the challenge of achieving true photorealism in dynamic environments at real-time frame rates, balancing visual quality and performance. Method: The paper proposes the REGEN framework, which uses a dual-stage generative adversarial network for real-time photorealism enhancement in video games. Result: The REGEN framework achieves visual results comparable to the unpaired Im2Im method and improves inference speed by 32.14 times, outperforming lightweight unpaired Im2Im approaches. Conclusion: The paper concludes that the REGEN framework effectively enhances photorealism in video games, achieving results comparable to more robust methods while significantly improving inference speed. Abstract: Photorealism is an important aspect of modern video games since it can shape the player experience and simultaneously impact the immersion, narrative engagement, and visual fidelity. Although recent hardware technological breakthroughs, along with state-of-the-art rendering technologies, have significantly improved the visual realism of video games, achieving true photorealism in dynamic environments at real-time frame rates still remains a major challenge due to the tradeoff between visual quality and performance. In this short paper, we present a novel approach for enhancing the photorealism of rendered game frames using generative adversarial networks. To this end, we propose Real-time photorealism Enhancement in Games via a dual-stage gEnerative Network framework (REGEN), which employs a robust unpaired image-to-image translation model to produce semantically consistent photorealistic frames that transform the problem into a simpler paired image-to-image translation task. This enables training with a lightweight method that can achieve real-time inference time without compromising visual quality. We demonstrate the effectiveness of our framework on Grand Theft Auto V, showing that the approach achieves visual results comparable to the ones produced by the robust unpaired Im2Im method while improving inference speed by 32.14 times. Our findings also indicate that the results outperform the photorealism-enhanced frames produced by directly training a lightweight unpaired Im2Im translation method to translate the video game frames towards the visual characteristics of real-world images. Code, pre-trained models, and demos for this work are available at: https://github.com/stefanos50/REGEN.[166] SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation
Peng Hu,Yu Gu,Liang Luo,Fuji Ren
Main category: cs.CV
TL;DR: 本文提出SSG-DiT,通过空间信号引导实现高效可控视频生成,解决了现有模型在语义一致性上的不足。
Details
Motivation: 现有可控视频生成模型难以保持与用户指定条件(如文本描述和初始图像)的高度语义一致性,容易偏离细节。因此需要一种更精确的条件引导方法。 Method: 该方法提出了一种两阶段解耦框架:第一阶段利用预训练多模态模型生成空间感知的视觉提示,第二阶段通过轻量级的SSG-Adapter将该提示与原始文本条件结合,输入冻结的视频DiT主干网络进行视频生成。 Result: SSG-DiT在VBench基准测试的多个关键指标上超越了现有模型,特别是在空间关系控制和整体一致性方面达到了SOTA性能。 Conclusion: SSG-DiT实现了视频生成中的高效空间信号引导,显著提升了生成视频与用户指定条件的语义一致性,尤其是在空间关系控制和整体一致性方面表现优异。 Abstract: Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.[167] Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry
Haoyu Yun,Hamid Krim
Main category: cs.CV
TL;DR: 本文提出了一种结合Vision Transformer (ViT)与近端工具的新框架,以增强特征表示和分类性能。
Details
Motivation: ViT的优化局限于建模单个图像内的局部关系,限制了其捕捉数据点之间的全局几何关系的能力。 Method: 通过ViT的自注意力机制构建流形的切丛,并引入近端迭代来定义切丛中的截面并投影数据,实现全局特征对齐和优化。 Result: 实验结果证实了所提出的方法在分类准确性和数据分布方面优于传统ViT。 Conclusion: 结合ViT与近端工具的新框架在分类准确性和数据分布方面优于传统ViT。 Abstract: The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT's optimization remains confined to modeling local relationships within individual images, limiting its ability to capture the global geometric relationships between data points. To address this limitation, this paper proposes a novel framework that integrates ViT with the proximal tools, enabling a unified geometric optimization approach to enhance feature representation and classification performance. In this framework, ViT constructs the tangent bundle of the manifold through its self-attention mechanism, where each attention head corresponds to a tangent space, offering geometric representations from diverse local perspectives. Proximal iterations are then introduced to define sections within the tangent bundle and project data from tangent spaces onto the base space, achieving global feature alignment and optimization. Experimental results confirm that the proposed method outperforms traditional ViT in terms of classification accuracy and data distribution.[168] PD-Loss: Proxy-Decidability for Efficient Metric Learning
Pedro Silva,Guilherme A. L. Silva,Pablo Coelho,Vander Freitas,Gladston Moreira,David Menotii,Eduardo Luz
Main category: cs.CV
TL;DR: 本文提出了一种名为PD-Loss的新方法,结合了代理损失和D-Loss的优点,以有效优化嵌入空间,适用于细粒度分类和面部验证等多种任务。
Details
Motivation: 现有的深度度量学习方法在复杂采样需求和收敛速度方面存在局限性,而代理损失方法虽然提高了可扩展性,但往往无法优化全局分布特性。D-Loss通过提高分布可分性来解决这个问题,但其对大批次数据的依赖带来了计算限制。 Method: 提出PD-Loss,结合了可学习代理与d'的统计框架,通过代理估计真实和冒充分布,从而实现高效的嵌入空间优化。 Result: 实验表明,PD-Loss在各种任务中实现了与现有最优方法相当的性能,并为嵌入优化提供了新的视角。 Conclusion: PD-Loss是一种可扩展的分布感知深度度量学习方法,兼顾了代理方法的计算效率和D-Loss的可分性优化,具有广泛的应用潜力。 Abstract: Deep Metric Learning (DML) aims to learn embedding functions that map semantically similar inputs to proximate points in a metric space while separating dissimilar ones. Existing methods, such as pairwise losses, are hindered by complex sampling requirements and slow convergence. In contrast, proxy-based losses, despite their improved scalability, often fail to optimize global distribution properties. The Decidability-based Loss (D-Loss) addresses this by targeting the decidability index (d') to enhance distribution separability, but its reliance on large mini-batches imposes significant computational constraints. We introduce Proxy-Decidability Loss (PD-Loss), a novel objective that integrates learnable proxies with the statistical framework of d' to optimize embedding spaces efficiently. By estimating genuine and impostor distributions through proxies, PD-Loss combines the computational efficiency of proxy-based methods with the principled separability of D-Loss, offering a scalable approach to distribution-aware DML. Experiments across various tasks, including fine-grained classification and face verification, demonstrate that PD-Loss achieves performance comparable to that of state-of-the-art methods while introducing a new perspective on embedding optimization, with potential for broader applications.[169] GRASP: Geospatial pixel Reasoning viA Structured Policy learning
Chengjie Jiang,Yunqi Zhou,Jiafeng Yan,Jing Li
Main category: cs.CV
TL;DR: GRASP introduces a reinforcement learning-based approach for geospatial pixel reasoning that avoids expensive supervision, achieving superior performance on both in-domain and out-of-domain data.
Details
Motivation: Current methods for geospatial pixel reasoning rely on expensive dense pixel supervision and perform poorly on out-of-domain data. The authors aim to develop a more efficient and generalizable approach by leveraging reinforcement learning and minimizing reliance on detailed annotations. Method: GRASP uses a multimodal large language model to generate bounding boxes and points from instructions, which are then used as prompts by a pre-trained segmentation model. The system is optimized using reinforcement learning with GRPO, based on format and accuracy rewards, without requiring mask supervision. Result: GRASP achieves a 4% improvement on in-domain data and up to 54% improvement on out-of-domain benchmarks. The model demonstrates strong generalization and learns complex segmentation behaviors through reinforcement learning using weak spatial cues. Conclusion: The GRASP framework achieves state-of-the-art results in geospatial pixel reasoning by using reinforcement learning without mask supervision, demonstrating robust generalization and efficient learning from weak spatial cues. Abstract: Geospatial pixel reasoning is a nascent remote-sensing task that aims to generate segmentation masks directly from natural-language instructions. Prevailing MLLM-based systems co-train a language model and a mask decoder with dense pixel supervision, which is expensive and often weak on out-of-domain (OOD) data. We introduce GRASP, a structured policy-learning framework. In our design, a multimodal large language model first emits task-relevant bounding boxes and positive points from a vision-language instruction. These outputs are then passed to a pre-trained segmentation model, which consumes them as prompts to generate the final mask. Instead of supervised fine-tuning, we optimize the system purely with reinforcement learning: the model is trained solely with GRPO, guided by format rewards and accuracy rewards computed on boxes and points (no mask supervision). This leverages strong priors in foundation models, minimizes trainable parameters, and enables learning from inexpensive annotations. We additionally curate GRASP-1k, which contains reasoning-intensive queries, detailed reasoning traces, and fine-grained segmentation annotations. Evaluations on both in-domain and out-of-domain test sets show state-of-the-art results: about 4% improvement in-domain and up to 54% on OOD benchmarks. The experiment results evidence our model's robust generalization and demonstrate that complex geospatial segmentation behaviors can be learned via RL from weak spatial cues. Code and the dataset will be released open-source.[170] SugarcaneShuffleNet: A Very Fast, Lightweight Convolutional Neural Network for Diagnosis of 15 Sugarcane Leaf Diseases
Shifat E. Arman,Hasan Muhammad Abdullah,Syed Nazmus Sakib,RM Saiem,Shamima Nasrin Asha,Md Mehedi Hasan,Shahrear Bin Amin,S M Mahin Abrar
Main category: cs.CV
TL;DR: 本文提出了一种针对低资源地区甘蔗叶病分类的高效、可解释的解决方案,包括一个新数据集SugarcaneLD-BD、优化的轻量级模型SugarcaneShuffleNet和一个用于实地部署的渐进式网络应用SugarcaneAI。
Details
Motivation: 由于缺乏可扩展、高效且可解释的工具,低资源地区的甘蔗种植者仍然容易受到叶片疾病的影响。许多深度学习模型在现实条件下的泛化能力有限,并且需要大量的计算资源,这限制了它们在资源受限地区的使用。 Method: 结合了SugarcaneLD-BD数据集和其他两个数据集以增加多样性,并通过迁移学习和贝叶斯优化对多个轻量级卷积神经网络(包括SugarcaneShuffleNet、MnasNet、EdgeNeXt、EfficientNet-Lite、MobileNet和SqueezeNet)进行了微调和优化。 Result: SugarcaneShuffleNet模型大小为9.26 MB,准确率达到98.02%,F1得分为0.98,每张图像的平均推理时间为4.14毫秒。虽然MnasNet和EdgeNeXt的准确性与SugarcaneShuffleNet相当,但它们需要更多的参数、内存和计算资源,限制了其在低资源环境下的适用性。 Conclusion: 论文提出了SugarcaneLD-BD数据集、优化的轻量级模型SugarcaneShuffleNet以及用于实地部署的渐进式网络应用SugarcaneAI,为低资源环境下的甘蔗叶病分类提供了一种高效、可解释的解决方案。 Abstract: Despite progress in AI-based plant diagnostics, sugarcane farmers in low-resource regions remain vulnerable to leaf diseases due to the lack of scalable, efficient, and interpretable tools. Many deep learning models fail to generalize under real-world conditions and require substantial computational resources, limiting their use in resource-constrained regions. In this paper, we present SugarcaneLD-BD, a curated dataset for sugarcane leaf-disease classification; SugarcaneShuffleNet, an optimized lightweight model for rapid on-device diagnosis; and SugarcaneAI, a Progressive Web Application for field deployment. SugarcaneLD-BD contains 638 curated images across five classes, including four major sugarcane diseases, collected in Bangladesh under diverse field conditions and verified by expert pathologists. To enhance diversity, we combined SugarcaneLD-BD with two additional datasets, yielding a larger and more representative corpus. Our optimized model, SugarcaneShuffleNet, offers the best trade-off between speed and accuracy for real-time, on-device diagnosis. This 9.26 MB model achieved 98.02% accuracy, an F1-score of 0.98, and an average inference time of 4.14 ms per image. For comparison, we fine-tuned five other lightweight convolutional neural networks: MnasNet, EdgeNeXt, EfficientNet-Lite, MobileNet, and SqueezeNet via transfer learning and Bayesian optimization. MnasNet and EdgeNeXt achieved comparable accuracy to SugarcaneShuffleNet, but required significantly more parameters, memory, and computation, limiting their suitability for low-resource deployment. We integrate SugarcaneShuffleNet into SugarcaneAI, delivering Grad-CAM-based explanations in the field. Together, these contributions offer a diverse benchmark, efficient models for low-resource environments, and a practical tool for sugarcane disease classification. It spans varied lighting, backgrounds and devices used on-farm[171] PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science
Syed Nazmus Sakib,Nafiul Haque,Mohammad Zabed Hossain,Shifat E. Arman
Main category: cs.CV
TL;DR: PlantVillageVQA是一个大规模的视觉问答数据集,旨在推动农业领域的视觉语言模型发展。
Details
Motivation: 为了推动农业领域视觉语言模型的发展和评估,需要一个大规模的视觉问答(VQA)数据集。 Method: PlantVillageVQA数据集通过一个自动化的两阶段流程生成QA对,并经过领域专家的迭代评审。 Result: PlantVillageVQA数据集包含193,609个高质量的问题-答案(QA)对,涵盖14种作物种类和38种疾病条件。 Conclusion: PlantVillageVQA是一个为农业决策和分析提供视觉语言模型开发和评估的高质量数据集。 Abstract: PlantVillageVQA is a large-scale visual question answering (VQA) dataset derived from the widely used PlantVillage image corpus. It was designed to advance the development and evaluation of vision-language models for agricultural decision-making and analysis. The PlantVillageVQA dataset comprises 193,609 high-quality question-answer (QA) pairs grounded over 55,448 images spanning 14 crop species and 38 disease conditions. Questions are organised into 3 levels of cognitive complexity and 9 distinct categories. Each question category was phrased manually following expert guidance and generated via an automated two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevancy. The final dataset was evaluated using three state-of-the-art models for quality assessment. Our objective remains to provide a publicly available, standardised and expert-verified database to enhance diagnostic accuracy for plant disease identifications and advance scientific research in the agricultural domain. Our dataset will be open-sourced at https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.[172] CE-RS-SBCIT A Novel Channel Enhanced Hybrid CNN Transformer with Residual, Spatial, and Boundary-Aware Learning for Brain Tumor MRI Analysis
Mirza Mumtaz Zahoor,Saddam Hussain Khan
Main category: cs.CV
TL;DR: 本文提出了一种新的混合框架CE-RS-SBCIT,用于脑肿瘤的早期检测和准确分类,通过整合残差和空间学习的CNN与Transformer驱动模块,取得了优异的性能。
Details
Motivation: 脑肿瘤是最致命的人类疾病之一,早期检测和准确分类对于有效诊断和治疗计划至关重要。传统的CNN和Transformer在MRI数据上面临高计算成本、对微小对比度变化敏感、结构异质性和纹理不一致等挑战。 Method: 该框架包括四个核心创新:基于平滑和边界操作的CNN集成Transformer(SBCIT)、定制的残差和空间学习CNN、通道增强(CE)策略以及新的空间注意力机制。SBCIT模块结合了茎卷积和上下文交互Transformer块,实现高效的全局特征建模。 Result: 在Kaggle和Figshare上的广泛MRI数据集评估表明,该框架取得了98.30%的准确率、98.08%的敏感度、98.25%的F1分数和98.43%的精确度。 Conclusion: 该论文提出了一种新的混合框架CE-RS-SBCIT,用于脑肿瘤的早期检测和准确分类,通过整合残差和空间学习的CNN与Transformer驱动模块,取得了优异的性能。 Abstract: Brain tumors remain among the most lethal human diseases, where early detection and accurate classification are critical for effective diagnosis and treatment planning. Although deep learning-based computer-aided diagnostic (CADx) systems have shown remarkable progress. However, conventional convolutional neural networks (CNNs) and Transformers face persistent challenges, including high computational cost, sensitivity to minor contrast variations, structural heterogeneity, and texture inconsistencies in MRI data. Therefore, a novel hybrid framework, CE-RS-SBCIT, is introduced, integrating residual and spatial learning-based CNNs with transformer-driven modules. The proposed framework exploits local fine-grained and global contextual cues through four core innovations: (i) a smoothing and boundary-based CNN-integrated Transformer (SBCIT), (ii) tailored residual and spatial learning CNNs, (iii) a channel enhancement (CE) strategy, and (iv) a novel spatial attention mechanism. The developed SBCIT employs stem convolution and contextual interaction transformer blocks with systematic smoothing and boundary operations, enabling efficient global feature modeling. Moreover, Residual and spatial CNNs, enhanced by auxiliary transfer-learned feature maps, enrich the representation space, while the CE module amplifies discriminative channels and mitigates redundancy. Furthermore, the spatial attention mechanism selectively emphasizes subtle contrast and textural variations across tumor classes. Extensive evaluation on challenging MRI datasets from Kaggle and Figshare, encompassing glioma, meningioma, pituitary tumors, and healthy controls, demonstrates superior performance, achieving 98.30% accuracy, 98.08% sensitivity, 98.25% F1-score, and 98.43% precision.[173] Structural Damage Detection Using AI Super Resolution and Visual Language Model
Catherine Hoier,Khandaker Mamun Ahmed
Main category: cs.CV
TL;DR: This study introduces a cost-effective AI-driven system that utilizes drone footage to rapidly and accurately assess structural damage after natural disasters.
Details
Motivation: The motivation stems from the challenges of traditional disaster damage assessment methods, which are labor-intensive, costly, and unsuitable for rapid response in resource-limited settings. Method: The methodology involves using aerial drone footage, Video Restoration Transformer (VRT) for video enhancement, and Gemma3:27b, a Visual Language Model, to classify structural damage into four categories. Result: The framework achieved a classification accuracy of 84.5% and allows non-technical users to perform preliminary analyses, improving disaster management responsiveness. Conclusion: The study concludes that the proposed integrated system, combining drone footage with AI models, significantly enhances the accuracy and efficiency of disaster damage assessment. Abstract: Natural disasters pose significant challenges to timely and accurate damage assessment due to their sudden onset and the extensive areas they affect. Traditional assessment methods are often labor-intensive, costly, and hazardous to personnel, making them impractical for rapid response, especially in resource-limited settings. This study proposes a novel, cost-effective framework that leverages aerial drone footage, an advanced AI-based video super-resolution model, Video Restoration Transformer (VRT), and Gemma3:27b, a 27 billion parameter Visual Language Model (VLM). This integrated system is designed to improve low-resolution disaster footage, identify structural damage, and classify buildings into four damage categories, ranging from no/slight damage to total destruction, along with associated risk levels. The methodology was validated using pre- and post-event drone imagery from the 2023 Turkey earthquakes (courtesy of The Guardian) and satellite data from the 2013 Moore Tornado (xBD dataset). The framework achieved a classification accuracy of 84.5%, demonstrating its ability to provide highly accurate results. Furthermore, the system's accessibility allows non-technical users to perform preliminary analyses, thereby improving the responsiveness and efficiency of disaster management efforts.[174] Beyond Play and Pause: Turning GPT-4o Spatial Weakness into a Strength for In-Depth Interactive Video Learning
Sajad Goudarzi,Samaneh Zamanifard
Main category: cs.CV
TL;DR: Untwist是一个AI驱动的交互式视频学习系统,通过结合GPT API和计算机视觉技术,允许用户对整个视频或特定区域提问,从而将被动的视频观看转化为互动学习体验。
Details
Motivation: 传统视频学习缺乏动态互动机会,而现有的AI工具虽然提供转录和摘要功能,但缺乏实时、区域特定的交互能力。 Method: 通过整合GPT API和计算机视觉技术,系统可以提取、处理和结构化视频内容,使用带注释的帧来弥补GPT-4的空间弱点,实现更准确的定位和内容解读。 Result: Untwist能够实现实时交互和上下文感知的多模态响应,显著提高用户对视频内容的理解和参与度。 Conclusion: Untwist通过AI驱动的实时交互功能,将传统的被动视频观看转变为互动学习体验,具有提升学习效果的潜力。 Abstract: Traditional video-based learning remains passive, offering limited opportunities for users to engage dynamically with content. While current AI-powered tools offer transcription and summarization, they lack real-time, region-specific interaction capabilities. This paper introduces Untwist, an AI-driven system that enables interactive video learning by allowing users to ask questions about the entire video or specific regions using a bounding box, receiving context-aware, multimodal responses. By integrating GPT APIs with Computer Vision techniques, Untwist extracts, processes, and structures video content to enhance comprehension. Our approach addresses GPT-4o spatial weakness by leveraging annotated frames instead of raw coordinate data, significantly improving accuracy in localizing and interpreting video content. This paper describes the system architecture, including video pre-processing and real-time interaction, and outlines how Untwist can transform passive video consumption into an interactive, AI-driven learning experience with the potential to enhance engagement and comprehension.[175] Development of an isotropic segmentation model for medial temporal lobe subregions on anisotropic MRI atlas using implicit neural representation
Yue Li,Pulkit Khandelwal,Rohit Jena,Long Xie,Michael Duong,Amanda E. Denning,Christopher A. Brown,Laura E. M. Wisse,Sandhitsu R. Das,David A. Wolk,Paul A. Yushkevich
Main category: cs.CV
TL;DR: This study developed an isotropic MTL subregion segmentation model that improves the accuracy of Alzheimer's disease imaging biomarkers by overcoming the anisotropic resolution limitations of T2w MRI, leading to better disease tracking without additional atlas annotation work.
Details
Motivation: Accurate segmentation of MTL subregions is crucial for extracting imaging biomarkers due to the early manifestation of AD-related atrophy in this area. However, anisotropic resolution in T2w MRI poses a challenge. Method: An implicit neural representation method was used to combine T1-weighted and T2w MRI, enabling accurate upsampling of an MTL subregion atlas to isotropic space. Result: The cortical subregion thickness extracted using the isotropic model showed higher significance in distinguishing between participants with mild cognitive impairment and cognitively unimpaired participants, and greater stability in longitudinal analysis. Conclusion: The study successfully improved the accuracy of Alzheimer's disease imaging biomarkers by developing an isotropic MTL subregion segmentation model without increasing atlas annotation work. Abstract: Imaging biomarkers in magnetic resonance imaging (MRI) are important tools for diagnosing and tracking Alzheimer's disease (AD). As medial temporal lobe (MTL) is the earliest region to show AD-related hallmarks, brain atrophy caused by AD can first be observed in the MTL. Accurate segmentation of MTL subregions and extraction of imaging biomarkers from them are important. However, due to imaging limitations, the resolution of T2-weighted (T2w) MRI is anisotropic, which makes it difficult to accurately extract the thickness of cortical subregions in the MTL. In this study, we used an implicit neural representation method to combine the resolution advantages of T1-weighted and T2w MRI to accurately upsample an MTL subregion atlas set from anisotropic space to isotropic space, establishing a multi-modality, high-resolution atlas set. Based on this atlas, we developed an isotropic MTL subregion segmentation model. In an independent test set, the cortical subregion thickness extracted using this isotropic model showed higher significance than an anisotropic method in distinguishing between participants with mild cognitive impairment and cognitively unimpaired (CU) participants. In longitudinal analysis, the biomarkers extracted using isotropic method showed greater stability in CU participants. This study improved the accuracy of AD imaging biomarkers without increasing the amount of atlas annotation work, which may help to more accurately quantify the relationship between AD and brain atrophy and provide more accurate measures for disease tracking.[176] VROOM - Visual Reconstruction over Onboard Multiview
Yajat Yadav,Varun Bharadwaj,Jathin Korrapati,Tanish Baranwal
Main category: cs.CV
TL;DR: VROOM系统利用F1赛车上的摄像头视频数据,通过结合多种方法和技术,实现了对赛道和车辆轨迹的部分重建,展示了在现实环境中使用车载视频进行4D重建的可行性。
Details
Motivation: 作者希望解决使用高速移动赛车上的摄像头拍摄的视频进行3D模型重建的问题,特别是应对高速运动和画面剧烈变化带来的挑战。 Method: VROOM系统结合了多种SLAM和视频处理方法,如DROID-SLAM、AnyCam和Monst3r,并采用了多种预处理技术,如掩膜、时间分块和分辨率缩放。 Result: VROOM能够在复杂环境中部分恢复赛道和车辆的轨迹,表明使用车载视频进行大规模4D重建是可行的。 Conclusion: 该研究证明了通过赛车上的摄像头视频数据,可以实现对赛道和车辆轨迹的重建,为未来的4D重建技术提供了新的思路和方法。 Abstract: We introduce VROOM, a system for reconstructing 3D models of Formula 1 circuits using only onboard camera footage from racecars. Leveraging video data from the 2023 Monaco Grand Prix, we address video challenges such as high-speed motion and sharp cuts in camera frames. Our pipeline analyzes different methods such as DROID-SLAM, AnyCam, and Monst3r and combines preprocessing techniques such as different methods of masking, temporal chunking, and resolution scaling to account for dynamic motion and computational constraints. We show that Vroom is able to partially recover track and vehicle trajectories in complex environments. These findings indicate the feasibility of using onboard video for scalable 4D reconstruction in real-world settings. The project page can be found at https://varun-bharadwaj.github.io/vroom, and our code is available at https://github.com/yajatyadav/vroom.[177] Advancing Weakly-Supervised Change Detection in Satellite Images via Adversarial Class Prompting
Zhenghui Zhao,Chen Wu,Di Wang,Hongruixuan Chen,Cuiqun Chen,Zhuo Zheng,Bo Du,Liangpei Zhang
Main category: cs.CV
TL;DR: This paper introduces AdvCP, a novel method for Weakly-Supervised Change Detection that improves performance by identifying and correcting misclassified background variations using adversarial prompting and prototype rectification.
Details
Motivation: WSCD reduces the need for dense annotations in change detection but suffers from misclassifying background variations as object changes, especially under weak supervision. This work aims to address this limitation. Method: AdvCP includes two phases: a) Adversarial Prompt Mining to reveal misclassified background variations, and b) Adversarial Sample Rectification using an online global prototype built from historical data and current batches. Result: Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines showed significant performance improvements, and AdvCP demonstrated generalizability to other multi-class weakly-supervised dense prediction tasks. Conclusion: The proposed AdvCP method effectively addresses the co-occurring noise problem in Weakly-Supervised Change Detection by leveraging adversarial prompting and online prototype rectification, significantly enhancing performance across various architectures without added inference cost. Abstract: Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP[178] MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling
Hyeyeon Kim,Sungwoo Han,Jingun Kwon,Hidetaka Kamigaito,Manabu Okumura
Main category: cs.CV
TL;DR: 本文提出了一种多模态伪标签方法,用于从纯文本生成封面图像和摘要,解决了现有数据集的不足,并取得了优于单模态方法的效果。
Details
Motivation: 现有的数据集无法满足从纯文本生成摘要和对应图像的任务需求,因此需要一种低成本且高效的数据集构建方法。 Method: 收集包含多个图像及其标题的文档,通过黄金摘要对图像和标题分别进行排名,并在两者均排名第一时生成伪标签,最后移除包含直接图像引用的文档。 Result: 所提出的多模态伪标签方法能够构建更精确的数据集,并生成更高质量的图像。 Conclusion: 该研究提出了一种多模态伪标签方法,用于构建高质量的封面图像生成数据集,实验结果表明其优于仅使用文本或图像的方法。 Abstract: In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents that contain multiple images with their captions, and their summaries by excluding factually inconsistent instances. Our approach selects one image from the multiple images accompanying the documents. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the proposed multimodal pseudo-labeling method constructs more precise datasets and generates higher quality images than text- and image-only pseudo-labeling methods, which consider captions and images separately. We release our code at: https://github.com/HyeyeeonKim/MMCIG[179] Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding
Yunxiang Yang,Ningning Xu,Jidong J. Yang
Main category: cs.CV
TL;DR: 本研究设计了一种多智能体框架,利用混合专家策略进行高速公路场景的全面理解和多任务感知,通过大模型生成提示信息引导小模型推理,实现高效的交通监控和预警。
Details
Motivation: 为了实现高速公路场景的全面理解,解决多任务感知问题并提升交通监控系统的效率与准确性。 Method: 提出了一种基于混合专家策略的多智能体框架,通过结合大模型与小模型的协同推理,同时处理多种感知任务。 Result: 该框架在多种交通和环境条件下均表现出色,并成功构建了三个专门的数据集以支持实验验证。 Conclusion: 该框架能够在资源受限的环境中有效部署,增强交通监控系统的态势感知能力,并提供及时预警。 Abstract: This paper introduces a multi-agent framework for comprehensive highway scene understanding, designed around a mixture-of-experts strategy. In this framework, a large generic vision-language model (VLM), such as GPT-4o, is contextualized with domain knowledge to generates task-specific chain-of-thought (CoT) prompts. These fine-grained prompts are then used to guide a smaller, efficient VLM (e.g., Qwen2.5-VL-7B) in reasoning over short videos, along with complementary modalities as applicable. The framework simultaneously addresses multiple critical perception tasks, including weather classification, pavement wetness assessment, and traffic congestion detection, achieving robust multi-task reasoning while balancing accuracy and computational efficiency. To support empirical validation, we curated three specialized datasets aligned with these tasks. Notably, the pavement wetness dataset is multimodal, combining video streams with road weather sensor data, highlighting the benefits of multimodal reasoning. Experimental results demonstrate consistently strong performance across diverse traffic and environmental conditions. From a deployment perspective, the framework can be readily integrated with existing traffic camera systems and strategically applied to high-risk rural locations, such as sharp curves, flood-prone lowlands, or icy bridges. By continuously monitoring the targeted sites, the system enhances situational awareness and delivers timely alerts, even in resource-constrained environments.[180] Multi-modal Knowledge Decomposition based Online Distillation for Biomarker Prediction in Breast Cancer Histopathology
Qibin Zhang,Xinyu Hao,Qiao Chen,Rui Xu,Fengyu Cong,Cheng Lu,Hongming Xu
Main category: cs.CV
TL;DR: 本研究提出一种新的在线蒸馏方法,在仅使用单一模态数据的情况下显著提升IHC生物标志物预测性能。
Details
Motivation: 由于成本和技术限制,多模态数据(如基因组和病理信息)的同步获取具有挑战性,因此需要一种有效方法在仅使用单一模态数据时仍能提升IHC生物标志物预测性能。 Method: 通过多模态知识分解(MKD)和相似性保持知识蒸馏(SKD)方法,结合在线协作蒸馏(CLOD),在训练阶段利用配对的基因组-病理数据,并在推理阶段支持仅使用病理图像或双模态输入。 Result: 在TCGA-BRCA和QHSU内部数据集上的实验表明,该方法在使用单模态数据进行IHC生物标志物预测时表现出优越的性能。 Conclusion: 该研究提出了一种基于多模态知识分解(MKD)的在线蒸馏方法,有效提升了免疫组化(IHC)生物标志物的预测性能,尤其是在仅使用单一模态数据的情况下。 Abstract: Immunohistochemical (IHC) biomarker prediction benefits from multi-modal data fusion analysis. However, the simultaneous acquisition of multi-modal data, such as genomic and pathological information, is often challenging due to cost or technical limitations. To address this challenge, we propose an online distillation approach based on Multi-modal Knowledge Decomposition (MKD) to enhance IHC biomarker prediction in haematoxylin and eosin (H\&E) stained histopathology images. This method leverages paired genomic-pathology data during training while enabling inference using either pathology slides alone or both modalities. Two teacher and one student models are developed to extract modality-specific and modality-general features by minimizing the MKD loss. To maintain the internal structural relationships between samples, Similarity-preserving Knowledge Distillation (SKD) is applied. Additionally, Collaborative Learning for Online Distillation (CLOD) facilitates mutual learning between teacher and student models, encouraging diverse and complementary learning dynamics. Experiments on the TCGA-BRCA and in-house QHSU datasets demonstrate that our approach achieves superior performance in IHC biomarker prediction using uni-modal data. Our code is available at https://github.com/qiyuanzz/MICCAI2025_MKD.[181] Deep Learning with Self-Attention and Enhanced Preprocessing for Precise Diagnosis of Acute Lymphoblastic Leukemia from Bone Marrow Smears in Hemato-Oncology
Md. Maruf,Md. Mahbubul Haque,Bishowjit Paul
Main category: cs.CV
TL;DR: 本研究开发了一种自动化诊断急性淋巴细胞白血病的深度学习方法,结合预处理、注意力增强的CNN和Focal Loss优化,实现了高效准确的诊断。
Details
Motivation: 早期和准确检测ALL并进行精确诊断对于指导治疗至关重要,而传统流程复杂、耗时且容易出错。 Method: 该方法结合了强大的预处理流程和卷积神经网络(CNN),并在VGG19主干中引入了多头自注意力(MHSA)模块,以建模细胞特征之间的长距离依赖性和上下文关系。此外,使用Focal Loss来缓解类别不平衡问题。 Result: 在评估的架构中,增强的VGG19+MHSA结合Focal Loss训练达到了99.25%的准确率,超过了强大的ResNet101基线模型(98.62%)。 Conclusion: 该研究提出了一种基于深度学习的急性淋巴细胞白血病(ALL)自动诊断框架,具有高度准确性和计算效率,有望加速诊断流程并在临床环境中支持可靠的决策。 Abstract: Acute lymphoblastic leukemia (ALL) is a prevalent hematological malignancy in both pediatric and adult populations. Early and accurate detection with precise subtyping is essential for guiding therapy. Conventional workflows are complex, time-consuming, and prone to human error. We present a deep learning framework for automated ALL diagnosis from bone marrow smear images. The method combines a robust preprocessing pipeline with convolutional neural networks (CNNs) to standardize image quality and improve inference efficiency. As a key design, we insert a multi-head self-attention (MHSA) block into a VGG19 backbone to model long-range dependencies and contextual relationships among cellular features. To mitigate class imbalance, we train with Focal Loss. Across evaluated architectures, the enhanced VGG19+MHSA trained with Focal Loss achieves 99.25% accuracy, surpassing a strong ResNet101 baseline (98.62%). These results indicate that attention-augmented CNNs, coupled with targeted loss optimization and preprocessing, yield more discriminative representations of leukemic cell morphology. Our approach offers a highly accurate and computationally efficient tool for automated ALL recognition and subtyping, with potential to accelerate diagnostic workflows and support reliable decision-making in clinical settings.[182] 4D Visual Pre-training for Robot Learning
Chengkai Hou,Yanjie Ze,Yankai Fu,Zeyu Gao,Songbo Hu,Yue Yu,Shanghang Zhang,Huazhe Xu
Main category: cs.CV
TL;DR: FVP是一个4D视觉预训练框架,通过点云预测和扩散模型提升机器人学习的3D表示,显著提高了任务成功率。
Details
Motivation: 现有的视觉表示大多基于2D图像,忽视了世界的3D本质,而大规模3D数据稀缺,因此需要一种新的框架来提升3D表示。 Method: FVP将视觉预训练目标定义为下一个点云预测问题,并使用扩散模型进行预训练。 Result: FVP在十二个现实世界的操作任务中平均提升了3D Diffusion Policy的成功率28%,并在模仿学习方法中达到了SOTA。 Conclusion: FVP是一个有效的4D视觉预训练框架,能够显著提升3D表示,并在多种机器人任务中表现出色。 Abstract: General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d- visual-pretraining.github.io/.[183] PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
Xiaoyang Hao,Han Li
Main category: cs.CV
TL;DR: PersPose is a new 3D human pose estimation framework that addresses the limitations of existing methods by encoding camera intrinsics and reducing perspective distortions, achieving state-of-the-art performance on multiple datasets.
Details
Motivation: Existing 3D human pose estimation methods suffer from inaccuracies due to the lack of camera intrinsics information and perspective distortions when using cropped images as input. This work aims to address these issues. Method: The authors propose Perspective Encoding to encode camera intrinsics and Perspective Rotation to center the human subject in the image, reducing perspective distortions. These techniques are integrated into a novel 3D HPE framework called PersPose. Result: PersPose achieves state-of-the-art results on the 3DPW, MPIINF-3DHP, and Human3.6M datasets, with a 7.54% improvement in MPJPE on the 3DPW dataset compared to the previous state-of-the-art approach. Conclusion: PersPose is a new 3D HPE framework that achieves state-of-the-art performance on multiple datasets by incorporating Perspective Encoding and Perspective Rotation techniques. Abstract: Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPIINF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/ KenAdamsJoseph/PersPose.[184] CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
Zicong Tang,Ziyang Ma,Suqing Wang,Zuchao Li,Lefei Zhang,Hai Zhao,Yun Li,Qianren Wang
Main category: cs.CV
TL;DR: CoViPAL是一种高效的视觉标记剪枝方法,提高了大型视觉-语言模型的推理效率,同时保持准确性。
Details
Motivation: 由于视觉信息丰富,单个图像可以生成数千个视觉标记,导致计算成本高和内存开销大,现有方法在浅层处理冗余视觉标记时效果不佳。 Method: 提出了一种名为CoViPAL的方法,使用即插即用剪枝模块(PPM)在浅层中预测和去除冗余的视觉标记,独立于LVLM架构。 Result: CoViPAL在多个基准测试中证明了其在相同token预算下优于非训练剪枝方法,并在具有可比监督的情况下超越了基于训练的方法。 Conclusion: CoViPAL 提出了一种轻量级、模型无关的视觉标记剪枝方法,有效提高了LVLM的推理效率,并在多个基准测试中优于现有的训练和非训练剪枝方法。 Abstract: Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.[185] Uncovering and Mitigating Destructive Multi-Embedding Attacks in Deepfake Proactive Forensics
Lixin Jia,Haiyang Sun,Zhiqing Guo,Yunfeng Diao,Dan Ma,Gaobo Yang
Main category: cs.CV
TL;DR: This paper identifies a new threat, Multi-Embedding Attacks (MEA), which can erase forensic watermarks from images. To counter this, the authors propose a training method called Adversarial Interference Simulation (AIS) that improves the robustness of watermarking techniques against such attacks.
Details
Motivation: The motivation is to address the practical limitations of current watermarking techniques, which assume single watermark embedding, making them vulnerable to real-world Multi-Embedding Attacks where additional watermarking can erase the original forensic watermark. Method: The authors proposed a training paradigm called Adversarial Interference Simulation (AIS), which simulates MEA scenarios during fine-tuning and incorporates a resilience-driven loss function to ensure sparse and stable watermark representations. Result: The experiments show that the AIS method enables models to retain the ability to extract original watermarks even after a second watermark embedding, significantly enhancing robustness against MEA. Conclusion: The study concludes that the proposed AIS training paradigm significantly improves the robustness of existing watermarking methods against Multi-Embedding Attacks (MEA). Abstract: With the rapid evolution of deepfake technologies and the wide dissemination of digital media, personal privacy is facing increasingly serious security threats. Deepfake proactive forensics, which involves embedding imperceptible watermarks to enable reliable source tracking, serves as a crucial defense against these threats. Although existing methods show strong forensic ability, they rely on an idealized assumption of single watermark embedding, which proves impractical in real-world scenarios. In this paper, we formally define and demonstrate the existence of Multi-Embedding Attacks (MEA) for the first time. When a previously protected image undergoes additional rounds of watermark embedding, the original forensic watermark can be destroyed or removed, rendering the entire proactive forensic mechanism ineffective. To address this vulnerability, we propose a general training paradigm named Adversarial Interference Simulation (AIS). Rather than modifying the network architecture, AIS explicitly simulates MEA scenarios during fine-tuning and introduces a resilience-driven loss function to enforce the learning of sparse and stable watermark representations. Our method enables the model to maintain the ability to extract the original watermark correctly even after a second embedding. Extensive experiments demonstrate that our plug-and-play AIS training paradigm significantly enhances the robustness of various existing methods against MEA.[186] A biological vision inspired framework for machine perception of abutting grating illusory contours
Xiao Zhang,Kai-Fu Yang,Xian-Shi Zhang,Hong-Zhi You,Hong-Mei Yan,Yong-Jie Li
Main category: cs.CV
TL;DR: This paper introduces ICPNet, a deep learning architecture inspired by human visual perception, which significantly improves the ability of machines to perceive illusory contours.
Details
Motivation: The discrepancy between deep neural networks' perception and human cognition regarding illusory contours necessitates a solution that aligns machine intelligence with human perception. Method: A novel deep network architecture named ICPNet was proposed, incorporating a multi-scale feature projection module, a feature interaction attention module, and an edge fusion module to improve contour perception. Result: ICPNet showed superior sensitivity to abutting grating illusory contours and achieved notable improvements in top-1 accuracy on AG-MNIST and AG-Fashion-MNIST test sets. Conclusion: ICPNet demonstrates significant improvements in perceiving illusory contours compared to existing models, aligning more closely with human perception. Abstract: Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Departing from previous works, we propose a novel deep network called illusory contour perception network (ICPNet) inspired by the circuits of the visual cortex. In ICPNet, a multi-scale feature projection (MFP) module is designed to extract multi-scale representations. To boost the interaction between feedforward and feedback features, a feature interaction attention module (FIAM) is introduced. Moreover, drawing inspiration from the shape bias observed in human perception, an edge detection task conducted via the edge fusion module (EFM) injects shape constraints that guide the network to concentrate on the foreground. We assess our method on the existing AG-MNIST test set and the AG-Fashion-MNIST test sets constructed by this work. Comprehensive experimental results reveal that ICPNet is significantly more sensitive to abutting grating illusory contours than state-of-the-art models, with notable improvements in top-1 accuracy across various subsets. This work is expected to make a step towards human-level intelligence for DNN-based models.[187] SEER-VAR: Semantic Egocentric Environment Reasoner for Vehicle Augmented Reality
Yuzhi Lai,Shenghai Yuan,Peizheng Li,Jun Lou,Andreas Zell
Main category: cs.CV
TL;DR: SEER-VAR是一个结合语义分解、上下文感知SLAM和大语言模型推荐的新型自我中心车辆AR框架,支持动态场景分离和高质量AR渲染。
Details
Motivation: 现有系统通常假设静态或单视图环境,而实际驾驶场景是动态且复杂的。SEER-VAR旨在解决这一限制,提供更鲁棒和感知一致的AR体验。 Method: SEER-VAR采用深度引导的视觉语言基础动态分离驾驶舱和道路场景,两个SLAM分支分别跟踪每个上下文中的自我中心运动,同时一个基于GPT的模块生成上下文感知的覆盖层。 Result: 实验表明,SEER-VAR在不同环境中实现了稳健的空间对齐和感知一致的AR渲染效果,并提升了场景理解、覆盖相关性和驾驶体验。 Conclusion: SEER-VAR通过融合语义分解、上下文感知的SLAM分支和LLM驱动的推荐,为基于车辆的自我中心AR提供了一个有效的框架,推动了该领域的未来研究。 Abstract: We present SEER-VAR, a novel framework for egocentric vehicle-based augmented reality (AR) that unifies semantic decomposition, Context-Aware SLAM Branches (CASB), and LLM-driven recommendation. Unlike existing systems that assume static or single-view settings, SEER-VAR dynamically separates cabin and road scenes via depth-guided vision-language grounding. Two SLAM branches track egocentric motion in each context, while a GPT-based module generates context-aware overlays such as dashboard cues and hazard alerts. To support evaluation, we introduce EgoSLAM-Drive, a real-world dataset featuring synchronized egocentric views, 6DoF ground-truth poses, and AR annotations across diverse driving scenarios. Experiments demonstrate that SEER-VAR achieves robust spatial alignment and perceptually coherent AR rendering across varied environments. As one of the first to explore LLM-based AR recommendation in egocentric driving, we address the lack of comparable systems through structured prompting and detailed user studies. Results show that SEER-VAR enhances perceived scene understanding, overlay relevance, and driver ease, providing an effective foundation for future research in this direction. Code and dataset will be made open source.[188] ResLink: A Novel Deep Learning Architecture for Brain Tumor Classification with Area Attention and Residual Connections
Sumedha Arya,Nirmal Gaud
Main category: cs.CV
TL;DR: ResLink是一种结合注意力机制和残差连接的新型深度学习模型,用于脑肿瘤分类,具有高准确率和良好的医学影像应用潜力。
Details
Motivation: 脑肿瘤由于可能对神经功能造成严重影响,因此需要早期和准确的诊断。现有的方法可能存在准确性和泛化能力不足的问题,因此需要一种更强大和高效的分类技术。 Method: ResLink采用了一种多阶段的卷积流水线,结合了dropout、正则化、下采样以及最终的基于注意力机制的分类优化。 Result: ResLink在训练平衡数据集上实现了95%的高准确率,并展示了强大的泛化能力。 Conclusion: ResLink是一个用于脑肿瘤分类的新型深度学习架构,具有高准确率和良好的泛化能力,为医学影像应用提供了一种强大而高效的技术。 Abstract: Brain tumors show significant health challenges due to their potential to cause critical neurological functions. Early and accurate diagnosis is crucial for effective treatment. In this research, we propose ResLink, a novel deep learning architecture for brain tumor classification using CT scan images. ResLink integrates novel area attention mechanisms with residual connections to enhance feature learning and spatial understanding for spatially rich image classification tasks. The model employs a multi-stage convolutional pipeline, incorporating dropout, regularization, and downsampling, followed by a final attention-based refinement for classification. Trained on a balanced dataset, ResLink achieves a high accuracy of 95% and demonstrates strong generalizability. This research demonstrates the potential of ResLink in improving brain tumor classification, offering a robust and efficient technique for medical imaging applications.[189] CLIFF: Continual Learning for Incremental Flake Features in 2D Material Identification
Sankalp Pandey,Xuan Bac Nguyen,Nicholas Borys,Hugh Churchill,Khoa Luu
Main category: cs.CV
TL;DR: This paper introduces CLIFF, a continual-learning framework for flake layer classification in 2D materials that reduces forgetting and improves classification accuracy.
Details
Motivation: Automated classification of quantum flakes from optical microscopy is challenging due to appearance variations across materials, and existing methods struggle with forgetting when learning new materials. Method: The proposed method, CLIFF, uses a continual-learning framework that freezes a trained backbone and base head, learns material-specific features, and utilizes a prompt pool and cosine-similarity gate alongside memory replay and knowledge distillation. Result: CLIFF achieves competitive accuracy while significantly reducing forgetting compared to naive fine-tuning and prompt-based baselines. Conclusion: CLIFF provides a new method for layer classification in 2D materials with reduced forgetting compared to existing methods. Abstract: Identifying quantum flakes is crucial for scalable quantum hardware; however, automated layer classification from optical microscopy remains challenging due to substantial appearance shifts across different materials. In this paper, we propose a new Continual-Learning Framework for Flake Layer Classification (CLIFF). To our knowledge, this is the first systematic study of continual learning in the domain of two-dimensional (2D) materials. Our method enables the model to differentiate between materials and their physical and optical properties by freezing a backbone and base head trained on a reference material. For each new material, it learns a material-specific prompt, embedding, and a delta head. A prompt pool and a cosine-similarity gate modulate features and compute material-specific corrections. Additionally, we incorporate memory replay with knowledge distillation. CLIFF achieves competitive accuracy with significantly lower forgetting than naive fine-tuning and a prompt-based baseline.[190] AdaGAT: Adaptive Guidance Adversarial Training for the Robustness of Deep Neural Networks
Zhenyu Liu,Huizhi Liang,Xinrun Li,Vaclav Snasel,Varun Ojha
Main category: cs.CV
TL;DR: 本文提出了一种名为AdaGAT的新方法,通过动态调整引导模型的训练状态,增强了目标模型的鲁棒性。
Details
Motivation: 由于可学习的引导模型从零开始学习,因此在共同训练期间保持其最佳状态以实现有效的知识转移具有挑战性。 Method: 开发了两种独立的损失函数,以使引导模型更积极地参与反向传播,从而达到其最佳状态。 Result: 在CIFAR-10、CIFAR-100和TinyImageNet数据集上进行了大量实验,结果表明,与各种基线模型相比,适当调整引导模型在特定准确度范围内可以增强目标模型对各种对抗性攻击的鲁棒性。 Conclusion: AdaGAT方法通过动态调整引导模型的训练状态,有效提升了目标模型的鲁棒性。 Abstract: Adversarial distillation (AD) is a knowledge distillation technique that facilitates the transfer of robustness from teacher deep neural network (DNN) models to lightweight target (student) DNN models, enabling the target models to perform better than only training the student model independently. Some previous works focus on using a small, learnable teacher (guide) model to improve the robustness of a student model. Since a learnable guide model starts learning from scratch, maintaining its optimal state for effective knowledge transfer during co-training is challenging. Therefore, we propose a novel Adaptive Guidance Adversarial Training (AdaGAT) method. Our method, AdaGAT, dynamically adjusts the training state of the guide model to install robustness to the target model. Specifically, we develop two separate loss functions as part of the AdaGAT method, allowing the guide model to participate more actively in backpropagation to achieve its optimal state. We evaluated our approach via extensive experiments on three datasets: CIFAR-10, CIFAR-100, and TinyImageNet, using the WideResNet-34-10 model as the target model. Our observations reveal that appropriately adjusting the guide model within a certain accuracy range enhances the target model's robustness across various adversarial attacks compared to a variety of baseline models.[191] Spatial-Temporal Human-Object Interaction Detection
Xu Sun,Yunqing He,Tongwei Ren,Gangshan Wu
Main category: cs.CV
TL;DR: 本文介绍了一种新的人-物交互检测任务ST-HOID,提出了解决该任务的新方法,并构建了用于评估的数据集VidOR-HOID,实验结果显示该方法效果优于现有基线方法。
Details
Motivation: HOI对于以人为中心的视频内容理解至关重要,这是本文研究的动机。 Method: 提出了一种包含物体轨迹检测模块和交互推理模块的新方法来解决ST-HOID任务。 Result: 实验结果表明,该方法优于图像人-物交互检测、视频视觉关系检测和视频人-物交互识别的最先进的基线方法。 Conclusion: 本文提出了一种新的视频实例级别的人-物交互检测任务ST-HOID,并介绍了用于评估的第一个数据集VidOR-HOID。 Abstract: In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object interaction recognition.[192] Deep Learning-Assisted Detection of Sarcopenia in Cross-Sectional Computed Tomography Imaging
Manish Bhardwaj,Huizhi Liang,Ashwin Sivaharan,Sandip Nandhra,Vaclav Snasel,Tamer El-Sayed,Varun Ojha
Main category: cs.CV
TL;DR: This paper presents an AI-based method to automate sarcopenia detection using CT images, achieving high accuracy and offering a scalable solution for clinical use.
Details
Motivation: The motivation is to improve the efficiency and scalability of sarcopenia detection, as manual assessment through cross-sectional imaging is time-consuming and limits timely clinical intervention. Method: The researchers used high-quality 3D CT images annotated by clinicians and applied deep-learning models, including transfer learning and self-supervised learning techniques, to quantify and detect sarcopenia. Result: The deep-learning model predicted SMA with an error of ±3 percentage points compared to manual measurements, and the average Dice similarity coefficient of predicted masks was 93%. Conclusion: The study concludes that deep-learning models can effectively automate the assessment and detection of sarcopenia, offering a precise and efficient alternative to manual methods. Abstract: Sarcopenia is a progressive loss of muscle mass and function linked to poor surgical outcomes such as prolonged hospital stays, impaired mobility, and increased mortality. Although it can be assessed through cross-sectional imaging by measuring skeletal muscle area (SMA), the process is time-consuming and adds to clinical workloads, limiting timely detection and management; however, this process could become more efficient and scalable with the assistance of artificial intelligence applications. This paper presents high-quality three-dimensional cross-sectional computed tomography (CT) images of patients with sarcopenia collected at the Freeman Hospital, Newcastle upon Tyne Hospitals NHS Foundation Trust. Expert clinicians manually annotated the SMA at the third lumbar vertebra, generating precise segmentation masks. We develop deep-learning models to measure SMA in CT images and automate this task. Our methodology employed transfer learning and self-supervised learning approaches using labelled and unlabeled CT scan datasets. While we developed qualitative assessment models for detecting sarcopenia, we observed that the quantitative assessment of SMA is more precise and informative. This approach also mitigates the issue of class imbalance and limited data availability. Our model predicted the SMA, on average, with an error of +-3 percentage points against the manually measured SMA. The average dice similarity coefficient of the predicted masks was 93%. Our results, therefore, show a pathway to full automation of sarcopenia assessment and detection.[193] MTNet: Learning modality-aware representation with transformer for RGBT tracking
Ruichao Hou,Boyue Xu,Tongwei Ren,Gangshan Wu
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer的多模态感知跟踪方法MTNet,通过多模态特征提取和动态模板更新策略,在RGBT跟踪任务中取得了优异的性能和实时速度。
Details
Motivation: 为了提高RGBT跟踪中的多模态表征能力,解决传统融合范式和固定跟踪模板对特征交互的限制问题。 Method: 提出了一种基于Transformer的多模态感知跟踪网络MTNet,包括多模态特征提取、Transformer融合网络和动态模板更新策略。 Result: 实验结果表明,MTNet在三个RGBT基准数据集上取得了与最先进方法相当的结果,并实现了实时跟踪速度。 Conclusion: MTNet通过动态更新策略和三叉戟预测头设计,有效解决了尺度变化和形变带来的挑战,实现了精确的目标定位。 Abstract: The ability to learn robust multi-modality representation has played a critical role in the development of RGBT tracking. However, the regular fusion paradigm and the invariable tracking template remain restrictive to the feature interaction. In this paper, we propose a modality-aware tracker based on transformer, termed MTNet. Specifically, a modality-aware network is presented to explore modality-specific cues, which contains both channel aggregation and distribution module(CADM) and spatial similarity perception module (SSPM). A transformer fusion network is then applied to capture global dependencies to reinforce instance representations. To estimate the precise location and tackle the challenges, such as scale variation and deformation, we design a trident prediction head and a dynamic update strategy which jointly maintain a reliable template for facilitating inter-frame communication. Extensive experiments validate that the proposed method achieves satisfactory results compared with the state-of-the-art competitors on three RGBT benchmarks while reaching real-time speed.[194] Quickly Tuning Foundation Models for Image Segmentation
Breenda Das,Lennart Purucker,Timur Carstensen,Frank Hutter
Main category: cs.CV
TL;DR: QTT-SEG是一种基于元学习的自动化微调方法,可快速优化SAM模型,以适应特定领域的图像分割任务。
Details
Motivation: 尽管SAM等基础模型在零样本图像分割中表现良好,但在特定领域任务上仍需大量手动微调和专业知识,因此需要一种自动化方法。 Method: 基于Quick-Tune超参数优化框架,使用元学习的成本和性能模型预测高性能配置。 Result: QTT-SEG在八项二分类和五项多分类分割数据集上均超过了SAM的零样本性能和AutoGluon Multimodal的多数任务性能。 Conclusion: QTT-SEG通过元学习加速和自动化SAM的微调,展示了在特定分割任务中优于现有方法的性能。 Abstract: Foundation models like SAM (Segment Anything Model) exhibit strong zero-shot image segmentation performance, but often fall short on domain-specific tasks. Fine-tuning these models typically requires significant manual effort and domain expertise. In this work, we introduce QTT-SEG, a meta-learning-driven approach for automating and accelerating the fine-tuning of SAM for image segmentation. Built on the Quick-Tune hyperparameter optimization framework, QTT-SEG predicts high-performing configurations using meta-learned cost and performance models, efficiently navigating a search space of over 200 million possibilities. We evaluate QTT-SEG on eight binary and five multiclass segmentation datasets under tight time constraints. Our results show that QTT-SEG consistently improves upon SAM's zero-shot performance and surpasses AutoGluon Multimodal, a strong AutoML baseline, on most binary tasks within three minutes. On multiclass datasets, QTT-SEG delivers consistent gains as well. These findings highlight the promise of meta-learning in automating model adaptation for specialized segmentation tasks. Code available at: https://github.com/ds-brx/QTT-SEG/[195] Explain Before You Answer: A Survey on Compositional Visual Reasoning
Fucai Ke,Joy Hsu,Zhixi Cai,Zixian Ma,Xin Zheng,Xindi Wu,Sukai Huang,Weiqing Wang,Pari Delir Haghighi,Gholamreza Haffari,Ranjay Krishna,Jiajun Wu,Hamid Rezatofighi
Main category: cs.CV
TL;DR: 这篇论文对2023年至2025年间的构成性视觉推理文献进行了系统回顾,分析了260多篇论文和60多个基准测试,旨在提供一个统一的分类法、历史路线图和批判性展望,识别了开放挑战并概述了未来研究方向。
Details
Motivation: 构成性视觉推理是多模态AI的一个关键研究前沿,旨在赋予机器以类似人类的能力,分解视觉场景、定位中间概念并进行多步骤逻辑推理。目前缺乏对这一快速发展的文献的专门综合。 Method: 论文系统地回顾了2023年至2025年间的260多篇论文,涵盖了顶级会议(CVPR, ICCV, NeurIPS, ICML, ACL等),并分析了60多个基准测试和相关指标。 Result: 论文系统回顾了构成性视觉推理的文献,追溯了五阶段的范式转变,目录了60多个基准测试,并提炼了关键见解,识别了开放挑战并概述了未来方向。 Conclusion: 该论文旨在通过提供一个统一的分类法、历史路线图和批判性展望,成为构成性视觉推理研究的基础参考,并激发下一代的研究。 Abstract: Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.[196] FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising
Zhihao Chen,Qi Gao,Zilong Li,Junping Zhang,Yi Zhang,Jun Zhao,Hongming Shan
Main category: cs.CV
TL;DR: FoundDiff is a new foundational diffusion model for low-dose CT denoising that adapts to different dose levels and anatomical regions, improving generalizability and robustness in clinical scenarios.
Details
Motivation: Existing DL-based LDCT denoising methods struggle with diverse noise characteristics and anatomical heterogeneity across varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. Method: FoundDiff uses a two-stage strategy: (i) dose-anatomy perception with DA-CLIP, which learns continuous representations of dose variations and identifies anatomical regions, and (ii) adaptive denoising with DA-Diff, which integrates dose and anatomy embeddings through a DACB based on Mamba. Result: Extensive experiments show that FoundDiff outperforms existing state-of-the-art methods in LDCT denoising and generalizes remarkably well to unseen dose levels. Conclusion: FoundDiff is a foundational diffusion model for LDCT denoising that provides adaptive and generalizable denoising across various dose levels and anatomical regions by using a two-stage strategy involving DA-CLIP and DA-Diff. Abstract: Low-dose computed tomography (CT) denoising is crucial for reduced radiation exposure while ensuring diagnostically acceptable image quality. Despite significant advancements driven by deep learning (DL) in recent years, existing DL-based methods, typically trained on a specific dose level and anatomical region, struggle to handle diverse noise characteristics and anatomical heterogeneity during varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. In this paper, we propose FoundDiff, a foundational diffusion model for unified and generalizable LDCT denoising across various dose levels and anatomical regions. FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising. First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception by leveraging specialized contrastive learning strategies to learn continuous representations that quantify ordinal dose variations and identify salient anatomical regions. Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising by synergistically integrating the learned dose and anatomy embeddings from DACLIP into diffusion process via a novel dose and anatomy conditional block (DACB) based on Mamba. Extensive experiments on two public LDCT datasets encompassing eight dose levels and three anatomical regions demonstrate superior denoising performance of FoundDiff over existing state-of-the-art methods and the remarkable generalization to unseen dose levels. The codes and models are available at https://github.com/hao1635/FoundDiff.[197] PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing
Peilin Xiong,Junwen Chen,Honghui Yuan,Keiji Yanai
Main category: cs.CV
TL;DR: PosBridge is an efficient and flexible framework for inserting custom objects into images, using positional embedding transplant and Corner Centered Layout to guide the FLUX.1-Fill model in generating identity-consistent content.
Details
Motivation: The increasing cost of training generative models highlights the need for training-free and scalable editing frameworks for localized subject-driven image editing. Method: PosBridge uses positional embedding transplant and Corner Centered Layout to guide the FLUX.1-Fill model in synthesizing identity-consistent content at the desired location. Result: Extensive experiments demonstrate that PosBridge outperforms mainstream baselines in structural consistency, appearance fidelity, and computational efficiency. Conclusion: PosBridge is a training-free and scalable framework for localized subject-driven image editing that outperforms mainstream baselines in structural consistency, appearance fidelity, and computational efficiency. Abstract: Localized subject-driven image editing aims to seamlessly integrate user-specified objects into target scenes. As generative models continue to scale, training becomes increasingly costly in terms of memory and computation, highlighting the need for training-free and scalable editing frameworks.To this end, we propose PosBridge an efficient and flexible framework for inserting custom objects. A key component of our method is positional embedding transplant, which guides the diffusion model to faithfully replicate the structural characteristics of reference objects.Meanwhile, we introduce the Corner Centered Layout, which concatenates reference images and the background image as input to the FLUX.1-Fill model. During progressive denoising, positional embedding transplant is applied to guide the noise distribution in the target region toward that of the reference object. In this way, Corner Centered Layout effectively directs the FLUX.1-Fill model to synthesize identity-consistent content at the desired location. Extensive experiments demonstrate that PosBridge outperforms mainstream baselines in structural consistency, appearance fidelity, and computational efficiency, showcasing its practical value and potential for broad adoption.[198] First Place Solution to the MLCAS 2025 GWFSS Challenge: The Devil is in the Detail and Minority
Songliang Cao,Tianqi Hu,Hao Lu
Main category: cs.CV
TL;DR: 本文提出了一种针对小麦茎部的语义分割方法,在2025年MLCAS GWFSS挑战赛中获得第一名。
Details
Motivation: 在当前的语义分割比赛中,传统的技巧已经难以带来显著优势,因此需要针对小麦本身的特性,尤其是茎部的精细结构进行优化。 Method: 基于ViT-Adapter基线模型,提出了三种针对茎部的改进技术:动态上采样器SAPA、茎部感知的半监督引导蒸馏以及测试时图像缩放策略。 Result: 该方法在MLCAS 2025 GWFSS挑战赛中取得了第一名,明显优于第二名。 Conclusion: 通过针对小麦茎部的特定改进,包括动态上采样、半监督引导蒸馏和测试时缩放策略,作者的方法在MLCAS 2025 GWFSS挑战赛中获得了第一名。 Abstract: In this report, we present our solution during the participation of the MLCAS 2025 GWFSS Challenge. This challenge hosts a semantic segmentation competition specific to wheat plants, which requires to segment three wheat organs including the head, leaf, and stem, and another background class. In 2025, participating a segmentation competition is significantly different from that in previous years where many tricks can play important roles. Nowadays most segmentation tricks have been well integrated into existing codebases such that our naive ViT-Adapter baseline has already achieved sufficiently good performance. Hence, we believe the key to stand out among other competitors is to focus on the problem nature of wheat per se. By probing visualizations, we identify the key -- the stem matters. In contrast to heads and leaves, stems exhibit fine structure and occupy only few pixels, which suffers from fragile predictions and class imbalance. Building on our baseline, we present three technical improvements tailored to stems: i) incorporating a dynamic upsampler SAPA used to enhance detail delineation; ii) leveraging semi-supervised guided distillation with stem-aware sample selection to mine the treasure beneath unlabeled data; and iii) applying a test-time scaling strategy to zoom in and segment twice the image. Despite being simple, the three improvements bring us to the first place of the competition, outperforming the second place by clear margins. Code and models will be released at https://github.com/tiny-smart/gwfss25.[199] Defending Deepfake via Texture Feature Perturbation
Xiao Zhang,Changfang Chen,Tianyi Wang
Main category: cs.CV
TL;DR: This paper proposes a proactive Deepfake detection framework by inserting texture-guided perturbations, achieving strong performance in disrupting fake image generation.
Details
Motivation: The increasing threat of high-quality Deepfake technology to social trust and information security necessitates the development of proactive defense mechanisms, as traditional passive detection methods face limitations. Method: A proactive detection method based on facial texture features, using Local Binary Patterns (LBP) to extract texture features and a dual-model attention strategy to generate and optimize texture perturbations. Result: The proposed method demonstrated promising performance on CelebA-HQ and LFW datasets, effectively distorting Deepfake generation and creating visible defects under multiple attack models. Conclusion: The proposed proactive Deepfake detection approach effectively distorts Deepfake generation and introduces visible defects, offering an efficient and scalable solution for combating high-quality Deepfake content. Abstract: The rapid development of Deepfake technology poses severe challenges to social trust and information security. While most existing detection methods primarily rely on passive analyses, due to unresolvable high-quality Deepfake contents, proactive defense has recently emerged by inserting invisible signals in advance of image editing. In this paper, we introduce a proactive Deepfake detection approach based on facial texture features. Since human eyes are more sensitive to perturbations in smooth regions, we invisibly insert perturbations within texture regions that have low perceptual saliency, applying localized perturbations to key texture regions while minimizing unwanted noise in non-textured areas. Our texture-guided perturbation framework first extracts preliminary texture features via Local Binary Patterns (LBP), and then introduces a dual-model attention strategy to generate and optimize texture perturbations. Experiments on CelebA-HQ and LFW datasets demonstrate the promising performance of our method in distorting Deepfake generation and producing obvious visual defects under multiple attack models, providing an efficient and scalable solution for proactive Deepfake detection.[200] SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation
Zhenyu Jin,Wenjie Li,Zhanyu Ma,Heng Guo
Main category: cs.CV
TL;DR: 本文提出了一种从单个RGB图像生成光谱BRDF的新方法SpecGen,解决了光谱数据稀缺问题,并在光谱图像重建方面优于现有技术。
Details
Motivation: 光谱图像合成对于逼真的渲染至关重要,但现有的光谱提升方法存在测量光谱BRDF数据稀缺的问题。 Method: 提出了一种名为Spectral-Spatial Tri-plane Aggregation (SSTA)网络的新方法,该方法通过建模不同波长和入射-出射方向的反射率响应,利用丰富的RGB BRDF数据增强光谱BRDF生成。 Result: 实验表明,该方法在有限光谱数据下能准确重建光谱BRDF,并在高光谱图像重建方面比现有技术提升8 dB的PSNR。 Conclusion: SpecGen能够利用单个RGB图像生成光谱BRDF,并在光谱图像渲染方面优于现有技术。 Abstract: Synthesizing spectral images across different wavelengths is essential for photorealistic rendering. Unlike conventional spectral uplifting methods that convert RGB images into spectral ones, we introduce SpecGen, a novel method that generates spectral bidirectional reflectance distribution functions (BRDFs) from a single RGB image of a sphere. This enables spectral image rendering under arbitrary illuminations and shapes covered by the corresponding material. A key challenge in spectral BRDF generation is the scarcity of measured spectral BRDF data. To address this, we propose the Spectral-Spatial Tri-plane Aggregation (SSTA) network, which models reflectance responses across wavelengths and incident-outgoing directions, allowing the training strategy to leverage abundant RGB BRDF data to enhance spectral BRDF generation. Experiments show that our method accurately reconstructs spectral BRDFs from limited spectral data and surpasses state-of-the-art methods in hyperspectral image reconstruction, achieving an improvement of 8 dB in PSNR. Codes and data will be released upon acceptance.[201] Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Somraj Gautam,Abhirama Subramanyam Penamakuri,Abhishek Bhandari,Gaurav Harit
Main category: cs.CV
TL;DR: The paper introduces MMCRICBENCH-3K, a benchmark for evaluating large vision-language models on complex numerical and cross-lingual reasoning tasks involving cricket scorecards.
Details
Motivation: To evaluate large vision-language models on complex numerical and cross-lingual reasoning over semi-structured tabular images. Method: Creation of a synthetic benchmark dataset for VQA on cricket scorecards, comprising English and Hindi subsets for cross-lingual evaluation. Result: State-of-the-art LVLMs like GPT-4o and Qwen2.5VL struggle with the English subset and show further performance drops on the Hindi subset. Conclusion: The MMCRICBENCH-3K dataset reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization by LVLMs. Abstract: We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.[202] No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection
Lianrui Mu,Zou Xingze,Jianhong Bai,Jiaqi Hu,Wenjie Zheng,Jiangnan Ye,Jiedong Zhuang,Mudassar Ali,Jing Wang,Haoji Hu
Main category: cs.CV
TL;DR: 本文提出HiDA-Net,一种新的高分辨率AI生成图像检测框架,结合全分辨率细节与全局视图,有效提升检测精度。
Details
Motivation: 现有检测方法在低分辨率数据集上训练和评估,难以应对高分辨率AI生成图像的挑战,且常用处理策略可能导致信息丢失。 Method: 引入了HiDA-Net框架,包含特征聚合模块(FAM)、Token-wise伪造定位(TFL)模块和JPEG质量因子估计(QFE)模块,并构建了HiRes-50K数据集。 Result: HiDA-Net在Chameleon数据集上准确率提高了超过13%,在HiRes-50K数据集上提高了10%,达到了最先进的性能。 Conclusion: HiDA-Net通过融合全分辨率局部块特征和全局图像视图,在高分辨率AI生成图像检测方面表现出色,增强了对局部AI篡改和压缩的鲁棒性,并通过HiRes-50K数据集促进了未来研究。 Abstract: The rapid growth of high-resolution, meticulously crafted AI-generated images poses a significant challenge to existing detection methods, which are often trained and evaluated on low-resolution, automatically generated datasets that do not align with the complexities of high-resolution scenarios. A common practice is to resize or center-crop high-resolution images to fit standard network inputs. However, without full coverage of all pixels, such strategies risk either obscuring subtle, high-frequency artifacts or discarding information from uncovered regions, leading to input information loss. In this paper, we introduce the High-Resolution Detail-Aggregation Network (HiDA-Net), a novel framework that ensures no pixel is left behind. We use the Feature Aggregation Module (FAM), which fuses features from multiple full-resolution local tiles with a down-sampled global view of the image. These local features are aggregated and fused with global representations for final prediction, ensuring that native-resolution details are preserved and utilized for detection. To enhance robustness against challenges such as localized AI manipulations and compression, we introduce Token-wise Forgery Localization (TFL) module for fine-grained spatial sensitivity and JPEG Quality Factor Estimation (QFE) module to disentangle generative artifacts from compression noise explicitly. Furthermore, to facilitate future research, we introduce HiRes-50K, a new challenging benchmark consisting of 50,568 images with up to 64 megapixels. Extensive experiments show that HiDA-Net achieves state-of-the-art, increasing accuracy by over 13% on the challenging Chameleon dataset and 10% on our HiRes-50K.[203] DiCache: Let Diffusion Model Determine Its Own Cache
Jiazi Bu,Pengyang Ling,Yujie Zhou,Yibin Wang,Yuhang Zang,Tong Wu,Dahua Lin,Jiaqi Wang
Main category: cs.CV
TL;DR: DiCache is a novel, training-free adaptive caching strategy that dynamically determines when and how to cache in diffusion models, significantly improving efficiency and visual fidelity for tasks like video and image generation.
Details
Motivation: The motivation is to address the limited generalizability of existing caching-based acceleration methods in diffusion models, particularly their failure on outlier samples due to reliance on predefined rules and dataset-level priors. Method: The method involves two components: an Online Probe Profiling Scheme to determine caching schedules and Dynamic Cache Trajectory Alignment to combine multi-step caches for better feature approximation. Result: DiCache achieves higher efficiency and improved visual quality over state-of-the-art methods on various diffusion models for both video and image generation. Conclusion: DiCache is a training-free adaptive caching strategy that improves efficiency and visual fidelity in diffusion models, validated through experiments across various models like WAN 2.1, HunyuanVideo, and Flux. Abstract: Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine the timing of caching and utilizing handcrafted rules for leveraging multi-step caches. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail on outlier samples. In this paper, a strong correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of final model outputs. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain a stable prior for the caching error in real time, enabling the model to autonomously determine caching schedules. (2) Dynamic Cache Trajectory Alignment combines multi-step caches based on shallow-layer probe feature trajectory to better approximate the current feature, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved visual fidelity over state-of-the-art methods on various leading diffusion models including WAN 2.1, HunyuanVideo for video generation, and Flux for image generation.[204] Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation
Guoqing Zhang,Xingtong Ge,Lu Shi,Xin Zhang,Muqing Xue,Wanru Xu,Yigang Cen
Main category: cs.CV
TL;DR: This paper proposes a unified framework for image-to-image generation that improves efficiency and expressiveness by introducing the CoMoE and WeaveNet modules. Experimental results show its state-of-the-art performance in conditional image generation tasks.
Details
Motivation: Existing methods for image-to-image generation train separate control branches for each condition, leading to redundant model structures and inefficient resource use. A more efficient and versatile framework is needed. Method: A Unified image-to-image Generation (UniGen) framework is proposed, which includes the Condition Modulated Expert (CoMoE) module to reduce parameter redundancy and computational inefficiency, and WeaveNet for effective interaction between global text-level and fine-grained control. Result: Extensive experiments on the Subjects-200K and MultiGen-20M datasets show that the proposed method achieves state-of-the-art performance across various conditional image generation tasks. Conclusion: The proposed UniGen framework, incorporating CoMoE and WeaveNet modules, demonstrates state-of-the-art performance in conditional image generation tasks while improving efficiency and expressiveness. Abstract: The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.[205] Lightweight Joint Optimization of General-Purpose Vision-Language Models and Retrievers for Medical Diagnosis
Nir Mazor,Tom Hope
Main category: cs.CV
TL;DR: 本文提出了一种联合优化多模态检索器和大型视觉语言模型(LVLM)的模型,用于医疗诊断,相比标准的RAG方法在处理具有挑战性的病例时表现更优。
Details
Motivation: 医疗决策通常依赖于对图像的解释,而现有的标准RAG方法无法有效优化检索器,因此需要一种更优的模型来提升诊断准确性。 Method: 该研究通过联合优化多模态检索器和LVLM进行医疗诊断,而不是像标准RAG那样仅使用固定的检索器。 Result: 实验表明,仅使用通用模型和轻量级微调的情况下,该方法在多个医疗任务上达到了与专业预训练模型相当的性能,并且在处理具有挑战性的病例时显著优于标准RAG方法。 Conclusion: 尽管本文提出的模型在某些情况下能够提升诊断性能,但与理想情况(Oracle)相比仍有较大差距,未来仍需进一步改进。 Abstract: Clinical decision-making often involves interpreting images (e.g., radiology) for making diagnoses. Retrieving relevant visual information from medical literature and hospital records could enhance diagnostic accuracy. In this paper, we develop a model in which a multimodal retriever is jointly optimized with an LVLM for medical diagnosis, unlike standard RAG where LVLM error signal is not propagated down to the retriever. We show that using only general-purpose backbones, with only lightweight fine-tuning, our model is able to achieve competitive results with medically-pretrained models across clinical multi-label classification and visual question answering tasks. In a novel analysis, we additionally find that in many cases different top retrieved images each lead to different predictions for a given target, and that these cases are empirically challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these challenging cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap -- leaving ample room for improvement by future methods. Code will be made publicly available.[206] Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches
Aoqi Li,Yanghui Song,Jichao Dao,Chengfu Yang
Main category: cs.CV
TL;DR: 该论文介绍了一种使用深度学习技术整合VGG19和ResNet50模型以实现复杂水下场景中有效图像增强的方法,并提供了提高系统实用性和稳定性的建议。
Details
Motivation: 解决复杂水下场景中的图像增强问题是一个挑战,论文旨在提出基于深度学习的解决方案。 Method: 论文巧妙地整合了两个深度卷积神经网络模型VGG19和ResNet50,利用其强大的特征提取能力对水下图像进行多尺度和多级别的深度特征分析。 Result: 通过构建统一模型,有效整合了两个模型的互补优势,实现了更全面和准确的图像增强效果。 Conclusion: 该论文通过整合VGG19和ResNet50模型,提出了一个有效的水下图像增强解决方案,并提供了提高系统实用性和稳定性的建议,为复杂水下环境的视觉增强任务提供了强大的技术支持。 Abstract: This paper addresses the challenging problem of image enhancement in complex underwater scenes by proposing a solution based on deep learning. The proposed method skillfully integrates two deep convolutional neural network models, VGG19 and ResNet50, leveraging their powerful feature extraction capabilities to perform multi-scale and multi-level deep feature analysis of underwater images. By constructing a unified model, the complementary advantages of the two models are effectively integrated, achieving a more comprehensive and accurate image enhancement effect.To objectively evaluate the enhancement effect, this paper introduces image quality assessment metrics such as PSNR, UCIQE, and UIQM to quantitatively compare images before and after enhancement and deeply analyzes the performance of different models in different scenarios.Furthermore, to improve the practicality and stability of the underwater visual enhancement system, this paper also provides practical suggestions from aspects such as model optimization, multi-model fusion, and hardware selection, aiming to provide strong technical support for visual enhancement tasks in complex underwater environments.[207] MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling
Haoyu Wang,Hao Tang,Donglin Di,Zhilu Zhang,Wangmeng Zuo,Feng Gao,Siwei Ma,Shiliang Zhang
Main category: cs.CV
TL;DR: MoCo improves human video generation by decoupling structure and appearance, using a 3D structure generator and Human-Aware Dynamic Control modules, outperforming existing methods in realism and coherence.
Details
Motivation: Generating human videos with consistent motion from text prompts is a significant challenge, especially for whole-body or long-range motion. Existing models focus on appearance fidelity, leading to unrealistic movements. Additionally, current datasets limit the scope of generation methods to simple movements. Method: MoCo decouples human video generation into structure generation and appearance generation. It uses a 3D structure generator for motion sequences and synthesizes appearance guided by the structure. Human-Aware Dynamic Control modules and dense tracking constraints are introduced for better control and coherence. Result: MoCo achieves better performance in generating realistic and structurally coherent human videos compared to existing methods. Conclusion: MoCo outperforms existing approaches in generating realistic and structurally coherent human videos. Abstract: Generating human videos with consistent motion from text prompts remains a significant challenge, particularly for whole-body or long-range motion. Existing video generation models prioritize appearance fidelity, resulting in unrealistic or physically implausible human movements with poor structural coherence. Additionally, most existing human video datasets primarily focus on facial or upper-body motions, or consist of vertically oriented dance videos, limiting the scope of corresponding generation methods to simple movements. To overcome these challenges, we propose MoCo, which decouples the process of human video generation into two components: structure generation and appearance generation. Specifically, our method first employs an efficient 3D structure generator to produce a human motion sequence from a text prompt. The remaining video appearance is then synthesized under the guidance of the generated structural sequence. To improve fine-grained control over sparse human structures, we introduce Human-Aware Dynamic Control modules and integrate dense tracking constraints during training. Furthermore, recognizing the limitations of existing datasets, we construct a large-scale whole-body human video dataset featuring complex and diverse motions. Extensive experiments demonstrate that MoCo outperforms existing approaches in generating realistic and structurally coherent human videos.[208] E-BayesSAM: Efficient Bayesian Adaptation of SAM with Self-Optimizing KAN-Based Interpretation for Uncertainty-Aware Ultrasonic Segmentation
Bin Huang,Zhong Liu,Huiying Wen,Bingsheng Huang,Xin Chen,Shuo Li
Main category: cs.CV
TL;DR: 本研究提出了一种名为E-BayesSAM的高效框架,结合了Token-wise Variational Bayesian Inference(T-VBI)和Self-Optimizing Kolmogorov-Arnold Network(SO-KAN),解决了Segment Anything Model(SAM)在医学图像分割中的贝叶斯适应、计算成本和可解释性限制。
Details
Motivation: Segment Anything Model(SAM)在医学图像分割方面有所进展,但其贝叶斯适应在不确定性感知分割方面受到三个关键问题的阻碍:(1)大型预训练SAM的贝叶斯微调不稳定;(2)由于SAM的大量参数导致计算成本高;(3)SAM的黑盒设计限制了可解释性。 Method: 提出了一种名为E-BayesSAM的高效框架,结合了Token-wise Variational Bayesian Inference(T-VBI)和Self-Optimizing Kolmogorov-Arnold Network(SO-KAN)。 Result: 在五个超声波数据集上的实验表明,E-BayesSAM实现了:(i)实时推理(0.03秒/图像),(ii)卓越的分割准确性(平均DSC:Pruned E-BayesSAM的89.0% vs. E-BayesSAM的88.0% vs. MedSAM的88.3%),以及(iii)识别出四个关键token,这些token控制着SAM的决策。 Conclusion: E-BayesSAM通过结合T-VBI和SO-KAN,解决了SAM在医学图像分割中的贝叶斯适应、计算成本和可解释性方面的限制,从而在安全关键的医疗应用中推进了SAM的部署。 Abstract: Although the Segment Anything Model (SAM) has advanced medical image segmentation, its Bayesian adaptation for uncertainty-aware segmentation remains hindered by three key issues: (1) instability in Bayesian fine-tuning of large pre-trained SAMs; (2) high computation cost due to SAM's massive parameters; (3) SAM's black-box design limits interpretability. To overcome these, we propose E-BayesSAM, an efficient framework combining Token-wise Variational Bayesian Inference (T-VBI) for efficienty Bayesian adaptation and Self-Optimizing Kolmogorov-Arnold Network (SO-KAN) for improving interpretability. T-VBI innovatively reinterprets SAM's output tokens as dynamic probabilistic weights and reparameterizes them as latent variables without auxiliary training, enabling training-free VBI for uncertainty estimation. SO-KAN improves token prediction with learnable spline activations via self-supervised learning, providing insight to prune redundant tokens to boost efficiency and accuracy. Experiments on five ultrasound datasets demonstrated that E-BayesSAM achieves: (i) real-time inference (0.03s/image), (ii) superior segmentation accuracy (average DSC: Pruned E-BayesSAM's 89.0\% vs. E-BayesSAM's 88.0% vs. MedSAM's 88.3%), and (iii) identification of four critical tokens governing SAM's decisions. By unifying efficiency, reliability, and interpretability, E-BayesSAM bridges SAM's versatility with clinical needs, advancing deployment in safety-critical medical applications. The source code is available at https://github.com/mp31192/E-BayesSAM.[209] Data Leakage in Visual Datasets
Patrick Ramos,Ryan Ramos,Noa Garcia
Main category: cs.CV
TL;DR: 本文研究了视觉数据集中数据泄漏的问题,发现所有分析的数据集都存在泄漏,这对模型评估的可靠性构成了威胁。
Details
Motivation: 大规模数据集通常来自互联网,而许多计算机视觉基准是公开的,这可能导致训练过程中看到评估基准中的图像,影响模型评估的公平性。 Method: 通过应用图像检索技术对不同类型的视觉泄漏进行识别和研究。 Result: 所有分析的数据集都存在某种形式的泄漏,并且从严重实例到更细微的情况,所有类型的泄漏都会损害下游任务中的模型评估。 Conclusion: 数据泄露会影响下游任务中模型评估的可靠性,所有被分析的数据集都显示出一定程度的泄漏现象。 Abstract: We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.[210] Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models
Xiaojie Yin,Qilong Wang,Qinghua Hu
Main category: cs.CV
TL;DR: This paper proposes a constrained prompt enhancement (CPE) method to improve visual-textual alignment in vision-language models by generating comprehensive textual prompts and compact visual prompts using TGSSG and CADRS, resulting in better zero-shot generalization.
Details
Motivation: The study aims to address semantic misalignment in vision-language models caused by incomplete textual prompts and noisy visual prompts during pre-training and downstream tasks. Method: The method includes two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) for generating comprehensive textual prompts and Category-Agnostic Discriminative Region Selection (CADRS) for identifying compact visual prompts. It also incorporates set-to-set matching strategies based on test-time adaptation (TTA) and optimal transport (OT) for better alignment. Result: The proposed method improves zero-shot generalization by constructing more comprehensive textual prompts and compact visual prompts, leading to better visual-textual alignment. Conclusion: The proposed constrained prompt enhancement method effectively improves visual-textual alignment and enhances the zero-shot generalization capabilities of vision-language models (VLMs). Abstract: Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the issue of incomplete semantic expression in textual prompts, our TGSSG first generates synonymous semantic set for each category via large language models, and constructs comprehensive textual prompts based on semantic ambiguity entropy and persistent homology analysis. Visually, to mitigate the irrelevant visual noise introduced by random cropping, our CADRS identifies discriminative regions with activation maps outputted by a pre-trained vision model, effectively filtering out noisy regions and generating compact visual prompts. Given the comprehensive set of textual prompts and compact set of visual prompts, we introduce two set-to-set matching strategies based on test-time adaptation (TTA) and optimal transport (OT) to achieve effective visual-textual alignment, and so improve zero-shot generalization of VLMs.[211] Robust Point Cloud Registration via Geometric Overlapping Guided Rotation Search
Zhao Zheng,Jingfan Fan,Long Shao,Hong Song,Danni Ai,Tianyu Fu,Deqiang Xiao,Yongtian Wang,Jian Yang
Main category: cs.CV
TL;DR: This paper introduces a new method for point cloud registration that improves accuracy and efficiency by using a geometric maximum overlapping framework and rotation-only branch-and-bound search, outperforming existing state-of-the-art techniques.
Details
Motivation: The motivation for the paper is to improve the efficiency and accuracy of point cloud registration, particularly under high outlier ratios. Existing methods either require excessive space and time complexity or suffer from inaccuracies due to local optima. Method: The method involves decomposing the rigid transformation using Chasles' theorem into translation along a rotation axis and a 2D rigid transformation. The optimal rotation axis and angle are found using a branch-and-bound search, and residual parameters are formulated as range maximum query problems. Top-k candidate rotation axes are searched within a hemisphere parameterized by cube mapping, and translation along each axis is estimated by interval stabbing. 2D registration is simplified to a 1D rotation angle search using a sweep line algorithm with a segment tree for solving 2D range maximum query problems. Result: Experimental results on 3DMatch, 3DLoMatch, and KITTI datasets show that the proposed method achieves superior accuracy and efficiency compared to state-of-the-art methods. The time complexity remains polynomial, and the space complexity increases linearly with the number of points. Conclusion: The paper concludes that their proposed method for point cloud registration, which uses a geometric maximum overlapping framework through rotation-only branch-and-bound search, outperforms current state-of-the-art methods in terms of accuracy and efficiency. It also maintains a lower time and space complexity. Abstract: Point cloud registration based on correspondences computes the rigid transformation that maximizes the number of inliers constrained within the noise threshold. Current state-of-the-art (SOTA) methods employing spatial compatibility graphs or branch-and-bound (BnB) search mainly focus on registration under high outlier ratios. However, graph-based methods require at least quadratic space and time complexity for graph construction, while multi-stage BnB search methods often suffer from inaccuracy due to local optima between decomposed stages. This paper proposes a geometric maximum overlapping registration framework via rotation-only BnB search. The rigid transformation is decomposed using Chasles' theorem into a translation along rotation axis and a 2D rigid transformation. The optimal rotation axis and angle are searched via BnB, with residual parameters formulated as range maximum query (RMQ) problems. Firstly, the top-k candidate rotation axes are searched within a hemisphere parameterized by cube mapping, and the translation along each axis is estimated through interval stabbing of the correspondences projected onto that axis. Secondly, the 2D registration is relaxed to 1D rotation angle search with 2D RMQ of geometric overlapping for axis-aligned rectangles, which is solved deterministically in polynomial time using sweep line algorithm with segment tree. Experimental results on 3DMatch, 3DLoMatch, and KITTI datasets demonstrate superior accuracy and efficiency over SOTA methods, while the time complexity is polynomial and the space complexity increases linearly with the number of points, even in the worst case.[212] FedKLPR: Personalized Federated Learning for Person Re-Identification with Adaptive Pruning
Po-Hsien Yu,Yu-Syuan Tseng,Shao-Yi Chien
Main category: cs.CV
TL;DR: FedKLPR是一个轻量级且通信高效的联邦学习框架,用于解决行人重识别中的统计异质性和通信开销问题。
Details
Motivation: 联邦学习在隐私保护方面具有优势,但在实际的行人重识别系统中面临统计异质性和通信开销的挑战。 Method: FedKLPR引入了KLL、KLPWA、SAS和CRR四个组件,以解决统计异质性和通信开销问题。 Result: 实验结果表明,FedKLPR在ResNet-50上减少了33%-38%的通信成本,在ResNet-34上减少了20%-40%的通信成本,同时保持了模型精度。 Conclusion: FedKLPR有效地解决了联邦学习在非独立同分布数据和通信开销方面的挑战,实现了高效的行人重识别。 Abstract: Person re-identification (Re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) offers a privacy-preserving solution by enabling collaborative model training without centralized data collection. However, applying FL to real-world re-ID systems faces two major challenges: statistical heterogeneity across clients due to non-IID data distributions, and substantial communication overhead caused by frequent transmission of large-scale models. To address these issues, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-identification. FedKLPR introduces four key components. First, the KL-Divergence Regularization Loss (KLL) constrains local models by minimizing the divergence from the global feature distribution, effectively mitigating the effects of statistical heterogeneity and improving convergence stability under non-IID conditions. Secondly, KL-Divergence-Prune Weighted Aggregation (KLPWA) integrates pruning ratio and distributional similarity into the aggregation process, thereby improving the robustness of the global model while significantly reducing communication overhead. Furthermore, sparse Activation Skipping (SAS) mitigates the dilution of critical parameters during the aggregation of pruned client models by excluding zero-valued weights from the update process. Finally, Cross-Round Recovery (CRR) introduces a dynamic pruning control mechanism that halts pruning when necessary, enabling deeper compression while maintaining model accuracy. Experimental results on eight benchmark datasets demonstrate that FedKLPR achieves significant communication reduction. Compared with the state-of-the-art, FedKLPR reduces 33\%-38\% communication cost on ResNet-50 and 20\%-40\% communication cost on ResNet-34, while maintaining model accuracy within 1\% degradation.[213] TinySR: Pruning Diffusion for Real-World Image Super-Resolution
Linwei Dong,Qingnan Fan,Yuhang Yu,Qi Zhang,Jinwei Chen,Yawei Luo,Changqing Zou
Main category: cs.CV
TL;DR: TinySR是一种专为实时图像超分辨率设计的紧凑型扩散模型,通过结构优化和压缩技术,在保持高质量的同时显著提升了性能和效率。
Details
Motivation: 现有的扩散模型在实时应用中面临计算开销大或模型架构过大的问题,因此需要一种紧凑且高效的模型以实现高质量的图像超分辨率重建。 Method: 引入了动态块间激活和扩展-腐蚀策略,以促进深度剪枝中的有效决策;通过通道剪枝、注意力移除和轻量级SepConv实现VAE压缩;并采用预缓存技术进一步加速模型。 Result: TinySR相较于其教师模型TSD-SR,实现了最高5.68倍的速度提升和83%的参数减少,同时仍然提供高质量的结果。 Conclusion: TinySR实现了实时性能,同时保持了感知质量,通过减少计算成本和模型尺寸,为实际应用提供了高效的解决方案。 Abstract: Real-world image super-resolution (Real-ISR) focuses on recovering high-quality images from low-resolution inputs that suffer from complex degradations like noise, blur, and compression. Recently, diffusion models (DMs) have shown great potential in this area by leveraging strong generative priors to restore fine details. However, their iterative denoising process incurs high computational overhead, posing challenges for real-time applications. Although one-step distillation methods, such as OSEDiff and TSD-SR, offer faster inference, they remain fundamentally constrained by their large, over-parameterized model architectures. In this work, we present TinySR, a compact yet effective diffusion model specifically designed for Real-ISR that achieves real-time performance while maintaining perceptual quality. We introduce a Dynamic Inter-block Activation and an Expansion-Corrosion Strategy to facilitate more effective decision-making in depth pruning. We achieve VAE compression through channel pruning, attention removal and lightweight SepConv. We eliminate time- and prompt-related modules and perform pre-caching techniques to further speed up the model. TinySR significantly reduces computational cost and model size, achieving up to 5.68x speedup and 83% parameter reduction compared to its teacher TSD-SR, while still providing high quality results.[214] An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing
Zihan Liang,Jiahao Sun,Haoran Ma
Main category: cs.CV
TL;DR: 本研究提出RefineEdit-Agent,利用大型语言模型和视觉语言模型的能力,设计一个闭环系统,有效解决文本到图像生成模型在实际应用中的迭代图像编辑问题。
Details
Motivation: 现有的文本到图像生成模型在实际应用中面临细粒度指令理解、修改过程中的上下文保持和缺乏智能反馈机制的挑战。 Method: 利用大型语言模型(LLMs)和视觉语言大型模型(LVLMs)的能力,构建一个闭环系统,包括指令解析、编辑规划、图像编辑和反馈评估模块。 Result: RefineEdit-Agent在LongBench-T2I-Edit基准测试中平均得分3.67,显著优于现有最先进基线方法。 Conclusion: RefineEdit-Agent提供了一种有效的解决方案,克服了现有方法在指令理解、上下文保持和反馈机制方面的不足,实现了更高质量的迭代图像编辑。 Abstract: Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.[215] Disentangled Geometry and Appearance for Efficient Multi-View Surface Reconstruction and Rendering
Qitong Zhang,Jieqing Feng
Main category: cs.CV
TL;DR: 这篇论文提出了一种高效的多视角表面重建与渲染方法,通过显式网格和可微分光栅化框架,实现了高质量、高效率和广泛的应用潜力。
Details
Motivation: 现有的基于神经渲染的多视角表面重建方法需要额外的网格提取步骤,这不仅繁琐,还会因网格走样而产生低质量的表面,限制了下游应用的发展。因此,论文旨在提出一种更高效且高质量的解决方案。 Method: 论文采用显式网格表示和可微分光栅化框架,引入了几何和外观解耦的模型,通过神经变形场整合全局几何上下文,并利用新颖的正则化约束传递给神经着色器的几何特征以提高其准确性。此外,将视图不变的漫反射项分离并烘焙到网格顶点中,进一步提高了渲染效率。 Result: 实验结果表明,所提出的方法在训练速度(4.84分钟)和渲染速度(0.023秒)上达到了最先进的水平,重建质量与顶级方法相当。此外,该方法支持网格和纹理编辑等实用应用,展示了其灵活性和应用潜力。 Conclusion: 该论文提出了一种高效的多视角表面重建与渲染方法,结合了显式网格表示和可微分光栅化框架,显著提高了重建质量和适用性,为多视角表面重建领域做出了重要贡献。 Abstract: This paper addresses the limitations of neural rendering-based multi-view surface reconstruction methods, which require an additional mesh extraction step that is inconvenient and would produce poor-quality surfaces with mesh aliasing, restricting downstream applications. Building on the explicit mesh representation and differentiable rasterization framework, this work proposes an efficient solution that preserves the high efficiency of this framework while significantly improving reconstruction quality and versatility. Specifically, we introduce a disentangled geometry and appearance model that does not rely on deep networks, enhancing learning and broadening applicability. A neural deformation field is constructed to incorporate global geometric context, enhancing geometry learning, while a novel regularization constrains geometric features passed to a neural shader to ensure its accuracy and boost shading. For appearance, a view-invariant diffuse term is separated and baked into mesh vertices, further improving rendering efficiency. Experimental results demonstrate that the proposed method achieves state-of-the-art training (4.84 minutes) and rendering (0.023 seconds) speeds, with reconstruction quality that is competitive with top-performing methods. Moreover, the method enables practical applications such as mesh and texture editing, showcasing its versatility and application potential. This combination of efficiency, competitive quality, and broad applicability makes our approach a valuable contribution to multi-view surface reconstruction and rendering.[216] Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
Long Le,Ryan Lucas,Chen Wang,Chuhao Chen,Dinesh Jayaraman,Eric Eaton,Lingjie Liu
Main category: cs.CV
TL;DR: PIXIE is a fast and generalizable neural network method that predicts physical properties of 3D scenes from visual features, enabling realistic physics simulations and zero-shot transfer to real-world scenes.
Details
Motivation: Inferring physical properties of 3D scenes is crucial for realistic virtual worlds, but existing methods are limited by slow, per-scene optimization. The goal is to develop a faster and more generalizable solution. Method: PIXIE trains a generalizable neural network using supervised losses to predict physical properties from 3D visual features. It uses a feed-forward network for fast inference and integrates with a learned static scene representation for physics simulation. Result: PIXIE achieves 1.46-4.39x better performance than test-time optimization methods and is significantly faster. It can also generalize to real-world scenes using pretrained visual features like CLIP, despite being trained only on synthetic data. Conclusion: PIXIE is a novel and efficient method for inferring physical properties of 3D scenes, offering significant improvements in speed and performance over existing methods, and demonstrating zero-shot generalization to real-world scenes. Abstract: Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/[217] Investigating Domain Gaps for Indoor 3D Object Detection
Zijing Zhao,Zhu Xu,Qingchao Chen,Yuxin Peng,Yang Liu
Main category: cs.CV
TL;DR: 本文提出了一种新的室内3D物体检测器领域适应的基准测试,并分析了不同领域差距对检测器性能的影响。
Details
Motivation: 现有的室内点云数据集研究存在过拟合特定因素的问题,例如点云质量、边界框布局和实例特征,需要研究适应性更强的3D物体检测器。 Method: 使用ScanNet、SUN RGB-D和3D Front数据集,以及新提出的大规模数据集ProcTHOR-OD和ProcFront,通过3D模拟器生成的数据进行跨数据集实验。 Result: 实验分析了不同领域差距对3D物体检测器的影响,并探讨了合成到真实、点云质量、布局和实例特征适应等不同适应场景下的性能。 Conclusion: 本文提出了一个全面的基准测试,用于室内3D物体检测器的领域适应,并介绍了几种提高适应性能的方法,旨在促进未来跨领域泛化能力更强的探测器的发展。 Abstract: As a fundamental task for indoor scene understanding, 3D object detection has been extensively studied, and the accuracy on indoor point cloud data has been substantially improved. However, existing researches have been conducted on limited datasets, where the training and testing sets share the same distribution. In this paper, we consider the task of adapting indoor 3D object detectors from one dataset to another, presenting a comprehensive benchmark with ScanNet, SUN RGB-D and 3D Front datasets, as well as our newly proposed large-scale datasets ProcTHOR-OD and ProcFront generated by a 3D simulator. Since indoor point cloud datasets are collected and constructed in different ways, the object detectors are likely to overfit to specific factors within each dataset, such as point cloud quality, bounding box layout and instance features. We conduct experiments across datasets on different adaptation scenarios including synthetic-to-real adaptation, point cloud quality adaptation, layout adaptation and instance feature adaptation, analyzing the impact of different domain gaps on 3D object detectors. We also introduce several approaches to improve adaptation performances, providing baselines for domain adaptive indoor 3D object detection, hoping that future works may propose detectors with stronger generalization ability across domains. Our project homepage can be found in https://jeremyzhao1998.github.io/DAVoteNet-release/.[218] Multi-Level LVLM Guidance for Untrimmed Video Action Recognition
Liyang Peng,Sihan Zhu,Yunjie Guo
Main category: cs.CV
TL;DR: This paper proposes the Event-Contextualized Video Transformer (ECVT) to improve action recognition and localization in complex, untrimmed videos by leveraging Large Vision-Language Models (LVLMs) for multi-granularity semantic descriptions and integrating them into a video encoding framework.
Details
Motivation: The motivation is to overcome the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features in complex, untrimmed videos. Method: ECVT uses a dual-branch design with a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch that leverages LVLMs to generate semantic descriptions. It integrates multi-level textual cues into the video encoder through adaptive gating, cross-modal attention, and an event graph module. Result: ECVT achieves an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14, outperforming leading baselines. Conclusion: ECVT is able to achieve state-of-the-art performance on ActionNet v1.3 and THUMOS14 datasets, demonstrating its effectiveness in recognizing and localizing actions in complex, untrimmed videos. Abstract: Action recognition and localization in complex, untrimmed videos remain a formidable challenge in computer vision, largely due to the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features. This paper introduces the Event-Contextualized Video Transformer (ECVT), a novel architecture that leverages the advanced semantic understanding capabilities of Large Vision-Language Models (LVLMs) to bridge this gap. ECVT employs a dual-branch design, comprising a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch. The latter utilizes an LVLM to generate multi-granularity semantic descriptions, including Global Event Prompting for macro-level narrative and Temporal Sub-event Prompting for fine-grained action details. These multi-level textual cues are integrated into the video encoder's learning process through sophisticated mechanisms such as adaptive gating for high-level semantic fusion, cross-modal attention for fine-grained feature refinement, and an event graph module for temporal context calibration. Trained end-to-end with a comprehensive loss function incorporating semantic consistency and temporal calibration terms, ECVT significantly enhances the model's ability to understand video temporal structures and event logic. Extensive experiments on ActivityNet v1.3 and THUMOS14 datasets demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14, outperforming leading baselines.[219] A Synthetic Dataset for Manometry Recognition in Robotic Applications
Pedro Antonio Rabelo Saraiva,Enzo Ferreira de Souza,Joao Manoel Herrera Pinheiro,Thiago H. Segreto,Ricardo V. Godoy,Marcelo Becker
Main category: cs.CV
TL;DR: 本文提出了一种合成数据生成方法,用于克服在复杂工业环境中获取真实数据的难题,并证明了合成数据在目标检测中的有效性。
Details
Motivation: 解决在复杂工业环境中获取真实数据的实际和经济障碍。 Method: 结合程序渲染与AI驱动的视频生成,使用BlenderProc和NVIDIA的Cosmos-Predict2模型生成合成数据。 Result: YOLO检测网络在混合真实与合成数据上训练后,性能优于仅使用真实数据的模型。 Conclusion: 合成数据优先的方法在工业环境中开发可靠感知系统是一种高效、经济且安全的替代方案。 Abstract: This work addresses the challenges of data scarcity and high acquisition costs for training robust object detection models in complex industrial environments, such as offshore oil platforms. The practical and economic barriers to collecting real-world data in these hazardous settings often hamper the development of autonomous inspection systems. To overcome this, in this work we propose and validate a hybrid data synthesis pipeline that combines procedural rendering with AI-driven video generation. Our methodology leverages BlenderProc to create photorealistic images with precise annotations and controlled domain randomization, and integrates NVIDIA's Cosmos-Predict2 world-foundation model to synthesize physically plausible video sequences with temporal diversity, capturing rare viewpoints and adverse conditions. We demonstrate that a YOLO-based detection network trained on a composite dataset, blending real images with our synthetic data, achieves superior performance compared to models trained exclusively on real-world data. Notably, a 1:1 mixture of real and synthetic data yielded the highest accuracy, surpassing the real-only baseline. These findings highlight the viability of a synthetic-first approach as an efficient, cost-effective, and safe alternative for developing reliable perception systems in safety-critical and resource-constrained industrial applications.[220] T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation
Kaiyue Sun,Rongyao Fang,Chengqi Duan,Xian Liu,Xihui Liu
Main category: cs.CV
TL;DR: T2I-ReasonBench is introduced as a multi-dimensional benchmark to evaluate reasoning and image quality in text-to-image models.
Details
Motivation: To accurately evaluate and analyze the reasoning capabilities of text-to-image models, ensuring both reasoning accuracy and high-quality image generation. Method: Proposed a two-stage evaluation protocol to assess reasoning accuracy and image quality across four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning, and Scientific-Reasoning. Result: Various T2I generation models were benchmarked, providing a comprehensive analysis of their reasoning and generation performance. Conclusion: T2I-ReasonBench provides a comprehensive benchmark for evaluating reasoning capabilities of text-to-image models across multiple dimensions. Abstract: We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.[221] GraphMMP: A Graph Neural Network Model with Mutual Information and Global Fusion for Multimodal Medical Prognosis
Xuhao Shan,Ruiquan Ge,Jikui Liu,Linglong Wu,Chi Zhang,Siqi Liu,Wenjian Qin,Wenwen Min,Ahmed Elazab,Changmiao Wang
Main category: cs.CV
TL;DR: This paper introduces GraphMMP, a two-stage multimodal prognosis model using graph neural networks and Mamba, which outperforms existing methods in liver prognosis and METABRIC datasets.
Details
Motivation: The main challenges lie in effectively modeling the complex interactions between heterogeneous data modalities with distinct characteristics while capturing both local and global dependencies across modalities. Method: The proposed model constructs feature graphs using mutual information and features a global fusion module built on Mamba. Result: Empirical results show that GraphMMP surpasses existing methods on datasets related to liver prognosis and the METABRIC study. Conclusion: GraphMMP, a two-stage multimodal prognosis model based on graph neural networks and Mamba, demonstrates effectiveness in multimodal medical prognosis tasks. Abstract: In the field of multimodal medical data analysis, leveraging diverse types of data and understanding their hidden relationships continues to be a research focus. The main challenges lie in effectively modeling the complex interactions between heterogeneous data modalities with distinct characteristics while capturing both local and global dependencies across modalities. To address these challenges, this paper presents a two-stage multimodal prognosis model, GraphMMP, which is based on graph neural networks. The proposed model constructs feature graphs using mutual information and features a global fusion module built on Mamba, which significantly boosts prognosis performance. Empirical results show that GraphMMP surpasses existing methods on datasets related to liver prognosis and the METABRIC study, demonstrating its effectiveness in multimodal medical prognosis tasks.[222] Optimizing Multi-Modal Trackers via Sensitivity-aware Regularized Tuning
Zhiwen Chen,Jinjian Wu,Zhiyu Zhu,Yifan Zhang,Guangming Shi,Junhui Hou
Main category: cs.CV
TL;DR: This paper proposes a sensitivity-aware regularized tuning framework to improve the transferability of pre-trained models for multi-modal tracking, outperforming state-of-the-art techniques.
Details
Motivation: The motivation is to address the suboptimal plasticity-stability trade-off in existing fine-tuning paradigms for multi-modal trackers, which either have excessive freedom or are over-restricted. Method: The paper proposes a novel sensitivity-aware regularized tuning framework that incorporates intrinsic parameter sensitivities to refine the learning process. It analyzes the tangent space of pre-trained weights to measure and orient prior sensitivities, and explores transfer sensitivities during the tuning phase. Result: Extensive experiments demonstrate the superior performance of the proposed method, surpassing current state-of-the-art techniques across various multi-modal tracking tasks. Conclusion: The paper concludes that the proposed sensitivity-aware regularized tuning framework enhances the transferability across modalities and outperforms current state-of-the-art techniques in multi-modal tracking. Abstract: This paper tackles the critical challenge of optimizing multi-modal trackers by effectively adapting the pre-trained models for RGB data. Existing fine-tuning paradigms oscillate between excessive freedom and over-restriction, both leading to a suboptimal plasticity-stability trade-off. To mitigate this dilemma, we propose a novel sensitivity-aware regularized tuning framework, which delicately refines the learning process by incorporating intrinsic parameter sensitivities. Through a comprehensive investigation from pre-trained to multi-modal contexts, we identify that parameters sensitive to pivotal foundational patterns and cross-domain shifts are primary drivers of this issue. Specifically, we first analyze the tangent space of pre-trained weights to measure and orient prior sensitivities, dedicated to preserving generalization. Then, we further explore transfer sensitivities during the tuning phase, emphasizing adaptability and stability. By incorporating these sensitivities as regularization terms, our method significantly enhances the transferability across modalities. Extensive experiments showcase the superior performance of the proposed method, surpassing current state-of-the-art techniques across various multi-modal tracking. The source code and models will be publicly available at https://github.com/zhiwen-xdu/SRTrack.[223] Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
Hugo Bohy,Minh Tran,Kevin El Haddad,Thierry Dutoit,Mohammad Soleymani
Main category: cs.CV
TL;DR: 本文提出了一种名为Social-MAE的预训练音视频自编码器,用于感知人类社交行为。
Details
Motivation: 由于人类社交行为本质上是多模态的,因此需要强大的音视频模型来感知这些行为。 Method: Social-MAE基于CAV-MAE,通过增加输入帧的数量,并在VoxCeleb2数据集上进行自监督预训练。 Result: 该模型在多模态情感识别和笑声识别任务上达到了最先进的结果,在表观人格估计任务上取得了具有竞争力的结果。 Conclusion: Social-MAE通过在社交数据上进行自监督预训练,在多模态情感识别和笑声识别任务上达到了最先进的结果,并在表观人格估计任务上取得了具有竞争力的结果。 Abstract: Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.[224] DinoTwins: Combining DINO and Barlow Twins for Robust, Label-Efficient Vision Transformers
Michael Podsiadly,Brendon K Lay
Main category: cs.CV
TL;DR: This paper proposes a hybrid model combining DINO and Barlow Twins techniques to train AI models for image understanding with fewer labels and less compute. The model is trained on the MS COCO dataset using only 10% of labeled data, achieving performance comparable to standalone DINO while maintaining strong feature representations and showing improved semantic segmentation capability.
Details
Motivation: Training AI models to understand images without costly labeled data remains a challenge. DINO may be sensitive to certain augmentations and Barlow Twins often requires batch sizes too large to fit on consumer hardware. Method: Combined DINO (teacher-student learning) with Barlow Twins (redundancy reduction) to leverage their complementary strengths. Result: The combined approach achieves comparable loss and classification accuracy to DINO while maintaining strong feature representations. Attention visualizations further suggest improved semantic segmentation capability in the hybrid model. Conclusion: The combined method offers a scalable and label-efficient alternative for training ViTs in resource-constrained environments. Abstract: Training AI models to understand images without costly labeled data remains a challenge. We combine two techniques--DINO (teacher-student learning) and Barlow Twins (redundancy reduction)--to create a model that learns better with fewer labels and less compute. While both DINO and Barlow Twins have independently demonstrated strong performance in self-supervised learning, each comes with limitations--DINO may be sensitive to certain augmentations, and Barlow Twins often requires batch sizes too large to fit on consumer hardware. By combining the redundancy-reduction objective of Barlow Twins with the self-distillation strategy of DINO, we aim to leverage their complementary strengths. We train a hybrid model on the MS COCO dataset using only 10\% of labeled data for linear probing, and evaluate its performance against standalone DINO and Barlow Twins implementations. Preliminary results show that the combined approach achieves comparable loss and classification accuracy to DINO while maintaining strong feature representations. Attention visualizations further suggest improved semantic segmentation capability in the hybrid model. This combined method offers a scalable, label-efficient alternative for training ViTs in resource-constrained environments.[225] OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation
Xingxin He,Aurora Rofena,Ruimin Feng,Haozhe Liao,Zhaoye Zhou,Albert Jang,Fang Liu
Main category: cs.CV
TL;DR: OmniMRI是一个统一的视觉-语言基础模型,旨在通过整合MRI图像和语言信息,提供一个通用的框架,完成MRI工作流程中的多种任务。
Details
Motivation: MRI在临床实践中不可或缺,但受限于分散的多阶段工作流程,同时现有深度学习方法缺乏跨不同临床场景的通用性,且很少整合影像数据与放射科医生依赖的语言信息。 Method: OmniMRI采用多阶段训练范式,包括自监督视觉预训练、视觉-语言对齐、多模态预训练和多任务指令微调,训练数据包括图像数据、配对的视觉-文本数据和指令-响应数据。 Result: OmniMRI展示了在单一架构下完成多种MRI任务的能力,包括MRI重建、解剖和病理分割、异常检测、诊断建议和放射学报告生成。 Conclusion: OmniMRI通过整合MRI图像和语言信息,提供一个通用的框架,有望将分散的MRI流程整合为端到端的综合解释系统。 Abstract: Magnetic Resonance Imaging (MRI) is indispensable in clinical practice but remains constrained by fragmented, multi-stage workflows encompassing acquisition, reconstruction, segmentation, detection, diagnosis, and reporting. While deep learning has achieved progress in individual tasks, existing approaches are often anatomy- or application-specific and lack generalizability across diverse clinical settings. Moreover, current pipelines rarely integrate imaging data with complementary language information that radiologists rely on in routine practice. Here, we introduce OmniMRI, a unified vision-language foundation model designed to generalize across the entire MRI workflow. OmniMRI is trained on a large-scale, heterogeneous corpus curated from 60 public datasets, over 220,000 MRI volumes and 19 million MRI slices, incorporating image-only data, paired vision-text data, and instruction-response data. Its multi-stage training paradigm, comprising self-supervised vision pretraining, vision-language alignment, multimodal pretraining, and multi-task instruction tuning, progressively equips the model with transferable visual representations, cross-modal reasoning, and robust instruction-following capabilities. Qualitative results demonstrate OmniMRI's ability to perform diverse tasks within a single architecture, including MRI reconstruction, anatomical and pathological segmentation, abnormality detection, diagnostic suggestion, and radiology report generation. These findings highlight OmniMRI's potential to consolidate fragmented pipelines into a scalable, generalist framework, paving the way toward foundation models that unify imaging and clinical language for comprehensive, end-to-end MRI interpretation.[226] Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning
Xinyu Wei,Guoli Yang,Jialu Zhou,Mingyue Yang,Leqian Li,Kedi Zhang,Chunping Qiu
Main category: cs.CV
TL;DR: DEHVF是一种高效的视觉语言微调方法,通过动态嵌入和融合层次化视觉特征来解决输入序列长度增加和计算开销大的问题。
Details
Motivation: 现有方法在将视觉信息融合到LLMs中间层时,忽略了模型内的层次语义表示和浅层视觉编码层中的细粒度视觉信息。 Method: DEHVF通过动态选择和融合多层视觉特征,并将其嵌入到LLMs的FFN中,避免了序列扩展问题。 Result: 实验结果显示,DEHVF在多个视觉语言基准测试中(如ScienceQA的视觉问答和COCO Captions的图像描述生成)比现有的参数高效微调方法具有更高的准确性,同时保持了训练和推理的高效性。 Conclusion: DEHVF有效地解决了LVLMs中输入序列长度增加和计算开销大的问题,同时实现了跨模态信息的精确对齐和互补。 Abstract: Large Vision-Language Models (LVLMs) commonly follow a paradigm that projects visual features and then concatenates them with text tokens to form a unified sequence input for Large Language Models (LLMs). However, this paradigm leads to a significant increase in the length of the input sequence, resulting in substantial computational overhead. Existing methods attempt to fuse visual information into the intermediate layers of LLMs, which alleviate the sequence length issue but often neglect the hierarchical semantic representations within the model and the fine-grained visual information available in the shallower visual encoding layers. To address this limitation, we propose DEHVF, an efficient vision-language fine-tuning method based on dynamic embedding and fusion of hierarchical visual features. Its core lies in leveraging the inherent hierarchical representation characteristics of visual encoders and language models. Through a lightweight hierarchical visual fuser, it dynamically selects and fuses hierarchical features corresponding to semantic granularity based on the internal representations of each layer in LLMs. The fused layer-related visual features are then projected and aligned before being directly embedded into the Feed-Forward Network (FFN) of the corresponding layer in LLMs. This approach not only avoids sequence expansion but also dynamically fuses multi-layer visual information. By fine-tuning only a small number of parameters, DEHVF achieves precise alignment and complementarity of cross-modal information at the same semantic granularity. We conducted experiments across various VL benchmarks, including visual question answering on ScienceQA and image captioning on COCO Captions. The results demonstrate that DEHVF achieves higher accuracy than existing parameter-efficient fine-tuning (PEFT) baselines while maintaining efficient training and inference.[227] Minimal Solvers for Full DoF Motion Estimation from Asynchronous Tracks
Petr Hruby,Marc Pollefeys
Main category: cs.CV
TL;DR: This paper proposes a polynomial approximation and minimal solvers to estimate camera motion from asynchronous point tracks, which is relevant for rolling shutter and event cameras.
Details
Motivation: The motivation is to address the challenge of estimating a camera's translational and angular velocity from asynchronous point tracks, which is particularly relevant for rolling shutter and event cameras. Method: The paper proposes a polynomial approximation for the non-polynomial original problem, classifies the resulting minimal problems, develops minimal solvers for low-degree problems, and evaluates them on synthetic and real datasets. Result: The result is the development of minimal solvers for several low-degree problems, along with an evaluation on synthetic and real datasets demonstrating their effectiveness. Conclusion: The paper concludes that by using polynomial approximation and minimal solvers, translational and angular velocity estimation for asynchronous point tracks can be efficiently solved, with the code being made publicly available. Abstract: We address the problem of estimating both translational and angular velocity of a camera from asynchronous point tracks, a formulation relevant to rolling shutter and event cameras. Since the original problem is non-polynomial, we propose a polynomial approximation, classify the resulting minimal problems, and determine their algebraic degrees. Furthermore, we develop minimal solvers for several problems with low degrees and evaluate them on synthetic and real datasets. The code will be made publicly available.[228] CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation
Mingyue Yang,Dianxi Shi,Jialu Zhou,Xinyu Wei,Leqian Li,Shaowu Yang,Chunping Qiu
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的文本到图像生成方法CEIDM,通过实体交互关系挖掘、交互动作聚类与偏移以及实体控制网络,实现了对实体及其交互的有效控制,提高了图像质量。
Details
Motivation: 文本到图像生成中实体的复杂性和它们之间的复杂交互对基于扩散模型的T2I方法提出了重大挑战:如何有效控制实体及其交互以生成高质量图像。 Method: 提出了一种基于扩散模型的文本到图像生成方法CEIDM,包括实体交互关系挖掘方法、交互动作聚类与偏移方法以及实体控制网络。 Result: 通过构建全局和局部双向偏移,增强原始动作的语义理解和细节补充,使模型对交互“动作”概念的理解更加准确,生成的图像具有更准确的交互动作。 Conclusion: 实验结果表明,CEIDM在实体控制和交互控制方面优于现有最具代表性的方法。 Abstract: In Text-to-Image (T2I) generation, the complexity of entities and their intricate interactions pose a significant challenge for T2I method based on diffusion model: how to effectively control entity and their interactions to produce high-quality images. To address this, we propose CEIDM, a image generation method based on diffusion model with dual controls for entity and interaction. First, we propose an entity interactive relationships mining approach based on Large Language Models (LLMs), extracting reasonable and rich implicit interactive relationships through chain of thought to guide diffusion models to generate high-quality images that are closer to realistic logic and have more reasonable interactive relationships. Furthermore, We propose an interactive action clustering and offset method to cluster and offset the interactive action features contained in each text prompts. By constructing global and local bidirectional offsets, we enhance semantic understanding and detail supplementation of original actions, making the model's understanding of the concept of interactive "actions" more accurate and generating images with more accurate interactive actions. Finally, we design an entity control network which generates masks with entity semantic guidance, then leveraging multi-scale convolutional network to enhance entity feature and dynamic network to fuse feature. It effectively controls entities and significantly improves image quality. Experiments show that the proposed CEIDM method is better than the most representative existing methods in both entity control and their interaction control.[229] Towards Optimal Convolutional Transfer Learning Architectures for Breast Lesion Classification and ACL Tear Detection
Daniel Frees,Moritz Bolling,Aditri Bhagirath
Main category: cs.CV
TL;DR: This study identifies optimal CNN architectures for medical classification tasks such as ACL tear and breast lesion detection, achieving high AUCs. However, it finds no evidence that RadImageNet pre-training outperforms ImageNet pre-training for these specific tasks.
Details
Motivation: The motivation stems from the scarcity of medical imaging data limiting the efficacy of models trained from scratch, and the need to explore transfer learning as a solution to enhance model performance on small datasets. Method: The research extends previous work by conducting a comprehensive investigation to determine optimal CNN architectures for breast lesion malignancy detection and ACL tear detection. It includes statistical analysis to compare the effects of RadImageNet and ImageNet pre-training on downstream performance. Result: The best models achieved AUCs of 0.9969 for ACL tear detection and 0.9641 for breast nodule malignancy detection, which are competitive with Mei et al. (2022) and surpass other prior works. However, RadImageNet pre-training did not confirm superior downstream performance for these tasks. Conclusion: The study concludes that specific CNN architectures, including 1-dimensional convolutional classifiers with skip connections, ResNet50 pre-trained backbones, and partial backbone unfreezing, yield optimal performance for medical classification tasks. RadImageNet pre-training did not demonstrate superior performance over ImageNet pre-training for the studied tasks. Abstract: Modern computer vision models have proven to be highly useful for medical imaging classification and segmentation tasks, but the scarcity of medical imaging data often limits the efficacy of models trained from scratch. Transfer learning has emerged as a pivotal solution to this, enabling the fine-tuning of high-performance models on small data. Mei et al. (2022) found that pre-training CNNs on a large dataset of radiologist-labeled images (RadImageNet) enhanced model performance on downstream tasks compared to ImageNet pretraining. The present work extends Mei et al. (2022) by conducting a comprehensive investigation to determine optimal CNN architectures for breast lesion malignancy detection and ACL tear detection, as well as performing statistical analysis to compare the effect of RadImageNet and ImageNet pre-training on downstream model performance. Our findings suggest that 1-dimensional convolutional classifiers with skip connections, ResNet50 pre-trained backbones, and partial backbone unfreezing yields optimal downstream medical classification performance. Our best models achieve AUCs of 0.9969 for ACL tear detection and 0.9641 for breast nodule malignancy detection, competitive with the results reported by Mei et al. (2022) and surpassing other previous works. We do not find evidence confirming RadImageNet pre-training to provide superior downstream performance for ACL tear and breast lesion classification tasks.[230] Designing Practical Models for Isolated Word Visual Speech Recognition
Iason Ioannis Panagos,Giorgos Sfikas,Christophoros Nikou
Main category: cs.CV
TL;DR: 本文提出了一种低硬件成本的视觉语音识别(VSR)系统,通过使用高效的图像分类模型和轻量级块设计,实现了在资源受限环境下的高效语音识别。
Details
Motivation: 现有的VSR系统虽然能从视频数据中识别出所说的话,但通常依赖于深度神经网络,这会带来显著的计算成本,从而增加了硬件需求,并在资源受限的实际场景中应用受限。作者旨在通过开发低硬件成本的VSR架构来解决这个问题。 Method: 作者采用标准的双网络设计范式,首先对图像分类文献中的高效模型进行基准测试,然后在时间卷积网络骨干中采用轻量级块设计,从而开发出轻量级端到端架构。 Result: 实验结果表明,所开发的低资源需求模型在大型公共英语单词数据库上具有强大的识别性能。 Conclusion: 作者通过开发低硬件成本的VSR架构,解决了现有深度模型计算成本高的问题,并展示了这些模型在大型公共数据库上的有效性和实用性。 Abstract: Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data. Practical applications of such systems include medical assistance as well as human-machine interactions. A VSR system is typically employed in a complementary role in cases where the audio is corrupt or not available. In order to accurately predict the spoken words, these architectures often rely on deep neural networks in order to extract meaningful representations from the input sequence. While deep architectures achieve impressive recognition performance, relying on such models incurs significant computation costs which translates into increased resource demands in terms of hardware requirements and results in limited applicability in real-world scenarios where resources might be constrained. This factor prevents wider adoption and deployment of speech recognition systems in more practical applications. In this work, we aim to alleviate this issue by developing architectures for VSR that have low hardware costs. Following the standard two-network design paradigm, where one network handles visual feature extraction and another one utilizes the extracted features to classify the entire sequence, we develop lightweight end-to-end architectures by first benchmarking efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone. We create several unified models with low resource requirements but strong recognition performance. Experiments on the largest public database for English words demonstrate the effectiveness and practicality of our developed models. Code and trained models will be made publicly available.[231] MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation
Liane Makatura,Benjamin Jones,Siyuan Bian,Wojciech Matusik
Main category: cs.CV
TL;DR: The paper introduces a framework for metamaterial design, including a domain-specific language (MetaDSL), a repository of designs (MetaDB), and benchmarking tools (MetaBench), providing a strong step toward simplified and integrated metamaterial design.
Details
Motivation: The motivation stems from the difficulty in designing metamaterials due to their geometric complexity and the non-trivial mapping from architecture to behavior. The authors aim to simplify and enhance the design process through their contributions. Method: The paper introduces MetaDSL, a domain-specific language for metamaterial design; MetaDB, a repository of parameterized designs; and MetaBench, benchmark suites for evaluating vision-language models. They use these tools to establish baselines and deploy an interactive CAD-like interface. Result: The results include the creation of MetaDSL, MetaDB, and MetaBench, which enable efficient metamaterial design and analysis. The framework demonstrates its effectiveness in case studies and establishes baselines for future developments in metamaterial design tools. Conclusion: The paper concludes that the developed framework provides a strong foundation for the integrated design and understanding of metamaterials, highlighting the effectiveness of the contributions in bridging the gap between design and functionality. Abstract: Metamaterials are micro-architected structures whose geometry imparts highly tunable-often counter-intuitive-bulk properties. Yet their design is difficult because of geometric complexity and a non-trivial mapping from architecture to behaviour. We address these challenges with three complementary contributions. (i) MetaDSL: a compact, semantically rich domain-specific language that captures diverse metamaterial designs in a form that is both human-readable and machine-parsable. (ii) MetaDB: a curated repository of more than 150,000 parameterized MetaDSL programs together with their derivatives-three-dimensional geometry, multi-view renderings, and simulated elastic properties. (iii) MetaBench: benchmark suites that test three core capabilities of vision-language metamaterial assistants-structure reconstruction, property-driven inverse design, and performance prediction. We establish baselines by fine-tuning state-of-the-art vision-language models and deploy an omni-model within an interactive, CAD-like interface. Case studies show that our framework provides a strong first step toward integrated design and understanding of structure-representation-property relationships.[232] IDU: Incremental Dynamic Update of Existing 3D Virtual Environments with New Imagery Data
Meida Chen,Luis Leal,Yue Hu,Rong Liu,Butian Xiong,Andrew Feng,Jiuyi Xu,Yangming Shi
Main category: cs.CV
TL;DR: The paper proposes the Incremental Dynamic Update (IDU) pipeline, which efficiently updates 3D models using minimal new images, reducing time and cost for military simulation and training in dynamic environments.
Details
Motivation: Military organizations require frequent updates to high-resolution 3D virtual environments due to the dynamic nature of battlefield conditions, which is both time-consuming and costly with traditional methods. Method: The Incremental Dynamic Update (IDU) pipeline combines camera pose estimation, change detection, and a 3D generative AI model to integrate new elements into existing 3D reconstructions, such as 3D Gaussian Splatting, with human guidance for accuracy. Result: The IDU pipeline significantly reduces update time and labor, providing a targeted solution for incorporating new elements into 3D models in rapidly evolving scenarios. Conclusion: The IDU pipeline offers a cost-effective and efficient method for maintaining updated 3D models in dynamic military environments by focusing on targeted updates with minimal input. Abstract: For simulation and training purposes, military organizations have made substantial investments in developing high-resolution 3D virtual environments through extensive imaging and 3D scanning. However, the dynamic nature of battlefield conditions-where objects may appear or vanish over time-makes frequent full-scale updates both time-consuming and costly. In response, we introduce the Incremental Dynamic Update (IDU) pipeline, which efficiently updates existing 3D reconstructions, such as 3D Gaussian Splatting (3DGS), with only a small set of newly acquired images. Our approach starts with camera pose estimation to align new images with the existing 3D model, followed by change detection to pinpoint modifications in the scene. A 3D generative AI model is then used to create high-quality 3D assets of the new elements, which are seamlessly integrated into the existing 3D model. The IDU pipeline incorporates human guidance to ensure high accuracy in object identification and placement, with each update focusing on a single new object at a time. Experimental results confirm that our proposed IDU pipeline significantly reduces update time and labor, offering a cost-effective and targeted solution for maintaining up-to-date 3D models in rapidly evolving military scenarios.[233] HERO: Hierarchical Extrapolation and Refresh for Efficient World Models
Quanjian Song,Xinyu Wang,Donghao Zhou,Jingyu Lin,Cunjian Chen,Yue Ma,Xiu Li
Main category: cs.CV
TL;DR: 本文提出了一种针对世界模型的高效推理框架HERO,通过分层策略加速推理,实现了1.73倍的加速,并且质量损失极小。
Details
Motivation: 世界模型的多模态特性导致浅层具有较高的时间变异性,而深层则产生更稳定的特征表示。 Method: HERO采用分层策略进行加速:(i) 在浅层中,采用分块刷新机制高效选择要重新计算的标记。通过分块采样和频率感知跟踪,避免额外的度量计算,并与FlashAttention保持兼容。(ii) 在深层中,采用线性外推方案直接估计中间特征。这完全绕过了注意力模块和前馈网络中的计算。 Result: HERO在世界模型中的推理速度提升了1.73倍,且质量损失极小。 Conclusion: HERO实现了1.73倍的加速,并且质量损失极小,显著优于现有的扩散加速方法。 Abstract: Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.[234] TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints
Vinh-Thuan Ly,Hoang M. Truong,Xuan-Huong Nguyen
Main category: cs.CV
TL;DR: 本文提出TinyGiantVLM,一种用于工业环境中物理空间推理的轻量级两阶段框架,通过融合RGB和深度模态特征及MoE模块,有效提升空间理解能力。
Details
Motivation: 现有视觉-语言模型在理解仓库级环境中的细粒度空间关系时存在显著挑战,特别是在处理3D布局、物体排列和多模态线索方面。 Method: 利用预训练的视觉骨干网络编码RGB和深度模态的全局和区域特征,并通过混合专家(MoE)融合模块动态结合空间表示。 Result: 在AI City Challenge 2025的Track 3中,64M参数的基础模型以66.8861的得分获得第5名,80M参数的变体在空间推理任务上表现更佳。 Conclusion: TinyGiantVLM通过轻量级、模块化的两阶段框架,有效提升了工业环境中视觉感知与空间理解的结合能力。 Abstract: Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings. In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is conducted in a two-phase strategy: the first phase focuses on generating free-form answers to enhance spatial reasoning ability, while the second phase uses normalized answers for evaluation. Evaluated on Track 3 of the AI City Challenge 2025, our 64M-parameter base model achieved 5th place on the leaderboard with a score of 66.8861, demonstrating strong performance in bridging visual perception and spatial understanding in industrial environments. We further present an 80M-parameter variant with expanded MoE capacity, which demonstrates improved performance on spatial reasoning tasks.[235] HotSpotter - Patterned Species Instance Recognition
Jonathan P. Crall,Charles V. Stewart,Tanya Y. Berger-Wolf,Daniel I. Rubenstein,Siva R. Sundaresan
Main category: cs.CV
TL;DR: HotSpotter是一种跨物种的快速准确的动物个体识别算法,其在大型图像数据库上优于现有方法。
Details
Motivation: 需要一个不依赖特定物种的快速准确的动物个体识别算法。 Method: 提出了两种方法:一种是顺序测试每个新查询图像与数据库中的每个图像,生成分数并排名;另一种是使用快速最近邻搜索匹配查询图像与数据库,并采用竞争性评分机制。 Result: HotSpotter在超过1000张图像的数据库上产生比现有方法更准确的匹配,并在几秒钟内完成每个查询图像的匹配。 Conclusion: HotSpotter是一个快速且准确的动物个体识别算法,其不依赖特定物种,并且在超过1000张图像的数据库上表现优于已发表的方法。 Abstract: We present HotSpotter, a fast, accurate algorithm for identifying individual animals against a labeled database. It is not species specific and has been applied to Grevy's and plains zebras, giraffes, leopards, and lionfish. We describe two approaches, both based on extracting and matching keypoints or "hotspots". The first tests each new query image sequentially against each database image, generating a score for each database image in isolation, and ranking the results. The second, building on recent techniques for instance recognition, matches the query image against the database using a fast nearest neighbor search. It uses a competitive scoring mechanism derived from the Local Naive Bayes Nearest Neighbor algorithm recently proposed for category recognition. We demonstrate results on databases of more than 1000 images, producing more accurate matches than published methods and matching each query image in just a few seconds.[236] A Weighted Vision Transformer-Based Multi-Task Learning Framework for Predicting ADAS-Cog Scores
Nur Amirah Abd Hamid,Mohd Ibrahim Shapiai,Daphne Teck Ching Lai
Main category: cs.CV
TL;DR: 本文提出了一种基于加权视觉变换器(ViT)的多任务学习框架,用于预测ADAS-Cog总分及其子得分,以提高阿尔茨海默病的预测精度和可解释性。
Details
Motivation: 现有的方法通常忽视了ADAS-Cog总分的13个子得分的预测价值,而这些子得分反映了不同的认知领域。分配更高的损失权重给这些具有临床意义的子得分可以引导模型关注更相关的认知领域,从而增强预测精度和可解释性。 Method: 本研究使用基线MRI扫描和第24个月的ADAS-Cog总分及其13个子得分,提出了一种加权视觉变换器(ViT)-基多任务学习(MTL)框架。 Result: 研究结果表明,所提出的加权策略是群体依赖性的:强加权提高了MCI受试者的性能,而中等加权对CN受试者更有效。研究发现,统一加权未能充分利用关键子得分并限制了泛化能力。 Conclusion: 本文提出了一种基于加权视觉变换器(ViT)的多任务学习框架,用于预测ADAS-Cog总分及其子得分,结果表明该方法能够提高预测精度和可解释性,并为阿尔茨海默病的预后提供了一种灵活、可解释的方法。 Abstract: Prognostic modeling is essential for forecasting future clinical scores and enabling early detection of Alzheimers disease (AD). While most existing methods focus on predicting the ADAS-Cog global score, they often overlook the predictive value of its 13 sub-scores, which reflect distinct cognitive domains. Some sub-scores may exert greater influence on determining global scores. Assigning higher loss weights to these clinically meaningful sub-scores can guide the model to focus on more relevant cognitive domains, enhancing both predictive accuracy and interpretability. In this study, we propose a weighted Vision Transformer (ViT)-based multi-task learning (MTL) framework to jointly predict the ADAS-Cog global score using baseline MRI scans and its 13 sub-scores at Month 24. Our framework integrates ViT as a feature extractor and systematically investigates the impact of sub-score-specific loss weighting on model performance. Results show that our proposed weighting strategies are group-dependent: strong weighting improves performance for MCI subjects with more heterogeneous MRI patterns, while moderate weighting is more effective for CN subjects with lower variability. Our findings suggest that uniform weighting underutilizes key sub-scores and limits generalization. The proposed framework offers a flexible, interpretable approach to AD prognosis using end-to-end MRI-based learning. (Github repo link will be provided after review)[237] JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on
Aowen Wang,Wei Li,Hao Luo,Mengxing Ao,Chenyu Zhu,Xinyang Li,Fan Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的无掩码虚拟试穿框架JCo-MVTON,结合扩散模型和多模态条件融合技术,实现了更精确的服装与人物融合,并在真实场景中表现出强大的泛化能力,超越了现有方法和商业系统。
Details
Motivation: 虚拟试穿系统长期受限于对人体掩码的重度依赖、对服装属性的细粒度控制有限以及在真实世界场景中的泛化能力不足。 Method: 提出了一种新的无掩码虚拟试穿框架JCo-MVTON,结合了基于扩散的图像生成和多模态条件融合,利用多模态扩散Transformer(MM-DiT)主干网络,将参考人物图像和目标服装图像等多样控制信号通过专门的条件路径融合到去噪过程中。 Result: JCo-MVTON在公共基准数据集上取得了SOTA结果,并在真实场景中表现出强大的泛化能力,超越了现有商业系统。此外,通过双向生成策略构建的合成数据集有效缓解了数据稀缺和质量问题。 Conclusion: JCo-MVTON实现了在公共基准测试(包括DressCode)上的SOTA性能,在定量指标和人类评估中均显著优于现有方法,并在商业系统中具有良好的应用前景。 Abstract: Virtual try-on systems have long been hindered by heavy reliance on human body masks, limited fine-grained control over garment attributes, and poor generalization to real-world, in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that overcomes these limitations by integrating diffusion-based image generation with multi-modal conditional fusion. Built upon a Multi-Modal Diffusion Transformer (MM-DiT) backbone, our approach directly incorporates diverse control signals -- such as the reference person image and the target garment image -- into the denoising process through dedicated conditional pathways that fuse features within the self-attention layers. This fusion is further enhanced with refined positional encodings and attention masks, enabling precise spatial alignment and improved garment-person integration. To address data scarcity and quality, we introduce a bidirectional generation strategy for dataset construction: one pipeline uses a mask-based model to generate realistic reference images, while a symmetric ``Try-Off'' model, trained in a self-supervised manner, recovers the corresponding garment images. The synthesized dataset undergoes rigorous manual curation, allowing iterative improvement in visual fidelity and diversity. Experiments demonstrate that JCo-MVTON achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations. Moreover, it shows strong generalization in real-world applications, surpassing commercial systems.[238] Improving Interpretability in Alzheimer's Prediction via Joint Learning of ADAS-Cog Scores
Nur Amirah Abd Hamid,Mohd Shahrizal Rusli,Muhammad Thaqif Iman Mohd Taufek,Mohd Ibrahim Shapiai,Daphne Teck Ching Lai
Main category: cs.CV
TL;DR: 本研究通过多任务学习框架联合预测ADAS-Cog总分及其子得分,发现子得分对总分预测具有重要贡献,并指出了模型在多模态融合和损失权重调整方面需要改进。
Details
Motivation: 现有方法主要关注预测ADAS-Cog总分,而忽视了子得分的预测价值,这些子得分可以捕捉特定领域的认知衰退。 Method: 使用Vision Transformer (ViT)和Swin Transformer架构提取影像特征,并与纵向临床输入融合以建模认知进展。 Result: 结合子得分学习可以提高总分预测的准确性,其中Q1(单词回忆)、Q4(延迟回忆)和Q8(单词识别)对预测总分有显著影响,但某些子得分预测误差较大,表明模型存在不稳定性。 Conclusion: 本研究提出了一种多任务学习框架,用于联合预测ADAS-Cog总分及其子得分,并强调了子得分对于阿尔茨海默病早期检测和预后的重要性。 Abstract: Accurate prediction of clinical scores is critical for early detection and prognosis of Alzheimers disease (AD). While existing approaches primarily focus on forecasting the ADAS-Cog global score, they often overlook the predictive value of its sub-scores (13 items), which capture domain-specific cognitive decline. In this study, we propose a multi task learning (MTL) framework that jointly predicts the global ADAS-Cog score and its sub-scores (13 items) at Month 24 using baseline MRI and longitudinal clinical scores from baseline and Month 6. The main goal is to examine how each sub scores particularly those associated with MRI features contribute to the prediction of the global score, an aspect largely neglected in prior MTL studies. We employ Vision Transformer (ViT) and Swin Transformer architectures to extract imaging features, which are fused with longitudinal clinical inputs to model cognitive progression. Our results show that incorporating sub-score learning improves global score prediction. Subscore level analysis reveals that a small subset especially Q1 (Word Recall), Q4 (Delayed Recall), and Q8 (Word Recognition) consistently dominates the predicted global score. However, some of these influential sub-scores exhibit high prediction errors, pointing to model instability. Further analysis suggests that this is caused by clinical feature dominance, where the model prioritizes easily predictable clinical scores over more complex MRI derived features. These findings emphasize the need for improved multimodal fusion and adaptive loss weighting to achieve more balanced learning. Our study demonstrates the value of sub score informed modeling and provides insights into building more interpretable and clinically robust AD prediction frameworks. (Github repo provided)[239] Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes
Ryan Faulkner,Ian Reid,Simon Ratcliffe,Tat-Jun Chin
Main category: cs.CV
TL;DR: 本文提出了一种新的基于Mamba架构的开放集分割方法,适用于大规模LiDAR点云数据,性能优越。
Details
Motivation: 户外LiDAR扫描应用中不可避免地会出现训练数据之外的异常物体,因此需要有效的开放集分割方法。 Method: 结合物体缺陷检测的研究成果和Mamba架构在长距离依赖性和大规模数据扩展性方面的优势,设计了一种基于重建的开放集分割方法。 Result: 该方法不仅在作者提出的数据集上表现优异,还能够提升现有开放集分割方法的性能,并且在大规模点云数据上与基于体素卷积的方法具有竞争力。 Conclusion: 本文提出了一种基于Mamba架构的户外场景开放集分割的新方法,并证明了其在大规模点云数据上的有效性和竞争力。 Abstract: LiDAR scanning in outdoor scenes acquires accurate distance measurements over wide areas, producing large-scale point clouds. Application examples for this data include robotics, automotive vehicles, and land surveillance. During such applications, outlier objects from outside the training data will inevitably appear. Our research contributes a novel approach to open-set segmentation, leveraging the learnings of object defect-detection research. We also draw on the Mamba architecture's strong performance in utilising long-range dependencies and scalability to large data. Combining both, we create a reconstruction based approach for the task of outdoor scene open-set segmentation. We show that our approach improves performance not only when applied to our our own open-set segmentation method, but also when applied to existing methods. Furthermore we contribute a Mamba based architecture which is competitive with existing voxel-convolution based methods on challenging, large-scale pointclouds.[240] Wound3DAssist: A Practical Framework for 3D Wound Assessment
Remi Chierchia,Rodrigo Santa Cruz,Léo Lebrat,Yulia Arzhaeva,Mohammad Ali Armin,Jeremy Oorloff,Chuong Nguyen,Olivier Salvado,Clinton Fookes,David Ahmedt-Aristizabal
Main category: cs.CV
TL;DR: Wound3DAssist 是一种基于单目视频的3D伤口评估框架,可实现高效、精确、自动化的伤口测量和分析。
Details
Motivation: 传统临床评估方法主观且耗时,2D数字视频测量方法存在透视失真、视野有限和无法捕捉伤口深度等问题。 Method: 使用单目消费级视频进行3D重建、伤口分割、组织分类和周围分析,形成模块化工作流程。 Result: 支持高质量的伤口床可视化,毫米级精度,可靠的组织成分分析,全面评估在20分钟内完成。 Conclusion: Wound3DAssist 是一种实用的3D伤口评估框架,能够实现非接触式、自动化的测量,适用于真实世界的临床应用。 Abstract: Managing chronic wounds remains a major healthcare challenge, with clinical assessment often relying on subjective and time-consuming manual documentation methods. Although 2D digital videometry frameworks aided the measurement process, these approaches struggle with perspective distortion, a limited field of view, and an inability to capture wound depth, especially in anatomically complex or curved regions. To overcome these limitations, we present Wound3DAssist, a practical framework for 3D wound assessment using monocular consumer-grade videos. Our framework generates accurate 3D models from short handheld smartphone video recordings, enabling non-contact, automatic measurements that are view-independent and robust to camera motion. We integrate 3D reconstruction, wound segmentation, tissue classification, and periwound analysis into a modular workflow. We evaluate Wound3DAssist across digital models with known geometry, silicone phantoms, and real patients. Results show that the framework supports high-quality wound bed visualization, millimeter-level accuracy, and reliable tissue composition analysis. Full assessments are completed in under 20 minutes, demonstrating feasibility for real-world clinical use.[241] Few-Shot Pattern Detection via Template Matching and Regression
Eunchan Jo,Dahyun Kang,Sanghyun Kim,Yunseon Choi,Minsu Cho
Main category: cs.CV
TL;DR: This paper introduces TMR, a novel few-shot pattern detection method based on template matching and regression, which outperforms existing approaches and generalizes well across datasets.
Details
Motivation: Existing few-shot object counting and detection methods focus on object categories and struggle with non-object pattern localization, necessitating a more effective and generalizable approach. Method: TMR utilizes template matching and regression with a minimalistic structure, preserving spatial layout through a few learnable convolutional or projection layers on a frozen backbone. Result: The method achieves superior performance on three benchmarks—RPINE, FSCD-147, and FSCD-LVIS—and shows strong cross-dataset generalization. Conclusion: The proposed TMR method outperforms state-of-the-art approaches on multiple benchmarks and demonstrates strong generalization across datasets. Abstract: We address the problem of few-shot pattern detection, which aims to detect all instances of a given pattern, typically represented by a few exemplars, from an input image. Although similar problems have been studied in few-shot object counting and detection (FSCD), previous methods and their benchmarks have narrowed patterns of interest to object categories and often fail to localize non-object patterns. In this work, we propose a simple yet effective detector based on template matching and regression, dubbed TMR. While previous FSCD methods typically represent target exemplars as spatially collapsed prototypes and lose structural information, we revisit classic template matching and regression. It effectively preserves and leverages the spatial layout of exemplars through a minimalistic structure with a small number of learnable convolutional or projection layers on top of a frozen backbone We also introduce a new dataset, dubbed RPINE, which covers a wider range of patterns than existing object-centric datasets. Our method outperforms the state-of-the-art methods on the three benchmarks, RPINE, FSCD-147, and FSCD-LVIS, and demonstrates strong generalization in cross-dataset evaluation.[242] HyTver: A Novel Loss Function for Longitudinal Multiple Sclerosis Lesion Segmentation
Dayan Perera,Ting Fung Fung,Vishnu Monn
Main category: cs.CV
TL;DR: The paper introduces HyTver, a novel hybrid loss function for longitudinal multiple sclerosis lesion segmentation that effectively balances segmentation performance and other metric requirements while being computationally efficient.
Details
Motivation: Longitudinal multiple sclerosis lesion segmentation presents significant data and output imbalance, which impacts model performance. Traditional loss functions like Dice or Cross-Entropy are often used naively, but they may not adequately address these imbalances, necessitating the development of more effective and balanced loss functions. Method: A novel hybrid loss function called HyTver was proposed, and its performance was evaluated by comparing it with other popular loss functions on a pre-trained model, focusing on both segmentation accuracy and other metrics like distance-based ones. Result: The HyTver loss achieved a Dice score of 0.659 while ensuring that distance-based metrics remained comparable to those of other popular loss functions. It also demonstrated stability when applied to pre-trained models. Conclusion: The proposed HyTver loss function effectively addresses data and segmentation imbalance in longitudinal multiple sclerosis lesion segmentation, achieving good segmentation performance while maintaining performance in other metrics. Abstract: Longitudinal Multiple Sclerosis Lesion Segmentation is a particularly challenging problem that involves both input and output imbalance in the data and segmentation. Therefore in order to develop models that are practical, one of the solutions is to develop better loss functions. Most models naively use either Dice loss or Cross-Entropy loss or their combination without too much consideration. However, one must select an appropriate loss function as the imbalance can be mitigated by selecting a proper loss function. In order to solve the imbalance problem, multiple loss functions were proposed that claimed to solve it. They come with problems of their own which include being too computationally complex due to hyperparameters as exponents or having detrimental performance in metrics other than region-based ones. We propose a novel hybrid loss called HyTver that achieves good segmentation performance while maintaining performance in other metrics. We achieve a Dice score of 0.659 while also ensuring that the distance-based metrics are comparable to other popular functions. In addition, we also evaluate the stability of the loss functions when used on a pre- trained model and perform extensive comparisons with other popular loss functions[243] FloraSyntropy-Net: Scalable Deep Learning with Novel FloraSyntropy Archive for Large-Scale Plant Disease Diagnosis
Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
Main category: cs.CV
TL;DR: This paper introduces FloraSyntropy-Net, a new AI framework for plant disease diagnosis with high accuracy and generalization across diverse species.
Details
Motivation: Early diagnosis of plant diseases is critical for food safety, but existing AI models lack generalization across diverse plant species. Method: FloraSyntropy-Net uses a federated learning framework with a Memetic Algorithm for model selection, a novel Deep Block for feature representation, and a client-cloning strategy for training. Result: FloraSyntropy-Net achieved 96.38% accuracy on the FloraSyntropy benchmark and 99.84% accuracy on the Pest dataset. Conclusion: FloraSyntropy-Net provides a robust and highly generalizable framework that advances agricultural AI applications. Abstract: Early diagnosis of plant diseases is critical for global food safety, yet most AI solutions lack the generalization required for real-world agricultural diversity. These models are typically constrained to specific species, failing to perform accurately across the broad spectrum of cultivated plants. To address this gap, we first introduce the FloraSyntropy Archive, a large-scale dataset of 178,922 images across 35 plant species, annotated with 97 distinct disease classes. We establish a benchmark by evaluating numerous existing models on this archive, revealing a significant performance gap. We then propose FloraSyntropy-Net, a novel federated learning framework (FL) that integrates a Memetic Algorithm (MAO) for optimal base model selection (DenseNet201), a novel Deep Block for enhanced feature representation, and a client-cloning strategy for scalable, privacy-preserving training. FloraSyntropy-Net achieves a state-of-the-art accuracy of 96.38% on the FloraSyntropy benchmark. Crucially, to validate its generalization capability, we test the model on the unrelated multiclass Pest dataset, where it demonstrates exceptional adaptability, achieving 99.84% accuracy. This work provides not only a valuable new resource but also a robust and highly generalizable framework that advances the field towards practical, large-scale agricultural AI applications.[244] Rethinking the Detail-Preserved Completion of Complex Tubular Structures based on Point Cloud: a Dataset and a Benchmark
Yaolei Qi,Yikai Yang,Wenbo Peng,Shumei Miao,Yutao Hu,Guanyu Yang
Main category: cs.CV
TL;DR: This study proposes TSRNet for tubular structure completion, particularly addressing structural discontinuities in coronary arteries, achieving superior performance on multiple datasets.
Details
Motivation: Existing segmentation algorithms face challenges with structural discontinuities in complex tubular structures like coronary arteries, affecting diagnostic accuracy. Method: The study introduces TSRNet, which combines a detail-preserving feature extractor, a multiple dense refinement strategy, and a global-to-local loss function for accurate tubular structure reconnection. Result: Experiments on PC-CAC and two public datasets (PC-ImageCAS and PC-PTR) show that TSRNet consistently surpasses existing methods across multiple evaluation metrics. Conclusion: The proposed TSRNet outperforms state-of-the-art methods in tubular structure completion, particularly on the newly introduced PC-CAC dataset derived from real clinical data. Abstract: Complex tubular structures are essential in medical imaging and computer-assisted diagnosis, where their integrity enhances anatomical visualization and lesion detection. However, existing segmentation algorithms struggle with structural discontinuities, particularly in severe clinical cases such as coronary artery stenosis and vessel occlusions, which leads to undesired discontinuity and compromising downstream diagnostic accuracy. Therefore, it is imperative to reconnect discontinuous structures to ensure their completeness. In this study, we explore the tubular structure completion based on point cloud for the first time and establish a Point Cloud-based Coronary Artery Completion (PC-CAC) dataset, which is derived from real clinical data. This dataset provides a novel benchmark for tubular structure completion. Additionally, we propose TSRNet, a Tubular Structure Reconnection Network that integrates a detail-preservated feature extractor, a multiple dense refinement strategy, and a global-to-local loss function to ensure accurate reconnection while maintaining structural integrity. Comprehensive experiments on our PC-CAC and two additional public datasets (PC-ImageCAS and PC-PTR) demonstrate that our method consistently outperforms state-of-the-art approaches across multiple evaluation metrics, setting a new benchmark for point cloud-based tubular structure reconstruction. Our benchmark is available at https://github.com/YaoleiQi/PCCAC.[245] M^3-GloDets: Multi-Region and Multi-Scale Analysis of Fine-Grained Diseased Glomerular Detection
Tianyu Shi,Xinzi He,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
Main category: cs.CV
TL;DR: M^3-GloDet 是一个用于评估肾小球疾病检测模型的框架,研究发现中等大小的补丁和适度放大能提高检测效率和泛化能力。
Details
Motivation: 目前大多数研究集中在正常肾小球或全局硬化病例上,而对更广泛的患病肾小球亚型研究较少,而这些亚型的形态学特征经常使计算模型难以捕捉。 Method: 评估了长期存在的基准架构和最新引入的先进模型,使用反映常规数字肾脏病理学中遇到的感兴趣区域大小和成像分辨率多样性的实验设计。 Result: 研究发现,中等大小的补丁在上下文和效率之间提供了最佳平衡。此外,适度放大通过减少过拟合提高了泛化能力。 Conclusion: M^3-GloDet 是一个系统框架,旨在全面评估检测模型,以改进数字病理学中的自动化检测策略和临床工作流程。 Abstract: Accurate detection of diseased glomeruli is fundamental to progress in renal pathology and underpins the delivery of reliable clinical diagnoses. Although recent advances in computer vision have produced increasingly sophisticated detection algorithms, the majority of research efforts have focused on normal glomeruli or instances of global sclerosis, leaving the wider spectrum of diseased glomerular subtypes comparatively understudied. This disparity is not without consequence; the nuanced and highly variable morphological characteristics that define these disease variants frequently elude even the most advanced computational models. Moreover, ongoing debate surrounds the choice of optimal imaging magnifications and region-of-view dimensions for fine-grained glomerular analysis, adding further complexity to the pursuit of accurate classification and robust segmentation. To bridge these gaps, we present M^3-GloDet, a systematic framework designed to enable thorough evaluation of detection models across a broad continuum of regions, scales, and classes. Within this framework, we evaluate both long-standing benchmark architectures and recently introduced state-of-the-art models that have achieved notable performance, using an experimental design that reflects the diversity of region-of-interest sizes and imaging resolutions encountered in routine digital renal pathology. As the results, we found that intermediate patch sizes offered the best balance between context and efficiency. Additionally, moderate magnifications enhanced generalization by reducing overfitting. Through systematic comparison of these approaches on a multi-class diseased glomerular dataset, our aim is to advance the understanding of model strengths and limitations, and to offer actionable insights for the refinement of automated detection strategies and clinical workflows in the digital pathology domain.[246] Hierarchical Vision-Language Learning for Medical Out-of-Distribution Detection
Runhe Lai,Xinhua Lu,Kanghao Chen,Qichao Chen,Wei-Shi Zheng,Ruixuan Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-语言模型的新型医学OOD检测框架,结合多尺度视觉融合和伪样本生成策略,有效提升了未知疾病的检测效果。
Details
Motivation: 在可信的医学诊断系统中,集成分布外(OOD)检测旨在识别样本中的未知疾病,从而降低误诊风险。 Method: 提出了跨尺度视觉融合策略和跨尺度难例伪OOD样本生成策略,并在三个公共医学数据集上进行了实验验证。 Result: 实验表明,所提出的框架在三个公共医学数据集上均取得了优于现有方法的OOD检测性能。 Conclusion: 该研究提出了一种基于视觉-语言模型的新型OOD检测框架,通过整合多尺度视觉信息和生成难例伪OOD样本策略,有效提升了未知疾病的检测性能。 Abstract: In trustworthy medical diagnosis systems, integrating out-of-distribution (OOD) detection aims to identify unknown diseases in samples, thereby mitigating the risk of misdiagnosis. In this study, we propose a novel OOD detection framework based on vision-language models (VLMs), which integrates hierarchical visual information to cope with challenging unknown diseases that resemble known diseases. Specifically, a cross-scale visual fusion strategy is proposed to couple visual embeddings from multiple scales. This enriches the detailed representation of medical images and thus improves the discrimination of unknown diseases. Moreover, a cross-scale hard pseudo-OOD sample generation strategy is proposed to benefit OOD detection maximally. Experimental evaluations on three public medical datasets support that the proposed framework achieves superior OOD detection performance compared to existing methods. The source code is available at https://openi.pcl.ac.cn/OpenMedIA/HVL.[247] Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
Yogesh Kumar
Main category: cs.CV
TL;DR: 本文提出了一种新的视频处理方法LGTTP,通过利用查询中的时间线索自适应修剪视频标记,有效减少了计算量并保持了高性能,在多个数据集和任务中表现出色。
Details
Motivation: Vision Language Models (VLMs)在处理长视频时面临注意力机制的二次复杂度问题,需要一种有效的方法来减少计算开销并保持性能。 Method: 提出了一种名为Language-Guided Temporal Token Pruning (LGTTP)的新方法,该方法利用查询中的时间线索自适应地修剪视频标记,保持上下文连续性的同时降低计算开销。 Result: LGTTP实现了计算量的65%减少,同时保持了97-99%的原始性能。在QVHighlights数据集上,HIT@1提高了+9.5%;在Charades-STA数据集上,R@1保持了99.6%。 Conclusion: LGTTP是一个模型无关的框架,能够有效减少视频处理的计算量,同时保持高性能,在具有显式时间标记的查询和一般视频理解任务中均表现出色。 Abstract: Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens, preserving contextual continuity while reducing computational overhead. Unlike uniform pruning or keyframe selection, LGTTP retains higher token density in temporally relevant segments. Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit temporal markers and remains effective across general video understanding tasks.[248] Benchmarking Class Activation Map Methods for Explainable Brain Hemorrhage Classification on Hemorica Dataset
Z. Rafati,M. Hoseyni,J. Khoramdel,A. Nikoofard
Main category: cs.CV
TL;DR: 该研究评估了9种CAM算法在脑出血诊断中的可解释性表现,发现AblationCAM在像素级分割中表现最佳。
Details
Motivation: 为了提高深度学习模型在医学影像研究中的透明度和临床信任,可解释人工智能(XAI)变得尤为重要。 Method: 研究通过分类模型提取像素级分割和检测注释,使用9种最先进的CAM算法,并在多个网络阶段进行了评估。 Result: 结果表明,在EfficientNetV2S的第5阶段,HiResCAM在边界框对齐方面表现最佳,而AblationCAM在像素级的Dice系数(0.57)和IoU(0.40)上表现最佳。 Conclusion: 该研究是首批对脑出血检测中的CAM方法进行定量比较的研究之一,为可解释人工智能(XAI)驱动的临床诊断工具奠定了基础。 Abstract: Explainable Artificial Intelligence (XAI) has become an essential component of medical imaging research, aiming to increase transparency and clinical trust in deep learning models. This study investigates brain hemorrhage diagnosis with a focus on explainability through Class Activation Mapping (CAM) techniques. A pipeline was developed to extract pixellevel segmentation and detection annotations from classification models using nine state-of-the-art CAM algorithms, applied across multiple network stages, and quantitatively evaluated on the Hemorica dataset, which uniquely provides both slice-level labels and high-quality segmentation masks. Metrics including Dice, IoU, and pixel-wise overlap were employed to benchmark CAM variants. Results show that the strongest localization performance occurred at stage 5 of EfficientNetV2S, with HiResCAM yielding the highest bounding-box alignment and AblationCAM achieving the best pixel-level Dice (0.57) and IoU (0.40), representing strong accuracy given that models were trained solely for classification without segmentation supervision. To the best of current knowledge, this is among the f irst works to quantitatively compare CAM methods for brain hemorrhage detection, establishing a reproducible benchmark and underscoring the potential of XAI-driven pipelines for clinically meaningful AI-assisted diagnosis.[249] CATformer: Contrastive Adversarial Transformer for Image Super-Resolution
Qinyi Tian,Spence Cox,Laura E. Dalton
Main category: cs.CV
TL;DR: 本文提出了CATformer模型,通过结合扩散启发式Transformer与对抗和对比学习,在图像超分辨率任务上取得了优异的性能。
Details
Motivation: 为了提升低分辨率图像的质量,同时结合扩散模型、对抗学习和对比学习的优势以增强鲁棒性和图像重建质量。 Method: CATformer采用双分支架构,结合一个主干扩散启发式Transformer和一个辅助Transformer分支,并使用深度残差密集块进行重建。 Result: 实验表明,CATformer在效率和视觉图像质量方面均优于近期的基于Transformer和扩散模型的方法。 Conclusion: CATformer成功弥合了基于Transformer、扩散模型和GAN的方法之间的性能差距,为扩散启发式Transformer在超分辨率中的实际应用奠定了基础。 Abstract: Super-resolution remains a promising technique to enhance the quality of low-resolution images. This study introduces CATformer (Contrastive Adversarial Transformer), a novel neural network integrating diffusion-inspired feature refinement with adversarial and contrastive learning. CATformer employs a dual-branch architecture combining a primary diffusion-inspired transformer, which progressively refines latent representations, with an auxiliary transformer branch designed to enhance robustness to noise through learned latent contrasts. These complementary representations are fused and decoded using deep Residual-in-Residual Dense Blocks for enhanced reconstruction quality. Extensive experiments on benchmark datasets demonstrate that CATformer outperforms recent transformer-based and diffusion-inspired methods both in efficiency and visual image quality. This work bridges the performance gap among transformer-, diffusion-, and GAN-based methods, laying a foundation for practical applications of diffusion-inspired transformers in super-resolution.[250] NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction
Soham Dasgupta,Shanthika Naik,Preet Savalia,Sujay Kumar Ingle,Avinash Sharma
Main category: cs.CV
TL;DR: 本文提出NGD方法,通过神经梯度变形与动态纹理映射,实现从单目视频中高质量重建动态衣物。
Details
Motivation: 现有的隐式表示方法难以建模高频细节,而基于模板的重建方法存在变形伪影问题。 Method: 提出了一种基于神经梯度的变形方法(NGD)和一种动态纹理映射学习方法,并结合自适应网格重建策略。 Result: 通过大量定性和定量评估,证明了NGD在动态衣物重建方面的显著改进。 Conclusion: NGD方法在动态衣物重建方面优于现有的最先进方法,提供了高质量的几何和纹理重建。 Abstract: Dynamic garment reconstruction from monocular video is an important yet challenging task due to the complex dynamics and unconstrained nature of the garments. Recent advancements in neural rendering have enabled high-quality geometric reconstruction with image/video supervision. However, implicit representation methods that use volume rendering often provide smooth geometry and fail to model high-frequency details. While template reconstruction methods model explicit geometry, they use vertex displacement for deformation, which results in artifacts. Addressing these limitations, we propose NGD, a Neural Gradient-based Deformation method to reconstruct dynamically evolving textured garments from monocular videos. Additionally, we propose a novel adaptive remeshing strategy for modelling dynamically evolving surfaces like wrinkles and pleats of the skirt, leading to high-quality reconstruction. Finally, we learn dynamic texture maps to capture per-frame lighting and shadow effects. We provide extensive qualitative and quantitative evaluations to demonstrate significant improvements over existing SOTA methods and provide high-quality garment reconstructions.[251] F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model
Hanbo Bi,Zhiqiang Yuan,Zexi Jia,Jiapei Zhang,Chongyang Li,Peixiang Luo,Ying Deng,Xiaoyue Duan,Jinchao Zhang
Main category: cs.CV
TL;DR: 本文提出了F2RVLM,一种用于从长对话中检索相关片段的生成式检索模型,通过监督微调和强化学习结合难度感知采样,实现了优于现有方法的性能。
Details
Motivation: 传统对话检索无法满足用户回顾长对话中分散的语义连贯内容的实际需求,因此提出了细粒度片段检索(FFR)任务。 Method: 提出了F2RVLM,这是一种分两阶段训练的生成式检索模型:(1) 监督微调以注入片段级检索知识,(2) 基于GRPO的强化学习,具有促进语义精确度、相关性和上下文连贯性的多目标奖励。此外,引入了难度感知的课程采样方法。 Result: 构建了MLDR数据集,这是迄今为止最长轮次的多模态对话检索数据集,并构建了基于微信的真实场景测试集。F2RVLM在检索性能上优于现有模型,特别是在处理复杂的多轮长对话时。 Conclusion: F2RVLM有效地解决了从多模态长对话中检索语义连贯片段的问题,并在同域和真实场景设置中优于流行的VLMs,展示了卓越的检索性能。 Abstract: Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.[252] Instant Preference Alignment for Text-to-Image Diffusion Models
Yang Li,Songlin Yang,Xiaoxuan Han,Wei Wang,Jing Dong,Yueming Lyu,Ziyu Xue
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态大语言模型先验的实时、无需训练的文本到图像生成框架,能够生成与用户偏好一致的图像。
Details
Motivation: 文本到图像生成已经极大地增强了创意表达,但以实时且无需训练的方式实现与用户偏好一致的生成仍然具有挑战性。之前的方法通常依赖于静态的、预先收集的偏好或微调,限制了对不断变化和细微用户意图的适应能力。 Method: 该框架将任务分解为两个部分:偏好理解和偏好引导生成。在偏好理解阶段,利用多模态大语言模型自动提取参考图像中的全局偏好信号,并使用结构化指令设计来丰富给定的提示。在偏好引导生成阶段,通过整合基于全局关键词的控制和局部区域感知的交叉注意力调制来引导扩散模型,而无需额外训练。 Result: 实验结果表明,该方法在Viper数据集和作者收集的基准数据集上均优于之前的方法,无论是在定量指标还是人类评估方面。 Conclusion: 本文提出了一种无需训练的文本到图像生成框架,该框架基于多模态大语言模型先验,能够实时地生成与用户偏好一致的图像。 Abstract: Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging. Previous methods often rely on static, pre-collected preferences or fine-tuning, limiting adaptability to evolving and nuanced user intents. In this paper, we highlight the need for instant preference-aligned T2I generation and propose a training-free framework grounded in multimodal large language model (MLLM) priors. Our framework decouples the task into two components: preference understanding and preference-guided generation. For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image and enrich a given prompt using structured instruction design. Our approach supports broader and more fine-grained coverage of user preferences than existing methods. For preference-guided generation, we integrate global keyword-based control and local region-aware cross-attention modulation to steer the diffusion model without additional training, enabling precise alignment across both global attributes and local elements. The entire framework supports multi-round interactive refinement, facilitating real-time and context-aware image generation. Extensive experiments on the Viper dataset and our collected benchmark demonstrate that our method outperforms prior approaches in both quantitative metrics and human evaluations, and opens up new possibilities for dialog-based generation and MLLM-diffusion integration.[253] Few-shot Human Action Anomaly Detection via a Unified Contrastive Learning Framework
Koichiro Kamide,Shunsuke Sakai,Shun Maeda,Chunzhi Gu,Chao Zhang
Main category: cs.CV
TL;DR: This paper proposes a unified few-shot framework for Human Action Anomaly Detection (HAAD) using contrastive learning and generative motion augmentation, achieving state-of-the-art results on the HumanAct12 dataset.
Details
Motivation: Current HAAD methods require separate training for each action category and a large amount of normal data, limiting their scalability and applicability in real-world situations with scarce or new categories. This work aims to overcome these limitations by proposing a unified, few-shot compatible framework. Method: The method utilizes contrastive learning to create a category-agnostic representation space and introduces a generative motion augmentation strategy based on a diffusion model to enhance inter-category generalization and intra-category robustness. Result: Extensive experiments on the HumanAct12 dataset demonstrate the effectiveness of the proposed approach, particularly in few-shot settings, with notable improvements in model scalability and training efficiency. Conclusion: The proposed framework for HAAD shows state-of-the-art effectiveness on the HumanAct12 dataset in both seen and unseen category settings, particularly excelling in training efficiency and scalability for few-shot scenarios. Abstract: Human Action Anomaly Detection (HAAD) aims to identify anomalous actions given only normal action data during training. Existing methods typically follow a one-model-per-category paradigm, requiring separate training for each action category and a large number of normal samples. These constraints hinder scalability and limit applicability in real-world scenarios, where data is often scarce or novel categories frequently appear. To address these limitations, we propose a unified framework for HAAD that is compatible with few-shot scenarios. Our method constructs a category-agnostic representation space via contrastive learning, enabling AD by comparing test samples with a given small set of normal examples (referred to as the support set). To improve inter-category generalization and intra-category robustness, we introduce a generative motion augmentation strategy harnessing a diffusion-based foundation model for creating diverse and realistic training samples. Notably, to the best of our knowledge, our work is the first to introduce such a strategy specifically tailored to enhance contrastive learning for action AD. Extensive experiments on the HumanAct12 dataset demonstrate the state-of-the-art effectiveness of our approach under both seen and unseen category settings, regarding training efficiency and model scalability for few-shot HAAD.[254] Segmentation and Classification of Pap Smear Images for Cervical Cancer Detection Using Deep Learning
Nisreen Albzour,Sarah S. Lam
Main category: cs.CV
TL;DR: This study proposes a deep learning framework using U-Net for segmentation and a classification model to improve cervical cancer diagnosis. While segmentation slightly enhances performance, its overall impact is limited, but the framework can serve as a useful tool for pathologists.
Details
Motivation: Cervical cancer is a leading cause of cancer-related deaths among women globally, and early detection through Pap smear tests is crucial. However, manual analysis is time-consuming and susceptible to human error. This study aims to improve diagnostic accuracy and efficiency by developing a deep learning-based automated system. Method: The study employed a deep learning framework integrating U-Net for image segmentation and a classification model for diagnosis. The Herlev Pap Smear Dataset was used for training and evaluation. The effect of segmentation on classification performance was assessed by comparing models trained on segmented and non-segmented images. Result: Experimental results showed that using segmented images marginally improved model performance, with a 0.41% increase in precision and a 1.30% increase in F1-score compared to non-segmented images. This indicates a slightly more balanced classification performance, although the overall impact of segmentation was limited. Conclusion: The study concludes that while image segmentation using U-Net can slightly enhance the precision and F1-score of cervical cancer diagnosis, its overall impact on classification performance is limited. However, the proposed deep learning framework can serve as a useful supplemental tool for pathologists in early cervical cancer diagnosis. Abstract: Cervical cancer remains a significant global health concern and a leading cause of cancer-related deaths among women. Early detection through Pap smear tests is essential to reduce mortality rates; however, the manual examination is time consuming and prone to human error. This study proposes a deep learning framework that integrates U-Net for segmentation and a classification model to enhance diagnostic performance. The Herlev Pap Smear Dataset, a publicly available cervical cell dataset, was utilized for training and evaluation. The impact of segmentation on classification performance was evaluated by comparing the model trained on segmented images and another trained on non-segmented images. Experimental results showed that the use of segmented images marginally improved the model performance on precision (about 0.41 percent higher) and F1-score (about 1.30 percent higher), which suggests a slightly more balanced classification performance. While segmentation helps in feature extraction, the results showed that its impact on classification performance appears to be limited. The proposed framework offers a supplemental tool for clinical applications, which may aid pathologists in early diagnosis.[255] CMFDNet: Cross-Mamba and Feature Discovery Network for Polyp Segmentation
Feng Jiang,Zongfei Zhang,Xin Xu
Main category: cs.CV
TL;DR: 本文提出了一种新的结肠息肉分割架构CMFDNet,通过CMD、MSA和FD模块解决了现有方法在息肉边界模糊、形状多样性和小尺寸息肉漏检方面的限制,实验结果显示其性能优于现有方法。
Details
Motivation: 现有的息肉分割方法受限于息肉形状和大小的显著变化、息肉与邻近组织边界不清晰以及小尺寸息肉容易被忽略等问题,需要一种更有效的分割方法。 Method: 提出了CMFDNet架构,包含CMD模块、MSA模块和FD模块,分别用于优化边界模糊问题、增强对不同几何形状和尺度分布息肉的识别能力,以及缓解小尺度息肉的漏检问题。 Result: 实验结果表明,CMFDNet在ETIS和ColonDB数据集上表现优于六个最先进的方法,mDice得分分别超过最佳SOTA方法1.83%和1.55%。 Conclusion: CMFDNet通过其创新的模块设计在结肠息肉分割任务上表现出色,尤其是在处理不同尺度和形状的息肉方面优于现有方法。 Abstract: Automated colonic polyp segmentation is crucial for assisting doctors in screening of precancerous polyps and diagnosis of colorectal neoplasms. Although existing methods have achieved promising results, polyp segmentation remains hindered by the following limitations,including: (1) significant variation in polyp shapes and sizes, (2) indistinct boundaries between polyps and adjacent tissues, and (3) small-sized polyps are easily overlooked during the segmentation process. Driven by these practical difficulties, an innovative architecture, CMFDNet, is proposed with the CMD module, MSA module, and FD module. The CMD module, serving as an innovative decoder, introduces a cross-scanning method to reduce blurry boundaries. The MSA module adopts a multi-branch parallel structure to enhance the recognition ability for polyps with diverse geometries and scale distributions. The FD module establishes dependencies among all decoder features to alleviate the under-detection of polyps with small-scale features. Experimental results show that CMFDNet outperforms six SOTA methods used for comparison, especially on ETIS and ColonDB datasets, where mDice scores exceed the best SOTA method by 1.83% and 1.55%, respectively.[256] DroneKey: Drone 3D Pose Estimation in Image Sequences using Gated Key-representation and Pose-adaptive Learning
Seo-Bin Hwang,Yeong-Jun Cho
Main category: cs.CV
TL;DR: DroneKey is a new framework for estimating the 3D pose of drones, combining a 2D keypoint detector and a 3D pose estimator to address challenges in drone keypoint detection.
Details
Motivation: Estimating the 3D pose of drones is crucial for anti-drone systems, but existing methods face challenges due to the visual similarity of drone propellers and diversity of poses. Method: The paper proposes DroneKey, which combines a 2D keypoint detector and a 3D pose estimator. Key-representations are extracted and combined using a gated sum, while a pose-adaptive Mahalanobis distance is introduced in the loss function. Result: DroneKey achieved an AP of 99.68% (OKS) in keypoint detection, outperforming existing methods. For 3D pose estimation, it achieved an MAE-angle of 10.62°, an RMSE of 0.221m, and an MAE-absolute of 0.076m. Conclusion: The paper concludes that DroneKey is an effective framework for estimating the 3D pose of drones, offering high accuracy and reliability in keypoint detection and pose estimation. Abstract: Estimating the 3D pose of a drone is important for anti-drone systems, but existing methods struggle with the unique challenges of drone keypoint detection. Drone propellers serve as keypoints but are difficult to detect due to their high visual similarity and diversity of poses. To address these challenges, we propose DroneKey, a framework that combines a 2D keypoint detector and a 3D pose estimator specifically designed for drones. In the keypoint detection stage, we extract two key-representations (intermediate and compact) from each transformer encoder layer and optimally combine them using a gated sum. We also introduce a pose-adaptive Mahalanobis distance in the loss function to ensure stable keypoint predictions across extreme poses. We built new datasets of drone 2D keypoints and 3D pose to train and evaluate our method, which have been publicly released. Experiments show that our method achieves an AP of 99.68% (OKS) in keypoint detection, outperforming existing methods. Ablation studies confirm that the pose-adaptive Mahalanobis loss function improves keypoint prediction stability and accuracy. Additionally, improvements in the encoder design enable real-time processing at 44 FPS. For 3D pose estimation, our method achieved an MAE-angle of 10.62{\deg}, an RMSE of 0.221m, and an MAE-absolute of 0.076m, demonstrating high accuracy and reliability. The code and dataset are available at https://github.com/kkanuseobin/DroneKey.[257] From Global to Local: Social Bias Transfer in CLIP
Ryan Ramos,Yusuke Hirota,Yuta Nakashima,Noa Garcia
Main category: cs.CV
TL;DR: 本文研究了CLIP模型中的偏见如何传播到下游任务,发现偏见测量依赖于数据子集,且偏见传播趋势不一致,这归因于表示空间在适应任务时的收敛性。
Details
Motivation: CLIP模型在许多下游任务中被用作骨干网络,但它们可能复制社会偏见和人类刻板印象,需要分析其传播影响。 Method: 本文通过全面的实证分析研究了预训练模型中的偏见如何传播到下游任务。 Result: 研究发现,偏见的测量高度依赖于数据子集,且预训练偏见与下游任务之间的关系难以发现一致趋势,这是由于不同预训练CLIP的表示空间在适应下游任务时趋于收敛。 Conclusion: 本文希望通过对CLIP模型偏见传播的实证分析,为未来的偏见缓解研究提供有价值的见解。 Abstract: The recycling of contrastive language-image pre-trained (CLIP) models as backbones for a large number of downstream tasks calls for a thorough analysis of their transferability implications, especially their well-documented reproduction of social biases and human stereotypes. How do such biases, learned during pre-training, propagate to downstream applications like visual question answering or image captioning? Do they transfer at all? We investigate this phenomenon, referred to as bias transfer in prior literature, through a comprehensive empirical analysis. Firstly, we examine how pre-training bias varies between global and local views of data, finding that bias measurement is highly dependent on the subset of data on which it is computed. Secondly, we analyze correlations between biases in the pre-trained models and the downstream tasks across varying levels of pre-training bias, finding difficulty in discovering consistent trends in bias transfer. Finally, we explore why this inconsistency occurs, showing that under the current paradigm, representation spaces of different pre-trained CLIPs tend to converge when adapted for downstream tasks. We hope this work offers valuable insights into bias behavior and informs future research to promote better bias mitigation practices.[258] Robust Anomaly Detection in Industrial Environments via Meta-Learning
Muhammad Aqeel,Shakiba Sharifi,Marco Cristani,Francesco Setti
Main category: cs.CV
TL;DR: 本文提出了一种名为RAD的工业异常检测框架,结合归一化流和元学习技术,有效应对训练数据中的标签噪声问题,并在多个数据集上表现出优异的检测性能和鲁棒性。
Details
Motivation: 传统异常检测方法在训练数据包含错误标签样本时面临重大挑战,而错误标签在现实场景中普遍存在。 Method: 将归一化流与模型无关的元学习相结合,采用双层优化策略,通过元学习快速适应不同噪声条件,利用不确定性量化引导自适应L2正则化以保持模型稳定性,并通过预训练特征提取器进行多尺度特征处理。 Result: 在MVTec-AD和KSDD2数据集上的综合评估表明,在干净条件下分别达到95.4%和94.6%的I-AUROC得分,即使50%的训练样本被错误标记,仍能保持高于86.8%和92.1%的检测能力。 Conclusion: RAD框架在工业异常检测中表现出对标签噪声的强鲁棒性,并能有效检测细微异常,适用于难以实现完美数据整理的实际场景。 Abstract: Anomaly detection is fundamental for ensuring quality control and operational efficiency in industrial environments, yet conventional approaches face significant challenges when training data contains mislabeled samples-a common occurrence in real-world scenarios. This paper presents RAD, a robust anomaly detection framework that integrates Normalizing Flows with Model-Agnostic Meta-Learning to address the critical challenge of label noise in industrial settings. Our approach employs a bi-level optimization strategy where meta-learning enables rapid adaptation to varying noise conditions, while uncertainty quantification guides adaptive L2 regularization to maintain model stability. The framework incorporates multiscale feature processing through pretrained feature extractors and leverages the precise likelihood estimation capabilities of Normalizing Flows for robust anomaly scoring. Comprehensive evaluation on MVTec-AD and KSDD2 datasets demonstrates superior performance, achieving I-AUROC scores of 95.4% and 94.6% respectively under clean conditions, while maintaining robust detection capabilities above 86.8% and 92.1% even when 50% of training samples are mislabeled. The results highlight RAD's exceptional resilience to noisy training conditions and its ability to detect subtle anomalies across diverse industrial scenarios, making it a practical solution for real-world anomaly detection applications where perfect data curation is challenging.[259] Sketchpose: Learning to Segment Cells with Partial Annotations
Clément Cazorla,Nathanaël Munier,Renaud Morin,Pierre Weiss
Main category: cs.CV
TL;DR: 本文提出了一种新的细胞分割方法,能够在部分标注数据的情况下实现高效学习,减少了标注工作量和资源消耗。
Details
Motivation: 现有的细胞分割网络依赖于完全标注的数据集,这限制了训练集的生成和迁移学习的应用。 Method: 该方法基于距离图预测,并适应部分标注的对象,适用于资源有限的学习、迁移学习和常规数据库学习。 Result: 实验表明,该方法在不牺牲分割质量的前提下,显著节省了时间和资源,并提供了一个用户友好的Napari插件。 Conclusion: 该论文提出了一种新的细胞分割方法,可以在部分标注数据的情况下依然保持高精度,同时减少了时间和资源成本。 Abstract: The most popular networks used for cell segmentation (e.g. Cellpose, Stardist, HoverNet,...) rely on a prediction of a distance map. It yields unprecedented accuracy but hinges on fully annotated datasets. This is a serious limitation to generate training sets and perform transfer learning. In this paper, we propose a method that still relies on the distance map and handles partially annotated objects. We evaluate the performance of the proposed approach in the contexts of frugal learning, transfer learning and regular learning on regular databases. Our experiments show that it can lead to substantial savings in time and resources without sacrificing segmentation quality. The proposed algorithm is embedded in a user-friendly Napari plugin.[260] PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models
Kai Zhao,Wubang Yuan,Alex Lingyu Hung,Dan Zeng
Main category: cs.CV
TL;DR: This paper introduces Position-reweighted Visual Token Pruning, a simple and effective method to address recency bias in visual token pruning for Vision-Language Models, improving efficiency without altering the model or requiring extra training.
Details
Motivation: The motivation stems from the inefficiency in Vision-Language Models (VLMs) due to redundant visual tokens and the recency bias that causes suboptimal pruning, disproportionately retaining tokens from the bottom of images. Method: The paper proposes a reweighting mechanism that adjusts attention scores of visual tokens based on their spatial positions in the image. This method is plug-and-play and does not require changes to the model architecture or additional training. Result: Extensive experiments on Large Vision-Language Models (LVLMs) demonstrate that the proposed method enhances the performance of visual token pruning while incurring minimal computational overhead. Conclusion: The paper concludes that the proposed Position-reweighted Visual Token Pruning method effectively alleviates recency bias in visual token pruning, improving the performance with minimal computational overhead. Abstract: Vision-Language Models (VLMs) typically process a significantly larger number of visual tokens compared to text tokens due to the inherent redundancy in visual signals. Visual token pruning is a promising direction to reduce the computational cost of VLMs by eliminating redundant visual tokens. The text-visual attention score is a widely adopted criterion for visual token pruning as it reflects the relevance of visual tokens to the text input. However, many sequence models exhibit a recency bias, where tokens appearing later in the sequence exert a disproportionately large influence on the model's output. In VLMs, this bias manifests as inflated attention scores for tokens corresponding to the lower regions of the image, leading to suboptimal pruning that disproportionately retains tokens from the image bottom. In this paper, we present an extremely simple yet effective approach to alleviate the recency bias in visual token pruning. We propose a straightforward reweighting mechanism that adjusts the attention scores of visual tokens according to their spatial positions in the image. Our method, termed Position-reweighted Visual Token Pruning, is a plug-and-play solution that can be seamlessly incorporated into existing visual token pruning frameworks without any changes to the model architecture or extra training. Extensive experiments on LVLMs demonstrate that our method improves the performance of visual token pruning with minimal computational overhead.[261] UniSino: Physics-Driven Foundational Model for Universal CT Sinogram Standardization
Xingyu Ai,Shaoyu Wang,Zhiyuan Jia,Ao Xu,Hongming Shan,Jianhua Ma,Qiegen Liu
Main category: cs.CV
TL;DR: 提出了一种名为UniSino的通用CT正弦图标准化基础模型,以解决CT成像中由于采样不足和噪声导致的图像伪影问题,该模型在投影域直接标准化数据,相比传统方法具有更强的泛化能力和鲁棒性。
Details
Motivation: CT成像中,由于采样不足和噪声等因素,会导致严重的伪影和图像质量下降,传统方法依赖于手动设计的算法或固定的经验参数,缺乏对多种伪影类型的泛化能力。 Method: 提出了一种在投影域直接标准化数据的基础模型UniSino,其训练框架结合了正弦图的物理特性,以增强泛化能力,并在多个子任务和四个基准数据集中实现稳健的性能。 Result: 实验结果表明,UniSino在单一和混合欠采样情况下均实现了卓越的重建质量,展示了其在CT成像正弦图增强方面的卓越鲁棒性和泛化能力。 Conclusion: UniSino是一种有效的通用CT正弦图标准化方法,相比传统方法具有更强的泛化能力和鲁棒性。 Abstract: During raw-data acquisition in CT imaging, diverse factors can degrade the collected sinograms, with undersampling and noise leading to severe artifacts and noise in reconstructed images and compromising diagnostic accuracy. Conventional correction methods rely on manually designed algorithms or fixed empirical parameters, but these approaches often lack generalizability across heterogeneous artifact types. To address these limitations, we propose UniSino, a foundation model for universal CT sinogram standardization. Unlike existing foundational models that operate in image domain, UniSino directly standardizes data in the projection domain, which enables stronger generalization across diverse undersampling scenarios. Its training framework incorporates the physical characteristics of sinograms, enhancing generalization and enabling robust performance across multiple subtasks spanning four benchmark datasets. Experimental results demonstrate thatUniSino achieves superior reconstruction quality both single and mixed undersampling case, demonstrating exceptional robustness and generalization in sinogram enhancement for CT imaging. The code is available at: https://github.com/yqx7150/UniSino.[262] TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration
Meiqi Gong,Hao Zhang,Xunpeng Yi,Linfeng Tang,Jiayi Ma
Main category: cs.CV
TL;DR: This paper proposes a novel video fusion framework that explicitly incorporates temporal modeling and visual-semantic collaboration to ensure visual fidelity, semantic accuracy, and temporal consistency, outperforming existing methods.
Details
Motivation: The motivation is to address the limitations of existing multi-modal fusion methods that neglect temporal dependencies in video fusion tasks, leading to inconsistent results across frames. Method: The method involves a visual-semantic interaction module with Dinov2 and VGG19 for distillation, a temporal cooperative module for weak information recovery, a temporal-enhanced mechanism for consistency, and a new temporal loss for optimization. Result: The result is a novel video fusion framework that effectively integrates temporal modeling and visual-semantic collaboration, validated through extensive experiments on public video datasets and supported by the release of the project code. Conclusion: The paper concludes that their proposed video fusion framework, which incorporates temporal modeling and visual-semantic collaboration, outperforms existing methods and ensures visual fidelity, semantic accuracy, and temporal consistency. Abstract: Existing multi-modal fusion methods typically apply static frame-based image fusion techniques directly to video fusion tasks, neglecting inherent temporal dependencies and leading to inconsistent results across frames. To address this limitation, we propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for targeted distillation, allowing simultaneous enhancement of both the visual and semantic representations. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative evaluation metrics tailored for video fusion, aimed at assessing the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets demonstrate the superiority of our method. Our code is released at https://github.com/Meiqi-Gong/TemCoCo.[263] A Contrastive Learning-Guided Confident Meta-learning for Zero Shot Anomaly Detection
Muhammad Aqeel,Danijel Skocaj,Marco Cristani,Francesco Setti
Main category: cs.CV
TL;DR: CoZAD是一种用于工业和医疗异常检测的零样本学习框架,解决了数据稀缺和标注成本问题,通过集成软置信学习、元学习和对比学习实现了最先进的性能。
Details
Motivation: 工业和医疗异常检测面临数据稀缺和标注成本高昂的挑战,特别是在不断变化的制造和医疗环境中。 Method: 通过IQR阈值量化数据不确定性,通过协方差正则化量化模型不确定性,利用对比学习创建判别特征空间,同时使用基于置信度的权重分配机制保留边界信息。 Result: 在跨越工业和医疗领域的10个数据集上全面评估显示,CoZAD在7个工业基准测试中的6个上表现优于现有方法,并在纹理丰富的数据集(如Dtd-Synthetic和BTAD)和像素级定位(如MVTec-AD)上取得显著提升。 Conclusion: CoZAD是一个新颖的零样本异常检测框架,它结合了软置信学习、元学习和对比特征表示,无需依赖视觉-语言对齐或模型集成,在资源受限环境中具有重要价值。 Abstract: Industrial and medical anomaly detection faces critical challenges from data scarcity and prohibitive annotation costs, particularly in evolving manufacturing and healthcare settings. To address this, we propose CoZAD, a novel zero-shot anomaly detection framework that integrates soft confident learning with meta-learning and contrastive feature representation. Unlike traditional confident learning that discards uncertain samples, our method assigns confidence-based weights to all training data, preserving boundary information while emphasizing prototypical normal patterns. The framework quantifies data uncertainty through IQR-based thresholding and model uncertainty via covariance based regularization within a Model-Agnostic Meta-Learning. Contrastive learning creates discriminative feature spaces where normal patterns form compact clusters, enabling rapid domain adaptation. Comprehensive evaluation across 10 datasets spanning industrial and medical domains demonstrates state-of-the-art performance, outperforming existing methods on 6 out of 7 industrial benchmarks with notable improvements on texture-rich datasets (99.2% I-AUROC on DTD-Synthetic, 97.2% on BTAD) and pixellevel localization (96.3% P-AUROC on MVTec-AD). The framework eliminates dependence on vision-language alignments or model ensembles, making it valuable for resourceconstrained environments requiring rapid deployment.[264] HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation
Xiping Wang,Yuxi Wang,Mengqi Zhou,Junsong Fan,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的层次布局生成方法,用于生成更逼真的3D室内场景,并通过实验验证了其有效性。
Details
Motivation: 现有的3D室内场景生成方法在细粒度物体放置上存在不足,影响了生成环境的真实感和实用性。 Method: 该方法采用从粗到细的层次方法,通过垂直和水平解耦构建层次布局,并使用可训练的布局优化网络解决放置问题。 Result: 该方法在生成逼真的室内场景方面表现出色,展示了其在需要详细3D环境的应用中的潜力。 Conclusion: 论文提出了一种新的层次布局生成方法,用于细粒度的3D室内场景生成,并通过实验验证了其在生成逼真室内场景方面优于现有方法。 Abstract: Realistic 3D indoor scene generation is crucial for virtual reality, interior design, embodied intelligence, and scene understanding. While existing methods have made progress in coarse-scale furniture arrangement, they struggle to capture fine-grained object placements, limiting the realism and utility of generated environments. This gap hinders immersive virtual experiences and detailed scene comprehension for embodied AI applications. To address these issues, we propose Hierarchical Layout Generation (HLG), a novel method for fine-grained 3D scene generation. HLG is the first to adopt a coarse-to-fine hierarchical approach, refining scene layouts from large-scale furniture placement to intricate object arrangements. Specifically, our fine-grained layout alignment module constructs a hierarchical layout through vertical and horizontal decoupling, effectively decomposing complex 3D indoor scenes into multiple levels of granularity. Additionally, our trainable layout optimization network addresses placement issues, such as incorrect positioning, orientation errors, and object intersections, ensuring structurally coherent and physically plausible scene generation. We demonstrate the effectiveness of our approach through extensive experiments, showing superior performance in generating realistic indoor scenes compared to existing methods. This work advances the field of scene generation and opens new possibilities for applications requiring detailed 3D environments. We will release our code upon publication to encourage future research.[265] SCOUT: Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection
Weiqi Yan,Lvhai Chen,Shengchuan Zhang,Yan Zhang,Liujuan Cao
Main category: cs.CV
TL;DR: SCOUT improves semi-supervised Camouflaged Object Detection by better utilizing unlabeled data through adaptive augmentation and text-visual fusion, achieving top performance.
Details
Motivation: The high cost of pixel-level annotation hampers COD development. Current semi-supervised approaches underutilize unlabeled data, leaving room for improvement. Method: SCOUT introduces an Adaptive Data Augment and Selection (ADAS) module and a Text Fusion Module (TFM), using adversarial augment and sampling strategy along with text-visual interaction to leverage unlabeled data and camouflage-related knowledge. Result: SCOUT achieves superior performance over existing semi-supervised COD methods and reaches state-of-the-art results. Conclusion: SCOUT, a new semi-supervised COD framework, outperforms previous methods and achieves state-of-the-art performance. Abstract: The difficulty of pixel-level annotation has significantly hindered the development of the Camouflaged Object Detection (COD) field. To save on annotation costs, previous works leverage the semi-supervised COD framework that relies on a small number of labeled data and a large volume of unlabeled data. We argue that there is still significant room for improvement in the effective utilization of unlabeled data. To this end, we introduce a Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection (SCOUT). It includes an Adaptive Data Augment and Selection (ADAS) module and a Text Fusion Module (TFM). The ADSA module selects valuable data for annotation through an adversarial augment and sampling strategy. The TFM module further leverages the selected valuable data by combining camouflage-related knowledge and text-visual interaction. To adapt to this work, we build a new dataset, namely RefTextCOD. Extensive experiments show that the proposed method surpasses previous semi-supervised methods in the COD field and achieves state-of-the-art performance. Our code will be released at https://github.com/Heartfirey/SCOUT.[266] Diffusion-Based Data Augmentation for Medical Image Segmentation
Maham Nazir,Muhammad Aqeel,Francesco Setti
Main category: cs.CV
TL;DR: DiffAug是一种结合文本引导扩散生成和自动分割验证的新框架,用于解决医学影像中罕见异常分割的问题。
Details
Motivation: 医学影像分割模型由于罕见异常情况的标注数据稀缺而面临挑战,需要一种能够生成多样化异常样本并确保其质量的框架。 Method: 使用基于文本描述和空间掩码的潜在扩散模型进行图像修复,结合潜空间分割网络进行动态质量验证。 Result: 在三个医学影像基准测试中,与基线相比,Dice得分提高了8-10%,对于小息肉和平坦病变等具有挑战性的病例,假阴性率降低了28%。 Conclusion: DiffAug有效地解决了医学影像分割模型在罕见异常情况下的挑战,通过结合文本引导的扩散生成和自动分割验证,实现了最先进的性能。 Abstract: Medical image segmentation models struggle with rare abnormalities due to scarce annotated pathological data. We propose DiffAug a novel framework that combines textguided diffusion-based generation with automatic segmentation validation to address this challenge. Our proposed approach uses latent diffusion models conditioned on medical text descriptions and spatial masks to synthesize abnormalities via inpainting on normal images. Generated samples undergo dynamic quality validation through a latentspace segmentation network that ensures accurate localization while enabling single-step inference. The text prompts, derived from medical literature, guide the generation of diverse abnormality types without requiring manual annotation. Our validation mechanism filters synthetic samples based on spatial accuracy, maintaining quality while operating efficiently through direct latent estimation. Evaluated on three medical imaging benchmarks (CVC-ClinicDB, Kvasir-SEG, REFUGE2), our framework achieves state-of-the-art performance with 8-10% Dice improvements over baselines and reduces false negative rates by up to 28% for challenging cases like small polyps and flat lesions critical for early detection in screening applications.[267] Alternating Training-based Label Smoothing Enhances Prompt Generalization
Yang Chen,Yanbin Wei,Ke Jin,Yi Kong,James Kwok,Yu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的基于交替训练的标签平滑方法ATLaS,并结合两种高效软标签(CSL和ISL),有效提升了prompt tuning的泛化能力和兼容性。
Details
Motivation: 尽管prompt tuning是一种参数高效微调方法,但其泛化能力仍有限;而标签平滑(LS)作为一种有效的正则化技术,能够防止模型过度自信并提升泛化能力,因此本文尝试将LS与prompt tuning结合以提升其性能。 Method: 提出了基于交替训练的标签平滑方法ATLaS,结合了两种高效的离线软标签——类别级软标签(CSL)和实例级软标签(ISL),通过交替使用标准one-hot标签和软标签来监督prompt tuning过程。 Result: 大量实验表明,结合CSL和ISL的ATLaS方法能够持续提升prompt tuning的泛化性能。 Conclusion: ATLaS方法结合CSL和ISL能有效提升prompt tuning的泛化性能,并且与现有主流prompt tuning方法具有高度兼容性,可无缝集成到现有方法中。 Abstract: Recent advances in pre-trained vision-language models have demonstrated remarkable zero-shot generalization capabilities. To further enhance these models' adaptability to various downstream tasks, prompt tuning has emerged as a parameter-efficient fine-tuning method. However, despite its efficiency, the generalization ability of prompt remains limited. In contrast, label smoothing (LS) has been widely recognized as an effective regularization technique that prevents models from becoming over-confident and improves their generalization. This inspires us to explore the integration of LS with prompt tuning. However, we have observed that the vanilla LS even weakens the generalization ability of prompt tuning. To address this issue, we propose the Alternating Training-based Label Smoothing (ATLaS) method, which alternately trains with standard one-hot labels and soft labels generated by LS to supervise the prompt tuning. Moreover, we introduce two types of efficient offline soft labels, including Class-wise Soft Labels (CSL) and Instance-wise Soft Labels (ISL), to provide inter-class or instance-class relationships for prompt tuning. The theoretical properties of the proposed ATLaS method are analyzed. Extensive experiments demonstrate that the proposed ATLaS method, combined with CSL and ISL, consistently enhances the generalization performance of prompt tuning. Moreover, the proposed ATLaS method exhibits high compatibility with prevalent prompt tuning methods, enabling seamless integration into existing methods.[268] Box-Level Class-Balanced Sampling for Active Object Detection
Jingyi Liao,Xun Xu,Chuan-Sheng Foo,Lile Cai
Main category: cs.CV
TL;DR: 本文提出一种改进的边界框级主动学习策略,通过类平衡采样和任务感知伪标签提升目标检测性能,有效减轻标注负担。
Details
Motivation: 当前深度目标检测器训练依赖昂贵的边界框标注,而主动学习技术可以减轻标注负担。然而早期模型在多数类上表现较好,导致伪标签严重类别不平衡,因此需要改进策略来提升训练效果。 Method: 研究通过类平衡采样策略从少数类中选择更多对象进行标注,同时采用任务感知软伪标签策略提升伪标签准确性。 Result: 所提出方法在公开基准数据集上实现了最先进的性能,验证了方法的有效性。 Conclusion: 该研究提出了一种类平衡采样策略和任务感知软伪标签策略,以提升目标检测中的边界框级主动学习效果,并在公开基准数据集上验证了其性能优越性。 Abstract: Training deep object detectors demands expensive bounding box annotation. Active learning (AL) is a promising technique to alleviate the annotation burden. Performing AL at box-level for object detection, i.e., selecting the most informative boxes to label and supplementing the sparsely-labelled image with pseudo labels, has been shown to be more cost-effective than selecting and labelling the entire image. In box-level AL for object detection, we observe that models at early stage can only perform well on majority classes, making the pseudo labels severely class-imbalanced. We propose a class-balanced sampling strategy to select more objects from minority classes for labelling, so as to make the final training data, \ie, ground truth labels obtained by AL and pseudo labels, more class-balanced to train a better model. We also propose a task-aware soft pseudo labelling strategy to increase the accuracy of pseudo labels. We evaluate our method on public benchmarking datasets and show that our method achieves state-of-the-art performance.[269] VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
Pengfei Jiang,Hanjun Li,Linglan Zhao,Fei Chao,Ke Yan,Shouhong Ding,Rongrong Ji
Main category: cs.CV
TL;DR: 本研究提出 VISA 方法,通过图结构聚合和分组 token 选择策略,有效提升多模态大语言模型的推理效率。
Details
Motivation: 多模态大语言模型中的视觉 token 数量过多会导致推理效率低下,因此需要一种能够压缩 token 数量同时保留更多视觉信息的方法。 Method: 提出了一种基于图的视觉 token 聚合模块(VTA)和一种分组 token 选择策略(GTS)。VTA 将视觉 token 视为图中的节点,基于语义相似性构建图结构并聚合信息;GTS 根据文本 token 指导将视觉 token 分为保留和移除的类别。 Result: 在 LLaVA-1.5、LLaVA-NeXT 和 Video-LLaVA 上的实验表明,该方法在模型性能和推理速度之间取得了更好的平衡,优于之前的 token 剪枝方法。 Conclusion: VISA 方法在多模态大语言模型中有效地解决了由于视觉 token 过多导致的推理效率问题,并在多个基准测试中表现出比现有方法更好的性能和推理速度的平衡。 Abstract: In this study, we introduce a novel method called group-wise \textbf{VI}sual token \textbf{S}election and \textbf{A}ggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). Compared with previous token pruning approaches, our method can preserve more visual information while compressing visual tokens. We first propose a graph-based visual token aggregation (VTA) module. VTA treats each visual token as a node, forming a graph based on semantic similarity among visual tokens. It then aggregates information from removed tokens into kept tokens based on this graph, producing a more compact visual token representation. Additionally, we introduce a group-wise token selection strategy (GTS) to divide visual tokens into kept and removed ones, guided by text tokens from the final layers of each group. This strategy progressively aggregates visual information, enhancing the stability of the visual information extraction process. We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA. Our method consistently outperforms previous methods, achieving a superior trade-off between model performance and inference speed. The code is available at https://github.com/mobiushy/VISA.[270] AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
Kang Zeng,Guojin Zhong,Jintao Cheng,Jin Yuan,Zhiyong Li
Main category: cs.CV
TL;DR: 本文提出了一种用于多图像视觉问答的自适应视觉锚定策略和协同解码机制,有效减少了视觉冗余,提高了模型准确性和效率。
Details
Motivation: 多图像视觉问答中存在大量与问题无关的视觉冗余,影响模型的准确性和效率,而现有方法缺乏压缩视觉标记的灵活性并产生离散的视觉片段。 Method: 提出了一种自适应视觉锚定策略和协同解码机制,以减少视觉冗余并提升模型的全局理解能力。 Result: 实验表明,该方法在多个多模态大语言模型中均实现了性能提升,并具有良好的泛化能力。 Conclusion: 该论文提出了一种通用的自适应视觉锚定策略和协同解码机制,有效解决了多图像视觉问答中的视觉冗余问题,并在多个多模态大语言模型中实现了性能提升。 Abstract: The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single to Multi Image VQA (MVQA). However, the increased number of images in MVQA inevitably introduces substantial visual redundancy that is irrelevant to question answering, negatively impacting both accuracy and efficiency. To address this issue, existing methods lack flexibility in controlling the number of compressed visual tokens and tend to produce discrete visual fragments, which hinder MLLMs' ability to comprehend images holistically. In this paper, we propose a straightforward yet universal Adaptive Visual Anchoring strategy, which can be seamlessly integrated into existing MLLMs, offering significant accuracy improvements through adaptive compression. Meanwhile, to balance the results derived from both global and compressed visual input, we further introduce a novel collaborative decoding mechanism, enabling optimal performance. Extensive experiments validate the effectiveness of our method, demonstrating consistent performance improvements across various MLLMs. The code will be publicly available.[271] Camera Pose Refinement via 3D Gaussian Splatting
Lulu Hao,Lipu Zhou,Zhenzhong Wei,Xu Wang
Main category: cs.CV
TL;DR: GS-SMC is a novel camera pose refinement framework based on 3D Gaussian Splatting that improves accuracy using epipolar geometry constraints, eliminating the need for additional training or scene-specific adjustments.
Details
Motivation: Camera pose refinement aims to enhance the accuracy of initial pose estimation in 3D computer vision applications, overcoming limitations of existing methods that rely on specific descriptors or dedicated networks. Method: An iterative optimization approach using epipolar geometric constraints among the query and multiple rendered images is introduced, allowing flexible selection of feature extractors and matchers. Result: Empirical evaluations on the 7-Scenes and Cambridge Landmarks datasets show that the proposed method outperforms state-of-the-art approaches, achieving significant reductions in median translation and rotation errors. Conclusion: The proposed GS-SMC method provides a lightweight and effective solution for camera pose refinement, leveraging 3D Gaussian Splatting to achieve improved accuracy without additional training. Abstract: Camera pose refinement aims at improving the accuracy of initial pose estimation for applications in 3D computer vision. Most refinement approaches rely on 2D-3D correspondences with specific descriptors or dedicated networks, requiring reconstructing the scene again for a different descriptor or fully retraining the network for each scene. Some recent methods instead infer pose from feature similarity, but their lack of geometry constraints results in less accuracy. To overcome these limitations, we propose a novel camera pose refinement framework leveraging 3D Gaussian Splatting (3DGS), referred to as GS-SMC. Given the widespread usage of 3DGS, our method can employ an existing 3DGS model to render novel views, providing a lightweight solution that can be directly applied to diverse scenes without additional training or fine-tuning. Specifically, we introduce an iterative optimization approach, which refines the camera pose using epipolar geometric constraints among the query and multiple rendered images. Our method allows flexibly choosing feature extractors and matchers to establish these constraints. Extensive empirical evaluations on the 7-Scenes and the Cambridge Landmarks datasets demonstrate that our method outperforms state-of-the-art camera pose refinement approaches, achieving 53.3% and 56.9% reductions in median translation and rotation errors on 7-Scenes, and 40.7% and 53.2% on Cambridge.[272] Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection
Dabbrata Das,Mahshar Yahan,Md Tareq Zaman,Md Rishadul Bayesh
Main category: cs.CV
TL;DR: This paper proposes a hybrid detection framework combining a fine-tuned Vision Transformer (ViT) with an edge-based image processing module to effectively detect AI-generated images, achieving high accuracy and F1-score while maintaining computational efficiency.
Details
Motivation: The motivation is to overcome the limitations of conventional detection methods that rely on deep learning models extracting global features, often overlooking subtle structural inconsistencies and demanding substantial computational resources. The goal is to develop a more sensitive, efficient, and interpretable method for detecting highly realistic AI-generated images. Method: The framework uses a fine-tuned Vision Transformer (ViT) combined with an edge-based image processing module. The edge-based module computes variance from edge-difference maps generated before and after smoothing to exploit the observation that AI-generated images have smoother textures, weaker edges, and reduced noise compared to real images. This module acts as a post-processing step on ViT predictions. Result: Extensive experiments on the CIFAKE, Artistic, and Custom Curated datasets show that the proposed framework achieves superior detection performance across all benchmarks, attaining 97.75% accuracy and a 97.77% F1-score on CIFAKE, surpassing widely adopted state-of-the-art models. Conclusion: The proposed hybrid detection framework combining a fine-tuned Vision Transformer (ViT) with an edge-based image processing module is a lightweight, interpretable, and effective solution for detecting AI-generated images, suitable for real-world applications in automated content verification and digital forensics. Abstract: The rapid advancement of generative models has led to a growing prevalence of highly realistic AI-generated images, posing significant challenges for digital forensics and content authentication. Conventional detection methods mainly rely on deep learning models that extract global features, which often overlook subtle structural inconsistencies and demand substantial computational resources. To address these limitations, we propose a hybrid detection framework that combines a fine-tuned Vision Transformer (ViT) with a novel edge-based image processing module. The edge-based module computes variance from edge-difference maps generated before and after smoothing, exploiting the observation that AI-generated images typically exhibit smoother textures, weaker edges, and reduced noise compared to real images. When applied as a post-processing step on ViT predictions, this module enhances sensitivity to fine-grained structural cues while maintaining computational efficiency. Extensive experiments on the CIFAKE, Artistic, and Custom Curated datasets demonstrate that the proposed framework achieves superior detection performance across all benchmarks, attaining 97.75% accuracy and a 97.77% F1-score on CIFAKE, surpassing widely adopted state-of-the-art models. These results establish the proposed method as a lightweight, interpretable, and effective solution for both still images and video frames, making it highly suitable for real-world applications in automated content verification and digital forensics.[273] ISALux: Illumination and Segmentation Aware Transformer Employing Mixture of Experts for Low Light Image Enhancement
Raul Balmez,Alexandru Brateanu,Ciprian Orhei,Codruta Ancuti,Cosmin Ancuti
Main category: cs.CV
TL;DR: ISALux is a new method for enhancing low-light images using transformer-based techniques that integrate illumination and semantic information effectively.
Details
Motivation: To enhance low-light images by integrating illumination and semantic priors effectively, addressing overfitting due to distinct light patterns in datasets. Method: Developed a novel transformer-based approach called ISALux incorporating HISA-MSA for integrating illumination and semantic features, using two self-attention modules and a MoE-based FFN with LoRA. Result: Extensive evaluations show ISALux's competitiveness with SOTA methods, and ablation studies underline the significance of its components. Conclusion: ISALux proves to be competitive with SOTA methods for LLIE, with added benefits from the MoE-based FFN and LoRA in contextual learning and avoiding overfitting. Abstract: We introduce ISALux, a novel transformer-based approach for Low-Light Image Enhancement (LLIE) that seamlessly integrates illumination and semantic priors. Our architecture includes an original self-attention block, Hybrid Illumination and Semantics-Aware Multi-Headed Self- Attention (HISA-MSA), which integrates illumination and semantic segmentation maps for en- hanced feature extraction. ISALux employs two self-attention modules to independently process illumination and semantic features, selectively enriching each other to regulate luminance and high- light structural variations in real-world scenarios. A Mixture of Experts (MoE)-based Feed-Forward Network (FFN) enhances contextual learning, with a gating mechanism conditionally activating the top K experts for specialized processing. To address overfitting in LLIE methods caused by distinct light patterns in benchmarking datasets, we enhance the HISA-MSA module with low-rank matrix adaptations (LoRA). Extensive qualitative and quantitative evaluations across multiple specialized datasets demonstrate that ISALux is competitive with state-of-the-art (SOTA) methods. Addition- ally, an ablation study highlights the contribution of each component in the proposed model. Code will be released upon publication.[274] UniAPO: Unified Multimodal Automated Prompt Optimization
Qipeng Zhu,Yanzhe Chen,Huasong Zhong,Yan Li,Jie Chen,Zhixin Zhang,Junping Zhang,Zhenheng Yang
Main category: cs.CV
TL;DR: UniAPO is a new framework for automated prompt optimization that effectively addresses challenges in multimodal tasks, offering a unified approach for efficient and transferable prompt optimization.
Details
Motivation: Existing APO methods are limited to text-only scenarios and face challenges in multimodal tasks, such as visual token inflation and lack of process-level supervision. Method: UniAPO uses an EM-inspired optimization process with decoupled feedback modeling and prompt refinement, incorporating a short-long term memory mechanism for improved optimization. Result: UniAPO achieves consistent performance improvements across text, image, and video benchmarks. Conclusion: UniAPO provides a unified framework for efficient and transferable multimodal prompt optimization, addressing key challenges in extending APO to multimodal tasks. Abstract: Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks, such as video-language generation introduces two core challenges: (i) visual token inflation, where long visual token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.[275] EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images
Xinning Yao,Bo Liu,Bojian Li,Jingjing Wang,Jinghua Yue,Fugen Zhou
Main category: cs.CV
TL;DR: This study introduces EndoUFM, an improved unsupervised method for depth estimation in endoscopic surgery that outperforms existing techniques by leveraging dual foundation models and novel architectural components.
Details
Motivation: Existing monocular depth estimation methods struggle with varying illumination and complex textures in surgical environments, while foundation models trained on natural images lack domain adaptability and semantic understanding for endoscopy. Method: The study proposes EndoUFM, an unsupervised framework integrating dual foundation models, adaptive fine-tuning using RVLoRA, Res-DSC for local feature extraction, and a mask-guided smoothness loss. Result: EndoUFM achieves state-of-the-art performance on multiple surgical datasets (SCARED, Hamlyn, SERV-CT, EndoNeRF) while maintaining an efficient model size. Conclusion: EndoUFM provides a state-of-the-art solution for monocular depth estimation in endoscopic surgeries, enhancing spatial perception for surgeons and improving precision and safety. Abstract: Depth estimation is a foundational component for 3D reconstruction in minimally invasive endoscopic surgeries. However, existing monocular depth estimation techniques often exhibit limited performance to the varying illumination and complex textures of the surgical environment. While powerful visual foundation models offer a promising solution, their training on natural images leads to significant domain adaptability limitations and semantic perception deficiencies when applied to endoscopy. In this study, we introduce EndoUFM, an unsupervised monocular depth estimation framework that innovatively integrating dual foundation models for surgical scenes, which enhance the depth estimation performance by leveraging the powerful pre-learned priors. The framework features a novel adaptive fine-tuning strategy that incorporates Random Vector Low-Rank Adaptation (RVLoRA) to enhance model adaptability, and a Residual block based on Depthwise Separable Convolution (Res-DSC) to improve the capture of fine-grained local features. Furthermore, we design a mask-guided smoothness loss to enforce depth consistency within anatomical tissue structures. Extensive experiments on the SCARED, Hamlyn, SERV-CT, and EndoNeRF datasets confirm that our method achieves state-of-the-art performance while maintaining an efficient model size. This work contributes to augmenting surgeons' spatial perception during minimally invasive procedures, thereby enhancing surgical precision and safety, with crucial implications for augmented reality and navigation systems.[276] Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation
Konstantin Egorov,Stepan Botman,Pavel Blinov,Galina Zubkova,Anton Ivaschenko,Alexander Kolsanov,Andrey Savchenko
Main category: cs.CV
TL;DR: This paper presents a large-scale, diverse dataset and rPPG model to advance AI medical assistants, addressing the limitations of current datasets.
Details
Motivation: The motivation is to address the limitations of existing rPPG datasets, such as small size, privacy concerns, and lack of diversity, which hinder progress in the field. Method: The authors introduce a large-scale, multi-view video dataset for rPPG and health biomarkers estimation, including synchronized video recordings and physiological signals. They train an efficient rPPG model and compare its performance in cross-dataset scenarios. Result: The dataset includes 3600 synchronized video recordings from 600 subjects under varied conditions, paired with PPG signals and extended health metrics. The trained rPPG model demonstrates high quality in cross-dataset evaluations. Conclusion: The paper concludes that the newly introduced dataset and rPPG model can significantly advance the development of AI medical assistants by overcoming the limitations of existing datasets. Abstract: Progress in remote PhotoPlethysmoGraphy (rPPG) is limited by the critical issues of existing publicly available datasets: small size, privacy concerns with facial videos, and lack of diversity in conditions. The paper introduces a novel comprehensive large-scale multi-view video dataset for rPPG and health biomarkers estimation. Our dataset comprises 3600 synchronized video recordings from 600 subjects, captured under varied conditions (resting and post-exercise) using multiple consumer-grade cameras at different angles. To enable multimodal analysis of physiological states, each recording is paired with a 100 Hz PPG signal and extended health metrics, such as electrocardiogram, arterial blood pressure, biomarkers, temperature, oxygen saturation, respiratory rate, and stress level. Using this data, we train an efficient rPPG model and compare its quality with existing approaches in cross-dataset scenarios. The public release of our dataset and model should significantly speed up the progress in the development of AI medical assistants.[277] See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops
Zixuan Dong,Baoyun Peng,Yufei Wang,Lin Liu,Xinxin Dong,Yunlong Cao,Xiaodong Wang
Main category: cs.CV
TL;DR: CAVIA是一个无需训练的视频理解框架,通过动态协调推理与感知,有效提升了长视频问答系统的性能。
Details
Motivation: 当前的长视频问答系统由于将推理与感知解耦,存在信息丢失或计算效率低下的问题。 Method: CAVIA采用了一种无需训练的框架,包含三个创新点:分层推理、跨模态语义桥接和置信度驱动的迭代合成。 Result: CAVIA在EgoSchema(65.7%,+5.3%)、NExT-QA(76.1%,+2.6%)和IntentQA(73.8%,+6.9%)等基准测试中达到了最先进的性能。 Conclusion: CAVIA通过推理与感知的动态协调为视频理解提供了一个可扩展的范例。 Abstract: Human video comprehension demonstrates dynamic coordination between reasoning and visual attention, adaptively focusing on query-relevant details. However, current long-form video question answering systems employ rigid pipelines that decouple reasoning from perception, leading to either information loss through premature visual abstraction or computational inefficiency through exhaustive processing. The core limitation lies in the inability to adapt visual extraction to specific reasoning requirements, different queries demand fundamentally different visual evidence from the same video content. In this work, we present CAVIA, a training-free framework that revolutionizes video understanding through reasoning, perception coordination. Unlike conventional approaches where visual processing operates independently of reasoning, CAVIA creates a closed-loop system where reasoning continuously guides visual extraction based on identified information gaps. CAVIA introduces three innovations: (1) hierarchical reasoning, guided localization to precise frames; (2) cross-modal semantic bridging for targeted extraction; (3) confidence-driven iterative synthesis. CAVIA achieves state-of-the-art performance on challenging benchmarks: EgoSchema (65.7%, +5.3%), NExT-QA (76.1%, +2.6%), and IntentQA (73.8%, +6.9%), demonstrating that dynamic reasoning-perception coordination provides a scalable paradigm for video understanding.[278] Beam Geometry and Input Dimensionality: Impact on Sparse-Sampling Artifact Correction for Clinical CT with U-Nets
Tina Dorosti,Johannes Thalhammer,Sebastian Peterhansl,Daniela Pfeiffer,Franz Pfeiffer,Florian Schaff
Main category: cs.CV
TL;DR: This study investigates the impact of beam geometries and data dimensions on sparse-sampling streak artifact correction using U-Nets for CT scans, finding that 2D U-Nets trained on axial slices perform best in terms of MSE and SSIM metrics.
Details
Motivation: To incorporate volumetric context into artifact reduction tasks and improve model performance for sparse-sampling streak artifact correction in clinical CT scans. Method: A total of 22 subjects were selected, and sparsely-sampled CT volumes were simulated with the Astra toolbox for parallel, fan, and cone beam geometries. 2D and 3D U-Nets were trained and validated on 14, and tested on 8 subjects. CT volumes were divided into 64x64x64 voxel blocks for 2.5D and 3D data analysis. Result: For all beam geometries, the 2D U-Net trained on axial 2D slices achieved the best MSE and SSIM values, surpassing the 2.5D and 3D input data dimensions. Conclusion: 2D U-Net trained on axial 2D slices outperforms 2.5D and 3D input data dimensions for sparse-sampling streak artifact correction in CT scans. Abstract: This study aims to investigate the effect of various beam geometries and dimensions of input data on the sparse-sampling streak artifact correction task with U-Nets for clinical CT scans as a means of incorporating the volumetric context into artifact reduction tasks to improve model performance. A total of 22 subjects were retrospectively selected (01.2016-12.2018) from the Technical University of Munich's research hospital, TUM Klinikum rechts der Isar. Sparsely-sampled CT volumes were simulated with the Astra toolbox for parallel, fan, and cone beam geometries. 2048 views were taken as full-view scans. 2D and 3D U-Nets were trained and validated on 14, and tested on 8 subjects, respectively. For the dimensionality study, in addition to the 512x512 2D CT images, the CT scans were further pre-processed to generate a so-called '2.5D', and 3D data: Each CT volume was divided into 64x64x64 voxel blocks. The 3D data refers to individual 64-voxel blocks. An axial, coronal, and sagittal cut through the center of each block resulted in three 64x64 2D patches that were rearranged as a single 64x64x3 image, proposed as 2.5D data. Model performance was assessed with the mean squared error (MSE) and structural similarity index measure (SSIM). For all geometries, the 2D U-Net trained on axial 2D slices results in the best MSE and SSIM values, outperforming the 2.5D and 3D input data dimensions.[279] SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization
Junyuan Deng,Heng Li,Tao Xie,Weiqiang Ren,Qian Zhang,Ping Tan,Xiaoyang Guo
Main category: cs.CV
TL;DR: The paper introduces SAIL-Recon, a Transformer-based approach for large-scale Structure-from-Motion, which efficiently handles a large number of images and achieves superior performance in pose estimation and view synthesis.
Details
Motivation: Scene regression methods like VGGT struggle to handle a large number of input images, which limits their scalability for large-scale Structure-from-Motion (SfM) tasks. Method: The method introduces SAIL-Recon, a feed-forward Transformer that augments the scene regression network with visual localization capabilities. It computes a neural scene representation from anchor images and fine-tunes the regression network to reconstruct all input images based on this representation. Result: The proposed method scales efficiently to large-scale scenes and achieves state-of-the-art results on benchmarks for camera pose estimation and novel view synthesis, including TUM-RGBD, CO3Dv2, and Tanks & Temples. Conclusion: SAIL-Recon successfully addresses the limitation of scene regression methods in handling a large number of input images for large-scale Structure-from-Motion (SfM) tasks, achieving state-of-the-art results in camera pose estimation and novel view synthesis. Abstract: Scene regression methods, such as VGGT, solve the Structure-from-Motion (SfM) problem by directly regressing camera poses and 3D scene structures from input images. They demonstrate impressive performance in handling images under extreme viewpoint changes. However, these methods struggle to handle a large number of input images. To address this problem, we introduce SAIL-Recon, a feed-forward Transformer for large scale SfM, by augmenting the scene regression network with visual localization capabilities. Specifically, our method first computes a neural scene representation from a subset of anchor images. The regression network is then fine-tuned to reconstruct all input images conditioned on this neural scene representation. Comprehensive experiments show that our method not only scales efficiently to large-scale scenes, but also achieves state-of-the-art results on both camera pose estimation and novel view synthesis benchmarks, including TUM-RGBD, CO3Dv2, and Tanks & Temples. We will publish our model and code. Code and models are publicly available at: https://hkust-sail.github.io/ sail-recon/.[280] Enhanced Drift-Aware Computer Vision Architecture for Autonomous Driving
Md Shahi Amran Hossain,Abu Shad Ahammed,Sayeri Mukherjee,Roman Obermaisser
Main category: cs.CV
TL;DR: A hybrid model combining YOLO and CNN improves object detection for autonomous driving in challenging conditions.
Details
Motivation: To improve object detection accuracy in adverse weather and low-light conditions and to comply with ISO 8800 standards. Method: Utilized YOLO version 8 for detection and a five-layer CNN for verification using synthetic data. Result: Improved detection accuracy by over 90% under drift-augmented road conditions. Conclusion: The hybrid computer vision architecture enhances road safety and detection accuracy in adverse conditions. Abstract: The use of computer vision in automotive is a trending research in which safety and security are a primary concern. In particular, for autonomous driving, preventing road accidents requires highly accurate object detection under diverse conditions. To address this issue, recently the International Organization for Standardization (ISO) released the 8800 norm, providing structured frameworks for managing associated AI relevant risks. However, challenging scenarios such as adverse weather or low lighting often introduce data drift, leading to degraded model performance and potential safety violations. In this work, we present a novel hybrid computer vision architecture trained with thousands of synthetic image data from the road environment to improve robustness in unseen drifted environments. Our dual mode framework utilized YOLO version 8 for swift detection and incorporated a five-layer CNN for verification. The system functioned in sequence and improved the detection accuracy by more than 90\% when tested with drift-augmented road images. The focus was to demonstrate how such a hybrid model can provide better road safety when working together in a hybrid structure.[281] Propose and Rectify: A Forensics-Driven MLLM Framework for Image Manipulation Localization
Keyang Zhang,Chenqi Kong,Hui Liu,Bo Ding,Xinghao Jiang,Haoliang Li
Main category: cs.CV
TL;DR: This paper proposes a Propose-Rectify framework for image forensics that integrates semantic reasoning with forensic-specific analysis, achieving superior performance in detecting and localizing image manipulations.
Details
Motivation: The motivation stems from the need for robust forensic solutions that can detect and localize image manipulations accurately, especially since existing MLLMs struggle with perceiving subtle low-level artifacts crucial for precise localization. Method: The paper introduces a two-stage framework: the Propose stage uses a forensic-adapted LLaVA model to generate initial manipulation analysis and localization, while the Rectify stage refines these proposals using a Forensics Rectification Module and an Enhanced Segmentation Module that integrates forensic cues into SAM's image embeddings. Result: The framework achieves state-of-the-art performance across diverse datasets, demonstrating exceptional robustness and generalization capabilities in detecting and localizing image manipulations. Conclusion: The paper concludes that the Propose-Rectify framework successfully combines semantic reasoning with forensic-specific analysis, achieving state-of-the-art performance in image manipulation detection and localization. Abstract: The increasing sophistication of image manipulation techniques demands robust forensic solutions that can both reliably detect alterations and precisely localize tampered regions. Recent Multimodal Large Language Models (MLLMs) show promise by leveraging world knowledge and semantic understanding for context-aware detection, yet they struggle with perceiving subtle, low-level forensic artifacts crucial for accurate manipulation localization. This paper presents a novel Propose-Rectify framework that effectively bridges semantic reasoning with forensic-specific analysis. In the proposal stage, our approach utilizes a forensic-adapted LLaVA model to generate initial manipulation analysis and preliminary localization of suspicious regions based on semantic understanding and contextual reasoning. In the rectification stage, we introduce a Forensics Rectification Module that systematically validates and refines these initial proposals through multi-scale forensic feature analysis, integrating technical evidence from several specialized filters. Additionally, we present an Enhanced Segmentation Module that incorporates critical forensic cues into SAM's encoded image embeddings, thereby overcoming inherent semantic biases to achieve precise delineation of manipulated regions. By synergistically combining advanced multimodal reasoning with established forensic methodologies, our framework ensures that initial semantic proposals are systematically validated and enhanced through concrete technical evidence, resulting in comprehensive detection accuracy and localization precision. Extensive experimental validation demonstrates state-of-the-art performance across diverse datasets with exceptional robustness and generalization capabilities.[282] Fence off Anomaly Interference: Cross-Domain Distillation for Fully Unsupervised Anomaly Detection
Xinyue Liu,Jianyuan Wang,Biao Leng,Shuo Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于全无监督异常检测(FUAD)的新颖跨域蒸馏(CDD)框架,通过设计特定领域训练和跨领域知识聚合策略,解决了传统知识蒸馏方法在FUAD设置下可能使学生学习异常表示的问题,并在MVTec AD和VisA数据集的噪声版本上验证了其有效性。
Details
Motivation: 全无监督异常检测(FUAD)是无监督异常检测(UAD)的一个实用扩展,旨在在训练集可能包含异常样本的情况下无需任何标签即可检测异常。为实现FUAD,作者首次将基于教师-学生框架的知识蒸馏(KD)范式引入FUAD设置。然而,由于训练数据中存在异常,传统的KD方法可能会使学生学习教师对异常的表示,从而导致异常检测性能不佳。 Method: 提出了一种新的跨域蒸馏(CDD)框架,基于广泛研究的反向蒸馏(RD)范式,设计了特定领域训练,将训练集划分为多个异常率较低的领域,并为每个领域训练一个领域特定的学生;然后进行跨领域知识聚合,由领域特定学生生成的伪正常特征协同指导全局学生学习所有样本的广义正常表示。 Result: 提出的方法在MVTec AD和VisA数据集的噪声版本上相对于基线方法取得了显著的性能提升。 Conclusion: 实验结果表明,在MVTec AD和VisA数据集的噪声版本上,该方法相对于基线方法取得了显著的性能提升,验证了其在FUAD设置下的有效性。 Abstract: Fully Unsupervised Anomaly Detection (FUAD) is a practical extension of Unsupervised Anomaly Detection (UAD), aiming to detect anomalies without any labels even when the training set may contain anomalous samples. To achieve FUAD, we pioneer the introduction of Knowledge Distillation (KD) paradigm based on teacher-student framework into the FUAD setting. However, due to the presence of anomalies in the training data, traditional KD methods risk enabling the student to learn the teacher's representation of anomalies under FUAD setting, thereby resulting in poor anomaly detection performance. To address this issue, we propose a novel Cross-Domain Distillation (CDD) framework based on the widely studied reverse distillation (RD) paradigm. Specifically, we design a Domain-Specific Training, which divides the training set into multiple domains with lower anomaly ratios and train a domain-specific student for each. Cross-Domain Knowledge Aggregation is then performed, where pseudo-normal features generated by domain-specific students collaboratively guide a global student to learn generalized normal representations across all samples. Experimental results on noisy versions of the MVTec AD and VisA datasets demonstrate that our method achieves significant performance improvements over the baseline, validating its effectiveness under FUAD setting.[283] Development of a Neural Network Model for Currency Detection to aid visually impaired people in Nigeria
Sochukwuma Nwokoye,Desmond Moru
Main category: cs.CV
TL;DR: This research developed an artificial neural network-based system to help visually impaired individuals differentiate Nigerian cash, achieving over 90% accuracy.
Details
Motivation: The research aims to explore the potential of artificial neural networks to facilitate the differentiation of various forms of cash for individuals with visual impairments. Method: A custom dataset of 3,468 images was built and used to train an SSD neural network model. Result: The performance of the system in terms of accuracy was assessed, and the Mean Average Precision score was over 90%. Conclusion: The system developed in this research can accurately identify Nigerian cash, contributing to assistive technology and improving the quality of life for visually impaired individuals in Nigeria and beyond. Abstract: Neural networks in assistive technology for visually impaired leverage artificial intelligence's capacity to recognize patterns in complex data. They are used for converting visual data into auditory or tactile representations, helping the visually impaired understand their surroundings. The primary aim of this research is to explore the potential of artificial neural networks to facilitate the differentiation of various forms of cash for individuals with visual impairments. In this study, we built a custom dataset of 3,468 images, which was subsequently used to train an SSD neural network model. The proposed system can accurately identify Nigerian cash, thereby streamlining commercial transactions. The performance of the system in terms of accuracy was assessed, and the Mean Average Precision score was over 90%. We believe that our system has the potential to make a substantial contribution to the field of assistive technology while also improving the quality of life of visually challenged persons in Nigeria and beyond.[284] Towards Continual Visual Anomaly Detection in the Medical Domain
Manuel Barusco,Francesco Borsatti,Nicola Beda,Davide Dalle Pezze,Gian Antonio Susto
Main category: cs.CV
TL;DR: This study introduces PatchCoreCL, a continual learning approach for visual anomaly detection in medical imaging, showing that it can adapt to evolving data with minimal forgetting while maintaining performance comparable to specialized models.
Details
Motivation: Visual Anomaly Detection (VAD) is crucial in domains like manufacturing and medical imaging, where accurate and explainable detection is essential. However, the impact of evolving input data distributions over time has received limited attention. Continual Learning (CL) offers a promising framework to adapt models incrementally while preserving prior knowledge, especially relevant in the dynamic medical imaging context. Method: The study introduces PatchCoreCL, a continual learning version of the PatchCore model, and evaluates its performance using the BMAD dataset, which includes real-world medical imaging data with both image-level and pixel-level annotations. Result: PatchCoreCL demonstrated effective performance on the BMAD dataset, achieving results comparable to task-specific models with a forgetting value of less than 1%, indicating minimal loss of previously acquired knowledge. Conclusion: The study concludes that PatchCoreCL, a continual learning adaptation of the PatchCore model, is a feasible and effective solution for visual anomaly detection in the medical field, with minimal forgetting and performance comparable to task-specific models. Abstract: Visual Anomaly Detection (VAD) seeks to identify abnormal images and precisely localize the corresponding anomalous regions, relying solely on normal data during training. This approach has proven essential in domains such as manufacturing and, more recently, in the medical field, where accurate and explainable detection is critical. Despite its importance, the impact of evolving input data distributions over time has received limited attention, even though such changes can significantly degrade model performance. In particular, given the dynamic and evolving nature of medical imaging data, Continual Learning (CL) provides a natural and effective framework to incrementally adapt models while preserving previously acquired knowledge. This study explores for the first time the application of VAD models in a CL scenario for the medical field. In this work, we utilize a CL version of the well-established PatchCore model, called PatchCoreCL, and evaluate its performance using BMAD, a real-world medical imaging dataset with both image-level and pixel-level annotations. Our results demonstrate that PatchCoreCL is an effective solution, achieving performance comparable to the task-specific models, with a forgetting value less than a 1%, highlighting the feasibility and potential of CL for adaptive VAD in medical imaging.[285] FCR: Investigating Generative AI models for Forensic Craniofacial Reconstruction
Ravi Shankar Prasad,Dinesh Singh
Main category: cs.CV
TL;DR: 本文提出了一种基于2D X光图像的通用颅面重建框架,通过使用多种生成模型生成更真实的面部图像,并在法医学中展示了其有效性。
Details
Motivation: 传统方法如基于粘土的颅面重建需要专家知识且耗时,而其他概率生成模型无法捕捉头骨和面部的跨域属性。 Method: 使用了多种生成模型(如CycleGANs、cGANs等)并微调生成器和判别器部分,以在两个不同域(即头骨和面部)生成更真实的图像。 Result: 通过FID、IS和SSIM评分评估了生成面部的质量,并提出了一个检索框架,其中查询是生成的面部图像,库是真实面部数据库。 Conclusion: 基于2D X光图像的颅面重建框架在法医学中是一个有效的工具,尤其是在其他识别方法失败的情况下。 Abstract: Craniofacial reconstruction in forensics is one of the processes to identify victims of crime and natural disasters. Identifying an individual from their remains plays a crucial role when all other identification methods fail. Traditional methods for this task, such as clay-based craniofacial reconstruction, require expert domain knowledge and are a time-consuming process. At the same time, other probabilistic generative models like the statistical shape model or the Basel face model fail to capture the skull and face cross-domain attributes. Looking at these limitations, we propose a generic framework for craniofacial reconstruction from 2D X-ray images. Here, we used various generative models (i.e., CycleGANs, cGANs, etc) and fine-tune the generator and discriminator parts to generate more realistic images in two distinct domains, which are the skull and face of an individual. This is the first time where 2D X-rays are being used as a representation of the skull by generative models for craniofacial reconstruction. We have evaluated the quality of generated faces using FID, IS, and SSIM scores. Finally, we have proposed a retrieval framework where the query is the generated face image and the gallery is the database of real faces. By experimental results, we have found that this can be an effective tool for forensic science.[286] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation
Yaqi Li,Peng Chen,Mingyang Han,Bu Pi,Haoxiang Shi,Runzhou Zhao,Yang Yao,Xuan Zhang,Jun Song
Main category: cs.CV
TL;DR: The paper proposes a new Visual-Chain of Guidance (Visual-CoG) paradigm for text-to-image generation that improves reasoning capabilities and achieves significant performance improvements.
Details
Motivation: Recent autoregressive models for text-to-image generation face limitations in handling multi-attribute and ambiguous prompts, and existing approaches provide reward signals only at the end of the generation stage, leading to suboptimal policies. Method: A three-stage Visual-Chain of Guidance (Visual-CoG) paradigm is proposed, which includes semantic reasoning, process refining, and outcome evaluation, along with stage-aware rewards. A visual cognition benchmark, VisCog-Bench, is also constructed to evaluate semantic reasoning effectiveness. Result: Comprehensive evaluations on GenEval, T2I-CompBench, and VisCog-Bench show performance improvements of 15%, 5%, and 19%, respectively. Conclusion: The proposed Visual-CoG paradigm improves the reasoning capabilities and effectiveness of text-to-image generation models, demonstrating superior performance compared to existing approaches. Abstract: Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.[287] ArgusCogito: Chain-of-Thought for Cross-Modal Synergy and Omnidirectional Reasoning in Camouflaged Object Segmentation
Jianwen Tan,Huiyao Zhang,Rui Xiong,Han Zhou,Hongfei Wang,Ye Li
Main category: cs.CV
TL;DR: 本文提出了一种新的零样本、思维链框架ArgusCogito,通过跨模态协同和全向推理,在COS和MIS任务中实现了最先进的性能。
Details
Motivation: 现有的COS方法受限于浅层特征表示、推理机制不足和跨模态整合能力弱,难以实现深入的认知,导致目标分离不完整和分割不精确的问题。 Method: 提出了一种名为ArgusCogito的零样本、思维链框架,包含推测、聚焦和雕刻三个阶段,利用跨模态协同和全向推理。 Result: 在四个具有挑战性的COS基准和三个医学图像分割基准上进行了广泛的评估,结果表明ArgusCogito达到了最先进的性能。 Conclusion: ArgusCogito实现了最先进的性能,证明了其在COS和MIS任务中的有效性、泛化能力和鲁棒性。 Abstract: Camouflaged Object Segmentation (COS) poses a significant challenge due to the intrinsic high similarity between targets and backgrounds, demanding models capable of profound holistic understanding beyond superficial cues. Prevailing methods, often limited by shallow feature representation, inadequate reasoning mechanisms, and weak cross-modal integration, struggle to achieve this depth of cognition, resulting in prevalent issues like incomplete target separation and imprecise segmentation. Inspired by the perceptual strategy of the Hundred-eyed Giant-emphasizing holistic observation, omnidirectional focus, and intensive scrutiny-we introduce ArgusCogito, a novel zero-shot, chain-of-thought framework underpinned by cross-modal synergy and omnidirectional reasoning within Vision-Language Models (VLMs). ArgusCogito orchestrates three cognitively-inspired stages: (1) Conjecture: Constructs a strong cognitive prior through global reasoning with cross-modal fusion (RGB, depth, semantic maps), enabling holistic scene understanding and enhanced target-background disambiguation. (2) Focus: Performs omnidirectional, attention-driven scanning and focused reasoning, guided by semantic priors from Conjecture, enabling precise target localization and region-of-interest refinement. (3) Sculpting: Progressively sculpts high-fidelity segmentation masks by integrating cross-modal information and iteratively generating dense positive/negative point prompts within focused regions, emulating Argus' intensive scrutiny. Extensive evaluations on four challenging COS benchmarks and three Medical Image Segmentation (MIS) benchmarks demonstrate that ArgusCogito achieves state-of-the-art (SOTA) performance, validating the framework's exceptional efficacy, superior generalization capability, and robustness.[288] Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images
Kaiyu Li,Xiangyong Cao,Ruixun Liu,Shihong Wang,Zixuan Jiang,Zhi Wang,Deyu Meng
Main category: cs.CV
TL;DR: 本文提出了一种适用于光学和SAR遥感图像的无标注开放词汇语义分割框架SegEarth-OV,通过SimFeatUp和Global Bias Alleviation操作提升分割效果,并利用AlignEarth策略实现跨传感器类型的语义迁移。
Details
Motivation: 遥感图像的语义分割对于地球观测至关重要,但现有基于自然图像设计的开放词汇语义分割框架难以应对遥感数据的复杂性,特别是在大规模标注成本高昂的情况下。 Method: 提出了一种通用的上采样方法SimFeatUp以恢复高分辨率空间细节,并引入Global Bias Alleviation操作以提升局部语义保真度;此外,提出AlignEarth策略,通过知识蒸馏将光学图像的语义知识迁移到SAR图像编码器中。 Result: 在光学和SAR数据集上的大量实验表明,SegEarth-OV在性能上显著优于现有最先进方法,实现了高效的无标注开放词汇语义分割。 Conclusion: SegEarth-OV通过SimFeatUp和Global Bias Alleviation操作,结合AlignEarth策略,为光学和SAR遥感图像提供了一个通用的无标注开放词汇语义分割框架,为无标注和开放世界的地球观测奠定了坚实基础。 Abstract: Semantic segmentation of remote sensing (RS) images is pivotal for comprehensive Earth observation, but the demand for interpreting new object categories, coupled with the high expense of manual annotation, poses significant challenges. Although open-vocabulary semantic segmentation (OVSS) offers a promising solution, existing frameworks designed for natural images are insufficient for the unique complexities of RS data. They struggle with vast scale variations and fine-grained details, and their adaptation often relies on extensive, costly annotations. To address this critical gap, this paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images. Specifically, we propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features, correcting distorted target shapes without any task-specific post-training. We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features, significantly enhancing local semantic fidelity. These components empower SegEarth-OV to effectively harness the rich semantics of pre-trained VLMs, making OVSS possible in optical RS contexts. Furthermore, to extend the framework's universality to other challenging RS modalities like SAR images, where large-scale VLMs are unavailable and expensive to create, we introduce AlignEarth, which is a distillation-based strategy and can efficiently transfer semantic knowledge from an optical VLM encoder to an SAR encoder, bypassing the need to build SAR foundation models from scratch and enabling universal OVSS across diverse sensor types. Extensive experiments on both optical and SAR datasets validate that SegEarth-OV can achieve dramatic improvements over the SOTA methods, establishing a robust foundation for annotation-free and open-world Earth observation.[289] EventTracer: Fast Path Tracing-based Event Stream Rendering
Zhenyang Li,Xiaoyang Bai,Jinfan Lu,Pengfei Shen,Edmund Y. Lam,Yifan Peng
Main category: cs.CV
TL;DR: EventTracer is a path tracing-based rendering pipeline that efficiently simulates high-fidelity event sequences from 3D scenes, offering better performance and realism compared to existing methods.
Details
Motivation: Simulating event streams from 3D scenes meets the demand for large-scale, high temporal frequency data without expensive hardware or extensive data collection. Existing methods use costly noiseless RGB frames and achieve lower temporal resolution compared to real-world event data. Method: EventTracer uses a path tracing-based rendering pipeline to simulate high-fidelity event sequences from complex 3D scenes. It speeds up the rendering process via low sample-per-pixel (SPP) path tracing and employs a lightweight event spiking network to denoise RGB videos into realistic event sequences. The network utilizes a bipolar leaky integrate-and-fired (BiLIF) spiking unit and is trained with a bidirectional earth mover distance (EMD) loss. Result: EventTracer runs at a speed of about 4 minutes per second of 720p video and captures better scene details, demonstrating a greater similarity to real-world event data than other simulators. Conclusion: EventTracer is a promising tool for creating large-scale event-RGB datasets at a low cost, narrowing the sim-to-real gap in event-based vision, and boosting various application scenarios such as robotics, autonomous driving, and VRAR. Abstract: Simulating event streams from 3D scenes has become a common practice in event-based vision research, as it meets the demand for large-scale, high temporal frequency data without setting up expensive hardware devices or undertaking extensive data collections. Yet existing methods in this direction typically work with noiseless RGB frames that are costly to render, and therefore they can only achieve a temporal resolution equivalent to 100-300 FPS, far lower than that of real-world event data. In this work, we propose EventTracer, a path tracing-based rendering pipeline that simulates high-fidelity event sequences from complex 3D scenes in an efficient and physics-aware manner. Specifically, we speed up the rendering process via low sample-per-pixel (SPP) path tracing, and train a lightweight event spiking network to denoise the resulting RGB videos into realistic event sequences. To capture the physical properties of event streams, the network is equipped with a bipolar leaky integrate-and-fired (BiLIF) spiking unit and trained with a bidirectional earth mover distance (EMD) loss. Our EventTracer pipeline runs at a speed of about 4 minutes per second of 720p video, and it inherits the merit of accurate spatiotemporal modeling from its path tracing backbone. We show in two downstream tasks that EventTracer captures better scene details and demonstrates a greater similarity to real-world event data than other event simulators, which establishes it as a promising tool for creating large-scale event-RGB datasets at a low cost, narrowing the sim-to-real gap in event-based vision, and boosting various application scenarios such as robotics, autonomous driving, and VRAR.[290] Few-shot Unknown Class Discovery of Hyperspectral Images with Prototype Learning and Clustering
Chun Liu,Chen Zhang,Zhuo Li,Zheng Li,Wei Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的开放集少样本高光谱图像分类方法,能够有效区分和发现未知类别。
Details
Motivation: 现有的开放集少样本高光谱图像分类方法主要关注区分未知类样本和已知类样本,而无法进一步识别或发现未知类。 Method: 提出了一种原型学习和聚类方法,用于在少样本环境下发现高光谱图像中的未知类别。 Result: 所提出的方法能够通过学习已知类别的分类器拒绝未知类样本,并根据未知类别原型的距离将未知类样本聚类为不同类别。 Conclusion: 实验结果表明,所提出的方法在开放集少样本高光谱图像分类任务中表现出色,具有竞争性性能。 Abstract: Open-set few-shot hyperspectral image (HSI) classification aims to classify image pixels by using few labeled pixels per class, where the pixels to be classified may be not all from the classes that have been seen. To address the open-set HSI classification challenge, current methods focus mainly on distinguishing the unknown class samples from the known class samples and rejecting them to increase the accuracy of identifying known class samples. They fails to further identify or discovery the unknow classes among the samples. This paper proposes a prototype learning and clustering method for discoverying unknown classes in HSIs under the few-shot environment. Using few labeled samples, it strives to develop the ability of infering the prototypes of unknown classes while distinguishing unknown classes from known classes. Once the unknown class samples are rejected by the learned known class classifier, the proposed method can further cluster the unknown class samples into different classes according to their distance to the inferred unknown class prototypes. Compared to existing state-of-the-art methods, extensive experiments on four benchmark HSI datasets demonstrate that our proposed method exhibits competitive performance in open-set few-shot HSI classification tasks. All the codes are available at \href{https://github.com/KOBEN-ff/OpenFUCD-main} {https://github.com/KOBEN-ff/OpenFUCD-main}[291] Incorporating Pre-trained Diffusion Models in Solving the Schrödinger Bridge Problem
Zhicong Tang,Tiankai Hang,Shuyang Gu,Dong Chen,Baining Guo
Main category: cs.CV
TL;DR: 该论文通过三种重新参数化技术统一了基于得分的生成模型(SGMs)和Schrödinger Bridge(SB)问题,并提出了利用SGMs作为初始化的新策略,以加速和稳定基于SB的模型的训练。
Details
Motivation: 加速和稳定基于SB的模型的训练,并结合SGMs的优点以提高其性能。 Method: 通过三种重新参数化技术(IPMM、IPTM和IPFM)统一了基于得分的生成模型(SGMs)和Schrödinger Bridge(SB)问题,并引入了使用预训练SGMs来有效训练SB-based模型的新初始化策略。 Result: 实验表明所提出的方法在有效性方面有显著提升。 Conclusion: 该论文为生成模型的研究做出了贡献,并为未来的研究铺平了道路。 Abstract: This paper aims to unify Score-based Generative Models (SGMs), also known as Diffusion models, and the Schr\"odinger Bridge (SB) problem through three reparameterization techniques: Iterative Proportional Mean-Matching (IPMM), Iterative Proportional Terminus-Matching (IPTM), and Iterative Proportional Flow-Matching (IPFM). These techniques significantly accelerate and stabilize the training of SB-based models. Furthermore, the paper introduces novel initialization strategies that use pre-trained SGMs to effectively train SB-based models. By using SGMs as initialization, we leverage the advantages of both SB-based models and SGMs, ensuring efficient training of SB-based models and further improving the performance of SGMs. Extensive experiments demonstrate the significant effectiveness and improvements of the proposed methods. We believe this work contributes to and paves the way for future research on generative models.[292] BirdRecorder's AI on Sky: Safeguarding birds of prey by detection and classification of tiny objects around wind turbines
Nico Klar,Nizam Gifary,Felix P. G. Ziegler,Frank Sehnke,Anton Kaifel,Eric Price,Aamir Ahmad
Main category: cs.CV
TL;DR: BirdRecorder is an AI-based anti-collision system designed to protect endangered birds like the red kite from wind turbine collisions, contributing to the sustainable coexistence of technology and nature.
Details
Motivation: The urgent need for renewable energy expansion is hindered by conflicts with wildlife conservation, particularly bird collisions with wind turbines. Method: BirdRecorder integrates AI methods with optimized hardware and software architectures, using Single Shot Detector (SSD) for detection, along with hardware acceleration and tracking algorithms. Result: BirdRecorder achieves high detection precision and maintains speed for real-time decision-making, outperforming existing approaches in accuracy and efficiency. Conclusion: BirdRecorder helps in reducing bird-turbine collisions, thereby bridging the gap between renewable energy expansion and wildlife conservation. Abstract: The urgent need for renewable energy expansion, particularly wind power, is hindered by conflicts with wildlife conservation. To address this, we developed BirdRecorder, an advanced AI-based anti-collision system to protect endangered birds, especially the red kite (Milvus milvus). Integrating robotics, telemetry, and high-performance AI algorithms, BirdRecorder aims to detect, track, and classify avian species within a range of 800 m to minimize bird-turbine collisions. BirdRecorder integrates advanced AI methods with optimized hardware and software architectures to enable real-time image processing. Leveraging Single Shot Detector (SSD) for detection, combined with specialized hardware acceleration and tracking algorithms, our system achieves high detection precision while maintaining the speed necessary for real-time decision-making. By combining these components, BirdRecorder outperforms existing approaches in both accuracy and efficiency. In this paper, we summarize results on field tests and performance of the BirdRecorder system. By bridging the gap between renewable energy expansion and wildlife conservation, BirdRecorder contributes to a more sustainable coexistence of technology and nature.[293] Assessing the Noise Robustness of Class Activation Maps: A Framework for Reliable Model Interpretability
Syamantak Sarkar,Revoti P. Bora,Bhupender Kaushal,Sudhish N George,Kiran Raja
Main category: cs.CV
TL;DR: This paper investigates the robustness of Class Activation Maps (CAMs) to noise perturbations and proposes a new metric to assess their stability and sensitivity.
Details
Motivation: The motivation is to explore the robustness of Class Activation Maps (CAMs) to different types of noise, as prior work has largely overlooked this aspect despite CAMs being widely used for model visualization. Method: The authors evaluate CAM methods under various noise perturbations across multiple models and datasets, proposing a robustness metric that considers both consistency and responsiveness. Result: The study finds significant variability in the noise sensitivity of different CAM methods and demonstrates the effectiveness of the proposed robustness metric through empirical evaluation and statistical tests. Conclusion: The paper concludes that different CAMs exhibit variable sensitivity to noise and introduces a robustness metric based on consistency and responsiveness to evaluate CAMs under perturbations. Abstract: Class Activation Maps (CAMs) are one of the important methods for visualizing regions used by deep learning models. Yet their robustness to different noise remains underexplored. In this work, we evaluate and report the resilience of various CAM methods for different noise perturbations across multiple architectures and datasets. By analyzing the influence of different noise types on CAM explanations, we assess the susceptibility to noise and the extent to which dataset characteristics may impact explanation stability. The findings highlight considerable variability in noise sensitivity for various CAMs. We propose a robustness metric for CAMs that captures two key properties: consistency and responsiveness. Consistency reflects the ability of CAMs to remain stable under input perturbations that do not alter the predicted class, while responsiveness measures the sensitivity of CAMs to changes in the prediction caused by such perturbations. The metric is evaluated empirically across models, different perturbations, and datasets along with complementary statistical tests to exemplify the applicability of our proposed approach.[294] SpotEdit: Evaluating Visually-Guided Image Editing Methods
Sara Ghazanfari,Wei-An Lin,Haitong Tian,Ersin Yumer
Main category: cs.CV
TL;DR: SpotEdit是一个用于评估视觉引导图像编辑方法的新基准测试,强调当前模型在处理视觉线索时的幻觉问题。
Details
Motivation: 现有的视觉引导图像编辑评估过于简单,无法充分反映现实世界的编辑挑战,因此需要一个更全面的基准测试。 Method: SpotEdit通过包括扩散、自回归和混合生成模型的方法进行系统评估,并特别关注模型在幻觉方面的表现。 Result: SpotEdit揭示了不同生成模型在视觉引导图像编辑任务中的显著性能差异,并指出当前领先模型在视觉线索处理上的不足。 Conclusion: SpotEdit是一个全面的基准测试,旨在评估视觉引导图像编辑方法,并发现当前模型如GPT-4o在处理视觉线索时存在幻觉问题。 Abstract: Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.[295] Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance
Xiangxiang Wang,Xuanyu Wang,YiJia Luo,Yongbin Yu,Manping Fan,Jingtao Zhang,Liyong Ren
Main category: cs.CV
TL;DR: This study proposes a dual innovation framework for vision-language models, significantly improving computational efficiency and offering real-time assistance to visually impaired users with minimal performance loss.
Details
Motivation: The motivation is to enhance the efficiency of vision-language models and provide real-time assistance to visually impaired individuals through improved scene perception, text recognition, and navigation. Method: The study introduces a dual technological innovation: a cross-modal differentiated quantization framework for vision-language models and a scene-aware vectorized memory multi-agent system for visually impaired assistance. Result: The modular framework reduced memory requirements from 38GB to 16GB, maintained model performance with only a 2.05% drop on MMBench, and achieved a response latency of 2.83-3.52 seconds, outperforming existing models. Conclusion: This research successfully improves computational efficiency and assistive technology for visually impaired users, achieving reduced memory requirements and latency while maintaining performance. Abstract: This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance. The modular framework was developed implementing differentiated processing strategies, effectively reducing memory requirements from 38GB to 16GB while maintaining model performance. The multi-agent architecture combines scene classification, vectorized memory, and multimodal interaction, enabling persistent storage and efficient retrieval of scene memories. Through perception-memory-reasoning workflows, the system provides environmental information beyond the current view using historical memories. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory requirements like the Molmo-7B series. The system maintains response latency between 2.83-3.52 seconds from scene analysis to initial speech output, substantially faster than non-streaming methods. This research advances computational efficiency and assistive technology, offering visually impaired users comprehensive real-time assistance in scene perception, text recognition, and navigation.[296] Emerging Semantic Segmentation from Positive and Negative Coarse Label Learning
Le Zhang,Fuping Wu,Arun Thirunavukarasu,Kevin Bronik,Thomas Nichols,Bartlomiej W. Papiez
Main category: cs.CV
TL;DR: 本文提出了一种使用粗略标注训练语义分割模型的方法,通过两个耦合CNN学习真实标签分布,在多个数据集上取得了优于现有方法的结果。
Details
Motivation: 像素级标注耗时且容易出错,而粗略标注更快速、便宜且容易获得,尤其是在医学图像中。 Method: 提出了一种使用来自正类和负类的粗略标注来训练卷积神经网络的方法,通过两个耦合的CNN从带有噪声的粗略标注中学习真实的分割标签分布。 Result: 在Cityscapes数据集和视网膜图像数据集上的实验表明,该方法优于现有方法,并且在粗标注比例较小的情况下表现尤为突出。 Conclusion: 实验结果表明,该方法在多个数据集上均优于现有的最新方法,尤其是在粗标注比例较小的情况下。 Abstract: Large annotated datasets are vital for training segmentation models, but pixel-level labeling is time-consuming, error-prone, and often requires scarce expert annotators, especially in medical imaging. In contrast, coarse annotations are quicker, cheaper, and easier to produce, even by non-experts. In this paper, we propose to use coarse drawings from both positive (target) and negative (background) classes in the image, even with noisy pixels, to train a convolutional neural network (CNN) for semantic segmentation. We present a method for learning the true segmentation label distributions from purely noisy coarse annotations using two coupled CNNs. The separation of the two CNNs is achieved by high fidelity with the characters of the noisy training annotations. We propose to add a complementary label learning that encourages estimating negative label distribution. To illustrate the properties of our method, we first use a toy segmentation dataset based on MNIST. We then present the quantitative results of experiments using publicly available datasets: Cityscapes dataset for multi-class segmentation, and retinal images for medical applications. In all experiments, our method outperforms state-of-the-art methods, particularly in the cases where the ratio of coarse annotations is small compared to the given dense annotations.[297] BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding
Xuan-Bac Nguyen,Thanh-Dat Truong,Pawan Sinha,Khoa Luu
Main category: cs.CV
TL;DR: This paper introduces the BRAIN approach to address memory decay and inconsistency in brain signals, achieving superior performance in Vision-Brain Understanding tasks through continual learning, De-bias Contrastive Learning, and Angular-based Forgetting Mitigation techniques.
Details
Motivation: The study aims to overcome the challenges of memory decay and inconsistent brain signal representations over time, which hinder the performance of Vision-Brain Understanding models. Method: The authors propose the BRAIN method, which includes a continual learning setup, De-bias Contrastive Learning loss function, and Angular-based Forgetting Mitigation to address bias and forgetting in brain signal interpretation. Result: The BRAIN approach outperforms prior and non-continual learning methods, achieving State-of-the-Art performance across multiple benchmarks. Conclusion: The BRAIN approach effectively addresses memory decay and inconsistency in brain signals, achieving SOTA performance in Vision-Brain Understanding tasks. Abstract: Memory decay makes it harder for the human brain to recognize visual objects and retain details. Consequently, recorded brain signals become weaker, uncertain, and contain poor visual context over time. This paper presents one of the first vision-learning approaches to address this problem. First, we statistically and experimentally demonstrate the existence of inconsistency in brain signals and its impact on the Vision-Brain Understanding (VBU) model. Our findings show that brain signal representations shift over recording sessions, leading to compounding bias, which poses challenges for model learning and degrades performance. Then, we propose a new Bias-Mitigation Continual Learning (BRAIN) approach to address these limitations. In this approach, the model is trained in a continual learning setup and mitigates the growing bias from each learning step. A new loss function named De-bias Contrastive Learning is also introduced to address the bias problem. In addition, to prevent catastrophic forgetting, where the model loses knowledge from previous sessions, the new Angular-based Forgetting Mitigation approach is introduced to preserve learned knowledge in the model. Finally, the empirical experiments demonstrate that our approach achieves State-of-the-Art (SOTA) performance across various benchmarks, surpassing prior and non-continual learning methods.[298] Explain and Monitor Deep Learning Models for Computer Vision using Obz AI
Neo Christopher Chung,Jakub Binda
Main category: cs.CV
TL;DR: 本文介绍 Obz AI,一个旨在提高计算机视觉系统可解释性和可观测性的软件生态系统。
Details
Motivation: 现有的计算机视觉系统被视为“黑箱”,缺乏透明度,且缺乏将 XAI 技术与强大知识管理和监控框架集成的软件解决方案。 Method: 开发了一个全面的软件生态系统 Obz AI,提供从 Python 客户端库到全栈分析仪表板的无缝集成管道。 Result: Obz AI 能够让机器学习工程师轻松集成先进的 XAI 方法,提取和分析特征以进行异常检测,并实时持续监控 AI 模型。 Conclusion: Obz AI 促进了深度学习模型决策机制的可解释性,推动了计算机视觉系统的可观测性和负责任部署。 Abstract: Deep learning has transformed computer vision (CV), achieving outstanding performance in classification, segmentation, and related tasks. Such AI-based CV systems are becoming prevalent, with applications spanning from medical imaging to surveillance. State of the art models such as convolutional neural networks (CNNs) and vision transformers (ViTs) are often regarded as ``black boxes,'' offering limited transparency into their decision-making processes. Despite a recent advancement in explainable AI (XAI), explainability remains underutilized in practical CV deployments. A primary obstacle is the absence of integrated software solutions that connect XAI techniques with robust knowledge management and monitoring frameworks. To close this gap, we have developed Obz AI, a comprehensive software ecosystem designed to facilitate state-of-the-art explainability and observability for vision AI systems. Obz AI provides a seamless integration pipeline, from a Python client library to a full-stack analytics dashboard. With Obz AI, a machine learning engineer can easily incorporate advanced XAI methodologies, extract and analyze features for outlier detection, and continuously monitor AI models in real time. By making the decision-making mechanisms of deep models interpretable, Obz AI promotes observability and responsible deployment of computer vision systems.[299] Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance
Ayce Idil Aytekin,Helge Rhodin,Rishabh Dabral,Christian Theobalt
Main category: cs.CV
TL;DR: This paper proposes a diffusion-based framework that generates high-quality 3D reconstructions of hand-held objects from single RGB images by leveraging hand-object interaction as geometric guidance, avoiding the need for post-processing and improving robustness under occlusion.
Details
Motivation: The motivation is to overcome the limitations of prior methods that either rely on extensive post-processing or produce low-quality reconstructions, by directly generating high-quality object geometry during the diffusion process while ensuring plausible hand-object interactions. Method: The method involves conditioning a latent diffusion model on inpainted object appearance and utilizing inference-time guidance to optimize object reconstruction. This includes applying supervision to the velocity field and optimizing transformations of both hand and object using multi-modal geometric cues like normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. Additionally, signed distance field supervision and constraints on contact and non-intersection are enforced. Result: The result is a framework that generates high-quality, physically plausible 3D reconstructions of hand-held objects under occlusion, while effectively generalizing to in-the-wild scenarios. Conclusion: The paper concludes that their proposed diffusion-based framework can accurately and robustly reconstruct 3D geometry of hand-held objects from monocular RGB images, even under occlusion, while maintaining physical plausibility in hand-object interactions. Abstract: We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.[300] GM-Skip: Metric-Guided Transformer Block Skipping for Efficient Vision-Language Models
Lianming Huang,Haibo Hu,Qiao Li,Xin He,Nan Guan,Chun Jason Xue
Main category: cs.CV
TL;DR: GM-Skip是一种用于Transformer-based Vision-Language Models的加速推理框架,能够在保持性能的同时显著降低计算延迟。
Details
Motivation: 基于Transformer的视觉-语言模型(VLMs)虽然在多个任务中表现出色,但其高计算成本限制了在延迟敏感应用(如自动驾驶)中的部署。因此,需要一种有效方法来加速推理过程而不损失性能。 Method: GM-Skip采用了一种贪婪的、指标引导的块选择策略,利用指标反馈(如准确率、CIDEr)识别冗余层,并结合反向删除机制保留早期基础块以避免性能崩溃。此外,它通过得分-稀疏度平衡目标支持稀疏性和性能之间的可调权衡。 Result: 实验表明,GM-Skip在多个任务和数据集(如COCO和CODA)上显著提升了推理速度并保持任务性能。在COCO数据集上,GM-Skip在跳过超过40%的Transformer块的同时,将单个对象分类中Person类别的准确率从19.1%提升至87.3%。在实际部署中,集成到Autoware.Universe的自动驾驶车辆中时,单个目标检测的延迟降低了高达45.4%。 Conclusion: GM-Skip是一个灵活且指标自适应的Transformer块跳过框架,可在保持输出质量的同时加速VLM推理,尤其在实际部署中具有显著的延迟降低效果。 Abstract: Transformer-based Vision-Language Models (VLMs) have achieved impressive performance on tasks such as image captioning, object recognition, and visual reasoning, but their high computational cost hinders deployment in latency-sensitive applications like autonomous driving. We introduce GM-Skip, a flexible and metric-adaptive framework for Transformer block skipping that accelerates VLM inference while preserving output quality. GM-Skip features a greedy, metric-guided block selection strategy that uses metric feedback (e.g., accuracy, CIDEr) to identify redundant layers, along with a reverse-order deletion mechanism that preserves early foundational blocks to avoid performance collapse. To support diverse deployment needs, it incorporates a tunable trade-off between sparsity and performance via a score-sparsity balance objective. Experiments across multiple tasks and datasets, including COCO and CODA, show that GM-Skip consistently improves inference speed while maintaining task performance. On the COCO dataset, GM-Skip improves single-object classification accuracy on the Person category from 19.1 percent to 87.3 percent while skipping more than 40 percent of Transformer blocks. In real-world deployment, it achieves up to 45.4 percent latency reduction on single-object detection when integrated into an autonomous vehicle running Autoware.Universe, validating the effectiveness of its skip configurations and confirming its practical value in accelerating real-world inference.[301] Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation
Ashwath Vaithinathan Aravindan,Abha Jha,Matthew Salaway,Atharva Sandeep Bhide,Duygu Nur Yaldiz
Main category: cs.CV
TL;DR: This paper proposes SKD-CAG, a novel defense method that effectively neutralizes backdoor attacks in text-to-image diffusion models without affecting generation quality.
Details
Motivation: Text-to-image diffusion models are vulnerable to backdoor attacks, where adversarial text triggers can manipulate outputs. Existing defenses for classification models are insufficient for generative models, prompting the need for an effective mitigation technique. Method: The paper proposes Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), which uses knowledge distillation and cross-attention mechanisms to correct responses to poisoned prompts while preserving image quality. Result: SKD-CAG achieved 100% removal accuracy for pixel backdoors and 93% for style-based attacks, outperforming existing methods while maintaining robustness and image quality. Conclusion: The paper concludes that targeted unlearning through SKD-CAG is a promising defense mechanism to secure generative models against backdoor attacks, effectively removing adversarial effects without compromising image fidelity. Abstract: Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers. Using the cross-attention mechanism, SKD-CAG neutralizes backdoor influences at the attention level, ensuring the targeted removal of adversarial effects. Extensive experiments show that our method outperforms existing approaches, achieving removal accuracy 100\% for pixel backdoors and 93\% for style-based attacks, without sacrificing robustness or image fidelity. Our findings highlight targeted unlearning as a promising defense to secure generative models. Code and model weights can be found at https://github.com/Mystic-Slice/Sealing-The-Backdoor .[302] Interpretable Evaluation of AI-Generated Content with Language-Grounded Sparse Encoders
Yiming Tang,Arash Lagzian,Srinivas Anumasa,Qiran Zou,Trang Nguyen,Ehsan Adeli,Ching-Yu Cheng,Yilun Du,Dianbo Liu
Main category: cs.CV
TL;DR: LanSE introduces interpretable evaluation metrics for synthetic images, offering fine-grained assessments of generation quality, prompt match, realism, plausibility, and diversity.
Details
Motivation: Current AI-generated content evaluation metrics lack fine-grained assessment capabilities, hindering model selection, development, and commercial deployment. Method: Language-Grounded Sparse Encoders (LanSE) architecture identifies interpretable visual patterns and describes them in natural language for evaluation of synthetic images. Result: LanSE detects interpretable visual patterns with over 93% accuracy and quantifies four key dimensions of generation quality, revealing model differences not visible to existing metrics. Conclusion: LanSE provides a powerful tool for model selection, quality control of synthetic content, and model improvement in generative AI applications. Abstract: While the quality of AI-generated contents, such as synthetic images, has become remarkably high, current evaluation metrics provide only coarse-grained assessments, failing to identify specific strengths and weaknesses that researchers and practitioners need for model selection and development, further limiting the scientific understanding and commercial deployment of these generative models. To address this, we introduce Language-Grounded Sparse Encoders (LanSE), a novel architecture that creates interpretable evaluation metrics by identifying interpretable visual patterns and automatically describing them in natural language. Through large-scale human evaluation (more than 11,000 annotations) and large multimodal model (LMM) based analysis, LanSE demonstrates reliable capabilities to detect interpretable visual patterns in synthetic images with more than 93\% accuracy in natural images. LanSE further provides a fine-grained evaluation framework that quantifies four key dimensions of generation quality, prompt match, visual realism, physical plausibility, and content diversity. LanSE reveals nuanced model differences invisible to existing metrics, for instance, FLUX's superior physical plausibility and SDXL-medium's strong content diversity, while aligning with human judgments. By bridging interpretability with practical evaluation needs, LanSE offers all users of generative AI models a powerful tool for model selection, quality control of synthetic content, and model improvement. These capabilities directly address the need for public confidence and safety in AI-generated content, both critical for the future of generative AI applications.[303] PriorFormer: A Transformer for Real-time Monocular 3D Human Pose Estimation with Versatile Geometric Priors
Mohamed Adjel,Vincent Bonnet
Main category: cs.CV
TL;DR: This paper introduces a lightweight Transformer-based model for accurate 3D pose estimation from 2D joint positions using a single camera, capable of adapting to both calibrated and uncalibrated environments with high efficiency and accuracy.
Details
Motivation: The motivation is to develop a versatile and lightweight model capable of adapting to different environments (calibrated or uncalibrated) for accurate 3D pose estimation using a single camera. Method: The model incorporates geometric priors such as segment lengths and camera intrinsics, utilizing a masking mechanism to handle missing priors during training and inference. It was trained using 3D keypoints from the AMASS dataset and synthetic 2D data generated by random camera poses and intrinsics, followed by an ablation study for validation. Result: Results indicate that the versatile model outperforms an expert model trained only on complete priors, achieving an average 3D joint center position estimation accuracy of 36mm while maintaining high accuracy even when priors are missing. The model runs in 380μs on GPU and 1800μs on CPU. Conclusion: The proposed lightweight Transformer-based lifter effectively maps 2D joint positions to 3D poses, demonstrating adaptability across various deployment scenarios, including calibrated and uncalibrated settings. It achieves high accuracy and low computational cost, making it suitable for embedded and low-power devices. Abstract: This paper proposes a new lightweight Transformer-based lifter that maps short sequences of human 2D joint positions to 3D poses using a single camera. The proposed model takes as input geometric priors including segment lengths and camera intrinsics and is designed to operate in both calibrated and uncalibrated settings. To this end, a masking mechanism enables the model to ignore missing priors during training and inference. This yields a single versatile network that can adapt to different deployment scenarios, from fully calibrated lab environments to in-the-wild monocular videos without calibration. The model was trained using 3D keypoints from AMASS dataset with corresponding 2D synthetic data generated by sampling random camera poses and intrinsics. It was then compared to an expert model trained, only on complete priors, and the validation was done by conducting an ablation study. Results show that both, camera and segment length priors, improve performance and that the versatile model outperforms the expert, even when all priors are available, and maintains high accuracy when priors are missing. Overall the average 3D joint center positions estimation accuracy was as low as 36mm improving state of the art by half a centimeter and at a much lower computational cost. Indeed, the proposed model runs in 380$\mu$s on GPU and 1800$\mu$s on CPU, making it suitable for deployment on embedded platforms and low-power devices.[304] GSVisLoc: Generalizable Visual Localization for Gaussian Splatting Scene Representations
Fadi Khatib,Dror Moran,Guy Trostianetsky,Yoni Kasten,Meirav Galun,Ronen Basri
Main category: cs.CV
TL;DR: GSVisLoc是一种基于3D Gaussian Splatting的视觉定位方法,利用特征匹配和姿态优化,实现了高效且无需重新训练的相机定位。
Details
Motivation: 现有的基于3D Gaussian Splatting的视觉定位方法存在需要额外数据或修改模型的问题,因此需要一种更高效且通用的方法。 Method: GSVisLoc通过三个步骤实现定位:粗匹配、细匹配和姿态优化。它利用降采样和编码3D高斯模型生成场景特征,并通过编码图像块提取图像特征进行匹配。 Result: GSVisLoc在标准数据集上表现出与现有方法相当甚至更优的定位性能,并能够有效泛化到新场景中。 Conclusion: GSVisLoc作为一种新的视觉定位方法,基于3D Gaussian Splatting场景表示,实现了无需修改或重新训练的高效定位,并在室内外场景中表现出色。 Abstract: We introduce GSVisLoc, a visual localization method designed for 3D Gaussian Splatting (3DGS) scene representations. Given a 3DGS model of a scene and a query image, our goal is to estimate the camera's position and orientation. We accomplish this by robustly matching scene features to image features. Scene features are produced by downsampling and encoding the 3D Gaussians while image features are obtained by encoding image patches. Our algorithm proceeds in three steps, starting with coarse matching, then fine matching, and finally by applying pose refinement for an accurate final estimate. Importantly, our method leverages the explicit 3DGS scene representation for visual localization without requiring modifications, retraining, or additional reference images. We evaluate GSVisLoc on both indoor and outdoor scenes, demonstrating competitive localization performance on standard benchmarks while outperforming existing 3DGS-based baselines. Moreover, our approach generalizes effectively to novel scenes without additional training.[305] MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Sixun Dong,Juhua Hu,Mian Zhang,Ming Yin,Yanjie Fu,Qi Qian
Main category: cs.CV
TL;DR: 本文提出MMTok,一种结合多模态信息的视觉标记选择方法,通过最大化覆盖准则提升视觉-语言模型的推理效率和性能。
Details
Motivation: 现有方法主要利用单模态信息进行视觉标记剪枝,忽略了视觉-语言任务的多模态特性,且缺乏适用于不同模态的通用准则。 Method: 将视觉标记选择问题建模为最大覆盖问题,同时利用视觉和文本标记选择信息量大的视觉标记,并通过VLM代理进一步优化文本标记质量。 Result: 所提方法MMTok在POPE数据集的最大覆盖准则下实现了1.87倍的加速,同时在LLaVA-NeXT-13B上保持了98.7%的原始性能;仅使用四个视觉标记时,在LLaVA-1.5-7B上仍保持87.7%的性能。 Conclusion: 结合多模态信息进行视觉标记选择可以显著优于单模态方法,并且在不同数据集和模型上都表现出良好的性能和效率。 Abstract: Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.[306] InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang,Zhangwei Gao,Lixin Gu,Hengjun Pu,Long Cui,Xingguang Wei,Zhaoyang Liu,Linglin Jing,Shenglong Ye,Jie Shao,Zhaokai Wang,Zhe Chen,Hongjie Zhang,Ganlin Yang,Haomin Wang,Qi Wei,Jinhui Yin,Wenhao Li,Erfei Cui,Guanzhou Chen,Zichen Ding,Changyao Tian,Zhenyu Wu,Jingjing Xie,Zehao Li,Bowen Yang,Yuchen Duan,Xuehui Wang,Songze Li,Xiangyu Zhao,Haodong Duan,Nianchen Deng,Bin Fu,Yinan He,Yi Wang,Conghui He,Botian Shi,Junjun He,Yingtong Xiong,Han Lv,Lijun Wu,Wenqi Shao,Kaipeng Zhang,Huipeng Deng,Biqing Qi,Jiaye Ge,Qipeng Guo,Wenwei Zhang,Wanli Ouyang,Limin Wang,Min Dou,Xizhou Zhu,Tong Lu,Dahua Lin,Jifeng Dai,Bowen Zhou,Weijie Su,Kai Chen,Yu Qiao,Wenhai Wang,Gen Luo
Main category: cs.CV
TL;DR: InternVL 3.5 is an improved open-source multimodal model with enhanced reasoning and efficiency, achieving state-of-the-art results among open-source MLLMs.
Details
Motivation: To significantly advance versatility, reasoning capability, and inference efficiency in multimodal models. Method: Cascade Reinforcement Learning (Cascade RL) framework, Visual Resolution Router (ViR), and Decoupled Vision-Language Deployment (DvD) strategy. Result: Achieves up to a +16.0% gain in overall reasoning performance and a 4.05× inference speedup compared to InternVL3. Conclusion: InternVL 3.5 narrows the performance gap with commercial models while supporting novel capabilities like GUI interaction and embodied agency. Abstract: We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.[307] ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models
Haitang Feng,Jie Liu,Jie Tang,Gangshan Wu,Beiqi Chen,Jianhuang Lai,Guangcong Wang
Main category: cs.CV
TL;DR: ObjFiller-3D是一种新的高质量和一致性的3D对象补全方法,相比传统2D图像修复方法具有更好的重建效果。