Skip to content

Table of Contents

cs.CL [Back]

[1] Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment

Gang Cheng,Haibo Jin,Wenbin Zhang,Haohan Wang,Jun Zhuang

Main category: cs.CL

TL;DR: 本文提出了一种新的多轮攻击框架Risk-Concealment Attacks (RCA),用于揭示金融领域大语言模型在监管合规方面的漏洞,并构建了FIN-Bench基准进行系统评估,实验显示主流模型均易受攻击,凸显现有对齐技术的不足。

Details Motivation: 现有红队测试研究主要关注有害内容,忽视金融应用中的监管风险,因此需要专门针对金融领域LLMs的合规性漏洞进行探究。 Method: 提出Risk-Concealment Attacks (RCA)多轮攻击框架,通过迭代隐藏监管风险来诱导模型生成表面合规但实际违规的回应,并构建金融领域安全评估基准FIN-Bench。 Result: 在FIN-Bench上的实验证明RCA能有效绕过9个主流大模型,平均攻击成功率达93.18%,其中GPT-4.1达98.28%,OpenAI o1达97.56%。 Conclusion: 当前的对齐技术在金融监管风险面前存在严重缺陷,亟需更强的审核机制,本研究为构建更鲁棒、领域感知的LLM对齐提供了实践洞见。 Abstract: Large Language Models (LLMs) are increasingly integrated into financial applications, yet existing red-teaming research primarily targets harmful content, largely neglecting regulatory risks. In this work, we aim to investigate the vulnerability of financial LLMs through red-teaming approaches. We introduce Risk-Concealment Attacks (RCA), a novel multi-turn framework that iteratively conceals regulatory risks to provoke seemingly compliant yet regulatory-violating responses from LLMs. To enable systematic evaluation, we construct FIN-Bench, a domain-specific benchmark for assessing LLM safety in financial contexts. Extensive experiments on FIN-Bench demonstrate that RCA effectively bypasses nine mainstream LLMs, achieving an average attack success rate (ASR) of 93.18%, including 98.28% on GPT-4.1 and 97.56% on OpenAI o1. These findings reveal a critical gap in current alignment techniques and underscore the urgent need for stronger moderation mechanisms in financial domains. We hope this work offers practical insights for advancing robust and domain-aware LLM alignment.

[2] No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado,Arnau Padrés Masdemont,Anton Gonzalvez Hawthorne,David Demitri Africa,Lorenzo Pacchiardi

Main category: cs.CL

TL;DR: 该研究通过训练线性探测器,利用大语言模型在生成答案前的激活状态预测其回答正确性,发现存在一个可泛化的‘提前正确性方向’,尤其在中间层表现饱和,揭示了模型自我评估能力的出现,并表明该方向也与模型置信度相关。

Details Motivation: 探究大语言模型是否能在生成答案前预知回答的正确性,进而理解其内部自我评估机制。 Method: 在问题输入后、生成回答前提取模型激活状态,训练线性探针预测后续回答的正确性,并在多个模型族和数据集上验证其预测能力。 Result: 发现存在一个‘提前正确性方向’,能有效预测模型在分布内及多种分布外知识任务上的表现,优于黑箱基线和显式置信度表达;预测能力在中间层达到饱和,且与‘我不知道’回应显著相关;但在数学推理任务上泛化能力下降。 Conclusion: 大语言模型在生成前已具备对回答正确性的内在判断,这种自我评估信号存在于中间层表示中,且与置信度相关,揭示了模型内部自我监控机制的一部分。 Abstract: Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

[3] Interdisciplinary Research in Conversation: A Case Study in Computational Morphology for Language Documentation

Enora Rice,Katharina von der Wense,Alexis Palmer

Main category: cs.CL

TL;DR: 本文探讨了计算形态学与实际语言记录工作之间的脱节问题,主张通过引入以用户为中心的设计(UCD)来重新定位研究方向,并通过GlossLM模型的案例研究展示了用户反馈如何揭示现有系统在实际应用中的不足,进而提出更具相关性和实效性的研究路径。

Details Motivation: 计算形态学的研究成果在实际语言记录中应用有限,反映出NLP领域研究与实践之间的脱节,亟需通过用户中心设计弥合差距。 Method: 通过GlossLM这一先进的多语言IGT生成模型进行小规模用户研究,访谈三位纪实语言学家,分析系统在真实场景中的可用性问题。 Result: 尽管GlossLM在指标上表现良好,但未能满足实际语言记录中的核心可用性需求,暴露出模型在约束、标签标准化、切分和个性化方面的不足。 Conclusion: 将用户置于研究中心不仅能提升工具的有效性,还能激发更丰富、更相关的新研究方向,推动NLP研究回归实际应用语境。 Abstract: Computational morphology has the potential to support language documentation through tasks like morphological segmentation and the generation of Interlinear Glossed Text (IGT). However, our research outputs have seen limited use in real-world language documentation settings. This position paper situates the disconnect between computational morphology and language documentation within a broader misalignment between research and practice in NLP and argues that the field risks becoming decontextualized and ineffectual without systematic integration of User-Centered Design (UCD). To demonstrate how principles from UCD can reshape the research agenda, we present a case study of GlossLM, a state-of-the-art multilingual IGT generation model. Through a small-scale user study with three documentary linguists, we find that despite strong metric based performance, the system fails to meet core usability needs in real documentation contexts. These insights raise new research questions around model constraints, label standardization, segmentation, and personalization. We argue that centering users not only produces more effective tools, but surfaces richer, more relevant research directions

[4] Context Copying Modulation: The Role of Entropy Neurons in Managing Parametric and Contextual Knowledge Conflicts

Zineddine Tighidet,Andrea Mogini,Hedi Ben-younes,Jiali Mei,Patrick Gallinari,Benjamin Piwowarski

Main category: cs.CL

TL;DR: This paper explores how entropy neurons in transformers suppress context copying when there is conflicting information, enhancing the understanding of LLM internal dynamics.

Details Motivation: The inconsistent behavior of LLMs when facing conflicting parametric and contextual knowledge requires an explanation, and entropy neurons are potential candidates for influencing this behavior. Method: Investigated the role of entropy neurons in transformers by analyzing their impact on the generation process upon ablating them. Result: Findings show that entropy neurons suppress context copying, and ablating them significantly alters the generation process. Conclusion: Entropy neurons play a crucial role in suppressing context copying in LLMs when dealing with conflicting contextual and parametric information. Abstract: The behavior of Large Language Models (LLMs) when facing contextual information that conflicts with their internal parametric knowledge is inconsistent, with no generally accepted explanation for the expected outcome distribution. Recent work has identified in autoregressive transformer models a class of neurons -- called entropy neurons -- that produce a significant effect on the model output entropy while having an overall moderate impact on the ranking of the predicted tokens. In this paper, we investigate the preliminary claim that these neurons are involved in inhibiting context copying behavior in transformers by looking at their role in resolving conflicts between contextual and parametric information. We show that entropy neurons are responsible for suppressing context copying across a range of LLMs, and that ablating them leads to a significant change in the generation process. These results enhance our understanding of the internal dynamics of LLMs when handling conflicting information.

[5] Pluralistic Alignment for Healthcare: A Role-Driven Framework

Jiayou Zhong,Anudeex Shetty,Chao Jia,Xuanrui Lin,Usman Naseem

Main category: cs.CL

TL;DR: This paper introduces EthosAgents, a novel pluralistic alignment method for large language models tailored to healthcare domains, demonstrating its effectiveness in simulating diverse perspectives and values.

Details Motivation: The motivation stems from the challenges in deploying large language models in sensitive domains like healthcare, where existing alignment approaches fail to account for personal, cultural, and situational factors shaping pluralism. Method: The study proposes EthosAgents, a lightweight and generalizable pluralistic alignment approach designed to simulate diverse perspectives and values. It empirically evaluates the approach across three modes and seven models of varying sizes. Result: The findings show that EthosAgents advances pluralistic alignment across all three modes and models tested, highlighting the need for adaptable and normatively aware approaches in healthcare contexts. Conclusion: The study concludes that health-related pluralism demands adaptable and normatively aware approaches for aligning large language models, offering insights into respecting diversity in high-stakes domains. Abstract: As large language models are increasingly deployed in sensitive domains such as healthcare, ensuring their outputs reflect the diverse values and perspectives held across populations is critical. However, existing alignment approaches, including pluralistic paradigms like Modular Pluralism, often fall short in the health domain, where personal, cultural, and situational factors shape pluralism. Motivated by the aforementioned healthcare challenges, we propose a first lightweight, generalizable, pluralistic alignment approach, EthosAgents, designed to simulate diverse perspectives and values. We empirically show that it advances the pluralistic alignment for all three modes across seven varying-sized open and closed models. Our findings reveal that health-related pluralism demands adaptable and normatively aware approaches, offering insights into how these models can better respect diversity in other high-stakes domains.

[6] Struct-Bench: A Benchmark for Differentially Private Structured Text Generation

Shuaiqi Wang,Vikas Raunak,Arturs Backurs,Victor Reis,Pei Zhou,Sihao Chen,Longqi Yang,Zinan Lin,Sergey Yekhanin,Giulia Fanti

Main category: cs.CL

TL;DR: 本文提出了Struct-Bench,一个用于评估包含自然语言的结构化数据生成效果的框架和基准测试工具,通过上下文无关文法(CFG)描述数据结构,并提供了真实与合成数据集、评估指标及排行榜,推动隐私保护合成数据方法的研究。

Details Motivation: 现有合成数据评估方法难以捕捉结构化数据(如表格数据)中的结构特性和相关性,尤其在企业场景中包含自然语言字段时表现不佳,因此需要一种专门针对此类数据的评估框架。 Method: 提出Struct-Bench框架,要求用户提供数据结构的上下文无关文法(CFG)表示;构建包含5个真实和2个合成数据集的基准,均标注CFG;集成多种评估指标并建立排行榜;并通过案例研究展示其在改进Private Evolution(PE)方法上的应用。 Result: 实验表明,即使是当前最先进的差分隐私合成数据生成方法,在Struct-Bench的基准上仍面临巨大挑战;该框架为评估结构化合成数据提供了标准化平台。 Conclusion: Struct-Bench为包含自然语言的结构化差分隐私合成数据提供了一个有效的评估框架,具备良好的可扩展性和实用性,有助于推动该领域的研究发展。 Abstract: Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise cannot be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e.g., tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e.g., FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises 5 real-world and 2 synthetically generated datasets, each annotated with CFGs. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench also includes reference implementations of different metrics and a leaderboard, thereby providing researchers a standardized evaluation platform to benchmark and investigate privacy-preserving synthetic data generation methods. Further, we also present a case study showing how to use Struct-Bench to improve the synthetic data quality of Private Evolution (PE) on structured data. The benchmark and the leaderboard have been publicly made available at https://struct-bench.github.io.

[7] A Survey on Retrieval And Structuring Augmented Generation with Large Language Models

Pengcheng Jiang,Siru Ouyang,Yizhu Jiao,Ming Zhong,Runchu Tian,Jiawei Han

Main category: cs.CL

TL;DR: This paper explores how Retrieval And Structuring Augmented Generation can address the limitations of Large Language Models in real-world applications, by integrating dynamic information retrieval with structured knowledge representations.

Details Motivation: The motivation of the paper is to address critical challenges faced by Large Language Models (LLMs) in real-world applications, such as hallucination generation, outdated knowledge, and limited domain expertise. Method: The paper surveys and analyzes retrieval mechanisms, text structuring techniques, and integration methods of structured knowledge representations with LLMs. Result: The paper examines retrieval mechanisms, explores text structuring techniques, and investigates the integration of structured representations with LLMs, identifying technical challenges and highlighting research opportunities. Conclusion: The paper concludes that RAS Augmented Generation can address the limitations of LLMs, and it provides a comprehensive overview of RAS methods, applications, and future research directions. Abstract: Large Language Models (LLMs) have revolutionized natural language processing with their remarkable capabilities in text generation and reasoning. However, these models face critical challenges when deployed in real-world applications, including hallucination generation, outdated knowledge, and limited domain expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these limitations by integrating dynamic information retrieval with structured knowledge representations. This survey (1) examines retrieval mechanisms including sparse, dense, and hybrid approaches for accessing external knowledge; (2) explore text structuring techniques such as taxonomy construction, hierarchical classification, and information extraction that transform unstructured text into organized representations; and (3) investigate how these structured representations integrate with LLMs through prompt-based methods, reasoning frameworks, and knowledge embedding techniques. It also identifies technical challenges in retrieval efficiency, structure quality, and knowledge integration, while highlighting research opportunities in multimodal retrieval, cross-lingual structures, and interactive systems. This comprehensive overview provides researchers and practitioners with insights into RAS methods, applications, and future directions.

[8] SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Iman Barati,Mostafa Amiri,Heshaam Faili

Main category: cs.CL

TL;DR: SearchInstruct 是一种用于构建高质量指令数据集的新方法,通过扩展人工生成的问题并生成准确答案,提高大语言模型在特定领域的性能。

Details Motivation: 为了解决特定领域中监督微调(SFT)数据集构建面临的挑战,如数据稀缺和领域限制。 Method: SearchInstruct 利用少量领域相关的人工生成问题,通过大语言模型系统扩展问题,并动态检索领域相关资源生成准确答案。 Result: 实验评估表明,SearchInstruct 提高了 SFT 数据集的多样性和质量,并在专门领域的 LLM 性能上取得了显著改进。 Conclusion: SearchInstruct 是一种有效的方法,不仅能构建高质量的 SFT 数据集,还能促进模型编辑等任务,提升 LLM 在特定领域的表现。 Abstract: Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)

[9] PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models

Zaur Gouliev,Jennifer Waters,Chengqian Wang

Main category: cs.CL

TL;DR: 本文提出了一种针对多语言虚假信息检测的系统性评估方法,使用包含25种语言的PolyTruth Disinfo Corpus数据集,比较了五种多语言Transformer模型(mBERT、XLM、XLM-RoBERTa、RemBERT和mT5)在真假信息分类任务中的表现。结果显示,RemBERT在低资源语言中表现最佳,而mBERT和XLM在训练数据稀缺时性能受限,揭示了当前AI在多语言虚假信息检测中的潜力与局限。

Details Motivation: 大多数AI模型仅在英语上进行基准测试,但虚假信息跨越语言迅速传播,因此需要评估多语言环境下模型的检测能力。 Method: 构建了一个包含60,486个陈述对(虚假声明与事实纠正)的新型多语言语料库PolyTruth Disinfo Corpus,覆盖25种语言、5大语系及多个主题领域,并在该数据集上系统比较五种多语言Transformer模型的分类性能。 Result: RemBERT整体准确率最高,尤其在低资源语言中表现突出;mBERT和XLM在训练数据不足时性能明显下降;不同模型在多语言虚假信息检测中表现出显著差异。 Conclusion: 尽管多语言Transformer模型在虚假信息检测中展现出潜力,但其性能受语言资源丰富程度影响较大,实际部署需考虑数据可用性和模型选择。 Abstract: Disinformation spreads rapidly across linguistic boundaries, yet most AI models are still benchmarked only on English. We address this gap with a systematic comparison of five multilingual transformer models: mBERT, XLM, XLM-RoBERTa, RemBERT, and mT5 on a common fake-vs-true machine learning classification task. While transformer-based language models have demonstrated notable success in detecting disinformation in English, their effectiveness in multilingual contexts still remains up for debate. To facilitate evaluation, we introduce PolyTruth Disinfo Corpus, a novel corpus of 60,486 statement pairs (false claim vs. factual correction) spanning over twenty five languages that collectively cover five language families and a broad topical range from politics, health, climate, finance, and conspiracy, half of which are fact-checked disinformation claims verified by an augmented MindBugs Discovery dataset. Our experiments revealed performance variations. Models such as RemBERT achieved better overall accuracy, particularly excelling in low-resource languages, whereas models like mBERT and XLM exhibit considerable limitations when training data is scarce. We provide a discussion of these performance patterns and implications for real-world deployment. The dataset is publicly available on our GitHub repository to encourage further experimentation and advancement. Our findings illuminate both the potential and the current limitations of AI systems for multilingual disinformation detection.

[10] Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs

Mobina Pournemat,Keivan Rezaei,Gaurang Sriramanan,Arman Zarei,Jiaxiang Fu,Yang Wang,Hamid Eghbalzadeh,Soheil Feizi

Main category: cs.CL

TL;DR: This study evaluates how well large language models reason about discrete probability distributions, showing that while larger models perform better, they still struggle with notation sensitivity and longer contexts.

Details Motivation: The motivation is to understand the probabilistic reasoning abilities of LLMs, especially given their inconsistent behavior on tasks requiring such reasoning, despite their success in language understanding and generation. Method: The researchers evaluated LLMs on three tasks—mode identification, maximum likelihood estimation, and sample generation—using prompts based on observations from discrete probability distributions. They conducted comprehensive empirical evaluations to assess model performance. Result: The results show a clear performance gap between smaller and larger models, with larger models performing better in inference and sample generation. However, the models showed limitations, such as sensitivity to notation and over 60% performance degradation as context length increased. Conclusion: The study concludes that larger LLMs show better probabilistic reasoning capabilities, particularly in inference and sample generation, but they still face significant limitations, including sensitivity to notation and performance degradation with increased context length. Abstract: Despite widespread success in language understanding and generation, large language models (LLMs) exhibit unclear and often inconsistent behavior when faced with tasks that require probabilistic reasoning. In this work, we present the first comprehensive study of the reasoning capabilities of LLMs over explicit discrete probability distributions. Given observations from a probability distribution, we evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation, by prompting them to provide responses to queries about either the joint distribution or its conditionals. These tasks thus probe a range of probabilistic skills, including frequency analysis, marginalization, and generative behavior. Through comprehensive empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models, with the latter demonstrating stronger inference and surprising capabilities in sample generation. Furthermore, our investigations reveal notable limitations, including sensitivity to variations in the notation utilized to represent probabilistic outcomes and performance degradation of over 60% as context length increases. Together, our results provide a detailed understanding of the probabilistic reasoning abilities of LLMs and identify key directions for future improvement.

[11] Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

Ozan Gokdemir,Neil Getty,Robert Underwood,Sandeep Madireddy,Franck Cappello,Arvind Ramanathan,Ian T. Foster,Rick L. Stevens

Main category: cs.CL

TL;DR: A framework for creating MCQA benchmarks from scientific papers improves small language model evaluation through reasoning-trace retrieval.

Details Motivation: To keep up with the rapid growth of scientific knowledge and ensure language models are tested on current literature. Method: A scalable, modular framework for generating MCQA benchmarks from scientific papers, including PDF parsing, semantic chunking, question generation, and model evaluation. Result: Generated over 16,000 MCQs and improved small model performance on a 2023 exam using reasoning-trace retrieval. Conclusion: The proposed framework successfully generates MCQA benchmarks and enhances small language model performance using reasoning-trace retrieval. Abstract: As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.

[12] RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems

Adarsh Srinivasan,Jacob Dineen,Muhammad Umar Afzal,Muhammad Uzair Sarfraz,Irbaz B. Riaz,Ben Zhou

Main category: cs.CL

TL;DR: RECAP是一个无需重新训练的推理框架,通过结构化情感推理提升医疗大模型的情感能力,在多个基准上显著提高共情表现。

Details Motivation: 大型语言模型在医疗场景中常缺乏情感共鸣,难以满足情绪困扰患者的沟通需求,影响信任与依从性。 Method: 提出RECAP框架,基于评估理论将共情分解为反思、提取、校准、对齐和生成五个阶段,并在推理时引入可审计的维度化评分信号。 Result: 在EmoBench、SECEU和EQ-Bench上,8B模型情感推理性能提升22-28%,更大模型提升10-13%;临床医生评估显示其共情表达更优。 Conclusion: 基于模块化和理论驱动的提示方法可系统增强医疗AI的情感智能,同时保持部署所需的透明性与问责性。 Abstract: Large language models in healthcare often miss critical emotional cues, delivering medically sound but emotionally flat advice. This is especially problematic in clinical contexts where patients are distressed and vulnerable, and require empathic communication to support safety, adherence, and trust. We present RECAP (Reflect-Extract-Calibrate-Align-Produce), an inference-time framework that adds structured emotional reasoning without retraining. By decomposing empathy into transparent appraisal-theoretic stages and exposing per-dimension Likert signals, RECAP produces nuanced, auditable responses. Across EmoBench, SECEU, and EQ-Bench, RECAP improves emotional reasoning by 22-28% on 8B models and 10-13% on larger models over zero-shot baselines. Clinician evaluations further confirm superior empathetic communication. RECAP shows that modular, theory-grounded prompting can systematically enhance emotional intelligence in medical AI while preserving the accountability required for deployment.

[13] Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

Yijun Liu,Yixuan Wang,Yuzhuang Xu,Shiyu Ji,Yang Xu,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: 提出一种名为Judge Q的新方法,通过引入软令牌列表来更有效地评估KV缓存中键值的重要性,从而在减少KV缓存时保持解码质量。

Details Motivation: 现有的KV缓存驱逐方法过于关注局部信息,可能忽略关键的全局信息,影响解码质量和效率。 Method: 设计了一种新的训练方法Judge Q,仅微调模型的嵌入层,在输入序列末尾连接软令牌列表,并训练这些令牌对原始输入序列的注意力图以匹配实际解码令牌的注意力图。 Result: 在相同驱逐预算下,相比现有方法性能下降更少;实验显示在LongBench上提升约1点,RULER上超过3点。 Conclusion: Judge Q能有效捕捉全局信息,改善KV缓存驱逐时的模型表现,且可轻松集成到现有开源模型中,训练开销小。 Abstract: Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.

[14] Towards Automated Error Discovery: A Study in Conversational AI

Dominic Petrak,Thy Thy Tran,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文提出了一种名为SEEED的编码器基错误检测框架,用于自动发现对话AI中的未知错误,通过改进软近邻损失和样本选择策略,在多个数据集上显著提升了错误检测准确率。

Details Motivation: 现有的大语言模型在未明确指定的错误类型(如响应生成模型更新或用户行为变化引发的错误)上检测能力有限,难以有效防止不良行为影响用户体验。 Method: 提出了Automated Error Discovery框架,并设计了SEEED方法,改进了软近邻损失函数以增强负样本距离加权,引入基于标签的样本排序来选取高对比度样本,提升表示学习效果。 Result: SEEED在多个标注错误的对话数据集上优于GPT-4o和Phi-4等基线模型,未知错误检测准确率最高提升8个百分点,并在未知意图检测中表现出强泛化能力。 Conclusion: SEEED能有效识别和定义对话系统中的未知错误,具备良好的泛化性和实用性,为部署安全可靠的对话AI提供了可行方案。 Abstract: Although LLM-based conversational agents demonstrate strong fluency and coherence, they still produce undesirable behaviors (errors) that are challenging to prevent from reaching users during deployment. Recent research leverages large language models (LLMs) to detect errors and guide response-generation models toward improvement. However, current LLMs struggle to identify errors not explicitly specified in their instructions, such as those arising from updates to the response-generation model or shifts in user behavior. In this work, we introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI, and propose SEEED (Soft Clustering Extended Encoder-Based Error Detection), as an encoder-based approach to its implementation. We enhance the Soft Nearest Neighbor Loss by amplifying distance weighting for negative samples and introduce Label-Based Sample Ranking to select highly contrastive examples for better representation learning. SEEED outperforms adapted baselines -- including GPT-4o and Phi-4 -- across multiple error-annotated dialogue datasets, improving the accuracy for detecting unknown errors by up to 8 points and demonstrating strong generalization to unknown intent detection.

[15] Evaluating Large Language Models for Evidence-Based Clinical Question Answering

Can Wang,Yiqun Chen

Main category: cs.CL

TL;DR: LLMs 在临床循证问答中表现有潜力,结构化指南表现最佳,检索增强可提高准确性,但模型能力仍受限,需要分层评估。

Details Motivation: 评估大型语言模型(LLMs)回答复杂、循证医学问题的能力。 Method: 使用 GPT-4o-mini 和 GPT-5 对来自 Cochrane 系统综述和临床指南的多源基准进行测试,并分析其准确性和证据质量推理能力。 Result: 在结构化指南推荐中的准确率最高(90%),而在叙述性指南和系统综述问题中的准确率较低(60-70%)。引用次数与准确性之间存在强相关性,引用次数每增加一倍,正确回答的几率大约增加 30%。提供相关文献能显著提高准确率,而随机文献则降低准确率。 Conclusion: LLMs 展示了在循证临床问答中的潜力和当前限制,检索增强提示可以提高事实准确性并与源证据对齐,但按专业和问题类型进行分层评估仍然至关重要。 Abstract: Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications, motivating rigorous evaluation of their ability to answer nuanced, evidence-based questions. We curate a multi-source benchmark drawing from Cochrane systematic reviews and clinical guidelines, including structured recommendations from the American Heart Association and narrative guidance used by insurers. Using GPT-4o-mini and GPT-5, we observe consistent performance patterns across sources and clinical domains: accuracy is highest on structured guideline recommendations (90%) and lower on narrative guideline and systematic review questions (60--70%). We also find a strong correlation between accuracy and the citation count of the underlying systematic reviews, where each doubling of citations is associated with roughly a 30% increase in the odds of a correct answer. Models show moderate ability to reason about evidence quality when contextual information is supplied. When we incorporate retrieval-augmented prompting, providing the gold-source abstract raises accuracy on previously incorrect items to 0.79; providing top 3 PubMed abstracts (ranked by semantic relevance) improves accuracy to 0.23, while random abstracts reduce accuracy (0.10, within temperature variation). These effects are mirrored in GPT-4o-mini, underscoring that source clarity and targeted retrieval -- not just model size -- drive performance. Overall, our results highlight both the promise and current limitations of LLMs for evidence-based clinical question answering. Retrieval-augmented prompting emerges as a useful strategy to improve factual accuracy and alignment with source evidence, while stratified evaluation by specialty and question type remains essential to understand current knowledge access and to contextualize model performance.

[16] GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings

Yixuan Tang,Yi Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为GAPrune的剪枝框架,通过结合领域重要性和通用语言基础保留来压缩领域特定的嵌入模型,在FinMTEB和ChemTEB两个基准上实现了优于基线的方法,并在少量剪枝后仍保持高性能。

Details Motivation: 现有的剪枝方法未能区分通用语义表示和领域特定模式,导致在资源受限环境下难以有效压缩大规模嵌入模型。 Method: GAPrune利用Fisher信息衡量参数重要性,通过通用-领域梯度对齐评估参数行为,并提出领域对齐重要性(DAI)评分进行剪枝决策。 Result: 在50%稀疏度下,GAPrune在一阶段剪枝中性能损失小于2.5%,并在100步再训练后在FinMTEB上提升+4.51%,ChemTEB上提升+1.73%。 Conclusion: GAPrune能有效压缩模型同时增强领域特化能力,为领域特定嵌入模型的部署提供了新思路。 Abstract: Domain-specific embedding models have shown promise for applications that require specialized semantic understanding, such as coding agents and financial retrieval systems, often achieving higher performance gains than general models. However, state-of-the-art embedding models are typically based on LLMs, which contain billions of parameters, making deployment challenging in resource-constrained environments. Model compression through pruning offers a promising solution, but existing pruning methods treat all parameters uniformly, failing to distinguish between general semantic representations and domain-specific patterns, leading to suboptimal pruning decisions. Thus, we propose GAPrune, a pruning framework that addresses this challenge by considering both domain importance and preserving general linguistic foundation. Our method uses Fisher Information to measure importance and general-domain gradient alignment to assess parameter behavior, then combines these signals using our Domain Alignment Importance (DAI) scoring. Lower DAI scores indicate that the parameter is either less important for the domain task or creates conflicts between domain and general objectives. Experiments on two domain benchmarks, FinMTEB and ChemTEB, show that GAPrune maintains performance within 2.5% of dense models in one-shot pruning at 50% sparsity, while outperforming all baselines. With retraining in 100 steps, GAPrune achieves +4.51% improvement on FinMTEB and +1.73% on ChemTEB, demonstrating that our pruning strategy not only preserves but enhances domain-specific capabilities. Our findings demonstrate that principled pruning strategies can achieve model compression and enhanced domain specialization, providing the research community with a new approach for development.

[17] Text2Sign Diffusion: A Generative Approach for Gloss-Free Sign Language Production

Liqian Feng,Lintao Wang,Kun Hu,Dehui Kong,Zhiyong Wang

Main category: cs.CL

TL;DR: 提出了一种基于扩散模型的无词表征(gloss-free)手语生成方法Text2SignDiff,通过跨模态对齐和非自回归去噪过程实现从文本到手语姿态序列的直接生成。

Details Motivation: 现有手语生成方法依赖于语言特定且标注困难的词表征(gloss),限制了模型的泛化能力和实用性,因此需要一种无需gloss的端到端生成方法。 Method: 提出Text2SignDiff,采用无gloss的潜在扩散模型,联合从噪声潜在手势码和文本中生成手语序列;设计跨模态对齐模块,学习文本与视觉内容的共享潜在空间,支持条件扩散生成。 Result: 在PHOENIX14T和How2Sign数据集上实现了最先进的性能,验证了所提方法在无gloss条件下的有效性与优越性。 Conclusion: Text2SignDiff能够有效实现无需gloss标注的高质量手语生成,推动了手语合成技术向更实用、可扩展的方向发展。 Abstract: Sign language production (SLP) aims to translate spoken language sentences into a sequence of pose frames in a sign language, bridging the communication gap and promoting digital inclusion for deaf and hard-of-hearing communities. Existing methods typically rely on gloss, a symbolic representation of sign language words or phrases that serves as an intermediate step in SLP. This limits the flexibility and generalization of SLP, as gloss annotations are often unavailable and language-specific. Therefore, we present a novel diffusion-based generative approach - Text2Sign Diffusion (Text2SignDiff) for gloss-free SLP. Specifically, a gloss-free latent diffusion model is proposed to generate sign language sequences from noisy latent sign codes and spoken text jointly, reducing the potential error accumulation through a non-autoregressive iterative denoising process. We also design a cross-modal signing aligner that learns a shared latent space to bridge visual and textual content in sign and spoken languages. This alignment supports the conditioned diffusion-based process, enabling more accurate and contextually relevant sign language generation without gloss. Extensive experiments on the commonly used PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method, achieving the state-of-the-art performance.

[18] A funny companion: Distinct neural responses to perceived AI- versus human- generated humor

Xiaohui Rao,Hanlin Wu,Zhenguang G. Cai

Main category: cs.CL

TL;DR: 该研究通过EEG比较人类对AI与人类幽默的神经认知反应,发现尽管主观趣味性相似,AI幽默引发更小的N400效应但更大的LPP,表明其带来更强的惊喜与情感反应,且随时间推移情感奖励增强,挑战了‘算法厌恶’观念。

Details Motivation: 随着AI伴侣具备类人交流能力,理解人类如何在认知和情感层面回应AI幽默变得愈发重要。 Method: 采用脑电图(EEG)技术结合行为实验,比较参与者对AI与人类幽默的认知与情感神经反应。 Result: 行为上AI与人类幽默同样有趣;神经上AI幽默引起更小的N400(认知努力较低)和更大的LPP(情绪唤醒更高),且随时间推移LPP上升、N400下降,显示对AI幽默的情感奖励递增;个体对AI的信任度调节了这一神经反应。 Conclusion: 大脑对AI幽默表现出强烈而积极的反应,揭示幽默在促进人-AI社会互动中的潜力,并表明认知适应可克服算法厌恶。 Abstract: As AI companions become capable of human-like communication, including telling jokes, understanding how people cognitively and emotionally respond to AI humor becomes increasingly important. This study used electroencephalography (EEG) to compare how people process humor from AI versus human sources. Behavioral analysis revealed that participants rated AI and human humor as comparably funny. However, neurophysiological data showed that AI humor elicited a smaller N400 effect, suggesting reduced cognitive effort during the processing of incongruity. This was accompanied by a larger Late Positive Potential (LPP), indicating a greater degree of surprise and emotional response. This enhanced LPP likely stems from the violation of low initial expectations regarding AI's comedic capabilities. Furthermore, a key temporal dynamic emerged: human humor showed habituation effects, marked by an increasing N400 and a decreasing LPP over time. In contrast, AI humor demonstrated increasing processing efficiency and emotional reward, with a decreasing N400 and an increasing LPP. This trajectory reveals how the brain can dynamically update its predictive model of AI capabilities. This process of cumulative reinforcement challenges "algorithm aversion" in humor, as it demonstrates how cognitive adaptation to AI's language patterns can lead to an intensified emotional reward. Additionally, participants' social attitudes toward AI modulated these neural responses, with higher perceived AI trustworthiness correlating with enhanced emotional engagement. These findings indicate that the brain responds to AI humor with surprisingly positive and intense reactions, highlighting humor's potential for fostering genuine engagement in human-AI social interaction.

[19] Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

Sangyeop Kim,Yohan Lee,Sanghwa Kim,Hyunjong Kim,Sungzoon Cho

Main category: cs.CL

TL;DR: The paper introduces PREMem, a method that improves long-term memory in conversational AI by shifting complex reasoning processes to memory construction, reducing computational demands.

Details Motivation: Current conversational AI systems excessively burden response generation with complex reasoning, making performance heavily reliant on model sizes. Method: The paper introduces PREMem, which extracts fine-grained memory fragments and establishes explicit relationships between memory items across sessions. Result: Experiments showed significant performance improvements across all model sizes, with smaller models achieving results comparable to larger baselines while maintaining effectiveness under constrained token budgets. Conclusion: PREMem proves to be an effective approach for enhancing long-term memory in conversational AI, reducing computational demands during interactions by shifting complex reasoning processes to memory construction. Abstract: Effective long-term memory in conversational AI requires synthesizing information across multiple sessions. However, current systems place excessive reasoning burden on response generation, making performance significantly dependent on model sizes. We introduce PREMem (Pre-storage Reasoning for Episodic Memory), a novel approach that shifts complex reasoning processes from inference to memory construction. PREMem extracts fine-grained memory fragments categorized into factual, experiential, and subjective information; it then establishes explicit relationships between memory items across sessions, capturing evolution patterns like extensions, transformations, and implications. By performing this reasoning during pre-storage rather than when generating a response, PREMem creates enriched representations while reducing computational demands during interactions. Experiments show significant performance improvements across all model sizes, with smaller models achieving results comparable to much larger baselines while maintaining effectiveness even with constrained token budgets. Code and dataset are available at https://github.com/sangyeop-kim/PREMem.

[20] Quantifier Scope Interpretation in Language Learners and LLMs

Shaohua Fang,Yue Li,Yan Cong

Main category: cs.CL

TL;DR: 该研究发现,大型语言模型在英语和汉语中倾向于表层范围解释,其与人类相似的程度受模型架构、规模和预训练数据语言背景的影响。

Details Motivation: 量词范围解释在多语言中存在歧义,研究大型语言模型如何处理这种歧义有助于理解其与人类语言处理的相似性。 Method: 采用跨语言方法,使用概率评估大型语言模型在英语和汉语中的表层范围和逆向范围解释偏好,并使用人类相似性评分量化模型与人类表现的相似程度。 Result: 大多数大型语言模型倾向于表层范围解释,与人类倾向一致,而只有一些模型在英语和汉语的逆向范围偏好中表现出与人类相似的模式。 Conclusion: 模型架构、规模以及预训练数据的语言背景显著影响大型语言模型对人类量词范围解释的逼近程度。 Abstract: Sentences with multiple quantifiers often lead to interpretive ambiguities, which can vary across languages. This study adopts a cross-linguistic approach to examine how large language models (LLMs) handle quantifier scope interpretation in English and Chinese, using probabilities to assess interpretive likelihood. Human similarity (HS) scores were used to quantify the extent to which LLMs emulate human performance across language groups. Results reveal that most LLMs prefer the surface scope interpretations, aligning with human tendencies, while only some differentiate between English and Chinese in the inverse scope preferences, reflecting human-similar patterns. HS scores highlight variability in LLMs' approximation of human behavior, but their overall potential to align with humans is notable. Differences in model architecture, scale, and particularly models' pre-training data language background, significantly influence how closely LLMs approximate human quantifier scope interpretations.

[21] Term2Note: Synthesising Differentially Private Clinical Notes from Medical Terms

Yuping Wu,Viktor Schlegel,Warren Del-Pinto,Srinivasan Nandakumar,Iqra Zahid,Yidan Sun,Usama Farghaly Omar,Amirah Jasmine,Arun-Kumar Kaliya-Perumal,Chun Shen Tham,Gabriel Connors,Anil A Bharath,Goran Nenadic

Main category: cs.CL

TL;DR: 本文提出Term2Note方法,用于在强差分隐私约束下生成长篇临床文本,通过分离内容与形式并结合差分隐私术语和质量优化机制,在保护隐私的同时保持了高数据保真度和实用性。

Details Motivation: 在医疗等高风险领域,真实训练数据因隐私泄露问题受限,现有差分隐私文本生成方法在隐私与效用之间难以平衡,尤其面对长文本和专业性挑战。 Method: Term2Note将内容与形式结构分离:基于满足差分隐私的医学术语生成各部分内容,并分别施加隐私约束;引入DP质量最大化模块筛选高质量合成文本。 Result: 实验表明,Term2Note生成的合成临床文本在统计特性上与真实文本高度一致,具有高保真度;在其上训练的多标签分类模型性能接近使用真实数据训练的模型,且优于现有基线方法。 Conclusion: Term2Note在较弱假设下实现了更高的保真度和实用性,是替代敏感临床文本的一种可行隐私保护方案。 Abstract: Training data is fundamental to the success of modern machine learning models, yet in high-stakes domains such as healthcare, the use of real-world training data is severely constrained by concerns over privacy leakage. A promising solution to this challenge is the use of differentially private (DP) synthetic data, which offers formal privacy guarantees while maintaining data utility. However, striking the right balance between privacy protection and utility remains challenging in clinical note synthesis, given its domain specificity and the complexity of long-form text generation. In this paper, we present Term2Note, a methodology to synthesise long clinical notes under strong DP constraints. By structurally separating content and form, Term2Note generates section-wise note content conditioned on DP medical terms, with each governed by separate DP constraints. A DP quality maximiser further enhances synthetic notes by selecting high-quality outputs. Experimental results show that Term2Note produces synthetic notes with statistical properties closely aligned with real clinical notes, demonstrating strong fidelity. In addition, multi-label classification models trained on these synthetic notes perform comparably to those trained on real data, confirming their high utility. Compared to existing DP text generation baselines, Term2Note achieves substantial improvements in both fidelity and utility while operating under fewer assumptions, suggesting its potential as a viable privacy-preserving alternative to using sensitive clinical notes.

[22] CultureSynth: A Hierarchical Taxonomy-Guided and Retrieval-Augmented Framework for Cultural Question-Answer Synthesis

Xinyu Zhang,Pei Zhang,Shuang Luo,Jialong Tang,Yu Wan,Baosong Yang,Fei Huang

Main category: cs.CL

TL;DR: This paper introduces CultureSynth, a framework for assessing cultural competence in large language models, addressing current evaluation limitations through an automated, comprehensive taxonomy and synthetic benchmark.

Details Motivation: The motivation is to address the limitations in current evaluations of cultural competence in large language models, which suffer from fragmented taxonomies, domain specificity, and reliance on manual data annotation. Method: The methodology involves a hierarchical multilingual cultural taxonomy and a Retrieval-Augmented Generation (RAG)-based approach to synthesize culturally relevant question-answer pairs. Result: The evaluation of 14 LLMs using the CultureSynth-7 benchmark revealed performance stratification, a 3B-parameter threshold for basic cultural competence, architectural biases in knowledge processing, and geographic disparities among models. Conclusion: The study concludes that CultureSynth provides a scalable framework for building culturally aware AI systems while reducing dependence on manual annotation. Abstract: Cultural competence, defined as the ability to understand and adapt to multicultural contexts, is increasingly vital for large language models (LLMs) in global environments. While several cultural benchmarks exist to assess LLMs' cultural competence, current evaluations suffer from fragmented taxonomies, domain specificity, and heavy reliance on manual data annotation. To address these limitations, we introduce CultureSynth, a novel framework comprising (1) a comprehensive hierarchical multilingual cultural taxonomy covering 12 primary and 130 secondary topics, and (2) a Retrieval-Augmented Generation (RAG)-based methodology leveraging factual knowledge to synthesize culturally relevant question-answer pairs. The CultureSynth-7 synthetic benchmark contains 19,360 entries and 4,149 manually verified entries across 7 languages. Evaluation of 14 prevalent LLMs of different sizes reveals clear performance stratification led by ChatGPT-4o-Latest and Qwen2.5-72B-Instruct. The results demonstrate that a 3B-parameter threshold is necessary for achieving basic cultural competence, models display varying architectural biases in knowledge processing, and significant geographic disparities exist across models. We believe that CultureSynth offers a scalable framework for developing culturally aware AI systems while reducing reliance on manual annotation\footnote{Benchmark is available at https://github.com/Eyr3/CultureSynth.}.

[23] Aligning ESG Controversy Data with International Guidelines through Semi-Automatic Ontology Construction

Tsuyoshi Iwata,Guillaume Comte,Melissa Flores,Ryoma Kondo,Ryohei Hisano

Main category: cs.CL

TL;DR: 提出一种半自动方法,利用轻量级本体设计、形式化模式建模和大语言模型,将新闻中的ESG事件映射到国际规范框架(如联合国全球契约)的原则上,构建结构化知识图谱。

Details Motivation: 现有ESG争议数据与抽象的国际规范框架之间缺乏标准化对齐方式,商业数据系统与原则性框架不一致,导致非财务风险难以准确解读。 Method: 结合轻量级本体设计、形式化模式建模和大语言模型,将规范性原则转化为RDF表达的可重用模板,并用于从新闻中提取信息以构建结构化知识图谱。 Result: 实现了对新闻报道中ESG事件的结构化表示,能够将具体事件链接到国际可持续性准则的具体原则,提升可扩展性和透明度。 Conclusion: 该方法为解释和识别违反国际可持续发展准则的行为提供了一个可扩展、透明且可解释的框架,有助于监管和投资决策中的非财务风险管理。 Abstract: The growing importance of environmental, social, and governance data in regulatory and investment contexts has increased the need for accurate, interpretable, and internationally aligned representations of non-financial risks, particularly those reported in unstructured news sources. However, aligning such controversy-related data with principle-based normative frameworks, such as the United Nations Global Compact or Sustainable Development Goals, presents significant challenges. These frameworks are typically expressed in abstract language, lack standardized taxonomies, and differ from the proprietary classification systems used by commercial data providers. In this paper, we present a semi-automatic method for constructing structured knowledge representations of environmental, social, and governance events reported in the news. Our approach uses lightweight ontology design, formal pattern modeling, and large language models to convert normative principles into reusable templates expressed in the Resource Description Framework. These templates are used to extract relevant information from news content and populate a structured knowledge graph that links reported incidents to specific framework principles. The result is a scalable and transparent framework for identifying and interpreting non-compliance with international sustainability guidelines.

[24] Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents

Ankan Mullick,Sombit Bose,Rounak Saha,Ayan Kumar Bhowmick,Aditya Vempaty,Prasenjit Dey,Ravi Kokku,Pawan Goyal,Niloy Ganguly

Main category: cs.CL

TL;DR: This paper introduces Spotlight, a method for creating engaging document summaries by focusing on compelling content, using a fine-tuned language model and DPO alignment.

Details Motivation: Traditional summaries prioritize comprehensive coverage, whereas spotlights aim to foster deeper engagement by emphasizing intriguing content. Method: A two-stage approach involving fine-tuning a large language model followed by alignment via Direct Preference Optimization (DPO) was used to generate spotlights. Result: The proposed model successfully identifies key elements, enhances readability, and boosts the engagement value of documents. Conclusion: Spotlight provides a new method for information extraction that improves reader engagement by highlighting compelling content. Abstract: In this paper, we introduce Spotlight, a novel paradigm for information extraction that produces concise, engaging narratives by highlighting the most compelling aspects of a document. Unlike traditional summaries, which prioritize comprehensive coverage, spotlights selectively emphasize intriguing content to foster deeper reader engagement with the source material. We formally differentiate spotlights from related constructs and support our analysis with a detailed benchmarking study using new datasets curated for this work. To generate high-quality spotlights, we propose a two-stage approach: fine-tuning a large language model on our benchmark data, followed by alignment via Direct Preference Optimization (DPO). Our comprehensive evaluation demonstrates that the resulting model not only identifies key elements with precision but also enhances readability and boosts the engagement value of the original document.

[25] An Interpretable Benchmark for Clickbait Detection and Tactic Attribution

Lihi Nofar,Tomer Portal,Aviv Elbaz,Alexander Apartsin,Yehudit Aperstein

Main category: cs.CL

TL;DR: 该论文提出了一种可解释的点击诱饵检测模型,通过合成数据集和两阶段框架(检测与策略归因),提高了对点击诱饵标题的识别能力,并推动了对抗操纵性媒体内容的透明AI系统发展。

Details Motivation: 点击诱饵标题的泛滥对数字媒体的信息可信度和用户信任构成了重大挑战,而当前机器学习方法在可解释性上的不足限制了其实际应用。因此,需要开发可解释的检测模型来对抗操纵性媒体内容。 Method: 论文引入了一个通过系统增强真实新闻标题生成的合成数据集,并提出了一个包含检测和策略归因两个阶段的自动点击诱饵分析框架。第一阶段使用BERT分类器和大语言模型(如GPT-4.0和Gemini 2.4 Flash)进行零样本或少样本提示的点击诱饵检测;第二阶段使用BERT分类器预测标题中的具体点击诱饵策略。 Result: 论文成功构建了一个合成数据集,并通过提出的两阶段框架实现了点击诱饵检测和策略归因。实验比较了不同模型在零样本和少样本提示下的表现,并展示了模型对点击诱饵策略的详细分析能力。 Conclusion: 该论文提出了一种可解释的点击诱饵检测模型,不仅能够识别点击诱饵标题,还能将其归因于特定的语言操纵策略,推动了透明和可信赖的AI系统的发展。 Abstract: The proliferation of clickbait headlines poses significant challenges to the credibility of information and user trust in digital media. While recent advances in machine learning have improved the detection of manipulative content, the lack of explainability limits their practical adoption. This paper presents a model for explainable clickbait detection that not only identifies clickbait titles but also attributes them to specific linguistic manipulation strategies. We introduce a synthetic dataset generated by systematically augmenting real news headlines using a predefined catalogue of clickbait strategies. This dataset enables controlled experimentation and detailed analysis of model behaviour. We present a two-stage framework for automatic clickbait analysis comprising detection and tactic attribution. In the first stage, we compare a fine-tuned BERT classifier with large language models (LLMs), specifically GPT-4.0 and Gemini 2.4 Flash, under both zero-shot prompting and few-shot prompting enriched with illustrative clickbait headlines and their associated persuasive tactics. In the second stage, a dedicated BERT-based classifier predicts the specific clickbait strategies present in each headline. This work advances the development of transparent and trustworthy AI systems for combating manipulative media content. We share the dataset with the research community at https://github.com/LLM-HITCS25S/ClickbaitTacticsDetection

[26] EmoBench-Reddit: A Hierarchical Benchmark for Evaluating the Emotional Intelligence of Multimodal Large Language Models

Haokun Li,Yazhou Zhang,Jizhi Ding,Qiuchi Li,Peng Zhang

Main category: cs.CL

TL;DR: 本文提出了EmoBench-Reddit,一个用于多模态情感理解的新分层基准,包含来自Reddit的350个样本,涵盖图像、文本和情绪类别,并设计了从感知到认知的递进式任务以评估MLLM对复杂主观情绪的理解能力。

Details Motivation: 现有评测基准主要关注客观视觉问答或描述生成,难以有效评估模型对复杂且主观的人类情绪的理解能力,因此需要构建专门针对多模态情感理解的高质量评测基准。 Method: 基于Reddit平台构建EmoBench-Reddit数据集,每条样本包含图像、用户文本和经用户标签确认的情绪类别(sad, humor, sarcasm, happy),并设计包含六个选择题和一个开放性问题的分层任务框架,覆盖从基础感知到高级认知的不同难度层级,结合AI辅助(Claude 4)与人工验证确保标注质量。 Result: 该基准支持对MLLM在识别基本视觉元素、场景推理、意图理解和共情等方面的能力进行系统评估,为多模态情感理解提供了新的评测标准。 Conclusion: EmoBench-Reddit填补了当前多模态大模型在主观情感理解评测方面的空白,通过分层任务设计有效衡量模型从感知到高阶认知的综合能力,推动MLLM在情感智能方向的发展。 Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), they have demonstrated exceptional capabilities across a variety of vision-language tasks. However, current evaluation benchmarks predominantly focus on objective visual question answering or captioning, inadequately assessing the models' ability to understand complex and subjective human emotions. To bridge this gap, we introduce EmoBench-Reddit, a novel, hierarchical benchmark for multimodal emotion understanding. The dataset comprises 350 meticulously curated samples from the social media platform Reddit, each containing an image, associated user-provided text, and an emotion category (sad, humor, sarcasm, happy) confirmed by user flairs. We designed a hierarchical task framework that progresses from basic perception to advanced cognition, with each data point featuring six multiple-choice questions and one open-ended question of increasing difficulty. Perception tasks evaluate the model's ability to identify basic visual elements (e.g., colors, objects), while cognition tasks require scene reasoning, intent understanding, and deep empathy integrating textual context. We ensured annotation quality through a combination of AI assistance (Claude 4) and manual verification.

[27] Fluid Language Model Benchmarking

Valentin Hofmann,David Heineman,Ian Magnusson,Kyle Lo,Jesse Dodge,Maarten Sap,Pang Wei Koh,Chun Wang,Hannaneh Hajishirzi,Noah A. Smith

Main category: cs.CL

TL;DR: 本文提出了Fluid Benchmarking,一种受心理测量学启发的动态语言模型评估方法,通过项目反应理论建模和自适应题目选择,在效率、有效性、方差和饱和度等多个维度上显著优于传统静态评估方法。

Details Motivation: 现有的语言模型评估面临成本高、无法准确衡量目标能力、标注错误和基准饱和等问题,且现有方法多孤立解决个别问题,缺乏对整体评估质量的关注。 Method: 基于项目反应理论(IRT),利用已有模型的评估结果估计项目响应模型,并在此基础上动态选择测试题目,实现类似教育领域计算机自适应测试的评估方式。 Result: 实验表明,Fluid Benchmarking在MMLU等基准上仅用五十分之一的题目就实现了更高的有效性与更低的方差,且在效率、有效性、方差和饱和度四个维度均优于随机抽样及其他基于IRT的基线方法。 Conclusion: 通过引入动态、自适应的评估机制,语言模型的基准测试质量可以得到显著提升,应超越传统的静态评估范式。 Abstract: Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce Fluid Benchmarking, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, Fluid Benchmarking is based on the insight that the relative value of benchmark items depends on an LM's capability level, suggesting that evaluation should adapt to each LM. Methodologically, Fluid Benchmarking estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education. In our experiments, we compare Fluid Benchmarking against the common practice of random item sampling as well as more sophisticated baselines, including alternative methods grounded in item response theory. We examine four dimensions -- efficiency, validity, variance, and saturation -- and find that Fluid Benchmarking achieves superior performance in all of them (e.g., higher validity and less variance on MMLU with fifty times fewer items). Our analysis shows that the two components of Fluid Benchmarking have distinct effects: item response theory, used to map performance into a latent ability space, increases validity, while dynamic item selection reduces variance. Overall, our results suggest that LM benchmarking can be substantially improved by moving beyond static evaluation.

[28] We Argue to Agree: Towards Personality-Driven Argumentation-Based Negotiation Dialogue Systems for Tourism

Priyanshu Priya,Saurav Dudhate,Desai Vishesh Yasheshbhai,Asif Ekbal

Main category: cs.CL

TL;DR: 本文提出了一个名为PAN-DG的个性驱动论证型谈判对话生成新任务,并构建了相应的数据集PACT,用于提升谈判对话系统中的个性化和推理能力。

Details Motivation: 为了增强谈判对话系统在冲突解决中的个性化适应能力,结合论证机制与个性特征,推动更智能、人性化的对话系统发展。 Method: 提出PAN-DG任务,构建基于大语言模型生成的PACT数据集,包含三种个性档案(论证、偏好、购买风格),并对预训练和微调的大语言模型进行比较实验。 Result: 自动与人工评估表明PACT数据集具有高质量;微调后的大语言模型在多维度评估中表现出更强的个性驱动与合理回应生成能力。 Conclusion: PACT数据集有效提升了谈判对话系统的个性化和推理能力,为未来研究奠定了基础。 Abstract: Integrating argumentation mechanisms into negotiation dialogue systems improves conflict resolution through exchanges of arguments and critiques. Moreover, incorporating personality attributes enhances adaptability by aligning interactions with individuals' preferences and styles. To advance these capabilities in negotiation dialogue systems, we propose a novel Personality-driven Argumentation-based Negotiation Dialogue Generation (PAN-DG) task. To support this task, we introduce PACT, a dataset of Personality-driven Argumentation-based negotiation Conversations for Tourism sector. This dataset, generated using Large Language Models (LLMs), features three distinct personality profiles, viz. Argumentation Profile, Preference Profile, and Buying Style Profile to simulate a variety of negotiation scenarios involving diverse personalities. Thorough automatic and manual evaluations indicate that the dataset comprises high-quality dialogues. Further, we conduct comparative experiments between pre-trained and fine-tuned LLMs for the PAN-DG task. Multi-dimensional evaluation demonstrates that the fine-tuned LLMs effectively generate personality-driven rational responses during negotiations. This underscores the effectiveness of PACT in enhancing personalization and reasoning capabilities in negotiation dialogue systems, thereby establishing a foundation for future research in this domain.

[29] Joint Effects of Argumentation Theory, Audio Modality and Data Enrichment on LLM-Based Fallacy Classification

Hongxu Zhou,Hylke Westerdijk,Khondoker Ittehadul Islam

Main category: cs.CL

TL;DR: 研究发现,虽然理论提示有助于提高模型的可解释性,但在谬误分类任务中,添加上下文和情感基调元数据往往会降低模型性能,尤其是情感基调元数据会导致模型偏向“诉诸情感”的判断。

Details Motivation: 本研究旨在探讨上下文和情感基调元数据如何影响大语言模型(LLM)在谬误分类任务中的推理和性能,特别是在政治辩论环境中的表现。 Method: 使用来自美国总统辩论的数据,通过各种提示策略将Qwen-3(8B)模型应用于六种谬误类型的分类。引入了两种理论上合理的思维链框架:语用辩证法和论证元素周期表,并在三种输入设置下评估了它们的有效性:纯文本、带上下文的文本和带上下文及音频情感基调元数据的文本。 Result: 研究结果表明,尽管理论提示可以提高模型的可解释性,并在某些情况下提高准确性,但上下文和特别是情感基调元数据的加入往往导致性能下降。情感基调元数据使模型更倾向于将陈述标记为“诉诸情感”,从而影响了逻辑推理。 Conclusion: 在谬误分类任务中,尽管理论提示可以提高可解释性并在某些情况下提高准确性,但添加上下文和情感基调元数据往往会降低性能。情感基调元数据会使模型偏向于将陈述标记为“诉诸情感”,从而恶化逻辑推理。总体而言,基本提示往往优于增强提示,这表明附加输入的注意力稀释可能会恶化而非改善LLM中的谬误分类。 Abstract: This study investigates how context and emotional tone metadata influence large language model (LLM) reasoning and performance in fallacy classification tasks, particularly within political debate settings. Using data from U.S. presidential debates, we classify six fallacy types through various prompting strategies applied to the Qwen-3 (8B) model. We introduce two theoretically grounded Chain-of-Thought frameworks: Pragma-Dialectics and the Periodic Table of Arguments, and evaluate their effectiveness against a baseline prompt under three input settings: text-only, text with context, and text with both context and audio-based emotional tone metadata. Results suggest that while theoretical prompting can improve interpretability and, in some cases, accuracy, the addition of context and especially emotional tone metadata often leads to lowered performance. Emotional tone metadata biases the model toward labeling statements as \textit{Appeal to Emotion}, worsening logical reasoning. Overall, basic prompts often outperformed enhanced ones, suggesting that attention dilution from added inputs may worsen rather than improve fallacy classification in LLMs.

[30] When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

Shiyao Cui,Xijia Feng,Yingkang Wang,Junxiao Yang,Zhexin Zhang,Biplab Sikdar,Hongning Wang,Han Qiu,Minlie Huang

Main category: cs.CL

TL;DR: 表情符号可能诱导大型语言模型生成有毒内容,研究发现其通过异构语义通道绕过安全机制并揭示了数据污染的潜在影响。

Details Motivation: 观察到表情符号在通常被认为友好或俏皮的同时,可能会触发大型语言模型产生有毒内容,因此研究其具体影响及原因。 Method: 通过自动化构建带有表情符号的提示,进行跨5种主流语言和7个著名大语言模型的实验,并进行模型级别的解释和预训练语料库的深入分析。 Result: 实验证明,带有表情符号的提示可以轻易诱导毒性生成,尤其是在越狱任务中表现明显。 Conclusion: 研究发现,表情符号可以作为一种异构语义通道绕过安全机制,诱导大型语言模型产生有毒内容,同时表情符号相关的数据污染与毒性生成行为之间存在潜在相关性。 Abstract: Emojis are globally used non-verbal cues in digital communication, and extensive research has examined how large language models (LLMs) understand and utilize emojis across contexts. While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. We begin with a comprehensive exploration of emoji-triggered LLM toxicity generation by automating the construction of prompts with emojis to subtly express toxic intent. Experiments across 5 mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate that prompts with emojis could easily induce toxicity generation. To understand this phenomenon, we conduct model-level interpretations spanning semantic cognition, sequence generation and tokenization, suggesting that emojis can act as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue deeper insights, we further probe the pre-training corpus and uncover potential correlation between the emoji-related data polution with the toxicity generation behaviors. Supplementary materials provide our implementation code and data. (Warning: This paper contains potentially sensitive contents)

[31] Text2Mem: A Unified Memory Operation Language for Memory Operating System

Felix Wang,Boyu Chen,Kerun Xu,Bo Tang,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: 本文提出了Text2Mem,一种统一的内存操作语言,旨在为大语言模型代理提供标准化、可靠且可执行的内存管理机制,填补现有框架在高级操作和形式化规范方面的空白。

Details Motivation: 现有的大语言模型代理内存框架仅支持基本操作,缺乏对合并、升降级、拆分等高阶操作的支持,且缺少形式化的可执行规范,导致行为不可预测。因此需要一个统一、标准的内存操作语言来提升可靠性与跨系统一致性。 Method: 提出Text2Mem,定义了一组紧凑而富有表达力的操作集,使用基于JSON的模式实例表示每条指令,并通过解析器生成带类型的操作对象。系统包含验证器确保执行前正确性,适配器支持多种后端(如SQL原型或真实内存框架),并在需要时集成嵌入或摘要等模型服务,所有结果通过统一执行合约返回。 Result: Text2Mem实现了内存操作的安全性、确定性和跨异构后端的可移植性;同时提出了Text2Mem Bench基准,用于分离模式生成与后端执行,支持对内存控制能力进行系统评估。 Conclusion: Text2Mem为智能代理的内存控制建立了首个标准化基础,解决了现有内存系统操作不完整和规范不明确的问题,具有良好的扩展性与实际应用潜力。 Abstract: Large language model agents increasingly depend on memory to sustain long horizon interaction, but existing frameworks remain limited. Most expose only a few basic primitives such as encode, retrieve, and delete, while higher order operations like merge, promote, demote, split, lock, and expire are missing or inconsistently supported. Moreover, there is no formal and executable specification for memory commands, leaving scope and lifecycle rules implicit and causing unpredictable behavior across systems. We introduce Text2Mem, a unified memory operation language that provides a standardized pathway from natural language to reliable execution. Text2Mem defines a compact yet expressive operation set aligned with encoding, storage, and retrieval. Each instruction is represented as a JSON based schema instance with required fields and semantic invariants, which a parser transforms into typed operation objects with normalized parameters. A validator ensures correctness before execution, while adapters map typed objects either to a SQL prototype backend or to real memory frameworks. Model based services such as embeddings or summarization are integrated when required. All results are returned through a unified execution contract. This design ensures safety, determinism, and portability across heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark that separates schema generation from backend execution to enable systematic evaluation. Together, these components establish the first standardized foundation for memory control in agents.

[32] Differentially-private text generation degrades output language quality

Erion Çano,Ivan Habernal

Main category: cs.CL

TL;DR: Differential privacy in LLMs reduces text quality and downstream task performance.

Details Motivation: To understand the trade-off between privacy and text quality/utility in DP-tuned LLMs. Method: Five LLMs were tuned under four privacy levels and evaluated for text quality and classification utility. Result: Texts from more private LLMs are shorter, less grammatically correct, less lexically diverse, and reduce classification accuracy. Conclusion: Stronger privacy constraints in LLMs negatively impact text quality and utility. Abstract: Ensuring user privacy by synthesizing data from large language models (LLMs) tuned under differential privacy (DP) has become popular recently. However, the impact of DP fine-tuned LLMs on the quality of the language and the utility of the texts they produce has not been investigated. In this work, we tune five LLMs with three corpora under four levels of privacy and assess the length, the grammatical correctness, and the lexical diversity of the text outputs they produce. We also probe the utility of the synthetic outputs in downstream classification tasks such as book genre recognition based on book descriptions and cause of death recognition based on verbal autopsies. The results indicate that LLMs tuned under stronger privacy constrains produce texts that are shorter by at least 77 %, that are less grammatically correct by at least 9 %, and are less diverse by at least 10 % in bi-gram diversity. Furthermore, the accuracy they reach in downstream classification tasks decreases, which might be detrimental to the usefulness of the generated synthetic data.

[33] Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

Hang Guo,Yawei Li,Luca Benini

Main category: cs.CL

TL;DR: 本文提出了一种名为Optimal Brain Restoration (OBR)的训练免费框架,通过结合量化和稀疏性来压缩大语言模型,利用二阶梯度信息和误差补偿实现高效压缩,在保持性能的同时显著提升速度和减少内存。

Details Motivation: 随着单一压缩技术(如量化或剪枝)接近极限,进一步压缩面临挑战,且量化与剪枝对权重分布的要求存在冲突,亟需一种能协同二者的方法。 Method: 提出OBR框架,基于二阶Hessian目标函数,通过代理近似和分组误差补偿将其转化为可解问题,实现无需训练的量化与稀疏联合优化。 Result: 在现有大语言模型上实现了W4A4KV4量化与50%稀疏性的组合,相比FP16密集模型最高可达4.72倍加速和6.4倍内存减少。 Conclusion: OBR有效协调了量化与剪枝之间的冲突,为大语言模型的高效压缩提供了一种通用且实用的解决方案。 Abstract: Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.

[34] RanAT4BIE: Random Adversarial Training for Biomedical Information Extraction

Jian Chen,Shengyi Lv,Leilei Su

Main category: cs.CL

TL;DR: 提出随机对抗训练(RAT)框架,基于PubMedBERT,在生物医学信息抽取任务中兼顾性能提升与计算效率。

Details Motivation: 传统对抗训练虽能提升模型性能,但在生物医学信息抽取任务中带来显著计算开销,需更高效的替代方案。 Method: 在PubMedBERT基础上,结合随机采样机制与对抗训练思想,提出RAT框架,以降低计算成本并提升模型泛化能力与鲁棒性。 Result: RAT在多项生物医学信息抽取任务中优于基线模型,显著减少计算开销的同时提升了模型性能。 Conclusion: RAT是一种高效且有效的对抗训练框架,为生物医学自然语言处理提供了性能与效率平衡的解决方案。 Abstract: We introduce random adversarial training (RAT), a novel framework successfully applied to biomedical information extraction (BioIE) tasks. Building on PubMedBERT as the foundational architecture, our study first validates the effectiveness of conventional adversarial training in enhancing pre-trained language models' performance on BioIE tasks. While adversarial training yields significant improvements across various performance metrics, it also introduces considerable computational overhead. To address this limitation, we propose RAT as an efficiency solution for biomedical information extraction. This framework strategically integrates random sampling mechanisms with adversarial training principles, achieving dual objectives: enhanced model generalization and robustness while significantly reducing computational costs. Through comprehensive evaluations, RAT demonstrates superior performance compared to baseline models in BioIE tasks. The results highlight RAT's potential as a transformative framework for biomedical natural language processing, offering a balanced solution to the model performance and computational efficiency.

[35] The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

Valentin Romanov,Steven A Niederer

Main category: cs.CL

TL;DR: This paper identifies and analyzes six core prompt engineering techniques to improve the efficiency and quality of responses from Large Language Models in life sciences workflows.

Details Motivation: The motivation is to streamline life sciences workflows by identifying core prompt engineering techniques that reduce the cognitive load and improve the reliability of responses from Large Language Models. Method: The study distills 58 text-based prompt engineering techniques into 6 core approaches and evaluates them based on use cases in the life sciences, while also analyzing current limitations and effectiveness across various platforms. Result: Six core prompt engineering techniques were identified and analyzed for their significance in life sciences, along with recommendations for prompt structuring and awareness of limitations. Conclusion: The paper concludes that prompt engineering can significantly enhance efficiency and quality in life sciences research when transitioning from ad-hoc prompting to systematic practice. Abstract: Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn't be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.

[36] Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context

Dasol Choi,Jungwhan Kim,Guijin Son

Main category: cs.CL

TL;DR: The paper introduces Ko-PIQA, a Korean physical commonsense reasoning dataset with cultural context, to address the lack of cultural diversity in existing datasets. Starting with a large set of web-crawled questions, the authors used multi-stage filtering and GPT-4o refinement to create a high-quality dataset of 441 question-answer pairs. Ko-PIQA includes culturally specific elements that highlight the importance of culturally-aware reasoning. The evaluation of several models shows significant room for improvement, especially in handling culturally specific scenarios.

Details Motivation: Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. The introduction of Ko-PIQA aims to incorporate cultural context into such datasets. Method: Starting from 3.01 million web-crawled questions, a multi-stage filtering approach using three language models was employed to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, 441 high-quality question-answer pairs were obtained. Result: Seven language models were evaluated on Ko-PIQA, with the best model achieving 83.22% accuracy while the weakest reached only 59.86%. Models particularly struggle with culturally specific scenarios. Conclusion: Ko-PIQA serves as a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. Abstract: Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7\% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22\% accuracy while the weakest reaches only 59.86\%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.

[37] !MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning

Mohamed Tarek,Seif Ahmed,Mohamed Basem

Main category: cs.CL

TL;DR: 本文介绍了在AraHealthQA-2025共享任务中,作者的系统在阿拉伯语健康问答任务中获得了第二名,并通过使用Gemini 2.5 Flash模型和多种优化方法提升了问答准确性。

Details Motivation: 作者旨在通过优化方法提升阿拉伯语临床背景下问答系统的准确性,从而在共享任务中取得好成绩。 Method: 在子任务1中,作者使用Gemini 2.5 Flash模型,结合少量示例提示、数据集预处理和三种提示配置的集成来提升分类准确性;在子任务2中,作者使用统一提示和角色扮演、少量示例以及后处理来生成简洁的回答。 Result: 作者的方法在AraHealthQA-2025共享任务的两个子任务中均获得第二名。 Conclusion: 通过优化模型和提示策略,作者成功提升了阿拉伯语临床问答系统的性能。 Abstract: We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task, where our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts. For Sub-Task 1, we leverage the Gemini 2.5 Flash model with few-shot prompting, dataset preprocessing, and an ensemble of three prompt configurations to improve classification accuracy on standard, biased, and fill-in-the-blank questions. For Sub-Task 2, we employ a unified prompt with the same model, incorporating role-playing as an Arabic medical expert, few-shot examples, and post-processing to generate concise responses across fill-in-the-blank, patient-doctor Q&A, GEC, and paraphrased variants.

[38] Transformer Enhanced Relation Classification: A Comparative Analysis of Contextuality, Data Efficiency and Sequence Complexity

Bowen Jing,Yang Cui,Tianpeng Huang

Main category: cs.CL

TL;DR: 本文系统比较了基于Transformer和非Transformer的深度监督学习方法在关系抽取任务中的性能,结果表明基于Transformer的模型在多个数据集上显著优于非Transformer模型。

Details Motivation: 为了评估在大语言模型时代,不同深度学习架构在关系抽取任务中的表现差异,特别是Transformer与非Transformer模型之间的性能差距。 Method: 采用了PA-LSTM、C-GCN、AGGCN等非Transformer模型以及BERT、RoBERTa、R-BERT等Transformer模型,在TACRED、TACREV和RE-TACRED数据集上进行实验,使用micro F1等指标评估,并在不同句子长度、训练数据比例等场景下分析性能。 Result: Transformer模型在micro F1得分上达到80-90%,显著高于非Transformer模型的64-67%,且在多种实验设置下均表现出更强的鲁棒性和泛化能力。 Conclusion: 基于Transformer的模型在关系抽取任务中表现更优,未来的研究应继续探索大语言模型在此领域的应用潜力。 Abstract: In the era of large language model, relation extraction (RE) plays an important role in information extraction through the transformation of unstructured raw text into structured data (Wadhwa et al., 2023). In this paper, we systematically compare the performance of deep supervised learning approaches without transformers and those with transformers. We used a series of non-transformer architectures such as PA-LSTM(Zhang et al., 2017), C-GCN(Zhang et al., 2018), and AGGCN(attention guide GCN)(Guo et al., 2019), and a series of transformer architectures such as BERT, RoBERTa, and R-BERT(Wu and He, 2019). Our comparison included traditional metrics like micro F1, as well as evaluations in different scenarios, varying sentence lengths, and different percentages of the dataset for training. Our experiments were conducted on TACRED, TACREV, and RE-TACRED. The results show that transformer-based models outperform non-transformer models, achieving micro F1 scores of 80-90% compared to 64-67% for non-transformer models. Additionally, we briefly review the research journey in supervised relation classification and discuss the role and current status of large language models (LLMs) in relation extraction.

[39] Continually Adding New Languages to Multilingual Language Models

Abraham Toluwase Owodunni,Sachin Kumar

Main category: cs.CL

TL;DR: This paper introduces LayRA, a method for efficiently adding new languages to multilingual language models without retraining from scratch or needing original training data.

Details Motivation: Multilingual models are typically trained on a fixed set of languages, making it expensive and often infeasible to add new ones due to the lack of original training data and issues like catastrophic forgetting. Method: Layer-Selective LoRA (LayRA) adds Low-Rank Adapters to selected layers of a pre-trained model, leveraging the insight that different layers handle different parts of the language processing pipeline. Result: LayRA outperforms naive approaches and competes well with existing methods like LoRA, successfully adding new languages (Galician, Swahili, Urdu) while preserving performance on previously supported languages. Conclusion: LayRA is an effective approach for adding new languages to multilingual models, providing a good balance between preserving existing capabilities and learning new languages, even without instruction tuning data. Abstract: Multilingual language models are trained on a fixed set of languages, and to support new languages, the models need to be retrained from scratch. This is an expensive endeavor and is often infeasible, as model developers tend not to release their pre-training data. Naive approaches, such as continued pretraining, suffer from catastrophic forgetting; however, mitigation strategies like experience replay cannot be applied due to the lack of original pretraining data. In this work, we investigate the problem of continually adding new languages to a multilingual model, assuming access to pretraining data in only the target languages. We explore multiple approaches to address this problem and propose Layer-Selective LoRA (LayRA), which adds Low-Rank Adapters (LoRA) to selected initial and final layers while keeping the rest of the model frozen. LayRA builds on two insights: (1) LoRA reduces forgetting, and (2) multilingual models encode inputs in the source language in the initial layers, reason in English in intermediate layers, and translate back to the source language in final layers. We experiment with adding multiple combinations of Galician, Swahili, and Urdu to pretrained language models and evaluate each method on diverse multilingual tasks. We find that LayRA provides the overall best tradeoff between preserving models' capabilities in previously supported languages, while being competitive with existing approaches such as LoRA in learning new languages. We also demonstrate that using model arithmetic, the adapted models can be equipped with strong instruction following abilities without access to any instruction tuning data in the target languages.

[40] A Transformer-Based Cross-Platform Analysis of Public Discourse on the 15-Minute City Paradigm

Gaurab Chhetri,Darrell Anderson,Boniphace Kutela,Subasish Das

Main category: cs.CL

TL;DR: 该研究首次对15分钟城市概念在Twitter、Reddit和新闻媒体上的公众舆论进行了多平台情感分析,使用压缩的Transformer模型和Llama-3-8B进行标注,评估了五种模型的表现,并提出了适用于城市规划讨论的可扩展情感分类方向。

Details Motivation: 为了更好地理解公众对15分钟城市概念的看法,并评估不同平台上情感分析的表现,以推动城市规划讨论中的情感分类技术。 Method: 使用压缩的Transformer模型和Llama-3-8B进行标注,对五种模型(DistilRoBERTa、DistilBERT、MiniLM、ELECTRA、TinyBERT)进行基准测试,并采用分层5折交叉验证报告F1分数、AUC和训练时间。 Result: DistilRoBERTa获得了最高的F1分数(0.8292),TinyBERT表现最佳效率,MiniLM在跨平台一致性上表现最好。研究发现新闻数据由于类别不平衡导致性能膨胀,Reddit由于摘要损失而表现不佳,Twitter则提供了适度的挑战。 Conclusion: 压缩模型在情感分析中表现良好,挑战了较大模型是必要的假设,并识别了平台特定的权衡,提出了可扩展的真实世界情感分类方向。 Abstract: This study presents the first multi-platform sentiment analysis of public opinion on the 15-minute city concept across Twitter, Reddit, and news media. Using compressed transformer models and Llama-3-8B for annotation, we classify sentiment across heterogeneous text domains. Our pipeline handles long-form and short-form text, supports consistent annotation, and enables reproducible evaluation. We benchmark five models (DistilRoBERTa, DistilBERT, MiniLM, ELECTRA, TinyBERT) using stratified 5-fold cross-validation, reporting F1-score, AUC, and training time. DistilRoBERTa achieved the highest F1 (0.8292), TinyBERT the best efficiency, and MiniLM the best cross-platform consistency. Results show News data yields inflated performance due to class imbalance, Reddit suffers from summarization loss, and Twitter offers moderate challenge. Compressed models perform competitively, challenging assumptions that larger models are necessary. We identify platform-specific trade-offs and propose directions for scalable, real-world sentiment classification in urban planning discourse.

[41] CognitiveSky: Scalable Sentiment and Narrative Analysis for Decentralized Social Media

Gaurab Chhetri,Anandi Dutta,Subasish Das

Main category: cs.CL

TL;DR: CognitiveSky is an open-source framework for analyzing public discourse on decentralized social media platform Bluesky, using transformer-based models to visualize emotion and conversation trends.

Details Motivation: The emergence of decentralized social media platforms creates new opportunities and challenges for real-time analysis of public discourse, prompting the need for a scalable and open-source framework like CognitiveSky. Method: CognitiveSky uses transformer-based models to analyze data ingested via Bluesky's API, producing structured outputs that drive a dynamic dashboard for visualizing patterns in emotion, activity, and conversation topics. Result: CognitiveSky successfully provides sentiment, emotion, and narrative analysis on Bluesky with low operational cost and high accessibility, demonstrated for mental health discourse monitoring but applicable to various domains. Conclusion: CognitiveSky serves as a transparent and extensible tool that bridges large language models with decentralized networks, offering opportunities for computational social science in evolving digital ecosystems. Abstract: The emergence of decentralized social media platforms presents new opportunities and challenges for real-time analysis of public discourse. This study introduces CognitiveSky, an open-source and scalable framework designed for sentiment, emotion, and narrative analysis on Bluesky, a federated Twitter or X.com alternative. By ingesting data through Bluesky's Application Programming Interface (API), CognitiveSky applies transformer-based models to annotate large-scale user-generated content and produces structured and analyzable outputs. These summaries drive a dynamic dashboard that visualizes evolving patterns in emotion, activity, and conversation topics. Built entirely on free-tier infrastructure, CognitiveSky achieves both low operational cost and high accessibility. While demonstrated here for monitoring mental health discourse, its modular design enables applications across domains such as disinformation detection, crisis response, and civic sentiment analysis. By bridging large language models with decentralized networks, CognitiveSky offers a transparent, extensible tool for computational social science in an era of shifting digital ecosystems.

[42] CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi,Raymond Li,Chuyuan Li,Shafiq Joty,Giuseppe Carenini

Main category: cs.CL

TL;DR: 本文介绍了一种名为CEMTM的上下文增强多模态主题模型,该模型能够从包含文本和图像的文档中推断出连贯且可解释的主题结构。

Details Motivation: 为了从包含文本和图像的文档中推断出更连贯且可解释的主题结构,设计了CEMTM。 Method: CEMTM利用微调的大型视觉语言模型(LVLMs)获取情境化嵌入,并采用分布注意力机制对主题推理中的令牌级贡献进行加权。 Result: 实验表明,CEMTM在六个多模态基准测试中均优于单模态和多模态基线模型,平均LLM得分达到了2.61。 Conclusion: CEMTM是一个上下文增强的多模态主题模型,能够从包含文本和图像的短文档和长文档中推断出连贯且可解释的主题结构。 Abstract: We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

[43] Improving LLMs' Learning for Coreference Resolution

Yujian Gan,Yuan Liang,Yanni Lin,Juntao Yu,Massimo Poesio

Main category: cs.CL

TL;DR: 本文提出了两种新方法来解决基于LLM的共指解析问题,有效减少了幻觉并提升了性能。

Details Motivation: 现有的LLM在共指解析任务中存在幻觉和表现不佳的问题。 Method: 提出两种新技术:反向训练与联合推理和迭代文档生成。 Result: 实验表明,反向训练改进了QA模板方法,而迭代文档生成消除了生成源文本中的幻觉并提升了共指解析效果。 Conclusion: 整合这些方法和技术提供了基于LLM的共指解析的有效且稳健的解决方案。 Abstract: Coreference Resolution (CR) is crucial for many NLP tasks, but existing LLMs struggle with hallucination and under-performance. In this paper, we investigate the limitations of existing LLM-based approaches to CR-specifically the Question-Answering (QA) Template and Document Template methods and propose two novel techniques: Reversed Training with Joint Inference and Iterative Document Generation. Our experiments show that Reversed Training improves the QA Template method, while Iterative Document Generation eliminates hallucinations in the generated source text and boosts coreference resolution. Integrating these methods and techniques offers an effective and robust solution to LLM-based coreference resolution.

[44] ClaimIQ at CheckThat! 2025: Comparing Prompted and Fine-Tuned Language Models for Verifying Numerical Claims

Anirban Saha Anik,Md Fahimul Kabir Chowdhury,Andrew Wyckoff,Sagnik Ray Choudhury

Main category: cs.CL

TL;DR: 本文提出了用于CLEF 2025 CheckThat! Lab任务3的系统,旨在通过检索证据验证数值和时间性声明,结合零样本提示与LoRA高效微调大模型,并探讨证据选择策略对性能的影响。

Details Motivation: 为了提升数值和时间性事实声明的验证准确性,特别是在缺乏标注数据的情况下,探索大语言模型在零样本和小样本设置下的表现,并优化证据使用方式。 Method: 采用两种方法:一是基于指令调优的大语言模型进行零样本提示,二是使用LoRA进行参数高效微调;同时比较多种证据选择策略,如全文输入、BM25和MiniLM筛选前k句。 Result: 在英语验证集上,经LoRA微调的LLaMA模型表现最佳,但在测试集上性能显著下降,暴露出泛化能力不足的问题;研究还发现证据粒度对结果影响显著。 Conclusion: 证据质量与模型适应性对数值事实验证至关重要,未来需增强模型泛化能力和更精细的证据筛选机制。 Abstract: This paper presents our system for Task 3 of the CLEF 2025 CheckThat! Lab, which focuses on verifying numerical and temporal claims using retrieved evidence. We explore two complementary approaches: zero-shot prompting with instruction-tuned large language models (LLMs) and supervised fine-tuning using parameter-efficient LoRA. To enhance evidence quality, we investigate several selection strategies, including full-document input and top-k sentence filtering using BM25 and MiniLM. Our best-performing model LLaMA fine-tuned with LoRA achieves strong performance on the English validation set. However, a notable drop in the test set highlights a generalization challenge. These findings underscore the importance of evidence granularity and model adaptation for robust numerical fact verification.

[45] AKCIT-FN at CheckThat! 2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization

Fabrycio Leite Nakano Almada,Kauan Divino Pouso Mariano,Maykon Adriell Dutra,Victor Emanuel da Silva Monteiro,Juliana Resplande Sant'Anna Gomes,Arlindo Rodrigues Galvão Filho,Anderson da Silva Soares

Main category: cs.CL

TL;DR: 本文提出了一种在监督和零样本语言中均表现优异的声明规范化方法,并公开了所有实现工具。

Details Motivation: 将非正式社交媒体帖子转换为简洁、自包含的陈述,这是自动化事实核查流程中的关键步骤。 Method: 针对监督语言,采用微调的小型语言模型(SLMs);针对零样本场景,采用大型语言模型(LLM)提示。 Result: 在20种语言中,有15种语言进入了前三名,其中包括8种语言获得第二名,其中5种是零样本语言。在葡萄牙语中,METEOR平均得分为0.5290,排名第三。 Conclusion: 我们的方法在20种语言中的15种中获得了前三名的成绩,包括在零样本语言中表现出色,证明了基于LLM的零样本策略的有效性。 Abstract: Claim normalization, the transformation of informal social media posts into concise, self-contained statements, is a crucial step in automated fact-checking pipelines. This paper details our submission to the CLEF-2025 CheckThat! Task~2, which challenges systems to perform claim normalization across twenty languages, divided into thirteen supervised (high-resource) and seven zero-shot (no training data) tracks. Our approach, leveraging fine-tuned Small Language Models (SLMs) for supervised languages and Large Language Model (LLM) prompting for zero-shot scenarios, achieved podium positions (top three) in fifteen of the twenty languages. Notably, this included second-place rankings in eight languages, five of which were among the seven designated zero-shot languages, underscoring the effectiveness of our LLM-based zero-shot strategy. For Portuguese, our initial development language, our system achieved an average METEOR score of 0.5290, ranking third. All implementation artifacts, including inference, training, evaluation scripts, and prompt configurations, are publicly available at https://github.com/ju-resplande/checkthat2025_normalization.

[46] DeDisCo at the DISRPT 2025 Shared Task: A System for Discourse Relation Classification

Zhuoxuan Ju,Jingni Wu,Abhishek Purushothama,Amir Zeldes

Main category: cs.CL

TL;DR: 本文介绍了Georgetown University为DISRPT 2025共享任务提交的DeDisCo系统,用于语篇关系分类,比较了基于mt5编码器和Qwen解码器的两种方法,并尝试使用自动翻译增强数据和附加语言特征。

Details Motivation: 在低资源语言场景下提升语篇关系分类性能,探索不同模型架构与数据增强方法的有效性。 Method: 采用mt5编码器和Qwen解码器两种架构,使用自动翻译生成的增强数据训练,并引入额外的语言学特征。 Result: 系统在DISRPT 2025任务上取得71.28的macro-accuracy分数,并提供了结果的解释与错误分析。 Conclusion: 所提出的DeDisCo系统在多语言语篇关系分类中表现具有竞争力,数据增强和语言特征对低资源语言有一定帮助。 Abstract: This paper presents DeDisCo, Georgetown University's entry in the DISRPT 2025 shared task on discourse relation classification. We test two approaches, using an mt5-based encoder and a decoder based approach using the openly available Qwen model. We also experiment on training with augmented dataset for low-resource languages using matched data translated automatically from English, as well as using some additional linguistic features inspired by entries in previous editions of the Shared Task. Our system achieves a macro-accuracy score of 71.28, and we provide some interpretation and error analysis for our results.

[47] Unsupervised Candidate Ranking for Lexical Substitution via Holistic Sentence Semantics

Zhongyang Hu,Naijie Gu,Xiangzhi Tao,Tianhui Gu,Yibing Zhou

Main category: cs.CL

TL;DR: 本文提出两种基于注意力权重和积分梯度的方法,通过衡量上下文对目标词的影响并结合句子间语义相似性,有效提升词汇替换中的候选词排序性能。

Details Motivation: 现有方法在建模候选词替换对目标词及其上下文的双向影响方面存在不足,难以准确刻画语义变化,且多依赖多指标调参,缺乏有效性。 Method: 提出两种方法:一种基于注意力权重,另一种采用更可解释的积分梯度法,均用于衡量上下文词对目标词的影响,并结合原句与替换后句子的语义相似性进行候选词排序。 Result: 在LS07和SWORDS数据集上的实验表明,所提出的两种方法均能提升候选词排序性能。 Conclusion: 通过引入注意力权重或积分梯度来量化上下文影响,并融合语义相似性,可有效改进词汇替换中的排序效果,优于传统单一语义变化建模方法。 Abstract: A key subtask in lexical substitution is ranking the given candidate words. A common approach is to replace the target word with a candidate in the original sentence and feed the modified sentence into a model to capture semantic differences before and after substitution. However, effectively modeling the bidirectional influence of candidate substitution on both the target word and its context remains challenging. Existing methods often focus solely on semantic changes at the target position or rely on parameter tuning over multiple evaluation metrics, making it difficult to accurately characterize semantic variation. To address this, we investigate two approaches: one based on attention weights and another leveraging the more interpretable integrated gradients method, both designed to measure the influence of context tokens on the target token and to rank candidates by incorporating semantic similarity between the original and substituted sentences. Experiments on the LS07 and SWORDS datasets demonstrate that both approaches improve ranking performance.

[48] LVLMs are Bad at Overhearing Human Referential Communication

Zhengxiang Wang,Weiling Li,Panagiotis Kaliosis,Owen Rambow,Susan E. Brennan

Main category: cs.CL

TL;DR: 本文研究了七种最先进的视觉语言模型在理解自发对话中指代表达的能力,发现它们在多次对话后未能有效提升性能。

Details Motivation: 为了使具身智能体能够在现实世界中执行任务,它需要理解语言、视觉和对话互动中的指代表达。 Method: 作为旁听者,研究七种最先进的大型视觉语言模型(LVLMs)在自发对话语料库上的能力。 Result: 实验发现,当前的LVLMs在重复任务的多轮对话中未能表现出一致的性能提升。 Conclusion: 当前的LVLMs在通过多次对话理解自发对话中的指代表达方面仍面临挑战,并且没有表现出一致的性能提升。 Abstract: During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.

[49] PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

Rodrigo M. Carrillo-Larco,Jesus Lovón Melgarejo,Manuel Castillo-Cara,Gusseppe Bravo-Rocca

Main category: cs.CL

TL;DR: 本研究构建了包含8380道西班牙语医学多选题的PeruMedQA数据集,用于评估和微调医学大语言模型在秘鲁专科医师考试中的表现。结果显示,medgemma-27b-text-it模型准确率超过90%,而小于100亿参数的模型普遍低于60%。经过LoRA微调的medgemma-4b-it显著提升性能,优于其他小规模模型,并可与700亿参数模型媲美。

Details Motivation: 探索医学大语言模型在拉丁美洲西班牙语医学问题中的适用性,填补该地区本地化医疗AI评估与应用的研究空白。 Method: 构建PeruMedQA数据集(8,380道题,12个领域),选取8个医学大模型进行零样本测试,使用PEFT和LoRA方法对medgemma-4b-it进行微调(排除2025年试题作为测试集)。 Result: medgemma-27b-text-it在多个测试中准确率超90%;小于100亿参数的模型准确率普遍低于60%,部分甚至低于50%;经微调的medgemma-4b-it优于所有小于100亿参数的模型,并接近700亿参数模型的表现。 Conclusion: 针对西班牙语国家及类似秘鲁流行病学特征地区的医学AI应用与研究,建议采用medgemma-27b-text-it或经微调的medgemma-4b-it模型。 Abstract: BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) datasets containing 8,380 questions spanning 12 medical domains (2018-2025). We selected eight medical LLMs including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific prompts to answer the questions appropriately. We employed parameter-efficient fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances. LLMs with <10 billion parameters exhibited <60% of correct answers, while some exams yielded results <50%. The fine-tuned version of medgemma-4b-it emerged victorious agains all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI application and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profiles to Peru's, interested parties should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.

[50] On the Distinctive Co-occurrence Characteristics of Antonymy

Zhihan Cao,Hiroaki Yamada,Takenobu Tokunaga

Main category: cs.CL

TL;DR: 该研究通过鲁棒的共现指标,比较了反义关系与其他三种语义关系在词性间的共现模式,发现反义词对具有高强度、偏好线性顺序和短距离共现三个显著特征。

Details Motivation: 反义关系在词汇语义中备受关注,已有研究表明反义词对在文本中频繁共现,但这种共现模式是否为反义关系所特有尚不清楚,因缺乏与其他语义关系的比较。 Method: 通过跨词性的鲁棒共现度量方法,将反义关系与另外三种语义关系进行对比分析。 Result: 反义关系在三个方面表现出独特性:反义词对共现强度高、呈现偏好的线性顺序、且共现在较短的文本跨度内。所有结果均已在线公开。 Conclusion: 反义词的共现模式在语义关系中是独特的,具有可量化的特征,有助于进一步理解词汇语义结构。 Abstract: Antonymy has long received particular attention in lexical semantics. Previous studies have shown that antonym pairs frequently co-occur in text, across genres and parts of speech, more often than would be expected by chance. However, whether this co-occurrence pattern is distinctive of antonymy remains unclear, due to a lack of comparison with other semantic relations. This work fills the gap by comparing antonymy with three other relations across parts of speech using robust co-occurrence metrics. We find that antonymy is distinctive in three respects: antonym pairs co-occur with high strength, in a preferred linear order, and within short spans. All results are available online.

[51] HARP: Hallucination Detection via Reasoning Subspace Projection

Junjie Hu,Gang Tu,ShengYu Cheng,Jinxin Li,Jinting Wang,Rui Chen,Zhilong Zhou,Dongbo Shan

Main category: cs.CL

TL;DR: HARP是一种通过分离语义和推理子空间来检测大型语言模型幻觉的新方法,它提高了检测性能并增强了鲁棒性。

Details Motivation: 现有的幻觉检测方法难以分离语义和推理信息,同时保持鲁棒性。 Method: 利用SVD分解分解隐藏状态空间,通过Unembedding层分离语义和推理子空间,并将隐藏状态投影到推理子空间上进行幻觉检测。 Result: HARP在多个数据集上实现了最先进的幻觉检测性能,特别是TriviaQA上达到了92.8%的AUROC,比之前最好的方法提高了7.5%。 Conclusion: HARP是一个新的幻觉检测框架,通过推理子空间投影实现高效的幻觉检测。 Abstract: Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.

[52] HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Wensheng Lu,Keyu Chen,Ruizhi Qiao,Xing Sun

Main category: cs.CL

TL;DR: 本文提出了HiCBench和HiChunk,用于评估和改进RAG系统中的文档分块质量。

Details Motivation: 现有的RAG评估基准无法有效评估文档分块质量,主要问题在于证据稀疏性。 Method: 本文分析了现有基准的不足,构建了一个新的评估基准HiCBench,并提出了基于微调LLM和Auto-Merge检索算法的多级文档结构框架HiChunk。 Result: 实验表明,HiCBench能够有效评估不同分块方法对RAG流程的影响,而HiChunk在合理的时间消耗内实现了更好的分块质量,提升了RAG系统的整体性能。 Conclusion: 本文通过HiCBench和HiChunk解决了RAG系统中文档分块质量评估和优化的问题。 Abstract: Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense quetion answer(QA) pairs, and their corresponding evidence sources. Additionally, we introduce the HiChunk framework, a multi-level document structuring framework based on fine-tuned LLMs, combined with the Auto-Merge retrieval algorithm to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of different chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves better chunking quality within reasonable time consumption, thereby enhancing the overall performance of RAG systems.

[53] D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs

Yue Ding,Xiaofang Zhu,Tianze Xia,Junfei Wu,Xinlong Chen,Qiang Liu,Liang Wang

Main category: cs.CL

TL;DR: 提出了一种无需训练、无需标签的幻觉检测框架D²HScore,通过衡量LLM生成过程中词元表示的层内分散性和层间漂移性来检测幻觉。

Details Motivation: 大语言模型(LLMs)在实际应用中常因产生非事实内容(即“幻觉”)而受限,尤其在金融、安全和医疗等高风险领域,确保其输出可靠性是一个关键挑战。现有方法多依赖训练或标注数据,缺乏对模型架构和生成动态的深入利用。 Method: 基于LLMs的多层结构和自回归解码过程,将幻觉信号分解为两个维度:每层内部词元表示的语义广度(Intra-Layer Dispersion)和跨层核心概念表示的语义深度演化(Inter-Layer Drift)。结合注意力机制筛选关键token,提出D²HScore框架,联合建模层内分散性和层间漂移性,实现无需训练和标注的幻觉检测。 Result: 在五个开源大模型和五个主流基准上的实验表明,D²HScore consistently 优于现有的无需训练的基线方法,具备良好的泛化性和可解释性。 Conclusion: D²HScore通过捕捉推理过程中表示的横向与纵向动态,提供了一种轻量、可解释且有效的幻觉检测代理指标,为基于模型内在机制的可靠性评估提供了新思路。 Abstract: Although large Language Models (LLMs) have achieved remarkable success, their practical application is often hindered by the generation of non-factual content, which is called "hallucination". Ensuring the reliability of LLMs' outputs is a critical challenge, particularly in high-stakes domains such as finance, security, and healthcare. In this work, we revisit hallucination detection from the perspective of model architecture and generation dynamics. Leveraging the multi-layer structure and autoregressive decoding process of LLMs, we decompose hallucination signals into two complementary dimensions: the semantic breadth of token representations within each layer, and the semantic depth of core concepts as they evolve across layers. Based on this insight, we propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)}, a training-free and label-free framework that jointly measures: (1) \textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of token representations within each layer; and (2) \textbf{Inter-Layer Drift}, which tracks the progressive transformation of key token representations across layers. To ensure drift reflects the evolution of meaningful semantics rather than noisy or redundant tokens, we guide token selection using attention signals. By capturing both the horizontal and vertical dynamics of representation during inference, D$^2$HScore provides an interpretable and lightweight proxy for hallucination detection. Extensive experiments across five open-source LLMs and five widely used benchmarks demonstrate that D$^2$HScore consistently outperforms existing training-free baselines.

[54] Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia -- Current Stage and Challenges

Sampoorna Poria,Xiaolei Huang

Main category: cs.CL

TL;DR: 该论文综述了南亚语言在自然语言处理中的现状与挑战,重点关注基于Transformer的模型在数据、模型和任务方面的进展与不足。

Details Motivation: 南亚有650多种语言,但多数低资源语言缺乏足够的计算资源和语言模型支持,亟需评估当前技术阶段以推动相关研究发展。 Method: 通过检索2020年以来的研究,系统梳理基于Transformer模型(如BERT、T5、GPT)在南亚语言上的应用,从数据、模型和任务三个关键方面分析现有成果与差距。 Result: 发现了关键问题,包括重要领域(如医疗)数据缺失、语码混合现象普遍以及缺乏标准化的评估基准。 Conclusion: 呼吁NLP社区加强针对南亚语言的数据建设,建立符合其文化和语言特点的统一评估基准,促进这些语言的公平表征和技术发展。 Abstract: Rapid developments of large language models have revolutionized many NLP tasks for English data. Unfortunately, the models and their evaluations for low-resource languages are being overlooked, especially for languages in South Asia. Although there are more than 650 languages in South Asia, many of them either have very limited computational resources or are missing from existing language models. Thus, a concrete question to be answered is: Can we assess the current stage and challenges to inform our NLP community and facilitate model developments for South Asian languages? In this survey, we have comprehensively examined current efforts and challenges of NLP models for South Asian languages by retrieving studies since 2020, with a focus on transformer-based models, such as BERT, T5, & GPT. We present advances and gaps across 3 essential aspects: data, models, & tasks, such as available data sources, fine-tuning strategies, & domain applications. Our findings highlight substantial issues, including missing data in critical domains (e.g., health), code-mixing, and lack of standardized evaluation benchmarks. Our survey aims to raise awareness within the NLP community for more targeted data curation, unify benchmarks tailored to cultural and linguistic nuances of South Asia, and encourage an equitable representation of South Asian languages. The complete list of resources is available at: https://github.com/trust-nlp/LM4SouthAsia-Survey.

[55] Analyzing Information-Seeking Behaviors in a Hakka AI Chatbot: A Cognitive-Pragmatic Study

Chu-Hsuan Lee,Chen-Chi Chang,Hung-Shin Lee,Yun-Hsiang Hsu,Ching-Yuan Chen

Main category: cs.CL

TL;DR: 该研究分析了TALKA中用户行为,TALKA是一个为客家语言参与而设计的生成式AI聊天机器人。

Details Motivation: 随着许多濒危语言面临消失的风险,语言保护工作比以往任何时候都更依赖于使用技术与文化相关的教学策略相结合。该研究旨在探讨生成式AI聊天机器人在语言保护和教育实践中的潜力。 Method: 该研究采用了基于Bloom认知过程分类法和对话行为分类的双层分析框架,对TALKA中的7077个用户话语进行了分析,每个话语都根据六个认知水平和十一种对话行为类型进行了仔细标注。 Result: 研究结果表明,生成式AI聊天机器人可以以有意义的方式支持语言学习,并帮助学习者更自信地表达自己并与他们的文化身份建立联系。此外,该研究提供了关于AI辅助对话如何促进低资源语言学习者的认知发展以及实用协商和社会文化归属的实证见解。 Conclusion: 该研究得出结论,生成式AI聊天机器人可以以有意义的方式支持语言学习,特别是当它们的设计考虑到用户的思维和交流方式时。它们还可能帮助学习者更自信地表达自己并与他们的文化身份建立联系。 Abstract: With many endangered languages at risk of disappearing, efforts to preserve them now rely more than ever on using technology alongside culturally informed teaching strategies. This study examines user behaviors in TALKA, a generative AI-powered chatbot designed for Hakka language engagement, by employing a dual-layered analytical framework grounded in Bloom's Taxonomy of cognitive processes and dialogue act categorization. We analyzed 7,077 user utterances, each carefully annotated according to six cognitive levels and eleven dialogue act types. These included a variety of functions, such as asking for information, requesting translations, making cultural inquiries, and using language creatively. Pragmatic classifications further highlight how different types of dialogue acts--such as feedback, control commands, and social greetings--align with specific cognitive intentions. The results suggest that generative AI chatbots can support language learning in meaningful ways--especially when they are designed with an understanding of how users think and communicate. They may also help learners express themselves more confidently and connect with their cultural identity. The TALKA case provides empirical insights into how AI-mediated dialogue facilitates cognitive development in low-resource language learners, as well as pragmatic negotiation and socio-cultural affiliation. By focusing on AI-assisted language learning, this study offers new insights into how technology can support language preservation and educational practice.

[56] Dynamic Span Interaction and Graph-Aware Memory for Entity-Level Sentiment Classification

Md. Mithun Hossain,Sanjara,Md. Shakil Hossain,Sudipto Chaki

Main category: cs.CL

TL;DR: 本文提出SpanEIT,一种结合动态跨度交互和图感知内存机制的新型框架,用于提升实体级情感分类任务的表现。

Details Motivation: 实体级情感分类任务面临诸多挑战,包括建模实体与其周围情感表达之间的复杂交互、捕捉跨句子的依赖关系、以及通过共指消解确保同一实体多个提及的情感预测一致性。 Method: SpanEIT构建基于跨度的实体和情感短语表示,采用双向注意力机制和图注意力网络来捕捉句法和共现关系,并通过一个共指感知的内存模块确保实体级的一致性。 Result: 实验表明,SpanEIT在FSAD、BARU和IMDB数据集上均优于当前最先进的Transformer模型和混合基线模型的准确率和F1分数。 Conclusion: SpanEIT通过结合动态跨度交互和图感知内存机制,在实体级情感分类任务上表现出色,证明了其在细粒度情感分析中的潜力。 Abstract: Entity-level sentiment classification involves identifying the sentiment polarity linked to specific entities within text. This task poses several challenges: effectively modeling the subtle and complex interactions between entities and their surrounding sentiment expressions; capturing dependencies that may span across sentences; and ensuring consistent sentiment predictions for multiple mentions of the same entity through coreference resolution. Additionally, linguistic phenomena such as negation, ambiguity, and overlapping opinions further complicate the analysis. These complexities make entity-level sentiment classification a difficult problem, especially in real-world, noisy textual data. To address these issues, we propose SpanEIT, a novel framework integrating dynamic span interaction and graph-aware memory mechanisms for enhanced entity-sentiment relational modeling. SpanEIT builds span-based representations for entities and candidate sentiment phrases, employs bidirectional attention for fine-grained interactions, and uses a graph attention network to capture syntactic and co-occurrence relations. A coreference-aware memory module ensures entity-level consistency across documents. Experiments on FSAD, BARU, and IMDB datasets show SpanEIT outperforms state-of-the-art transformer and hybrid baselines in accuracy and F1 scores. Ablation and interpretability analyses validate the effectiveness of our approach, underscoring its potential for fine-grained sentiment analysis in applications like social media monitoring and customer feedback analysis.

[57] HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems

Spandan Anaokar,Shrey Ganatra,Harshvivek Kashid,Swapnil Bhattacharyya,Shruti Nair,Reshma Sekhar,Siddharth Manohar,Rahul Hemrajani,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM的幻觉检测系统HalluDetect,并在五种聊天机器人架构中评估了其性能,发现AgentBot在减少幻觉方面表现最佳,每轮幻觉率仅为0.4159,同时保持96.13%的高令牌准确率。

Details Motivation: 大型语言模型(LLM)在工业中广泛应用,但容易产生幻觉,影响其在关键应用中的可靠性,尤其是在消费者投诉聊天机器人等高风险领域。因此,亟需有效的幻觉缓解策略。 Method: 开发了一个名为HalluDetect的基于LLM的幻觉检测系统,并在LLaMA 3.1 8B Instruct模型基础上 benchmark了五种聊天机器人架构,评估其在幻觉频率和令牌准确性方面的表现。 Result: HalluDetect在F1分数上达到69%,比基线检测器提高25.44%;在五种架构中,AgentBot将幻觉降至每轮0.4159,且令牌准确率达到96.13%,为最优方案。 Conclusion: 优化的推理策略可显著提升事实准确性,所提出的框架具有可扩展性,不仅适用于消费者法律领域,还可推广至其他高风险场景,增强对LLM驱动助手的信任。 Abstract: Large Language Models (LLMs) are widely used in industry but remain prone to hallucinations, limiting their reliability in critical applications. This work addresses hallucination reduction in consumer grievance chatbots built using LLaMA 3.1 8B Instruct, a compact model frequently used in industry. We develop HalluDetect, an LLM-based hallucination detection system that achieves an F1 score of 69% outperforming baseline detectors by 25.44%. Benchmarking five chatbot architectures, we find that out of them, AgentBot minimizes hallucinations to 0.4159 per turn while maintaining the highest token accuracy (96.13%), making it the most effective mitigation strategy. Our findings provide a scalable framework for hallucination mitigation, demonstrating that optimized inference strategies can significantly improve factual accuracy. While applied to consumer law, our approach generalizes to other high-risk domains, enhancing trust in LLM-driven assistants. We will release the code and dataset

[58] AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment

Kun Li,Lai-Man Po,Hongzheng Yang,Xuyuan Xu,Kangcheng Liu,Yuzhi Zhao

Main category: cs.CL

TL;DR: 提出AesBiasBench基准,用于评估多模态大语言模型在个性化图像美学评估中的刻板偏见和与人类审美的对齐程度,发现小模型偏见更强,大模型更贴近人类偏好,且引入身份信息可能加剧偏见。

Details Motivation: 现有MLLMs在个性化图像美学评估中可能存在由性别、年龄、教育等人口统计因素引发的隐性偏见,缺乏系统性评估框架。 Method: 构建AesBiasBench基准,涵盖美学感知、评估与共情三个子任务,提出IFD、NRD、AAS等指标,从刻板偏见和与人类偏好对齐两个维度评估19个MLLM。 Result: 较小模型表现出更强的刻板偏见,较大模型更符合人类审美偏好;引入身份信息常加剧偏见,尤其在情感判断中。 Conclusion: 需在主观视觉-语言任务中采用考虑身份因素的评估框架,以提升模型公平性与人类对齐性。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to expert evaluations. However, their predictions may reflect subtle biases influenced by demographic factors such as gender, age, and education. In this work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary dimensions: (1) stereotype bias, quantified by measuring variations in aesthetic evaluations across demographic groups; and (2) alignment between model outputs and genuine human aesthetic preferences. Our benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and introduces structured metrics (IFD, NRD, AAS) to assess both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences. Incorporating identity information often exacerbates bias, particularly in emotional judgments. These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.

[59] EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI

Sai Kartheek Reddy Kasu

Main category: cs.CL

TL;DR: The paper introduces EthicsMH, a dataset of 125 ethically complex mental health scenarios, aiming to improve AI's ethical reasoning in sensitive domains.

Details Motivation: To address the lack of benchmarks capturing ethical dilemmas in mental health, where existing AI ethics datasets fall short in evaluating therapeutic and psychiatric decision-making. Method: Development of a structured dataset with 125 scenarios involving ethical dilemmas in mental health contexts, incorporating multiple decision options, expert reasoning, and stakeholder perspectives. Result: Creation of the EthicsMH dataset, which enables evaluation of AI systems on decision accuracy, explanation quality, and alignment with professional mental health norms. Conclusion: The EthicsMH dataset is a pilot framework for evaluating AI's ethical reasoning in mental health, aiming to be expanded by community contributions for responsible AI development. Abstract: The deployment of large language models (LLMs) in mental health and other sensitive domains raises urgent questions about ethical reasoning, fairness, and responsible alignment. Yet, existing benchmarks for moral and clinical decision-making do not adequately capture the unique ethical dilemmas encountered in mental health practice, where confidentiality, autonomy, beneficence, and bias frequently intersect. To address this gap, we introduce Ethical Reasoning in Mental Health (EthicsMH), a pilot dataset of 125 scenarios designed to evaluate how AI systems navigate ethically charged situations in therapeutic and psychiatric contexts. Each scenario is enriched with structured fields, including multiple decision options, expert-aligned reasoning, expected model behavior, real-world impact, and multi-stakeholder viewpoints. This structure enables evaluation not only of decision accuracy but also of explanation quality and alignment with professional norms. Although modest in scale and developed with model-assisted generation, EthicsMH establishes a task framework that bridges AI ethics and mental health decision-making. By releasing this dataset, we aim to provide a seed resource that can be expanded through community and expert contributions, fostering the development of AI systems capable of responsibly handling some of society's most delicate decisions.

[60] A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection

Di Jin,Jun Yang,Xiaobao Wang,Junwei Zhang,Shuqi Li,Dongxiao He

Main category: cs.CL

TL;DR: 提出一种基于动态知识更新的假新闻检测模型DYNAMO,结合知识图谱与大语言模型,通过蒙特卡洛树搜索逐步验证新闻内容,并持续更新真实知识,有效提升检测性能。

Details Motivation: 由于新闻事件发展迅速且信息复杂,传统假新闻检测方法难以获取最新信息,且检索内容可信度低、噪声干扰严重,因此需要一种能动态更新知识并确保新知识真实性的方法。 Method: 构建新闻领域特定的知识图谱,利用蒙特卡洛树搜索分解复杂新闻并逐步验证,结合大语言模型进行新闻真实性判断与新知识正确性验证,并从已验证的真实新闻中提取和更新知识。 Result: 在两个真实世界数据集上的实验表明,DYNAMO在假新闻检测任务中优于现有方法,取得了最佳性能。 Conclusion: DYNAMO通过动态知识更新和双重验证机制,有效解决了检索内容可信度低和语义挖掘不足的问题,显著提升了假新闻检测的准确性和时效性。 Abstract: As the Internet and social media evolve rapidly, distinguishing credible news from a vast amount of complex information poses a significant challenge. Due to the suddenness and instability of news events, the authenticity labels of news can potentially shift as events develop, making it crucial for fake news detection to obtain the latest event updates. Existing methods employ retrieval-augmented generation to fill knowledge gaps, but they suffer from issues such as insufficient credibility of retrieved content and interference from noisy information. We propose a dynamic knowledge update-driven model for fake news detection (DYNAMO), which leverages knowledge graphs to achieve continuous updating of new knowledge and integrates with large language models to fulfill dual functions: news authenticity detection and verification of new knowledge correctness, solving the two key problems of ensuring the authenticity of new knowledge and deeply mining news semantics. Specifically, we first construct a news-domain-specific knowledge graph. Then, we use Monte Carlo Tree Search to decompose complex news and verify them step by step. Finally, we extract and update new knowledge from verified real news texts and reasoning paths. Experimental results demonstrate that DYNAMO achieves the best performance on two real-world datasets.

[61] CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model

Wei-Hsin Yeh,Yu-An Su,Chih-Ning Chen,Yi-Hsueh Lin,Calvin Ku,Wen-Hsin Chiu,Min-Chun Hu,Lun-Wei Ku

Main category: cs.CL

TL;DR: 本文提出了一种名为CoachMe的基于参考的动作指导模型,通过分析学习者与标准动作在时序和物理层面的差异,生成精确且具有运动专项性的改进建议。该模型能够有效识别动作错误并提供具体改进方法,在花样滑冰和拳击任务上显著优于GPT-4o。

Details Motivation: 现有的多模态模型在理解动作方面已有进展,但在生成精准、特定于运动项目的指导方面仍存在挑战,因为体育运动具有高度领域特异性,且需要包含丰富信息的反馈。因此,需要一种能结合领域知识并模拟教练思维过程的模型来提供有效的运动指导。 Method: 提出CoachMe模型,采用基于参考的方法,从时序和物理两个维度分析学习者动作与标准动作之间的差异。模型首先从通用动作中学习领域知识,再利用有限数据适应特定运动(如滑冰和拳击),从而模拟教练的判断逻辑,识别错误并生成包含具体改进方式的指令。 Result: 实验表明,CoachMe在花样滑冰和拳击任务上的G-Eval评分分别比GPT-4o高出31.6%和58.3%,生成的指导不仅语气专业,且包含关键纠错信息。分析显示,模型能详细描述错误及其改进方法,提供高质量、信息丰富的反馈。 Conclusion: CoachMe通过结合参考动作的差异分析与教练式推理,在少量数据下实现了对特定运动项目的高质量指导生成,显著优于现有大模型,展现了其在运动训练中的实际应用潜力。 Abstract: Motion instruction is a crucial task that helps athletes refine their technique by analyzing movements and providing corrective guidance. Although recent advances in multimodal models have improved motion understanding, generating precise and sport-specific instruction remains challenging due to the highly domain-specific nature of sports and the need for informative guidance. We propose CoachMe, a reference-based model that analyzes the differences between a learner's motion and a reference under temporal and physical aspects. This approach enables both domain-knowledge learning and the acquisition of a coach-like thinking process that identifies movement errors effectively and provides feedback to explain how to improve. In this paper, we illustrate how CoachMe adapts well to specific sports such as skating and boxing by learning from general movements and then leveraging limited data. Experiments show that CoachMe provides high-quality instructions instead of directions merely in the tone of a coach but without critical information. CoachMe outperforms GPT-4o by 31.6% in G-Eval on figure skating and by 58.3% on boxing. Analysis further confirms that it elaborates on errors and their corresponding improvement methods in the generated instructions. You can find CoachMe here: https://motionxperts.github.io/

[62] Room acoustics affect communicative success in hybrid meeting spaces: a pilot study

Robert Einig,Stefan Janscha,Jonas Schuster,Julian Koch,Martin Hagmueller,Barbara Schuppler

Main category: cs.CL

TL;DR: 本研究探讨了在格拉茨理工大学的一个研讨室中进行声学改造对混合会议沟通效果的影响,尽管样本量较小未达到统计显著性,但结果表明声学改进有助于提升沟通质量。

Details Motivation: 由于新冠疫情后混合会议空间的普及,人们往往重视网络连接而忽视声学设计,导致语音清晰度下降和沟通疲劳等问题,因此需要研究声学优化对混合会议的实际影响。 Method: 通过在声学改造前后两次录制两组人员在研讨室中的会议情况,比较房间声学条件改善前后沟通效果的变化。 Result: 初步结果显示,经过声学干预后,混合会议的沟通成效有所提升,尤其是在减少混响、提高语音清晰度方面表现明显,但由于样本量小,结果未达统计显著性。 Conclusion: 合理的房间声学设计能有效支持混合会议中的沟通质量,建议在建设混合会议空间时应充分考虑声学环境的优化。 Abstract: Since the COVID-19 pandemic in 2020, universities and companies have increasingly integrated hybrid features into their meeting spaces, or even created dedicated rooms for this purpose. While the importance of a fast and stable internet connection is often prioritized, the acoustic design of seminar rooms is frequently overlooked. Poor acoustics, particularly excessive reverberation, can lead to issues such as misunderstandings, reduced speech intelligibility or cognitive and vocal fatigue. This pilot study investigates whether room acoustic interventions in a seminar room at Graz University of Technology support better communication in hybrid meetings. For this purpose, we recorded two groups of persons twice, once before and once after improving the acoustics of the room. Our findings -- despite not reaching statistical significance due to the small sample size - indicate clearly that our spatial interventions improve communicative success in hybrid meetings. To make the paper accessible also for readers from the speech communication community, we explain room acoustics background, relevant for the interpretation of our results.

[63] An Agentic Toolkit for Adaptive Information Extraction from Regulatory Documents

Gaye Colakoglu,Gürkan Solmaz,Jonathan Fürst

Main category: cs.CL

TL;DR: 提出一种领域特定的、有状态的代理系统,通过规划-执行-响应架构,动态协调工具以实现对多样化格式和语言的DoP文档中关键信息的稳健提取。

Details Motivation: 由于欧盟法规要求的建筑产品性能声明(DoP)文档在布局、语言、模式和格式上差异大,导致现有静态或仅使用大模型的信息抽取方法易产生幻觉且适应性差。 Method: 采用领域特定的、有状态的代理系统,具备 planner-executor-responder 架构,能够推断用户意图、检测文档模态,并动态编排工具进行可追溯的推理过程,避免工具误用或执行循环。 Result: 在整理的DoP数据集上评估显示,该系统在多种格式和语言下均表现出更强的鲁棒性,能有效支持自动化关键值对提取与问答任务。 Conclusion: 所提出的代理架构为监管环境下的结构化数据提取提供了一个可扩展且稳健的解决方案。 Abstract: Declaration of Performance (DoP) documents, mandated by EU regulation, certify the performance of construction products. While some of their content is standardized, DoPs vary widely in layout, language, schema, and format, posing challenges for automated key-value pair extraction (KVP) and question answering (QA). Existing static or LLM-only IE pipelines often hallucinate and fail to adapt to this structural diversity. Our domain-specific, stateful agentic system addresses these challenges through a planner-executor-responder architecture. The system infers user intent, detects document modality, and orchestrates tools dynamically for robust, traceable reasoning while avoiding tool misuse or execution loops. Evaluation on a curated DoP dataset demonstrates improved robustness across formats and languages, offering a scalable solution for structured data extraction in regulated workflows.

[64] User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums

Mikhail Kulyabin,Jan Joosten,Choro Ulan uulu,Nuno Miguel Martins Pacheco,Fabian Ries,Filippos Petridis,Jan Bosch,Helena Holmström Olsson

Main category: cs.CL

TL;DR: This paper introduces UXPID, a dataset of 7130 synthetically generated user feedback entries from an industrial automation forum, annotated using a large language model to support research in user experience and AI-driven feedback processing.

Details Motivation: Customer feedback in industrial forums is a rich but underexplored source of real-world product experience insights, which is challenging to analyze systematically due to its unstructured and domain-specific nature. Method: A large language model was used to analyze and annotate user feedback branches from an industrial automation forum with UX insights, sentiment ratings, severity, user expectations, and topic classifications. Result: The User eXperience Perception Insights Dataset (UXPID) was created, consisting of 7130 synthesized and anonymized user feedback branches enriched with metadata and contextual conversation data. Conclusion: The UXPID dataset facilitates research in user requirements, UX analysis, and AI-driven feedback processing, especially in scenarios constrained by privacy and licensing issues. Abstract: Customer feedback in industrial forums reflect a rich but underexplored source of insight into real-world product experience. These publicly shared discussions offer an organic view of user expectations, frustrations, and success stories shaped by the specific contexts of use. Yet, harnessing this information for systematic analysis remains challenging due to the unstructured and domain-specific nature of the content. The lack of structure and specialized vocabulary makes it difficult for traditional data analysis techniques to accurately interpret, categorize, and quantify the feedback, thereby limiting its potential to inform product development and support strategies. To address these challenges, this paper presents the User eXperience Perception Insights Dataset (UXPID), a collection of 7130 artificially synthesized and anonymized user feedback branches extracted from a public industrial automation forum. Each JavaScript object notation (JSON) record contains multi-post comments related to specific hardware and software products, enriched with metadata and contextual conversation data. Leveraging a large language model (LLM), each branch is systematically analyzed and annotated for UX insights, user expectations, severity and sentiment ratings, and topic classifications. The UXPID dataset is designed to facilitate research in user requirements, user experience (UX) analysis, and AI-driven feedback processing, particularly where privacy and licensing restrictions limit access to real-world data. UXPID supports the training and evaluation of transformer-based models for tasks such as issue detection, sentiment analysis, and requirements extraction in the context of technical forums.

[65] When Curiosity Signals Danger: Predicting Health Crises Through Online Medication Inquiries

Dvora Goncharok,Arbel Shifman,Alexander Apartsin,Yehudit Aperstein

Main category: cs.CL

TL;DR: 本研究提出了一种从在线医疗论坛中提取药物相关问题并标注其临床风险等级的新数据集,旨在识别可能预示严重不良事件的关键问题。

Details Motivation: 在线医疗论坛蕴含大量患者关注的信息,尤其是关于药物使用的问题,但这些信息尚未被充分利用。识别其中可能预示严重健康危机的关键问题,有助于及时干预,提升患者安全。 Method: 构建了一个手动标注的药物相关问题数据集,基于临床风险因素对每个问题进行关键性标注;采用TF-IDF表示结合六种传统机器学习分类器,并与三种基于大语言模型(LLM)的先进分类方法进行性能对比。 Result: 实验结果表明,传统的机器学习方法和现代的大语言模型均能在识别高风险问题上表现良好,具备用于实时分诊和预警系统的潜力。 Conclusion: 该研究展示了利用患者生成的文本数据结合自然语言处理技术,在数字健康环境中实现早期健康风险预警的可行性,公开的数据集和基准有助于推动相关研究发展。 Abstract: Online medical forums are a rich and underutilized source of insight into patient concerns, especially regarding medication use. Some of the many questions users pose may signal confusion, misuse, or even the early warning signs of a developing health crisis. Detecting these critical questions that may precede severe adverse events or life-threatening complications is vital for timely intervention and improving patient safety. This study introduces a novel annotated dataset of medication-related questions extracted from online forums. Each entry is manually labelled for criticality based on clinical risk factors. We benchmark the performance of six traditional machine learning classifiers using TF-IDF textual representations, alongside three state-of-the-art large language model (LLM)-based classification approaches that leverage deep contextual understanding. Our results highlight the potential of classical and modern methods to support real-time triage and alert systems in digital health spaces. The curated dataset is made publicly available to encourage further research at the intersection of patient-generated data, natural language processing, and early warning systems for critical health events. The dataset and benchmark are available at: https://github.com/Dvora-coder/LLM-Medication-QA-Risk-Classifier-MediGuard.

[66] From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives

Eden Mama,Liel Sheri,Yehudit Aperstein,Alexander Apartsin

Main category: cs.CL

TL;DR: This paper introduces the Noisy Diagnostic Benchmark (NDB), a synthetic dataset of noisy patient descriptions, to evaluate large language models' diagnostic capabilities under realistic linguistic conditions.

Details Motivation: The motivation is to address the gap in evaluating large language models (LLMs) on realistic, noisy patient-generated narratives rather than structured clinical texts, to better understand their diagnostic capabilities in real-world conditions. Method: The paper introduces the Noisy Diagnostic Benchmark (NDB), a synthetic dataset with annotated, noisy patient descriptions. It uses clinically consistent scenarios with varying levels of linguistic noise and lay terminology. The dataset is used to fine-tune and evaluate state-of-the-art models like BERT-based and T5 models. Result: The result is the creation of the NDB dataset, which enables the evaluation of LLMs in diagnosing from noisy, synthetic patient descriptions, offering insights into model performance under realistic linguistic conditions. Conclusion: The paper concludes that the NDB dataset can effectively evaluate LLMs' diagnostic capabilities under realistic linguistic conditions, and the authors advocate for further research and model development using this benchmark. Abstract: The widespread adoption of large language models (LLMs) in healthcare raises critical questions about their ability to interpret patient-generated narratives, which are often informal, ambiguous, and noisy. Existing benchmarks typically rely on clean, structured clinical text, offering limited insight into model performance under realistic conditions. In this work, we present a novel synthetic dataset designed to simulate patient self-descriptions characterized by varying levels of linguistic noise, fuzzy language, and layperson terminology. Our dataset comprises clinically consistent scenarios annotated with ground-truth diagnoses, spanning a spectrum of communication clarity to reflect diverse real-world reporting styles. Using this benchmark, we fine-tune and evaluate several state-of-the-art models (LLMs), including BERT-based and encoder-decoder T5 models. To support reproducibility and future research, we release the Noisy Diagnostic Benchmark (NDB), a structured dataset of noisy, synthetic patient descriptions designed to stress-test and compare the diagnostic capabilities of large language models (LLMs) under realistic linguistic conditions. We made the benchmark available for the community: https://github.com/lielsheri/PatientSignal

[67] PledgeTracker: A System for Monitoring the Fulfilment of Pledges

Yulong Chen,Michael Sejr Schlichtkrull,Zhenyun Deng,David Corney,Nasim Asl,Joshua Salisbury,Andrew Dudfield,Andreas Vlachos

Main category: cs.CL

TL;DR: 本文提出了PledgeTracker系统,通过构建结构化事件时间线来验证政治承诺的履行情况,解决了现有方法忽略动态性、时序性和多文档特性的局限。

Details Motivation: 现有方法将政治承诺验证简化为文档分类任务,无法有效处理其动态、时序和多源信息分布的特点,因此需要更符合实际复杂性的解决方案。 Method: PledgeTracker包含三个核心模块:多步证据检索、时间线构建和履行过滤,将承诺验证重构为结构化事件时间线的生成过程。 Result: 在与专业事实核查人员的实际协作中评估表明,该系统能有效检索相关证据并显著减少人工验证工作量。 Conclusion: PledgeTracker通过结构化时间线建模,提升了政治承诺履行追踪的准确性与可解释性,适用于动态多源信息环境下的事实核查任务。 Abstract: Political pledges reflect candidates' policy commitments, but tracking their fulfilment requires reasoning over incremental evidence distributed across multiple, dynamically updated sources. Existing methods simplify this task into a document classification task, overlooking its dynamic, temporal and multi-document nature. To address this issue, we introduce \textsc{PledgeTracker}, a system that reformulates pledge verification into structured event timeline construction. PledgeTracker consists of three core components: (1) a multi-step evidence retrieval module; (2) a timeline construction module and; (3) a fulfilment filtering module, allowing the capture of the evolving nature of pledge fulfilment and producing interpretable and structured timelines. We evaluate PledgeTracker in collaboration with professional fact-checkers in real-world workflows, demonstrating its effectiveness in retrieving relevant evidence and reducing human verification effort.

[68] SCDTour: Embedding Axis Ordering and Merging for Interpretable Semantic Change Detection

Taichi Aida,Danushka Bollegala

Main category: cs.CL

TL;DR: 本文提出SCDTour方法,通过排序和合并可解释的轴来平衡语义变化检测中的可解释性与性能。

Details Motivation: 在语义变化检测中,提高可解释性常导致性能下降,反之亦然,需要一种兼顾两者的方法。 Method: SCDTour结合嵌入空间中轴之间的语义相似性及其对语义变化的贡献程度,对轴进行排序并聚合。 Result: 实验表明,SCDTour在保持高可解释性的同时,保留了语义变化检测的性能,并且聚类后的轴在词义细化上表现优于或相当于全维嵌入。 Conclusion: SCDTour有效平衡了可解释性与性能,能通过少量精炼轴实现对语义变化的有意义解释。 Abstract: In Semantic Change Detection (SCD), it is a common problem to obtain embeddings that are both interpretable and high-performing. However, improving interpretability often leads to a loss in the SCD performance, and vice versa. To address this problem, we propose SCDTour, a method that orders and merges interpretable axes to alleviate the performance degradation of SCD. SCDTour considers both (a) semantic similarity between axes in the embedding space, as well as (b) the degree to which each axis contributes to semantic change. Experimental results show that SCDTour preserves performance in semantic change detection while maintaining high interpretability. Moreover, agglomerating the sorted axes produces a more refined set of word senses, which achieves comparable or improved performance against the original full-dimensional embeddings in the SCD task. These findings demonstrate that SCDTour effectively balances interpretability and SCD performance, enabling meaningful interpretation of semantic shifts through a small number of refined axes. Source code is available at https://github.com/LivNLP/svp-tour .

[69] MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues

Weishu Chen,Jinyi Tang,Zhouhui Hou,Shihao Han,Mingjie Zhan,Zhiyuan Huang,Delong Liu,Jiawei Guo,Zhicheng Zhao,Fei Su

Main category: cs.CL

TL;DR: 提出MOOM,一种基于文学理论的双分支记忆插件,用于解决人机角色扮演中超长对话的记忆提取问题,通过情节冲突总结和用户角色特征提取,并结合遗忘机制控制记忆容量。

Details Motivation: 现有方法在处理超长对话时存在记忆无限制增长的问题,影响了对话的连贯性和系统效率。 Method: 设计双分支结构:一个分支在多个时间尺度上总结情节冲突,另一个提取用户角色画像;引入受“竞争-抑制”记忆理论启发的遗忘机制以控制记忆容量。 Result: 在新提出的中文超长对话数据集ZH-4O(平均600轮)上实验表明,MOOM优于现有最先进方法,在减少大模型调用次数的同时保持可控记忆容量。 Conclusion: MOOM有效解决了超长对话中记忆提取的可扩展性问题,兼顾效率与记忆质量,为角色扮演游戏中的记忆管理提供了新思路。 Abstract: Memory extraction is crucial for maintaining coherent ultra-long dialogues in human-robot role-playing scenarios. However, existing methods often exhibit uncontrolled memory growth. To address this, we propose MOOM, the first dual-branch memory plugin that leverages literary theory by modeling plot development and character portrayal as core storytelling elements. Specifically, one branch summarizes plot conflicts across multiple time scales, while the other extracts the user's character profile. MOOM further integrates a forgetting mechanism, inspired by the ``competition-inhibition'' memory theory, to constrain memory capacity and mitigate uncontrolled growth. Furthermore, we present ZH-4O, a Chinese ultra-long dialogue dataset specifically designed for role-playing, featuring dialogues that average 600 turns and include manually annotated memory information. Experimental results demonstrate that MOOM outperforms all state-of-the-art memory extraction methods, requiring fewer large language model invocations while maintaining a controllable memory capacity.

[70] Growing Perspectives: Modelling Embodied Perspective Taking and Inner Narrative Development Using Large Language Models

Sabrina Patania,Luca Annese,Anna Lambiase,Anita Pellegrini,Tom Foulsham,Azzurra Ruggeri,Silvia Rossi,Silvia Serino,Dimitri Ognibene

Main category: cs.CL

TL;DR: This study explores how combining language and embodied perspective taking in computational models, specifically using the PerspAct system with LLMs, can better simulate human developmental dynamics and improve collaborative performance.

Details Motivation: Language and embodied perspective taking are essential for human collaboration, yet few computational models address both simultaneously. Method: The study investigates the PerspAct system, which combines the ReAct paradigm with Large Language Models (LLMs), evaluating GPT's ability to generate narratives aligned with Selman's developmental stages of perspective taking using an extended director task. Collaborative performance is assessed both qualitatively and quantitatively. Result: GPT reliably produces developmentally-consistent narratives before task execution but often shifts towards more advanced stages during interaction, suggesting that language exchanges refine internal representations. Higher developmental stages generally enhance collaborative effectiveness, while earlier stages yield more variable outcomes in complex contexts. Conclusion: The study concludes that integrating embodied perspective taking and language in LLMs can better model developmental dynamics, highlighting the importance of evaluating internal speech during combined linguistic and embodied tasks. Abstract: Language and embodied perspective taking are essential for human collaboration, yet few computational models address both simultaneously. This work investigates the PerspAct system [1], which integrates the ReAct (Reason and Act) paradigm with Large Language Models (LLMs) to simulate developmental stages of perspective taking, grounded in Selman's theory [2]. Using an extended director task, we evaluate GPT's ability to generate internal narratives aligned with specified developmental stages, and assess how these influence collaborative performance both qualitatively (action selection) and quantitatively (task efficiency). Results show that GPT reliably produces developmentally-consistent narratives before task execution but often shifts towards more advanced stages during interaction, suggesting that language exchanges help refine internal representations. Higher developmental stages generally enhance collaborative effectiveness, while earlier stages yield more variable outcomes in complex contexts. These findings highlight the potential of integrating embodied perspective taking and language in LLMs to better model developmental dynamics and stress the importance of evaluating internal speech during combined linguistic and embodied tasks.

[71] Uncertainty in Authorship: Why Perfect AI Detection Is Mathematically Impossible

Aadil Gani Ganie

Main category: cs.CL

TL;DR: 本文提出人工智能生成文本的检测存在理论极限,类比量子不确定性原理,指出越精确地检测AI写作,就越可能破坏文本的自然性,最终表明完美检测在理论上是不可能的。

Details Motivation: 随着大语言模型的进步,区分人类和AI生成文本变得愈发困难,作者试图从理论层面探讨检测技术的根本局限性。 Method: 通过类比量子力学中的测不准原理,分析当前检测方法(如风格分析、水印技术和神经分类器)的局限性,并论证检测行为本身会引入新的不确定性。 Result: 研究表明,当AI文本足够接近人类写作时,完美的作者归属检测不仅是技术难题,更是理论上的不可能。 Conclusion: AI文本检测的挑战不仅源于技术不足,更反映了语言本质中无法消除的张力,未来应重新思考作者身份、伦理与政策框架。 Abstract: As large language models (LLMs) become more advanced, it is increasingly difficult to distinguish between human-written and AI-generated text. This paper draws a conceptual parallel between quantum uncertainty and the limits of authorship detection in natural language. We argue that there is a fundamental trade-off: the more confidently one tries to identify whether a text was written by a human or an AI, the more one risks disrupting the text's natural flow and authenticity. This mirrors the tension between precision and disturbance found in quantum systems. We explore how current detection methods--such as stylometry, watermarking, and neural classifiers--face inherent limitations. Enhancing detection accuracy often leads to changes in the AI's output, making other features less reliable. In effect, the very act of trying to detect AI authorship introduces uncertainty elsewhere in the text. Our analysis shows that when AI-generated text closely mimics human writing, perfect detection becomes not just technologically difficult but theoretically impossible. We address counterarguments and discuss the broader implications for authorship, ethics, and policy. Ultimately, we suggest that the challenge of AI-text detection is not just a matter of better tools--it reflects a deeper, unavoidable tension in the nature of language itself.

[72] Designing LLMs for cultural sensitivity: Evidence from English-Japanese translation

Helene Tenzer,Oumnia Abidi,Stefan Feuerriegel

Main category: cs.CL

TL;DR: This paper explores how culturally-tailored prompting enhances the cultural sensitivity of LLMs when translating workplace emails between English and Japanese, using a mixed-methods study and prompting strategies.

Details Motivation: The motivation stems from the increasing use of large language models (LLMs) in multilingual communication and the question of whether these models support culturally appropriate interactions, especially in professional settings. Method: The research uses a mixed-methods approach, analyzing translations of workplace emails from English to Japanese. It evaluates the impact of three prompting strategies: naive translation prompts, audience-targeted prompts, and instructional prompts with guidance on Japanese communication norms. Result: The findings indicate that prompting strategies tailored to cultural context improve the cultural appropriateness of translations, as assessed by native speakers and through analysis of language patterns. Conclusion: The study concludes that culturally-tailored prompting can enhance the cultural sensitivity of LLMs in multilingual settings, specifically in translating workplace emails between English and Japanese. Abstract: Large language models (LLMs) are increasingly used in everyday communication, including multilingual interactions across different cultural contexts. While LLMs can now generate near-perfect literal translations, it remains unclear whether LLMs support culturally appropriate communication. In this paper, we analyze the cultural sensitivity of different LLM designs when applied to English-Japanese translations of workplace e-mails. Here, we vary the prompting strategies: (1) naive "just translate" prompts, (2) audience-targeted prompts specifying the recipient's cultural background, and (3) instructional prompts with explicit guidance on Japanese communication norms. Using a mixed-methods study, we then analyze culture-specific language patterns to evaluate how well translations adapt to cultural norms. Further, we examine the appropriateness of the tone of the translations as perceived by native speakers. We find that culturally-tailored prompting can improve cultural fit, based on which we offer recommendations for designing culturally inclusive LLMs in multilingual settings.

[73] Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding

Mingxiao Huo,Jiayi Zhang,Hewei Wang,Jinfeng Xu,Zheyu Chen,Huilin Tai,Yijun Chen

Main category: cs.CL

TL;DR: 本文提出了Spec-LLaVA,一种通过动态树结构推测解码实现无损加速的视觉语言模型推理框架,显著提升解码速度并适用于资源受限环境。

Details Motivation: 视觉语言模型(VLMs)因自回归推理速度慢而难以部署于实时应用,亟需在不牺牲生成质量的前提下提升推理效率。 Method: 引入Spec-LLaVA系统,采用轻量级草案VLM与大型目标模型配对,通过草案模型推测后续token并由目标模型并行验证,并设计动态树形验证算法,根据草案模型置信度自适应扩展或剪枝推测分支。 Result: 在MS COCO跨域图像上,Spec-LLaVA使LLaVA-1.5(7B、13B)解码速度最高提升3.28倍,且生成质量无损失。 Conclusion: 该工作提出了一种高效的无损加速框架,为实现实时多模态助手提供了可行路径,且轻量级草案模型设计便于在资源受限或设备端部署。 Abstract: Vision-Language Models (VLMs) enable powerful multimodal reasoning but suffer from slow autoregressive inference, limiting their deployment in real-time applications. We introduce Spec-LLaVA, a system that applies speculative decoding to accelerate VLMs without sacrificing output quality. Spec-LLaVA pairs a lightweight draft VLM with a large target model: the draft speculates future tokens, which the target verifies in parallel, allowing multiple tokens to be generated per step. To maximize efficiency, we design a dynamic tree-based verification algorithm that adaptively expands and prunes speculative branches using draft model confidence. On MS COCO out-of-domain images, Spec-LLaVA achieves up to 3.28$\times$ faster decoding on LLaVA-1.5 (7B, 13B) with no loss in generation quality. This work presents a lossless acceleration framework for VLMs using dynamic tree-structured speculative decoding, opening a path toward practical real-time multimodal assistants. Importantly, the lightweight draft model design makes the framework amenable to resource-constrained or on-device deployment settings.

[74] ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Mayank Agarwal,Ibrahim Abdelaziz,Kinjal Basu,Merve Unuvar,Luis A. Lastras,Yara Rizk,Pavan Kapanipathi

Main category: cs.CL

TL;DR: 本文提出了FC-RewardBench,首个用于评估奖励模型在工具调用场景中表现的基准,并提出一种基于结果的奖励模型训练框架,利用开源大模型合成数据,显著提升奖励模型在工具使用中的性能。

Details Motivation: 现有的奖励模型主要针对自然语言输出训练,在评估涉及工具调用的推理和执行时表现不佳,缺乏有效捕捉工具使用关键信号的能力,因此需要专门针对工具调用场景的奖励建模方法。 Method: 提出FC-RewardBench基准以系统评估奖励模型在工具调用中的表现;设计一种基于结果的奖励模型训练框架,使用开源、可商用的大语言模型合成训练数据,并训练1.7B到14B参数范围的奖励模型。 Result: 所训练的奖励模型在七个跨领域基准上 consistently 超过通用基线模型,下游任务平均性能提升高达25%,并通过奖励引导的数据过滤实现高效的数据微调。 Conclusion: 针对工具调用场景设计专用奖励模型是必要且有效的,基于合成数据的训练框架能显著提升奖励模型对工具使用行为的评估能力,为未来LLM智能体的奖励建模提供了可行路径。 Abstract: As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.

[75] Query-Focused Extractive Summarization for Sentiment Explanation

Ahmed Moubtahij,Sylvie Ratté,Yazid Attabi,Maxime Dumas

Main category: cs.CL

TL;DR: 提出了一种多偏差框架,用于改进查询聚焦的摘要任务,特别是在情感解释方面,通过情感偏差和查询扩展,在真实世界的情感感知QFS数据集上超越了基线模型。

Details Motivation: 解决查询与源文档之间的语言不一致问题,提升客户反馈中情感原因分析的效率和准确性。 Method: 设计了一个领域无关的多偏差框架,并结合情感偏差和查询扩展方法,专门用于情感解释任务。 Result: 在真实世界专有情感感知QFS数据集上的实验结果优于基线模型。 Conclusion: 所提出的多偏差框架能有效缓解语言不匹配问题,显著提升查询聚焦摘要在情感解释任务中的性能。 Abstract: Constructive analysis of feedback from clients often requires determining the cause of their sentiment from a substantial amount of text documents. To assist and improve the productivity of such endeavors, we leverage the task of Query-Focused Summarization (QFS). Models of this task are often impeded by the linguistic dissonance between the query and the source documents. We propose and substantiate a multi-bias framework to help bridge this gap at a domain-agnostic, generic level; we then formulate specialized approaches for the problem of sentiment explanation through sentiment-based biases and query expansion. We achieve experimental results outperforming baseline models on a real-world proprietary sentiment-aware QFS dataset.

[76] Text Adaptation to Plain Language and Easy Read via Automatic Post-Editing Cycles

Jesús Calleja,David Ponce,Thierry Etchegoyhen

Main category: cs.CL

TL;DR: Vicomtech在CLEARS挑战赛中使用大语言模型进行西班牙语的简易语言和易读文本改编,通过自动后编辑和迭代优化,在官方指标平均得分上分别获得第一和第二名。

Details Motivation: 为了提高西班牙语文本在简易语言和易读性方面的可访问性,参与CLEARS挑战赛并探索基于大语言模型的自动文本改编方法。 Method: 采用大语言模型生成初始改编文本,并通过自动后编辑迭代优化,直到可读性和相似性指标表明无法进一步改进。 Result: 在CLEARS挑战赛中,该方法在简易语言和易读文本改编任务上分别取得第一和第二名(按官方指标平均得分)。 Conclusion: 基于迭代式自动后编辑的大语言模型方法在西班牙语文本简易化任务中表现优异,具有实际应用潜力。 Abstract: We describe Vicomtech's participation in the CLEARS challenge on text adaptation to Plain Language and Easy Read in Spanish. Our approach features automatic post-editing of different types of initial Large Language Model adaptations, where successive adaptations are generated iteratively until readability and similarity metrics indicate that no further adaptation refinement can be successfully performed. Taking the average of all official metrics, our submissions achieved first and second place in Plain language and Easy Read adaptation, respectively.

[77] Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect

Alina Klerings,Jannik Brinkmann,Daniel Ruffinelli,Simone Ponzetto

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型如何在内部表示和控制动词时态和体这两类多维层次语法现象,通过线性判别分析识别出残差空间中正交的方向,并实现了在生成过程中的因果控制。

Details Motivation: 理解大语言模型如何内部编码句法知识,特别是多维层次语法结构(如时态和体),而不仅限于二元语法对比。 Method: 使用线性判别分析(LDA)在残差空间中识别表示时态和体的独立、正交方向,并通过概念引导实现对这两个语法特征的因果控制,最后开展多任务生成实验进行案例研究。 Result: 成功识别出控制时态和体的正交方向,实现了跨三种生成任务的语法特征控制;发现引导强度、位置和持续时间是减少主题偏移和文本退化等副作用的关键因素。 Conclusion: 大语言模型以结构化、类人方式编码时态和体信息,但对其有效控制敏感且需精细调节参数,提示未来需自动化优化策略。 Abstract: Large language models (LLMs) are able to generate grammatically well-formed text, but how do they encode their syntactic knowledge internally? While prior work has focused largely on binary grammatical contrasts, in this work, we study the representation and control of two multidimensional hierarchical grammar phenomena - verb tense and aspect - and for each, identify distinct, orthogonal directions in residual space using linear discriminant analysis. Next, we demonstrate causal control over both grammatical features through concept steering across three generation tasks. Then, we use these identified features in a case study to investigate factors influencing effective steering in multi-token generation. We find that steering strength, location, and duration are crucial parameters for reducing undesirable side effects such as topic shift and degeneration. Our findings suggest that models encode tense and aspect in structurally organized, human-like ways, but effective control of such features during generation is sensitive to multiple factors and requires manual tuning or automated optimization.

[78] SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

Salima Mdhaffar,Haroun Elleuch,Chaimae Chellaf,Ha Nguyen,Yannick Estève

Main category: cs.CL

TL;DR: 本文提出了SENSE(Shared Embedding for N-lingual Speech and tExt),一个受SAMU-XLSR框架启发的开源解决方案,通过改进教师文本模型和初始语音编码器,在多语言语音-文本语义对齐任务中实现了具有竞争力的性能。

Details Motivation: 为了实现语音和文本在多语言环境下的语义对齐,提升跨模态、多语言理解能力,同时推动开放可用的统一嵌入模型的发展。 Method: 采用教师-学生框架,将自监督语音编码器与语言无关的文本编码器在话语级别进行对齐;选用更强的教师文本模型和更优的初始语音编码器,并基于SpeechBrain工具包实现训练与应用。 Result: 在多语言和多模态语义任务上的实验表明,所提出的SENSE模型达到了极具竞争力的性能,且模型已公开发布。 Conclusion: SENSE有效实现了语音与文本的跨语言语义对齐,为语义对齐语音编码器如何捕捉语义提供了新的见解,具有良好的开放性和应用潜力。 Abstract: This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder. The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders.

[79] Is 'Hope' a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities

Payam Latifi

Main category: cs.CL

TL;DR: This study compares NER performance of LLMs and traditional NLP tools, showing that LLMs excel in context-sensitive entities while traditional tools perform better in structured tagging.

Details Motivation: To compare the performance of traditional NLP tools and large language models (LLMs) in Named Entity Recognition (NER), particularly in recognizing context-sensitive entities versus structured tags. Method: The study evaluated six NER systems (three non-LLM tools: NLTK, spaCy, Stanza; and three LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B) using a manually annotated dataset of 119 tokens across five entity types. Performance was measured using F1-score against a gold standard dataset. Result: LLMs generally outperformed conventional tools in recognizing context-sensitive entities like person names, with Gemini achieving the highest average F1-score. However, traditional systems like Stanza showed greater consistency in structured tags such as LOCATION and DATE. Variability was observed among LLMs in handling temporal expressions and multi-word organizations. Conclusion: LLMs provide better contextual understanding, but traditional tools like Stanza are still competitive in specific tasks such as structured tagging, which informs the selection of models based on task requirements. Abstract: This pilot study presents a small-scale but carefully annotated benchmark of Named Entity Recognition (NER) performance across six systems: three non-LLM NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models (LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119 tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME). We evaluated each system's output against the manually annotated gold standard dataset using F1-score. The results show that LLMs generally outperform conventional tools in recognizing context-sensitive entities like person names, with Gemini achieving the highest average F1-score. However, traditional systems like Stanza demonstrate greater consistency in structured tags such as LOCATION and DATE. We also observed variability among LLMs, particularly in handling temporal expressions and multi-word organizations. Our findings highlight that while LLMs offer improved contextual understanding, traditional tools remain competitive in specific tasks, informing model selection.

[80] In-domain SSL pre-training and streaming ASR

Jarod Duret,Salima Mdhaffar,Gaëlle Laperrière,Ryan Whetten,Audrey Galametz,Catherine Kobus,Marion-Cécile Martin,Jo Oleiwan,Yannick Estève

Main category: cs.CL

TL;DR: 本研究探讨了航空交通管制环境下自监督预训练对语音识别的益处,并提出了一种低延迟的流式处理方法。

Details Motivation: 探索领域特定的自监督预训练在离线和流式ASR中的优势,特别是在航空交通管制环境中。 Method: 训练BEST-RQ模型并在较小的监督ATC数据集上进行微调,使用分块注意力和动态卷积实现流式处理。 Result: 领域适应的预训练显著提升了ATC基准性能,与通用语音编码器相比,显著降低了词错误率。 Conclusion: 研究得出领域适应的预训练在ATC数据中是一种实用的路径,可以实现更准确高效的ASR系统,尤其适用于安全关键型航空应用。 Abstract: In this study, we investigate the benefits of domain-specific self-supervised pre-training for both offline and streaming ASR in Air Traffic Control (ATC) environments. We train BEST-RQ models on 4.5k hours of unlabeled ATC data, then fine-tune on a smaller supervised ATC set. To enable real-time processing, we propose using chunked attention and dynamic convolutions, ensuring low-latency inference. We compare these in-domain SSL models against state-of-the-art, general-purpose speech encoders such as w2v-BERT 2.0 and HuBERT. Results show that domain-adapted pre-training substantially improves performance on standard ATC benchmarks, significantly reducing word error rates when compared to models trained on broad speech corpora. Furthermore, the proposed streaming approach further improves word error rate under tighter latency constraints, making it particularly suitable for safety-critical aviation applications. These findings highlight that specializing SSL representations for ATC data is a practical path toward more accurate and efficient ASR systems in real-world operational settings.

[81] GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models

Min Zeng,Jinfei Sun,Xueyou Luo,Caiquan Liu,Shiqi Zhang,Li Xie,Xiaoxin Chen

Main category: cs.CL

TL;DR: The paper proposes the GTA framework that effectively combines SFT and RL for improved performance and faster convergence in NLP tasks.

Details Motivation: To address the efficiency-capability trade-off in NLP tasks between pure RL and SFT methods. Method: GTA framework using cross-entropy loss and RL rewards with loss masking and gradient constraints. Result: GTA accelerates convergence and outperforms standalone SFT and RL methods on four text classification benchmarks. Conclusion: GTA is an effective framework that combines SFT and RL to achieve faster convergence and higher performance in NLP tasks. Abstract: In natural language processing tasks, pure reinforcement learning (RL) fine-tuning methods often suffer from inefficient exploration and slow convergence; while supervised fine-tuning (SFT) methods, although efficient in training, have limited performance ceiling and less solid theoretical foundation compared to RL. To address efficiency-capability trade-off, we propose the Guess-Think-Answer (GTA) framework that combines the efficiency of SFT with the capability gains of RL in a unified training paradigm. GTA works by having the model first produce a provisional guess (optimized via cross-entropy loss), then reflect on this guess before generating the final answer, with RL rewards shaping both the final output and the format of the entire GTA structure. This hybrid approach achieves both faster convergence than pure RL and higher performance ceiling than pure SFT. To mitigate gradient conflicts between the two training signals, we employ loss masking and gradient constraints. Empirical results on four text classification benchmarks demonstrate that GTA substantially accelerates convergence while outperforming both standalone SFT and RL baselines.

[82] CBP-Tuning: Efficient Local Customization for Black-box Large Language Models

Jiaxuan Zhao,Naibin Gu,Yuchen Feng,Xiyu Liu,Peng Fu,Zheng Lin,Weiping Wang

Main category: cs.CL

TL;DR: CBP-Tuning is a privacy-preserving method for customizing large language models locally without the need to access model weights or upload private data, showing superior performance in various domains.

Details Motivation: The high cost of customizing large language models and the privacy risks associated with uploading sensitive data motivated the development of CBP-Tuning. Method: CBP-Tuning uses a two-stage framework involving a server-side prompt generator and user-side gradient-free optimization. Result: CBP-Tuning outperformed baselines in commonsense reasoning, medical, and financial domains, demonstrating its effectiveness in preserving privacy and enabling customization. Conclusion: CBP-Tuning effectively addresses the challenges of customization and privacy in using large language models by providing an efficient and privacy-preserving framework. Abstract: The high costs of customizing large language models (LLMs) fundamentally limit their adaptability to user-specific needs. Consequently, LLMs are increasingly offered as cloud-based services, a paradigm that introduces critical limitations: providers struggle to support personalized customization at scale, while users face privacy risks when exposing sensitive data. To address this dual challenge, we propose Customized Black-box Prompt Tuning (CBP-Tuning), a novel framework that facilitates efficient local customization while preserving bidirectional privacy. Specifically, we design a two-stage framework: (1) a prompt generator trained on the server-side to capture domain-specific and task-agnostic capabilities, and (2) user-side gradient-free optimization that tailors soft prompts for individual tasks. This approach eliminates the need for users to access model weights or upload private data, requiring only a single customized vector per task while achieving effective adaptation. Furthermore, the evaluation of CBP-Tuning in the commonsense reasoning, medical and financial domain settings demonstrates superior performance compared to baselines, showcasing its advantages in task-agnostic processing and privacy preservation.

[83] XplaiNLP at CheckThat! 2025: Multilingual Subjectivity Detection with Finetuned Transformers and Prompt-Based Inference with Large Language Models

Ariana Sahitaj,Jiaao Li,Pia Wenzel Neves,Fedor Splitt,Premtim Sahitaj,Charlott Jakob,Veronika Solopova,Vera Schmitt

Main category: cs.CL

TL;DR: This paper presents the XplaiNLP submission to the CheckThat! 2025 shared task, evaluating supervised and zero-shot methods for multilingual subjectivity detection, achieving strong results in some languages while highlighting challenges in low-resource cross-lingual settings.

Details Motivation: To address the challenge of multilingual subjectivity detection, this study aims to evaluate and compare the effectiveness of supervised fine-tuning and zero-shot prompting approaches across multiple languages and subtasks. Method: Two approaches were evaluated: (1) supervised fine-tuning of transformer encoders (EuroBERT, XLM-RoBERTa, German-BERT) on monolingual and machine-translated data, and (2) zero-shot prompting using LLMs (o3-mini for Annotation, gpt-4.1-mini for DoubleDown and Perspective methods). Result: The Annotation Approach ranked 1st in the Italian monolingual subtask (F_1: 0.8104). Fine-tuned XLM-RoBERTa ranked 3rd in the Romanian zero-shot setting (F_1: 0.7917) and improved over baselines in Greek and German subtasks. Performance in Ukrainian and Polish zero-shot settings fell slightly below baselines. Conclusion: XplaiNLP submission showcased competitive performance in multilingual subjectivity detection, with success in specific languages and subtasks, while identifying challenges in low-resource cross-lingual scenarios. Abstract: This notebook reports the XplaiNLP submission to the CheckThat! 2025 shared task on multilingual subjectivity detection. We evaluate two approaches: (1) supervised fine-tuning of transformer encoders, EuroBERT, XLM-RoBERTa, and German-BERT, on monolingual and machine-translated training data; and (2) zero-shot prompting using two LLMs: o3-mini for Annotation (rule-based labelling) and gpt-4.1-mini for DoubleDown (contrastive rewriting) and Perspective (comparative reasoning). The Annotation Approach achieves 1st place in the Italian monolingual subtask with an F_1 score of 0.8104, outperforming the baseline of 0.6941. In the Romanian zero-shot setting, the fine-tuned XLM-RoBERTa model obtains an F_1 score of 0.7917, ranking 3rd and exceeding the baseline of 0.6461. The same model also performs reliably in the multilingual task and improves over the baseline in Greek. For German, a German-BERT model fine-tuned on translated training data from typologically related languages yields competitive performance over the baseline. In contrast, performance in the Ukrainian and Polish zero-shot settings falls slightly below the respective baselines, reflecting the challenge of generalization in low-resource cross-lingual scenarios.

[84] Pun Unintended: LLMs and the Illusion of Humor Understanding

Alessandro Zangari,Matteo Marcuzzo,Andrea Albarelli,Mohammad Taher Pilehvar,Jose Camacho-Collados

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在双关语检测中的表现,指出其理解往往较为肤浅,缺乏人类般的细致把握。通过系统分析和重构现有双关语基准,展示了细微变化即可误导LLM。文章贡献包括更全面细致的双关语检测基准、对近期LLM的人类评估,以及对模型处理双关语时鲁棒性挑战的分析。

Details Motivation: 尽管大语言模型在双关语检测上展现出潜力,但其理解深度不及人类,存在鲁棒性问题。作者旨在揭示当前模型的局限性,并推动更具挑战性的评估方法。 Method: 通过系统性地重构和分析现有的双关语数据集,引入细微的语言变化,测试多个主流大语言模型的表现,并结合人工评估来衡量模型性能。 Result: 实验表明,即使是微小的双关语修改也能显著降低LLM的检测准确率,说明其理解不够稳健;人工评估进一步揭示了模型与人类理解之间的差距。 Conclusion: 当前大语言模型在双关语理解方面仍停留在表面层次,缺乏对语言微妙性的深层把握,未来需要设计更鲁棒的模型和更精细的评估基准。 Abstract: Puns are a form of humorous wordplay that exploits polysemy and phonetic similarity. While LLMs have shown promise in detecting puns, we show in this paper that their understanding often remains shallow, lacking the nuanced grasp typical of human interpretation. By systematically analyzing and reformulating existing pun benchmarks, we demonstrate how subtle changes in puns are sufficient to mislead LLMs. Our contributions include comprehensive and nuanced pun detection benchmarks, human evaluation across recent LLMs, and an analysis of the robustness challenges these models face in processing puns.

[85] RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing

Timothy Rupprecht,Enfu Nan,Arash Akbari,Arman Akbari,Lei Lu,Priyanka Maan,Sean Duffy,Pu Zhao,Yumei He,David Kaeli,Yanzhi Wang

Main category: cs.CL

TL;DR: 提出了一种名为RAGs-to-Riches的提示框架,通过检索增强生成的思想提升LLM在角色扮演中的保真度和稳定性,尤其在面对敌对用户时表现更优。

Details Motivation: 现有少样本学习方法在高风险领域中容易导致LLM角色扮演时“出戏”,特别是在面对敌对用户时可能产生有害行为,因此需要更可靠的角色保持机制。 Method: 将LLM角色扮演重构为文本检索问题,利用精选的参考示例进行条件生成;引入RAGs-to-Riches提示框架,并设计了两个新的令牌级ROUGE指标(IOO和IOR)来评估模型的即兴程度和示例使用率。 Result: 在453次角色扮演交互中,该方法在推理时平均比基线多使用35%的参考示例内容,模型被判断为更具真实性且更少脱离角色。 Conclusion: RAGs-to-Riches提供了一种可扩展、鲁棒性强且符合人类对齐需求的LLM角色扮演构建策略。 Abstract: Role-playing Large language models (LLMs) are increasingly deployed in high-stakes domains such as healthcare, education, and governance, where failures can directly impact user trust and well-being. A cost effective paradigm for LLM role-playing is few-shot learning, but existing approaches often cause models to break character in unexpected and potentially harmful ways, especially when interacting with hostile users. Inspired by Retrieval-Augmented Generation (RAG), we reformulate LLM role-playing into a text retrieval problem and propose a new prompting framework called RAGs-to-Riches, which leverages curated reference demonstrations to condition LLM responses. We evaluate our framework with LLM-as-a-judge preference voting and introduce two novel token-level ROUGE metrics: Intersection over Output (IOO) to quantity how much an LLM improvises and Intersection over References (IOR) to measure few-shot demonstrations utilization rate during the evaluation tasks. When simulating interactions with a hostile user, our prompting strategy incorporates in its responses during inference an average of 35% more tokens from the reference demonstrations. As a result, across 453 role-playing interactions, our models are consistently judged as being more authentic, and remain in-character more often than zero-shot and in-context Learning (ICL) methods. Our method presents a scalable strategy for building robust, human-aligned LLM role-playing frameworks.

[86] Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

Marek Kubis,Paweł Skórzewski,Iwona Christop,Mateusz Czyżnikiewicz,Jakub Kubiak,Łukasz Bondaruk,Marcin Lewandowski

Main category: cs.CL

TL;DR: The paper introduces C3T, a benchmark for assessing speech-aware large language models by evaluating their language understanding capabilities, fairness, and robustness through both text and speech modalities.

Details Motivation: To accurately evaluate how well speech-aware large language models preserve language understanding when accessed via speech input, and to assess their fairness and robustness across different modalities and speaker categories. Method: The paper introduces C3T, which combines textual tasks with a voice cloning text-to-speech model to assess the preservation of language understanding capabilities in speech-aware models. Result: C3T successfully quantifies the fairness of models for different speaker categories and their robustness across text and speech modalities. Conclusion: C3T is an effective benchmark for evaluating the language understanding capabilities of speech-aware large language models, including their fairness and robustness across different speaker categories and modalities. Abstract: The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.

cs.CV [Back]

[87] A Real-Time Diminished Reality Approach to Privacy in MR Collaboration

Christian Fane

Main category: cs.CV

TL;DR: 本文提出了一种基于图像修复的实时减弱现实(DR)系统,用于在共享空间混合现实(MR)会议中实现隐私控制。

Details Motivation: 在混合现实会议中,用户可能希望隐藏个人或敏感物品,以保护隐私。现有的方法通常需要固定的视角或预先扫描环境,限制了系统的便携性和实用性。 Method: 该系统结合语义分割与精确对象选择,利用YOLOv11进行物体检测,并采用改进的解耦时空变换器(DSTT)模型实现高质量视频修复。通过ZED 2i深度相机从第二视角实时捕捉背景信息并进行补全。 Result: 系统在720p分辨率下实现超过20 fps的帧率,能够在无需固定视角或预先建模环境的情况下实时去除真实物体。 Conclusion: 该方案具有良好的便携性与鲁棒性,验证了实时减弱现实在实际隐私保护MR应用中的可行性。 Abstract: Diminished reality (DR) refers to the digital removal of real-world objects by compositing background content in their place. This thesis presents a real-time, inpainting-based DR system designed to enable privacy control in shared-space mixed reality (MR) meetings. The system allows a primary headset user to selectively remove personal or sensitive items from their environment, ensuring that those objects are no longer visible to other participants. Removal is achieved through semantic segmentation and precise object selection, followed by real-time inpainting from the viewpoint of a secondary observer, implemented using a mobile ZED 2i depth camera. The solution is designed to be portable and robust, requiring neither a fixed secondary viewpoint nor prior 3D scanning of the environment. The system utilises YOLOv11 for object detection and a modified Decoupled Spatial-Temporal Transformer (DSTT) model for high-quality video inpainting. At 720p resolution, the pipeline sustains frame rates exceeding 20 fps, demonstrating the feasibility of real-time diminished reality for practical privacy-preserving MR applications.

[88] SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning

Alejandra Perez,Chinedu Nwoye,Ramtin Raji Kermani,Omid Mohareri,Muhammad Abdullah Jamal

Main category: cs.CV

TL;DR: 本文提出 SurgLaVi 数据集和 SurgCLIP 框架,显著提升手术视觉语言预训练的效果。

Details Motivation: 现有手术视觉语言数据集的规模、程序多样性、语义质量和层次结构有限,阻碍了手术视觉语言预训练的发展。 Method: 开发了 SurgLaVi 数据集和 SurgCLIP 对比学习框架。 Result: SurgCLIP 在多个手术识别任务中均超越了先前最先进的方法,有时优势显著。 Conclusion: SurgLaVi 是一个关键资源,用于开发手术基础模型,促进更强和更具通用性的表示学习。 Abstract: Vision-language pre-training (VLP) offers unique advantages for surgery by aligning language with surgical videos, enabling workflow understanding and transfer across tasks without relying on expert-labeled datasets. However, progress in surgical VLP remains constrained by the limited scale, procedural diversity, semantic quality, and hierarchical structure of existing datasets. In this work, we present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures, and comprising hierarchical levels at phase-, step-, and task-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. To ensure high-quality annotations, it applies dual-modality filtering to remove irrelevant and noisy samples. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, we release SurgLaVi-\b{eta}, an open-source derivative of 113k clip-caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets. To demonstrate the value of SurgLaVi datasets, we introduce SurgCLIP, a CLIP-style video-text contrastive framework with dual encoders, as a representative base model. SurgCLIP achieves consistent improvements across phase, step, action, and tool recognition, surpassing prior state-of-the-art methods, often by large margins. These results validate that large-scale, semantically rich, and hierarchically structured datasets directly translate into stronger and more generalizable representations, establishing SurgLaVi as a key resource for developing surgical foundation models.

[89] Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses

Emily Kaczmarek,Justin Szeto,Brennan Nichyporuk,Tal Arbel

Main category: cs.CV

TL;DR: 提出了一种基于SimCLR的高分辨率3D脑结构MRI自监督学习基础模型,使用18,759名患者的44,958次扫描进行预训练,在多种下游任务中表现优于MAE和监督基线模型,即使在少量标注数据下仍表现出色。

Details Motivation: 现有的3D脑MRI深度学习模型通常针对特定任务且泛化能力有限,而现有的自监督基础模型在分辨率、范围或可访问性方面存在不足,因此需要一个通用且高性能的基础模型来提升临床脑MRI分析的效率与准确性。 Method: 采用SimCLR框架构建自监督学习模型,利用11个公开数据集中的18,759名患者(共44,958次扫描)进行预训练,并在四个不同的下游预测任务中评估模型性能,包括分布内和分布外设置,同时与MAE及两个监督基线模型进行比较。 Result: 微调后的SimCLR模型在所有下游任务中均优于其他模型,尤其是在仅使用20%标注数据的情况下,在阿尔茨海默病预测任务中仍保持优越性能。 Conclusion: 该研究成功开发了一个通用、高分辨率且可公开访问的3D脑MRI基础模型,显著提升了多种临床相关任务的性能,具有广泛的适用性和应用潜力。 Abstract: 3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly acquired in clinical settings to monitor a wide range of neurological conditions, including neurodegenerative disorders and stroke. While deep learning models have shown promising results analyzing 3D MRI across a number of brain imaging tasks, most are highly tailored for specific tasks with limited labeled data, and are not able to generalize across tasks and/or populations. The development of self-supervised learning (SSL) has enabled the creation of large medical foundation models that leverage diverse, unlabeled datasets ranging from healthy to diseased data, showing significant success in 2D medical imaging applications. However, even the very few foundation models for 3D brain MRI that have been developed remain limited in resolution, scope, or accessibility. In this work, we present a general, high-resolution SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on 18,759 patients (44,958 scans) from 11 publicly available datasets spanning diverse neurological diseases. We compare our model to Masked Autoencoders (MAE), as well as two supervised baselines, on four diverse downstream prediction tasks in both in-distribution and out-of-distribution settings. Our fine-tuned SimCLR model outperforms all other models across all tasks. Notably, our model still achieves superior performance when fine-tuned using only 20% of labeled training samples for predicting Alzheimer's disease. We use publicly available code and data, and release our trained model at https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly applicable and accessible foundation model for clinical brain MRI analysis.

[90] USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction

Xiaoyang Ma,Yiyang Chai,Xinran Qu,Hong Sun

Main category: cs.CV

TL;DR: 提出USCTNet,一种基于物理的深度展开求解器,用于从单个RGB图像重建高光谱图像,通过可学习变换域中的核范数正则化和显式估计相机光谱敏感性和光照来保证色彩一致性。

Details Motivation: 从单个RGB图像重建高光谱图像存在病态且易因相机光谱敏感性和光照条件误设而导致物理不一致,需提高重建的准确性和物理合理性。 Method: 将RGB到HSI的重建建模为物理驱动的逆问题,采用可学习变换域中的核范数正则化,并在每次迭代中嵌入显式估计的前向算子(包含CSS和光照),引入数据自适应的低秩子空间SVT算子以避免全SVD计算开销。 Result: 在标准基准上实验表明,该方法在重建精度上优于现有的RGB-based先进方法。 Conclusion: USCTNet通过结合物理模型与深度学习,在保证色彩一致性的同时提升了高光谱图像重建性能,具有良好的应用潜力。 Abstract: Reconstructing hyperspectral images (HSIs) from a single RGB image is ill-posed and can become physically inconsistent when the camera spectral sensitivity (CSS) and scene illumination are misspecified. We formulate RGB-to-HSI reconstruction as a physics-grounded inverse problem regularized by a nuclear norm in a learnable transform domain, and we explicitly estimate CSS and illumination to define the forward operator embedded in each iteration, ensuring colorimetric consistency. To avoid the cost and instability of full singular-value decompositions (SVDs) required by singular-value thresholding (SVT), we introduce a data-adaptive low-rank subspace SVT operator. Building on these components, we develop USCTNet, a deep unfolding solver tailored to HSI that couples a parameter estimation module with learnable proximal updates. Extensive experiments on standard benchmarks show consistent improvements over state-of-the-art RGB-based methods in reconstruction accuracy. Code: https://github.com/psykheXX/USCTNet-Code-Implementation.git

[91] A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI

Felicia Liu,Jay J. Yoo,Farzad Khalvati

Main category: cs.CV

TL;DR: 大型语言模型(LLMs)在医学图像任务中表现不佳,卷积神经网络(CNNs)表现更优。

Details Motivation: 大型语言模型(LLMs)在基于文本的医疗任务中表现出色,但其在基于图像的应用中的效用尚未被探索。 Method: 使用BraTS 2020多模态脑部MRI数据集,评估了一种通用的视觉-语言LLM(LLaMA 3.2 Instruct)在微调前后在胶质瘤分类和分割任务中的表现,并与定制的3D CNNs进行了基准测试。 Result: 在胶质瘤分类任务中,CNN达到了80%的准确率,而通用LLM仅为76%,微调后特异性从18%提升至55%,但总体性能下降。在分割任务中,CNN能够准确定位胶质瘤,而LLMs的预测始终集中在图像中心,无法区分肿瘤的大小、位置或分布。 Conclusion: 卷积神经网络(CNNs)在图像理解方面优于大型语言模型(LLMs),而当前形式下的LLMs在医学图像任务中表现不佳,需要更严格的微调或替代训练策略以提高性能、鲁棒性和实用性。 Abstract: Large Language Models (LLMs) have shown strong performance in text-based healthcare tasks. However, their utility in image-based applications remains unexplored. We investigate the effectiveness of LLMs for medical imaging tasks, specifically glioma classification and segmentation, and compare their performance to that of traditional convolutional neural networks (CNNs). Using the BraTS 2020 dataset of multi-modal brain MRIs, we evaluated a general-purpose vision-language LLM (LLaMA 3.2 Instruct) both before and after fine-tuning, and benchmarked its performance against custom 3D CNNs. For glioma classification (Low-Grade vs. High-Grade), the CNN achieved 80% accuracy and balanced precision and recall. The general LLM reached 76% accuracy but suffered from a specificity of only 18%, often misclassifying Low-Grade tumors. Fine-tuning improved specificity to 55%, but overall performance declined (e.g., accuracy dropped to 72%). For segmentation, three methods - center point, bounding box, and polygon extraction, were implemented. CNNs accurately localized gliomas, though small tumors were sometimes missed. In contrast, LLMs consistently clustered predictions near the image center, with no distinction of glioma size, location, or placement. Fine-tuning improved output formatting but failed to meaningfully enhance spatial accuracy. The bounding polygon method yielded random, unstructured outputs. Overall, CNNs outperformed LLMs in both tasks. LLMs showed limited spatial understanding and minimal improvement from fine-tuning, indicating that, in their current form, they are not well-suited for image-based tasks. More rigorous fine-tuning or alternative training strategies may be needed for LLMs to achieve better performance, robustness, and utility in the medical space.

[92] Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Hao Zhang,Chun-Han Yao,Simon Donné,Narendra Ahuja,Varun Jampani

Main category: cs.CV

TL;DR: 提出SP4D框架,用于从单目输入生成配对的RGB和运动部件视频,通过双分支扩散模型联合合成RGB帧和部件分割图,并引入空间颜色编码和双向扩散融合模块提升一致性。

Details Motivation: 传统部件分割方法依赖外观语义线索,难以捕捉物体的运动结构;希望生成跨视角和时间一致的、与物体关节对齐的运动部件。 Method: 采用双分支扩散模型,共享潜在VAE,使用空间颜色编码将部件掩码映射为连续RGB样图像,通过BiDiFuse模块和对比部件一致性损失增强跨分支一致性。 Result: 在KinematicParts20K数据集上训练后,SP4D能泛化到真实视频、新生成物体和罕见姿态,生成的2D部件图可提升至3D,用于构建骨骼结构和蒙皮权重。 Conclusion: SP4D能有效生成具有运动感知能力的配对RGB与部件视频,适用于动画和运动相关下游任务。 Abstract: We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

[93] SegSLR: Promptable Video Segmentation for Isolated Sign Language Recognition

Sven Schreiber,Noha Sarhan,Simone Frintrop,Christian Wilms

Main category: cs.CV

TL;DR: 本文提出了一种新的孤立手语识别系统SegSLR,该系统通过可提示的零样本视频分割方法结合RGB和姿势信息,有效地保留了手部和身体的形状信息,提高了识别性能,并在ChaLearn249 IsoGD数据集上取得了优于现有方法的结果。

Details Motivation: 现有的孤立手语识别(ISLR)方法主要依赖于RGB数据或姿势信息,但这些方法在结合多模态数据时往往丢失重要的细节,如手的形状和方向。 Method: 使用可提示的零样本视频分割方法结合RGB和姿势信息,通过从姿态信息中对身体和手部进行粗略定位,分别对视频中的相关部分进行分割,以保留所有相关的形状信息。 Result: 在复杂的ChaLearn249 IsoGD数据集上的评估表明,SegSLR优于现有最先进的方法。消融研究表明,SegSLR受益于关注手语者的身体和手部,证明了设计选择的合理性。 Conclusion: SegSLR有效地结合了RGB和姿势信息,通过可提示的零样本视频分割方法,在ISLR任务中表现出色,优于现有最先进的方法。 Abstract: Isolated Sign Language Recognition (ISLR) approaches primarily rely on RGB data or signer pose information. However, combining these modalities often results in the loss of crucial details, such as hand shape and orientation, due to imprecise representations like bounding boxes. Therefore, we propose the ISLR system SegSLR, which combines RGB and pose information through promptable zero-shot video segmentation. Given the rough localization of the hands and the signer's body from pose information, we segment the respective parts through the video to maintain all relevant shape information. Subsequently, the segmentations focus the processing of the RGB data on the most relevant body parts for ISLR. This effectively combines RGB and pose information. Our evaluation on the complex ChaLearn249 IsoGD dataset shows that SegSLR outperforms state-of-the-art methods. Furthermore, ablation studies indicate that SegSLR strongly benefits from focusing on the signer's body and hands, justifying our design choices.

[94] SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation

Jecia Z. Y. Mao,Francis X Creighton,Russell H Taylor,Manish Sahu

Main category: cs.CV

TL;DR: 提出了一种语音引导的协作感知框架(SCOPE),结合大语言模型与开放集视觉基础模型,实现术中视频流中手术器械和解剖结构的即席分割、标注与跟踪。

Details Motivation: 现有方法依赖于有标签数据和特定领域模型,难以适应新手术场景且受限于预定义类别;同时,当前视觉基础模型依赖手动提示,难以在术中实际部署。 Method: 构建SCOPE框架,融合大语言模型的推理能力与开放集视觉基础模型的感知能力;设计协作感知代理,生成分割候选并利用医生语音反馈进行修正;利用已分割器械作为交互指针标注其他手术场景元素。 Result: 在Cataract1k子集和自建离体颅底数据集上验证了框架的有效性,实现了术中场景的即席分割与跟踪,并通过活体模拟实验展示了其动态性能。 Conclusion: SCOPE展示了人机协作在手术环境中的潜力,为开发可适应、免手操作、以外科医生为中心的智能工具提供了新范式。 Abstract: Accurate segmentation and tracking of relevant elements of the surgical scene is crucial to enable context-aware intraoperative assistance and decision making. Current solutions remain tethered to domain-specific, supervised models that rely on labeled data and required domain-specific data to adapt to new surgical scenarios and beyond predefined label categories. Recent advances in prompt-driven vision foundation models (VFM) have enabled open-set, zero-shot segmentation across heterogeneous medical images. However, dependence of these models on manual visual or textual cues restricts their deployment in introperative surgical settings. We introduce a speech-guided collaborative perception (SCOPE) framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs to support on-the-fly segmentation, labeling and tracking of surgical instruments and anatomy in intraoperative video streams. A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation and incorporates intuitive speech feedback from clinicians to guide the segmentation of surgical instruments in a natural human-machine collaboration paradigm. Afterwards, instruments themselves serve as interactive pointers to label additional elements of the surgical scene. We evaluated our proposed framework on a subset of publicly available Cataract1k dataset and an in-house ex-vivo skull-base dataset to demonstrate its potential to generate on-the-fly segmentation and tracking of surgical scene. Furthermore, we demonstrate its dynamic capabilities through a live mock ex-vivo experiment. This human-AI collaboration paradigm showcase the potential of developing adaptable, hands-free, surgeon-centric tools for dynamic operating-room environments.

[95] Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation

Yi-Ruei Liu,You-Zhe Xie,Yu-Hsiang Hsu,I-Sheng Fang,Yu-Lun Liu,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 提出了一种名为4D高斯光线追踪(4D-GRT)的两阶段管道,用于生成具有真实相机效果的动态场景视频,兼具高效渲染速度和物理准确性。

Details Motivation: 现有方法在模拟鱼眼畸变、滚动快门等真实相机效应时存在成本高、仿真到现实差距大或建模不准确的问题,缺乏有效的训练数据。 Method: 结合4D高斯点阵化与基于物理的光线追踪,首先从多视角视频重建动态场景,然后通过光线追踪生成带有可控相机效应的视频。 Result: 4D-GRT在渲染速度上最快,且渲染质量优于或相当于现有基线方法;构建了包含八种合成动态场景的数据集,涵盖四种相机效应,用于评估带相机效应的视频生成效果。 Conclusion: 4D-GRT能高效、准确地模拟真实相机效应,为计算机视觉系统提供了高质量、低成本的带效应视频生成方案,并推动了相关数据集的发展。 Abstract: Common computer vision systems typically assume ideal pinhole cameras but fail when facing real-world camera effects such as fisheye distortion and rolling shutter, mainly due to the lack of learning from training data with camera effects. Existing data generation approaches suffer from either high costs, sim-to-real gaps or fail to accurately model camera effects. To address this bottleneck, we propose 4D Gaussian Ray Tracing (4D-GRT), a novel two-stage pipeline that combines 4D Gaussian Splatting with physically-based ray tracing for camera effect simulation. Given multi-view videos, 4D-GRT first reconstructs dynamic scenes, then applies ray tracing to generate videos with controllable, physically accurate camera effects. 4D-GRT achieves the fastest rendering speed while performing better or comparable rendering quality compared to existing baselines. Additionally, we construct eight synthetic dynamic scenes in indoor environments across four camera effects as a benchmark to evaluate generated videos with camera effects.

[96] EditDuet: A Multi-Agent System for Video Non-Linear Editing

Marcelo Sandoval-Castaneda,Bryan Russell,Josef Sivic,Gregory Shakhnarovich,Fabian Caba Heilbron

Main category: cs.CV

TL;DR: This paper introduces an automated video editing system using a multi-agent approach with an Editor and Critic agent, significantly outperforming existing methods in efficiency and user satisfaction.

Details Motivation: The motivation is to automate the core task of video editing, moving beyond previous approaches that focused on retrieval or user interfaces, by formulating video editing as a sequential decision-making process. Method: The method involves a multi-agent system where the Editor agent uses video editing tools to create sequences based on input clips and instructions, and the Critic agent provides feedback or approves the output. A learning-based approach enables communication between agents, and an LLM-as-a-judge metric is used for evaluation. Result: The system was evaluated qualitatively and quantitatively through a user study, showing vast improvements over existing approaches in multiple metrics including coverage, time constraint satisfaction, and human preference. Conclusion: The proposed multi-agent approach for video editing, which includes an Editor agent and a Critic agent, outperforms existing methods in coverage, time constraint satisfaction, and human preference. Abstract: Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system's output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference.

[97] Enhancement Without Contrast: Stability-Aware Multicenter Machine Learning for Glioma MRI Imaging

Sajad Amiri,Shahram Taeb,Sara Gharibi,Setareh Dehghanfard,Somayeh Sadat Mehrnia,Mehrdad Oveisi,Ilker Hacihaliloglu,Arman Rahmim,Mohammad R. Salmanpour

Main category: cs.CV

TL;DR: 提出了一种稳定性感知的机器学习框架,用于多中心胶质瘤MRI对比增强的预测,通过非对比MRI实现可靠的增强预测,减少对钆基对比剂的依赖。

Details Motivation: 钆基对比剂在胶质瘤成像中存在安全性、成本和可及性问题,且不同扫描仪和队列间的变异性阻碍了模型的稳健选择。 Method: 使用PyRadiomics(符合IBSI标准)从非对比T1加权图像中提取108个特征,结合48种降维方法和25种分类器,构建1200条机器学习流水线;采用旋转验证策略,在三个数据集上训练并在第四个上测试。 Result: 交叉验证准确率为0.91–0.96,外部测试平均准确率达0.93(范围0.87–0.98);F1分数、精确率和召回率稳定(0.87–0.96),ROC-AUC变化较大(0.50–0.82);MI与ETr组合的流水线表现最优且最稳定。 Conclusion: 该稳定性感知框架能可靠地从非对比MRI预测胶质瘤对比增强,提升跨中心泛化能力,为神经肿瘤学提供可扩展、可重复的机器学习模板。 Abstract: Gadolinium-based contrast agents (GBCAs) are central to glioma imaging but raise safety, cost, and accessibility concerns. Predicting contrast enhancement from non-contrast MRI using machine learning (ML) offers a safer alternative, as enhancement reflects tumor aggressiveness and informs treatment planning. Yet scanner and cohort variability hinder robust model selection. We propose a stability-aware framework to identify reproducible ML pipelines for multicenter prediction of glioma MRI contrast enhancement. We analyzed 1,446 glioma cases from four TCIA datasets (UCSF-PDGM, UPENN-GB, BRATS-Africa, BRATS-TCGA-LGG). Non-contrast T1WI served as input, with enhancement derived from paired post-contrast T1WI. Using PyRadiomics under IBSI standards, 108 features were extracted and combined with 48 dimensionality reduction methods and 25 classifiers, yielding 1,200 pipelines. Rotational validation was trained on three datasets and tested on the fourth. Cross-validation prediction accuracies ranged from 0.91 to 0.96, with external testing achieving 0.87 (UCSF-PDGM), 0.98 (UPENN-GB), and 0.95 (BRATS-Africa), with an average of 0.93. F1, precision, and recall were stable (0.87 to 0.96), while ROC-AUC varied more widely (0.50 to 0.82), reflecting cohort heterogeneity. The MI linked with ETr pipeline consistently ranked highest, balancing accuracy and stability. This framework demonstrates that stability-aware model selection enables reliable prediction of contrast enhancement from non-contrast glioma MRI, reducing reliance on GBCAs and improving generalizability across centers. It provides a scalable template for reproducible ML in neuro-oncology and beyond.

[98] Group Evidence Matters: Tiling-based Semantic Gating for Dense Object Detection

Yilun Xiao

Main category: cs.CV

TL;DR: 本文提出了一种后处理框架,通过将重叠引起的冗余转化为组证据,提高了UAV图像中密集小物体的检测召回率。

Details Motivation: 由于长距离视角、遮挡和杂乱环境,UAV图像中的密集小物体经常被遗漏。 Method: 本文提出了一种与检测器无关的后处理框架,通过将重叠引起的冗余转化为组证据来检测UAV图像中的密集小物体。首先通过重叠平铺恢复低置信度候选对象。然后使用空间门(DBSCAN在框质心上)和语义门(DBSCAN在ResNet-18嵌入上)验证组证据。验证后的组在类感知NMS融合前进行受控置信度重加权。 Result: 实验结果显示,在VisDrone数据集上,召回率从0.685提高到0.778(+0.093),精度从0.801调整到0.595,F1得分为0.669。后处理延迟平均每张图像为0.095秒。消融实验确认了平铺可以发现遗漏的物体,空间聚类稳定了几何结构,语义聚类加强了外观一致性,重加权提供了与基线的校准集成。 Conclusion: 该框架无需重新训练,可与现代检测器集成,并且未来的工作将减少语义门控成本并扩展方法以加入时间线索。 Abstract: Dense small objects in UAV imagery are often missed due to long-range viewpoints, occlusion, and clutter[cite: 5]. This paper presents a detector-agnostic post-processing framework that converts overlap-induced redundancy into group evidence[cite: 6]. Overlapping tiling first recovers low-confidence candidates[cite: 7]. A Spatial Gate (DBSCAN on box centroids) and a Semantic Gate (DBSCAN on ResNet-18 embeddings) then validates group evidence[cite: 7]. Validated groups receive controlled confidence reweighting before class-aware NMS fusion[cite: 8]. Experiments on VisDrone show a recall increase from 0.685 to 0.778 (+0.093) and a precision adjustment from 0.801 to 0.595, yielding F1=0.669[cite: 9]. Post-processing latency averages 0.095 s per image[cite: 10]. These results indicate recall-first, precision-trade-off behavior that benefits recall-sensitive applications such as far-field counting and monitoring[cite: 10]. Ablation confirms that tiling exposes missed objects, spatial clustering stabilizes geometry, semantic clustering enforces appearance coherence, and reweighting provides calibrated integration with the baseline[cite: 11]. The framework requires no retraining and integrates with modern detectors[cite: 12]. Future work will reduce semantic gating cost and extend the approach with temporal cues[cite: 13].

[99] InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong,Peizhou Cao,Yichen Jin,Li Luo,Wenzhe Cai,Jingli Lin,Hanqing Wang,Zhaoyang Lyu,Tai Wang,Bo Dai,Xudong Xu,Jiangmiao Pang

Main category: cs.CV

TL;DR: 本文提出了InternScenes,一个大规模可模拟的室内场景数据集,包含约4万个多样化场景和196万个3D物体,整合了真实扫描、程序生成和设计创建的场景,解决了现有数据集在规模、多样性和布局真实性上的不足。

Details Motivation: 现有3D场景数据集在数据规模、多样性、布局真实性和小物体覆盖方面存在局限,限制了具身AI的发展,因此需要构建更真实、复杂且可模拟的大规模场景数据集。 Method: 通过融合真实世界扫描、程序生成和设计师创建的三种场景来源,构建了包含4万个场景的InternScenes数据集;提出了一套完整的数据处理流程,包括实到仿的转换、交互物体注入和物理仿真解决物体碰撞,确保场景的可模拟性和交互性。 Result: InternScenes包含15种常见场景类型和288类物体,平均每区域41.5个物体,显著提升场景复杂度和真实性;在场景布局生成和点目标导航两个基准任务中验证了其挑战性和实用性。 Conclusion: InternScenes为具身AI提供了高质量、大规模、高真实感的室内场景数据,支持复杂场景下的模型训练与评估,推动场景生成与导航任务的发展,并将开源数据、模型和基准以促进社区进步。 Abstract: The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

[100] Well-Conditioned Polynomial Representations for Mathematical Handwriting Recognition

Robert M. Corless,Deepak Singh Kalhan,Stephen M. Watt

Main category: cs.CV

TL;DR: 本文探讨了在数学手写识别中使用不同基函数(如勒让德、勒让德-索博列夫、切比雪夫及其变体)表示参数化平面曲线多项式的权衡,重点分析基选择与多项式阶数对建模精度和计算成本的影响。

Details Motivation: 为了在保持高精度的同时降低数学手写识别的计算成本,需要系统评估不同多项式基函数的表现和数值稳定性。 Method: 通过分析不同基函数下多项式求值的条件数,并利用内积导出的范数来界定符号间差异,比较各基函数在不同多项式阶数下的性能。 Result: 明确了不同基函数(如Legendre、Chebyshev等)在条件数和符号区分能力方面的差异,给出了在低计算成本下实现准确建模的基选择与多项式阶数的权衡关系。 Conclusion: 选择合适的基函数和适当的多项式阶数可在保证建模准确性的同时显著降低计算复杂度,为数学手写识别中的数字墨水表示提供了优化路径。 Abstract: Previous work has made use of a parameterized plane curve polynomial representation for mathematical handwriting, with the polynomials represented in a Legendre or Legendre-Sobolev graded basis. This provides a compact geometric representation for the digital ink. Preliminary results have also been shown for Chebyshev and Chebyshev-Sobolev bases. This article explores the trade-offs between basis choice and polynomial degree to achieve accurate modeling with a low computational cost. To do this, we consider the condition number for polynomial evaluation in these bases and bound how the various inner products give norms for the variations between symbols.

[101] Multi-Task Diffusion Approach For Prediction of Glioma Tumor Progression

Aghiles Kebaili,Romain Modzelewski,Jérôme Lapuyade-Lahorgue,Maxime Fontanilles,Sébastien Thureau,Su Ruan

Main category: cs.CV

TL;DR: 本研究提出了一种用于预测胶质瘤进展的多任务扩散框架,结合了变形模块和放射治疗加权焦点损失项,并在多个数据集上验证了其有效性。

Details Motivation: 由于胶质瘤具有快速进展和预后差的特点,且临床实践中纵向MRI数据稀疏且不规则,因此需要一种能够进行准确肿瘤演变预测的方法,以应对数据不平衡和建模困难的问题。 Method: 本研究采用了一种多任务扩散框架,同时生成未来FLAIR序列并估计基于符号距离场的空间概率肿瘤演变图。此外,还引入了预训练的变形模块和放射治疗加权焦点损失项,以建模肿瘤演变的动态并提高模型训练的准确性。 Result: 该方法在公共数据集和内部私有数据集上均取得了令人满意的结果,能够生成时间依赖的肿瘤进展概率图,并通过合成完整的随访扫描序列和缺失MRI模态的插补提高了模型的稳定性和准确性。 Conclusion: 该论文提出的多任务扩散框架在预测胶质瘤进展方面表现良好,能够在仅有两次随访扫描的情况下生成灵活的时间依赖概率图,为临床医生提供肿瘤进展风险的评估工具。 Abstract: Glioma, an aggressive brain malignancy characterized by rapid progression and its poor prognosis, poses significant challenges for accurate evolution prediction. These challenges are exacerbated by sparse, irregularly acquired longitudinal MRI data in clinical practice, where incomplete follow-up sequences create data imbalances and make reliable modeling difficult. In this paper, we present a multitask diffusion framework for time-agnostic, pixel-wise prediction of glioma progression. The model simultaneously generates future FLAIR sequences at any chosen time point and estimates spatial probabilistic tumor evolution maps derived using signed distance fields (SDFs), allowing uncertainty quantification. To capture temporal dynamics of tumor evolution across arbitrary intervals, we integrate a pretrained deformation module that models inter-scan changes using deformation fields. Regarding the common clinical limitation of data scarcity, we implement a targeted augmentation pipeline that synthesizes complete sequences of three follow-up scans and imputes missing MRI modalities from available patient studies, improving the stability and accuracy of predictive models. Based on merely two follow-up scans at earlier timepoints, our framework produces flexible time-depending probability maps, enabling clinicians to interrogate tumor progression risks at any future temporal milestone. We further introduce a radiotherapy-weighted focal loss term that leverages radiation dose maps, as these highlight regions of greater clinical importance during model training. The proposed method was trained on a public dataset and evaluated on an internal private dataset, achieving promising results in both cases

[102] Point-Plane Projections for Accurate LiDAR Semantic Segmentation in Small Data Scenarios

Simone Mosco,Daniel Fusaro,Wanmeng Li,Emanuele Menegatti,Alberto Pretto

Main category: cs.CV

TL;DR: 本文提出一种基于点云的语义分割方法,通过点平面投影和几何感知的数据增强技术,在有限数据场景下实现性能提升。

Details Motivation: LiDAR点云语义分割在自动驾驶和机器人领域至关重要,但现有方法计算复杂度高且需要大量训练数据,因此在数据稀缺场景下泛化能力受限。 Method: 改进基于点的方法,通过点平面投影从2D表示中学习特征,并引入几何感知的数据增强技术以缓解类别不平衡问题。 Result: 实验表明,该方法在有限数据场景下表现显著提升,并在SemanticKITTI和PandaSet两个标准数据集上取得了具有竞争力的结果。 Conclusion: 通过点平面投影和数据增强技术,本文方法在仅依赖LiDAR数据的前提下,有效提升了点云语义分割性能,特别是在数据稀缺场景下。 Abstract: LiDAR point cloud semantic segmentation is essential for interpreting 3D environments in applications such as autonomous driving and robotics. Recent methods achieve strong performance by exploiting different point cloud representations or incorporating data from other sensors, such as cameras or external datasets. However, these approaches often suffer from high computational complexity and require large amounts of training data, limiting their generalization in data-scarce scenarios. In this paper, we improve the performance of point-based methods by effectively learning features from 2D representations through point-plane projections, enabling the extraction of complementary information while relying solely on LiDAR data. Additionally, we introduce a geometry-aware technique for data augmentation that aligns with LiDAR sensor properties and mitigates class imbalance. We implemented and evaluated our method that applies point-plane projections onto multiple informative 2D representations of the point cloud. Experiments demonstrate that this approach leads to significant improvements in limited-data scenarios, while also achieving competitive results on two publicly available standard datasets, as SemanticKITTI and PandaSet. The code of our method is available at https://github.com/SiMoM0/3PNet

[103] OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds

Chongyu Wang,Kunlei Jing,Jihua Zhu,Di Wang

Main category: cs.CV

TL;DR: 提出OpenUrban3D,首个无需多视角图像对齐、预训练模型或人工标注的3D开放词汇语义分割框架,用于大规模城市点云场景。

Details Motivation: 现有3D分割方法在大规模城市点云中因缺乏高质量多视角图像和跨环境泛化能力差,难以实现开放词汇语义分割。 Method: 通过多视图、多粒度渲染,掩码级视觉-语言特征提取和样本均衡融合,直接从原始点云生成语义特征,并蒸馏至3D骨干网络。 Result: 在SensatUrban和SUM等大规模城市基准上显著提升了分割精度和跨场景泛化能力。 Conclusion: OpenUrban3D实现了零样本文本驱动的3D语义分割,兼具语义丰富性和几何先验,是面向城市理解的灵活可扩展方案。 Abstract: Open-vocabulary semantic segmentation enables models to recognize and segment objects from arbitrary natural language descriptions, offering the flexibility to handle novel, fine-grained, or functionally defined categories beyond fixed label sets. While this capability is crucial for large-scale urban point clouds that support applications such as digital twins, smart city management, and urban analytics, it remains largely unexplored in this domain. The main obstacles are the frequent absence of high-quality, well-aligned multi-view imagery in large-scale urban point cloud datasets and the poor generalization of existing three-dimensional (3D) segmentation pipelines across diverse urban environments with substantial variation in geometry, scale, and appearance. To address these challenges, we present OpenUrban3D, the first 3D open-vocabulary semantic segmentation framework for large-scale urban scenes that operates without aligned multi-view images, pre-trained point cloud segmentation networks, or manual annotations. Our approach generates robust semantic features directly from raw point clouds through multi-view, multi-granularity rendering, mask-level vision-language feature extraction, and sample-balanced fusion, followed by distillation into a 3D backbone model. This design enables zero-shot segmentation for arbitrary text queries while capturing both semantic richness and geometric priors. Extensive experiments on large-scale urban benchmarks, including SensatUrban and SUM, show that OpenUrban3D achieves significant improvements in both segmentation accuracy and cross-scene generalization over existing methods, demonstrating its potential as a flexible and scalable solution for 3D urban scene understanding.

[104] AutoOEP -- A Multi-modal Framework for Online Exam Proctoring

Aryan Kashyap Naveen,Bhuvanesh Singla,Raajan Wankhade,Shreesha M,Ramu S,Ram Mohana Reddy Guddeti

Main category: cs.CV

TL;DR: AutoOEP 是一个基于计算机视觉和机器学习的自动化在线考试监考框架,通过双摄像头设置和多模块分析(包括人脸识别、头部姿态估计、注视跟踪、嘴部动作分析和物体检测)来检测可疑行为,并利用 LSTM 网络分析时间模式计算实时作弊概率得分。

Details Motivation: 在线教育的兴起迫切需要强大且可扩展的系统来确保远程考试中的学术诚信。传统的人工监考在规模上往往不可行,而现有的自动化解决方案可能具有侵入性,或者无法检测广泛的作弊行为。 Method: AutoOEP 利用计算机视觉和机器学习技术,通过双摄像头设置捕捉考生正面和侧面的工作区,并通过 ArcFace 进行人脸识别、头部姿态估计、注视跟踪和嘴部动作分析来检测可疑行为。此外,使用微调后的 YOLOv11 模型检测违禁物品并跟踪手部接近物体的动作,最后通过 LSTM 网络分析时间模式以计算实时作弊概率得分。 Result: AutoOEP 在自建的数据集上进行了评估,模拟了多种考试条件,系统在分类可疑活动方面达到了 90.7% 的准确率。物体检测组件对违禁物品的平均精度(mAP@.5)为 0.57,并且整个框架在没有 GPU 的情况下以大约每秒 2.4 帧的速度处理视频流。 Conclusion: AutoOEP 是一种有效的自动化在线考试监考解决方案,显著减少了人工干预的需求,提升了在线评估的诚信度。 Abstract: The burgeoning of online education has created an urgent need for robust and scalable systems to ensure academic integrity during remote examinations. Traditional human proctoring is often not feasible at scale, while existing automated solutions can be intrusive or fail to detect a wide range of cheating behaviors. This paper introduces AutoOEP (Automated Online Exam Proctoring), a comprehensive, multi-modal framework that leverages computer vision and machine learning to provide effective, automated proctoring. The system utilizes a dual-camera setup to capture both a frontal view of the examinee and a side view of the workspace, minimizing blind spots. Our approach integrates several parallel analyses: the Face Module performs continuous identity verification using ArcFace, along with head pose estimation, gaze tracking, and mouth movement analysis to detect suspicious cues. Concurrently, the Hand Module employs a fine-tuned YOLOv11 model for detecting prohibited items (e.g., mobile phones, notes) and tracks hand proximity to these objects. Features from these modules are aggregated and fed into a Long Short-Term Memory (LSTM) network that analyzes temporal patterns to calculate a real-time cheating probability score. We evaluate AutoOEP on a custom-collected dataset simulating diverse exam conditions. Our system achieves an accuracy of 90.7% in classifying suspicious activities. The object detection component obtains a mean Average Precision (mAP@.5) of 0.57 for prohibited items, and the entire framework processes video streams at approximately 2.4 frames per second without a GPU. The results demonstrate that AutoOEP is an effective and resource-efficient solution for automated proctoring, significantly reducing the need for human intervention and enhancing the integrity of online assessments.

[105] Total Variation Subgradient Guided Image Fusion for Dual-Camera CASSI System

Weiqiang Zhao,Tianzhu Liu,Yuzhe Gui,Yanfeng Gu

Main category: cs.CV

TL;DR: 提出了一种基于双相机CASSI和全变分(TV)次梯度理论的可解释性计算光谱成像框架,通过引入动态正则化策略和自适应参考生成机制,在高压缩比下实现了空间-光谱结构一致性的有效保持。

Details Motivation: 传统压缩感知光谱成像方法在高压缩比下重建效果差,基于模型的方法依赖手工先验,深度学习方法缺乏物理可解释性。 Method: 构建端到端的SD-CASSI数学模型,结合TV次梯度理论,利用辅助相机提供的空间先验设计动态正则化策略和自适应参考更新机制,实现具有严格凸优化保证的重建。 Result: 实验结果表明该方法在多种重建场景下均能有效保持空间-光谱结构一致性,且具有良好的鲁棒性和物理可解释性。 Conclusion: 所提框架为多相机计算光谱成像提供了数学上严谨且可解释的重建基础,平衡了性能与物理合理性。 Abstract: Spectral imaging technology has long-faced fundamental challenges in balancing spectral, spatial, and temporal resolutions. While compressive sensing-based Coded Aperture Snapshot Spectral Imaging (CASSI) mitigates this trade-off through optical encoding, high compression ratios result in ill-posed reconstruction problems. Traditional model-based methods exhibit limited performance due to reliance on handcrafted inherent image priors, while deep learning approaches are constrained by their black-box nature, which compromises physical interpretability. To address these limitations, we propose a dual-camera CASSI reconstruction framework that integrates total variation (TV) subgradient theory. By establishing an end-to-end SD-CASSI mathematical model, we reduce the computational complexity of solving the inverse problem and provide a mathematically well-founded framework for analyzing multi-camera systems. A dynamic regularization strategy is introduced, incorporating normalized gradient constraints from RGB/panchromatic-derived reference images, which constructs a TV subgradient similarity function with strict convex optimization guarantees. Leveraging spatial priors from auxiliary cameras, an adaptive reference generation and updating mechanism is designed to provide subgradient guidance. Experimental results demonstrate that the proposed method effectively preserves spatial-spectral structural consistency. The theoretical framework establishes an interpretable mathematical foundation for computational spectral imaging, demonstrating robust performance across diverse reconstruction scenarios. The source code is available at https://github.com/bestwishes43/ADMM-TVDS.

[106] Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation

Mohanad Albughdadi

Main category: cs.CV

TL;DR: 本研究提出了一种紧凑的元数据感知专家混合掩码自动编码器(MoE-MAE),参数量仅250万,在地球观测任务中表现出高效的迁移学习能力,即使在缺乏元数据的情况下也具有竞争力,表明紧凑模型是地球观测基础模型的可行发展方向。

Details Motivation: 现有的地球观测基础模型计算成本高昂,限制了其可访问性和重用性,因此需要探索紧凑架构作为实用的替代方案。 Method: 提出了一种仅有250万参数的元数据感知专家混合掩码自动编码器(MoE-MAE),结合了稀疏专家路由与地理时间条件,使用影像数据及经纬度和季节/每日周期编码进行预训练,并在BigEarthNet-Landsat数据集上评估了其性能。 Result: 尽管模型规模较小,MoE-MAE仍能与更大模型竞争,表明元数据感知预训练可提高迁移和标签效率,并在EuroSAT-Landsat数据集上表现出色,即使该数据集没有显式元数据。 Conclusion: 研究结果表明,紧凑且具有元数据意识的MoE-MAE模型是高效且可扩展的地球观测基础模型的未来发展方向。 Abstract: Recent advances in Earth Observation have focused on large-scale foundation models. However, these models are computationally expensive, limiting their accessibility and reuse for downstream tasks. In this work, we investigate compact architectures as a practical pathway toward smaller general-purpose EO models. We propose a Metadata-aware Mixture-of-Experts Masked Autoencoder (MoE-MAE) with only 2.5M parameters. The model combines sparse expert routing with geo-temporal conditioning, incorporating imagery alongside latitude/longitude and seasonal/daily cyclic encodings. We pretrain the MoE-MAE on the BigEarthNet-Landsat dataset and evaluate embeddings from its frozen encoder using linear probes. Despite its small size, the model competes with much larger architectures, demonstrating that metadata-aware pretraining improves transfer and label efficiency. To further assess generalization, we evaluate on the EuroSAT-Landsat dataset, which lacks explicit metadata, and still observe competitive performance compared to models with hundreds of millions of parameters. These results suggest that compact, metadata-aware MoE-MAEs are an efficient and scalable step toward future EO foundation models.

[107] Simulating Sinogram-Domain Motion and Correcting Image-Domain Artifacts Using Deep Learning in HR-pQCT Bone Imaging

Farhan Sadik,Christopher L. Newman,Stuart J. Warden,Rachel K. Surowiec

Main category: cs.CV

TL;DR: The study introduces ESWGAN-GP, a deep learning method for correcting motion artifacts in HR-pQCT images, showing promising results in both simulated and real-world datasets.

Details Motivation: Rigid-motion artifacts hinder in vivo assessment of bone microstructures in HR-pQCT, and no motion correction methods currently exist due to the lack of standardized degradation models. Method: An Edge-enhanced Self-attention Wasserstein Generative Adversarial Network with Gradient Penalty (ESWGAN-GP) was proposed, incorporating edge-enhancing skip connections, self-attention mechanisms, and a VGG-based perceptual loss for motion correction in HR-pQCT images. Result: The ESWGAN-GP achieved improved performance metrics, including a mean SNR of 26.78, SSIM of 0.81, and VIF of 0.76 for the source dataset, and SNR of 29.31, SSIM of 0.87, and VIF of 0.81 for the target dataset. Conclusion: The proposed ESWGAN-GP method represents an important initial step toward implementing deep learning-based motion correction in HR-pQCT, despite addressing only a simplified representation of real-world motion. Abstract: Rigid-motion artifacts, such as cortical bone streaking and trabecular smearing, hinder in vivo assessment of bone microstructures in high-resolution peripheral quantitative computed tomography (HR-pQCT). Despite various motion grading techniques, no motion correction methods exist due to the lack of standardized degradation models. We optimize a conventional sinogram-based method to simulate motion artifacts in HR-pQCT images, creating paired datasets of motion-corrupted images and their corresponding ground truth, which enables seamless integration into supervised learning frameworks for motion correction. As such, we propose an Edge-enhanced Self-attention Wasserstein Generative Adversarial Network with Gradient Penalty (ESWGAN-GP) to address motion artifacts in both simulated (source) and real-world (target) datasets. The model incorporates edge-enhancing skip connections to preserve trabecular edges and self-attention mechanisms to capture long-range dependencies, facilitating motion correction. A visual geometry group (VGG)-based perceptual loss is used to reconstruct fine micro-structural features. The ESWGAN-GP achieves a mean signal-to-noise ratio (SNR) of 26.78, structural similarity index measure (SSIM) of 0.81, and visual information fidelity (VIF) of 0.76 for the source dataset, while showing improved performance on the target dataset with an SNR of 29.31, SSIM of 0.87, and VIF of 0.81. The proposed methods address a simplified representation of real-world motion that may not fully capture the complexity of in vivo motion artifacts. Nevertheless, because motion artifacts present one of the foremost challenges to more widespread adoption of this modality, these methods represent an important initial step toward implementing deep learning-based motion correction in HR-pQCT.

[108] Gaze Authentication: Factors Influencing Authentication Performance

Dillon Lohr,Michael J Proulx,Mehedi Hasan Raju,Oleg V Komogortsev

Main category: cs.CV

TL;DR: 该研究通过大规模实验分析了影响基于凝视的用户认证性能的因素,发现适当的校准和信号质量提升有助于提高认证准确性。

Details Motivation: 本研究的动机是探索影响基于凝视的认证性能的关键因素,以提高认证系统的准确性和可靠性。 Method: 实验是在一个大型的内部数据集上进行的,数据集包含8849个样本,使用Meta Quest Pro等效硬件收集数据,并采用最先进的神经网络架构研究不同因素对认证性能的影响。 Result: 研究发现,使用相同的校准目标深度、融合校准和非校准凝视数据以及改善信号质量均能提升认证性能,而简单的移动平均滤波器则会轻微降低性能。 Conclusion: 论文的结论是使用相同的校准目标深度进行眼动校准、融合校准和非校准凝视以及改善眼动信号质量都能提高认证性能。简单的三样本移动平均滤波器通常会略微降低认证性能。 Abstract: This paper examines the key factors that influence the performance of state-of-the-art gaze-based authentication. Experiments were conducted on a large-scale, in-house dataset comprising 8,849 subjects collected with Meta Quest Pro equivalent hardware running a video oculography-driven gaze estimation pipeline at 72Hz. The state-of-the-art neural network architecture was employed to study the influence of the following factors on authentication performance: eye tracking signal quality, various aspects of eye tracking calibration, and simple filtering on estimated raw gaze. We found that using the same calibration target depth for eye tracking calibration, fusing calibrated and non-calibrated gaze, and improving eye tracking signal quality all enhance authentication performance. We also found that a simple three-sample moving average filter slightly reduces authentication performance in general. While these findings hold true for the most part, some exceptions were noted.

[109] TrueSkin: Towards Fair and Accurate Skin Tone Recognition and Generation

Haoming Lu

Main category: cs.CV

TL;DR: 本文介绍了一个名为TrueSkin的新数据集,该数据集有助于提高皮肤色调识别和生成任务的准确性和公平性。

Details Motivation: 由于缺乏全面的数据集和稳健的方法论,皮肤色调识别和生成仍然具有挑战性。 Method: 引入了一个名为TrueSkin的数据集,并用其对现有的识别和生成方法进行基准测试。 Result: 使用TrueSkin训练的识别模型比LMMs和其他传统方法的分类准确率提高了超过20%,并且在图像生成模型中微调使用TrueSkin显著提高了皮肤色调的保真度。 Conclusion: TrueSkin是一个有价值的训练资源,可以提高皮肤色调识别和生成任务的公平性和准确性。 Abstract: Skin tone recognition and generation play important roles in model fairness, healthcare, and generative AI, yet they remain challenging due to the lack of comprehensive datasets and robust methodologies. Compared to other human image analysis tasks, state-of-the-art large multimodal models (LMMs) and image generation models struggle to recognize and synthesize skin tones accurately. To address this, we introduce TrueSkin, a dataset with 7299 images systematically categorized into 6 classes, collected under diverse lighting conditions, camera angles, and capture settings. Using TrueSkin, we benchmark existing recognition and generation approaches, revealing substantial biases: LMMs tend to misclassify intermediate skin tones as lighter ones, whereas generative models struggle to accurately produce specified skin tones when influenced by inherent biases from unrelated attributes in the prompts, such as hairstyle or environmental context. We further demonstrate that training a recognition model on TrueSkin improves classification accuracy by more than 20\% compared to LMMs and conventional approaches, and fine-tuning with TrueSkin significantly improves skin tone fidelity in image generation models. Our findings highlight the need for comprehensive datasets like TrueSkin, which not only serves as a benchmark for evaluating existing models but also provides a valuable training resource to enhance fairness and accuracy in skin tone recognition and generation tasks.

[110] Policy-Driven Transfer Learning in Resource-Limited Animal Monitoring

Nisha Pillai,Aditi Virupakshaiah,Harrison W. Smith,Amanda J. Ashworth,Prasanna Gowda,Phillip R. Owens,Adam R. Rivers,Bindu Nanduri,Mahalingam Ramkumar

Main category: cs.CV

TL;DR: 提出一种基于强化学习的迁移学习框架,利用UCB算法自动选择最适合动物检测任务的预训练模型,显著提高检测率并减少计算时间。

Details Motivation: 由于标记训练数据有限,开发有效的深度学习模型用于野生动物监测面临挑战,且在众多预训练模型中选择最优模型对新手研究者而言困难重重。 Method: 采用基于上置信界(UCB)算法的强化学习框架,系统评估并排序候选预训练模型,实现自动化模型选择。 Result: 实验结果表明,该框架相比传统方法在动物检测任务中实现了更高的检测率,并显著减少了计算时间。 Conclusion: 所提出的框架能有效简化资源受限场景下的模型选择过程,为无人机辅助的动物监测提供了高效、自动化的解决方案。 Abstract: Animal health monitoring and population management are critical aspects of wildlife conservation and livestock management that increasingly rely on automated detection and tracking systems. While Unmanned Aerial Vehicle (UAV) based systems combined with computer vision offer promising solutions for non-invasive animal monitoring across challenging terrains, limited availability of labeled training data remains an obstacle in developing effective deep learning (DL) models for these applications. Transfer learning has emerged as a potential solution, allowing models trained on large datasets to be adapted for resource-limited scenarios such as those with limited data. However, the vast landscape of pre-trained neural network architectures makes it challenging to select optimal models, particularly for researchers new to the field. In this paper, we propose a reinforcement learning (RL)-based transfer learning framework that employs an upper confidence bound (UCB) algorithm to automatically select the most suitable pre-trained model for animal detection tasks. Our approach systematically evaluates and ranks candidate models based on their performance, streamlining the model selection process. Experimental results demonstrate that our framework achieves a higher detection rate while requiring significantly less computational time compared to traditional methods.

[111] Improving Fungi Prototype Representations for Few-Shot Classification

Abdarahmane Traore,Éric Hervet,Andy Couturier

Main category: cs.CV

TL;DR: FungiCLEF 2025挑战赛旨在利用野外采集的观测数据实现真菌物种的自动识别。本文提出一种基于原型网络的深度学习方法,显著提升了稀有物种在极少数样本情况下的分类性能,在公共和私人排行榜的Recall@5指标上比基线高出30多个百分点。

Details Motivation: 真菌物种识别对生物多样性监测至关重要,但存在类别极度不平衡和许多稀有物种样本极少的问题,尤其是那些在标准训练集中缺失的物种。FungiCLEF 2025旨在推动在真实观测数据下对常见与稀有真菌的准确识别。 Method: 采用基于原型网络(prototypical networks)的深度学习方法,通过增强原型表征来提升小样本条件下的真菌分类效果,特别针对样本稀缺的稀有物种进行优化。 Result: 所提方法在FungiCLEF 2025竞赛中表现优异,Recall@5指标在公共(PB)和私人(PR)排行榜上均超过竞赛基线30多个百分点,显示出对常见和稀有真菌物种均具有良好的识别能力。 Conclusion: 基于原型网络的方法能有效应对真菌物种识别中的类别不平衡和小样本挑战,为大规模生物多样性监测提供了可靠的技术支持,达到了FungiCLEF 2025的核心目标。 Abstract: The FungiCLEF 2025 competition addresses the challenge of automatic fungal species recognition using realistic, field-collected observational data. Accurate identification tools support both mycologists and citizen scientists, greatly enhancing large-scale biodiversity monitoring. Effective recognition systems in this context must handle highly imbalanced class distributions and provide reliable performance even when very few training samples are available for many species, especially rare and under-documented taxa that are often missing from standard training sets. According to competition organizers, about 20\% of all verified fungi observations, representing nearly 20,000 instances, are associated with these rarely recorded species. To tackle this challenge, we propose a robust deep learning method based on prototypical networks, which enhances prototype representations for few-shot fungal classification. Our prototypical network approach exceeds the competition baseline by more than 30 percentage points in Recall@5 on both the public (PB) and private (PR) leaderboards. This demonstrates strong potential for accurately identifying both common and rare fungal species, supporting the main objectives of FungiCLEF 2025.

[112] Cluster-Level Sparse Multi-Instance Learning for Whole-Slide Images

Yuedi Zhang,Zhixiang Xia,Guosheng Yin,Bin Liu

Main category: cs.CV

TL;DR: csMIL框架通过整合全局-局部实例聚类、簇内注意力和簇级稀疏性诱导,解决了传统多实例学习方法在实例冗余和丢弃非信息实例方面的不足,提高了鲁棒性、可解释性,并降低了计算复杂度。

Details Motivation: 传统多实例学习方法在处理复杂、弱标签数据集(如全切片图像)时存在实例冗余和缺乏丢弃非信息实例机制的问题,这限制了其鲁棒性和可解释性。 Method: 提出Cluster-level Sparse MIL (csMIL)框架,包括全局聚类、局部聚类、簇内注意力计算和簇权重的稀疏正则化。 Result: csMIL在两个公开的组织病理学基准数据集(CAMELYON16, TCGA-NSCLC)上实现了最先进的性能。理论分析表明,csMIL需要O(s log K)袋来恢复s个相关簇,符合压缩感知原理。 Conclusion: csMIL通过簇级稀疏机制和注意力机制有效解决了传统MIL方法的局限性,具有良好的鲁棒性、可解释性和计算效率,适用于弱标签数据集的分析。 Abstract: Multi-Instance Learning (MIL) is pivotal for analyzing complex, weakly labeled datasets, such as whole-slide images (WSIs) in computational pathology, where bags comprise unordered collections of instances with sparse diagnostic relevance. Traditional MIL approaches, including early statistical methods and recent attention-based frameworks, struggle with instance redundancy and lack explicit mechanisms for discarding non-informative instances, limiting their robustness and interpretability. We propose Cluster-level Sparse MIL (csMIL), a novel framework that integrates global-local instance clustering, within-cluster attention, and cluster-level sparsity induction to address these challenges. Our csMIL first performs global clustering across all bags to establish $K$ cluster centers, followed by local clustering within each bag to assign cluster labels. Attention scores are computed within each cluster, and sparse regularization is applied to cluster weights, enabling the selective retention of diagnostically relevant clusters while discarding irrelevant ones. This approach enhances robustness to noisy instances, improves interpretability by identifying critical regions, and reduces computational complexity. Theoretical analysis demonstrates that csMIL requires $O(s log K)$ bags to recover $s$ relevant clusters, aligning with compressed sensing principles. Empirically, csMIL achieves state-of-the-art performance on two public histopathology benchmarks (CAMELYON16, TCGA-NSCLC).

[113] Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection

Canhui Tang,Sanping Zhou,Haoyue Shi,Le Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于骨架数据的零样本视频异常检测新框架,通过动作典型性和唯一性学习,在无需目标域训练数据的情况下实现跨场景异常定位。

Details Motivation: 现有方法仅学习低层次骨架表示,并依赖受限于特定领域的正常行为边界,难以泛化到具有不同正常与异常行为模式的新场景。 Method: 提出语言引导的语义典型性建模模块,将骨架片段映射到动作语义空间并利用大语言模型知识;同时设计测试时上下文唯一性分析模块,精细分析时空差异并生成场景自适应边界。 Result: 在ShanghaiTech、UBnormal、NWPU和UCF-Crime四个大规模数据集上达到最优性能,涵盖100多个未见监控场景。 Conclusion: 所提方法有效挖掘了骨架数据在零样本异常检测中的潜力,显著提升了跨域泛化能力。 Abstract: Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM's knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.

[114] Organoid Tracker: A SAM2-Powered Platform for Zero-shot Cyst Analysis in Human Kidney Organoid Videos

Xiaoyu Huang,Lauren M Maxson,Trang Nguyen,Cheng Jack Song,Yuankai Huo

Main category: cs.CV

TL;DR: 本文提出 Organoid Tracker,一种基于先进视觉模型的开源平台,用于自动化分析肾脏类器官显微视频,提高多囊肾病研究效率。

Details Motivation: 现有的手动分析方法仅限于粗分类,往往忽略了有价值的像素级和纵向信息,需要一种无需编程专业知识即可提取详细定量指标的方法。 Method: 基于 Segment Anything Model 2 (SAM2) 构建 Organoid Tracker,这是一个具有模块化插件架构的 GUI 平台,支持零样本分割和空间-时间显微视频的自动化分析。 Result: Organoid Tracker 可量化囊肿形成率、生长速度和形态变化等关键指标,并生成全面报告,为 PKD 研究提供高效的筛选平台。 Conclusion: Organoid Tracker 提供了一个强大的、可扩展的开源解决方案,有助于改进和加速肾脏发育、PKD 建模和治疗发现的研究。 Abstract: Recent advances in organoid models have revolutionized the study of human kidney disease mechanisms and drug discovery by enabling scalable, cost-effective research without the need for animal sacrifice. Here, we present a kidney organoid platform optimized for efficient screening in polycystic kidney disease (PKD). While these systems generate rich spatial-temporal microscopy video datasets, current manual approaches to analysis remain limited to coarse classifications (e.g., hit vs. non-hit), often missing valuable pixel-level and longitudinal information. To help overcome this bottleneck, we developed Organoid Tracker, a graphical user interface (GUI) platform designed with a modular plugin architecture, which empowers researchers to extract detailed, quantitative metrics without programming expertise. Built on the cutting-edge vision foundation model Segment Anything Model 2 (SAM2), Organoid Tracker enables zero-shot segmentation and automated analysis of spatial-temporal microscopy videos. It quantifies key metrics such as cyst formation rate, growth velocity, and morphological changes, while generating comprehensive reports. By providing an extensible, open-source framework, Organoid Tracker offers a powerful solution for improving and accelerating research in kidney development, PKD modeling, and therapeutic discovery. The platform is publicly available as open-source software at https://github.com/hrlblab/OrganoidTracker.

[115] The System Description of CPS Team for Track on Driving with Language of CVPR 2024 Autonomous Grand Challenge

Jinghan Peng,Jingwen Wang,Xing Yu,Dehui Du

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型的方法,用于CVPR 2024自动驾驶挑战赛的“用语言驾驶”任务,使用DriveLM-nuScenes数据集训练,并结合LoRA/DoRA微调和深度信息提升性能,最终在验证集上排名第一。

Details Motivation: 旨在提升自动驾驶中语言驱动决策的能力,利用视觉语言模型实现更准确的指令理解和推理。 Method: 基于LLaVA模型,采用LoRA和DoRA方法进行微调,融合开源深度估计模型提供的深度信息,并在推理阶段引入思维链(Chain-of-Thought)策略以提高多选题和是非题的准确性。 Result: 在验证集上取得了0.7799的最高分,排名第一。 Conclusion: 结合深度信息与思维链推理的微调策略显著提升了视觉语言模型在语言驱动自动驾驶任务中的表现。 Abstract: This report outlines our approach using vision language model systems for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We have exclusively utilized the DriveLM-nuScenes dataset for training our models. Our systems are built on the LLaVA models, which we enhanced through fine-tuning with the LoRA and DoRA methods. Additionally, we have integrated depth information from open-source depth estimation models to enrich the training and inference processes. For inference, particularly with multiple-choice and yes/no questions, we adopted a Chain-of-Thought reasoning approach to improve the accuracy of the results. This comprehensive methodology enabled us to achieve a top score of 0.7799 on the validation set leaderboard, ranking 1st on the leaderboard.

[116] Mars Traversability Prediction: A Multi-modal Self-supervised Approach for Costmap Generation

Zongwu Xie,Kaijie Yun,Yang Liu,Yiming Ji,Han Li

Main category: cs.CV

TL;DR: 论文提出了一种融合相机和LiDAR数据的自我监督多模态模型,用于预测行星探测车的可行驶性成本地图,具有高度鲁棒性。

Details Motivation: 为了提高行星探测车在复杂地形中的导航能力,需要一种能够处理多种输入并自我监督学习的成本地图预测模型。 Method: 该模型融合了相机和LiDAR数据,使用DINOv3图像编码器、FiLM传感器融合方法以及结合Huber和光滑项的优化损失函数进行训练。 Result: 实验表明,即使在输入数据被部分遮挡或加入噪声的情况下,模型性能仅有轻微下降,显示其对几何信息的依赖和鲁棒性。 Conclusion: 该论文提出了一种用于行星探测车可行驶性成本地图预测的鲁棒多模态框架,并强调了其高度的鲁棒性和自我监督的学习能力。 Abstract: We present a robust multi-modal framework for predicting traversability costmaps for planetary rovers. Our model fuses camera and LiDAR data to produce a bird's-eye-view (BEV) terrain costmap, trained self-supervised using IMU-derived labels. Key updates include a DINOv3-based image encoder, FiLM-based sensor fusion, and an optimization loss combining Huber and smoothness terms. Experimental ablations (removing image color, occluding inputs, adding noise) show only minor changes in MAE/MSE (e.g. MAE increases from ~0.0775 to 0.0915 when LiDAR is sparsified), indicating that geometry dominates the learned cost and the model is highly robust. We attribute the small performance differences to the IMU labeling primarily reflecting terrain geometry rather than semantics and to limited data diversity. Unlike prior work claiming large gains, we emphasize our contributions: (1) a high-fidelity, reproducible simulation environment; (2) a self-supervised IMU-based labeling pipeline; and (3) a strong multi-modal BEV costmap prediction model. We discuss limitations and future work such as domain generalization and dataset expansion.

[117] End-to-End Visual Autonomous Parking via Control-Aided Attention

Chao Chen,Shunyu Yao,Yuanwu He,Tao Feng,Ruojing Song,Yuliang Guo,Xinyu Huang,Chenxu Wu,Ren Liu,Chen Feng

Main category: cs.CV

TL;DR: 本文提出CAA-Policy,一种新的端到端模仿学习系统,通过控制信号引导注意力机制,提高了自动驾驶停车任务的性能。

Details Motivation: 现有端到端学习方法在感知与控制之间缺乏有效协同,导致空间注意力不稳定,影响停车任务的可靠性。 Method: 提出CAA-Policy,结合控制引导的注意力机制和自监督学习方法,并通过CARLA模拟器进行实验验证。 Result: CAA-Policy在CARLA模拟器实验中优于端到端基线方法和模块化方法,具有更高的准确性、鲁棒性和可解释性。 Conclusion: CAA-Policy通过结合控制信号和注意力机制,提高了自动驾驶车辆在精确停车任务中的准确性、鲁棒性和可解释性。 Abstract: Precise parking requires an end-to-end system where perception adaptively provides policy-relevant details-especially in critical areas where fine control decisions are essential. End-to-end learning offers a unified framework by directly mapping sensor inputs to control actions, but existing approaches lack effective synergy between perception and control. We find that transformer-based self-attention, when used alone, tends to produce unstable and temporally inconsistent spatial attention, which undermines the reliability of downstream policy decisions over time. Instead, we propose CAA-Policy, an end-to-end imitation learning system that allows control signal to guide the learning of visual attention via a novel Control-Aided Attention (CAA) mechanism. For the first time, we train such an attention module in a self-supervised manner, using backpropagated gradients from the control outputs instead of from the training loss. This strategy encourages the attention to focus on visual features that induce high variance in action outputs, rather than merely minimizing the training loss-a shift we demonstrate leads to a more robust and generalizable policy. To further enhance stability, CAA-Policy integrates short-horizon waypoint prediction as an auxiliary task, and introduces a separately trained motion prediction module to robustly track the target spot over time. Extensive experiments in the CARLA simulator show that \titlevariable~consistently surpasses both the end-to-end learning baseline and the modular BEV segmentation + hybrid A* pipeline, achieving superior accuracy, robustness, and interpretability. Code is released at https://github.com/Joechencc/CAAPolicy.

[118] PanoLora: Bridging Perspective and Panoramic Video Generation with LoRA Adaptation

Zeyu Dong,Yuyang Yin,Yuqi Li,Eric Li,Hao-Xiang Guo,Yikai Wang

Main category: cs.CV

TL;DR: 提出一种基于LoRA的高效微调方法,用于从透视视频生成高质量360度全景视频,在小规模数据上实现优越性能。

Details Motivation: 由于全景投影与传统透视投影存在本质差异,现有视频生成模型难以直接生成高质量全景视频,且已有方法复杂、效率低。 Method: 将全景视频生成视为从透视视图的适应问题,利用LoRA对预训练视频扩散模型进行微调,并通过理论分析确定LoRA秩的下限。 Result: 仅使用约1000个视频微调即实现了高质量全景视频生成,在投影几何、视觉质量、左右一致性和运动多样性方面优于先前最先进方法。 Conclusion: LoRA能有效建模透视到全景的投影变换,为全景视频生成提供了一种高效、简洁的新方案。 Abstract: Generating high-quality 360{\deg} panoramic videos remains a significant challenge due to the fundamental differences between panoramic and traditional perspective-view projections. While perspective videos rely on a single viewpoint with a limited field of view, panoramic content requires rendering the full surrounding environment, making it difficult for standard video generation models to adapt. Existing solutions often introduce complex architectures or large-scale training, leading to inefficiency and suboptimal results. Motivated by the success of Low-Rank Adaptation (LoRA) in style transfer tasks, we propose treating panoramic video generation as an adaptation problem from perspective views. Through theoretical analysis, we demonstrate that LoRA can effectively model the transformation between these projections when its rank exceeds the degrees of freedom in the task. Our approach efficiently fine-tunes a pretrained video diffusion model using only approximately 1,000 videos while achieving high-quality panoramic generation. Experimental results demonstrate that our method maintains proper projection geometry and surpasses previous state-of-the-art approaches in visual quality, left-right consistency, and motion diversity.

[119] SMILE: A Super-resolution Guided Multi-task Learning Method for Hyperspectral Unmixing

Ruiying Li,Bin Pan,Qiaoying Qu,Xia Xu,Zhenwei Shi

Main category: cs.CV

TL;DR: This paper proposes SMILE, a super-resolution guided multi-task learning framework for hyperspectral unmixing, supported by theoretical analysis and convergence guarantees, effectively improving unmixing performance.

Details Motivation: The motivation stems from the limitations of hyperspectral unmixing due to low spatial resolution and the challenges in integrating super-resolution with unmixing, such as unverified task affinity and lack of convergence guarantees. Method: The paper provides theoretical analysis to validate the feasibility of multi-task learning and proposes the SMILE framework, which learns shared and specific representations to integrate super-resolution and unmixing. An accessibility theorem is also introduced to guarantee convergence. Result: The experiments on synthetic and real datasets demonstrate the effectiveness of the SMILE framework in enhancing unmixing performance under the guidance of super-resolution. Conclusion: The paper concludes that the proposed SMILE framework effectively enhances hyperspectral unmixing performance by leveraging super-resolution through a multi-task learning approach, backed by theoretical analysis and validated task affinity. Abstract: The performance of hyperspectral unmixing may be constrained by low spatial resolution, which can be enhanced using super-resolution in a multitask learning way. However, integrating super-resolution and unmixing directly may suffer two challenges: Task affinity is not verified, and the convergence of unmixing is not guaranteed. To address the above issues, in this paper, we provide theoretical analysis and propose super-resolution guided multi-task learning method for hyperspectral unmixing (SMILE). The provided theoretical analysis validates feasibility of multitask learning way and verifies task affinity, which consists of relationship and existence theorems by proving the positive guidance of super-resolution. The proposed framework generalizes positive information from super-resolution to unmixing by learning both shared and specific representations. Moreover, to guarantee the convergence, we provide the accessibility theorem by proving the optimal solution of unmixing. The major contributions of SMILE include providing progressive theoretical support, and designing a new framework for unmixing under the guidance of super-resolution. Our experiments on both synthetic and real datasets have substantiate the usefulness of our work.

[120] A Copula-Guided Temporal Dependency Method for Multitemporal Hyperspectral Images Unmixing

Ruiying Li,Bin Pan,Qiaoying Qu,Xia Xu,Zhenwei Shi

Main category: cs.CV

TL;DR: This paper introduces a copula-guided method for multitemporal hyperspectral unmixing, improving the modeling of temporal dependency and dynamical material evolution, with experimental validation of its effectiveness.

Details Motivation: The motivation stems from the limitations of existing methods in capturing temporal dependency and dynamical material evolution in multitemporal hyperspectral unmixing. Copula theory is used due to its ability to explicitly model dependency structures. Method: The paper proposes a copula-guided temporal dependency method (Cog-TD), which involves defining a new mathematical model, constructing a copula-guided framework, and developing two key modules for copula function estimation and temporal dependency guidance. Result: The results show that the proposed method successfully captures temporal dependency in hyperspectral images and improves unmixing performance, as demonstrated on both synthetic and real-world datasets. Conclusion: The paper concludes that the proposed Cog-TD method effectively models temporal dependency in multitemporal hyperspectral unmixing, showing its utility through experimental results on synthetic and real-world datasets. Abstract: Multitemporal hyperspectral unmixing (MTHU) aims to model variable endmembers and dynamical abundances, which emphasizes the critical temporal information. However, existing methods have limitations in modeling temporal dependency, thus fail to capture the dynamical material evolution. Motivated by the ability of copula theory in modeling dependency structure explicitly, in this paper, we propose a copula-guided temporal dependency method (Cog-TD) for multitemporal hyperspectral unmixing. Cog-TD defines new mathematical model, constructs copula-guided framework and provides two key modules with theoretical support. The mathematical model provides explicit formulations for MTHU problem definition, which describes temporal dependency structure by incorporating copula theory. The copula-guided framework is constructed for utilizing copula function, which estimates dynamical endmembers and abundances with temporal dependency. The key modules consist of copula function estimation and temporal dependency guidance, which computes and employs temporal information to guide unmixing process. Moreover, the theoretical support demonstrates that estimated copula function is valid and the represented temporal dependency exists in hyperspectral images. The major contributions of this paper include redefining MTHU problem with temporal dependency, proposing a copula-guided framework, developing two key modules and providing theoretical support. Our experimental results on both synthetic and real-world datasets demonstrate the utility of the proposed method.

[121] 3DAeroRelief: The first 3D Benchmark UAV Dataset for Post-Disaster Assessment

Nhut Le,Ehsan Karimi,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 本文提出了3DAeroRelief——首个专用于灾后评估的3D语义分割基准数据集,基于低成本无人机采集飓风灾区的三维点云数据,填补了现有3D数据集在灾害场景中的空白。

Details Motivation: 现有自然灾害分析多依赖二维图像,缺乏深度信息且易受遮挡,而现有的3D数据集主要关注城市或室内场景,缺乏对灾后环境的关注。因此需要一个面向真实灾害场景的3D数据集以支持更准确的结构损伤评估。 Method: 利用低成本无人机在飓风受灾区域采集图像,通过运动恢复结构(SfM)和多视图立体匹配(MVS)技术重建密集三维点云,并将人工标注的2D语义标签投影到3D空间,构建带细粒度语义标注的3D灾后场景数据集。 Result: 成功构建了名为3DAeroRelief的大规模户外3D灾后评估数据集,包含真实灾害环境下的精细结构损伤标注;并通过多个先进3D语义分割模型的实验验证了该数据集的挑战性与应用价值。 Conclusion: 3DAeroRelief为灾后响应中的3D场景理解提供了重要资源,推动了鲁棒性3D视觉系统在实际灾害评估中的发展,同时展示了无人机在危险环境中进行高效、安全数据采集的优势。 Abstract: Timely assessment of structural damage is critical for disaster response and recovery. However, most prior work in natural disaster analysis relies on 2D imagery, which lacks depth, suffers from occlusions, and provides limited spatial context. 3D semantic segmentation offers a richer alternative, but existing 3D benchmarks focus mainly on urban or indoor scenes, with little attention to disaster-affected areas. To address this gap, we present 3DAeroRelief--the first 3D benchmark dataset specifically designed for post-disaster assessment. Collected using low-cost unmanned aerial vehicles (UAVs) over hurricane-damaged regions, the dataset features dense 3D point clouds reconstructed via Structure-from-Motion and Multi-View Stereo techniques. Semantic annotations were produced through manual 2D labeling and projected into 3D space. Unlike existing datasets, 3DAeroRelief captures 3D large-scale outdoor environments with fine-grained structural damage in real-world disaster contexts. UAVs enable affordable, flexible, and safe data collection in hazardous areas, making them particularly well-suited for emergency scenarios. To demonstrate the utility of 3DAeroRelief, we evaluate several state-of-the-art 3D segmentation models on the dataset to highlight both the challenges and opportunities of 3D scene understanding in disaster response. Our dataset serves as a valuable resource for advancing robust 3D vision systems in real-world applications for post-disaster scenarios.

[122] Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation

Nhi Kieu,Kien Nguyen,Arnold Wiliem,Clinton Fookes,Sridha Sridharan

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态学习方法GEMMNet,有效解决了遥感语义分割中多模态信号缺失的问题,并在多个数据集上表现出色。

Details Motivation: 现有的生成模型在处理遥感语义分割中多模态信号缺失问题时效果有限,主要因为难以捕捉复杂场景的语义上下文和对主导模态的依赖性过强。 Method: 提出了一种新的生成增强多模态学习网络GEMMNet,包括混合特征提取器(HyFEx)、多尺度感知的混合融合(HyFMA)和互补损失(CoLoss)方案。 Result: GEMMNet在Vaihingen和Potsdam两个遥感数据集上均优于生成模型基线(如AE、cGAN)和非生成模型先进方法(如mmformer、shaspec) Conclusion: GEMMNet通过三个关键组件有效解决了多模态遥感数据中的异构性和模态缺失问题,并在两个遥感数据集上优于现有方法。 Abstract: Multimodal learning has shown significant performance boost compared to ordinary unimodal models across various domains. However, in real-world scenarios, multimodal signals are susceptible to missing because of sensor failures and adverse weather conditions, which drastically deteriorates models' operation and performance. Generative models such as AutoEncoder (AE) and Generative Adversarial Network (GAN) are intuitive solutions aiming to reconstruct missing modality from available ones. Yet, their efficacy in remote sensing semantic segmentation remains underexplored. In this paper, we first examine the limitations of existing generative approaches in handling the heterogeneity of multimodal remote sensing data. They inadequately capture semantic context in complex scenes with large intra-class and small inter-class variation. In addition, traditional generative models are susceptible to heavy dependence on the dominant modality, introducing bias that affects model robustness under missing modality conditions. To tackle these limitations, we propose a novel Generative-Enhanced MultiModal learning Network (GEMMNet) with three key components: (1) Hybrid Feature Extractor (HyFEx) to effectively learn modality-specific representations, (2) Hybrid Fusion with Multiscale Awareness (HyFMA) to capture modality-synergistic semantic context across scales and (3) Complementary Loss (CoLoss) scheme to alleviate the inherent bias by encouraging consistency across modalities and tasks. Our method, GEMMNet, outperforms both generative baselines AE, cGAN (conditional GAN), and state-of-the-art non-generative approaches - mmformer and shaspec - on two challenging semantic segmentation remote sensing datasets (Vaihingen and Potsdam). Source code is made available.

[123] WildSmoke: Ready-to-Use Dynamic 3D Smoke Assets from a Single Video in the Wild

Yuqiu Liu,Jialin Song,Manolis Savva,Wuyang Chen

Main category: cs.CV

TL;DR: The paper presents a pipeline for extracting and reconstructing 3D smoke from real-world videos, enabling realistic smoke editing through simulation, with demonstrated improvements in quality and versatility over existing approaches.

Details Motivation: The motivation is to overcome the limitations of current fluid reconstruction techniques, which typically require controlled environments, by developing a method that works effectively with real-world videos captured in the wild. Method: The paper outlines a method involving smoke extraction with background removal, initialization of smoke particles and camera poses, and inferring multi-view videos to address the challenges of reconstructing smoke in real-world settings. Result: The result is a high-quality smoke reconstruction that outperforms previous methods, achieving a +2.22 average PSNR on wild videos, and allows for diverse and realistic editing of fluid dynamics. Conclusion: The paper concludes that their proposed pipeline effectively extracts and reconstructs dynamic 3D smoke assets from real-world videos, offering improved performance over existing methods and enabling realistic editing through simulation. Abstract: We propose a pipeline to extract and reconstruct dynamic 3D smoke assets from a single in-the-wild video, and further integrate interactive simulation for smoke design and editing. Recent developments in 3D vision have significantly improved reconstructing and rendering fluid dynamics, supporting realistic and temporally consistent view synthesis. However, current fluid reconstructions rely heavily on carefully controlled clean lab environments, whereas real-world videos captured in the wild are largely underexplored. We pinpoint three key challenges of reconstructing smoke in real-world videos and design targeted techniques, including smoke extraction with background removal, initialization of smoke particles and camera poses, and inferring multi-view videos. Our method not only outperforms previous reconstruction and generation methods with high-quality smoke reconstructions (+2.22 average PSNR on wild videos), but also enables diverse and realistic editing of fluid dynamics by simulating our smoke assets. We provide our models, data, and 4D smoke assets at [https://autumnyq.github.io/WildSmoke](https://autumnyq.github.io/WildSmoke).

[124] SVR-GS: Spatially Variant Regularization for Probabilistic Masks in 3D Gaussian Splatting

Ashkan Taghipour,Vahid Naghshin,Benjamin Southwell,Farid Boussaid,Hamid Laga,Mohammed Bennamoun

Main category: cs.CV

TL;DR: 本文提出了SVR-GS,一种用于3D高斯点阵的空变正则化方法,通过沿光线生成像素级空间掩码,针对低重要性高斯施加稀疏压力,相比MaskGS和3DGS显著减少了高斯数量,仅带来轻微PSNR下降,提升了模型效率。

Details Motivation: 现有基于掩码的剪枝方法(如MaskGS)使用全局均值正则化,与决定图像质量的局部每像素重建损失不一致,导致剪枝效果不佳。 Method: 提出SVR-GS,设计三种空间掩码聚合策略,从每个高斯在光线上的有效贡献渲染出每像素空间掩码,并在CUDA中实现,结合梯度分析优化最终设计。 Result: 在Tanks&Temples、Deep Blending和Mip-NeRF360数据集上,SVR-GS平均比MaskGS减少1.79倍高斯数量,比3DGS减少5.63倍,仅导致0.50 dB和0.40 dB的PSNR下降。 Conclusion: SVR-GS能更有效地压缩3D高斯表示,生成更小、更快、内存更高效的模型,适用于机器人、AR/VR和移动感知等实时应用。 Abstract: 3D Gaussian Splatting (3DGS) enables fast, high-quality novel view synthesis but typically relies on densification followed by pruning to optimize the number of Gaussians. Existing mask-based pruning, such as MaskGS, regularizes the global mean of the mask, which is misaligned with the local per-pixel (per-ray) reconstruction loss that determines image quality along individual camera rays. This paper introduces SVR-GS, a spatially variant regularizer that renders a per-pixel spatial mask from each Gaussian's effective contribution along the ray, thereby applying sparsity pressure where it matters: on low-importance Gaussians. We explore three spatial-mask aggregation strategies, implement them in CUDA, and conduct a gradient analysis to motivate our final design. Extensive experiments on Tanks\&Temples, Deep Blending, and Mip-NeRF360 datasets demonstrate that, on average across the three datasets, the proposed SVR-GS reduces the number of Gaussians by 1.79\(\times\) compared to MaskGS and 5.63\(\times\) compared to 3DGS, while incurring only 0.50 dB and 0.40 dB PSNR drops, respectively. These gains translate into significantly smaller, faster, and more memory-efficient models, making them well-suited for real-time applications such as robotics, AR/VR, and mobile perception.

[125] No Mesh, No Problem: Estimating Coral Volume and Surface from Sparse Multi-View Images

Diego Eustachio Farchione,Ramzi Idoughi,Peter Wonka

Main category: cs.CV

TL;DR: 提出一种基于2D多视角图像预测珊瑚3D体积和表面积的轻量级学习框架,结合VGGT与DGCNN,并引入复合损失函数提升预测稳定性与不确定性估计。

Details Motivation: 准确量化珊瑚生长需要精确的体积和表面积估计,但由于珊瑚形态复杂,传统方法难以实现高效、可扩展的三维重建。 Method: 使用预训练VGGT模块从多视角RGB图像提取密集点图,融合为统一的点云并加入置信度评分;通过双分支DGCNN解码器联合输出体积、表面积及其置信度估计,并采用基于高斯负对数似然的复合损失函数优化模型。 Result: 该方法在体积和表面积预测上达到竞争性精度,能良好泛化到未见的珊瑚形态,并提供可靠的不确定性估计。 Conclusion: 所提框架实现了从稀疏图像中高效、可扩展地估计珊瑚几何结构,为珊瑚生长分析和珊瑚礁监测提供了实用工具。 Abstract: Effective reef monitoring requires the quantification of coral growth via accurate volumetric and surface area estimates, which is a challenging task due to the complex morphology of corals. We propose a novel, lightweight, and scalable learning framework that addresses this challenge by predicting the 3D volume and surface area of coral-like objects from 2D multi-view RGB images. Our approach utilizes a pre-trained module (VGGT) to extract dense point maps from each view; these maps are merged into a unified point cloud and enriched with per-view confidence scores. The resulting cloud is fed to two parallel DGCNN decoder heads, which jointly output the volume and the surface area of the coral, as well as their corresponding confidence estimate. To enhance prediction stability and provide uncertainty estimates, we introduce a composite loss function based on Gaussian negative log-likelihood in both real and log domains. Our method achieves competitive accuracy and generalizes well to unseen morphologies. This framework paves the way for efficient and scalable coral geometry estimation directly from a sparse set of images, with potential applications in coral growth analysis and reef monitoring.

[126] Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic

Waikit Xiu,Qiang Lu,Xiying Li,Chen Hu,Shengbo Sun

Main category: cs.CV

TL;DR: 本文提出Traffic-MLLM,一种专为细粒度交通分析定制的多模态大语言模型,通过低秩适应进行轻量级微调,并引入知识提示模块,有效提升模型的逻辑推理和知识适应能力。

Details Motivation: 现有方法在准确建模时空因果关系和整合领域特定知识方面面临显著挑战,限制了其在复杂场景中的有效性。 Method: 基于Qwen2.5-VL骨干网络,利用高质量的交通特定多模态数据集,并采用低秩适应(LoRA)进行轻量级微调,引入融合思维链(CoT)推理和检索增强生成(RAG)的知识提示模块。 Result: 在TrafficQA和DriveQA基准测试中,Traffic-MLLM实现了最先进的性能,验证了其处理多模态交通数据的卓越能力。 Conclusion: Traffic-MLLM展现出卓越的零样本推理和跨场景泛化能力,实现了最先进的性能。 Abstract: As intelligent transportation systems advance, traffic video understanding plays an increasingly pivotal role in comprehensive scene perception and causal analysis. Yet, existing approaches face notable challenges in accurately modeling spatiotemporal causality and integrating domain-specific knowledge, limiting their effectiveness in complex scenarios. To address these limitations, we propose Traffic-MLLM, a multimodal large language model tailored for fine-grained traffic analysis. Built on the Qwen2.5-VL backbone, our model leverages high-quality traffic-specific multimodal datasets and uses Low-Rank Adaptation (LoRA) for lightweight fine-tuning, significantly enhancing its capacity to model continuous spatiotemporal features in video sequences. Furthermore, we introduce an innovative knowledge prompting module fusing Chain-of-Thought (CoT) reasoning with Retrieval-Augmented Generation (RAG), enabling precise injection of detailed traffic regulations and domain knowledge into the inference process. This design markedly boosts the model's logical reasoning and knowledge adaptation capabilities. Experimental results on TrafficQA and DriveQA benchmarks show Traffic-MLLM achieves state-of-the-art performance, validating its superior ability to process multimodal traffic data. It also exhibits remarkable zero-shot reasoning and cross-scenario generalization capabilities.

[127] Multispectral-NeRF:a multispectral modeling approach based on neural radiance fields

Hong Zhang,Fei Guo,Zihan Xie,Dizhao Yao

Main category: cs.CV

TL;DR: Multispectral-NeRF是一种改进的NeRF架构,用于3D重建,能够有效整合多光谱信息。

Details Motivation: 传统的3D重建技术通常依赖于RGB光谱信息,但这些方法在价格、准确性和几何特征方面存在不足。 Method: 通过扩展隐藏层维度、重新设计残差函数、调整数据压缩模块,改进NeRF架构以处理多光谱数据。 Result: Multispectral-NeRF能够成功处理多波段光谱特征,同时准确保留原始场景的光谱特性。 Conclusion: Multispectral-NeRF是一种有效的3D重建技术,能够处理多光谱信息,并产生高精度和高质量的重建结果。 Abstract: 3D reconstruction technology generates three-dimensional representations of real-world objects, scenes, or environments using sensor data such as 2D images, with extensive applications in robotics, autonomous vehicles, and virtual reality systems. Traditional 3D reconstruction techniques based on 2D images typically relies on RGB spectral information. With advances in sensor technology, additional spectral bands beyond RGB have been increasingly incorporated into 3D reconstruction workflows. Existing methods that integrate these expanded spectral data often suffer from expensive scheme prices, low accuracy and poor geometric features. Three - dimensional reconstruction based on NeRF can effectively address the various issues in current multispectral 3D reconstruction methods, producing high - precision and high - quality reconstruction results. However, currently, NeRF and some improved models such as NeRFacto are trained on three - band data and cannot take into account the multi - band information. To address this problem, we propose Multispectral-NeRF, an enhanced neural architecture derived from NeRF that can effectively integrates multispectral information. Our technical contributions comprise threefold modifications: Expanding hidden layer dimensionality to accommodate 6-band spectral inputs; Redesigning residual functions to optimize spectral discrepancy calculations between reconstructed and reference images; Adapting data compression modules to address the increased bit-depth requirements of multispectral imagery. Experimental results confirm that Multispectral-NeRF successfully processes multi-band spectral features while accurately preserving the original scenes' spectral characteristics.

[128] SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion

Zhiwen Yang,Yuxin Peng

Main category: cs.CV

TL;DR: 提出了一种名为SPHERE的新方法,结合体素和高斯表示,用于相机驱动的3D语义场景补全,兼顾语义准确性和物理真实性,在SemanticKITTI和SSCBench-KITTI-360上表现优异。

Details Motivation: 现有体素或平面方法难以捕捉真实几何细节,而NeRF等神经重建方法计算成本高且语义精度不足,因此需要一种兼顾物理感知与语义理解的高效方法。 Method: 提出SPHERE框架,包含语义引导的高斯初始化(SGI)模块和物理感知谐波增强(PHE)模块,融合体素与高斯表示,通过语义球谐函数建模物理上下文并提升语义-几何一致性。 Result: 在SemanticKITTI和SSCBench-KITTI-360基准上取得了优异性能,实现了更真实的几何细节和更高的语义准确性,同时保持较高效率。 Conclusion: SPHERE有效整合了语义与物理信息,在大规模自动驾驶场景中实现了高质量的3D语义场景补全,优于现有方法。 Abstract: Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.

[129] StegOT: Trade-offs in Steganography via Optimal Transport

Chengde Lin,Xuezhu Gong,Shuxue Ding,Mingzhe Yang,Xijun Lu,Chengjun Mo

Main category: cs.CV

TL;DR: 本文提出了一种基于自编码器和最优传输理论的新型图像隐写模型StegOT,通过多通道最优传输(MCOT)模块缓解了模式崩溃问题,实现了载体与秘密图像间的信息平衡,并提升了隐写和恢复图像的质量。

Details Motivation: 现有基于GAN和VAE的隐写模型普遍存在模式崩溃问题,导致载体与秘密图像之间的信息不平衡,影响隐写和提取效果。 Method: 提出StegOT模型,结合自编码器架构与最优传输理论,设计多通道最优传输(MCOT)模块,将多峰特征分布转换为单峰分布,以实现信息的均衡分配。 Result: 实验表明,StegOT在保持高隐写容量的同时,有效缓解了模式崩溃,提升了隐写图像和恢复图像的视觉质量,并在信息隐藏与提取方面取得了更好的平衡。 Conclusion: StegOT通过引入最优传输理论,有效解决了传统生成模型在图像隐写中的模式崩溃问题,为高质量、高鲁棒性的图像隐写提供了新思路。 Abstract: Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.

[130] The Impact of Skin Tone Label Granularity on the Performance and Fairness of AI Based Dermatology Image Classification Models

Partha Shah,Durva Sankhe,Maariyah Rashid,Zakaa Khaled,Esther Puyol-Antón,Tiarna Lee,Maram Alqarni,Sweta Rai,Andrew P. King

Main category: cs.CV

TL;DR: This paper investigates how the granularity of the Fitzpatrick Skin Tone (FST) scale affects AI models for skin lesion classification, showing that finer FST groupings improve performance and reduce bias, while suggesting a move to more equitable skin tone scales.

Details Motivation: AI models for classifying skin lesions have shown susceptibility to bias based on skin tone. The Fitzpatrick Skin Tone (FST) scale, commonly used to represent skin tone, has been criticized for greater granularity in lighter-skinned categories. This paper aims to understand how FST granularity affects model performance and bias. Method: The paper investigates the impact of granularity in the FST scale on AI classification models by training multiple models to classify benign versus malignant lesions using FST-specific data with varying levels of granularity. Result: The study shows that training models using FST-specific data with higher granularity (e.g., three groups: FST 1/2, 3/4, and 5/6) generally performs better than models trained on FST-balanced general data. Reducing FST granularity (e.g., combining groups into broader categories) can negatively impact performance. Conclusion: The paper concludes that the granularity of Fitzpatrick Skin Tone (FST) groups plays a critical role in training lesion classification models. It highlights the need to move away from the FST scale due to potential human biases and suggests adopting an alternative scale that better captures the diversity of human skin tones for fair AI research. Abstract: Artificial intelligence (AI) models to automatically classify skin lesions from dermatology images have shown promising performance but also susceptibility to bias by skin tone. The most common way of representing skin tone information is the Fitzpatrick Skin Tone (FST) scale. The FST scale has been criticised for having greater granularity in its skin tone categories for lighter-skinned subjects. This paper conducts an investigation of the impact (on performance and bias) on AI classification models of granularity in the FST scale. By training multiple AI models to classify benign vs. malignant lesions using FST-specific data of differing granularity, we show that: (i) when training models using FST-specific data based on three groups (FST 1/2, 3/4 and 5/6), performance is generally better for models trained on FST-specific data compared to a general model trained on FST-balanced data; (ii) reducing the granularity of FST scale information (from 1/2 and 3/4 to 1/2/3/4) can have a detrimental effect on performance. Our results highlight the importance of the granularity of FST groups when training lesion classification models. Given the question marks over possible human biases in the choice of categories in the FST scale, this paper provides evidence for a move away from the FST scale in fair AI research and a transition to an alternative scale that better represents the diversity of human skin tones.

[131] Scaling Up Forest Vision with Synthetic Data

Yihang She,Andrew Blake,David Coomes,Srinivasan Keshav

Main category: cs.CV

TL;DR: 本研究提出了一种基于合成数据的森林树木分割方法,通过结合游戏引擎和物理基础的LiDAR模拟生成大规模、多样化的3D森林数据集,显著减少了对标注真实数据的依赖。实验表明,仅使用小于0.1公顷的真实林地进行微调后,预训练模型的表现即可媲美在完整真实数据上训练的模型。

Details Motivation: 现有的公开3D森林数据集规模不足,难以构建鲁棒的树木分割系统。受自动驾驶等领域中合成数据成功的启发,本文探索利用合成数据进行预训练以减少对昂贵实地采集和标注数据的依赖。 Method: 开发了一个新的合成数据生成流程,结合游戏引擎与物理基础的LiDAR模拟技术,生成大规模、多样化且带标注的3D森林数据集,并用于树分割模型的预训练,再用少量真实数据进行微调。 Result: 实验结果显示,使用合成数据预训练并在单个小型真实林地(<0.1公顷)微调后的模型,其分割性能可与在全部真实数据上训练的模型相媲美。同时识别出影响合成数据效果的关键因素:物理真实性、数据多样性和规模。 Conclusion: 合成数据能有效减少对标注真实森林数据的需求,为未来构建更鲁棒的3D森林视觉系统提供了可行路径,相关数据生成流程和数据集已开源。 Abstract: Accurate tree segmentation is a key step in extracting individual tree metrics from forest laser scans, and is essential to understanding ecosystem functions in carbon cycling and beyond. Over the past decade, tree segmentation algorithms have advanced rapidly due to developments in AI. However existing, public, 3D forest datasets are not large enough to build robust tree segmentation systems. Motivated by the success of synthetic data in other domains such as self-driving, we investigate whether similar approaches can help with tree segmentation. In place of expensive field data collection and annotation, we use synthetic data during pretraining, and then require only minimal, real forest plot annotation for fine-tuning. We have developed a new synthetic data generation pipeline to do this for forest vision tasks, integrating advances in game-engines with physics-based LiDAR simulation. As a result, we have produced a comprehensive, diverse, annotated 3D forest dataset on an unprecedented scale. Extensive experiments with a state-of-the-art tree segmentation algorithm and a popular real dataset show that our synthetic data can substantially reduce the need for labelled real data. After fine-tuning on just a single, real, forest plot of less than 0.1 hectare, the pretrained model achieves segmentations that are competitive with a model trained on the full scale real data. We have also identified critical factors for successful use of synthetic data: physics, diversity, and scale, paving the way for more robust 3D forest vision systems in the future. Our data generation pipeline and the resulting dataset are available at https://github.com/yihshe/CAMP3D.git.

[132] Beyond Sliders: Mastering the Art of Diffusion-based Image Manipulation

Yufei Tang,Daiheng Gao,Pingyu Wu,Wenbo Zhou,Bang Zhang,Weiming Zhang

Main category: cs.CV

TL;DR: Beyond Sliders 结合 GAN 和扩散模型,通过细粒度文本和视觉引导实现高质量的图像编辑,适用于各种图像类型。

Details Motivation: 现有的图像编辑方法(如概念滑块)在处理非 AI 生成的图像(尤其是真实世界拍摄的图像)时表现不佳,因此需要一种更稳健和通用的解决方案。 Method: 该方法通过结合 GAN 和扩散模型,并利用文本和视觉的细粒度指导以对抗方式优化图像,从而改进现有的概念滑块方法。 Result: 实验验证表明,Beyond Sliders 在多种应用场景中都表现出色,显著提升了图像质量和真实感。 Conclusion: Beyond Sliders 是一种创新的图像处理框架,结合了 GAN 和扩散模型的优势,实现了跨类别的高质量图像编辑。 Abstract: In the realm of image generation, the quest for realism and customization has never been more pressing. While existing methods like concept sliders have made strides, they often falter when it comes to no-AIGC images, particularly images captured in real world settings. To bridge this gap, we introduce Beyond Sliders, an innovative framework that integrates GANs and diffusion models to facilitate sophisticated image manipulation across diverse image categories. Improved upon concept sliders, our method refines the image through fine grained guidance both textual and visual in an adversarial manner, leading to a marked enhancement in image quality and realism. Extensive experimental validation confirms the robustness and versatility of Beyond Sliders across a spectrum of applications.

[133] Geometrically Constrained and Token-Based Probabilistic Spatial Transformers

Johann Schmidt,Sebastian Stober

Main category: cs.CV

TL;DR: 提出一种基于概率性分量分解的Spatial Transformer Networks改进方法,用于提升细粒度视觉分类中对几何变化的鲁棒性。

Details Motivation: 细粒度视觉分类对几何变化敏感,现有等变架构计算开销大且限制模型灵活性。 Method: 将仿射变换分解为旋转、缩放和剪切分量,在共享定位编码器下回归各分量,并引入高斯变分后验建模不确定性,采用基于采样的规范化和分量对齐损失进行优化。 Result: 在具有挑战性的飞蛾分类基准上,该方法相比其他STN方法显著提升了鲁棒性。 Conclusion: 所提出的概率性分量式STN框架灵活、主干无关,有效增强了Transformer类视觉模型对几何变形的适应能力。 Abstract: Fine-grained visual classification (FGVC) remains highly sensitive to geometric variability, where objects appear under arbitrary orientations, scales, and perspective distortions. While equivariant architectures address this issue, they typically require substantial computational resources and restrict the hypothesis space. We revisit Spatial Transformer Networks (STNs) as a canonicalization tool for transformer-based vision pipelines, emphasizing their flexibility, backbone-agnostic nature, and lack of architectural constraints. We propose a probabilistic, component-wise extension that improves robustness. Specifically, we decompose affine transformations into rotation, scaling, and shearing, and regress each component under geometric constraints using a shared localization encoder. To capture uncertainty, we model each component with a Gaussian variational posterior and perform sampling-based canonicalization during inference.A novel component-wise alignment loss leverages augmentation parameters to guide spatial alignment. Experiments on challenging moth classification benchmarks demonstrate that our method consistently improves robustness compared to other STNs.

[134] CCoMAML: Efficient Cattle Identification Using Cooperative Model-Agnostic Meta-Learning

Rabin Dulal,Lihong Zheng,Ashad Kabir

Main category: cs.CV

TL;DR: This paper proposes a novel few-shot learning framework (CCoMAML with MHAFF) for cattle identification, achieving high accuracy and adaptability without frequent retraining.

Details Motivation: Cattle identification is essential for livestock management, but current RFID-based systems are prone to failures. Biometric identification using muzzle patterns offers a promising alternative, though deep learning models face challenges like limited data and dynamic herd compositions. Method: The study introduces a few-shot learning framework using Cooperative Model-Agnostic Meta-Learning (CCoMAML) combined with Multi-Head Attention Feature Fusion (MHAFF) for real-time cattle identification, evaluated against state-of-the-art techniques. Result: The proposed CCoMAML with MHAFF achieved F1 scores of 98.46% and 97.91%, outperforming existing few-shot learning methods in cattle identification. Conclusion: The proposed CCoMAML with MHAFF demonstrates superior cattle identification performance, achieving high F1 scores of 98.46% and 97.91%, offering a robust and adaptable solution without the need for frequent retraining. Abstract: Cattle identification is critical for efficient livestock farming management, currently reliant on radio-frequency identification (RFID) ear tags. However, RFID-based systems are prone to failure due to loss, damage, tampering, and vulnerability to external attacks. As a robust alternative, biometric identification using cattle muzzle patterns similar to human fingerprints has emerged as a promising solution. Deep learning techniques have demonstrated success in leveraging these unique patterns for accurate identification. But deep learning models face significant challenges, including limited data availability, disruptions during data collection, and dynamic herd compositions that require frequent model retraining. To address these limitations, this paper proposes a novel few-shot learning framework for real-time cattle identification using Cooperative Model-Agnostic Meta-Learning (CCoMAML) with Multi-Head Attention Feature Fusion (MHAFF) as a feature extractor model. This model offers great model adaptability to new data through efficient learning from few data samples without retraining. The proposed approach has been rigorously evaluated against current state-of-the-art few-shot learning techniques applied in cattle identification. Comprehensive experimental results demonstrate that our proposed CCoMAML with MHAFF has superior cattle identification performance with 98.46% and 97.91% F1 scores.

[135] ANROT-HELANet: Adverserially and Naturally Robust Attention-Based Aggregation Network via The Hellinger Distance for Few-Shot Classification

Gao Yu Lee,Tanmoy Dam,Md Meftahul Ferdaus,Daniel Puiu Poenar,Vu N. Duong

Main category: cs.CV

TL;DR: 本文提出了ANROT-HELANet,一种在Few-Shot Learning中具有对抗性和自然鲁棒性的新方法,结合了Hellinger距离、注意力机制和新损失函数,实现了性能和鲁棒性的提升。

Details Motivation: 现有的基于贝叶斯估计的方法虽然有所改进,但仍然容易受到对抗攻击和自然噪声的影响。 Method: 引入了基于Hellinger距离的特征类聚合方案和Hellinger相似性对比损失函数,结合注意力机制。 Result: ANROT-HELANet在miniImageNet上1-shot和5-shot场景下分别提高了1.20%和1.40%,并且在对抗扰动(ε=0.30)和高斯噪声(σ=0.30)下表现出鲁棒性,FID评分为2.75。 Conclusion: ANROT-HELANet通过Hellinger距离特征聚合、注意力机制和新的损失函数,在FSL中实现了最先进的性能和对对抗性和自然扰动的鲁棒性。 Abstract: Few-Shot Learning (FSL), which involves learning to generalize using only a few data samples, has demonstrated promising and superior performances to ordinary CNN methods. While Bayesian based estimation approaches using Kullback-Leibler (KL) divergence have shown improvements, they remain vulnerable to adversarial attacks and natural noises. We introduce ANROT-HELANet, an Adversarially and Naturally RObusT Hellinger Aggregation Network that significantly advances the state-of-the-art in FSL robustness and performance. Our approach implements an adversarially and naturally robust Hellinger distance-based feature class aggregation scheme, demonstrating resilience to adversarial perturbations up to $\epsilon=0.30$ and Gaussian noise up to $\sigma=0.30$. The network achieves substantial improvements across benchmark datasets, including gains of 1.20\% and 1.40\% for 1-shot and 5-shot scenarios on miniImageNet respectively. We introduce a novel Hellinger Similarity contrastive loss function that generalizes cosine similarity contrastive loss for variational few-shot inference scenarios. Our approach also achieves superior image reconstruction quality with a FID score of 2.75, outperforming traditional VAE (3.43) and WAE (3.38) approaches. Extensive experiments conducted on four few-shot benchmarked datasets verify that ANROT-HELANet's combination of Hellinger distance-based feature aggregation, attention mechanisms, and our novel loss function establishes new state-of-the-art performance while maintaining robustness against both adversarial and natural perturbations. Our code repository will be available at https://github.com/GreedYLearner1146/ANROT-HELANet/tree/main.

[136] MIS-LSTM: Multichannel Image-Sequence LSTM for Sleep Quality and Stress Prediction

Seongwan Park,Jieun Woo,Siheon Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为MIS-LSTM的混合框架,结合CNN编码器与LSTM序列模型,用于基于多模态生活日志数据进行日级睡眠质量和压力预测,并引入不确定性感知集成方法UALRE提升鲁棒性,在2025 ETRI Lifelog挑战赛数据集上取得了优于基线模型的表现。

Details Motivation: 为了有效利用多模态生活日志中的连续传感器信号和稀疏离散事件进行日级睡眠质量与压力预测,需解决多模态融合、时间依赖建模及预测不确定性等问题。 Method: 将连续传感器流划分为N小时块并转换为多通道图像,稀疏离散事件通过专用1D-CNN编码;使用卷积块注意力模块融合双模态特征形成块嵌入,再由LSTM捕获长期时序依赖;进一步提出不确定性感知集成方法UALRE,以高置信度个体预测修正低置信度的多数投票结果。 Result: 在2025 ETRI Lifelog挑战赛数据集上,基础MIS-LSTM模型达到Macro-F1 0.615,加入UALRE后提升至0.647,优于LSTM、1D-CNN和CNN等强基线模型;消融实验验证了多通道成像优于垂直堆叠、4小时分块粒度最优,以及模态特定离散编码的有效性。 Conclusion: MIS-LSTM结合多模态特征提取、注意力融合与LSTM时序建模,并通过UALRE增强预测鲁棒性,显著提升了日级睡眠与压力预测性能,验证了所提结构设计与融合策略的有效性。 Abstract: This paper presents MIS-LSTM, a hybrid framework that joins CNN encoders with an LSTM sequence model for sleep quality and stress prediction at the day level from multimodal lifelog data. Continuous sensor streams are first partitioned into N-hour blocks and rendered as multi-channel images, while sparse discrete events are encoded with a dedicated 1D-CNN. A Convolutional Block Attention Module fuses the two modalities into refined block embeddings, which an LSTM then aggregates to capture long-range temporal dependencies. To further boost robustness, we introduce UALRE, an uncertainty-aware ensemble that overrides lowconfidence majority votes with high-confidence individual predictions. Experiments on the 2025 ETRI Lifelog Challenge dataset show that Our base MISLSTM achieves Macro-F1 0.615; with the UALRE ensemble, the score improves to 0.647, outperforming strong LSTM, 1D-CNN, and CNN baselines. Ablations confirm (i) the superiority of multi-channel over stacked-vertical imaging, (ii) the benefit of a 4-hour block granularity, and (iii) the efficacy of modality-specific discrete encoding.

[137] Contextualized Multimodal Lifelong Person Re-Identification in Hybrid Clothing States

Robert Long,Rongxin Jiang,Mingrui Yan

Main category: cs.CV

TL;DR: 提出LReID-Hybrid任务,结合持续学习与衣物变化下的行人重识别,通过CMLReID框架(含CASP和AKFP模块)实现对同衣和换衣场景的统一建模,显著提升鲁棒性与泛化能力。

Details Motivation: 现有方法多局限于同衣场景或单独处理换衣问题,难以在持续学习中同时应对衣物变化和知识遗忘,缺乏统一有效的解决方案。 Method: 提出CMLReID框架,包含两个新模块:1)上下文感知语义提示(CASP),生成自适应提示并融合多粒度视觉线索与语义文本空间;2)自适应知识融合与投影(AKFP),通过双路径学习和衣物状态感知投影损失构建鲁棒的同衣/换衣原型。 Result: 在多个数据集上实验表明,CMLReID在面对衣物变化和连续学习复杂过程时,性能优于所有现有最先进方法,具有强鲁棒性和泛化能力。 Conclusion: CMLReID有效解决了持续学习下同衣与换衣行人重识别的统一建模问题,通过语义对齐与自适应知识融合,实现了更贴近真实监控场景的鲁棒识别。 Abstract: Person Re-Identification (ReID) has several challenges in real-world surveillance systems due to clothing changes (CCReID) and the need for maintaining continual learning (LReID). Previous existing methods either develop models specifically for one application, which is mostly a same-cloth (SC) setting or treat CCReID as its own separate sub-problem. In this work, we will introduce the LReID-Hybrid task with the goal of developing a model to achieve both SC and CC while learning in a continual setting. Mismatched representations and forgetting from one task to the next are significant issues, we address this with CMLReID, a CLIP-based framework composed of two novel tasks: (1) Context-Aware Semantic Prompt (CASP) that generates adaptive prompts, and also incorporates context to align richly multi-grained visual cues with semantic text space; and (2) Adaptive Knowledge Fusion and Projection (AKFP) which produces robust SC/CC prototypes through the use of a dual-path learner that aligns features with our Clothing-State-Aware Projection Loss. Experiments performed on a wide range of datasets and illustrate that CMLReID outperforms all state-of-the-art methods with strong robustness and generalization despite clothing variations and a sophisticated process of sequential learning.

[138] Cross-Domain Attribute Alignment with CLIP: A Rehearsal-Free Approach for Class-Incremental Unsupervised Domain Adaptation

Kerun Mi,Guoliang Kang,Guangyu Li,Lin Zhao,Tao Zhou,Chen Gong

Main category: cs.CV

TL;DR: 本文提出一种无需回放的类增量无监督域自适应方法,通过挖掘和保留领域不变、类别无关的属性知识(“attribute”),利用CLIP提取视觉原型与文本提示构成“键-值”对,并建立双属性字典进行跨域属性对齐,从而在缓解域偏移的同时有效减少灾难性遗忘,在三个基准上超越了现有方法。

Details Motivation: 现有CI-UDA方法通常依赖样本回放和仅在共享类别间进行不对称对齐,导致内存持续增长和知识遗忘问题,因此需要一种无需回放且能更有效地保持历史知识的方法。 Method: 利用CLIP提取类别无关的属性(attribute),构建基于视觉原型(键)和文本提示(值)的‘键-值’对,并为源域和目标域分别维护属性字典;通过鼓励跨域的视觉注意力一致性和预测一致性实现属性对齐,从而在不使用回放示例的情况下进行域适应。 Result: 在三个CI-UDA基准上的实验表明,该方法优于先前的最先进方法,有效缓解了灾难性遗忘,且无需存储历史样本。 Conclusion: 通过属性建模和跨域对齐,所提方法实现了无需回放的CI-UDA,在减轻域偏移和避免知识遗忘方面表现出色,为类增量域自适应提供了新思路。 Abstract: Class-Incremental Unsupervised Domain Adaptation (CI-UDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where the sets of potential target classes appearing at different time steps are disjoint and are subsets of the source classes. The key to solving this problem lies in avoiding catastrophic forgetting of knowledge about previous target classes during continuously mitigating the domain shift. Most previous works cumbersomely combine two technical components. On one hand, they need to store and utilize rehearsal target sample from previous time steps to avoid catastrophic forgetting; on the other hand, they perform alignment only between classes shared across domains at each time step. Consequently, the memory will continuously increase and the asymmetric alignment may inevitably result in knowledge forgetting. In this paper, we propose to mine and preserve domain-invariant and class-agnostic knowledge to facilitate the CI-UDA task. Specifically, via using CLIP, we extract the class-agnostic properties which we name as "attribute". In our framework, we learn a "key-value" pair to represent an attribute, where the key corresponds to the visual prototype and the value is the textual prompt. We maintain two attribute dictionaries, each corresponding to a different domain. Then we perform attribute alignment across domains to mitigate the domain shift, via encouraging visual attention consistency and prediction consistency. Through attribute modeling and cross-domain alignment, we effectively reduce catastrophic knowledge forgetting while mitigating the domain shift, in a rehearsal-free way. Experiments on three CI-UDA benchmarks demonstrate that our method outperforms previous state-of-the-art methods and effectively alleviates catastrophic forgetting. Code is available at https://github.com/RyunMi/VisTA.

[139] Synthetic Dataset Evaluation Based on Generalized Cross Validation

Zhihang Song,Dingyi Yao,Ruibo Ming,Lihui Peng,Danya Yao,Yi Zhang

Main category: cs.CV

TL;DR: 提出了一种新的合成数据集质量评估框架,结合广义交叉验证和领域迁移学习,通过跨性能矩阵和两个新指标量化合成数据的仿真质量和迁移质量。

Details Motivation: 现有合成数据集评估方法缺乏统一标准,难以有效衡量合成数据的质量和实用性。 Method: 设计了一个包含广义交叉验证实验和领域迁移学习的评估框架,训练任务模型(如YOLOv5s)在合成数据和多个真实数据集上,构建归一化的GCV矩阵,并提出两个评估指标:仿真质量和转移质量。 Result: 在Virtual KITTI上的实验表明,该框架能有效评估合成数据的保真度,具备可扩展性和可量化优势。 Conclusion: 所提框架为合成数据集的质量评估提供了通用、可比较的解决方案,有助于指导AI中合成数据的优化与应用。 Abstract: With the rapid advancement of synthetic dataset generation techniques, evaluating the quality of synthetic data has become a critical research focus. Robust evaluation not only drives innovations in data generation methods but also guides researchers in optimizing the utilization of these synthetic resources. However, current evaluation studies for synthetic datasets remain limited, lacking a universally accepted standard framework. To address this, this paper proposes a novel evaluation framework integrating generalized cross-validation experiments and domain transfer learning principles, enabling generalizable and comparable assessments of synthetic dataset quality. The framework involves training task-specific models (e.g., YOLOv5s) on both synthetic datasets and multiple real-world benchmarks (e.g., KITTI, BDD100K), forming a cross-performance matrix. Following normalization, a Generalized Cross-Validation (GCV) Matrix is constructed to quantify domain transferability. The framework introduces two key metrics. One measures the simulation quality by quantifying the similarity between synthetic data and real-world datasets, while another evaluates the transfer quality by assessing the diversity and coverage of synthetic data across various real-world scenarios. Experimental validation on Virtual KITTI demonstrates the effectiveness of our proposed framework and metrics in assessing synthetic data fidelity. This scalable and quantifiable evaluation solution overcomes traditional limitations, providing a principled approach to guide synthetic dataset optimization in artificial intelligence research.

[140] ROSGS: Relightable Outdoor Scenes With Gaussian Splatting

Lianjun Liao,Chunhui Zhang,Tong Wu,Henglei Lv,Bailin Deng,Lin Gao

Main category: cs.CV

TL;DR: 提出ROSGS,一种基于高斯点阵的两阶段方法,用于高效重建可重光照的户外场景,结合单目法线先验与混合光照模型,在几何、反射和光照分解上实现领先性能。

Details Motivation: 户外图像因复杂场景和多变光照难以分解为几何、反射和光照成分;现有NeRF和3DGS方法存在计算开销高和光照表示频率低的问题。 Method: 采用两阶段 pipeline:第一阶段利用单目法线先验通过2D高斯点阵(2DGS)重建几何结构;第二阶段基于重建几何,使用混合光照模型分解纹理与光照,其中太阳光用球面高斯函数建模,天光通过球谐系数学习辐射传输函数。 Result: 在定量指标和定性比较中,ROSGS在户外场景重光照任务上达到SOTA水平,具有更高的重光照精度和渲染效率。 Conclusion: ROSGS通过紧凑的2DGS表示和混合光照模型,有效解决了户外场景分解中的效率与精度问题,显著提升了重光照的质量与速度。 Abstract: Image data captured outdoors often exhibit unbounded scenes and unconstrained, varying lighting conditions, making it challenging to decompose them into geometry, reflectance, and illumination. Recent works have focused on achieving this decomposition using Neural Radiance Fields (NeRF) or the 3D Gaussian Splatting (3DGS) representation but remain hindered by two key limitations: the high computational overhead associated with neural networks of NeRF and the use of low-frequency lighting representations, which often result in inefficient rendering and suboptimal relighting accuracy. We propose ROSGS, a two-stage pipeline designed to efficiently reconstruct relightable outdoor scenes using the Gaussian Splatting representation. By leveraging monocular normal priors, ROSGS first reconstructs the scene's geometry with the compact 2D Gaussian Splatting (2DGS) representation, providing an efficient and accurate geometric foundation. Building upon this reconstructed geometry, ROSGS then decomposes the scene's texture and lighting through a hybrid lighting model. This model effectively represents typical outdoor lighting by employing a spherical Gaussian function to capture the directional, high-frequency components of sunlight, while learning a radiance transfer function via Spherical Harmonic coefficients to model the remaining low-frequency skylight comprehensively. Both quantitative metrics and qualitative comparisons demonstrate that ROSGS achieves state-of-the-art performance in relighting outdoor scenes and highlight its ability to deliver superior relighting accuracy and rendering efficiency.

[141] Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations

Yifan Lu,Ziqi Zhang,Chunfeng Yuan,Jun Gao,Congxuan Zhang,Xiaojuan Qi,Bing Li,Weiming Hu

Main category: cs.CV

TL;DR: 提出了一种无需外部依赖的自主偏好对齐方法APASI,通过自我注入机制生成具有不同偏好级别的响应对,有效缓解大视觉语言模型中的幻觉问题。

Details Motivation: 现有的幻觉缓解方法依赖外部人工标注或辅助模型来收集偏好数据,成本高且难以持续改进。因此需要一种无需外部依赖、可自主持续优化的方法。 Method: 提出APASI方法,利用目标LVLM自身将幻觉信息注入生成的回答中,构建包含不同程度幻觉的响应对;基于三种幻觉特征生成低偏好的含幻觉响应,并结合迭代对齐训练与课程学习策略进行周期性更新和逐步优化。 Result: 在六个基准上进行了广泛实验,APASI在三个基线模型上均有效减少了幻觉现象,性能媲美甚至超过依赖外部数据的对齐方法。 Conclusion: APASI是一种通用、高效的幻觉缓解方法,无需外部依赖即可实现稳定持续的模型提升,展现出良好的泛化能力和应用潜力。 Abstract: Large Vision-Language Models (LVLMs) suffer from serious hallucination problems, where the model-generated responses are inconsistent with the visual inputs. Existing hallucination mitigation methods are mainly based on preference alignment and require external human annotations or auxiliary models for preference data collection, which increase costs and limit sustainable improvement. To tackle these challenges, we propose Autonomous Preference Alignment via Self-Injection (APASI), a novel and generalizable method that mitigates hallucinations without external dependencies. APASI leverages the target LVLM to self-inject hallucinations into a generated response, creating a pair of responses with varying preference levels. During the self-injection process, the dis-preferred response is generated based on three key observations of hallucinations, ensuring it simulates real hallucination patterns. This fidelity offers an accurate learning signal for hallucination mitigation. Moreover, APASI incorporates an iterative alignment training strategy combined with curriculum learning to periodically update the preference data with increasing challenge, enabling stable and continuous enhancement of the LVLM. Extensive experiments across six benchmarks show that APASI not only effectively mitigates hallucinations for three baseline models but also achieves comparable or even superior performance to alignment-based methods with external dependency, thereby demonstrating its effectiveness and generalization capability. The code is available at https://github.com/davidluciolu/APASI.

[142] Leveraging Geometric Priors for Unaligned Scene Change Detection

Ziling Liu,Ziwei Chen,Mingqi Gao,Jinyu Yang,Feng Zheng

Main category: cs.CV

TL;DR: 本文首次利用几何基础模型的几何先验来解决未对齐场景变化检测中的核心挑战,提出了一种无需训练的框架,结合视觉基础模型的强大表征能力,在视角不对齐情况下实现了可靠的变化检测。

Details Motivation: 现有方法仅依赖2D视觉线索进行跨图像匹配,在大视角变化下容易失效,且受限于小规模数据集的2D监督,缺乏显式的几何推理能力。 Method: 利用几何基础模型提供的几何先验,构建一个无需训练的框架,将其与视觉基础模型的表征相结合,实现鲁棒的对应建立、视觉重叠识别和显式遮挡检测。 Result: 在PSCD、ChangeSim和PASLCD数据集上进行了广泛实验,结果表明所提方法在性能和鲁棒性方面均优于现有方法。 Conclusion: 引入几何先验有效解决了未对齐场景变化检测中的关键问题,为多视角变化检测提供了新的思路。 Abstract: Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we are the first to leverage geometric priors from a Geometric Foundation Model to address the core challenges of unaligned SCD, including reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.

[143] UnLoc: Leveraging Depth Uncertainties for Floorplan Localization

Matthias Wüest,Francis Engelmann,Ondrej Miksik,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: UnLoc improves sequential camera localization within floorplans by incorporating uncertainty estimation and leveraging pre-trained depth models, eliminating the need for environment-specific training and achieving significant performance improvements on large-scale datasets.

Details Motivation: The motivation is to overcome key limitations of recent methods, such as the lack of uncertainty modeling in depth predictions and the requirement for custom depth networks trained for each environment, while utilizing the availability and robustness of floorplan data. Method: The method introduces a probabilistic model that models depth predictions as explicit probability distributions, using off-the-shelf pre-trained monocular depth models to eliminate the need for per-environment-trained networks. Result: UnLoc demonstrated significant improvements in accuracy and robustness over existing methods, achieving 2.7 times higher localization recall on long sequences and 16.7 times higher on short sequences than the state of the art on the LaMAR HGE dataset. Conclusion: UnLoc is a data-driven solution for sequential camera localization within floorplans that enhances generalization to unseen spaces by leveraging pre-trained monocular depth models and incorporating uncertainty estimation. Abstract: We propose UnLoc, an efficient data-driven solution for sequential camera localization within floorplans. Floorplan data is readily available, long-term persistent, and robust to changes in visual appearance. We address key limitations of recent methods, such as the lack of uncertainty modeling in depth predictions and the necessity for custom depth networks trained for each environment. We introduce a novel probabilistic model that incorporates uncertainty estimation, modeling depth predictions as explicit probability distributions. By leveraging off-the-shelf pre-trained monocular depth models, we eliminate the need to rely on per-environment-trained depth networks, enhancing generalization to unseen spaces. We evaluate UnLoc on large-scale synthetic and real-world datasets, demonstrating significant improvements over existing methods in terms of accuracy and robustness. Notably, we achieve $2.7$ times higher localization recall on long sequences (100 frames) and $16.7$ times higher on short ones (15 frames) than the state of the art on the challenging LaMAR HGE dataset.

[144] Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding

Jian Song,Wei Mei,Yunfeng Xu,Qiang Fu,Renke Kou,Lina Bu,Yucheng Long

Main category: cs.CV

TL;DR: This paper introduces SIKNet, a novel learning-aided filter for motion estimation in multi-object tracking, which outperforms traditional Kalman filters and existing learning-aided filters in terms of robustness and accuracy.

Details Motivation: The motivation is to overcome the limitations of traditional Kalman filters (KF), which may yield unsatisfactory results when parameters are mismatched or objects move in non-stationary patterns, by introducing a more robust and accurate learning-aided filter for motion estimation in MOT. Method: The paper proposes a new method called Semantic-Independent KalmanNet (SIKNet) for motion estimation in MOT. It utilizes a Semantic-Independent Encoder (SIE) that encodes the state vector in two steps: first using a 1D convolution along homogeneous-semantic elements, followed by a fully-connected layer and a nonlinear activation layer to capture nonlinear and cross-dependency information between heterogeneous-semantic elements. Result: The experimental results show that SIKNet outperforms traditional KF and achieves better robustness and accuracy compared to existing learning-aided filters in motion estimation for MOT. The performance was evaluated using a large-scale semi-simulated dataset constructed from open-source MOT datasets. Conclusion: The paper concludes that SIKNet, a novel learning-aided filter, outperforms traditional Kalman filters and existing learning-aided filters in motion estimation for multi-object tracking (MOT), demonstrating superior robustness and accuracy. Abstract: Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).

[145] Toward Next-generation Medical Vision Backbones: Modeling Finer-grained Long-range Visual Dependency

Mingyuan Meng

Main category: cs.CV

TL;DR: 本论文研究了在医学图像计算中有效建模长距离视觉依赖关系的方法,提出并验证了基于MLP的模型在捕捉高分辨率医学图像中细粒度长距离依赖方面的优越性,表明MLP可作为超越CNN和Transformer的新一代医学视觉骨干网络。

Details Motivation: 传统CNN受限于局部感受野,难以建模长距离依赖;Transformer虽擅长长距离建模,但计算开销大,无法处理高分辨率特征,限制了其在医学图像中捕捉细微细节的能力。因此需要更高效、精细的长距离依赖建模方法。 Method: 首先探索Transformer在像素级和图像级医学视觉任务中的应用,随后开创性地设计基于多层感知机(MLP)的视觉模型,以实现对高分辨率医学图像中细粒度长距离依赖的有效建模,并通过大量实验进行验证。 Result: 实验证明长距离依赖建模在医学图像分析中至关重要;关键发现是MLP能够在包含丰富解剖/病理细节的高分辨率特征上建模更细粒度的长距离依赖,且在多种医学视觉任务中性能优于CNN和Transformer。 Conclusion: 基于MLP的模型在医学图像计算中展现出比CNN和Transformer更优的潜力,能够高效建模细粒度长距离依赖,有望成为下一代医学视觉骨干网络的主流范式。 Abstract: Medical Image Computing (MIC) is a broad research topic covering both pixel-wise (e.g., segmentation, registration) and image-wise (e.g., classification, regression) vision tasks. Effective analysis demands models that capture both global long-range context and local subtle visual characteristics, necessitating fine-grained long-range visual dependency modeling. Compared to Convolutional Neural Networks (CNNs) that are limited by intrinsic locality, transformers excel at long-range modeling; however, due to the high computational loads of self-attention, transformers typically cannot process high-resolution features (e.g., full-scale image features before downsampling or patch embedding) and thus face difficulties in modeling fine-grained dependency among subtle medical image details. Concurrently, Multi-layer Perceptron (MLP)-based visual models are recognized as computation/memory-efficient alternatives in modeling long-range visual dependency but have yet to be widely investigated in the MIC community. This doctoral research advances deep learning-based MIC by investigating effective long-range visual dependency modeling. It first presents innovative use of transformers for both pixel- and image-wise medical vision tasks. The focus then shifts to MLPs, pioneeringly developing MLP-based visual models to capture fine-grained long-range visual dependency in medical images. Extensive experiments confirm the critical role of long-range dependency modeling in MIC and reveal a key finding: MLPs provide feasibility in modeling finer-grained long-range dependency among higher-resolution medical features containing enriched anatomical/pathological details. This finding establishes MLPs as a superior paradigm over transformers/CNNs, consistently enhancing performance across various medical vision tasks and paving the way for next-generation medical vision backbones.

[146] Dual Band Video Thermography Near Ambient Conditions

Sriram Narayanan,Mani Ramanagopal,Srinivasa G. Narasimhan

Main category: cs.CV

TL;DR: 本文提出了一种利用双波段热成像技术分离反射和发射成分的新方法,能够在近环境条件下估计表面发射率和时变温度,并有效隔离动态背景。

Details Motivation: 在近环境条件下,热图像中的反射和发射成分通常大小相当且随时间变化,传统假设不再适用,因此需要一种能够准确分离这两种成分的方法以更好地理解物体属性。 Method: 提出了一个双波段热图像形成模型,使用两个具有不同光谱敏感性的热像仪视频输入,开发了算法来估计表面的发射率和时变温度,同时分离动态背景。 Result: 通过精心校准的多种材料发射率进行了定量评估,并在复杂日常场景中展示了定性结果,如装有热液体的玻璃杯和移动的人物。 Conclusion: 该方法是首个能够在近环境条件下有效分离热图像中反射和发射成分的技术,提升了对物体发射率、温度等属性的理解与估计精度。 Abstract: Long-wave infrared radiation captured by a thermal camera consists of two components: (a) light from the environment reflected or transmitted by a surface, and (b) light emitted by the surface after undergoing heat transport through the object and exchanging heat with the surrounding environment. Separating these components is essential for understanding object properties such as emissivity, temperature, reflectance and shape. Previous thermography studies often assume that only one component is dominant (e.g., in welding) or that the second component is constant and can be subtracted. However, in near-ambient conditions, which are most relevant to computer vision applications, both components are typically comparable in magnitude and vary over time. We introduce the first method that separates reflected and emitted components of light in videos captured by two thermal cameras with different spectral sensitivities. We derive a dual-band thermal image formation model and develop algorithms to estimate the surface's emissivity and its time-varying temperature while isolating a dynamic background. We quantitatively evaluate our approach using carefully calibrated emissivities for a range of materials and show qualitative results on complex everyday scenes, such as a glass filled with hot liquid and people moving in the background.

[147] Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning

Huaiyuan Qin,Muli Yang,Siyuan Hu,Peng Hu,Yu Zhang,Chen Gong,Hongyuan Zhu

Main category: cs.CV

TL;DR: This paper shows that self-supervised learning (SSL) can work well even when different views of an image don't show the same object, as long as view diversity is balanced—not too little, not too much.

Details Motivation: The motivation stems from the breakdown of the instance consistency assumption in SSL for non-iconic data, where different views may contain distinct objects or semantic information. Method: The authors conducted extensive ablation studies to evaluate the impact of view diversity on SSL performance, using Earth Mover's Distance (EMD) to measure mutual information between views. Result: The study found that SSL remains effective without strict instance consistency, with optimal performance achieved through moderate view diversity. Excessive diversity, however, diminishes effectiveness. Conclusion: The paper concludes that SSL can learn meaningful representations even without strict instance consistency, and that moderate view diversity enhances downstream performance, while excessive diversity reduces effectiveness. Abstract: Self-supervised learning (SSL) conventionally relies on the instance consistency paradigm, assuming that different views of the same image can be treated as positive pairs. However, this assumption breaks down for non-iconic data, where different views may contain distinct objects or semantic information. In this paper, we investigate the effectiveness of SSL when instance consistency is not guaranteed. Through extensive ablation studies, we demonstrate that SSL can still learn meaningful representations even when positive pairs lack strict instance consistency. Furthermore, our analysis further reveals that increasing view diversity, by enforcing zero overlapping or using smaller crop scales, can enhance downstream performance on classification and dense prediction tasks. However, excessive diversity is found to reduce effectiveness, suggesting an optimal range for view diversity. To quantify this, we adopt the Earth Mover's Distance (EMD) as an estimator to measure mutual information between views, finding that moderate EMD values correlate with improved SSL learning, providing insights for future SSL framework design. We validate our findings across a range of settings, highlighting their robustness and applicability on diverse data sources.

[148] Promoting Shape Bias in CNNs: Frequency-Based and Contrastive Regularization for Corruption Robustness

Robin Narsingh Ranabhat,Longwei Wang,Amit Kumar Patel,KC santosh

Main category: cs.CV

TL;DR: This paper proposes regularization techniques to make CNNs more robust by promoting shape-based representations.

Details Motivation: CNNs are vulnerable to common image corruptions due to reliance on local textures, unlike human perception which focuses on global shapes. Method: Two regularization strategies were proposed: (1) an auxiliary loss enforcing feature consistency with low-frequency filtering, and (2) supervised contrastive learning to structure shape-relevant features. Result: Both methods improved robustness on CIFAR-10-C benchmark without sacrificing clean accuracy. Conclusion: Loss-level regularization methods can effectively improve the robustness of CNNs by encouraging shape-aware representations. Abstract: Convolutional Neural Networks (CNNs) excel at image classification but remain vulnerable to common corruptions that humans handle with ease. A key reason for this fragility is their reliance on local texture cues rather than global object shapes -- a stark contrast to human perception. To address this, we propose two complementary regularization strategies designed to encourage shape-biased representations and enhance robustness. The first introduces an auxiliary loss that enforces feature consistency between original and low-frequency filtered inputs, discouraging dependence on high-frequency textures. The second incorporates supervised contrastive learning to structure the feature space around class-consistent, shape-relevant representations. Evaluated on the CIFAR-10-C benchmark, both methods improve corruption robustness without degrading clean accuracy. Our results suggest that loss-level regularization can effectively steer CNNs toward more shape-aware, resilient representations.

[149] GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration

Wan Xu,Feng Zhu,Yihan Zeng,Yuanfan Guo,Ming Liu,Hang Xu,Wangmeng Zuo

Main category: cs.CV

TL;DR: 提出GLaVE-Cap框架,通过全局-局部对齐和视觉专家集成生成更详细、上下文一致的视频描述,并构建了大规模数据集GLaVE-1.2M和新基准GLaVE-Bench,实验证明其在多个基准上达到SOTA性能。

Details Motivation: 现有视频详细描述方法采用局部到全局范式,存在细节不足和上下文不一致问题,主要由于缺乏细粒度控制机制以及局部与全局描述间交互薄弱。 Method: 提出GLaVE-Cap框架,包含TrackFusion模块(利用视觉专家提取跨帧视觉提示,结合双流结构生成全面的局部描述)和CaptionBridge模块(建立局部与全局描述间的双向交互,用全局上下文指导局部描述并自适应汇总为全局描述)。 Result: 在四个基准上实验表明GLaVE-Cap达到最先进的性能,消融实验和学生模型分析验证了各模块有效性及GLaVE-1.2M数据集对视频理解社区的贡献。 Conclusion: GLaVE-Cap通过增强局部与全局描述的对齐与交互,显著提升了视频详细描述的质量,所构建的GLaVE-Bench和GLaVE-1.2M为未来研究提供了重要资源。 Abstract: Video detailed captioning aims to generate comprehensive video descriptions to facilitate video understanding. Recently, most efforts in the video detailed captioning community have been made towards a local-to-global paradigm, which first generates local captions from video clips and then summarizes them into a global caption. However, we find this paradigm leads to less detailed and contextual-inconsistent captions, which can be attributed to (1) no mechanism to ensure fine-grained captions, and (2) weak interaction between local and global captions. To remedy the above two issues, we propose GLaVE-Cap, a Global-Local aligned framework with Vision Expert integration for Captioning, which consists of two core modules: TrackFusion enables comprehensive local caption generation, by leveraging vision experts to acquire cross-frame visual prompts, coupled with a dual-stream structure; while CaptionBridge establishes a local-global interaction, by using global context to guide local captioning, and adaptively summarizing local captions into a coherent global caption. Besides, we construct GLaVE-Bench, a comprehensive video captioning benchmark featuring 5X more queries per video than existing benchmarks, covering diverse visual dimensions to facilitate reliable evaluation. We further provide a training dataset GLaVE-1.2M containing 16K high-quality fine-grained video captions and 1.2M related question-answer pairs. Extensive experiments on four benchmarks show that our GLaVE-Cap achieves state-of-the-art performance. Besides, the ablation studies and student model analyses further validate the effectiveness of the proposed modules and the contribution of GLaVE-1.2M to the video understanding community. The source code, model weights, benchmark, and dataset will be open-sourced.

[150] In-Vivo Skin 3-D Surface Reconstruction and Wrinkle Depth Estimation using Handheld High Resolution Tactile Sensing

Akhil Padmanabha,Arpit Agarwal,Catherine Li,Austin Williams,Dinesh K. Patel,Sankalp Chopkar,Achu Wilson,Ahmet Ozkan,Wenzhen Yuan,Sonal Choudhary,Arash Mostaghimi,Zackory Erickson,Carmel Majidi

Main category: cs.CV

TL;DR: This paper presents a compact 3-D skin reconstruction probe based on GelSight tactile imaging and a learning-based reconstruction algorithm for micron-level wrinkle height estimation. The probe has potential applications in clinical and cosmetic skin analysis.

Details Motivation: The motivation for the paper is to develop a portable, high-resolution device for 3-D skin surface reconstruction, which offers promise for objective and quantitative dermatological assessment. There is currently no such device that has been validated and used for depth reconstruction across various body locations. Method: The paper introduces a compact 3-D skin reconstruction probe based on GelSight tactile imaging with a custom elastic gel and a learning-based reconstruction algorithm for estimating wrinkle height at the micron level. The probe is integrated into a handheld device with force sensing for consistent contact. Result: The probe achieved a mean absolute error of 12.55 microns on wrinkle-like test objects. In a study with 15 participants without skin disorders, the authors provided the first validated wrinkle depth metrics across multiple body regions. They also demonstrated statistically significant reductions in wrinkle height at three locations following over-the-counter moisturizer application. Conclusion: The paper concludes that their 3-D skin reconstruction probe provides a validated tool for clinical and cosmetic skin analysis with potential applications in diagnosis, treatment monitoring, and skincare efficacy evaluation. Abstract: Three-dimensional (3-D) skin surface reconstruction offers promise for objective and quantitative dermatological assessment, but no portable, high-resolution device exists that has been validated and used for depth reconstruction across various body locations. We present a compact 3-D skin reconstruction probe based on GelSight tactile imaging with a custom elastic gel and a learning-based reconstruction algorithm for micron-level wrinkle height estimation. Our probe, integrated into a handheld probe with force sensing for consistent contact, achieves a mean absolute error of 12.55 micron on wrinkle-like test objects. In a study with 15 participants without skin disorders, we provide the first validated wrinkle depth metrics across multiple body regions. We further demonstrate statistically significant reductions in wrinkle height at three locations following over-the-counter moisturizer application. Our work offers a validated tool for clinical and cosmetic skin analysis, with potential applications in diagnosis, treatment monitoring, and skincare efficacy evaluation.

[151] MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation

Syed Talal Wasim,Hamid Suleman,Olga Zatsarynna,Muzammal Naseer,Juergen Gall

Main category: cs.CV

TL;DR: MixANT通过动态选择基于输入特征的A矩阵,改进了现有状态空间模型的时间记忆控制,提升了对人类行为预测的准确性。

Details Motivation: 当前的状态空间模型(如Mamba)在三个关键参数上表现出输入依赖的选择性,但在控制时间记忆的遗忘门(A矩阵)仍然是静态的,这限制了其性能。 Method: MixANT采用了一种新的混合专家方法,根据输入特征动态选择相关的A矩阵,从而增强表征能力而不牺牲计算效率。 Result: 在50Salads、Breakfast和Assembly101数据集上的广泛实验表明,MixANT在所有评估设置中均一致优于现有最先进的方法。 Conclusion: MixANT有效地解决了现有模型在遗忘门控制时间记忆上的局限性,通过引入基于输入特征动态选择的专家混合方法,提升了对人类行为预测的可靠性。 Abstract: We present MixANT, a novel architecture for stochastic long-term dense anticipation of human activities. While recent State Space Models (SSMs) like Mamba have shown promise through input-dependent selectivity on three key parameters, the critical forget-gate ($\textbf{A}$ matrix) controlling temporal memory remains static. We address this limitation by introducing a mixture of experts approach that dynamically selects contextually relevant $\textbf{A}$ matrices based on input features, enhancing representational capacity without sacrificing computational efficiency. Extensive experiments on the 50Salads, Breakfast, and Assembly101 datasets demonstrate that MixANT consistently outperforms state-of-the-art methods across all evaluation settings. Our results highlight the importance of input-dependent forget-gate mechanisms for reliable prediction of human behavior in diverse real-world scenarios.

[152] No Modality Left Behind: Dynamic Model Generation for Incomplete Medical Data

Christoph Fürböck,Paul Weiser,Branko Mitic,Philipp Seeböck,Thomas Helbich,Georg Langs

Main category: cs.CV

TL;DR: 本文提出了一种基于超网络的多模态医学影像数据分析方法,能够在模态缺失的情况下保持较高的分类性能,优于现有方法。

Details Motivation: 在现实世界的临床环境中,多模态医学影像数据往往存在缺失,传统的处理方式(如丢弃缺失样本、使用插补或dropout方法)限制了模型的鲁棒性和泛化能力。因此,需要一种更灵活、更有效的方法来应对这一问题。 Method: 论文提出了一种超网络框架,通过学习预测任务模型的参数,使模型能够根据可用的模态进行自适应调整。该方法在包含人工缺失模态的数据集上与三种标准方法进行了对比实验。 Result: 实验结果显示,与仅使用完整数据训练的模型、最先进的通道dropout方法以及插补方法相比,该方法在训练数据完整性为25%(即75%的训练数据存在模态缺失)时,准确率提高了最高达8%。 Conclusion: 该论文提出了一种基于超网络的方法,能够根据可用的模态动态生成任务特定的分类模型,从而在部分数据缺失的情况下仍能进行有效的训练和推理,为现实世界中的多模态医学数据分析提供了一种高效解决方案。 Abstract: In real world clinical environments, training and applying deep learning models on multi-modal medical imaging data often struggles with partially incomplete data. Standard approaches either discard missing samples, require imputation or repurpose dropout learning schemes, limiting robustness and generalizability. To address this, we propose a hypernetwork-based method that dynamically generates task-specific classification models conditioned on the set of available modalities. Instead of training a fixed model, a hypernetwork learns to predict the parameters of a task model adapted to available modalities, enabling training and inference on all samples, regardless of completeness. We compare this approach with (1) models trained only on complete data, (2) state of the art channel dropout methods, and (3) an imputation-based method, using artificially incomplete datasets to systematically analyze robustness to missing modalities. Results demonstrate superior adaptability of our method, outperforming state of the art approaches with an absolute increase in accuracy of up to 8% when trained on a dataset with 25% completeness (75% of training data with missing modalities). By enabling a single model to generalize across all modality configurations, our approach provides an efficient solution for real-world multi-modal medical data analysis.

[153] On the Skinning of Gaussian Avatars

Nikolaos Zioulis,Nikolaos Kotarelas,Georgios Albanis,Spyridon Thermos,Anargyros Chatzitofis

Main category: cs.CV

TL;DR: 提出一种基于四元数平均的加权旋转混合方法,用于改进高斯点阵在人体头像重建中的非线性旋转变形问题,简化了顶点基高斯的动画实现。

Details Motivation: 解决传统线性蒙皮无法有效处理高斯点阵非线性旋转导致的形变伪影问题。 Method: 采用基于四元数平均的加权旋转混合方法,改进线性混合蒙皮技术,实现从规范空间到观测空间的前向蒙皮。 Result: 实现了更准确的高斯旋转建模,减少了形变伪影,且兼容现有高斯光栅化器,易于集成到各种引擎中。 Conclusion: 所提方法简化了高斯点阵的动画流程,提升了渲染效率与质量,为快速训练和渲染人体头像提供了有效解决方案。 Abstract: Radiance field-based methods have recently been used to reconstruct human avatars, showing that we can significantly downscale the systems needed for creating animated human avatars. Although this progress has been initiated by neural radiance fields, their slow rendering and backward mapping from the observation space to the canonical space have been the main challenges. With Gaussian splatting overcoming both challenges, a new family of approaches has emerged that are faster to train and render, while also straightforward to implement using forward skinning from the canonical to the observation space. However, the linear blend skinning required for the deformation of the Gaussians does not provide valid results for their non-linear rotation properties. To address such artifacts, recent works use mesh properties to rotate the non-linear Gaussian properties or train models to predict corrective offsets. Instead, we propose a weighted rotation blending approach that leverages quaternion averaging. This leads to simpler vertex-based Gaussians that can be efficiently animated and integrated in any engine by only modifying the linear blend skinning technique, and using any Gaussian rasterizer.

[154] Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery

Jeanny Pan,Philipp Seeböck,Christoph Fürböck,Svitlana Pochepnia,Jennifer Straub,Lucian Beer,Helmut Prosch,Georg Langs

Main category: cs.CV

TL;DR: 提出一种通过潜空间旋转来解耦生物和技术因素的无标签框架,用于医学影像中的生物标志物发现。

Details Motivation: 医学影像数据受不同设备和技术参数影响导致域偏移,阻碍了生物学上有意义的聚类发现。 Method: 通过后处理潜空间旋转主动学习域偏移,实现生物与技术因素的解耦表示。 Result: 在真实异构临床数据上验证,相比纠缠表示,聚类一致性提升(ARI +19.01%,NMI +16.85%,Dice +12.39%),且优于四种先进方法;用于肺纤维化患者生存预测时提升了Cox模型性能。 Conclusion: 该无标签框架能有效分离技术干扰,提升多中心影像数据中组织聚类稳定性,促进生物标志物发现。 Abstract: Identifying new disease-related patterns in medical imaging data with the help of machine learning enlarges the vocabulary of recognizable findings. This supports diagnostic and prognostic assessment. However, image appearance varies not only due to biological differences, but also due to imaging technology linked to vendors, scanning- or re- construction parameters. The resulting domain shifts impedes data representation learning strategies and the discovery of biologically meaningful cluster appearances. To address these challenges, we introduce an approach to actively learn the domain shift via post-hoc rotation of the data latent space, enabling disentanglement of biological and technical factors. Results on real-world heterogeneous clinical data showcase that the learned disentangled representation leads to stable clusters representing tissue-types across different acquisition settings. Cluster consistency is improved by +19.01% (ARI), +16.85% (NMI), and +12.39% (Dice) compared to the entangled representation, outperforming four state-of-the-art harmonization methods. When using the clusters to quantify tissue composition on idiopathic pulmonary fibrosis patients, the learned profiles enhance Cox survival prediction. This indicates that the proposed label-free framework facilitates biomarker discovery in multi-center routine imaging data. Code is available on GitHub https://github.com/cirmuw/latent-space-rotation-disentanglement.

[155] MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder

Ayhan Can Erdur,Christian Beischl,Daniel Scholz,Jiazhen Pan,Benedikt Wiestler,Daniel Rueckert,Jan C Peeken

Main category: cs.CV

TL;DR: 提出一种基于掩码自编码器(MAE)的多模态、多任务学习框架,用于处理脑部MRI中缺失序列的问题,通过跨序列推理实现缺失输入的推断,并在下游任务中显著优于基线模型。

Details Motivation: 医学图像中常存在缺失的输入序列,这对依赖完整数据的深度学习模型构成挑战。因此需要一种能够处理不完整多模态输入并仍能学习丰富表征的方法。 Method: 受MultiMAE启发,将每个MRI序列视为独立模态,采用late-fusion风格的Transformer编码器进行多序列信息融合,并为每种模态设置独立解码器流以实现多任务重建。通过掩码自编码预训练策略,使模型具备跨序列推理能力。 Result: 该方法在下游分割和分类任务中相比MAE-ViT基线模型,在输入序列缺失情况下平均Dice分数绝对提升10.1,MCC提升0.46,表现出更强的鲁棒性和泛化能力。 Conclusion: 所提出的MAE范式能有效处理脑部MRI中的缺失序列问题,具备良好的灵活性和可迁移性,适用于多种下游应用。 Abstract: Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of $10.1$ overall Dice score and $0.46$ MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.

[156] Beyond Frame-wise Tracking: A Trajectory-based Paradigm for Efficient Point Cloud Tracking

BaiChen Fan,Sifan Zhou,Jian Li,Shibo Zhao,Muqing Cao,Qin Wang

Main category: cs.CV

TL;DR: 提出了一种基于轨迹的3D单目标跟踪新范式TrajTrack,通过历史边界框轨迹隐式学习运动连续性,在不增加点云输入的情况下显著提升性能和效率。

Details Motivation: 现有两帧方法缺乏长期时序上下文,而序列方法计算成本高,需在鲁棒性和效率间权衡。 Method: 提出轨迹跟踪框架TrajTrack,结合显式运动提议与隐式运动建模模块,利用历史轨迹信息优化跟踪结果。 Result: 在NuScenes数据集上达到SOTA,精度提升4.48%,运行速度达56 FPS,并展现对不同基础跟踪器的良好泛化能力。 Conclusion: TrajTrack有效解决了3D SOT中效率与鲁棒性的矛盾,为LiDAR 3D跟踪提供了高效可靠的新范式。 Abstract: LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics and autonomous systems. Existing methods typically follow frame-wise motion estimation or a sequence-based paradigm. However, the two-frame methods are efficient but lack long-term temporal context, making them vulnerable in sparse or occluded scenes, while sequence-based methods that process multiple point clouds gain robustness at a significant computational cost. To resolve this dilemma, we propose a novel trajectory-based paradigm and its instantiation, TrajTrack. TrajTrack is a lightweight framework that enhances a base two-frame tracker by implicitly learning motion continuity from historical bounding box trajectories alone-without requiring additional, costly point cloud inputs. It first generates a fast, explicit motion proposal and then uses an implicit motion modeling module to predict the future trajectory, which in turn refines and corrects the initial proposal. Extensive experiments on the large-scale NuScenes benchmark show that TrajTrack achieves new state-of-the-art performance, dramatically improving tracking precision by 4.48% over a strong baseline while running at 56 FPS. Besides, we also demonstrate the strong generalizability of TrajTrack across different base trackers. Video is available at https://www.bilibili.com/video/BV1ahYgzmEWP.

[157] Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision

Tianyao Sun,Dawei Xiang,Tianqi Ding,Xiang Fang,Yijiashun Qi,Zunduo Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为FusionNet的端到端图像融合框架,通过模态感知注意力机制和像素级alpha混合模块,实现了语义感知的红外与可见光图像融合,并在公共数据集上验证了其性能。

Details Motivation: 红外和可见图像融合需要整合不同光谱域的互补结构和纹理线索,但现有方法在语义保持和可解释性方面存在不足。 Method: FusionNet引入了一种模态感知的注意力机制,动态调整红外和可见光特征的贡献,并结合像素级alpha混合模块,学习空间变化的融合权重。此外,还设计了一种目标感知损失,利用弱ROI监督来保持重要区域的语义一致性。 Result: 在M3FD数据集上的实验表明,FusionNet生成的融合图像在语义保持、感知质量和可解释性方面均有提升。 Conclusion: FusionNet可以为语义感知的多模态图像融合提供一个通用且可扩展的解决方案,并有利于下游任务,如目标检测和场景理解。 Abstract: Infrared and visible image fusion (IVIF) is a fundamental task in multi-modal perception that aims to integrate complementary structural and textural cues from different spectral domains. In this paper, we propose FusionNet, a novel end-to-end fusion framework that explicitly models inter-modality interaction and enhances task-critical regions. FusionNet introduces a modality-aware attention mechanism that dynamically adjusts the contribution of infrared and visible features based on their discriminative capacity. To achieve fine-grained, interpretable fusion, we further incorporate a pixel-wise alpha blending module, which learns spatially-varying fusion weights in an adaptive and content-aware manner. Moreover, we formulate a target-aware loss that leverages weak ROI supervision to preserve semantic consistency in regions containing important objects (e.g., pedestrians, vehicles). Experiments on the public M3FD dataset demonstrate that FusionNet generates fused images with enhanced semantic preservation, high perceptual quality, and clear interpretability. Our framework provides a general and extensible solution for semantic-aware multi-modal image fusion, with benefits for downstream tasks such as object detection and scene understanding.

[158] Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis

Wenhao Tang,Sheng Huang,Heng Fang,Fengtao Zhou,Bo Liu,Qingshan Liu

Main category: cs.CV

TL;DR: 提出了一种新的多实例学习框架MHIM-MIL,通过掩码硬实例挖掘来提升病理图像分析性能。

Details Motivation: 现有注意力机制容易偏向易分类实例,忽略难例,影响模型判别能力。 Method: 采用Siamese结构和一致性约束,结合类别感知实例概率与动量教师模型,通过随机掩码和全局回收网络挖掘非冗余硬实例。 Result: 在12个基准任务(包括癌症诊断、分型和生存分析)上优于最新方法,兼具更高性能与效率。 Conclusion: MHIM-MIL有效提升了CPath中WSI分析的准确性和鲁棒性,为MIL框架设计提供了新思路。 Abstract: Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: https://github.com/DearCaat/MHIM-MIL.

[159] SFGNet: Semantic and Frequency Guided Network for Camouflaged Object Detection

Dezhen Wang,Haixiang Zhao,Xiang Shen,Sheng Miao

Main category: cs.CV

TL;DR: 本文提出了一种新的SFGNet,通过结合语义提示和频域特征,显著提高了COD任务的性能。

Details Motivation: 现有COD方法忽略了不同目标文本提示的语义差异和细粒度频率特征。 Method: 提出了一种新的SFGNet,结合了语义提示和频域特征,并设计了MBFM和ISEB模块。 Result: 在三个COD基准数据集上进行了实验,结果表明该方法优于现有技术。 Conclusion: SFGNet在处理复杂背景和模糊边界方面表现出色,显著优于现有方法。 Abstract: Camouflaged object detection (COD) aims to segment objects that blend into their surroundings. However, most existing studies overlook the semantic differences among textual prompts of different targets as well as fine-grained frequency features. In this work, we propose a novel Semantic and Frequency Guided Network (SFGNet), which incorporates semantic prompts and frequency-domain features to capture camouflaged objects and improve boundary perception. We further design Multi-Band Fourier Module(MBFM) to enhance the ability of the network in handling complex backgrounds and blurred boundaries. In addition, we design an Interactive Structure Enhancement Block (ISEB) to ensure structural integrity and boundary details in the predictions. Extensive experiments conducted on three COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches. The core code of the model is available at the following link: https://github.com/winter794444/SFGNetICASSP2026.

[160] How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

Weiming Li,Yan Shao,Jing Yang,Yujing Lu,Ling Zhong,Yuhan Wang,Manni Duan

Main category: cs.CV

TL;DR: This paper proposes three zero-shot auxiliary reasoning methods to enhance GUI grounding in vision-language models by incorporating explicit spatial cues, significantly improving their performance without additional data or annotations.

Details Motivation: General vision-language models (VLMs) struggle with GUI grounding due to a lack of specific optimization, particularly when tasked with outputting explicit coordinates, despite having latent grounding potential. Method: Three zero-shot auxiliary reasoning methods were proposed, incorporating explicit spatial cues like axes, grids, and labeled intersections into the input image to improve the explicit spatial articulation of VLMs. These methods were evaluated on four GUI grounding benchmarks across seven VLMs. Result: The evaluation results show that the proposed methods substantially improve the performance of GUI grounding in VLMs. Conclusion: The proposed zero-shot auxiliary reasoning methods significantly enhance the GUI grounding performance of VLMs without requiring additional data or annotations. Abstract: Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.

[161] Gaussian-Plus-SDF SLAM: High-fidelity 3D Reconstruction at 150+ fps

Zhexi Peng,Kun Zhou,Tianjia Shao

Main category: cs.CV

TL;DR: This paper introduces GPS-SLAM, a new 3D reconstruction method that combines Gaussian and SDF representations to achieve real-time performance with high-quality results.

Details Motivation: The motivation is to overcome the computational bottleneck in existing Gaussian-based SLAM methods, which suffer from low frame rates despite photorealistic reconstruction capabilities. Method: The method involves combining a colorized Signed Distance Field (SDF) for smooth geometry and appearance with 3D Gaussians to capture underrepresented details. The SDF is efficiently constructed via RGB-D fusion, while Gaussians undergo iterative optimization. Result: The result is GPS-SLAM, a real-time 3D reconstruction system achieving over 150 fps on Azure Kinect sequences, offering an order-of-magnitude speedup over state-of-the-art techniques while maintaining comparable reconstruction quality. Conclusion: The paper concludes that GPS-SLAM, a hybrid Gaussian-SDF representation, significantly improves computational performance in 3D reconstruction while maintaining quality. Abstract: While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, significantly lagging behind geometry-centric approaches like KinectFusion (hundreds of fps). This limitation stems from the heavy computational burden: modeling scenes requires numerous Gaussians and complex iterative optimization to fit RGB-D data, where insufficient Gaussian counts or optimization iterations cause severe quality degradation. To address this, we propose a Gaussian-SDF hybrid representation, combining a colorized Signed Distance Field (SDF) for smooth geometry and appearance with 3D Gaussians to capture underrepresented details. The SDF is efficiently constructed via RGB-D fusion (as in geometry-centric methods), while Gaussians undergo iterative optimization. Our representation enables drastic Gaussian reduction (50% fewer) by avoiding full-scene Gaussian modeling, and efficient Gaussian optimization (75% fewer iterations) through targeted appearance refinement. Building upon this representation, we develop GPS-SLAM (Gaussian-Plus-SDF SLAM), a real-time 3D reconstruction system achieving over 150 fps on real-world Azure Kinect sequences -- delivering an order-of-magnitude speedup over state-of-the-art techniques while maintaining comparable reconstruction quality. We will release the source code and data to facilitate future research.

[162] Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification

Haonan Shi,Yubin Wang,De Cheng,Lingfeng He,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 提出了一种用于无监督可见光-红外行人重识别的分层身份学习(HIL)框架,通过多中心对比学习和双向反向选择传输机制提升跨模态匹配性能。

Details Motivation: 现有方法在处理无监督可见光-红外行人重识别时,通常使用单个聚类中心表示人物,忽略了同一聚类内图像之间的细粒度差异,限制了性能提升。 Method: 提出分层身份学习(HIL)框架,包括二次聚类生成多个记忆向量、多中心对比学习(MCCL)优化表征,以及双向反向选择传输(BRST)机制建立可靠的跨模态对应关系。 Result: 在SYSU-MM01和RegDB数据集上的实验表明,该方法优于现有方法,显著提升了跨模态行人重识别性能。 Conclusion: HIL框架有效解决了传统聚类方法忽略细粒度差异的问题,通过多层次学习和可靠伪标签匹配,显著提高了无监督可见光-红外行人重识别的准确性。 Abstract: Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets by reducing the modality gap while minimizing reliance on costly manual annotations. Existing methods typically address USVI-ReID using cluster-based contrastive learning, which represents a person by a single cluster center. However, they primarily focus on the commonality of images within each cluster while neglecting the finer-grained differences among them. To address the limitation, we propose a Hierarchical Identity Learning (HIL) framework. Since each cluster may contain several smaller sub-clusters that reflect fine-grained variations among images, we generate multiple memories for each existing coarse-grained cluster via a secondary clustering. Additionally, we propose Multi-Center Contrastive Learning (MCCL) to refine representations for enhancing intra-modal clustering and minimizing cross-modal discrepancies. To further improve cross-modal matching quality, we design a Bidirectional Reverse Selection Transmission (BRST) mechanism, which establishes reliable cross-modal correspondences by performing bidirectional matching of pseudo-labels. Extensive experiments conducted on the SYSU-MM01 and RegDB datasets demonstrate that the proposed method outperforms existing approaches. The source code is available at: https://github.com/haonanshi0125/HIL.

[163] Optimizing Class Distributions for Bias-Aware Multi-Class Learning

Mirco Felske,Stefan Stiene

Main category: cs.CV

TL;DR: 提出了一种名为BiCDO的迭代式数据中心框架,用于识别多类图像分类中的帕累托最优类别分布,可提升关键类别的性能并减少偏差与方差。

Details Motivation: 在安全关键场景中需要优先保证某些类别的分类性能,而传统的均匀数据分布无法有效平衡各类别间的性能与模型偏差。 Method: 通过迭代优化方法寻找帕累托最优的类别样本分布,以控制偏差和方差,支持现有训练流程和多类标注数据集。 Result: 在CIFAR-10和iNaturalist21数据集上使用EfficientNet、ResNet和ConvNeXt验证,结果显示模型性能更均衡且整体表现提升。 Conclusion: BiCDO能有效优化多类图像分类中的数据分布,提升关键类别的可靠性,同时兼容现有模型和数据集,具有实际应用价值。 Abstract: We propose BiCDO (Bias-Controlled Class Distribution Optimizer), an iterative, data-centric framework that identifies Pareto optimized class distributions for multi-class image classification. BiCDO enables performance prioritization for specific classes, which is useful in safety-critical scenarios (e.g. prioritizing 'Human' over 'Dog'). Unlike uniform distributions, BiCDO determines the optimal number of images per class to enhance reliability and minimize bias and variance in the objective function. BiCDO can be incorporated into existing training pipelines with minimal code changes and supports any labelled multi-class dataset. We have validated BiCDO using EfficientNet, ResNet and ConvNeXt on CIFAR-10 and iNaturalist21 datasets, demonstrating improved, balanced model performance through optimized data distribution.

[164] MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment

Yanyun Pu,Kehan Li,Zeyi Huang,Zhijie Zhong,Kaixiang Yang

Main category: cs.CV

TL;DR: 本文提出了一个名为MVQA-68K的多维度视频质量评估数据集,包含超过68,000个标注视频,涵盖七个关键质量维度,并引入链式思维推理以提升可解释性。实验表明,该数据集显著提升了多模态大模型在VQA任务上的性能,在多个基准上达到SOTA效果,且训练中引入显式推理过程增强了零样本泛化能力。

Details Motivation: 传统视频质量评估方法通常仅提供单一数值评分,缺乏全面性和可解释性,难以满足大规模预训练数据集中高质量视频筛选的需求。 Method: 构建了一个包含7个质量维度的多维VQA数据集MVQA-68K,每个标注附带详细的链式思维(chain-of-thought)推理过程,并用于训练多模态大语言模型以提升评估性能和可解释性。 Result: 在内部测试集及LSVQ-test、LSVQ-1080p、LIVE-VQC等公开基准上均取得当前最优性能,同时引入显式推理过程显著提升了模型的零样本泛化能力。 Conclusion: MVQA-68K是一个具有高可解释性和全面性的多维度视频质量评估数据集,能有效提升多模态大模型在VQA任务中的表现,并推动高质量视频生成与筛选的发展。 Abstract: With the rapid advancement of video generation models such as Sora, video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training. Traditional VQA methods, typically producing single numerical scores, often lack comprehensiveness and interpretability. To address these challenges, we introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos, covering seven essential quality dimensions: overall aesthetics, camera movement, dynamic degree, texture detail, composition, visual quality, and factual consistency. Each annotation includes detailed chain-of-thought reasoning to facilitate interpretability and comprehensive understanding. Extensive experiments demonstrate that MVQA-68K significantly enhances the performance of various multimodal large language models (MLLMs) on the VQA task, achieving state-of-the-art results not only on our internal test set (Fig.1) but also on public benchmarks including LSVQ-test, LSVQ-1080p, and LIVE-VQC. Meantime, incorporating explicit reasoning process during VQA training substantially boosts the zero-shot generalization. Code and dataset will be available at github: https://github.com/Controller01-ai/MVQA-68K

[165] Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework

Siming Fu,Sijun Dong,Xiaoliang Meng

Main category: cs.CV

TL;DR: 本文提出了一种新的混合生成-判别学习框架HyGDL,旨在解决自监督学习中因捷径学习导致的泛化问题。通过在输入端系统地改变偏差(如风格)同时保持监督信号不变,该方法实现了内容与风格的显式解耦。

Details Motivation: 自监督学习虽然取得了显著成功,但其泛化能力受到捷径学习的根本限制,即模型倾向于利用表面特征而非内在结构。这种现象不仅影响生成式方法,也影响判别式方法,并且是它们在未见域上失败的根本原因。 Method: 提出了HyGDL框架,基于不变性预训练原则,在单个编码器上操作并通过向量投影分析定义风格为表示中正交于风格不变内容的部分,从而实现内容与风格的显式分离。 Result: 实验验证了捷径学习问题是普遍存在的,并表明HyGDL能够有效缓解这一问题,促进更鲁棒的内容表征学习。 Conclusion: HyGDL提供了一个从根本上应对自监督学习中捷径学习的新视角,通过显式的内容-风格解耦增强了模型跨域的泛化能力。 Abstract: Despite the remarkable success of Self-Supervised Learning (SSL), its generalization is fundamentally hindered by Shortcut Learning, where models exploit superficial features like texture instead of intrinsic structure. We experimentally verify this flaw within the generative paradigm (e.g., MAE) and argue it is a systemic issue also affecting discriminative methods, identifying it as the root cause of their failure on unseen domains. While existing methods often tackle this at a surface level by aligning or separating domain-specific features, they fail to alter the underlying learning mechanism that fosters shortcut dependency. To address this at its core, we propose HyGDL (Hybrid Generative-Discriminative Learning Framework), a hybrid framework that achieves explicit content-style disentanglement. Our approach is guided by the Invariance Pre-training Principle: forcing a model to learn an invariant essence by systematically varying a bias (e.g., style) at the input while keeping the supervision signal constant. HyGDL operates on a single encoder and analytically defines style as the component of a representation that is orthogonal to its style-invariant content, derived via vector projection.

[166] DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection

Seoik Jung,Taekyung Song,Joshua Jordan Daniel,JinYoung Lee,SungJun Lee

Main category: cs.CV

TL;DR: 提出了一种基于softmax的帧分配策略,构建了两个互补的基准(基于图像和基于视频),在UCF-Crime数据集上验证了方法在帧级和视频级任务上的有效性。

Details Motivation: 现有视频异常检测基准仅限于帧级或视频级任务,缺乏对模型泛化能力的整体评估,因此需要一种能兼顾不同时间尺度的平衡评估框架。 Method: 提出基于softmax的帧分配策略,优先选择异常密集片段同时保持全视频覆盖;构建两个基准:基于代表性帧的图像基准用于帧级推理评估,基于时序定位片段的视频基准并引入异常评分任务。 Result: 在UCF-Crime数据集上的实验表明,该方法在帧级和视频级任务上均取得性能提升,消融研究证实异常聚焦采样优于均匀和随机采样基线。 Conclusion: 所提出的采样策略和双基准框架有效提升了视频异常检测的评估全面性与模型性能,支持跨时间尺度的均衡训练与评估。 Abstract: Video Anomaly Detection (VAD) is critical for surveillance and public safety. However, existing benchmarks are limited to either frame-level or video-level tasks, restricting a holistic view of model generalization. This work first introduces a softmax-based frame allocation strategy that prioritizes anomaly-dense segments while maintaining full-video coverage, enabling balanced sampling across temporal scales. Building on this process, we construct two complementary benchmarks. The image-based benchmark evaluates frame-level reasoning with representative frames, while the video-based benchmark extends to temporally localized segments and incorporates an abnormality scoring task.Experiments on UCF-Crime demonstrate improvements at both the frame and video levels, and ablation studies confirm clear advantages of anomaly-focused sampling over uniform and random baselines.

[167] A Controllable 3D Deepfake Generation Framework with Gaussian Splatting

Wending Liu,Siyun Liang,Huy H. Nguyen,Isao Echizen

Main category: cs.CV

TL;DR: This paper introduces a 3D deepfake generation framework using 3D Gaussian Splatting that enables realistic, identity-preserving face swapping and reenactment in a fully controllable 3D space, outperforming 2D methods in multi-view rendering and 3D consistency.

Details Motivation: The motivation is to overcome the limitations of conventional 2D deepfake approaches, such as geometric inconsistencies and limited generalization to novel views, by leveraging 3D Gaussian Splatting for realistic and controllable face swapping and reenactment. Method: The method combines a parametric head model with dynamic Gaussian representations to support multi-view consistent rendering, precise expression control, and seamless background integration. It separates head and background Gaussians and uses pre-trained 2D guidance for facial region optimization, along with a repair module for visual consistency. Result: Experiments show comparable performance to state-of-the-art 2D approaches in identity preservation and pose and expression consistency, with significant improvements in multi-view rendering quality and 3D consistency. Conclusion: The paper concludes that the proposed 3D deepfake generation framework outperforms state-of-the-art 2D approaches in multi-view rendering quality and 3D consistency, bridging the gap between 3D modeling and deepfake synthesis. Abstract: We propose a novel 3D deepfake generation framework based on 3D Gaussian Splatting that enables realistic, identity-preserving face swapping and reenactment in a fully controllable 3D space. Compared to conventional 2D deepfake approaches that suffer from geometric inconsistencies and limited generalization to novel view, our method combines a parametric head model with dynamic Gaussian representations to support multi-view consistent rendering, precise expression control, and seamless background integration. To address editing challenges in point-based representations, we explicitly separate the head and background Gaussians and use pre-trained 2D guidance to optimize the facial region across views. We further introduce a repair module to enhance visual consistency under extreme poses and expressions. Experiments on NeRSemble and additional evaluation videos demonstrate that our method achieves comparable performance to state-of-the-art 2D approaches in identity preservation, as well as pose and expression consistency, while significantly outperforming them in multi-view rendering quality and 3D consistency. Our approach bridges the gap between 3D modeling and deepfake synthesis, enabling new directions for scene-aware, controllable, and immersive visual forgeries, revealing the threat that emerging 3D Gaussian Splatting technique could be used for manipulation attacks.

[168] IS-Diff: Improving Diffusion-Based Inpainting with Better Initial Seed

Yongzhe Lyu,Yu Wu,Yutian Lin,Bo Du

Main category: cs.CV

TL;DR: 提出了一种无需训练的初始种子优化扩散模型(IS-Diff),通过从非掩码区域采样初始种子并引入动态选择性 refinement 机制,提升图像修复的一致性和协调性。

Details Motivation: 传统的扩散模型在自由形式图像修复中由于随机初始化噪声可能引入语义不匹配,导致修复结果与周围区域缺乏一致性和连贯性。 Method: IS-Diff利用非掩码区域的分布信息生成初始种子,以引导掩码区域的生成过程,并设计动态选择性 refinement 机制,在中间潜在层检测不协调的修复结果并动态调整初始化先验强度。 Result: 在CelebA-HQ、ImageNet和Places2数据集上的标准和大掩码修复任务中,IS-Diff在各项指标上均优于现有最先进方法。 Conclusion: IS-Diff是一种无需训练的高效修复方法,通过分布协调的初始种子和动态优化机制显著提升了修复结果的语义一致性和视觉质量。 Abstract: Diffusion models have shown promising results in free-form inpainting. Recent studies based on refined diffusion samplers or novel architectural designs led to realistic results and high data consistency. However, random initialization seed (noise) adopted in vanilla diffusion process may introduce mismatched semantic information in masked regions, leading to biased inpainting results, e.g., low consistency and low coherence with the other unmasked area. To address this issue, we propose the Initial Seed refined Diffusion Model (IS-Diff), a completely training-free approach incorporating distributional harmonious seeds to produce harmonious results. Specifically, IS-Diff employs initial seeds sampled from unmasked areas to imitate the masked data distribution, thereby setting a promising direction for the diffusion procedure. Moreover, a dynamic selective refinement mechanism is proposed to detect severe unharmonious inpaintings in intermediate latent and adjust the strength of our initialization prior dynamically. We validate our method on both standard and large-mask inpainting tasks using the CelebA-HQ, ImageNet, and Places2 datasets, demonstrating its effectiveness across all metrics compared to state-of-the-art inpainting methods.

[169] WeatherBench: A Real-World Benchmark Dataset for All-in-One Adverse Weather Image Restoration

Qiyuan Guan,Qianfeng Yang,Xiang Chen,Tianyu Song,Guiyue Jin,Jiyu Jin

Main category: cs.CV

TL;DR: 提出并发布了一个真实世界的一体化恶劣天气图像恢复基准数据集,包含多种天气条件下的对齐退化与清晰图像对,解决了现有合成数据集存在的域差距问题,并支持监督学习与公平评估。

Details Motivation: 现有的一体化图像恢复方法主要依赖混合单天气合成数据集进行训练和评估,但这些数据集在分辨率、风格和域特征上差异大,导致域差距显著,且缺乏大规模真实世界的一体化天气恢复数据集,限制了统一模型的发展与公平评估。 Method: 构建了一个真实世界的一体化恶劣天气图像恢复基准数据集,包含雨、雪、雾等多种天气条件下采集的对齐退化-清晰图像对,覆盖多样户外场景与光照设置,并在此数据集上对多种任务专用、通用及一体化恢复方法进行了全面实验评估。 Result: 该数据集提供了高质量、精确对齐的真实图像对,有效缩小了合成与真实数据间的域差距;实验表明现有方法在真实复杂天气下仍有局限,验证了本数据集对推动鲁棒、实用的一体化图像恢复研究的价值。 Conclusion: 所提出的WeatherBench数据集为真实场景下的一体化图像恢复提供了更可靠、公平的评估平台,有助于推动该领域的实际应用发展,且已公开发布以促进后续研究。 Abstract: Existing all-in-one image restoration approaches, which aim to handle multiple weather degradations within a single framework, are predominantly trained and evaluated using mixed single-weather synthetic datasets. However, these datasets often differ significantly in resolution, style, and domain characteristics, leading to substantial domain gaps that hinder the development and fair evaluation of unified models. Furthermore, the lack of a large-scale, real-world all-in-one weather restoration dataset remains a critical bottleneck in advancing this field. To address these limitations, we present a real-world all-in-one adverse weather image restoration benchmark dataset, which contains image pairs captured under various weather conditions, including rain, snow, and haze, as well as diverse outdoor scenes and illumination settings. The resulting dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of task-specific, task-general, and all-in-one restoration methods on our dataset. Our dataset offers a valuable foundation for advancing robust and practical all-in-one image restoration in real-world scenarios. The dataset has been publicly released and is available at https://github.com/guanqiyuan/WeatherBench.

[170] Joint-octamamba:an octa joint segmentation network based on feature enhanced mamba

Chuang Liu,Nan Guo

Main category: cs.CV

TL;DR: 提出了一种基于Mamba状态空间模型的新型架构RVMamba,以及用于OCTA图像中视网膜血管和FAZ联合分割的Joint-OCTAMamba框架,在OCTA-500数据集上表现优于现有方法。

Details Motivation: 现有的2D视网膜血管分割方法精度不足,且OCTA联合分割模型在不同任务间存在性能不平衡问题。 Method: 结合多个特征提取模块与Mamba状态空间模型,提出RVMamba、FAZMamba及统一的Joint-OCTAMamba框架,实现视网膜血管和FAZ的精确联合分割。 Result: 在OCTA-500数据集上的实验表明,Joint-OCTAMamba在各项评估指标上均优于现有模型。 Conclusion: 所提出的Joint-OCTAMamba框架有效提升了视网膜血管和FAZ分割的准确性,并缓解了多任务间的性能不平衡问题,具有良好的临床应用潜力。 Abstract: OCTA is a crucial non-invasive imaging technique for diagnosing and monitoring retinal diseases like diabetic retinopathy, age-related macular degeneration, and glaucoma. Current 2D-based methods for retinal vessel (RV) segmentation offer insufficient accuracy. To address this, we propose RVMamba, a novel architecture integrating multiple feature extraction modules with the Mamba state-space model. Moreover, existing joint segmentation models for OCTA data exhibit performance imbalance between different tasks. To simultaneously improve the segmentation of the foveal avascular zone (FAZ) and mitigate this imbalance, we introduce FAZMamba and a unified Joint-OCTAMamba framework. Experimental results on the OCTA-500 dataset demonstrate that Joint-OCTAMamba outperforms existing models across evaluation metrics.The code is available at https://github.com/lc-sfis/Joint-OCTAMamba.

[171] DTGen: Generative Diffusion-Based Few-Shot Data Augmentation for Fine-Grained Dirty Tableware Recognition

Lifei Hao,Yue Cheng,Baoqi Huang,Bing Jia,Xuandong Zhao

Main category: cs.CV

TL;DR: 本文提出了一種名為DTGen的少樣本數據增強方案,用於細粒度餐具污垢識別,並探討其在智能洗碗機上的應用。

Details Motivation: 現有方法受限於粗粒度分類和少樣本數據的稀缺,難以滿足工業化需求。 Method: DTGen基於生成擴散模型,利用LoRA進行高效的領域專門化,通過結構化提示生成多樣的髒污圖像,並通過基於CLIP的跨模態過濾確保數據質量。 Result: 在極端有限的真實少樣本條件下,DTGen可以合成幾乎無限的高質量樣本,顯著提高分類器性能,並支持細粒度餐具污垢識別。 Conclusion: DTGen驗證了生成式AI在少樣本工業視覺中的價值,並為自動化餐具清潔和食品安全監測提供了可行的部署路徑。 Abstract: Intelligent tableware cleaning is a critical application in food safety and smart homes, but existing methods are limited by coarse-grained classification and scarcity of few-shot data, making it difficult to meet industrialization requirements. We propose DTGen, a few-shot data augmentation scheme based on generative diffusion models, specifically designed for fine-grained dirty tableware recognition. DTGen achieves efficient domain specialization through LoRA, generates diverse dirty images via structured prompts, and ensures data quality through CLIP-based cross-modal filtering. Under extremely limited real few-shot conditions, DTGen can synthesize virtually unlimited high-quality samples, significantly improving classifier performance and supporting fine-grained dirty tableware recognition. We further elaborate on lightweight deployment strategies, promising to transfer DTGen's benefits to embedded dishwashers and integrate with cleaning programs to intelligently regulate energy consumption and detergent usage. Research results demonstrate that DTGen not only validates the value of generative AI in few-shot industrial vision but also provides a feasible deployment path for automated tableware cleaning and food safety monitoring.

[172] MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

Feilong Chen,Yijiang Liu,Yi Huang,Hao Wang,Miren Tian,Ya-Qi Yu,Minghui Liao,Jihao Wu

Main category: cs.CV

TL;DR: MindVL是一种多模态大语言模型,通过原生分辨率视觉变换器和优化的训练框架,在更少数据上实现了卓越性能。

Details Motivation: 解决固定分辨率平铺造成的图像处理质量下降问题,同时提升多模态模型在Ascend NPUs上的训练效率。 Method: MindVL采用原生分辨率视觉变换器和分布式多模态训练框架Mindspeed-MLLM,通过三阶段训练过程提升模型能力。 Result: MindVL仅使用Qwen2.5-VL约1/10的训练数据,就在多模态理解和文档/表格理解评估中达到与Qwen2.5-VL相当的性能,并在OCR评估中领先。 Conclusion: MindVL在使用更少训练数据的情况下,实现了与Qwen2.5-VL相当的性能,并且在OCR评估中表现出色。 Abstract: We propose MindVL, a multimodal large langauge model trained on Ascend NPUs. Similar to Qwen2.5-VL, MindVL adopts native-resolution Vision Transformers, which enables it to process images at their original variable resolutions. This design avoids the degradation caused by fixed-resolution tiling while preserving fine-grained details and global layouts, which is crucial for visually dense content such as complex charts and diagrams. To ensure the smooth training of MindVL on Ascend NPUs, we develop Mindspeed-MLLM, a distributed multimodal training framework tailored for Ascend NPUs. To maintain training accuracy, we implement equivalent replacements for certain operators. MindVL undergoes a three-phase training process, namely the warm-up phase, multitask training phase, and supervised instruction tuning phase, to gradually enhance its capabilities. This process starts with basic visual and multimodal pre-training, followed by large-scale multiask trainging and instruction tuning. We also adopt multimodal data packaging and hybrid parallelism techniques, which significantly improve end-to-end training speed. To further boost model performance, we specifically introduce test-time resolution search and model weight averaging. Notably, despite using about 1/10 of the training data required by Qwen2.5-VL, MindVL achieves performance on par with Qwen2.5-VL in evaluations of general multimodal understanding and document/table comprehension. Beyond overall scores, MindVL also delivers leading performance in OCR assessments.

[173] RouteExtract: A Modular Pipeline for Extracting Routes from Paper Maps

Bjoern Kremser,Yusuke Matsui

Main category: cs.CV

TL;DR: 本文提出了一种从扫描地图中提取可导航路径的管道方法,结合了地理配准、U-Net分割、图构建和路由引擎优化,能够生成适用于GPS导航的路线。

Details Motivation: 纸质地图仍然广泛用于远足和观光,因为它们包含精选的路径和本地相关的注释,而这些往往在数字导航应用程序(如Google Maps)中缺失。本文旨在提出一种方法,从扫描的地图中提取可导航的路径,从而使其能够在基于GPS的导航中使用。 Method: 该论文的方法结合了地理配准、基于U-Net的二值分割、图构建以及使用路由引擎的迭代优化过程。 Result: 论文评估了端到端的整个管道以及各个组件,结果表明该方法能够从多样化的地图样式中稳健地恢复路径网络,并生成适用于实际使用的GPS路线。 Conclusion: 论文得出结论,所提出的方法能够从扫描的地图中稳健地恢复路径网络,并生成适用于实际使用的GPS路线。 Abstract: Paper maps remain widely used for hiking and sightseeing because they contain curated trails and locally relevant annotations that are often missing from digital navigation applications such as Google Maps. We propose a pipeline to extract navigable trails from scanned maps, enabling their use in GPS-based navigation. Our method combines georeferencing, U-Net-based binary segmentation, graph construction, and an iterative refinement procedure using a routing engine. We evaluate the full end-to-end pipeline as well as individual components, showing that the approach can robustly recover trail networks from diverse map styles and generate GPS routes suitable for practical use.

[174] IMD: A 6-DoF Pose Estimation Benchmark for Industrial Metallic Objects

Ruimin Ma,Sebastian Zudaire,Zhen Li,Chi Zhang

Main category: cs.CV

TL;DR: The paper introduces a new industrial dataset (IMD) for 6D pose estimation, highlighting its importance for real-world robotic applications where traditional benchmarks fall short.

Details Motivation: Existing benchmarks for object 6D pose estimation mainly use textured, low-reflectivity everyday objects, which do not generalize well to industrial environments where objects are often metallic, texture-less, and reflective. Method: The authors evaluate state-of-the-art models like XMem, SAM2 (for segmentation) and BundleTrack, BundleSDF (for pose estimation) on their newly proposed dataset, which consists of 45 industrial components captured under real-world conditions. Result: The evaluation shows that the proposed industrial dataset is more challenging than existing household object datasets, emphasizing its relevance for industrial robotics applications. Conclusion: The proposed Industrial Metallic Dataset (IMD) serves as a new benchmark for industrial applications, providing a baseline for segmentation and pose estimation algorithms in industrial robotics scenarios. Abstract: Object 6DoF (6D) pose estimation is essential for robotic perception, especially in industrial settings. It enables robots to interact with the environment and manipulate objects. However, existing benchmarks on object 6D pose estimation primarily use everyday objects with rich textures and low-reflectivity, limiting model generalization to industrial scenarios where objects are often metallic, texture-less, and highly reflective. To address this gap, we propose a novel dataset and benchmark namely \textit{Industrial Metallic Dataset (IMD)}, tailored for industrial applications. Our dataset comprises 45 true-to-scale industrial components, captured with an RGB-D camera under natural indoor lighting and varied object arrangements to replicate real-world conditions. The benchmark supports three tasks, including video object segmentation, 6D pose tracking, and one-shot 6D pose estimation. We evaluate existing state-of-the-art models, including XMem and SAM2 for segmentation, and BundleTrack and BundleSDF for pose estimation, to assess model performance in industrial contexts. Evaluation results show that our industrial dataset is more challenging than existing household object datasets. This benchmark provides the baseline for developing and comparing segmentation and pose estimation algorithms that better generalize to industrial robotics scenarios.

[175] Uncertainty-Aware Retinal Vessel Segmentation via Ensemble Distillation

Jeremiah Fadugba,Petru Manescu,Bolanle Oladejo,Delmiro Fernandez-Reyes,Philipp Berens

Main category: cs.CV

TL;DR: 本文提出了一种名为Ensemble Distillation的方法,通过将多个模型的知识提炼到单个模型中,实现了高效的不确定性估计,适用于视网膜血管分割。

Details Motivation: 不确定性估计对于可靠的医学图像分割至关重要,尤其是在视网膜血管分析中,准确的预测对于诊断应用至关重要。 Method: 通过将多个集成模型的知识提炼到单个模型中,提出了一种称为Ensemble Distillation的稳健不确定性估计替代方法。 Result: 在DRIVE和FIVES数据集上进行的大量实验证明,Ensemble Distillation通过校准和分割指标实现了可比较的性能,同时显著降低了计算复杂性。 Conclusion: Ensemble Distillation是一种有效的不确定性估计方法,在视网膜血管分割中提供了高效且可靠的方法,使其成为医学成像应用的有前景的工具。 Abstract: Uncertainty estimation is critical for reliable medical image segmentation, particularly in retinal vessel analysis, where accurate predictions are essential for diagnostic applications. Deep Ensembles, where multiple networks are trained individually, are widely used to improve medical image segmentation performance. However, training and testing costs increase with the number of ensembles. In this work, we propose Ensemble Distillation as a robust alternative to commonly used uncertainty estimation techniques by distilling the knowledge of multiple ensemble models into a single model. Through extensive experiments on the DRIVE and FIVES datasets, we demonstrate that Ensemble Distillation achieves comparable performance via calibration and segmentation metrics, while significantly reducing computational complexity. These findings suggest that Ensemble distillation provides an efficient and reliable approach for uncertainty estimation in the segmentation of the retinal vessels, making it a promising tool for medical imaging applications.

[176] The Quest for Universal Master Key Filters in DS-CNNs

Zahra Babaiee,Peyman M. Kiassari,Daniela Rus,Radu Grosu

Main category: cs.CV

TL;DR: 本文扩展了“主密钥过滤器假设”,发现深度卷积网络收敛到仅8个通用过滤器,并揭示了它们与高斯差、高斯及其导数结构的相似性,提供了对迁移学习和泛化的理解。

Details Motivation: 最近的一项研究提出了卷积神经网络过滤器的“主密钥过滤器假设”。本文通过将其范围彻底限制为一组仅8个通用过滤器来扩展这一假设。 Method: 通过系统性的无监督搜索,提取了不同架构和数据集中的基本模式。 Result: 使用这8个独特冻结过滤器初始化的网络在ImageNet上实现了超过80%的准确率,并且在应用于较小数据集时甚至优于具有数千个可训练参数的模型。 Conclusion: 深度卷积层自然趋向于这一基本的空间操作集,不论任务或架构如何。 Abstract: A recent study has proposed the "Master Key Filters Hypothesis" for convolutional neural network filters. This paper extends this hypothesis by radically constraining its scope to a single set of just 8 universal filters that depthwise separable convolutional networks inherently converge to. While conventional DS-CNNs employ thousands of distinct trained filters, our analysis reveals these filters are predominantly linear shifts (ax+b) of our discovered universal set. Through systematic unsupervised search, we extracted these fundamental patterns across different architectures and datasets. Remarkably, networks initialized with these 8 unique frozen filters achieve over 80% ImageNet accuracy, and even outperform models with thousands of trainable parameters when applied to smaller datasets. The identified master key filters closely match Difference of Gaussians (DoGs), Gaussians, and their derivatives, structures that are not only fundamental to classical image processing but also strikingly similar to receptive fields in mammalian visual systems. Our findings provide compelling evidence that depthwise convolutional layers naturally gravitate toward this fundamental set of spatial operators regardless of task or architecture. This work offers new insights for understanding generalization and transfer learning through the universal language of these master key filters.

[177] Advanced Layout Analysis Models for Docling

Nikolaos Livathinos,Christoph Auer,Ahmed Nassar,Rafael Teixeira de Lima,Maksym Lysak,Brown Ebouky,Cesar Berrospi,Michele Dolfi,Panagiotis Vagenas,Matteo Omenetti,Kasper Dinkla,Yusik Kim,Valery Weber,Lucas Morin,Ingmar Meijer,Viktor Kuropiatnyk,Tim Strohmeyer,A. Said Gurbuz,Peter W. J. Staar

Main category: cs.CV

TL;DR: 本文提出并发布了五个新的文档布局分析模型,集成到Docling文档转换流程中,基于RT-DETR、RT-DETRv2和DFINE架构,在15万文档数据集上训练,相比先前基线mAP提升20.6%-23.9%,最佳模型“heron-101”在NVIDIA A100上达到78% mAP和28 ms/图像的推理速度,所有模型和代码已开源。

Details Motivation: 为了提升文档转换中文档布局分析的准确性和实用性,解决现有方法在复杂文档结构上的检测不足以及缺乏统一评估标准的问题。 Method: 采用RT-DETR、RT-DETRv2和DFINE等先进目标检测架构,在包含15万份异构文档的大规模数据集上进行训练,并对原始检测结果应用后处理以更好适应文档转换任务。 Result: 提出的五个新模型在多个文档基准上显著优于原有基线,mAP提升20.6%-23.9%,最佳模型heron-101在NVIDIA A100上实现78% mAP和28 ms/图像的推理速度,同时在CPU和Apple GPU等不同硬件上验证了运行效率。 Conclusion: 本文展示了高性能文档布局分析模型的有效训练与部署方案,为文档转换领域提供了可复现、可扩展的最佳实践,并通过开源促进社区发展。 Abstract: This technical report documents the development of novel Layout Analysis models integrated into the Docling document-conversion pipeline. We trained several state-of-the-art object detectors based on the RT-DETR, RT-DETRv2 and DFINE architectures on a heterogeneous corpus of 150,000 documents (both openly available and proprietary). Post-processing steps were applied to the raw detections to make them more applicable to the document conversion task. We evaluated the effectiveness of the layout analysis on various document benchmarks using different methodologies while also measuring the runtime performance across different environments (CPU, Nvidia and Apple GPUs). We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline, with comparable or better runtime. Our best model, "heron-101", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU. Extensive quantitative and qualitative experiments establish best practices for training, evaluating, and deploying document-layout detectors, providing actionable guidance for the document conversion community. All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.

[178] Microsurgical Instrument Segmentation for Robot-Assisted Surgery

Tae Kyeong Jeong,Garam Kim,Juyoun Park

Main category: cs.CV

TL;DR: 提出MISRA框架,通过增强RGB输入、引入跳跃注意力和迭代反馈模块,显著提升显微手术器械中细长结构的分割性能。

Details Motivation: 准确分割显微手术中的细长结构对场景理解至关重要,但因分辨率损失、低对比度和类别不平衡而具有挑战性。 Method: 提出MISRA分割框架,融合亮度通道增强RGB输入,采用跳跃注意力保留细长特征,并设计迭代反馈模块(IFM)多轮恢复结构连续性。同时构建了带精细标注的显微手术数据集。 Result: 实验表明,MISRA在平均类别IoU上比现有方法提升5.37%,在器械接触和重叠区域表现出更稳定的预测性能。 Conclusion: MISRA为计算机辅助和机器人显微手术中的可靠场景解析提供了有前景的解决方案。 Abstract: Accurate segmentation of thin structures is critical for microsurgical scene understanding but remains challenging due to resolution loss, low contrast, and class imbalance. We propose Microsurgery Instrument Segmentation for Robotic Assistance(MISRA), a segmentation framework that augments RGB input with luminance channels, integrates skip attention to preserve elongated features, and employs an Iterative Feedback Module(IFM) for continuity restoration across multiple passes. In addition, we introduce a dedicated microsurgical dataset with fine-grained annotations of surgical instruments including thin objects, providing a benchmark for robust evaluation Dataset available at https://huggingface.co/datasets/KIST-HARILAB/MISAW-Seg. Experiments demonstrate that MISRA achieves competitive performance, improving the mean class IoU by 5.37% over competing methods, while delivering more stable predictions at instrument contacts and overlaps. These results position MISRA as a promising step toward reliable scene parsing for computer-assisted and robotic microsurgery.

[179] Bridging the Gap Between Sparsity and Redundancy: A Dual-Decoding Framework with Global Context for Map Inference

Yudong Shen,Wenyu Wu,Jiali Mao,Yixiao Tong,Guoping Liu,Chaoya Wang

Main category: cs.CV

TL;DR: 提出了一种名为DGMap的双解码框架,用于解决轨迹数据密度不均导致的道路碎片化和冗余问题,在真实世界数据集上显著优于现有方法。

Details Motivation: 由于轨迹数据密度不均,现有自动地图推断方法在稀疏区域易产生道路断裂,在密集区域则出现冗余路段,影响地图质量。 Method: DGMap采用多尺度网格编码、掩码增强的关键点提取和全局上下文感知的关系预测,结合全局语义与局部几何特征,提升关键点检测精度并抑制密集区域的错误连接。 Result: 在三个真实世界数据集上实验表明,DGMap在APLS指标上比现有最先进方法高出5%,尤其在滴滴出行平台的轨迹数据上表现更优。 Conclusion: DGMap通过融合全局上下文与局部特征,有效解决了轨迹密度不均带来的挑战,显著提升了自动地图推断的精度与鲁棒性。 Abstract: Trajectory data has become a key resource for automated map in-ference due to its low cost, broad coverage, and continuous availability. However, uneven trajectory density often leads to frag-mented roads in sparse areas and redundant segments in dense regions, posing significant challenges for existing methods. To address these issues, we propose DGMap, a dual-decoding framework with global context awareness, featuring Multi-scale Grid Encoding, Mask-enhanced Keypoint Extraction, and Global Context-aware Relation Prediction. By integrating global semantic context with local geometric features, DGMap improves keypoint detection accuracy to reduce road fragmentation in sparse-trajectory areas. Additionally, the Global Context-aware Relation Prediction module suppresses false connections in dense-trajectory regions by modeling long-range trajectory patterns. Experimental results on three real-world datasets show that DGMap outperforms state-of-the-art methods by 5% in APLS, with notable performance gains on trajectory data from the Didi Chuxing platform

[180] Lost in Embeddings: Information Loss in Vision-Language Models

Wenyan Li,Raphael Tang,Chengzu Li,Caiqi Zhang,Ivan Vulić,Anders Søgaard

Main category: cs.CV

TL;DR: The projection step in vision--language models causes significant information loss, impacting performance on retrieval and question-answering tasks.

Details Motivation: The impact of the projection step in vision--language models on information loss and its effect on model capabilities is understudied. Method: Two approaches were used: analyzing k-nearest neighbor relationships before and after projection, and reconstructing visual embeddings at the patch level to localize information loss. Result: Connectors distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60% post-projection, and patch-level loss predicts model struggles in tasks. Conclusion: The projection step in vision--language models leads to significant information loss, affecting retrieval performance and model behavior on question-answering tasks. Abstract: Vision--language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60\% post-projection, correlating with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visually grounded question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.

[181] A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications

Hongyuan Zhang,Yuheng Wu,Mingyang Zhao,Zhiwei Chen,Rebecca Li,Fei Zhu,Haohan Zhao,Xiaohua Yuan,Meng Yang,Chunli Qiu,Xiang Cong,Haiyan Chen,Lina Luan,Randolph H. L. Wong,Huai Liao,Colin A Graham,Shi Chang,Guowei Tao,Dong Yi,Zhen Lei,Nassir Navab,Sebastien Ourselin,Jiebo Luo,Hongbin Liu,Gaofeng Meng

Main category: cs.CV

TL;DR: 本研究提出了一种新的通用临床超声基础模型EchoCare,该模型通过自我监督学习和层次分类器实现了卓越的性能和广泛的适用性。

Details Motivation: 现实临床环境中缺乏大型标记数据集以及任务特定模型有限的泛化能力阻碍了超声应用中临床AI模型的发展。 Method: 通过自我监督学习,使用包含450万张超声图像的多中心、多设备、多民族的大型数据集EchoCareData进行训练,并引入了一种层次分类器以实现像素级和表示级特征的联合学习。 Result: EchoCare 在10个具有不同诊断难度的代表性超声波基准测试中优于最先进的比较模型,涵盖疾病诊断、病变分割、器官检测、标志点预测、定量回归、成像增强和报告生成等领域。 Conclusion: EchoCare 提供了一个完全开放且可推广的基础模型,可以加速人工智能技术在多样化临床超声应用中的发展。 Abstract: Artificial intelligence (AI) that can effectively learn ultrasound representations by integrating multi-source data holds significant promise for advancing clinical care. However, the scarcity of large labeled datasets in real-world clinical environments and the limited generalizability of task-specific models have hindered the development of generalizable clinical AI models for ultrasound applications. In this study, we present EchoCare, a novel ultrasound foundation model for generalist clinical use, developed via self-supervised learning on our curated, publicly available, large-scale dataset EchoCareData. EchoCareData comprises 4.5 million ultrasound images, sourced from over 23 countries across 5 continents and acquired via a diverse range of distinct imaging devices, thus encompassing global cohorts that are multi-center, multi-device, and multi-ethnic. Unlike prior studies that adopt off-the-shelf vision foundation model architectures, we introduce a hierarchical classifier into EchoCare to enable joint learning of pixel-level and representation-level features, capturing both global anatomical contexts and local ultrasound characteristics. With minimal training, EchoCare outperforms state-of-the-art comparison models across 10 representative ultrasound benchmarks of varying diagnostic difficulties, spanning disease diagnosis, lesion segmentation, organ detection, landmark prediction, quantitative regression, imaging enhancement and report generation. The code and pretrained model are publicly released, rendering EchoCare accessible for fine-tuning and local adaptation, supporting extensibility to additional applications. EchoCare provides a fully open and generalizable foundation model to boost the development of AI technologies for diverse clinical ultrasound applications.

[182] Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Pu Jian,Junhong Wu,Wei Sun,Chen Wang,Shuo Ren,Jiajun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种增强视觉反射能力的新型视觉推理模型Reflection-V,通过构建以视觉为中心的推理数据和设计基于视觉注意力的奖励机制,在多个视觉推理基准上实现了显著性能提升。

Details Motivation: 现有的视觉语言模型在长推理过程中对视觉信息的关注迅速下降,缺乏有效的视觉反射能力,限制了其在复杂视觉推理任务中的表现。 Method: 首先利用代理在VLMs和推理LLMs之间交互构建视觉中心的推理数据,实现视觉反射模式的冷启动学习;其次在强化学习中引入基于视觉注意力的奖励模型,鼓励模型依赖视觉信息进行推理。 Result: Reflection-V在多个视觉推理基准上表现出显著性能提升,并在推理过程中展现出更强且更稳定的视觉信息依赖性。 Conclusion: 通过推理数据构造和奖励设计,Reflection-V有效增强了视觉反射能力,为视觉语言模型中的慢思考推理提供了可行路径。 Abstract: Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, \textbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

[183] MSMA: Multi-Scale Feature Fusion For Multi-Attribute 3D Face Reconstruction From Unconstrained Images

Danling Cao

Main category: cs.CV

TL;DR: 提出了一种多尺度特征融合与多属性学习结合的框架(MSMA),用于从单张无约束图像中进行3D人脸重建,在多个数据集上达到了与当前最先进方法相当甚至更优的性能。

Details Motivation: 现有基于学习的方法依赖大量标注的3D人脸数据,且在复杂条件下难以捕捉细节和多尺度特征,导致重建不完整或不准确。 Method: 提出MSMA框架,结合多尺度特征融合与多属性学习,引入大核注意力模块以增强跨尺度特征提取精度,通过投影损失约束训练,减少对标注3D数据的依赖。 Result: 在MICC Florence、Facewarehouse和自建数据集上实验表明,该方法在多种挑战性条件下达到或超过了现有最先进方法的性能。 Conclusion: MSMA框架有效提升了无约束条件下单图像3D人脸重建的精度和细节表现,减少了对大规模3D标注数据的依赖。 Abstract: Reconstructing 3D face from a single unconstrained image remains a challenging problem due to diverse conditions in unconstrained environments. Recently, learning-based methods have achieved notable results by effectively capturing complex facial structures and details across varying conditions. Consequently, many existing approaches employ projection-based losses between generated and input images to constrain model training. However, learning-based methods for 3D face reconstruction typically require substantial amounts of 3D facial data, which is difficult and costly to obtain. Consequently, to reduce reliance on labeled 3D face datasets, many existing approaches employ projection-based losses between generated and input images to constrain model training. Nonetheless, despite these advancements, existing approaches frequently struggle to capture detailed and multi-scale features under diverse facial attributes and conditions, leading to incomplete or less accurate reconstructions. In this paper, we propose a Multi-Scale Feature Fusion with Multi-Attribute (MSMA) framework for 3D face reconstruction from unconstrained images. Our method integrates multi-scale feature fusion with a focus on multi-attribute learning and leverages a large-kernel attention module to enhance the precision of feature extraction across scales, enabling accurate 3D facial parameter estimation from a single 2D image. Comprehensive experiments on the MICC Florence, Facewarehouse and custom-collect datasets demonstrate that our approach achieves results on par with current state-of-the-art methods, and in some instances, surpasses SOTA performance across challenging conditions.

[184] Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation for Zero-shot Generalization

Diogo Mendonça,Tiago Barros,Cristiano Premebida,Urbano J. Nunes

Main category: cs.CV

TL;DR: 本文提出了Seg2Track-SAM2,一个无需微调、检测器无关的多目标跟踪与分割框架,结合预训练检测器与SAM2,在KITTI MOTS上达到SOTA性能,并通过滑动窗口策略显著降低内存消耗。

Details Motivation: 现有基础模型(如SAM2)在视频分割中表现良好,但在MOTS任务中因身份管理和内存效率不足而受限,需提升跟踪一致性与部署实用性。 Method: 提出Seg2Track-SAM2框架,集成预训练检测器与SAM2,并设计新的Seg2Track模块以解决轨迹初始化、管理和强化问题;采用滑动窗口内存策略提升效率。 Result: 在KITTI MOT和MOTS基准上取得SOTA结果,车辆和行人类别均排名第四,关联精度(AssA)创最新记录;内存使用减少75%且性能损失极小。 Conclusion: Seg2Track-SAM2通过强健的零样本跟踪、改进的身份保持和高效内存利用,显著推进了MOTS技术的发展,适用于资源受限场景。 Abstract: Autonomous systems require robust Multi-Object Tracking (MOT) capabilities to operate reliably in dynamic environments. MOT ensures consistent object identity assignment and precise spatial delineation. Recent advances in foundation models, such as SAM2, have demonstrated strong zero-shot generalization for video segmentation, but their direct application to MOTS (MOT+Segmentation) remains limited by insufficient identity management and memory efficiency. This work introduces Seg2Track-SAM2, a framework that integrates pre-trained object detectors with SAM2 and a novel Seg2Track module to address track initialization, track management, and reinforcement. The proposed approach requires no fine-tuning and remains detector-agnostic. Experimental results on KITTI MOT and KITTI MOTS benchmarks show that Seg2Track-SAM2 achieves state-of-the-art (SOTA) performance, ranking fourth overall in both car and pedestrian classes on KITTI MOTS, while establishing a new benchmark in association accuracy (AssA). Furthermore, a sliding-window memory strategy reduces memory usage by up to 75% with negligible performance degradation, supporting deployment under resource constraints. These results confirm that Seg2Track-SAM2 advances MOTS by combining robust zero-shot tracking, enhanced identity preservation, and efficient memory utilization. The code is available at https://github.com/hcmr-lab/Seg2Track-SAM2

[185] SA-UNetv2: Rethinking Spatial Attention U-Net for Retinal Vessel Segmentation

Changlu Guo,Anders Nymark Christensen,Anders Bjorholm Dahl,Yugen Yi,Morten Rieger Hannemose

Main category: cs.CV

TL;DR: 提出SA-UNetv2,一种轻量级视网膜血管分割模型,通过在所有跳跃连接中引入跨尺度空间注意力并采用加权BCE+MCC损失函数,在公共数据集上实现了最先进的性能,具有高效率和良好的部署能力。

Details Motivation: 解决SA-UNet在跳跃连接中注意力机制利用不足以及前景-背景严重不平衡的问题。 Method: 在所有跳跃连接中引入跨尺度空间注意力机制,并采用加权二元交叉熵(BCE)与Matthews相关系数(MCC)结合的损失函数。 Result: 在DRIVE和STARE数据集上达到最先进水平,模型仅1.2MB内存占用、0.26M参数(不到SA-UNet的一半),并在CPU上实现每秒1帧的推理速度(592x592x3图像)。 Conclusion: SA-UNetv2在资源受限、仅CPU环境中表现出强大的效率和可部署性,适用于早期疾病诊断。 Abstract: Retinal vessel segmentation is essential for early diagnosis of diseases such as diabetic retinopathy, hypertension, and neurodegenerative disorders. Although SA-UNet introduces spatial attention in the bottleneck, it underuses attention in skip connections and does not address the severe foreground-background imbalance. We propose SA-UNetv2, a lightweight model that injects cross-scale spatial attention into all skip connections to strengthen multi-scale feature fusion and adopts a weighted Binary Cross-Entropy (BCE) plus Matthews Correlation Coefficient (MCC) loss to improve robustness to class imbalance. On the public DRIVE and STARE datasets, SA-UNetv2 achieves state-of-the-art performance with only 1.2MB memory and 0.26M parameters (less than 50% of SA-UNet), and 1 second CPU inference on 592 x 592 x 3 images, demonstrating strong efficiency and deployability in resource-constrained, CPU-only settings.

[186] FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

Haodong Chen,Haojian Huang,XinXiang Yin,Dian Shao

Main category: cs.CV

TL;DR: 本文提出FineQuest,一种无需训练的双模式推理框架,结合多模态知识图谱,有效提升体育视频问答的表现。

Details Motivation: 基于大语言模型的视频问答在通用视频理解方面有潜力,但在复杂的体育视频领域面临挑战,需要一种更有效的解决方案。 Method: 该研究提出了FineQuest,利用基于认知科学的双模式推理方法,包括反应式推理和深思熟虑式推理,并引入了SSGraph,一个涵盖九种运动的多模态体育知识场景图。 Result: FineQuest在新提出的Gym-QA和Diving-QA基准测试以及现有的SPORTU数据集上均达到了最先进的性能。 Conclusion: FineQuest是一个无需训练的框架,在体育视频的视频问答任务中表现出色,同时保持了强大的通用视频问答能力。 Abstract: Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.

[187] Pseudo-D: Informing Multi-View Uncertainty Estimation with Calibrated Neural Training Dynamics

Ang Nan Gu,Michael Tsang,Hooman Vaseli,Purang Abolmaesumi,Teresa Tsang

Main category: cs.CV

TL;DR: 提出一种基于神经网络训练动态(NNTD)的框架,通过生成反映学习过程中不确定性的伪标签,提升医学图像分类中模型的不确定性估计与鲁棒性。

Details Motivation: 现有模型使用过于简化的独热标签进行训练,忽略了诊断中的不确定性,导致模型在面对噪声或模糊输入时预测过度自信。 Method: 利用神经网络训练过程中的动态行为评估每个样本的学习难度,通过聚合和校准训练过程中的模型预测,生成具有不确定性感知的伪标签,并将其用于标签增强。 Result: 在具有挑战性的超声心动图分类基准上验证,该方法在模型校准、选择性分类和多视图融合方面优于专门的基线方法。 Conclusion: 所提出的框架能够有效引入标签空间中的不确定性,提升模型对诊断不确定性的建模能力,且适用于任意监督学习架构。 Abstract: Computer-aided diagnosis systems must make critical decisions from medical images that are often noisy, ambiguous, or conflicting, yet today's models are trained on overly simplistic labels that ignore diagnostic uncertainty. One-hot labels erase inter-rater variability and force models to make overconfident predictions, especially when faced with incomplete or artifact-laden inputs. We address this gap by introducing a novel framework that brings uncertainty back into the label space. Our method leverages neural network training dynamics (NNTD) to assess the inherent difficulty of each training sample. By aggregating and calibrating model predictions during training, we generate uncertainty-aware pseudo-labels that reflect the ambiguity encountered during learning. This label augmentation approach is architecture-agnostic and can be applied to any supervised learning pipeline to enhance uncertainty estimation and robustness. We validate our approach on a challenging echocardiography classification benchmark, demonstrating superior performance over specialized baselines in calibration, selective classification, and multi-view fusion.

[188] LFRA-Net: A Lightweight Focal and Region-Aware Attention Network for Retinal Vessel Segmentatio

Mehwish Mehmood,Shahzaib Iqbal,Tariq Mahmood Khan,Ivor Spence,Muhammad Fahim

Main category: cs.CV

TL;DR: 本文提出了一种轻量级网络LFRA-Net,用于精确且高效的视网膜血管分割,结合焦点调制注意力和区域感知注意力机制,在多个公开数据集上优于现有方法,同时保持较低的计算成本。

Details Motivation: 现有的深度学习模型在提取细小血管方面存在困难,并且计算成本较高,难以在资源受限的临床环境中应用。 Method: 在编码器-解码器瓶颈处引入焦点调制注意力,并在选择性跳跃连接中使用区域感知注意力机制,构建轻量级网络LFRA-Net。 Result: LFRA-Net在DRIVE、STARE和CHASE_DB数据集上分别取得了84.28%、88.44%和85.50%的Dice分数,以及72.86%、79.31%和74.70%的Jaccard指数,仅需0.17百万参数、0.66MB内存和10.50 GFLOPs。 Conclusion: LFRA-Net在分割精度和计算成本之间提供了理想的平衡,适用于资源有限环境下的实时临床应用。 Abstract: Retinal vessel segmentation is critical for the early diagnosis of vision-threatening and systemic diseases, especially in real-world clinical settings with limited computational resources. Although significant improvements have been made in deep learning-based segmentation methods, current models still face challenges in extracting tiny vessels and suffer from high computational costs. In this study, we present LFRA-Net by incorporating focal modulation attention at the encoder-decoder bottleneck and region-aware attention in the selective skip connections. LFRA-Net is a lightweight network optimized for precise and effective retinal vascular segmentation. It enhances feature representation and regional focus by efficiently capturing local and global dependencies. LFRA-Net outperformed many state-of-the-art models while maintaining lightweight characteristics with only 0.17 million parameters, 0.66 MB memory size, and 10.50 GFLOPs. We validated it on three publicly available datasets: DRIVE, STARE, and CHASE\_DB. It performed better in terms of Dice score (84.28\%, 88.44\%, and 85.50\%) and Jaccard index (72.86\%, 79.31\%, and 74.70\%) on the DRIVE, STARE, and CHASE\_DB datasets, respectively. LFRA-Net provides an ideal ratio between segmentation accuracy and computational cost compared to existing deep learning methods, which makes it suitable for real-time clinical applications in areas with limited resources. The code can be found at https://github.com/Mehwish4593/LFRA-Net.

[189] SpecVLM: Fast Speculative Decoding in Vision-Language Models

Haiduo Huang,Fuwei Yang,Zhenhua Liu,Xuanwu Yin,Dong Li,Pengju Ren,Emad Barsoum

Main category: cs.CV

TL;DR: 本文提出了SpecVLM,一种针对视觉语言模型(VLMs)的实用推测解码系统,通过弹性视觉压缩和在线logit蒸馏协议,在保持输出分布不变的前提下,实现了2.5–2.9倍的端到端加速。

Details Motivation: 直接将推测解码应用于视觉语言模型面临视觉token数量随图像分辨率和视频长度增加而导致计算和内存开销过大的问题,因此需要一种更高效的推理加速方法。 Method: 提出SpecVLM系统,包括一个强基线EagleVLM,并引入弹性视觉压缩模块以自适应选择剪枝、池化、卷积或重采样策略;同时设计了一种无需离线蒸馏语料的在线logit蒸馏协议,利用教师模型实时生成的logits和倒数第二层特征进行训练。 Result: 在LLaVA和MMMU上5个epoch内实现2.5–2.9倍的端到端加速,优于基线EagleVLM(1.5–2.3倍),且在不同分辨率和任务难度下均保持稳定,同时保持目标模型输出分布无损。 Conclusion: SpecVLM通过弹性压缩和高效在线训练策略,显著提升了视觉语言模型的推理效率,为实际部署提供了可行的加速方案。 Abstract: Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.

[190] MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic Segmentation

Liying Wang,Xiaoli Zhang,Chuanmin Jia,Siwei Ma

Main category: cs.CV

TL;DR: The paper proposes MAFS, a unified framework for infrared-visible image fusion and semantic segmentation that enhances semantic-aware capabilities and improves feature-level fusion through a parallel network structure and dynamic task weighting.

Details Motivation: The motivation is to address the lack of exploration into the reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks in existing semantic-driven methods. The authors aim to enhance semantic-aware capabilities and improve feature-level fusion for downstream applications. Method: The authors propose a unified network for image fusion and semantic segmentation called MAFS, which contains a fusion sub-network and a segmentation sub-network. They introduce a heterogeneous feature fusion strategy, a multi-stage Transformer decoder, and a dynamic factor based on max-min fairness allocation for adaptive task weighting. Result: Extensive experiments show that the proposed MAFS approach achieves competitive results compared to state-of-the-art methods in both image fusion and semantic segmentation tasks. Conclusion: The paper concludes that the proposed MAFS framework achieves competitive results in infrared-visible image fusion and semantic segmentation, demonstrating the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks. Abstract: Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/.

[191] Probabilistic Robustness Analysis in High Dimensional Space: Application to Semantic Segmentation Network

Navid Hashemi,Samuel Sasaki,Diego Manzanas Lopez,Ipek Oguz,Meiyi Ma,Taylor T. Johnson

Main category: cs.CV

TL;DR: 提出了一种可扩展、架构无关的概率验证框架,结合基于采样的可达性分析与共形推理(CI),为语义分割网络提供可靠且非保守的安全保证。

Details Motivation: 现有概率验证方法难以应对现代语义分割任务的高维度和复杂性,导致验证结果过于保守,缺乏实用性。 Method: 将基于采样的可达性分析与共形推理(CI)相结合,并设计新的策略以降低CI在高维输出中的保守性,实现对语义分割网络的高效概率验证。 Result: 在CamVid、OCTA-500、Lung Segmentation和Cityscapes等多个大规模分割模型上验证了该方法的有效性,相比当前最先进方法显著收紧了验证边界,同时保持严谨性。 Conclusion: 所提框架能够为高维语义分割模型提供可扩展、非保守且具有理论保证的安全验证,具备实际应用潜力,并已开源工具支持。 Abstract: Semantic segmentation networks (SSNs) play a critical role in domains such as medical imaging, autonomous driving, and environmental monitoring, where safety hinges on reliable model behavior under uncertainty. Yet, existing probabilistic verification approaches struggle to scale with the complexity and dimensionality of modern segmentation tasks, often yielding guarantees that are too conservative to be practical. We introduce a probabilistic verification framework that is both architecture-agnostic and scalable to high-dimensional outputs. Our approach combines sampling-based reachability analysis with conformal inference (CI) to deliver provable guarantees while avoiding the excessive conservatism of prior methods. To counteract CI's limitations in high-dimensional settings, we propose novel strategies that reduce conservatism without compromising rigor. Empirical evaluation on large-scale segmentation models across CamVid, OCTA-500, Lung Segmentation, and Cityscapes demonstrates that our method provides reliable safety guarantees while substantially tightening bounds compared to SOTA. We also provide a toolbox implementing this technique, available on Github.

[192] Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation

Tim Lebailly,Vijay Veerabadran,Satwik Kottur,Karl Ridgeway,Michael Louis Iuzzolino

Main category: cs.CV

TL;DR: 本文提出了一种通过生成式VLMs生成的合成描述来实现图像与语言模态密集对齐的新方法,这种方法在零样本开放词汇分割任务中表现优异且数据使用效率高。

Details Motivation: 生成式VLMs在高层次的图像理解方面表现出色,但在视觉和语言模态之间缺乏空间上的密集对齐。这种缺乏对齐的问题促使了对更有效对齐方法的研究。 Method: 该方法利用生成式VLMs生成的合成描述,将图像与这些描述进行密集对齐,从而提高对齐的效率和效果。 Result: 该方法在标准的零样本开放词汇分割基准/数据集上超越了以往工作,并且在数据使用效率上也更高。 Conclusion: 该论文提出了一种方法,通过使用合成描述来弥合生成式视觉-语言模型(VLMs)与视觉-语言对齐表示学习之间的差距,从而实现更密集的图像与语言模态对齐。 Abstract: Generative vision-language models (VLMs) exhibit strong high-level image understanding but lack spatially dense alignment between vision and language modalities, as our findings indicate. Orthogonal to advancements in generative VLMs, another line of research has focused on representation learning for vision-language alignment, targeting zero-shot inference for dense tasks like segmentation. In this work, we bridge these two directions by densely aligning images with synthetic descriptions generated by VLMs. Synthetic captions are inexpensive, scalable, and easy to generate, making them an excellent source of high-level semantic understanding for dense alignment methods. Empirically, our approach outperforms prior work on standard zero-shot open-vocabulary segmentation benchmarks/datasets, while also being more data-efficient.

[193] Segmentation-Driven Initialization for Sparse-view 3D Gaussian Splatting

Yi-Hsin Li,Thomas Sikora,Sebastian Knorr,Måarten Sjöström

Main category: cs.CV

TL;DR: 本文提出了一种名为SDI-GS的分割驱动高斯溅射初始化方法,通过基于区域的分割减少稀疏视图合成中的冗余3D高斯数量,在保持渲染质量的同时显著降低内存消耗和训练时间。

Details Motivation: 现有3D高斯溅射方法在稀疏视图下依赖SfM估计相机位姿效果不佳,而无需SfM的方法使用MVS会产生大量冗余的3D高斯点,导致内存开销高。 Method: 利用区域分割识别并保留结构上重要的区域,通过选择性下采样密集点云来减少3D高斯数量,从而实现高效初始化。 Result: 实验表明,该方法最多可将高斯数量减少50%,在PSNR和SSIM指标上达到相当或更优的渲染质量,LPIPS仅有轻微下降,并加快训练速度,降低内存占用。 Conclusion: SDI-GS有效提升了3D高斯溅射在稀疏视图场景下的效率与实用性,为资源受限条件下的新视角合成提供了可行方案。 Abstract: Sparse-view synthesis remains a challenging problem due to the difficulty of recovering accurate geometry and appearance from limited observations. While recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time rendering with competitive quality, existing pipelines often rely on Structure-from-Motion (SfM) for camera pose estimation, an approach that struggles in genuinely sparse-view settings. Moreover, several SfM-free methods replace SfM with multi-view stereo (MVS) models, but generate massive numbers of 3D Gaussians by back-projecting every pixel into 3D space, leading to high memory costs. We propose Segmentation-Driven Initialization for Gaussian Splatting (SDI-GS), a method that mitigates inefficiency by leveraging region-based segmentation to identify and retain only structurally significant regions. This enables selective downsampling of the dense point cloud, preserving scene fidelity while substantially reducing Gaussian count. Experiments across diverse benchmarks show that SDI-GS reduces Gaussian count by up to 50% and achieves comparable or superior rendering quality in PSNR and SSIM, with only marginal degradation in LPIPS. It further enables faster training and lower memory footprint, advancing the practicality of 3DGS for constrained-view scenarios.

[194] Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

Haodi Ma,Vyom Pathak,Daisy Zhe Wang

Main category: cs.CV

TL;DR: 本文提出SG-VLM框架,通过结合符号化场景图(SG)与视觉语言模型(VLM),提升视频问答中的因果和时序推理能力。

Details Motivation: 现有VLM在视频问答中依赖浅层相关性,导致时序定位弱和可解释性差,因此需要引入结构化的中间表示来增强推理。 Method: 提出SG-VLM,利用冻结的VLM,通过提示和视觉定位模块融合符号化场景图作为中间接地信号,实现模块化集成。 Result: 在NExT-QA、iVQA和ActivityNet-QA三个基准及多个VLM上验证,SG-VLM提升了因果与时序推理性能,但相较于强VLM提升有限。 Conclusion: 符号化接地在视频理解中具有潜力但存在局限,研究为未来VLM与符号方法的融合提供了方向。 Abstract: Video Question Answering (VQA) requires models to reason over spatial, temporal, and causal cues in videos. Recent vision language models (VLMs) achieve strong results but often rely on shallow correlations, leading to weak temporal grounding and limited interpretability. We study symbolic scene graphs (SGs) as intermediate grounding signals for VQA. SGs provide structured object-relation representations that complement VLMs holistic reasoning. We introduce SG-VLM, a modular framework that integrates frozen VLMs with scene graph grounding via prompting and visual localization. Across three benchmarks (NExT-QA, iVQA, ActivityNet-QA) and multiple VLMs (QwenVL, InternVL), SG-VLM improves causal and temporal reasoning and outperforms prior baselines, though gains over strong VLMs are limited. These findings highlight both the promise and current limitations of symbolic grounding, and offer guidance for future hybrid VLM-symbolic approaches in video understanding.

[195] Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Meng Luo,Shengqiong Wu,Liqiang Jing,Tianjie Ju,Li Zheng,Jinxiang Lai,Tianlong Wu,Xinya Du,Jian Li,Siyuan Yan,Jiebo Luo,William Yang Wang,Hao Fei,Mong-Li Lee,Wynne Hsu

Main category: cs.CV

TL;DR: 提出Dr.V框架,包括Dr.V-Bench和Dr.V-Agent,用于检测大型视频模型中的幻觉问题。

Details Motivation: 大型视频模型存在幻觉问题,需要一种可靠的方法来检测并解决这一问题。 Method: 构建了一个包含Dr.V-Bench和Dr.V-Agent的分层框架,并应用细粒度时空定位和认知级推理。 Result: Dr.V-Bench包含10k实例,Dr.V-Agent在检测幻觉方面表现良好,并提升了可解释性。 Conclusion: Dr.V-Agent有效诊断幻觉,增强视频理解的可靠性和可解释性。 Abstract: Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

[196] Multi-animal tracking in Transition: Comparative Insights into Established and Emerging Methods

Anne Marthe Sophie Ngo Bibinbe,Patrick Gagnon,Jamie Ahloy-Dallaire,Eric R. Paquet

Main category: cs.CV

TL;DR: 该研究比较了多动物追踪(MAT)与多目标追踪(MOT)方法在畜牧业中的表现,发现MOT方法更优,能提升自动化追踪的准确性和可靠性。

Details Motivation: 精准畜牧业需要先进的监测工具以满足行业日益增长的管理需求,而能够进行长期多动物追踪的计算机视觉系统对连续行为监测至关重要。 Method: 研究通过一个10分钟的猪群追踪数据集评估了包括DeepLabCut、idTracker、ByteTrack、DeepSORT、跨输入一致性、Track-Anything和PromptTrack在内的多种方法。 Result: 结果显示,即使在畜牧业多动物追踪场景下,MOT方法的表现优于传统MAT工具,这对行为分析、健康状态估计等下游任务具有重要意义。 Conclusion: 研究得出,即使在长期追踪情况下,MOT方法整体上优于传统的MAT工具,表明最新的MOT技术有潜力提高自动化畜牧业追踪的准确性与可靠性。 Abstract: Precision livestock farming requires advanced monitoring tools to meet the increasing management needs of the industry. Computer vision systems capable of long-term multi-animal tracking (MAT) are essential for continuous behavioral monitoring in livestock production. MAT, a specialized subset of multi-object tracking (MOT), shares many challenges with MOT, but also faces domain-specific issues including frequent animal occlusion, highly similar appearances among animals, erratic motion patterns, and a wide range of behavior types. While some existing MAT tools are user-friendly and widely adopted, they often underperform compared to state-of-the-art MOT methods, which can result in inaccurate downstream tasks such as behavior analysis, health state estimation, and related applications. In this study, we benchmarked both MAT and MOT approaches for long-term tracking of pigs. We compared tools such as DeepLabCut and idTracker with MOT-based methods including ByteTrack, DeepSORT, cross-input consistency, and newer approaches like Track-Anything and PromptTrack. All methods were evaluated on a 10-minute pig tracking dataset. Our results demonstrate that, overall, MOT approaches outperform traditional MAT tools, even for long-term tracking scenarios. These findings highlight the potential of recent MOT techniques to enhance the accuracy and reliability of automated livestock tracking.

[197] Do It Yourself (DIY): Modifying Images for Poems in a Zero-Shot Setting Using Weighted Prompt Manipulation

Sofia Jamil,Kotla Sai Charan,Sriparna Saha,Koustava Goswami,K J Joseph

Main category: cs.CV

TL;DR: 本文提出了一种新的加权提示操作(WPM)技术,用于在零样本设置下生成并改进诗歌的图像,通过调整扩散模型中的注意力权重和文本嵌入来增强视觉表现力。

Details Motivation: 由于读者常基于个人情感、经验和文化背景对诗歌有多样化的解读,因此需要一种能够根据用户需求灵活修改诗歌对应图像的方法。 Method: 引入加权提示操作(WPM)技术,动态调整扩散模型中特定词语的注意力权重和文本嵌入,结合大语言模型(如GPT)和现有诗歌数据集进行图像生成。 Result: WPM技术能有效增强或抑制关键词在生成图像中的影响,产生语义更丰富、上下文更准确的可视化结果。 Conclusion: 这是首次将加权提示操作应用于诗歌语言的图像增强,为文学领域的图像生成提供了新颖且系统化的方法。 Abstract: Poetry is an expressive form of art that invites multiple interpretations, as readers often bring their own emotions, experiences, and cultural backgrounds into their understanding of a poem. Recognizing this, we aim to generate images for poems and improve these images in a zero-shot setting, enabling audiences to modify images as per their requirements. To achieve this, we introduce a novel Weighted Prompt Manipulation (WPM) technique, which systematically modifies attention weights and text embeddings within diffusion models. By dynamically adjusting the importance of specific words, WPM enhances or suppresses their influence in the final generated image, leading to semantically richer and more contextually accurate visualizations. Our approach exploits diffusion models and large language models (LLMs) such as GPT in conjunction with existing poetry datasets, ensuring a comprehensive and structured methodology for improved image generation in the literary domain. To the best of our knowledge, this is the first attempt at integrating weighted prompt manipulation for enhancing imagery in poetic language.

[198] SAM-TTT: Segment Anything Model via Reverse Parameter Configuration and Test-Time Training for Camouflaged Object Detection

Zhenni Yu,Li Zhao,Guobao Xiao,Xiaoqin Zhang

Main category: cs.CV

TL;DR: The paper proposes SAM-TTT for Camouflaged Object Detection, enhancing the Segment Anything Model by addressing adverse parameters and reinforcing advantageous ones, resulting in state-of-the-art performance.

Details Motivation: The motivation is to address the identified gap in existing SAM-based Camouflaged Object Detection models, which insufficiently address adverse parameters that impair the model's semantic understanding in downstream tasks. Method: The method involves two modules: the Reverse SAM Parameter Configuration Module, which mitigates adverse parameters in a train-free manner, and the T-Visioner Module, which enhances advantageous parameters using Test-Time Training layers. These modules work together to improve the model's performance. Result: The experimental results show that the proposed SAM-TTT approach achieves state-of-the-art performance on various Camouflaged Object Detection benchmarks, setting a new benchmark in the field. Conclusion: The paper concludes that the proposed SAM-TTT approach significantly improves the semantic understanding of the Segment Anything Model in Camouflaged Object Detection tasks, achieving state-of-the-art performance. Abstract: This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. While most existing SAM-based COD models primarily focus on enhancing SAM by extracting favorable features and amplifying its advantageous parameters, a crucial gap is identified: insufficient attention to adverse parameters that impair SAM's semantic understanding in downstream tasks. To tackle this issue, the Reverse SAM Parameter Configuration Module is proposed to effectively mitigate the influence of adverse parameters in a train-free manner by configuring SAM's parameters. Building on this foundation, the T-Visioner Module is unveiled to strengthen advantageous parameters by integrating Test-Time Training layers, originally developed for language tasks, into vision tasks. Test-Time Training layers represent a new class of sequence modeling layers characterized by linear complexity and an expressive hidden state. By integrating two modules, SAM-TTT simultaneously suppresses adverse parameters while reinforcing advantageous ones, significantly improving SAM's semantic understanding in COD task. Our experimental results on various COD benchmarks demonstrate that the proposed approach achieves state-of-the-art performance, setting a new benchmark in the field. The code will be available at https://github.com/guobaoxiao/SAM-TTT.

[199] BREA-Depth: Bronchoscopy Realistic Airway-geometric Depth Estimation

Francis Xiatian Zhang,Emile Mackute,Mohammadreza Kasaei,Kevin Dhaliwal,Robert Thomson,Mohsen Khadem

Main category: cs.CV

TL;DR: 提出Brea-Depth框架,结合气道几何先验信息提升支气管镜单目深度估计的准确性与解剖结构一致性。

Details Motivation: 现有深度基础模型在支气管镜应用中缺乏解剖感知能力,易受局部纹理干扰且难以处理模糊深度线索和低光照情况。 Method: 引入基于解剖数据的气道特定几何先验,采用深度感知CycleGAN对齐真实图像与气道几何结构,并设计气道结构感知损失以保持管腔内深度一致性与结构完整性。 Result: 在体外人肺数据集和公开支气管镜数据集上均优于现有方法,显著提升解剖深度保持能力与3D重建鲁棒性。 Conclusion: Brea-Depth通过融合解剖先验信息有效提升了支气管镜单目深度估计的准确性和结构合理性,适用于复杂气道环境下的导航干预。 Abstract: Monocular depth estimation in bronchoscopy can significantly improve real-time navigation accuracy and enhance the safety of interventions in complex, branching airways. Recent advances in depth foundation models have shown promise for endoscopic scenarios, yet these models often lack anatomical awareness in bronchoscopy, overfitting to local textures rather than capturing the global airway structure, particularly under ambiguous depth cues and poor lighting. To address this, we propose Brea-Depth, a novel framework that integrates airway-specific geometric priors into foundation model adaptation for bronchoscopic depth estimation. Our method introduces a depth-aware CycleGAN, refining the translation between real bronchoscopic images and airway geometries from anatomical data, effectively bridging the domain gap. In addition, we introduce an airway structure awareness loss to enforce depth consistency within the airway lumen while preserving smooth transitions and structural integrity. By incorporating anatomical priors, Brea-Depth enhances model generalization and yields more robust, accurate 3D airway reconstructions. To assess anatomical realism, we introduce Airway Depth Structure Evaluation, a new metric for structural consistency. We validate BREA-Depth on a collected ex vivo human lung dataset and an open bronchoscopic dataset, where it outperforms existing methods in anatomical depth preservation.

[200] Logit Mixture Outlier Exposure for Fine-grained Out-of-Distribution Detection

Akito Shinohara,Kohei Fukuda,Hiroaki Aizawa

Main category: cs.CV

TL;DR: This paper proposes a logit-space interpolation method to enhance out-of-distribution detection by smoothing class boundaries and improving model consistency.

Details Motivation: Existing out-of-distribution detection methods like Outlier Exposure and Mixture Outlier Exposure struggle to effectively distinguish between in-distribution and out-of-distribution data. This work aims to improve class separation by focusing on the logit space. Method: A linear interpolation technique in the logit space was proposed to mix in-distribution and out-of-distribution data. Consistency between logit-space and input-space mixing was enforced. Result: The technique reduces abrupt fluctuations in model outputs near decision boundaries, leading to smoother and more reliable separation between in-distribution and out-of-distribution data. Conclusion: The proposed logit-space mixing technique improves the detection of out-of-distribution data, especially those close to in-distribution samples, by smoothing the model outputs near decision boundaries. Abstract: The ability to detect out-of-distribution data is essential not only for ensuring robustness against unknown or unexpected input data but also for improving the generalization performance of the model. Among various out-of-distribution detection methods, Outlier Exposure and Mixture Outlier Exposure are promising approaches that enhance out-of-distribution detection performance by exposing the outlier data during training. However, even with these sophisticated techniques, it remains challenging for models to learn the relationships between classes effectively and to distinguish data sampling from in-distribution and out-of-distribution clearly. Therefore, we focus on the logit space, where the properties between class-wise distributions are distinctly separated from those in the input or feature spaces. Specifically, we propose a linear interpolation technique in the logit space that mixes in-distribution and out-of-distribution data to facilitate smoothing logits between classes and improve the out-of-distribution detection performance, particularly for out-of-distribution data that lie close to the in-distribution data. Additionally, we enforce consistency between the logits obtained through mixing in the logit space and those generated via mixing in the input space. Our experiments demonstrate that our logit-space mixing technique reduces the abrupt fluctuations in the model outputs near the decision boundaries, resulting in smoother and more reliable separation between in-distribution and out-of-distribution data. Furthermore, we evaluate the effectiveness of the proposed method on a fine-grained out-of-distribution detection task.

[201] Integrating Prior Observations for Incremental 3D Scene Graph Prediction

Marian Renz,Felix Igelbrink,Martin Atzmueller

Main category: cs.CV

TL;DR: 该研究提出了一种用于3D语义场景图预测的增量式异构图模型,结合多模态信息,提升了在现实环境中的表现力和适用性。

Details Motivation: 现有的3DSSG方法主要依赖传感器数据,缺乏对语义丰富环境信息的整合,同时大多假设能够获得完整的场景重建,限制了其在真实世界增量环境中的适用性。 Method: 论文采用了一种包含多层结构的异构图模型,将全局和局部场景表示灵活结合,并将多模态信息直接集成到消息传递过程中,而无需专门模块或完整场景重建。 Result: 在3DSSG数据集上的评估表明,通过引入多模态信息,模型在复杂现实环境中表现出良好的可扩展性和泛化能力。 Conclusion: 该论文提出了一种新的异构图模型,用于增量式3D语义场景图(3DSSG)预测,通过整合多模态信息(如CLIP语义嵌入和先验观测),实现了在复杂现实环境中的可扩展和泛化解决方案。 Abstract: 3D semantic scene graphs (3DSSG) provide compact structured representations of environments by explicitly modeling objects, attributes, and relationships. While 3DSSGs have shown promise in robotics and embodied AI, many existing methods rely mainly on sensor data, not integrating further information from semantically rich environments. Additionally, most methods assume access to complete scene reconstructions, limiting their applicability in real-world, incremental settings. This paper introduces a novel heterogeneous graph model for incremental 3DSSG prediction that integrates additional, multi-modal information, such as prior observations, directly into the message-passing process. Utilizing multiple layers, the model flexibly incorporates global and local scene representations without requiring specialized modules or full scene reconstructions. We evaluate our approach on the 3DSSG dataset, showing that GNNs enriched with multi-modal information such as semantic embeddings (e.g., CLIP) and prior observations offer a scalable and generalizable solution for complex, real-world environments. The full source code of the presented architecture will be made available at https://github.com/m4renz/incremental-scene-graph-prediction.

[202] NeuroGaze-Distill: Brain-informed Distillation and Depression-Inspired Geometric Priors for Robust Facial Emotion Recognition

Zilin Li,Weiwei Xu,Xuanqi Zhao,Yiran Zhu

Main category: cs.CV

TL;DR: NeuroGaze-Distill是一种新颖的跨模态蒸馏方法,将脑电图信息转化为仅图像的面部情绪识别模型,通过静态效价/唤醒原型和抑郁启发的几何先验提高跨数据集的泛化能力。

Details Motivation: 仅基于像素的面部情绪识别(FER)模型在跨数据集泛化方面表现不佳,因为面部外观是潜在情感的间接且有偏见的代理。 Method: 教师模型基于DREAMER数据集的EEG地形图进行训练(MAHNOB-HCI作为无标签支持),生成一个冻结的5x5效价/唤醒原型网格。学生模型(ResNet-18/50)在FERPlus上进行训练,采用常规的交叉熵/知识蒸馏方法以及两个轻量级正则化器:Proto-KD(余弦)对齐学生特征与静态原型;D-Geo软化嵌入几何形状,符合抑郁症研究中报告的情感发现。 Result: 方法在FERPlus验证集和跨数据集协议(AffectNet-mini;可选CK+)上进行了评估,结果表明原型和D-Geo对性能提升有稳定贡献,5x5网格在稳定性上优于更密集的网格。 Conclusion: NeuroGaze-Distill 是一种跨模态的蒸馏框架,通过静态效价/唤醒原型和抑郁启发的几何先验,将脑电图信息转化为仅图像的情感识别模型,无需脑电图-面部配对或部署时非视觉信号。 Abstract: Facial emotion recognition (FER) models trained only on pixels often fail to generalize across datasets because facial appearance is an indirect and biased proxy for underlying affect. We present NeuroGaze-Distill, a cross-modal distillation framework that transfers brain-informed priors into an image-only FER student via static Valence/Arousal (V/A) prototypes and a depression-inspired geometric prior (D-Geo). A teacher trained on EEG topographic maps from DREAMER (with MAHNOB-HCI as unlabeled support) produces a consolidated 5x5 V/A prototype grid that is frozen and reused; no EEG-face pairing and no non-visual signals at deployment are required. The student (ResNet-18/50) is trained on FERPlus with conventional CE/KD and two lightweight regularizers: (i) Proto-KD (cosine) aligns student features to the static prototypes; (ii) D-Geo softly shapes the embedding geometry in line with affective findings often reported in depression research (e.g., anhedonia-like contraction in high-valence regions). We evaluate both within-domain (FERPlus validation) and cross-dataset protocols (AffectNet-mini; optional CK+), reporting standard 8-way scores alongside present-only Macro-F1 and balanced accuracy to fairly handle label-set mismatch. Ablations attribute consistent gains to prototypes and D-Geo, and favor 5x5 over denser grids for stability. The method is simple, deployable, and improves robustness without architectural complexity.

[203] Enriched text-guided variational multimodal knowledge distillation network (VMD) for automated diagnosis of plaque vulnerability in 3D carotid artery MRI

Bo Cao,Fan Yu,Mengmeng Feng,SenHao Zhang,Xin Meng,Yue Zhang,Zhen Qian,Jie Lu

Main category: cs.CV

TL;DR: 提出了一种基于变分推理和多模态知识蒸馏(VMD)的方法,利用放射科医生的领域知识自动化诊断颈动脉斑块易损性,提升对未标注3D MRI图像的诊断准确性。

Details Motivation: 直接从颈动脉3D MRI图像诊断斑块易损性对放射科医生和传统3D视觉网络都具有挑战性,临床中需结合多种模态信息和专家知识进行判断。 Method: 提出Variation inference and Multimodal knowledge Distillation (VMD)策略,利用有限的图像标注和放射学报告中的跨模态先验知识,通过多模态学习整合影像数据与领域知识。 Result: 在自建数据集上进行了深入实验,验证了所提VMD策略的有效性,显著提升了对未标注3D MRI图像的诊断准确率。 Conclusion: VMD方法能有效融合放射科医生的领域知识与多模态数据,提高自动化斑块易损性诊断性能,具有临床应用潜力。 Abstract: Multimodal learning has attracted much attention in recent years due to its ability to effectively utilize data features from a variety of different modalities. Diagnosing the vulnerability of atherosclerotic plaques directly from carotid 3D MRI images is relatively challenging for both radiologists and conventional 3D vision networks. In clinical practice, radiologists assess patient conditions using a multimodal approach that incorporates various imaging modalities and domain-specific expertise, paving the way for the creation of multimodal diagnostic networks. In this paper, we have developed an effective strategy to leverage radiologists' domain knowledge to automate the diagnosis of carotid plaque vulnerability through Variation inference and Multimodal knowledge Distillation (VMD). This method excels in harnessing cross-modality prior knowledge from limited image annotations and radiology reports within training data, thereby enhancing the diagnostic network's accuracy for unannotated 3D MRI images. We conducted in-depth experiments on the dataset collected in-house and verified the effectiveness of the VMD strategy we proposed.

[204] Graph Algorithm Unrolling with Douglas-Rachford Iterations for Image Interpolation with Guaranteed Initialization

Xue Zhang,Bingshuo Hu,Gene Cheung

Main category: cs.CV

TL;DR: This paper introduces a novel neural network approach for image interpolation that reduces parameters and improves performance by integrating graph signal processing and Douglas-Rachford iterations.

Details Motivation: The motivation is to overcome the limitations of conventional deep neural networks, such as the risk of poor-performing local minima due to random initialization, by employing a more structured and interpretable approach based on graph signal processing. Method: The method involves initializing a directed graph adjacency matrix based on an interpolator, learning perturbation matrices to enhance performance, and implementing restoration through unrolled Douglas-Rachford iterations in a neural network. Result: The experimental results show state-of-the-art performance in image interpolation while significantly reducing the number of network parameters. Conclusion: The paper concludes that by utilizing graph shift variation priors and unrolling Douglas-Rachford iterations into a neural network, they achieve state-of-the-art results in image interpolation with fewer network parameters. Abstract: Conventional deep neural nets (DNNs) initialize network parameters at random and then optimize each one via stochastic gradient descent (SGD), resulting in substantial risk of poor-performing local minima.Focusing on the image interpolation problem and leveraging a recent theorem that maps a (pseudo-)linear interpolator {\Theta} to a directed graph filter that is a solution to a MAP problem regularized with a graph shift variation (GSV) prior, we first initialize a directed graph adjacency matrix A based on a known interpolator {\Theta}, establishing a baseline performance.Then, towards further gain, we learn perturbation matrices P and P(2) from data to augment A, whose restoration effects are implemented via Douglas-Rachford (DR) iterations, which we unroll into a lightweight interpretable neural net.Experimental results demonstrate state-of-the-art image interpolation results, while drastically reducing network parameters.

[205] Sphere-GAN: a GAN-based Approach for Saliency Estimation in 360° Videos

Mahmoud Z. A. Wahba,Sara Baldoni,Federica Battisti

Main category: cs.CV

TL;DR: Sphere-GAN is a novel model for 360° video saliency detection using GAN with spherical convolutions, which outperforms current state-of-the-art methods.

Details Motivation: With the rise of immersive applications, there is a need to develop new approaches for processing and transmitting 360° images and videos, particularly in identifying visually relevant areas through saliency estimation. Method: Sphere-GAN uses a Generative Adversarial Network with spherical convolutions for saliency detection in 360° videos. Result: Sphere-GAN demonstrated superior performance in predicting saliency maps compared to existing models based on experiments with a public 360° video saliency dataset. Conclusion: Sphere-GAN is an effective model for 360° video saliency detection, outperforming existing state-of-the-art models. Abstract: The recent success of immersive applications is pushing the research community to define new approaches to process 360{\deg} images and videos and optimize their transmission. Among these, saliency estimation provides a powerful tool that can be used to identify visually relevant areas and, consequently, adapt processing algorithms. Although saliency estimation has been widely investigated for 2D content, very few algorithms have been proposed for 360{\deg} saliency estimation. Towards this goal, we introduce Sphere-GAN, a saliency detection model for 360{\deg} videos that leverages a Generative Adversarial Network with spherical convolutions. Extensive experiments were conducted using a public 360{\deg} video saliency dataset, and the results demonstrate that Sphere-GAN outperforms state-of-the-art models in accurately predicting saliency maps.

[206] CLAIRE: A Dual Encoder Network with RIFT Loss and Phi-3 Small Language Model Based Interpretability for Cross-Modality Synthetic Aperture Radar and Optical Land Cover Segmentation

Debopom Sutradhar,Arefin Ittesafun Abian,Mohaimenul Azam Khan Raiaan,Reem E. Mohamed,Sheikh Izzal Azid,Sami Azam

Main category: cs.CV

TL;DR: This paper proposes an improved dual encoder architecture with a cross-modality fusion module and hybrid loss function for accurate land cover classification from satellite imagery, achieving strong performance and interpretability across multiple datasets.

Details Motivation: Accurate land cover classification from satellite imagery is essential for environmental monitoring and resource management, but challenges such as complex landscapes, visual similarity between classes, and class imbalance hinder performance. Method: A dual encoder architecture was developed to extract features from optical and SAR imagery, fused using the CLAIRE cross-modality attention-fusion module. A hybrid loss function (RIFT) combining Weighted Focal Loss and Tversky Loss was used to address class imbalance. Additionally, a metric-driven reasoning module based on a Small Language Model (Phi-3) was introduced to enhance interpretability. Result: The model achieved a mIoU of 56.02% and OA of 84.56% on the WHU-OPT-SAR dataset, a mIoU of 59.89% and OA of 73.92% on the OpenEarthMap-SAR dataset, and robust performance with a mIoU of 86.86% and OA of 94.58% on the PIE-RGB-SAR dataset under cloud-obstructed conditions. Conclusion: The proposed dual encoder architecture with the CLAIRE fusion module and RIFT loss function achieves competitive performance in land cover classification, showing robustness under cloud-obstructed conditions and strong generalization across datasets. Abstract: Accurate land cover classification from satellite imagery is crucial in environmental monitoring and sustainable resource management. However, it remains challenging due to the complexity of natural landscapes, the visual similarity between classes, and the significant class imbalance in the available datasets. To address these issues, we propose a dual encoder architecture that independently extracts modality-specific features from optical and Synthetic Aperture Radar (SAR) imagery, which are then fused using a cross-modality attention-fusion module named Cross-modality Land cover segmentation with Attention and Imbalance-aware Reasoning-Enhanced Explanations (CLAIRE). This fusion mechanism highlights complementary spatial and textural features, enabling the network to better capture detailed and diverse land cover patterns. We incorporate a hybrid loss function that utilizes Weighted Focal Loss and Tversky Loss named RIFT (Rare-Instance Focal-Tversky) to address class imbalance and improve segmentation performance across underrepresented categories. Our model achieves competitive performance across multiple benchmarks: a mean Intersection over Union (mIoU) of 56.02% and Overall Accuracy (OA) of 84.56% on the WHU-OPT-SAR dataset; strong generalization with a mIoU of 59.89% and OA of 73.92% on the OpenEarthMap-SAR dataset; and remarkable robustness under cloud-obstructed conditions, achieving an mIoU of 86.86% and OA of 94.58% on the PIE-RGB-SAR dataset. Additionally, we introduce a metric-driven reasoning module generated by a Small Language Model (Phi-3), which generates expert-level, sample-specific justifications for model predictions, thereby enhancing transparency and interpretability.

[207] Learning to Generate 4D LiDAR Sequences

Ao Liang,Youquan Liu,Yu Yang,Dongyue Lu,Linfeng Li,Lingdong Kong,Huaici Zhao,Wei Tsang Ooi

Main category: cs.CV

TL;DR: LiDARCrafter is a unified framework that converts language into editable LiDAR sequences, addressing challenges in LiDAR generation for 3D perception.

Details Motivation: LiDAR generation remains underexplored despite its importance for accurate 3D perception, and extending generation to 4D LiDAR data introduces challenges in controllability, temporal stability, and evaluation. Method: LiDARCrafter uses a tri-branch diffusion model to transform parsed language instructions into object layouts, trajectories, and shapes, followed by a range-image diffusion model and an autoregressive module to generate temporally coherent LiDAR sequences. Result: LiDARCrafter achieves state-of-the-art fidelity, controllability, and temporal consistency on nuScenes, with explicit layout design supporting object-level editing like insertion or relocation. Conclusion: LiDARCrafter provides a foundation for LiDAR-based simulation and data augmentation, achieving state-of-the-art fidelity, controllability, and temporal consistency on nuScenes. Abstract: While generative world models have advanced video and occupancy-based data synthesis, LiDAR generation remains underexplored despite its importance for accurate 3D perception. Extending generation to 4D LiDAR data introduces challenges in controllability, temporal stability, and evaluation. We present LiDARCrafter, a unified framework that converts free-form language into editable LiDAR sequences. Instructions are parsed into ego-centric scene graphs, which a tri-branch diffusion model transforms into object layouts, trajectories, and shapes. A range-image diffusion model generates the initial scan, and an autoregressive module extends it into a temporally coherent sequence. The explicit layout design further supports object-level editing, such as insertion or relocation. To enable fair assessment, we provide EvalSuite, a benchmark spanning scene-, object-, and sequence-level metrics. On nuScenes, LiDARCrafter achieves state-of-the-art fidelity, controllability, and temporal consistency, offering a foundation for LiDAR-based simulation and data augmentation.

[208] Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

Zixuan Fu,Yan Ren,Finn Carter,Chenyue Wen,Le Ku,Daheng Yu,Emily Davis,Bo Zhang

Main category: cs.CV

TL;DR: SCORE 提出了一种新的扩散模型概念擦除框架,通过对抗独立性问题实现敏感概念的稳健移除,同时保持生成能力。

Details Motivation: 扩散模型在图像生成方面表现出色,但在隐私、公平性和安全性方面存在风险,需要从模型中擦除敏感或有害概念,同时保持其整体生成能力。 Method: 将概念擦除表述为对抗独立性问题,通过最小化目标概念与生成输出之间的互信息,实现统计独立性,提供可证明的擦除保证。 Result: 在Stable Diffusion和FLUX上的实验表明,SCORE在擦除效果方面优于ErasingAnything、ANT、MACE、ESD和UCE等现有方法,提升高达12.5%,同时保持图像质量。 Conclusion: SCORE 通过对抗优化、轨迹一致性和显著性驱动的微调,为扩散模型的安全和稳健概念擦除设定了新标准。 Abstract: Diffusion models have achieved unprecedented success in image generation but pose increasing risks in terms of privacy, fairness, and security. A growing demand exists to \emph{erase} sensitive or harmful concepts (e.g., NSFW content, private individuals, artistic styles) from these models while preserving their overall generative capabilities. We introduce \textbf{SCORE} (Secure and Concept-Oriented Robust Erasure), a novel framework for robust concept removal in diffusion models. SCORE formulates concept erasure as an \emph{adversarial independence} problem, theoretically guaranteeing that the model's outputs become statistically independent of the erased concept. Unlike prior heuristic methods, SCORE minimizes the mutual information between a target concept and generated outputs, yielding provable erasure guarantees. We provide formal proofs establishing convergence properties and derive upper bounds on residual concept leakage. Empirically, we evaluate SCORE on Stable Diffusion and FLUX across four challenging benchmarks: object erasure, NSFW removal, celebrity face suppression, and artistic style unlearning. SCORE consistently outperforms state-of-the-art methods including EraseAnything, ANT, MACE, ESD, and UCE, achieving up to \textbf{12.5\%} higher erasure efficacy while maintaining comparable or superior image quality. By integrating adversarial optimization, trajectory consistency, and saliency-driven fine-tuning, SCORE sets a new standard for secure and robust concept erasure in diffusion models.

[209] RAM++: Robust Representation Learning via Adaptive Mask for All-in-One Image Restoration

Zilong Zhang,Chujie Qin,Chunle Guo,Yong Zhang,Chao Xue,Ming-Ming Cheng,Chongyi Li

Main category: cs.CV

TL;DR: RAM++ is a two-stage image restoration framework that integrates semantic understanding and texture generation to achieve robust performance across diverse degradation scenarios.

Details Motivation: Existing degradation-oriented methods face challenges in extreme scenarios, unbalanced performance across tasks, overfitting to seen degradations, and weak generalization to unseen ones. RAM++ aims to overcome these issues with a content-oriented robust restoration approach. Method: RAM++ uses a two-stage framework with three key designs: Adaptive Semantic-Aware Mask (AdaSAM) for pretraining, Mask Attribute Conductance (MAC) for fine-tuning, and Robust Feature Regularization (RFR) for feature fusion and degradation-invariant representations. Result: RAM++ achieves state-of-the-art performance across seen, unseen, extreme, and mixed degradations, with robustness, balance, and improved generalization. Conclusion: RAM++ effectively addresses the limitations of existing degradation-oriented methods by integrating high-level semantic understanding and low-level texture generation, achieving robust and balanced performance across various degradation scenarios. Abstract: This work presents Robust Representation Learning via Adaptive Mask (RAM++), a two-stage framework for all-in-one image restoration. RAM++ integrates high-level semantic understanding with low-level texture generation to achieve content-oriented robust restoration. It addresses the limitations of existing degradation-oriented methods in extreme scenarios (e.g., degradations strongly coupled with image structures). RAM++ also mitigates common challenges such as unbalanced performance across tasks, overfitting to seen degradations, and weak generalization to unseen ones through three key designs: 1) Adaptive Semantic-Aware Mask (AdaSAM): a pretraining strategy that applies pixel-level masks to semantically rich and textured regions. This design enables the network to learn both generative priors and image content priors from various degradations. 2) Mask Attribute Conductance (MAC): a selective fine-tuning strategy that adjusts the layers with higher contributions to bridge the integrity gap between masked pretraining and full-image fine-tuning while retaining learned priors. 3) Robust Feature Regularization (RFR): a strategy that leverages DINOv2's semantically consistent and degradation-invariant representations, together with efficient feature fusion, to achieve faithful and semantically coherent restoration. With these designs, RAM++ achieves robust, well-balanced, and state-of-the-art performance across seen, unseen, extreme, and mixed degradations. Our code and model will be released at https://github.com/DragonisCV/RAM

[210] Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Bingyu Li,Haocheng Dong,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一个面向遥感图像的开放词汇分割框架RSKT-Seg,并构建了首个标准化评测基准OVRSISBench,通过多方向特征聚合、高效融合Transformer和知识迁移模块,在性能和推理速度上均显著优于现有方法。

Details Motivation: 开放词汇遥感图像分割(OVRSIS)因缺乏统一评测基准和自然图像与遥感图像之间的域差异而发展受限,亟需专门针对遥感场景的模型和评估体系。 Method: 提出RSKT-Seg框架,包含三个核心组件:多方向成本图聚合(RS-CMA)、高效成本图融合Transformer(RS-Fusion)和遥感知识迁移模块(RS-Transfer),并在新构建的OVRSISBench基准上进行训练与评估。 Result: 在OVRSISBench上,RSKT-Seg比现有最优方法提升+3.8 mIoU和+5.9 mACC,同时推理速度提高2倍。 Conclusion: RSKT-Seg有效解决了遥感图像中开放词汇分割的挑战,为该领域提供了可靠的基准和高性能解决方案。 Abstract: Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

[211] Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking

Zirui Zheng,Takashi Isobe,Tong Shen,Xu Jia,Jianbin Zhao,Xiaomin Li,Mengmeng Ge,Baolu Li,Qinghe Wang,Dong Li,Dong Zhou,Yunzhi Zhuge,Huchuan Lu,Emad Barsoum

Main category: cs.CV

TL;DR: 提出了一种基于结构化掩码的布局到图像生成框架SMARLI,有效结合空间布局约束与自回归模型,提升生成质量与布局准确性。

Details Motivation: 自回归模型在图像生成中表现优异,但在布局条件生成方面因布局稀疏性和特征纠缠问题面临挑战,需有效引入布局控制机制。 Method: 设计了一种结构化掩码策略,用于注意力计算中调控全局提示、布局和图像token之间的交互,并结合基于组相对策略优化(GRPO)的后训练方案与专门设计的布局奖励函数。 Result: 实验结果表明,SMARLI能无缝集成布局、文本和图像token,在不牺牲生成质量的前提下实现优越的布局感知控制,同时保持自回归模型的结构简洁性和生成效率。 Conclusion: SMARLI为自回归模型提供了有效的布局控制方法,在布局到图像生成任务中实现了高质量生成与精确布局对齐。 Abstract: While autoregressive (AR) models have demonstrated remarkable success in image generation, extending them to layout-conditioned generation remains challenging due to the sparse nature of layout conditions and the risk of feature entanglement. We present Structured Masking for AR-based Layout-to-Image (SMARLI), a novel framework for layoutto-image generation that effectively integrates spatial layout constraints into AR-based image generation. To equip AR model with layout control, a specially designed structured masking strategy is applied to attention computation to govern the interaction among the global prompt, layout, and image tokens. This design prevents mis-association between different regions and their descriptions while enabling sufficient injection of layout constraints into the generation process. To further enhance generation quality and layout accuracy, we incorporate Group Relative Policy Optimization (GRPO) based post-training scheme with specially designed layout reward functions for next-set-based AR models. Experimental results demonstrate that SMARLI is able to seamlessly integrate layout tokens with text and image tokens without compromising generation quality. It achieves superior layoutaware control while maintaining the structural simplicity and generation efficiency of AR models.

[212] A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset

Haiyu Yang,Enhong Liu,Jennifer Sun,Sumit Sharma,Meike van Leerdam,Sebastien Franceschini,Puchun Niu,Miel Hostens

Main category: cs.CV

TL;DR: 提出一种基于开源计算机视觉技术的模块化管道,用于自动化分析群养环境中的动物行为,特别是在室内猪的行为监测中表现出高准确性和鲁棒性。

Details Motivation: 传统的人工观察方法耗时、主观且难以扩展,无法满足现代农业对动物福利、健康和生产效率的精准监测需求。 Method: 结合了零样本目标检测、运动感知的跟踪与分割、以及基于视觉Transformer的高级特征提取等先进模型,构建了一个模块化的自动化行为分析流程。 Result: 在爱丁堡猪行为视频数据集上验证,时间模型整体准确率达到94.2%,比现有方法提高21.2个百分点;追踪身份保持得分为93.3%,目标检测精度为89.3%。 Conclusion: 该模块化管道在猪行为监测中表现出色,具有良好的可扩展性和适应性,有望推广至其他物种,推动精准养殖和动物福利评估的发展。 Abstract: Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware tracking and segmentation, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with 93.3% identity preservation score and 89.3% object detection precision. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.

[213] AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective

Yuchen Deng,Xiuyang Wu,Hai-Tao Zheng,Suiyang Zhang,Yi He,Yuxing Han

Main category: cs.CV

TL;DR: 提出AvatarSync,一种基于自回归框架的语音驱动 talking-head 生成方法,通过两阶段策略实现高保真、时序一致且高效的动画合成。

Details Motivation: 现有基于GAN或扩散模型的方法存在帧间闪烁、身份漂移和推理速度慢等问题,难以满足实际应用需求。 Method: 采用自回归框架,在音素表征上进行建模;使用两阶段策略:第一阶段(FKG)通过音素到视觉映射生成面部关键帧,结合定制的文本-帧因果注意力掩码;第二阶段通过时间戳感知的自适应插值方法实现平滑过渡。引入选择性状态空间模型支持双向上下文推理,并优化推理流程以降低延迟。 Result: 实验表明,AvatarSync在视觉保真度、时序一致性和计算效率方面优于现有方法。 Conclusion: AvatarSync提供了一种可扩展、可控性强且适合部署的 talking-head 动画生成方案。 Abstract: Existing talking-head animation approaches based on Generative Adversarial Networks (GANs) or diffusion models often suffer from inter-frame flicker, identity drift, and slow inference. These limitations inherent to their video generation pipelines restrict their suitability for applications. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly text or audio input. In addition, AvatarSync adopts a two-stage generation strategy, decoupling semantic modeling from visual dynamics, which is a deliberate "Divide and Conquer" design. The first stage, Facial Keyframe Generation (FKG), focuses on phoneme-level semantic representation by leveraging the many-to-one mapping from text or audio to phonemes. A Phoneme-to-Visual Mapping is constructed to anchor abstract phonemes to character-level units. Combined with a customized Text-Frame Causal Attention Mask, the keyframes are generated. The second stage, inter-frame interpolation, emphasizes temporal coherence and visual smoothness. We introduce a timestamp-aware adaptive strategy based on a selective state space model, enabling efficient bidirectional context reasoning. To support deployment, we optimize the inference pipeline to reduce latency without compromising visual fidelity. Extensive experiments show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.

[214] Robust Fetal Pose Estimation across Gestational Ages via Cross-Population Augmentation

Sebastian Diaz,Benjamin Billot,Neel Dey,Molin Zhang,Esra Abaci Turk,P. Ellen Grant,Polina Golland,Elfar Adalsteinsson

Main category: cs.CV

TL;DR: 提出了一种跨人群数据增强框架,使基于较大孕周(GA)标注数据训练的姿势估计模型能够可靠地泛化到较小GA的胎儿群体,从而改善4D胎儿影像中早期运动分析的性能。

Details Motivation: 现有胎儿运动追踪方法在较大学龄段表现良好,但在早期孕周(GA)泛化能力差,主要由于母体和胎儿解剖结构随孕周变化显著,且早期GA的标注数据获取困难。因此需要一种无需早期标注数据即可提升模型在早期GA性能的方法。 Method: 设计了一种针对胎儿的特定数据增强策略,模拟早期孕周特有的宫内环境和胎儿体位,通过跨人群数据增强,在仅使用较大GA标注图像的情况下,提升姿势估计模型对较小GA胎儿的泛化能力。 Result: 实验表明,所提出的增强方法在较大GA和更具挑战性的早期GA病例上均显著降低了估计变异性并提升了性能,实现了更稳健的跨孕周姿势估计。 Conclusion: 该跨人群数据增强框架有效解决了胎儿姿势估计模型在早期孕周泛化能力不足的问题,为无需大量早期标注数据即可实现可靠的4D胎儿运动分析提供了可行方案,有助于早期临床检测与干预。 Abstract: Fetal motion is a critical indicator of neurological development and intrauterine health, yet its quantification remains challenging, particularly at earlier gestational ages (GA). Current methods track fetal motion by predicting the location of annotated landmarks on 3D echo planar imaging (EPI) time-series, primarily in third-trimester fetuses. The predicted landmarks enable simplification of the fetal body for downstream analysis. While these methods perform well within their training age distribution, they consistently fail to generalize to early GAs due to significant anatomical changes in both mother and fetus across gestation, as well as the difficulty of obtaining annotated early GA EPI data. In this work, we develop a cross-population data augmentation framework that enables pose estimation models to robustly generalize to younger GA clinical cohorts using only annotated images from older GA cohorts. Specifically, we introduce a fetal-specific augmentation strategy that simulates the distinct intrauterine environment and fetal positioning of early GAs. Our experiments find that cross-population augmentation yields reduced variability and significant improvements across both older GA and challenging early GA cases. By enabling more reliable pose estimation across gestation, our work potentially facilitates early clinical detection and intervention in challenging 4D fetal imaging settings. Code is available at https://github.com/sebodiaz/cross-population-pose.

[215] End-to-End Learning of Multi-Organ Implicit Surfaces from 3D Medical Imaging Data

Farahdiba Zarin,Nicolas Padoy,Jérémy Dana,Vinkle Srivastav

Main category: cs.CV

TL;DR: ImplMORe는 3D CNN 인코더와 멀티 스케일 보간을 사용하여 점유 함수를 기반으로 한 암시적 표면 표현을 활용하여 고해상도의 세부적인 표면 세부 정보를 제공하는 다중 기관 재구성 방법입니다.

Details Motivation: 3D 의료 영상에서 다양한 기관의 미세한 표면 재구성을 통해 진단 및 수술 계획을 개선하고자 하지만, 기존 방법은 해상도와 메모리, 계산 비용의 한계가 있음. Method: ImplMORe는 3D CNN 인코더를 사용하여 국부 특징을 통합하고, 연속 도메인에서 점유 함수를 사용한 다중 스케일 보간을 수행하여 표면 재구성을 학습함. Result: 제안된 방법은 이산적 명시적 표현 기반 접근법보다 우수하며, 입력 이미지보다 높은 해상도로 기관의 세부 표면 정보를 재구성함. Conclusion: ImplMORe는 3D 의료 영상에서 기관 재구성의 한계를 극복하고, 높은 해상도와 세부적인 표면 정보를 제공하여 진단 및 수술 계획에 기여할 수 있음. Abstract: The fine-grained surface reconstruction of different organs from 3D medical imaging can provide advanced diagnostic support and improved surgical planning. However, the representation of the organs is often limited by the resolution, with a detailed higher resolution requiring more memory and computing footprint. Implicit representations of objects have been proposed to alleviate this problem in general computer vision by providing compact and differentiable functions to represent the 3D object shapes. However, architectural and data-related differences prevent the direct application of these methods to medical images. This work introduces ImplMORe, an end-to-end deep learning method using implicit surface representations for multi-organ reconstruction from 3D medical images. ImplMORe incorporates local features using a 3D CNN encoder and performs multi-scale interpolation to learn the features in the continuous domain using occupancy functions. We apply our method for single and multiple organ reconstructions using the totalsegmentator dataset. By leveraging the continuous nature of occupancy functions, our approach outperforms the discrete explicit representation based surface reconstruction approaches, providing fine-grained surface details of the organ at a resolution higher than the given input image. The source code will be made publicly available at: https://github.com/CAMMA-public/ImplMORe

[216] U-Mamba2: Scaling State Space Models for Dental Anatomy Segmentation in CBCT

Zhi Qin Tan,Xiatian Zhu,Owen Addison,Yunpeng Li

Main category: cs.CV

TL;DR: 本文提出了一种用于CBCT中多解剖结构分割的新神经网络U-Mamba2,结合Mamba2状态空间模型与U-Net架构,并引入交互式点击提示、自监督预训练和牙科领域知识,在ToothFairy3挑战赛中表现出高效性和优异性能,取得两项任务前三名的成绩。

Details Motivation: 准确的CBCT解剖结构分割对临床诊断和手术规划至关重要,但传统方法耗时且困难,现有方法在效率与精度之间难以平衡。 Method: 提出U-Mamba2网络,将Mamba2状态空间模型融入U-Net结构,增强结构约束以提升效率;结合交互式点击提示与交叉注意力模块,采用自监督学习进行预训练,并融入牙科领域知识优化模型设计。 Result: 在ToothFairy3挑战赛独立测试中表现优异:任务1平均Dice为0.792,HD95为93.19;任务2平均Dice为0.852,HD95为7.39,推理时间短,效率高。 Conclusion: U-Mamba2在保持高性能的同时显著提升分割效率,适用于复杂牙科CBCT图像的多解剖结构自动分割,具有良好的临床应用前景。 Abstract: Cone-Beam Computed Tomography (CBCT) is a widely used 3D imaging technique in dentistry, providing volumetric information about the anatomical structures of jaws and teeth. Accurate segmentation of these anatomies is critical for clinical applications such as diagnosis and surgical planning, but remains time-consuming and challenging. In this paper, we present U-Mamba2, a new neural network architecture designed for multi-anatomy CBCT segmentation in the context of the ToothFairy3 challenge. U-Mamba2 integrates the Mamba2 state space models into the U-Net architecture, enforcing stronger structural constraints for higher efficiency without compromising performance. In addition, we integrate interactive click prompts with cross-attention blocks, pre-train U-Mamba2 using self-supervised learning, and incorporate dental domain knowledge into the model design to address key challenges of dental anatomy segmentation in CBCT. Extensive experiments, including independent tests, demonstrate that U-Mamba2 is both effective and efficient, securing top 3 places in both tasks of the Toothfairy3 challenge. In Task 1, U-Mamba2 achieved a mean Dice of 0.792, HD95 of 93.19 with the held-out test data, with an average inference time of XX (TBC during the ODIN workshop). In Task 2, U-Mamba2 achieved the mean Dice of 0.852 and HD95 of 7.39 with the held-out test data. The code is publicly available at https://github.com/zhiqin1998/UMamba2.

[217] Progressive Flow-inspired Unfolding for Spectral Compressive Imaging

Xiaodong Wang,Ping Wang,Zijun He,Mengjie Qin,Xin Yuan

Main category: cs.CV

TL;DR: 提出了一种可控制轨迹的展开框架,用于提升CASSI中的高光谱图像重建质量与效率。

Details Motivation: 现有深度展开网络在CASSI重建中存在重建轨迹不可控的问题,导致重建质量跳跃且缺乏渐进优化。 Method: 受扩散轨迹和流匹配启发,设计了可控制优化路径的展开框架,并结合高效的空谱Transformer和频域融合模块。 Result: 在仿真和真实数据上实验表明,该方法在重建质量和计算效率方面优于先前的最先进方法。 Conclusion: 所提出的轨迹可控展开框架能实现更平滑、连续的优化过程,显著提升CASSI的重建性能。 Abstract: Coded aperture snapshot spectral imaging (CASSI) retrieves a 3D hyperspectral image (HSI) from a single 2D compressed measurement, which is a highly challenging reconstruction task. Recent deep unfolding networks (DUNs), empowered by explicit data-fidelity updates and implicit deep denoisers, have achieved the state of the art in CASSI reconstruction. However, existing unfolding approaches suffer from uncontrollable reconstruction trajectories, leading to abrupt quality jumps and non-gradual refinement across stages. Inspired by diffusion trajectories and flow matching, we propose a novel trajectory-controllable unfolding framework that enforces smooth, continuous optimization paths from noisy initial estimates to high-quality reconstructions. To achieve computational efficiency, we design an efficient spatial-spectral Transformer tailored for hyperspectral reconstruction, along with a frequency-domain fusion module to gurantee feature consistency. Experiments on simulation and real data demonstrate that our method achieves better reconstruction quality and efficiency than prior state-of-the-art approaches.

[218] End-to-End 4D Heart Mesh Recovery Across Full-Stack and Sparse Cardiac MRI

Yihong Chen,Jiancheng Yang,Deniz Sayin Mercadier,Hieu Le,Juerg Schwitter,Pascal Fua

Main category: cs.CV

TL;DR: TetHeart是一个端到端框架,能从完整的或稀疏的CMR切片中统一恢复全4D多结构心脏网格,适用于术前和术中场景。

Details Motivation: 现有方法依赖完整的CMR序列来推断心脏运动,在仅有稀疏切片的术中场景下应用受限。 Method: 采用深度可变形四面体的显式-隐式混合表示,结合注意力机制实现自适应2D-3D特征融合,并通过两阶段弱监督学习仅需关键帧标注进行训练。 Result: 在多个公开和私有数据集上验证,TetHeart在完整和稀疏输入下均达到最先进的重建精度,并表现出强泛化能力。 Conclusion: TetHeart首次实现了从全栈和稀疏切片中统一重建心脏4D运动,显著提升了在术中稀疏观测下的临床适用性。 Abstract: Reconstructing cardiac motion from cine CMR sequences is critical for diagnosis, prediction, and intervention. Existing methods rely on complete CMR stacks to infer full heart motion, limiting their utility in intra-procedural scenarios where only sparse observations are available. We present TetHeart, the first end-to-end framework that unifies full 4D multi-structure heart mesh recovery from both offline full-stack acquisitions and intra-procedural sparse-slice observations. Our method leverages deep deformable tetrahedra, an explicit-implicit hybrid representation, to capture shape and motion in a coherent space shared across cardiac structures. It is initialized from high-quality pre-procedural or offline-acquired full stacks to build detailed, patient-specific heart meshes, which can then be updated using whatever slices are available, from full stacks down to a single slice. We further incorporate several key innovations: (i) an attentive mechanism for slice-adaptive 2D-3D feature assembly that dynamically integrates information from arbitrary numbers of slices at any position, combined with a distillation strategy from full-slice to sparse-slice settings to ensure accurate reconstruction under extreme sparsity; and (ii) a two-stage weakly supervised motion learning scheme requiring only keyframe (e.g., ED and ES) annotations. Trained and validated on three large public datasets and externally evaluated zero-shot on additional private interventional and public CMR datasets, TetHeart achieves state-of-the-art accuracy and strong generalization in both pre- and intra-procedural settings.

[219] FS-SAM2: Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation via Low-Rank Adaptation

Bernardo Forni,Gabriele Lombardi,Federico Pozzi,Mirco Planamente

Main category: cs.CV

TL;DR: 本文提出了一种基于SAM2的少样本语义分割方法FS-SAM2,通过重用SAM2的视频分割能力并引入低秩适应(LoRA)来适应标准图像数据集,实现了高效的参数微调和优异的分割性能。

Details Motivation: 现有的少样本分割方法通常需要从头训练额外模块且依赖大规模数据训练,计算成本高且难以充分利用预训练模型的能力。因此,亟需一种能高效适配、参数量小且性能优越的少样本分割框架。 Method: 提出FS-SAM2,利用SAM2的视频模块处理少样本任务,并采用LoRA对原有模块进行轻量级微调,仅需少量参数即可适应不同图像分布,支持任意K-shot设置。 Result: 在PASCAL-5^i、COCO-20^i和FSS-1000数据集上取得了显著性能提升,同时具备出色的推理效率。 Conclusion: FS-SAM2通过巧妙复用SAM2的视频能力与LoRA微调策略,实现了一种高效、灵活且高性能的少样本语义分割方法,为基于基础模型的下游任务适配提供了新思路。 Abstract: Few-shot semantic segmentation has recently attracted great attention. The goal is to develop a model capable of segmenting unseen classes using only a few annotated samples. Most existing approaches adapt a pre-trained model by training from scratch an additional module. Achieving optimal performance with these approaches requires extensive training on large-scale datasets. The Segment Anything Model 2 (SAM2) is a foundational model for zero-shot image and video segmentation with a modular design. In this paper, we propose a Few-Shot segmentation method based on SAM2 (FS-SAM2), where SAM2's video capabilities are directly repurposed for the few-shot task. Moreover, we apply a Low-Rank Adaptation (LoRA) to the original modules in order to handle the diverse images typically found in standard datasets, unlike the temporally connected frames used in SAM2's pre-training. With this approach, only a small number of parameters is meta-trained, which effectively adapts SAM2 while benefiting from its impressive segmentation performance. Our method supports any K-shot configuration. We evaluate FS-SAM2 on the PASCAL-5$^i$, COCO-20$^i$ and FSS-1000 datasets, achieving remarkable results and demonstrating excellent computational efficiency during inference. Code is available at https://github.com/fornib/FS-SAM2

[220] RailSafeNet: Visual Scene Understanding for Tram Safety

Ing. Ondrej Valach,Ing. Ivan Gruber

Main category: cs.CV

TL;DR: 本文提出了一种名为RailSafeNet的实时框架,利用语义分割、目标检测和基于规则的距离评估器,通过单目视频检测轨道入侵,提升有轨电车与行人的交互安全。

Details Motivation: 有轨电车常在人口密集区域运行,存在与行人、骑行者等发生碰撞的风险,亟需一种高效、准确的安全预警系统。 Method: 结合语义分割(SegFormer B3)、目标检测(YOLOv8)和基于1435mm标准轨距的规则距离评估器,对单目视频进行融合分析,识别轨道并评估附近物体的风险等级。 Result: 在RailSem19数据集上,语义分割达到65% IoU,目标检测达到75.6% mAP(IoU=0.5),系统实现了高精度、低标注依赖的场景理解。 Conclusion: RailSafeNet能够有效识别轨道入侵,为驾驶员提供及时预警,具有实际应用潜力,并已开源代码。 Abstract: Tram-human interaction safety is an important challenge, given that trams frequently operate in densely populated areas, where collisions can range from minor injuries to fatal outcomes. This paper addresses the issue from the perspective of designing a solution leveraging digital image processing, deep learning, and artificial intelligence to improve the safety of pedestrians, drivers, cyclists, pets, and tram passengers. We present RailSafeNet, a real-time framework that fuses semantic segmentation, object detection and a rule-based Distance Assessor to highlight track intrusions. Using only monocular video, the system identifies rails, localises nearby objects and classifies their risk by comparing projected distances with the standard 1435mm rail gauge. Experiments on the diverse RailSem19 dataset show that a class-filtered SegFormer B3 model achieves 65% intersection-over-union (IoU), while a fine-tuned YOLOv8 attains 75.6% mean average precision (mAP) calculated at an intersection over union (IoU) threshold of 0.50. RailSafeNet therefore delivers accurate, annotation-light scene understanding that can warn drivers before dangerous situations escalate. Code available at https://github.com/oValach/RailSafeNet.

[221] 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data

Nojod M. Alotaibi,Areej M. Alhothali,Manar S. Ali

Main category: cs.CV

TL;DR: 提出一种结合Vision Transformers和Graph Neural Networks的统一框架,用于基于sMRI数据自动检测重度抑郁症(MDD),在REST-meta-MDD数据集上验证了模型有效性,结果表明基于图谱的方法优于基于立方体分割的方法。

Details Motivation: 现有基于体素或预定义脑图谱区域特征的MDD自动检测方法难以捕捉复杂的脑部模式,限制了诊断性能,因此需要更灵活且具有生物学意义的区域表示方法。 Method: 采用Vision Transformers从3D sMRI图像块中提取区域嵌入,并构建余弦相似性图以建模区域间关系,使用Graph Neural Network进行分类;比较了基于预定义脑图谱和均匀分割3D立方体两种区域划分策略。 Result: 在REST-meta-MDD数据集上,最佳模型通过分层10折交叉验证达到78.98%准确率、76.54%敏感性、81.58%特异性、81.58%精确率和78.98% F1分数;基于图谱的方法 consistently 优于立方体方法。 Conclusion: 融合ViT提取区域特征与GNN建模脑区连接的框架在MDD识别中表现良好,且引入解剖学先验知识(如脑图谱)对提升模型性能至关重要。 Abstract: Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.

[222] Open-ended Hierarchical Streaming Video Understanding with Vision Language Models

Hyolim Kang,Yunsu Park,Youngbeom Yoo,Yeeun Choi,Seon Joo Kim

Main category: cs.CV

TL;DR: 提出了一种名为OpenHOUSE的层次化流视频理解系统,结合在线时序动作定位与自由形式描述生成,显著提升了相邻动作边界的检测性能。

Details Motivation: 现有数据集缺乏层次化和细粒度的时序标注,且传统方法难以有效处理连续动作流中的复杂结构。 Method: 利用大语言模型(LLMs)将原子动作聚类为高层事件以增强数据集,并设计了一个专用的流式模块来精确检测紧密相邻动作之间的边界。 Result: OpenHOUSE在动作边界检测性能上接近翻倍,超越了现有方法的直接扩展版本。 Conclusion: OpenHOUSE是迈向融合强生成模型的未来流式动作感知系统的重要一步。 Abstract: We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.

[223] Multi Anatomy X-Ray Foundation Model

Nishank Singla,Krisztian Koos,Farzin Haddadpour,Amin Honarmandi Shandiz,Lovish Chum,Xiaojian Xu,Qing Jin,Erhan Bas

Main category: cs.CV

TL;DR: XR-0是一个基于自监督学习的多解剖X射线基础模型,使用115万张图像训练,在12个数据集和20项下游任务中表现出色,推动了放射学中可扩展AI系统的发展。

Details Motivation: 现有AI基础模型主要局限于胸部解剖,难以泛化到更广泛的临床任务,因此需要一个能覆盖多解剖区域的通用X射线模型。 Method: 采用自监督学习方法,在包含115万张跨多个解剖区域的大规模私有数据集上训练XR-0模型,并在12个数据集和20项下游任务(包括分类、检索、分割、定位、视觉接地和报告生成)上进行评估。 Result: XR-0在大多数多解剖任务上达到最先进的性能,在胸部专用基准上也具有竞争力。实验表明解剖多样性与监督对构建鲁棒的通用医学视觉模型至关重要。 Conclusion: 解剖多样性和适当监督是构建通用医学影像AI模型的关键,XR-0为放射学中可扩展、可适应的AI系统提供了可行路径。 Abstract: X-ray imaging is a ubiquitous in radiology, yet most existing AI foundation models are limited to chest anatomy and fail to generalize across broader clinical tasks. In this work, we introduce XR-0, the multi-anatomy X-ray foundation model using self-supervised learning on a large, private dataset of 1.15 million images spanning diverse anatomical regions and evaluated across 12 datasets and 20 downstream tasks, including classification, retrieval, segmentation, localization, visual grounding, and report generation. XR-0 achieves state-of-the-art performance on most multi-anatomy tasks and remains competitive on chest-specific benchmarks. Our results demonstrate that anatomical diversity and supervision are critical for building robust, general-purpose medical vision models, paving the way for scalable and adaptable AI systems in radiology.

[224] LoRA-fine-tuned Large Vision Models for Automated Assessment of Post-SBRT Lung Injury

M. Bolhassani,B. Veasey,E. Daugherty,S. Keltner,N. Kumar,N. Dunlap,A. Amini

Main category: cs.CV

TL;DR: 本研究探讨了使用低秩适应(LoRA)微调大型视觉模型(DinoV2和SwinV2)以从X射线CT扫描中诊断放射性肺损伤(RILI)的有效性,结果表明LoRA在性能相当或更优的同时显著降低了计算成本和训练时间。

Details Motivation: 为了高效且鲁棒地诊断SBRT治疗后由CT图像反映的放射性肺损伤,探索参数效率更高的微调方法。 Method: 采用LoRA对DinoV2和SwinV2模型进行微调,并与全量微调和无微调推理方法对比;使用不同尺寸裁剪图像(50 mm³ 和 75 mm³)及2D到3D适配技术评估模型对空间上下文的敏感性。 Result: LoRA在诊断RILI任务上表现优于或相当于传统全量微调方法,同时显著减少可训练参数数量,降低计算开销和训练时间。 Conclusion: LoRA是一种高效、鲁棒的视觉大模型微调方法,适用于医学图像分析任务,尤其在资源受限场景下具有明显优势。 Abstract: This study investigates the efficacy of Low-Rank Adaptation (LoRA) for fine-tuning large Vision Models, DinoV2 and SwinV2, to diagnose Radiation-Induced Lung Injury (RILI) from X-ray CT scans following Stereotactic Body Radiation Therapy (SBRT). To evaluate the robustness and efficiency of this approach, we compare LoRA with traditional full fine-tuning and inference-only (no fine-tuning) methods. Cropped images of two sizes (50 mm3 and 75 mm3), centered at the treatment isocenter, in addition to different adaptation techniques for adapting the 2D LVMs for 3D data were used to determine the sensitivity of the models to spatial context. Experimental results show that LoRA achieves comparable or superior performance to traditional fine-tuning while significantly reducing computational costs and training times by requiring fewer trainable parameters.

[225] HoloGarment: 360° Novel View Synthesis of In-the-Wild Garments

Johanna Karras,Yingwei Li,Yasamin Jafarian,Ira Kemelmacher-Shlizerman

Main category: cs.CV

TL;DR: HoloGarment是一种利用真实视频和合成3D数据训练的新方法,能够在动态视频上生成高质量的360度服装新视角图像,有效应对真实世界中的复杂情况。

Details Motivation: 现有方法依赖于合成的3D训练数据,这些数据主要是未遮挡和静态的物体,导致在真实世界服装上的泛化能力较差。因此,需要一种能够处理显著遮挡、复杂人体姿态和布料变形的新方法。 Method: HoloGarment采用一种新的隐式训练范式,利用真实视频和合成3D数据优化共享的服装嵌入空间,并通过构建服装“图谱”表示实现动态视频到360度新视角的合成。 Result: HoloGarment在从图像和视频中进行野外服装新视角合成任务上达到了最先进的性能,并能够稳健地处理真实世界中的褶皱、姿态变化和遮挡等挑战,同时保持照片真实感、视角一致性、精细的纹理细节和准确的几何形状。 Conclusion: HoloGarment通过结合大规模真实视频数据和小规模合成3D数据,构建了一个共享的服装嵌入空间,从而有效地弥合了真实数据和合成数据之间的领域差距,实现了在动态视频上的高质量服装新视角合成。 Abstract: Novel view synthesis (NVS) of in-the-wild garments is a challenging task due significant occlusions, complex human poses, and cloth deformations. Prior methods rely on synthetic 3D training data consisting of mostly unoccluded and static objects, leading to poor generalization on real-world clothing. In this paper, we propose HoloGarment (Hologram-Garment), a method that takes 1-3 images or a continuous video of a person wearing a garment and generates 360{\deg} novel views of the garment in a canonical pose. Our key insight is to bridge the domain gap between real and synthetic data with a novel implicit training paradigm leveraging a combination of large-scale real video data and small-scale synthetic 3D data to optimize a shared garment embedding space. During inference, the shared embedding space further enables dynamic video-to-360{\deg} NVS through the construction of a garment "atlas" representation by finetuning a garment embedding on a specific real-world video. The atlas captures garment-specific geometry and texture across all viewpoints, independent of body pose or motion. Extensive experiments show that HoloGarment achieves state-of-the-art performance on NVS of in-the-wild garments from images and videos. Notably, our method robustly handles challenging real-world artifacts -- such as wrinkling, pose variation, and occlusion -- while maintaining photorealism, view consistency, fine texture details, and accurate geometry. Visit our project page for additional results: https://johannakarras.github.io/HoloGarment

[226] Domain-Adaptive Pretraining Improves Primate Behavior Recognition

Felix B. Mueller,Timo Lueddecke,Richard Vogg,Alexander S. Ecker

Main category: cs.CV

TL;DR: 本文提出利用自监督学习和领域自适应预训练(DAP)提升灵长类动物行为识别的性能,无需标注数据即可显著提高准确率和mAP。

Details Motivation: 视频陷阱相机虽能大规模采集动物行为数据,但标注成本高,亟需数据高效的机器学习方法来推动动物行为识别研究。 Method: 采用预训练的V-JEPA模型,并在灵长类行为数据集(PanAf和ChimpACT)上进行领域自适应预训练(DAP),以提升动作识别性能。 Result: 在两个灵长类行为数据集上分别比现有最先进模型提升了6.1%准确率和6.3% mAP,且性能提升主要来自DAP。 Conclusion: DAP是一种无需标注数据即可有效提升动物行为识别性能的方法,具有广泛应用于生态学、认知科学和保护工作的潜力。 Abstract: Computer vision for animal behavior offers promising tools to aid research in ecology, cognition, and to support conservation efforts. Video camera traps allow for large-scale data collection, but high labeling costs remain a bottleneck to creating large-scale datasets. We thus need data-efficient learning approaches. In this work, we show that we can utilize self-supervised learning to considerably improve action recognition on primate behavior. On two datasets of great ape behavior (PanAf and ChimpACT), we outperform published state-of-the-art action recognition models by 6.1 %pt. accuracy and 6.3 %pt. mAP, respectively. We achieve this by utilizing a pretrained V-JEPA model and applying domain-adaptive pretraining (DAP), i.e. continuing the pretraining with in-domain data. We show that most of the performance gain stems from the DAP. Our method promises great potential for improving the recognition of animal behavior, as DAP does not require labeled samples. Code is available at https://github.com/ecker-lab/dap-behavior

[227] 3D Human Pose and Shape Estimation from LiDAR Point Clouds: A Review

Salma Galaaoui,Eduardo Valle,David Picard,Nermin Samet

Main category: cs.CV

TL;DR: This paper provides a comprehensive review of 3D human pose estimation and mesh recovery from LiDAR point clouds, including a structured taxonomy, dataset comparisons, benchmark tables, and an updated webpage for ongoing research.

Details Motivation: The motivation is to provide a comprehensive review of 3D human pose estimation and mesh recovery from LiDAR point clouds, enabling fair comparisons and promoting progress in the field. Method: The authors present a structured taxonomy to classify existing methods, perform a quantitative comparison of datasets, compile unified evaluation metrics, and establish benchmark tables. Result: The paper results in a detailed analysis of existing methods, benchmark tables on widely used datasets, unified definitions of evaluation metrics, and an updated webpage for organizing studies. Conclusion: The paper concludes by outlining open challenges and future research directions in LiDAR-based 3D human understanding, while also maintaining an updated webpage to organize studies in this field. Abstract: In this paper, we present a comprehensive review of 3D human pose estimation and human mesh recovery from in-the-wild LiDAR point clouds. We compare existing approaches across several key dimensions, and propose a structured taxonomy to classify these methods. Following this taxonomy, we analyze each method's strengths, limitations, and design choices. In addition, (i) we perform a quantitative comparison of the three most widely used datasets, detailing their characteristics; (ii) we compile unified definitions of all evaluation metrics; and (iii) we establish benchmark tables for both tasks on these datasets to enable fair comparisons and promote progress in the field. We also outline open challenges and research directions critical for advancing LiDAR-based 3D human understanding. Moreover, we maintain an accompanying webpage that organizes papers according to our taxonomy and continuously update it with new studies: https://github.com/valeoai/3D-Human-Pose-Shape-Estimation-from-LiDAR

[228] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Yang Zhou,Yifan Wang,Jianjun Zhou,Wenzheng Chang,Haoyu Guo,Zizun Li,Kaijing Ma,Xinyue Li,Yating Wang,Haoyi Zhu,Mingyu Liu,Dingning Liu,Jiange Yang,Zhoujie Fu,Junyi Chen,Chunhua Shen,Jiangmiao Pang,Kaipeng Zhang,Tong He

Main category: cs.CV

TL;DR: This paper introduces OmniWorld, a new large-scale dataset designed to advance 4D world modeling by overcoming the limitations of existing datasets, leading to improved performance in key tasks and promoting the development of general-purpose 4D world models.

Details Motivation: The development of general 4D world models is limited by the lack of high-quality, dynamic, and multi-domain datasets. Existing datasets do not adequately support tasks like 4D geometric reconstruction and future prediction. Method: The authors introduced OmniWorld, a large-scale, multi-domain, multi-modal dataset for 4D world modeling, including the newly collected OmniWorld-Game dataset and curated public datasets. They established a challenging benchmark based on this dataset. Result: OmniWorld provides richer modality coverage, larger scale, and more realistic dynamic interactions compared to existing datasets. Fine-tuning state-of-the-art methods on OmniWorld leads to significant performance improvements in 4D reconstruction and video generation tasks. Conclusion: OmniWorld is expected to accelerate the development of general-purpose 4D world models, enhancing machines' understanding of the physical world. Abstract: The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.

[229] LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Zixin Yin,Xili Dai,Duomin Wang,Xianfang Zeng,Lionel M. Ni,Gang Yu,Heung-Yeung Shum

Main category: cs.CV

TL;DR: LazyDrag是一种用于多模态扩散变换器的新型拖拽式图像编辑方法,通过显式对应图增强注意力控制,无需测试时优化即可实现稳定且全强度的反演,显著提升了编辑精度和生成能力。

Details Motivation: 现有的拖拽式编辑方法依赖于注意力机制的隐式点匹配,导致反演强度受限且优化成本高,限制了扩散模型在高保真修复和文本引导生成方面的能力。 Method: LazyDrag通过从用户拖拽输入生成可靠的显式对应图,从而增强注意力控制,实现了稳定且全强度的反演过程,无需测试时优化(TTO)。 Result: LazyDrag在DragBench基准测试中优于基线方法,在拖拽准确性和感知质量方面均表现优异,得到了VIEScore和人类评估的认可。 Conclusion: LazyDrag提供了一种新的拖拽式图像编辑方法,消除了对隐式点匹配的依赖,实现了精确的几何控制与文本引导的自然统一,显著提升了模型的生成能力。 Abstract: The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

[230] Character-Centric Understanding of Animated Movies

Zhongrui Gui,Junyu Xie,Tengda Han,Weidi Xie,Andrew Zisserman

Main category: cs.CV

TL;DR: 提出了一种基于音视频多模态的动画角色识别方法,通过构建音视频角色库实现对动画电影中角色的鲁棒识别,并应用于视听描述生成和字幕生成,提升了动画内容的可访问性和叙事理解。

Details Motivation: 传统的人脸识别方法难以应对动画角色在外观、运动和形变上的极端多样性,因此需要一种更鲁棒的识别方法来支持动画电影中的角色理解。 Method: 提出一种音视频联合的识别流程,自动从网络资源构建包含视觉样本和语音样本的角色库,利用多模态信息进行角色识别,克服长尾分布问题。 Result: 在新提出的CMD-AM数据集(75部动画电影)上验证了方法的有效性,在角色识别、视听描述生成和听障字幕等任务上优于传统基于人脸检测的方法。 Conclusion: 该音视频多模态角色识别框架显著提升了动画内容的角色理解能力,推动了面向残障人群的影视无障碍技术发展。 Abstract: Animated movies are captivating for their unique character designs and imaginative storytelling, yet they pose significant challenges for existing recognition systems. Unlike the consistent visual patterns detected by conventional face recognition methods, animated characters exhibit extreme diversity in their appearance, motion, and deformation. In this work, we propose an audio-visual pipeline to enable automatic and robust animated character recognition, and thereby enhance character-centric understanding of animated movies. Central to our approach is the automatic construction of an audio-visual character bank from online sources. This bank contains both visual exemplars and voice (audio) samples for each character, enabling subsequent multi-modal character recognition despite long-tailed appearance distributions. Building on accurate character recognition, we explore two downstream applications: Audio Description (AD) generation for visually impaired audiences, and character-aware subtitling for the hearing impaired. To support research in this domain, we introduce CMD-AM, a new dataset of 75 animated movies with comprehensive annotations. Our character-centric pipeline demonstrates significant improvements in both accessibility and narrative comprehension for animated content over prior face-detection-based approaches. For the code and dataset, visit https://www.robots.ox.ac.uk/~vgg/research/animated_ad/.