cs.CL [Back]

[1] Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling

Hyunji Lee,Wenhao Yu,Hongming Zhang,Kaixin Ma,Jiyeon Kim,Dong Yu,Minjoon Seo

Main category: cs.CL

TL;DR: 本文研究了结合状态空间模型（SSM）和注意力机制的混合架构，分析了其在内存利用和性能上的表现，提出通过顺序与并行结构的选择以及基于数据增强的持续训练来提升模型效果。

Details

Motivation: 当前混合模型的设计选择缺乏深入理解，尤其是在不同上下文长度下的表现差异，需要系统性分析以指导实际架构设计。 Method: 分析顺序与并行融合SSM和注意力层的架构差异，并引入基于回译和改写的数据增强方法进行持续训练。 Result: 发现顺序结构在短上下文中表现更好，而并行结构更适合长上下文；数据增强方法能显著提升召回率且泛化性强，优于部分架构改进方案。 Conclusion: 混合模型的性能受架构选择和训练数据共同影响，针对不同应用场景应采取不同的结构设计，并可通过数据增强有效提升召回能力。 Abstract: Hybrid models that combine state space models (SSMs) with attention mechanisms have shown strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the architectural design choices behind these hybrid models remain insufficiently understood. In this work, we analyze hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We first examine the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals several interesting findings, including that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. We also introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities. It generalizes well across different base models and outperforms architectural modifications aimed at enhancing recall. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases.

[2] Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence

Lívia Dutra,Arthur Lorenzi,Laís Berno,Franciany Campos,Karoline Biscardi,Kenneth Brown,Marcelo Viridiano,Frederico Belcavello,Ely Matos,Olívia Guaranha,Erik Santos,Sofia Reinach,Tiago Timponi Torrent

Main category: cs.CL

TL;DR: 提出一种基于语义框架的方法，用于在电子病历的非结构化文本中识别性别暴力事件，有效缓解漏报问题。

Details

Motivation: 医疗记录中性别暴力事件存在严重漏报，需通过自动化方法提升通报率。 Method: 利用语义框架定义细粒度模式，并在巴西葡萄牙语的2100万句电子病历文本中搜索这些模式。 Result: 该方法识别暴力事件的精确率达到0.726，经语言学家人工评估验证有效。 Conclusion: 该透明、高效、可迁移的方法可用于其他公共卫生监测场景，推动NLP在公共健康中的可解释与伦理应用。 Abstract: We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients' visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.

[3] Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations

Jean-Philippe Corbeil,Asma Ben Abacha,Jerome Tremblay,Phillip Swazinna,Akila Jeeson Daniel,Miguel Del-Agua,Francois Beaulieu

Main category: cs.CL

TL;DR: 本文介绍了MEDIQA-OE 2025共享任务，旨在从医患对话中提取医疗指令，首次探索将对话转化为电子健康记录中的可执行医嘱，并评估了多种大语言模型方法的效果。

Details

Motivation: 将医患对话自动转化为可操作的医疗指令，可显著减轻临床医生的文档负担，并改善患者护理质量。目前这一领域尚未被充分探索。 Method: 设计并发布了MEDIQA-OE 2025共享任务，包含专门标注的数据集，邀请研究团队使用闭源和开源的大语言模型参与挑战，比较不同方法在提取医疗指令上的表现。 Result: 共有六支队伍参与，尝试了多种方法，包括不同类型的大型语言模型，最终形成了任务排行榜，展示了当前最优性能。 Conclusion: MEDIQA-OE 2025成功推动了从医患对话中提取医疗指令的研究，为未来自动化临床文档生成提供了基准和方向。 Abstract: Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants' solutions.

[4] Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services

Jayden Serenari,Stephen Lee

Main category: cs.CL

TL;DR: 本文提出了一种名为LOPSIDED的框架，用于在使用大语言模型时保护用户隐私，通过语义一致的伪名替换敏感个人信息，并在生成回复后自动还原，有效减少语义错误并提升隐私保护。

Details

Motivation: 随着对话式AI系统的普及，用户与大语言模型交互时可能泄露敏感个人信息，存在隐私泄露风险，因此需要一种既能保护隐私又不损害对话质量的方法。 Method: 提出LOPSIDED框架，该框架在用户输入中动态检测敏感PII实体，并用语义上一致的伪名进行替换；在模型输出后，再将伪名还原为原始信息，从而在保证上下文完整性的同时实现隐私保护。 Result: 在基于ShareGPT真实对话数据的实验中，LOPSIDED相比基线方法将语义效用错误减少了5倍，同时有效提升了隐私保护能力。 Conclusion: LOPSIDED能够在不牺牲响应质量的前提下，有效保护用户在与远程大语言模型交互时的隐私，实现了隐私保护与语义完整性的良好平衡。 Abstract: With the increasing use of conversational AI systems, there is growing concern over privacy leaks, especially when users share sensitive personal data in interactions with Large Language Models (LLMs). Conversations shared with these models may contain Personally Identifiable Information (PII), which, if exposed, could lead to security breaches or identity theft. To address this challenge, we present the Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection (LOPSIDED) framework, a semantically-aware privacy agent designed to safeguard sensitive PII data when using remote LLMs. Unlike prior work that often degrade response quality, our approach dynamically replaces sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving the contextual integrity of conversations. Once the model generates its response, the pseudonyms are automatically depseudonymized, ensuring the user receives an accurate, privacy-preserving output. We evaluate our approach using real-world conversations sourced from ShareGPT, which we further augment and annotate to assess whether named entities are contextually relevant to the model's response. Our results show that LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques, all while enhancing privacy.

[5] Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral

Ayoub Hammal,Pierre Zweigenbaum,Caio Corro

Main category: cs.CL

TL;DR: 提出了一种基于小规模对齐模型的代理式测试时对齐方法，以降低大语言模型对齐的计算成本。

Details

Motivation: 大语言模型在预训练中已具备较强生成能力，但仍需对齐以满足下游任务需求，而随着模型规模增大，对齐成本过高。 Method: 采用基于小对齐模型的代理式测试时对齐，将 token 级别的委派决策建模为 0-1 背包问题，并推导出最优决策的原始和对偶近似方法。 Result: 实验表明该方法在任务性能和推测解码速度方面均有提升。 Conclusion: 通过代理模型实现高效的测试时对齐，显著降低了对齐成本，同时保持或提升了性能。 Abstract: Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task requirements and stylistic preferences, among other desired properties. As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively. In this work, we propose a novel approach to circumvent these costs via proxy-based test-time alignment, i.e. using guidance from a small aligned model. Our approach can be described as token-specific cascading method, where the token-specific deferral rule is reduced to 0-1 knapsack problem. In this setting, we derive primal and dual approximations of the optimal deferral decision. We experimentally show the benefits of our method both in task performance and speculative decoding speed.

[6] Elastic Architecture Search for Efficient Language Models

Shang Wang

Main category: cs.CL

TL;DR: 本文提出了一种新的神经架构搜索方法ELM，用于构建高效的紧凑型语言模型，通过灵活的搜索空间和知识蒸馏损失，在掩码和因果语言建模任务上优于现有方法。

Details

Motivation: 大型预训练语言模型在自然语言理解任务中日益重要，但其高计算和内存开销带来了经济与环境问题，亟需更高效的模型设计方法。 Method: 提出弹性语言模型（ELM），引入包含高效Transformer块和可调维度/头数的动态模块的灵活搜索空间，并设计新的知识蒸馏损失以更好地区分不同架构选择。 Result: 在掩码语言建模和因果语言建模任务上的实验表明，ELM发现的模型显著优于现有方法。 Conclusion: ELM通过改进神经架构搜索的灵活性和优化机制，有效提升了紧凑语言模型的性能，为高效语言模型设计提供了新方向。 Abstract: As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.

[7] Dataset Creation and Baseline Models for Sexism Detection in Hausa

Fatima Adam Muhammad,Shamsuddeen Muhammad Hassan,Isa Inuwa-Dutse

Main category: cs.CL

TL;DR: 本研究首次构建了豪萨语性别歧视检测数据集，通过社区参与和用户研究探索豪萨语中性别歧视的表达方式，并评估了传统机器学习与多语言预训练模型在小样本场景下的检测效果。

Details

Motivation: 低资源语言中的性别歧视检测因语言资源有限和文化差异而面临挑战，现有计算方法在这些语言中进展有限，亟需针对特定文化语境的有效检测策略。 Method: 通过社区参与、定性编码和数据增强构建豪萨语性别歧视数据集；开展两阶段用户研究（n=66），结合传统机器学习分类器和多语言预训练模型进行实验，评估小样本学习在豪萨语性别歧视检测中的表现。 Result: 发现文化细微差别（如寻求澄清和习语表达）难以捕捉，导致多种模型出现较多误报；多语言模型在小样本设置下表现有限，凸显本地化语言理解的重要性。 Conclusion: 构建低资源语言的性别歧视检测系统需深入理解本地文化和语言特征，单纯依赖现有模型不足以准确识别复杂语境中的性别歧视，未来应加强社区参与和文化适配的建模方法。 Abstract: Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.

[8] Quantitative Intertextuality from the Digital Humanities Perspective: A Survey

Siyu Duan

Main category: cs.CL

TL;DR: 本文综述了基于自然语言处理技术的量化互文性研究，总结了其数据、方法及在人文社科中的应用，展望了AI与人文学科交叉研究中互文性的广阔前景。

Details

Motivation: 推动互文性研究进入量化时代，利用先进的自然语言处理技术实现更精确、多样和大规模的文本关联分析。 Method: 综述了从统计方法到深度学习的多种量化方法，并基于多语言、多主题的数据进行分析。 Result: 梳理了当前量化互文性研究的主要方法、应用领域及平台工具，展示了其在数字人文和社会科学中的广泛应用。 Conclusion: 随着计算机技术的发展，量化互文性研究有望在人工智能与人文学科的跨学科研究中发挥更大作用。 Abstract: The connection between texts is referred to as intertextuality in literary theory, which served as an important theoretical basis in many digital humanities studies. Over the past decade, advancements in natural language processing have ushered intertextuality studies into the quantitative age. Large-scale intertextuality research based on cutting-edge methods has continuously emerged. This paper provides a roadmap for quantitative intertextuality studies, summarizing their data, methods, and applications. Drawing on data from multiple languages and topics, this survey reviews methods from statistics to deep learning. It also summarizes their applications in humanities and social sciences research and the associated platform tools. Driven by advances in computer technology, more precise, diverse, and large-scale intertext studies can be anticipated. Intertextuality holds promise for broader application in interdisciplinary research bridging AI and the humanities.

[9] Recursive numeral systems are highly regular and easy to process

Ponrawee Prasertsom,Andrea Silvi,Jennifer Culbertson,Moa Johansson,Devdatt Dubhashi,Kenny Smith

Main category: cs.CL

TL;DR: 本文提出，递归数词系统在规律性和处理复杂性方面具有高效性，通过最小描述长度（MDL）方法更好地解释了自然与非自然数词系统之间的差异，并表明先前研究中的特设约束可自然地从规律性中导出。

Details

Motivation: 此前研究未能充分解释为何仅自然语言类似的数词系统能优化词库大小与形态句法复杂性的权衡，且依赖特设约束排除非自然系统，本文旨在引入规律性这一关键因素以更准确刻画语言最优性。 Method: 基于最小描述长度（MDL）框架，提出衡量数词系统规律性和处理复杂性的新方法，并比较已知自然系统与可能但未被证实的系统（包括先前研究中的‘最优’递归系统）。 Result: MDL-based度量能更好地区分自然与非自然数词系统，揭示递归系统的高效性源于其规律性，且先前研究所用的特设约束可由规律性原则自然推导得出。 Conclusion: 在语言最优性研究中，必须将形式集合的规律性纳入考量，递归数词系统的优越性应从其结构规律与处理效率角度理解。 Abstract: Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Deni\'c and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including "optimal" recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.

[10] VISTA Score: Verification In Sequential Turn-based Assessment

Ashley Lewis,Andrew Perrault,Eric Fosler-Lussier,Michael White

Main category: cs.CL

TL;DR: 本文提出了VISTA框架，通过基于声明级别的验证和连续一致性追踪来评估对话系统中的事实性，有效提升了多轮对话中幻觉检测的准确性。

Details

Motivation: 现有的评估指标在处理多轮对话时存在局限，无法准确识别未被证据支持或与上下文矛盾的幻觉内容，因此需要一种更可靠的评估方法。 Method: 将助手的每一轮回复分解为原子事实声明，利用可信来源和对话历史进行验证，并对不可验证的内容进行分类（如主观、矛盾、缺乏证据或回避）。 Result: 在八个大语言模型和四个对话事实性基准测试上，VISTA显著优于FACTSCORE和LLM-as-Judge基线方法，提高了幻觉检测能力，并改善了人工标注的一致性。 Conclusion: VISTA通过将事实性建模为对话的动态属性，提供了一种更透明、更符合人类判断的对话真实性评估方法。 Abstract: Hallucination--defined here as generating statements unsupported or contradicted by available evidence or conversational context--remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

[11] LLM-Centric RAG with Multi-Granular Indexing and Confidence Constraints

Xiaofan Guo,Yaxuan Luan,Yue Kang,Xiangchen Song,Jinxu Guo

Main category: cs.CL

TL;DR: 本文提出了一种结合多粒度记忆索引与不确定性估计的置信度控制方法，以提升复杂知识环境下检索增强生成（RAG）的覆盖性、稳定性与可靠性。

Details

Motivation: 现有RAG方法在复杂知识环境中存在覆盖不足、结果不稳定和可靠性有限的问题，亟需提升生成过程中的可控性与准确性。 Method: 构建分层记忆结构，实现从局部细节到全局语境的多粒度知识表示，并引入不确定性估计机制，在生成过程中动态过滤低置信路径，优化目标包括生成损失、熵约束与方差正则化。 Result: 实验表明该方法在问答准确率、检索召回率、排序质量与事实一致性等方面优于现有模型，且在不同超参数、环境与数据结构下表现出更强的稳定性和鲁棒性。 Conclusion: 多粒度索引与置信度控制的结合能有效提升RAG系统的性能，为大模型在复杂场景下的可靠与可控应用提供了新的技术路径。 Abstract: This paper addresses the issues of insufficient coverage, unstable results, and limited reliability in retrieval-augmented generation under complex knowledge environments, and proposes a confidence control method that integrates multi-granularity memory indexing with uncertainty estimation. The method builds a hierarchical memory structure that divides knowledge representations into different levels of granularity, enabling dynamic indexing and retrieval from local details to global context, and thus establishing closer semantic connections between retrieval and generation. On this basis, an uncertainty estimation mechanism is introduced to explicitly constrain and filter low-confidence paths during the generation process, allowing the model to maintain information coverage while effectively suppressing noise and false content. The overall optimization objective consists of generation loss, entropy constraints, and variance regularization, forming a unified confidence control framework. In the experiments, comprehensive sensitivity tests and comparative analyses were designed, covering hyperparameters, environmental conditions, and data structures, to verify the stability and robustness of the proposed method across different scenarios. The results show that the method achieves superior performance over existing models in QA accuracy, retrieval recall, ranking quality, and factual consistency, demonstrating the effectiveness of combining multi-granularity indexing with confidence control. This study not only provides a new technical pathway for retrieval-augmented generation but also offers practical evidence for improving the reliability and controllability of large models in complex contexts.

[12] Detecting Data Contamination in LLMs via In-Context Learning

Michał Zawalski,Meriem Boubdir,Klaudia Bałazy,Besmira Nushi,Pablo Ribalta

Main category: cs.CL

TL;DR: 提出了一种名为CoDeC的实用且准确的方法，用于检测和量化大语言模型中的训练数据污染。

Details

Motivation: 区分训练期间记忆的数据和训练分布之外的数据，以检测数据污染。 Method: 通过测量上下文学习对模型性能的影响来识别数据污染，利用在上下文中示例对已见和未见数据集置信度的不同影响。 Result: CoDeC生成可解释的污染分数，能清晰区分已见和未见数据集，并揭示了开源权重模型中存在显著记忆化的证据。 Conclusion: CoDeC是一种简单、自动化、模型和数据集无关的方法，易于集成到基准评估中，有效检测训练数据污染。 Abstract: We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.

[13] Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models

Jiasen Zheng,Huajun Zhang,Xu Yan,Ran Hao,Chong Peng

Main category: cs.CL

TL;DR: 本文提出了一种结合对比蒸馏与抗噪训练的微调方法，以提升大语言模型在安全对齐和鲁棒性方面的表现。

Details

Motivation: 解决大语言模型在安全对齐和输入扰动下鲁棒性不足的问题。 Method: 采用冻结主干模型的策略，通过知识蒸馏将教师模型的知识边界传递给学生模型，并引入噪声扰动和鲁棒优化约束，构建包含蒸馏损失、鲁棒性损失和正则项的统一优化目标。 Result: 实验表明，该方法在知识迁移、鲁棒性和安全性指标上显著优于现有基线，在多种计算环境和数据噪声条件下均表现出稳定性能。 Conclusion: 该方法不仅丰富了参数高效微调的理论体系，还为构建更安全、可信的对齐机制提供了新思路。 Abstract: This paper addresses the limitations of large-scale language models in safety alignment and robustness by proposing a fine-tuning method that combines contrastive distillation with noise-robust training. The method freezes the backbone model and transfers the knowledge boundaries of the teacher model to the student model through distillation, thereby improving semantic consistency and alignment accuracy. At the same time, noise perturbations and robust optimization constraints are introduced during training to ensure that the model maintains stable predictive outputs under noisy and uncertain inputs. The overall framework consists of distillation loss, robustness loss, and a regularization term, forming a unified optimization objective that balances alignment ability with resistance to interference. To systematically validate its effectiveness, the study designs experiments from multiple perspectives, including distillation weight sensitivity, stability analysis under computation budgets and mixed-precision environments, and the impact of data noise and distribution shifts on model performance. Results show that the method significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety, achieving the best performance across several key metrics. This work not only enriches the theoretical system of parameter-efficient fine-tuning but also provides a new solution for building safer and more trustworthy alignment mechanisms.

[14] Characterizing Selective Refusal Bias in Large Language Models

Adel Khorramrouz,Sharon Levy

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLM）的安全防护机制存在对不同人口统计群体的“选择性拒绝”偏见，表现为对某些群体更频繁地拒绝生成内容，揭示了安全措施中潜在的不公平问题。

Details

Motivation: 探讨LLM安全防护机制是否在不同人口统计学群体间产生或反映偏见，特别是某些群体更容易被拒绝生成内容的问题。 Method: 通过分析针对个体和交叉性人口群体的拒绝率、模型响应类型以及拒绝回复的长度，研究LLM在性别、性取向、国籍和宗教等属性上的选择性拒绝行为，并测试间接攻击场景下的安全影响。 Result: 发现LLM在多个属性上存在显著的选择性拒绝偏见，某些群体被更频繁地拒绝；同时，在间接攻击中，先前被拒绝的群体可能成为目标，暴露出安全机制的不均衡性。 Conclusion: 当前LLM的安全防护机制在不同人口群体间表现不均，可能导致新的偏见，需设计更公平、稳健的安全策略以确保各群体受到平等保护。 Abstract: Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.

[15] Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Rajarshi Haldar,Julia Hockenmaier

Main category: cs.CL

TL;DR: 本文探讨了使用大语言模型（LLM）作为自然语言生成（NLG）评估工具时存在的低内部评分者信度问题，指出其评分存在不一致性甚至任意性，并通过实验量化了该现象在不同NLG任务和基准中的影响，同时讨论了在合理规范指导下如何有效使用LLM评估者。

Details

Motivation: 随着自然语言生成技术的广泛应用，传统基于n-gram或嵌入的自动评估指标难以准确反映人类偏好，因此研究者开始采用大语言模型作为评估手段；然而其评分的一致性和可靠性尚不明确，亟需系统分析。 Method: 通过在多个NLG任务和基准上进行多轮实验，测量LLM评估者在不同运行间的评分差异，量化其内部评分者信度，并分析影响因素以提出使用建议。 Result: 发现LLM评估者的评分在不同运行间存在显著波动，表现出较低的内部信度，导致评估结果不稳定甚至近乎任意；这种不一致在不同任务和数据集上普遍存在。 Conclusion: 尽管LLM评估者在模拟人类偏好方面具有潜力，但其评分的不一致性限制了其可靠性；必须制定并遵循严格的使用指南，才能确保其在NLG评估中的有效性。 Abstract: As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

[16] Probability Distributions Computed by Hard-Attention Transformers

Andy Yang,Anej Svete,Jiaoda Li,Anthony Widjaja Lin,Jonathan Rawski,Ryan Cotterell,David Chiang

Main category: cs.CL

TL;DR: 本文研究了作为语言模型使用的Transformer的表达能力，分析了其在自回归和概率性生成下的分布表达特性，揭示了与传统语言识别器不同的表达能力。

Details

Motivation: 大多数关于Transformer的表达能力的研究将其视为语言识别器（接受或拒绝字符串），而实际中Transformer多用作语言模型（自回归且概率性地生成字符串）。因此需要研究其在实际使用场景下的表达能力。 Method: 通过理论分析，研究Transformer语言模型能够表达的概率分布，比较其在自回归和非自回归、概率性与非概率性设置下的表达能力差异。 Result: 发现将Transformer语言识别器变为自回归形式有时能提升其表达能力，而引入概率性会打破非概率情况下的等价性。 Conclusion: 明确了Transformer在作为语言模型使用时所能表达的函数类型，区分了不同设置下的表达能力差异，为理解其实际应用中的功能提供了理论基础。 Abstract: Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

[17] Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

Mason Shipton,York Hay Ng,Aditya Khan,Phuong Hanh Hoang,Xiang Lu,A. Seza Doğruöz,En-Shiun Annie Lee

Main category: cs.CL

TL;DR: 本文扩展了URIEL+语言知识库，通过引入书写系统向量、整合Glottolog数据和改进谱系特征推断，显著减少了特征稀疏性，提升了多语言研究的覆盖范围与性能。

Details

Motivation: URIEL+在地理、谱系和类型学特征上存在数据稀疏问题，限制了其在低资源语言跨语言迁移中的应用。 Method: 引入脚本向量表示7,488种语言的文字系统，整合Glottolog增加18,710种语言，并通过谱系传播扩展26,449种语言的类型学和脚本特征。 Result: 脚本特征稀疏性减少14%，语言覆盖增加最多19,015种（增长1007%），推断质量指标提升达33%，跨语言迁移任务中性能最高提升6%。 Conclusion: 扩展后的URIEL+在语言覆盖和特征完整性方面显著提升，更适用于支持低资源语言的多语言研究。 Abstract: The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.

[18] MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

Zixin Chen,Hongzhan Lin,Kaixin Li,Ziyang Luo,Yayue Deng,Jing Ma

Main category: cs.CL

TL;DR: 本文提出了MemeArena，一种基于代理的竞技场式评估框架，用于更准确、无偏地评估多模态大语言模型（mLLMs）对社交媒体模因中多模态危害性的理解能力。

Details

Motivation: 现有方法主要关注二分类检测准确性，缺乏对不同语境下危害性解释细微差别的深入评估，因此需要更细致、情境感知的评估方式。 Method: MemeArena通过模拟多样化的解释情境，生成特定视角的分析任务，并整合多种观点达成评估者共识，实现对mLLMs的公平、无偏评估。 Result: 实验表明该框架有效减少了评判代理的偏差，评估结果与人类偏好高度一致，提升了评估的可靠性和全面性。 Conclusion: MemeArena为mLLMs在多模态危害性理解方面提供了更具上下文感知和公正性的评估方案，推动了更可靠的多模态模型评估发展。 Abstract: The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.

[19] Identifying the Periodicity of Information in Natural Language

Yulin Ou,Yu Wang,Yang Xu,Hendrik Buschmeier

Main category: cs.CL

TL;DR: 提出一种名为AutoPeriod of Surprisal (APS)的新方法，用于检测自然语言中信息意外性序列的周期性模式，发现语言信息存在显著周期性，并揭示了超出传统文本结构单位的新周期。

Details

Motivation: 探讨自然语言在信息编码过程中是否存在周期性模式及其程度。 Method: 引入AutoPeriod of Surprisal (APS)方法，结合标准周期性检测算法，分析单个文档的意外性序列，并在多个语料库上应用该方法，通过谐波回归模型验证新发现的周期。 Result: 发现相当比例的人类语言在信息上表现出强烈的周期性模式；识别出超出典型文本结构单元（如句子边界、基本话语单元等）分布的新周期，并通过谐波回归建模加以确认。 Conclusion: 语言中信息的周期性是结构化因素和长距离作用的其他驱动因素共同作用的结果；所提出的周期性检测方法具有优势，并在LLM生成文本检测方面具有潜力。 Abstract: Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.

[20] Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli,Alireza Salemi,Carrie Ye,Mohamed Abdalla,Hamed Zamani,J Ross Mitchell

Main category: cs.CL

TL;DR: 本文提出了一种新的框架用于生成长且连贯的对话，并构建了BEAM基准测试，同时提出了LIGHT框架以增强大语言模型在长上下文推理任务中的表现。

Details

Motivation: 现有的基准测试缺乏叙事连贯性、领域狭窄且仅测试简单的回忆任务，难以评估大语言模型在需要长期记忆和长上下文推理任务中的能力。 Method: 提出了一个自动生成长达千万token的连贯且主题多样的对话框架，并设计了涵盖多种记忆能力的问题；基于人类认知启发，提出了包含长期情景记忆、短期工作记忆和用于积累显著事实的草稿区的LIGHT框架。 Result: 实验表明，即使具有百万token上下文窗口的大语言模型在对话变长时仍表现不佳，而LIGHT框架在不同模型上均显著提升了性能，相比最强基线平均提升3.5%-12.69%。消融研究验证了各记忆组件的有效性。 Conclusion: LIGHT框架有效增强了大语言模型在长上下文对话中的记忆与推理能力，BEAM基准为评估此类能力提供了更全面的解决方案。 Abstract: Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

[21] Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

Rajan Agarwal,Aarush Gupta

Main category: cs.CL

TL;DR: LLINK是一种高效的多语言对齐方法，通过将低资源语言作为模态注入到冻结的解码器中，无需修改分词器或重新训练模型，显著提升跨语言理解和检索性能。

Details

Motivation: 现有大语言模型在低资源、非拉丁语系文本上表现不佳，主要由于分词碎片化和跨语言耦合弱。 Method: 使用冻结的多语言编码器生成句子嵌入，并通过轻量级对比投影器将其对齐到解码器的潜在空间；利用K个软槽位和最小适配器让解码器吸收该信号。 Result: LLINK显著提升了双语检索性能，在LLM评判的问答任务中，相比基础模型获得81.3%的偏好胜率，优于直接微调（63.6%）；同时减少了分词膨胀并增强了跨语言对齐。 Conclusion: 将低资源语言视为一种模态注入到大模型中，为轻量级模型实现更强的跨语言对齐提供了实用路径。 Abstract: Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.

[22] MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Kangkun Mao,Jinru Ding,Jiayuan Chen,Mouxiao Bian,Ruiyao Chen,Xinwei Peng,Sijie Ren,Linyang Li,Jie Xu

Main category: cs.CL

TL;DR: 本文提出了MedCalc-Eval，这是目前最大的用于评估大语言模型在医学计算能力方面的基准测试，包含700多个基于方程和规则评分系统的任务，并开发了MedCalc-Env强化学习环境以提升模型在多步临床推理中的表现。

Details

Motivation: 现有医学领域的大语言模型评测多集中于问答和描述性推理，忽视了临床决策中关键的定量计算能力，且已有数据集覆盖的计算任务有限，无法反映真实医疗场景。 Method: 构建了一个包含700+医学计算任务的基准MedCalc-Eval，涵盖方程型和规则型两类任务；并基于InternBootcamp框架开发MedCalc-Env强化学习环境，通过多步推理训练提升模型性能。 Result: 在MedCalc-Eval上，经过MedCalc-Env微调的Qwen2.5-32B模型取得了当前最优结果，在数值敏感性、公式选择和推理鲁棒性方面均有显著提升，但仍面临单位转换、多条件逻辑和上下文理解等挑战。 Conclusion: MedCalc-Eval为评估医学计算能力提供了更全面、更具挑战性的基准，结合强化学习环境可有效提升大语言模型在临床定量推理任务中的表现。 Abstract: As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.

[23] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Deokhyung Kang,Seonjeong Hwang,Daehui Kim,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 本文研究了多语言推理模型在低资源语言中表现较差的问题，发现主要原因是模型无法正确理解多语言输入的语义。作者提出通过检测理解失败并采用选择性翻译策略来缓解这一问题，实验表明该方法在仅翻译约20%输入的情况下达到了接近全翻译的效果。

Details

Motivation: 探索多语言推理差距的根本原因，特别是语言理解失败在其中的作用。 Method: 分析多语言推理中的理解失败现象，评估多种检测方法，并提出Selective Translation策略，仅在检测到理解失败时进行翻译。 Result: 理解失败可被有效检测，Selective Translation显著缩小了多语言推理差距，在仅翻译20%输入时达到接近全翻译性能。 Conclusion: 语言理解失败是多语言推理差距的主要原因，可通过检测和选择性翻译有效缓解，为实现更公平的多语言推理提供了可行路径。 Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model's inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.

[24] A Unified Representation Underlying the Judgment of Large Language Models

Yi-Long Lu,Jiajun Song,Wei Wang

Main category: cs.CL

TL;DR: 研究发现大语言模型中存在一个统一的评价维度——情感-同意轴（VAA），该轴同时编码主观价值和事实认同，并作为控制信号引导生成过程，导致推理服从于目标导向的辩护，从而引发系统性偏差和幻觉。

Details

Motivation: 探讨大语言模型中的判断是依赖于专门模块还是统一的通用资源，揭示其架构对推理与判断的影响。 Method: 通过分析多种大语言模型中的可解码神经表征，并进行直接干预实验，识别出主导评价判断的共同维度（VAA），并检验其对生成过程的影响。 Result: 发现不同评价判断集中在单一维度（VAA）上，该维度联合编码主观情感和事实认同；干预实验证明VAA作为控制信号会引导生成符合评价状态的论据，甚至牺牲事实准确性。 Conclusion: 大语言模型采用一种收敛式架构，其统一的VAA机制虽促进判断一致性，但会导致推理被目标状态支配，从而产生系统性偏差和幻觉，损害忠实推理。 Abstract: A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what is good") and the model's assent to factual claims ("what is true"). Through direct interventions, we show this unified representation creates a critical dependency: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. This mechanism, which we term the subordination of reasoning, shifts the process of reasoning from impartial inference toward goal-directed justification. Our discovery offers a mechanistic account for systemic bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.

[25] TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

Benedikt Ebing,Christian Goldschmied,Goran Glavaš

Main category: cs.CL

TL;DR: 本文提出了一种新的词对齐工具TransAlign，利用大规模多语言机器翻译模型的编码器进行标签投影，在跨语言迁移的标记分类任务中显著优于现有的词对齐和非词对齐方法。

Details

Motivation: 由于大多数语言和NLP任务缺乏充足的训练数据，研究者广泛采用基于翻译的策略进行跨语言迁移，但现有基于机器翻译模型的词对齐方法效果不佳，因此需要更有效的对齐方法以提升标签投影质量。 Method: 提出TransAlign，利用大规模多语言机器翻译模型的编码器提取多语言词对齐信息，并将其应用于翻译-测试和翻译-训练场景中的标签投影。 Result: TransAlign在词对齐任务上表现优异，并在多种跨语言迁移的标记分类任务中显著优于现有的词对齐工具和先进的非词对齐标签投影方法。 Conclusion: 利用多语言机器翻译模型的编码器可有效提升词对齐质量，TransAlign为低资源语言的跨语言迁移提供了更优的标签投影解决方案。 Abstract: In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

[26] ThoughtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations

Zijian Wang,Chang Xu

Main category: cs.CL

TL;DR: 本文提出了ThoughtProbe，一种利用大语言模型隐藏推理特征来提升推理性能的新型推理框架。

Details

Motivation: 现有的方法多通过操控隐藏表征来引导生成，而本文旨在利用这些表征作为判别信号，更有效地探索树状响应空间。 Method: 在每个节点扩展时，使用分类器对候选路径打分和排序，优先扩展高分路径；完成树扩展后，收集所有分支的答案形成候选池，并通过聚合各分支的思维链（CoT）得分进行答案融合。 Result: 实验结果表明，该框架在多个算术推理基准上显著提升了性能，有效覆盖并识别出有效的推理路径。 Conclusion: ThoughtProbe通过利用LLM的隐藏推理特征指导树搜索和答案聚合，显著增强了复杂推理任务的表现。 Abstract: This paper introduces ThoughtProbe, a novel inference time framework that leverages the hidden reasoning features of Large Language Models (LLMs) to improve their reasoning performance. Unlike previous works that manipulate the hidden representations to steer LLM generation, we harness them as discriminative signals to guide the tree structured response space exploration. In each node expansion, a classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by prioritizing higher score candidates for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We then propose a branch aggregation method that marginalizes over all supporting branches by aggregating their CoT scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework's comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.

[27] From the Rock Floor to the Cloud: A Systematic Survey of State-of-the-Art NLP in Battery Life Cycle

Tosin Adewumi,Martin Karlsson,Marcus Liwicki,Mikael Sjödahl,Lama Alkhaled,Rihab Gargouri,Nudrat Habib,Franz Hennie

Main category: cs.CL

TL;DR: 本文对自然语言处理（NLP）在电池全生命周期中的应用进行了系统性综述，并提出了适用于欧盟数字电池护照（DBP）的新型技术语言处理（TLP）框架。研究遵循PRISMA方法，评估了274篇文献，最终筛选出66篇进行深入分析，结果揭示了NLP在材料发现等环节的新任务与潜力，同时指出了缺乏标准基准等挑战。

Details

Motivation: 为了全面理解NLP在电池全生命周期各阶段的应用现状与潜力，填补以往仅关注单一阶段或方法的空白，并支持欧盟数字电池护照等新兴需求。 Method: 采用PRISMA指南进行系统性综述，通过Google Scholar、IEEE Xplore和Scopus三个权威数据库检索文献，筛选并分析相关研究，最终纳入66篇论文。同时提出结合代理型AI和优化提示的TLP框架。 Result: 识别出电池领域中新兴的NLP任务，证实其在材料发现和生命周期管理中的作用；发现了如缺乏标准基准等关键挑战；提出的TLP框架展示了应对这些挑战的潜力。 Conclusion: NLP在电池全生命周期中具有广泛应用前景，未来需建立标准基准，而所提出的TLP框架有望推动电池领域的智能化发展，特别是在数字电池护照和通用电池预测方面。 Abstract: We present a comprehensive systematic survey of the application of natural language processing (NLP) along the entire battery life cycle, instead of one stage or method, and introduce a novel technical language processing (TLP) framework for the EU's proposed digital battery passport (DBP) and other general battery predictions. We follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method and employ three reputable databases or search engines, including Google Scholar, Institute of Electrical and Electronics Engineers Xplore (IEEE Xplore), and Scopus. Consequently, we assessed 274 scientific papers before the critical review of the final 66 relevant papers. We publicly provide artifacts of the review for validation and reproducibility. The findings show that new NLP tasks are emerging in the battery domain, which facilitate materials discovery and other stages of the life cycle. Notwithstanding, challenges remain, such as the lack of standard benchmarks. Our proposed TLP framework, which incorporates agentic AI and optimized prompts, will be apt for tackling some of the challenges.

[28] Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs

Jiahao Liu,Zijian Wang,Kuo Zhao,Dong Hu

Main category: cs.CL

TL;DR: 本文提出IntAttn-Edit，一种联合更新MLP和注意力模块的知识编辑方法，通过知识平衡策略提升大语言模型的事实知识编辑效果。

Details

Motivation: 现有知识编辑方法主要关注MLP模块，忽略注意力模块，导致旧知识残留和编辑效果受限。本文旨在探究注意力模块在知识存储中的作用，并实现更均衡有效的编辑。 Method: 通过在先进大语言模型上进行知识定位实验，发现注意力模块（尤其在浅层）在事实知识存储中起重要作用；基于此，提出IntAttn-Edit方法，扩展关联记忆范式，结合MLP与注意力模块的更新，并采用知识平衡策略按贡献比例分配更新幅度。 Result: 在标准基准上的实验表明，IntAttn-Edit相比先前方法实现了更高的编辑成功率、更好的泛化能力和更强的知识保持性；分析显示该平衡策略能在多种设置下维持编辑性能于最优范围。 Conclusion: 注意力模块在事实知识存储中具有重要作用，IntAttn-Edit通过联合编辑MLP与注意力模块并引入知识平衡策略，显著提升了知识编辑的整体性能。 Abstract: Knowledge editing has emerged as an efficient approach for updating factual knowledge in large language models (LLMs). It typically locates knowledge storage modules and then modifies their parameters. However, most existing methods focus on the weights of multilayer perceptron (MLP) modules, which are often identified as the main repositories of factual information. Other components, such as attention (Attn) modules, are often ignored during editing. This imbalance can leave residual outdated knowledge and limit editing effectiveness. We perform comprehensive knowledge localization experiments on advanced LLMs and find that Attn modules play a substantial role in factual knowledge storage and retrieval, especially in earlier layers. Based on these insights, we propose IntAttn-Edit, a method that extends the associative memory paradigm to jointly update both MLP and Attn modules. Our approach uses a knowledge balancing strategy that allocates update magnitudes in proportion to each module's measured contribution to knowledge storage. Experiments on standard benchmarks show that IntAttn-Edit achieves higher edit success, better generalization, and stronger knowledge preservation than prior methods. Further analysis shows that the balancing strategy keeps editing performance within an optimal range across diverse settings.

[29] Awal -- Community-Powered Language Technology for Tamazight

Alp Öktem,Farida Boudichat

Main category: cs.CL

TL;DR: 本文介绍了Awal，一个旨在为塔玛齐特语开发语言技术资源的社区驱动项目，通过awaldigital.org平台促进用户贡献翻译和语音数据，但参与度受限于书写信心不足和标准化问题，实际贡献主要来自语言学家和活动家，数据规模较小，反映出在复杂社会语言背景下众包方法的局限性。

Details

Motivation: 解决塔玛齐特语在数字空间中代表性不足及数据稀缺的问题。 Method: 通过2024年启动的awaldigital.org平台，采用社区协作方式收集翻译和语音数据，并分析18个月的用户参与情况。 Result: 收集到6,421对翻译数据和3小时语音数据，参与主要集中在语言学家和活动家中，普通使用者参与度低，暴露出标准众包方法在该语言环境下的局限性。 Conclusion: 尽管社区反馈积极，但数据贡献规模有限，表明需改进众包策略以适应塔玛齐特语等具有复杂社会语言背景的语言发展需求。 Abstract: This paper presents Awal, a community-powered initiative for developing language technology resources for Tamazight. We provide a comprehensive review of the NLP landscape for Tamazight, examining recent progress in computational resources, and the emergence of community-driven approaches to address persistent data scarcity. Launched in 2024, awaldigital.org platform addresses the underrepresentation of Tamazight in digital spaces through a collaborative platform enabling speakers to contribute translation and voice data. We analyze 18 months of community engagement, revealing significant barriers to participation including limited confidence in written Tamazight and ongoing standardization challenges. Despite widespread positive reception, actual data contribution remained concentrated among linguists and activists. The modest scale of community contributions -- 6,421 translation pairs and 3 hours of speech data -- highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts. We are working on improved open-source MT models using the collected data.

[30] Dynamic Affective Memory Management for Personalized LLM Agents

Junfeng Lu,Yueyan Li

Main category: cs.CL

TL;DR: 提出一种基于贝叶斯记忆更新算法的新型记忆管理系统，通过最小化记忆熵来动态更新记忆向量数据库，提升个性化AI代理在情感场景中的表现。

Details

Motivation: 现有代理系统依赖外部记忆数据库，但存在记忆冗余、陈旧和上下文整合差的问题，主要由于交互过程中缺乏有效的记忆更新机制。 Method: 设计了一种受贝叶斯启发的记忆更新算法，引入记忆熵概念，使代理能自主维护动态更新的记忆向量数据库。同时提出了DABench基准，专注于对象的情感表达与情感变化评估。 Result: 实验结果表明，该系统在个性化、逻辑连贯性和准确性方面表现优越。消融研究验证了贝叶斯式更新机制在缓解记忆膨胀方面的有效性。 Conclusion: 该工作为长期记忆系统的设计提供了新思路，特别是在情感化人机交互场景中具有应用潜力。 Abstract: Advances in large language models are making personalized AI agents a new research focus. While current agent systems primarily rely on personalized external memory databases to deliver customized experiences, they face challenges such as memory redundancy, memory staleness, and poor memory-context integration, largely due to the lack of effective memory updates during interaction. To tackle these issues, we propose a new memory management system designed for affective scenarios. Our approach employs a Bayesian-inspired memory update algorithm with the concept of memory entropy, enabling the agent to autonomously maintain a dynamically updated memory vector database by minimizing global entropy to provide more personalized services. To better evaluate the system's effectiveness in this context, we propose DABench, a benchmark focusing on emotional expression and emotional change toward objects. Experimental results demonstrate that, our system achieves superior performance in personalization, logical coherence, and accuracy. Ablation studies further validate the effectiveness of the Bayesian-inspired update mechanism in alleviating memory bloat. Our work offers new insights into the design of long-term memory systems.

[31] VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

Xuan Gong,Senmiao Wang,Hanbo Huang,Ruoyu Sun,Shiyu Liang

Main category: cs.CL

TL;DR: 本文提出了VCORE，一种基于优化理论的链式思维监督重加权框架，通过方差控制实现对推理过程中不同token的自适应监督分配，提升了大模型在数学和编程等复杂任务上的推理泛化能力。

Details

Motivation: 标准的交叉熵损失在长链式思维推理中对所有token一视同仁，忽略了其在推理轨迹中的不同贡献，导致监督分配不当和泛化能力弱。 Method: 将CoT监督重构为一个约束优化问题，引入优化理论视角，提出VCORE框架，通过控制梯度更新的方差来自适应地调整各token的监督权重。 Result: 在Qwen3系列和LLaMA-3.1-8B-Instruct模型上，VCORE在领域内和跨领域设置下均显著优于现有token重加权方法，在数学和编程基准测试中取得显著性能提升，并能更有效地初始化后续的强化学习阶段。 Conclusion: VCORE通过原则性的自适应监督分配机制，有效提升了大语言模型在复杂推理任务中的训练效果和泛化能力，为增强模型推理性能提供了新思路。 Abstract: Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.

[32] Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning

Chenyang Shao,Sijian Ren,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: 提出一种利用扩散语言模型（DLMs）生成候选思维、大语言模型（LLMs）评估思维质量的高效协同推理框架，提升复杂推理任务性能并降低计算开销。

Details

Motivation: LLMs在推理任务中因自回归生成范式导致测试时计算扩展效率低下，需大量计算资源但性能增益有限，亟需更高效的推理方法。 Method: 利用DLMs通过并行去噪在单次前向传播中高效生成多样化中间思维，再由LLMs对这些思维进行质量评估，构建协同推理框架。 Result: 在多个基准测试中，该框架在复杂推理任务上表现出色，相比传统方法在性能和计算效率之间取得更好平衡。 Conclusion: 该研究为提升语言模型推理效率提供了新方向，结合DLMs的高效生成与LLMs的质量评估，实现了高性能且低计算负担的协同推理。 Abstract: In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs' autoregressive generation paradigm results in reasoning performance scaling sub-optimally with test-time computation, often requiring excessive computational overhead to propose thoughts while yielding only marginal performance gains. In contrast, diffusion language models (DLMs) can efficiently produce diverse samples through parallel denoising in a single forward pass, inspiring us to leverage them for proposing intermediate thoughts, thereby alleviating the computational burden associated with autoregressive generation while maintaining quality. In this work, we propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality. Experiments across diverse benchmarks demonstrate that our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research. Our code is open-source at https://anonymous.4open.science/r/Diffuse-Thinking-EC60.

[33] The aftermath of compounds: Investigating Compounds and their Semantic Representations

Swarang Joshi

Main category: cs.CL

TL;DR: 本研究比较了静态词向量（GloVe）和上下文嵌入（BERT）在英语复合词语义加工中与人类语义判断的对齐程度，发现BERT在捕捉组合语义方面优于GloVe，且可预测性评分能有效预测语义透明度。

Details

Motivation: 探究计算嵌入模型在多大程度上能够反映人类对复合词语义的判断，推动计算心理语言学的发展。 Method: 使用GloVe和BERT生成词嵌入，结合Edinburgh Associative Thesaurus的关联强度、BNC频率和LaDEC可预测性等指标，计算词素意义主导性（LMD）和语义透明度（ST），并与人类评分进行Spearman相关性和回归分析。 Result: BERT嵌入比GloVe更能捕捉组合语义；可预测性评分在人类和模型数据中均能显著预测语义透明度。 Conclusion: 上下文嵌入（如BERT）更接近人类语义处理机制，为基于嵌入的语义建模提供了重要启示。 Abstract: This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.

[34] Effect of Domain Generalization Techniques in Low Resource Systems

Mahi Aminu,Chisom Chibuike,Fatimo Adebanjo,Omokolade Awosanya,Samuel Oyeneye

Main category: cs.CL

TL;DR: 本研究探讨了两种因果域泛化方法在低资源自然语言任务中的应用，包括用于增强鲁棒性的反事实数据增强和适应多语言场景的不变因果表示学习框架DINER。

Details

Motivation: 由于分布偏移问题，传统机器学习模型在真实场景中表现不佳，尤其在低资源环境下数据稀缺且领域多样性不足，因此需要提升模型跨域泛化能力。 Method: 采用两种因果域泛化技术：一是通过生成语义等价的反事实样例进行数据增强（CDA），应用于NaijaSenti推特语料的情感分类；二是将DINER框架扩展至多语言场景，实现不变因果表示学习（ICRL）。 Result: 实验表明，反事实数据增强在情感分类中带来一致的跨域准确率提升，而基于DINER的因果表示学习在多语言情感分析中改善了分布外性能，但不同语言间提升程度有所不同。 Conclusion: 两种因果域泛化方法均能有效提升低资源自然语言任务中模型对未见域的鲁棒性，验证了利用因果机制进行域泛化的潜力。 Abstract: Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.

[35] BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization

Desta Haileselassie Hagos,Legand L. Burge,Anietie Andy,Anis Yazidi,Vladimir Vlassov

Main category: cs.CL

TL;DR: 本文提出了一种结合稀疏注意力、自适应跨度和双线性注意力的新型框架BiSparse-AAS，用于解决Transformer在长文档摘要中的可扩展性问题。

Details

Motivation: Transformer模型的二次计算复杂度限制了其在长文本摘要任务中的应用，因此需要一种更高效且可扩展的方法。 Method: 引入BiSparse-AAS框架，融合稀疏注意力以降低计算开销，采用自适应跨度动态调整注意力范围，并利用双线性注意力建模关键部分内的复杂词元交互。 Result: 在CNN/DailyMail和XSum数据集上ROUGE分数平均提升约68.1%和52.6%，在OpenWebText和Gigaword上也表现优异。 Conclusion: BiSparse-AAS通过兼顾效率、可扩展性和长序列建模，为实际文本摘要应用提供了统一且实用的解决方案。 Abstract: Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.

[36] SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps

Neha Srikanth,Victor Bursztyn,Puneet Mathur,Ani Nenkova

Main category: cs.CL

TL;DR: 本文提出了SQLSpace，一种可解释且通用的文本到SQL示例表示方法，并展示了其在基准比较、模型性能分析和查询重写中的应用价值。

Details

Motivation: 现有的文本到SQL任务缺乏一种能够有效揭示数据集组成和模型表现差异的统一表示方式，限制了对模型行为的深入理解与优化。 Method: 设计了一种名为SQLSpace的人类可解释、紧凑的表示方法，通过最小的人工干预从文本到SQL样例中提取结构化特征，并将其应用于基准对比、细粒度性能分析和基于正确性估计的查询重写。 Result: SQLSpace能够揭示不同基准数据集之间的组成差异，暴露仅通过准确率无法发现的性能模式，并支持对查询成功率的建模，从而提升模型性能。 Conclusion: SQLSpace为文本到SQL任务提供了一个有效的分析框架，有助于更深入地理解模型行为并推动其改进。 Abstract: We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples derived with minimal human intervention. We demonstrate the utility of these representations in evaluation with three use cases: (i) closely comparing and contrasting the composition of popular text-to-SQL benchmarks to identify unique dimensions of examples they evaluate, (ii) understanding model performance at a granular level beyond overall accuracy scores, and (iii) improving model performance through targeted query rewriting based on learned correctness estimation. We show that SQLSpace enables analysis that would be difficult with raw examples alone: it reveals compositional differences between benchmarks, exposes performance patterns obscured by accuracy alone, and supports modeling of query success.

[37] Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design

Maria Lizarazo Jimenez,Ana Gabriela Claros,Kieran Green,David Toro-Tobon,Felipe Larios,Sheena Asthana,Camila Wenczenovicz,Kerly Guevara Maldonado,Luis Vilatuna-Andrango,Cristina Proano-Velez,Satya Sai Sri Bandi,Shubhangi Bagewadi,Megan E. Branda,Misk Al Zahidy,Saturnino Luz,Mirella Lapata,Juan P. Brito,Oscar J. Ponce-Ponte

Main category: cs.CL

TL;DR: 本研究提出并开发了以患者为中心的临床摘要（PCS）框架，旨在提升AI生成的临床摘要对患者价值观、偏好和情境信息的关注。通过患者与临床医生的参与制定标注指南，并评估了五种开源大语言模型在该任务上的表现，结果显示当前模型在正确性和以患者为中心方面仍不及人类专家。

Details

Motivation: 现有的大型语言模型在生成临床摘要时多关注患者的生物学信息，忽视了患者的价值观、偏好和关切，难以实现以患者为中心的医疗护理。因此，需要建立新的AI临床摘要标准——以患者为中心的摘要（PCS），以更好地整合患者个人和情境信息。 Method: 采用混合方法研究设计：首先通过半结构化访谈收集患者和临床医生对临床摘要内容和结构的意见，形成标注指南；随后由八名临床医生根据指南创建88例房颤门诊咨询的金标准PCS；使用16次咨询数据优化提示词，并让五种开源大语言模型（Llama、Mistral、Gemma、Qwen3等）对72次咨询进行零样本和少样本摘要生成；最后使用ROUGE-L、BERTScore及定性指标进行评估。 Result: 患者强调应包含生活方式、社会支持、近期压力源和护理价值观；临床医生希望摘要简洁且包含功能、心理和社会情感背景。最佳零样本表现为Mistral-8B（ROUGE-L 0.189）和Llama-3.1-8B（BERTScore 0.673）；最佳少样本为Llama-3.1-8B（ROUGE-L 0.206，BERTScore 0.683）。模型生成摘要在完整性与流畅性上接近人类水平，但在准确性和以患者为中心程度上仍落后于人类专家。 Conclusion: 尽管当前开源大语言模型在生成以患者为中心的临床摘要方面取得一定进展，尤其在流畅性和内容覆盖上接近人类表现，但在关键的准确性和真正体现患者价值观方面仍有不足。未来需进一步优化模型训练与提示策略，以实现真正以患者为中心的AI辅助临床记录。 Abstract: Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.

[38] DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

Malik H. Altakrori,Nizar Habash,Abdelhakim Freihat,Younes Samih,Kirill Chirkunov,Muhammed AbuOdeh,Radu Florian,Teresa Lynn,Preslav Nakov,Alham Fikri Aji

Main category: cs.CL

TL;DR: 提出DialectalArabicMMLU，首个针对阿拉伯语方言的大规模多领域基准，包含15K方言和22K多语言问答对，用于评估大模型在方言理解上的表现。

Details

Motivation: 现有评测基准主要关注现代标准阿拉伯语（MSA），而日常交流中广泛使用的阿拉伯语方言缺乏代表性，亟需更全面、包容的评测资源。 Method: 基于MMLU-Redux框架，人工翻译并适配3K个问答对至五种主要阿拉伯方言（叙利亚、埃及、阿联酋、沙特、摩洛哥），构建共15K方言问答对，并涵盖32个学术与专业领域。 Result: 评测了19个开源阿拉伯语及多语言大模型（1B-13B参数），发现模型在不同方言上表现差异显著，揭示出当前模型在方言泛化能力上的明显不足。 Conclusion: DialectalArabicMMLU是首个统一、人工整理的阿拉伯语方言评测基准，推动对阿拉伯语方言理解能力的系统评估，促进更具包容性的模型发展。 Abstract: We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.

[39] Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality

Yinghao Luo,Lang Zhou,Amrish Jhingoer,Klaske Vliegenthart Jongbloed,Carlijn Jordans,Ben Werkhoven,Tom Seinen,Erik van Mulligen,Casper Rokx,Yunlei Li

Main category: cs.CL

TL;DR: 本研究探讨了在特定领域语料上进行进一步预训练对低资源语言医疗NLP任务性能的影响，结果显示领域适应显著提升了模型表现，并验证了跨语言迁移能力。

Details

Motivation: 由于低资源语言在医疗NLP领域缺乏足够的自然语言处理工具，尽管多语言BERT提供了缩小语言差距的潜力，但相关研究仍不足，因此需要探索领域适应对模型性能的影响。 Method: 通过对Dutch、Romanian和Spanish三种语言，在医学领域语料上进行四次进一步预训练实验，构建领域专用模型，并在三个下游任务（患者自动筛查、命名实体识别）上进行微调和评估。 Result: 领域适应显著提升任务性能；临床领域的模型优于通用生物医学领域的模型；观察到跨语言迁移能力；并进一步分析了导致性能差异的潜在因素。 Conclusion: 领域适应和跨语言迁移在低资源语言的医疗NLP中具有可行性，可为开发多语言医疗NLP系统提供有效指导，缓解训练数据不足的问题。 Abstract: In multilingual healthcare applications, the availability of domain-specific natural language processing(NLP) tools is limited, especially for low-resource languages. Although multilingual bidirectional encoder representations from transformers (BERT) offers a promising motivation to mitigate the language gap, the medical NLP tasks in low-resource languages are still underexplored. Therefore, this study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks, focusing on three languages: Dutch, Romanian and Spanish. In terms of further pre-training, we conducted four experiments to create medical domain models. Then, these models were fine-tuned on three downstream tasks: Automated patient screening in Dutch clinical notes, named entity recognition in Romanian and Spanish clinical notes. Results show that domain adaptation significantly enhanced task performance. Furthermore, further differentiation of domains, e.g. clinical and general biomedical domains, resulted in diverse performances. The clinical domain-adapted model outperformed the more general biomedical domain-adapted model. Moreover, we observed evidence of cross-lingual transferability. Moreover, we also conducted further investigations to explore potential reasons contributing to these performance differences. These findings highlight the feasibility of domain adaptation and cross-lingual ability in medical NLP. Within the low-resource language settings, these findings can provide meaningful guidance for developing multilingual medical NLP systems to mitigate the lack of training data and thereby improve the model performance.

[40] Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization

Inacio Vieira,Antonio Castaldo,James O'Doherty,Sheila Castilho

Main category: cs.CL

TL;DR: 提出使用CPO方法进行数据高效的领域自适应，通过将基础模型的原始输出作为“被拒绝”样本，人工批准的翻译记忆条目作为“被选中”样本，构建偏好对进行训练，在仅使用14.7k样本时即可接近使用160k以上样本进行SFT的性能。

Details

Motivation: 传统SFT在领域自适应中数据需求大、成本高，需要更高效的方法利用有限的人类标注数据提升模型在特定领域的表现。 Method: 采用CPO（对比策略优化）方法，利用基础模型自身生成的翻译作为‘rejected’样本，人工校对过的翻译记忆（TM）条目作为‘chosen’样本，构建偏好数据进行训练。 Result: 在英-巴西葡语和英-韩语翻译任务中，仅用14.7k偏好对就使模型性能接近使用160k+样本SFT的效果，显著提升了数据效率。 Conclusion: CPO是一种高效的数据利用方法，适用于有黄金参考标准的生成任务，能以极少量标注数据实现有效的领域自适应，且可推广到其他生成任务。 Abstract: LLMs often require adaptation to domain-specific requirements, a process that can be expensive when relying solely on SFT. We present an empirical study on applying CPO to simulate a post-editing workflow for data-efficient domain adaptation. Our approach synthesizes preference pairs by treating the base model's own raw output as the 'rejected' translation and the human-approved TM entry as the 'chosen' one. This method provides direct feedback on the model's current knowledge, guiding it to align with domain-specific standards. Experiments in English-Brazilian Portuguese and English-Korean show that, by using just 14.7k preference pairs, the model achieves performance close to that of a model trained on 160k+ samples with SFT, demonstrating significant data efficiency. Although we showcase its effectiveness in MT, this application of CPO naturally generalizes to other generative tasks where a model's initial drafts can serve as a contrastive signal against a golden reference.

[41] MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval

Qi Luo,Xiaonan Li,Yuxin Wang,Tingshuo Fan,Yuan Li,Xinchi Chen,Xipeng Qiu

Main category: cs.CL

TL;DR: 提出MARAG-R1，一种基于强化学习的多工具RAG框架，通过动态组合多种检索机制提升大模型在语料级推理任务中的表现。

Details

Motivation: 现有RAG系统依赖单一检索器和固定检索数量，限制了对全面外部信息的获取能力，难以支持需要语料级推理的复杂任务。 Method: 设计四种检索工具（语义搜索、关键词搜索、过滤和聚合），并通过监督微调加强化学习的两阶段训练，使LLM学会动态协调使用这些工具。 Result: 在GlobalQA、HotpotQA和2WikiMultiHopQA上显著优于强基线方法，并在语料级推理任务中达到新的SOTA性能。 Conclusion: MARAG-R1有效突破了传统单检索器RAG的瓶颈，实现了更广泛且精准的信息访问，提升了LLM对新知识的适应性和推理能力。 Abstract: Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools -- semantic search, keyword search, filtering, and aggregation -- and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.

[42] SpecAttn: Speculating Sparse Attention

Harsh Shah

Main category: cs.CL

TL;DR: SpecAttn是一种无需训练的新型稀疏注意力方法，通过利用推测解码中草稿模型的注意力权重来减少预训练Transformer的计算冗余，在显著降低KV缓存访问的同时保持良好的输出质量。

Details

Motivation: 大型语言模型在推理时因自注意力机制的二次复杂度而面临计算瓶颈，尤其是在上下文长度增加时，需要高效且无需训练的稀疏注意力方法来缓解这一问题。 Method: SpecAttn结合现有的推测解码技术，利用草稿模型的注意力权重识别目标模型的重要token，采用基于KL散度的层对齐、无需排序的top-p token选择算法和动态KV缓存剪枝策略实现高效稀疏注意力。 Result: 在PG-19数据集上实现了超过75%的KV缓存访问减少，仅带来15.29%的困惑度上升，显著优于现有稀疏注意力方法。 Conclusion: SpecAttn证明了推测执行可被增强用于近似验证，在不显著影响性能的前提下大幅提升推理效率。 Abstract: Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.

[43] Culture Cartography: Mapping the Landscape of Cultural Knowledge

Caleb Ziems,William Held,Jane Yu,Amir Goldberg,David Grusky,Diyi Yang

Main category: cs.CL

TL;DR: 提出一种混合主动的CultureCartography方法，通过LLM初始化低置信度问题并由人类补充，有效发现文化特有知识，提升模型在文化相关任务上的表现。

Details

Motivation: LLM在预训练中可能缺乏特定文化知识，需找到既对本地用户重要又为模型未知的知识。 Method: 提出CultureCartography方法，LLM基于低置信度问题发起提问，人类可编辑和补充答案，实现人机协同的知识探索，并开发工具CultureExplorer实现该流程。 Result: 相比传统问答标注方式，CultureExplorer能更有效地生成主流模型（如DeepSeek R1、GPT-4o）缺失的文化知识；基于此数据微调Llama-3.1-8B，在文化基准上准确率提升高达19.2%。 Conclusion: 混合主动的协作方式能更有效地挖掘文化特有知识，显著提升LLM在跨文化场景下的性能。 Abstract: To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.

[44] Continuous Autoregressive Language Models

Chenze Shao,Darren Li,Fandong Meng,Jie Zhou

Main category: cs.CL

TL;DR: 本文提出了连续自回归语言模型（CALM），通过将K个离散token压缩为一个连续向量，实现从逐token生成到逐向量生成的范式转变，显著降低生成步数和计算成本。

Details

Motivation: 传统大语言模型受限于逐token生成的低效性，需要提升每步生成的语义信息量以突破效率瓶颈。 Method: 使用高保真自编码器将K个token压缩为单个连续向量，在连续向量空间进行语言建模，并构建了一套无需似然的训练、评估与采样框架。 Result: 实验表明，CALM在保持强性能的同时大幅降低计算开销，生成步数减少K倍，重建准确率超过99.9%。 Conclusion: next-vector预测是一种高效且可扩展的语言模型新范式，为构建超高效语言模型提供了新路径。 Abstract: The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.

cs.CV [Back]

[45] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Fenfen Lin,Yesheng Liu,Haiyu Xu,Chen Yue,Zheqi He,Mingxuan Zhao,Miguel Hu Chen,Jiakang Liu,JG Yao,Xi Yang

Main category: cs.CV

TL;DR: 本文提出了MeasureBench，一个用于视觉测量读数的基准，涵盖真实和合成图像，并设计了可扩展的数据合成管道。评估发现当前视觉语言模型在测量读数任务上表现不佳，尤其在指针定位等细粒度空间接地方面存在严重缺陷。

Details

Motivation: 现有视觉语言模型在人类轻松完成的测量读数任务上表现较差，缺乏对精细空间关系的理解，亟需专门的基准来评估和提升模型的视觉数值理解能力。 Method: 构建了MeasureBench基准，包含真实与合成的测量仪表图像；开发了可控制外观参数（如指针、刻度、光照等）的程序化数据合成管道；在多种主流VLM上进行评测，并尝试使用强化学习在合成数据上训练模型。 Result: 实验表明，即使是先进的VLM在测量读数任务上也表现不佳，普遍存在指针定位错误问题；强化学习在合成数据上有效，但难以泛化到真实图像；揭示了当前VLM在细粒度空间接地方面的根本局限。 Conclusion: 当前视觉语言模型在精确的空间感知和视觉数值理解方面仍有显著不足，未来研究需加强模型对细微视觉结构的定位与解析能力，以缩小识别数字与实际测量世界之间的差距。 Abstract: Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.

[46] PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT

Rochak Dhakal,Chen Zhao,Zixin Shi,Joyce H. Keyak,Tadashi S. Kaneko,Kuan-Jui Su,Hui Shen,Hong-Wen Deng,Weihua Zhou

Main category: cs.CV

TL;DR: 提出了一种基于3D TransUNet的域自适应Transformer分割框架，通过对抗对齐和统计对齐策略解决多中心QCT图像中的域偏移问题，实现了鲁棒的 proximal femur 分割。

Details

Motivation: 由于不同机构间扫描设备、重建参数和患者人群差异导致的域偏移，深度学习模型在实际应用中泛化能力差，限制了多中心骨质疏松研究和定量分析的可重复性。 Method: 在3D TransUNet基础上引入梯度反转层（GRL）进行对抗对齐，结合最大均值差异（MMD）实现统计对齐，以减少机构间的分布差异，提升模型跨域泛化能力。 Result: 模型在包含1024例图兰大学和384例罗切斯特样本的大规模多中心队列上进行了训练与验证，表现出优异的分割性能和跨站点稳定性。 Conclusion: 所提出的双对齐机制能有效缓解域偏移问题，支持跨机构的可靠QCT分析，推动多中心骨科影像研究的发展。 Abstract: Quantitative computed tomography (QCT) plays a crucial role in assessing bone strength and fracture risk by enabling volumetric analysis of bone density distribution in the proximal femur. However, deploying automated segmentation models in practice remains difficult because deep networks trained on one dataset often fail when applied to another. This failure stems from domain shift, where scanners, reconstruction settings, and patient demographics vary across institutions, leading to unstable predictions and unreliable quantitative metrics. Overcoming this barrier is essential for multi-center osteoporosis research and for ensuring that radiomics and structural finite element analysis results remain reproducible across sites. In this work, we developed a domain-adaptive transformer segmentation framework tailored for multi-institutional QCT. Our model is trained and validated on one of the largest hip fracture related research cohorts to date, comprising 1,024 QCT images scans from Tulane University and 384 scans from Rochester, Minnesota for proximal femur segmentation. To address domain shift, we integrate two complementary strategies within a 3D TransUNet backbone: adversarial alignment via Gradient Reversal Layer (GRL), which discourages the network from encoding site-specific cues, and statistical alignment via Maximum Mean Discrepancy (MMD), which explicitly reduces distributional mismatches between institutions. This dual mechanism balances invariance and fine-grained alignment, enabling scanner-agnostic feature learning while preserving anatomical detail.

[47] DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting

Moonsoo Jeong,Dongbeen Kim,Minseong Kim,Sungkil Lee

Main category: cs.CV

TL;DR: 提出了一种基于方向一致性（DC）的自适应密度控制方法（DC4GS），通过引入梯度的角度一致性来优化3D高斯点阵的基元分裂，减少冗余分割并提高重建精度。

Details

Motivation: 传统的自适应密度控制仅依赖位置梯度的大小进行基元分裂，忽略了局部结构的方向信息，导致分裂不够精确且存在冗余。 Method: 将方向一致性（DC）引入自适应密度控制，利用梯度的角相干性判断分裂必要性，并据此确定最优分裂位置，使子基元更好地对齐局部结构。 Result: 相比传统方法，DC4GS在实验中最多减少了30%的基元数量，同时显著提升了重建保真度。 Conclusion: 通过融合方向一致性，DC4GS有效提升了3D高斯点阵的表示效率与重建质量，减少了冗余基元。 Abstract: We present a Directional Consistency (DC)-driven Adaptive Density Control (ADC) for 3D Gaussian Splatting (DC4GS). Whereas the conventional ADC bases its primitive splitting on the magnitudes of positional gradients, we further incorporate the DC of the gradients into ADC, and realize it through the angular coherence of the gradients. Our DC better captures local structural complexities in ADC, avoiding redundant splitting. When splitting is required, we again utilize the DC to define optimal split positions so that sub-primitives best align with the local structures than the conventional random placement. As a consequence, our DC4GS greatly reduces the number of primitives (up to 30% in our experiments) than the existing ADC, and also enhances reconstruction fidelity greatly.

[48] Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11

Yi Luo,Yike Guo,Hamed Hooshangnejad,Kai Ding

Main category: cs.CV

TL;DR: 提出了一种新的自适应课程学习方法SACL，用于在数据有限的情况下提升肺结节检测性能，实验表明其在不同数据规模下均优于传统方法。

Details

Motivation: 现有深度学习方法在标注数据有限的临床场景中表现不佳，传统的静态课程学习策略在数据稀缺情况下失效。 Method: 提出Scale Adaptive Curriculum Learning (SACL)，包含自适应epoch调度、难样本注入和尺度感知优化三个机制，基于YOLOv11在LUNA25数据集上进行评估。 Result: 在完整数据集上SACL与传统方法性能相当，但在10%、20%、50%的训练数据下相比基线分别提升了4.6%、3.5%、2.0%的mAP50。 Conclusion: SACL能够在不修改模型结构的前提下，有效提升数据稀缺情况下的训练效果，为医疗场景提供了实用的肺结节检测解决方案。 Abstract: Lung nodule detection in chest CT is crucial for early lung cancer diagnosis, yet existing deep learning approaches face challenges when deployed in clinical settings with limited annotated data. While curriculum learning has shown promise in improving model training, traditional static curriculum strategies fail in data-scarce scenarios. We propose Scale Adaptive Curriculum Learning (SACL), a novel training strategy that dynamically adjusts curriculum design based on available data scale. SACL introduces three key mechanisms:(1) adaptive epoch scheduling, (2) hard sample injection, and (3) scale-aware optimization. We evaluate SACL on the LUNA25 dataset using YOLOv11 as the base detector. Experimental results demonstrate that while SACL achieves comparable performance to static curriculum learning on the full dataset in mAP50, it shows significant advantages under data-limited conditions with 4.6%, 3.5%, and 2.0% improvements over baseline at 10%, 20%, and 50% of training data respectively. By enabling robust training across varying data scales without architectural modifications, SACL provides a practical solution for healthcare institutions to develop effective lung nodule detection systems despite limited annotation resources.

[49] SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions

Md. Mehedi Hassan,Shafqat Alam,Shahriar Ahmed Seam,Maruf Ahmed

Main category: cs.CV

TL;DR: 提出了一种统一的多流SYNAPSE-Net框架，用于多模态MRI中异质性脑病变的自动分割，结合CNN编码器、Swin Transformer瓶颈和动态跨模态注意力融合机制，在多个公开数据集上实现了最先进的性能。

Details

Motivation: 现有深度学习模型多为专用‘点解决方案’，泛化能力差且性能波动大，限制了其在临床中的可靠性，因此需要一个兼具泛化性和鲁棒性的统一框架。 Method: 设计了一个包含多流CNN编码器、Swin Transformer瓶颈、动态跨模态注意力融合（CMAF）机制和分层门控解码器的混合架构，并采用病理特异性数据增强与难度感知采样相结合的方差缩减策略进行训练。 Result: 在WMH、ISLES 2022和BraTS 2020三个公开数据集上均取得领先结果：WMH的DSC达0.831（HD95=3.03），ISLES 2022边界精度最优（HD95=9.69），BraTS 2020肿瘤核心区域DSC最高（0.8651）。 Conclusion: 所提出的统一自适应框架在多种脑部病理中表现出卓越的分割性能和鲁棒性，具有良好的临床应用前景。 Abstract: Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions' that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at https://github.com/mubid-01/SYNAPSE-Net-pre.

[50] Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

Anam Fatima,Yi Yu,Janak Kapuriya,Julien Lalanne,Jainendra Shukla

Main category: cs.CV

TL;DR: 提出一种基于语义帧聚合的Transformer（SFAT）模型，用于生成直播视频评论，结合视觉-文本多模态信息并根据观众对话的语义相关性加权视频帧。

Details

Motivation: 现有方法忽视了在生成实时评论时对与观众互动最相关的视频帧进行优先级排序，导致评论缺乏上下文相关性。 Method: 提出SFAT模型，利用CLIP的视觉-文本多模态知识，通过语义相关性为视频帧分配权重，并采用加权和方式突出重要帧；结合跨注意力机制的评论解码器融合聊天和视频上下文生成评论。 Result: 在新构建的大规模英文多模态数据集（来自Twitch，涵盖11类、438小时、320万条评论）上验证了SFAT模型的有效性，优于现有方法。 Conclusion: SFAT模型能更有效地生成上下文相关的直播评论，所构建的数据集为英文直播评论研究提供了重要资源。 Abstract: Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.

[51] MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

Arghavan Rezvani,Xiangyi Yan,Anthony T. Wu,Kun Han,Pooya Khosravi,Xiaohui Xie

Main category: cs.CV

TL;DR: 提出MoME，一种用于医学图像分割的视觉-语言多专家混合模型，通过结合多尺度视觉特征和文本嵌入，在10个数据集上表现出色。

Details

Motivation: 将大语言模型中成功的多专家混合（MoE）范式引入医学视觉-语言任务，以提升医学图像分析的性能。 Method: 设计MoME架构，利用多尺度视觉特征和文本嵌入实现动态专家选择，融合视觉-语言基础模型进行医学图像分割。 Result: 在包含3,410个CT扫描的10个数据集上，MoME在综合性医学图像分割基准上表现出具有竞争力的精度。 Conclusion: MoME通过新颖的视觉-语言多专家混合架构，实现了鲁棒的医学图像分析结果，展示了文本信息与基础模型结合的潜力。 Abstract: In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

[52] Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

Yana Wei,Zeen Chi,Chongyu Wang,Yu Wu,Shipeng Yan,Yongfei Liu,Xuming He

Main category: cs.CV

TL;DR: 本文提出了一种新的无样本增量关系蒸馏（IRD）框架，用于开放世界中的人-物交互检测，有效缓解了灾难性遗忘、交互漂移和零样本HOI组合的挑战。

Details

Motivation: 传统的闭集HOI检测模型难以应对开放世界中持续变化的交互场景，因此需要一种能够逐步学习新知识的增量检测方法。 Method: IRD框架解耦对象与关系的学习过程，并引入两种独特的关系蒸馏损失，以在共享相同关系的不同HOI组合中学习不变的关系特征，实现无样本增量学习。 Result: 在HICO-DET和V-COCO数据集上的实验表明，该方法在抑制遗忘、应对交互漂移和零样本HOI泛化方面优于现有最先进方法。 Conclusion: 所提出的IRD框架为开放世界中持续学习人-物交互提供了有效解决方案，具有良好的鲁棒性和泛化能力。 Abstract: In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans' ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}

[53] VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video

Philipp V. Rouast

Main category: cs.CV

TL;DR: VitalLens 2.0 是一种新的深度学习模型，可从面部视频中准确估计心率、呼吸频率和心率变异性，达到当前最优性能。

Details

Motivation: 提升远程光电容积描记法（rPPG）的准确性，以实现对多种生理信号（如HR、RR和HRV）的稳健估计。 Method: 采用新的模型架构，并使用包含1,413名个体的大规模多样化训练数据进行训练。 Result: 在包含422名个体的测试集上，HR的MAE为1.57 bpm，RR为1.08 bpm，HRV-SDNN为10.18 ms，HRV-RMSSD为16.45 ms，显著优于先前方法。 Conclusion: VitalLens 2.0 在 rPPG 生理信号估计方面实现了显著进步，现可通过API供开发者使用。 Abstract: This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at https://rouast.com/api.

[54] AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception

Mario Camarena,Het Patel,Fatemeh Nazari,Evangelos Papalexakis,Mohamadhossein Noruzoliaee,Jia Chen

Main category: cs.CV

TL;DR: 本文提出了AD-SAM，一种针对自动驾驶语义分割任务微调的视觉基础模型，通过双编码器和可变形解码器结构提升对复杂道路场景的分割精度，在Cityscapes和BDD100K上显著优于SAM、G-SAM和DeepLabV3，并展现出更强的泛化性、学习效率和数据效率。

Details

Motivation: 为了提升自动驾驶中语义分割模型对复杂道路场景的空间与几何建模能力，克服现有基础模型（如SAM）在AD任务中细节丢失、边界模糊和训练效率低的问题。 Method: 提出AD-SAM模型，采用双编码器结构融合SAM的ViT-H全局语义信息和ResNet-50的局部空间细节，通过可变形融合模块对齐多尺度异构特征，使用可变形注意力进行多阶段渐进式解码，并采用Focal、Dice、Lovasz-Softmax和Surface损失组成的混合损失函数进行优化。 Result: 在Cityscapes和BDD100K上分别达到68.1和59.5 mIoU，比SAM、G-SAM和DeepLabV3最高提升22.9和19.2 mIoU；跨域泛化保留率达0.87，学习速度是基准模型两倍，仅用1000样本即达0.607 mIoU。 Conclusion: 通过对基础模型进行针对性的架构和优化改进，AD-SAM实现了更可靠、可扩展的自动驾驶感知，兼具高精度、强泛化、快速收敛和数据高效性，具有实际应用潜力。 Abstract: This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM's pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.

[55] Hierarchical Transformers for Unsupervised 3D Shape Abstraction

Aditya Vora,Lily Goli,Andrea Tagliasacchi,Hao Zhang

Main category: cs.CV

TL;DR: 提出了一种名为HiT的层次化神经场表示方法，通过分层Transformer和压缩码本在无监督设置下学习3D形状的通用层次结构。

Details

Motivation: 现有方法通常受限于固定的层次结构（如二叉树），难以表达复杂多样的形状层级关系，因此需要一种更灵活、能从数据中自动学习层次结构的方法。 Method: 设计了一个层次化Transformer（HiT），每一层通过压缩码本学习树状层次中的父子关系，并不限定具体结构形式，仅限制每层节点总数，从而实现跨类别自动发现共通子结构。 Result: 在55个ShapeNet类别上的无监督形状分割任务中，模型成功实现了多粒度的形状分割，捕捉到了有意义的父子包含关系。 Conclusion: HiT能够从数据中自动推断出比以往方法更通用、更复杂的层次结构，在跨类别3D形状表示和分割中表现出色。 Abstract: We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.

[56] ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding

Haonan Wang,Jingyu Lu,Hongrui Li,Xiaomeng Li

Main category: cs.CV

TL;DR: 本文提出了ZEBRA，首个无需受试者特定调整的零样本脑视觉解码框架，通过对抗训练分离fMRI表征中的个体相关和语义相关成分，实现对未见受试者的良好泛化。

Details

Motivation: 现有fMRI到图像的重建方法大多依赖于受试者特定模型或微调，限制了其可扩展性和实际应用，因此需要一种通用、可扩展的神经解码方法。 Method: 提出ZEBRA框架，利用对抗训练将fMRI表征分解为个体相关和语义相关成分，提取出跨受试者不变的语义表征，从而实现零样本解码。 Result: 实验表明，ZEBRA在多个指标上显著优于零样本基线方法，性能媲美完全微调的模型，且无需任何额外数据或重新训练。 Conclusion: ZEBRA为实现可扩展、实用的通用神经解码提供了有效路径，推动了fMRI解码技术向真实应用场景的发展。 Abstract: Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.

[57] WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond

Zhicong Sun,Jacqueline Lo,Jinxing Hu

Main category: cs.CV

TL;DR: 本文提出了一种用于野火和森林环境SLAM的大型综合合成数据集WildfireX-SLAM，基于Unreal Engine 5生成，包含5.5k低空RGB-D航拍图像，覆盖16平方公里森林区域，并提供了光照、天气和火灾状况等环境因素的灵活控制，支持森林测绘与应急响应研究。

Details

Motivation: 开发适用于大规模森林场景的3DGS-SLAM方法具有重要现实意义，但受限于高质量真实数据集的缺乏，且实地采集成本高、技术难度大。因此需要一个可控、高质量的合成数据集来推动相关研究。 Method: 基于Unreal Engine 5的Electric Dreams环境示例项目，构建了一个可自动生成空中与地面视角图像的数据采集流程，支持无人飞行器视角下的多模态数据（如真值相机位姿）采集，并实现对光照、天气及火灾类型与状态的灵活调控。 Result: 成功构建了包含5.5k张低空RGB-D图像的WildfireX-SLAM数据集，覆盖16平方公里森林区域；并在此基础上开展了3DGS-SLAM基准测试，揭示了森林环境中该类方法面临的关键挑战。 Conclusion: WildfireX-SLAM为森林与野火环境下的SLAM研究提供了高质量、可控的合成数据基础，有助于推动3DGS-SLAM在 wildfire 应急响应和森林管理中的应用，未来工作可基于此数据集进行算法优化。 Abstract: 3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: https://zhicongsun.github.io/wildfirexslam.

[58] E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Tong Shen,Jingai Yu,Dong Zhou,Dong Li,Emad Barsoum

Main category: cs.CV

TL;DR: 提出了一种高效轻量的多模态扩散模型E-MMDiT，仅用304M参数和少量训练资源即可实现快速图像生成。

Details

Motivation: 现有扩散模型通常需要大规模数据和高计算资源，或存在结构复杂、延迟高的问题，因此需要一种更高效、轻量且易于复现的模型。 Method: 采用高度压缩的视觉分词器和多路径压缩模块减少token数量，引入位置强化和交替子区域注意力（ASA）以保持空间一致性并降低计算成本，设计AdaLN-affine模块用于高效调制。 Result: 在单节点8块AMD MI300X GPU上用25M公开数据训练1.5天，512px图像生成在GenEval上达到0.66，经后训练技术可提升至0.72。 Conclusion: E-MMDiT是一种高效、轻量且可复现的扩散模型，有助于推动生成式AI模型的普及。 Abstract: Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

[59] Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features

Xingtao Ling Yingying Zhu

Main category: cs.CV

TL;DR: 本文提出了一种用于跨视角物体地理定位的CVCAM和MHSAM模块，通过多轮视图间交互和多尺度空间特征提取提升定位精度，并发布了新的G2D数据集用于地面到无人机的定位任务。

Details

Motivation: 现有方法在跨视角物体定位中未能有效传递信息并细化空间关系特征图，导致模型易受边缘噪声干扰，影响定位性能。 Method: 提出跨视角与交叉注意力模块（CVCAM），实现双视角多次交互以增强上下文信息学习；引入多头空间注意力模块（MHSAM），利用不同尺寸卷积核提取多尺度空间特征；并构建了新的G2D数据集用于Ground-to-Drone定位任务。 Result: 在CVOGL和G2D数据集上的实验表明，该方法显著提升了定位准确率，优于当前最先进的方法。 Conclusion: 所提出的CVCAM和MHSAM模块能有效提升跨视角物体定位性能，抑制无关噪声，且新构建的G2D数据集为相关研究提供了重要支持。 Abstract: Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the "Ground-to-Drone" localization task, enriching existing datasets and filling the gap in "Ground-to-Drone" localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.

[60] HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition

Jiacheng Hong,Kunzhen Wu,Mingrui Yu,Yichao Gu,Shengze Xue,Shuangjiu Xiao,Deli Dong

Main category: cs.CV

TL;DR: 提出了一种名为HiGS的分层生成框架，用于多步关联语义空间组合，通过渐进式层次化空间-语义图（PHiSSG）实现可控且高效的3D场景生成。

Details

Motivation: 现有3D场景生成方法多为单步生成，难以在复杂性与用户输入之间取得平衡；受人类认知过程启发，需要一种从全局到局部、关注关键元素并通过语义关联补全场景的生成方式。 Method: 提出HiGS框架和PHiSSG结构，支持用户通过选择关键语义对象迭代扩展场景，并利用图结构动态维护空间关系与语义依赖，实现递归布局优化和对象级一致性控制。 Result: 实验表明，HiGS在布局合理性、风格一致性和用户偏好方面优于单阶段方法，能更有效地生成结构连贯的3D场景。 Conclusion: HiGS提供了一种可控制、可扩展的3D场景生成范式，通过分层多步生成显著提升了生成质量与用户交互性。 Abstract: Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.

[61] AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification

Yuanhao Tang,Xuechao Zou,Zhengpei Hu,Junliang Xing,Chengkun Zhang,Jianqiang Huang

Main category: cs.CV

TL;DR: 提出AFM-Net，一种结合CNN和Mamba的遥感图像场景分类框架，通过分层融合机制实现局部与全局特征的有效协同表示，并引入专家混合分类器提升识别精度，在多个数据集上取得领先性能。

Details

Motivation: 遥感图像场景分类面临地物空间结构复杂、多尺度特性显著等挑战，现有方法在高效融合局部纹理与全局上下文信息方面存在瓶颈。 Method: 设计AFM-Net，包含CNN分支提取层次化视觉先验，Mamba分支进行高效的全局序列建模；提出分层融合机制，逐步聚合多尺度特征并实现动态跨层级交互；采用混合专家分类器自适应路由特征以提升细粒度识别能力。 Result: 在AID、NWPU-RESISC45和UC Merced数据集上分别达到93.72%、95.54%和96.92%的准确率，优于当前主流方法，在性能与效率之间实现更好平衡。 Conclusion: AFM-Net通过有效的双路径特征融合与自适应分类策略，显著提升了遥感图像场景分类的精度与效率，具备较强的实用性与可扩展性。 Abstract: Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.

[62] How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring

Yanfan Zhu,Juming Xiong,Ruining Deng,Yu Wang,Yaohong Wang,Shilin Zhao,Mengmeng Yin,Yuqing Liu,Haichun Yang,Yuankai Huo

Main category: cs.CV

TL;DR: 本研究探讨了使用现有深度学习模型通过模块化、基于规则的框架来近似Banff病变评分的可行性，发现当前AI模型在结构遗漏、幻觉和检测模糊等方面存在局限性。

Details

Motivation: 由于Banff分类具有半定量性、标准复杂且观察者间变异性大，难以进行计算机复制，因此需要探索AI模型对其评分的逼近能力。 Method: 将每个Banff指标（如g、ptc、v）分解为结构和炎症成分，利用现有的分割与检测工具，并通过符合专家指南的启发式规则映射到Banff评分，最后与专家标注结果对比评估。 Result: 部分指标可实现评分匹配，但中间表征常存在不一致；模型存在结构遗漏、幻觉和检测模糊等关键失败模式。 Conclusion: 当前AI流水线在复制专家级计算评分方面仍有局限，需通过模块化评估和建立计算型Banff评分标准来指导未来移植病理学模型的发展。 Abstract: The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.

[63] Generating Accurate and Detailed Captions for High-Resolution Images

Hankyeol Lee,Gawon Seo,Kyounggyu Lee,Dogun Kim,Kyungwoo Song,Jiyoung Jung

Main category: cs.CV

TL;DR: 提出一种结合视觉-语言模型、大语言模型和目标检测系统的多阶段管道，以提升高分辨率图像的图像描述质量，减少幻觉并增强细节。

Details

Motivation: 由于视觉-语言模型通常在低分辨率图像上预训练，处理高分辨率图像时会丢失细节并遗漏重要物体，导致生成的描述不准确或不完整。 Method: 首先使用视觉-语言模型生成初始描述，然后利用大语言模型识别关键对象并预测可能共现的其他对象，通过目标检测系统验证这些对象的存在，并对未在初始描述中提及的新对象进行区域聚焦式重描述，从而细化最终输出。 Result: 在高分辨率图像数据集上的实验表明，该方法生成的描述更详细、更可靠，并显著减少了幻觉现象，评估包括成对比较、量化评分和幻觉检测基准。 Conclusion: 所提出的多模态协同管道有效提升了高分辨率图像的描述质量，在保持准确性的同时增强了细节表达能力，为克服现有视觉-语言模型的分辨率限制提供了可行方案。 Abstract: Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.

Xiaozhi Li,Huijun Di,Jian Li,Feng Liu,Wei Liang

Main category: cs.CV

TL;DR: 本文提出M^3Detection，一种用于相机与4D成像雷达多帧融合的统一3D目标检测框架，通过多层次特征融合提升检测性能。

Details

Motivation: 现有相机-雷达融合方法多局限于单帧输入，导致场景信息不完整，且受图像退化和雷达稀疏性影响，检测性能受限。需要更鲁棒、高效的多模态多帧融合方法。 Method: 提出M^3Detection框架，利用基线检测器的中间特征和跟踪器生成参考轨迹；在第二阶段设计全局级对象间特征聚合模块（基于雷达引导）和局部级网格间特征聚合模块（沿轨迹扩展），并通过轨迹级时空推理模块进行跨帧特征增强。 Result: 在VoD和TJ4DRadSet数据集上实现了最先进的3D检测性能，验证了所提方法在多帧相机-4D雷达融合中的有效性。 Conclusion: M^3Detection通过多层级、跨模态、多帧特征融合策略，显著提升了复杂环境下的3D目标检测性能，同时兼顾计算效率。 Abstract: Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.

[65] DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model

Yucheng Xing,Jinxing Yin,Xiaodong Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的新型框架DANCER，用于生成逼真的单人舞蹈视频，通过引入外观增强模块和姿态渲染模块，结合大规模自建数据集TikTok-3K，在真实数据集上取得了优于现有方法的效果。

Details

Motivation: 由于人体运动自由度高，生成包含人物动作（如舞蹈）的连续、高质量视频具有挑战性，现有方法在细节保持和动作连贯性方面仍有不足。 Method: 基于稳定视频扩散模型，设计了外观增强模块（AEM）以保留参考图像细节，并提出姿态渲染模块（PRM）从额外域捕捉姿态条件；同时构建了包含3000个视频的大规模数据集TikTok-3K用于训练。 Result: 在多个真实世界数据集上的实验表明，该方法在视频质量、动作连贯性和细节还原方面均优于当前最先进的方法，定性和定量评估均验证了其有效性。 Conclusion: DANCER框架有效提升了基于扩散模型的单人舞蹈视频生成质量，通过模块化设计和高质量数据集增强了生成结果的真实感与连续性，为人体动作视频生成提供了新的解决方案。 Abstract: Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.

[66] H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

Mingyu Sung,Il-Min Kim,Sangseok Yun,Jae-Mo Kang

Main category: cs.CV

TL;DR: 提出H2-Cache，一种用于扩散模型的分层双阶段缓存机制，通过结构与细节分离的缓存策略，在显著加速推理（最高5.08倍）的同时保持高质量生成。

Details

Motivation: 现有缓存方法在加速扩散模型推理时面临速度与生成质量之间的权衡，常导致质量下降或计算开销高。 Method: 提出H2-Cache，基于去噪过程可分为结构定义和细节优化两个阶段的洞察，采用双阈值系统分别缓存；引入轻量化的池化特征摘要（PFS）实现高效相似性估计。 Result: 在Flux架构上的实验表明，H2-Cache最高可实现5.08倍加速，生成图像质量与基线几乎一致，在定量和定性评估中均优于现有缓存方法。 Conclusion: H2-Cache有效解决了扩散模型推理中的速度-质量困境，为高保真扩散模型的实际部署提供了高效、可行的解决方案。 Abstract: Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.

[67] SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles

Guanchong Huang,Song Fang

Main category: cs.CV

TL;DR: 本文提出了一种名为SilhouetteTell的新型视频识别攻击方法，利用字幕轮廓的空间和时间信息来推断用户观看的在线和离线视频内容，实验表明该方法在多种场景下均具有高有效性。

Details

Motivation: 视频识别攻击可能严重威胁用户隐私，暴露其兴趣、信仰、政治倾向等敏感信息。现有技术多依赖网络流量分析，无法有效应对离线视频识别需求，因此需要一种新的攻击方式。 Method: 通过观察字幕内容决定其屏幕轮廓的现象，结合字幕轮廓的空间特征与连续字幕间的时间差，构建时空特征，并探索录制字幕轮廓与字幕文件之间的时空相关性，实现对视频内容的推断。 Result: 在现成智能手机上的综合实验表明，SilhouetteTell能高效推断视频标题和片段，在最远40米的距离下仍有效。 Conclusion: SilhouetteTell是一种有效的视频识别攻击方法，能够突破在线与离线场景限制，揭示基于字幕轮廓的新型隐私威胁。 Abstract: Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.

[68] Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization

Guozheng Zheng,Jian Guan,Mingjie Xie,Xuanjia Zhao,Congyi Fan,Shiheng Zhang,Pengming Feng

Main category: cs.CV

TL;DR: 提出了一种双层次渐进难样本重加权策略（DPHR），用于无人机与卫星图像之间的跨视角地理定位，有效缓解了视角差异和难负样本带来的挑战。

Details

Motivation: 由于严重的视角差异和存在视觉相似但地理位置不匹配的难负样本，现有方法在跨视角地理定位中表现不佳，且静态加权策略对分布变化敏感，容易过早强调难样本，导致梯度噪声和收敛不稳定。 Method: 设计了样本级别的比率难度感知模块（RDA）来评估负样本的相对难度并分配细粒度权重；同时，在批次级别引入渐进自适应损失加权机制（PALW），利用训练进度信号在训练初期抑制噪声梯度，并随着训练推进逐步增强难负样本挖掘。 Result: 在University-1652和SUES-200基准上的实验表明，该方法优于当前最先进的方法，具有更好的鲁棒性和一致性提升。 Conclusion: DPHR通过双层次动态重加权机制，有效提升了跨视角地理定位的性能，尤其在处理难负样本和训练稳定性方面表现出色。 Abstract: Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.

[69] Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

Zixuan Hu,Yongxian Wei,Li Shen,Zhenyi Wang,Lei Li,Chun Yuan,Dacheng Tao

Main category: cs.CV

TL;DR: 提出了一种稀疏模型反演策略（SMI），通过选择性反演语义前景并忽略噪声背景和虚假相关性，显著加速了基于ViT的高分辨率图像反演过程，且保持甚至提升了下游任务性能。

Details

Motivation: 现有的密集反演方法在反演高分辨率图像时效率低下，主要由于冗余地反演噪声背景和产生虚假相关性的‘幻觉’现象。 Method: 提出一种即插即用的稀疏模型反演策略，选择性地仅对语义前景进行反演，避免对背景和潜在虚假相关性进行不必要的优化。 Result: 相比现有方法最高速度提升达3.79倍，在无数据模型量化和无数据知识迁移任务中保持或提升了性能。 Conclusion: 所提出的稀疏反演策略有效解决了高分辨率模型反演中的效率问题，为大规模视觉模型的数据-free应用提供了更高效的解决方案。 Abstract: Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.

Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了一个多模态交互真实性评估任务（MIVA），并基于狼人杀游戏构建了一个包含同步视频、文本和真实标签的数据集，用于评估多模态大模型在识谎任务中的表现，发现现有模型如GPT-4o仍存在显著局限性。

Details

Motivation: 随着AI系统越来越多地融入人类生活，赋予其识别谎言与真实的能力至关重要。然而，当前在动态多角色对话中进行多模态欺骗检测的研究仍不足，且缺乏可靠数据集和系统评估。 Method: 提出MIVA任务，构建基于狼人杀游戏的多模态数据集，包含同步视频、转录文本和每句话的真实标签，并在多个先进的多模态大语言模型上进行基准测试。 Result: 实验表明，即使是最先进的模型（如GPT-4o）在该任务上也表现不佳，存在明显的性能差距；模型难以将语言与视觉社交线索有效对齐，且判断过于保守。 Conclusion: 现有MLLM在识谎任务中仍有重大缺陷，亟需发展能更好理解多模态社交信号、更具洞察力的新型AI系统。 Abstract: As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

Jiaxin Zhang,Zehong Zhu,Junye Deng,Yunqin Li,and Bowen Wang

Main category: cs.CV

TL;DR: 本文提出了一种融合多源数据的层次图神经网络（HGNN）模型，用于深入分析乡村空间形态，显著提升了多模态融合与分类任务的性能。

Details

Motivation: 现有研究多采用单一学科视角和定性方法，受限于数字化基础设施和数据不足，难以有效揭示乡村空间形态特征及其演化机制。 Method: 构建包含输入节点和通信节点、静态输入边和动态通信边的HGNN模型，结合GCN和GAT，在双阶段特征更新机制下实现多模态特征融合，并引入关系池化机制与17个子类型联合训练策略。 Result: 该方法在多模态融合与分类任务中表现显著优于现有方法，平均准确率/F1从0.71/0.83提升至0.82/0.90，地块任务性能提升6%。 Conclusion: 所提方法为探索乡村空间格局及其生成逻辑提供了科学依据和技术支持。 Abstract: Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.

[72] Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness

Ren Tasai,Guang Li,Ren Togo,Takahiro Ogawa,Kenji Hirata,Minghui Tang,Takaaki Yoshimura,Hiroyuki Sugimori,Noriko Nishioka,Yukie Shimizu,Kohsuke Kudo,Miki Haseyama

Main category: cs.CV

TL;DR: 提出一种新的持续自监督学习（CSSL）框架，用于从多窗宽获取的胸部CT图像中同时学习多样化特征并保护数据隐私。

Details

Motivation: 医学图像诊断中由于标注数据稀缺和动态医疗环境中的域偏移（如不同窗宽设置）导致模型泛化能力差，且传统方法因隐私限制难以重用历史数据。 Method: 在无标签图像上进行持续预训练，引入基于潜在回放的机制以缓解灾难性遗忘，并结合Wasserstein距离知识蒸馏（WKD）与批量知识集成（BKE）的特征蒸馏技术，提升模型对域偏移的鲁棒性。 Result: 在两种不同窗宽设置下的胸部CT图像上验证，性能优于其他现有方法。 Conclusion: 该CSSL框架能有效应对域偏移和数据隐私挑战，在无需访问历史数据的前提下实现持续学习，提升了模型的泛化能力和实用性。 Abstract: We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.

[73] SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping

Renjie Ji,Xue Wang,Chao Niu,Wen Zhang,Yong Mei,Kun Tan

Main category: cs.CV

TL;DR: 提出了一种名为SpecAware的高光谱基础模型，通过融合传感器元属性和图像内容，实现多传感器联合学习，提升了土地覆盖分类、变化检测和场景分类性能。

Details

Motivation: 现有高光谱基础模型忽视传感器元属性的作用，难以处理多传感器数据，缺乏泛化能力。 Method: 设计了元内容感知模块和HyperEmbedding模块，利用超网络动态生成编码矩阵因子，实现对不同传感器和光谱通道的自适应特征提取。 Result: 在六个数据集上实验表明，SpecAware在语义分割、变化检测和场景分类任务中均优于现有方法。 Conclusion: SpecAware通过感知光谱内容和传感器信息，建立了统一的多传感器高光谱学习框架，显著提升了模型迁移能力和泛化性能。 Abstract: Hyperspectral imaging (HSI) is a vital tool for fine-grained land-use and land-cover (LULC) mapping. However, the inherent heterogeneity of HSI data has long posed a major barrier to developing generalized models via joint training. Although HSI foundation models have shown promise for different downstream tasks, the existing approaches typically overlook the critical guiding role of sensor meta-attributes, and struggle with multi-sensor training, limiting their transferability. To address these challenges, we propose SpecAware, which is a novel hyperspectral spectral-content aware foundation model for unifying multi-sensor learning for HSI mapping. We also constructed the Hyper-400K dataset to facilitate this research, which is a new large-scale, high-quality benchmark dataset with over 400k image patches from diverse airborne AVIRIS sensors. The core of SpecAware is a two-step hypernetwork-driven encoding process for HSI data. Firstly, we designed a meta-content aware module to generate a unique conditional input for each HSI patch, tailored to each spectral band of every sample by fusing the sensor meta-attributes and its own image content. Secondly, we designed the HyperEmbedding module, where a sample-conditioned hypernetwork dynamically generates a pair of matrix factors for channel-wise encoding, consisting of adaptive spatial pattern extraction and latent semantic feature re-projection. Thus, SpecAware gains the ability to perceive and interpret spatial-spectral features across diverse scenes and sensors. This, in turn, allows SpecAware to adaptively process a variable number of spectral channels, establishing a unified framework for joint pre-training. Extensive experiments on six datasets demonstrate that SpecAware can learn superior feature representations, excelling in land-cover semantic segmentation classification, change detection, and scene classification.

[74] Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

Mahmoud El Hussieni,Bahadır K. Güntürk,Hasan F. Ateş,Oğuz Hanoğlu

Main category: cs.CV

TL;DR: 本文研究了YOLOv11在卫星图像中联合进行建筑物实例分割与离散高度分类的性能，使用DFC2023 Track 2数据集验证其在复杂城市环境下的优越表现。

Details

Motivation: 准确的建筑物实例分割和高度分类对城市规划、三维建模和基础设施监测至关重要，现有方法在处理复杂场景和多任务平衡方面仍有挑战。 Method: 采用YOLOv11模型，利用其改进的多尺度特征融合架构，在DFC2023 Track 2数据集上进行训练与评估，使用精度、召回率、F1分数和mAP等指标衡量性能。 Result: YOLOv11在实例分割上达到60.4% mAP@50和38.3% mAP@50-95，高度分类在五个层级上表现稳健，尤其在高层建筑等稀有类别上优于先前方法，且推理速度更快。 Conclusion: YOLOv11在多任务建筑提取与高度分类中表现出色，兼具高精度与高效性，适用于大规模实时城市映射，为遥感与地理空间智能提供了有力工具。 Abstract: Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset -- which includes over 125,000 annotated buildings across 12 cities -- we evaluate YOLOv11's performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4\% mAP@50 and 38.3\% mAP@50--95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11's potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.

[75] MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

Jingnan Gao,Zhe Wang,Xianze Fang,Xingyu Ren,Zhuo Chen,Shengqi Liu,Yuhao Cheng,Jiangjing Lyu,Xiaokang Yang,Yichao Yan

Main category: cs.CV

TL;DR: MoRE是一种基于Mixture-of-Experts架构的密集3D视觉基础模型，通过动态路由机制提升可扩展性和适应性，在多个几何任务中实现最先进的性能。

Details

Motivation: 由于几何监督的复杂性和3D数据的多样性，进一步扩大3D模型规模面临挑战，因此需要一种更具可扩展性和适应性的方法。 Method: 提出MoRE模型，采用Mixture-of-Experts架构进行特征动态路由，结合置信度深度优化模块和密集语义特征融合，并设计专用损失函数以支持多任务学习。 Result: 在多个基准测试中，MoRE实现了最先进的性能，能够在无需额外计算的情况下有效支持下游应用。 Conclusion: MoRE通过专家分工、深度优化和语义-几何融合，显著提升了3D视觉模型的鲁棒性、可扩展性和多任务适应能力。 Abstract: Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.

[76] Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting

Tianli Liao,Ran Wang,Siqing Zhang,Lei Li,Guangen Liu,Chenyang Zhao,Heling Cao,Peng Li

Main category: cs.CV

TL;DR: 本文提出了一种自监督的图像重定向方法Object-IR，通过学习基于网格变形的优化框架，在保持语义重要对象外观和几何结构的同时实现高质量图像重定向。

Details

Motivation: 消除语义重要区域的几何失真是图像重定向中的难题，现有方法依赖人工标注数据且难以兼顾视觉保真与几何一致性。 Method: 将图像重定向重构为基于学习的网格扭曲优化问题，利用CNN预测网格点运动，并设计包含对象一致性损失、几何保持损失和边界损失的综合目标函数，实现无需标注数据的自监督训练。 Result: 在RetargetMe基准上实现了最先进的性能，优于现有方法的定量指标和主观视觉质量，平均推理时间仅0.009秒（1024x683分辨率），可在消费级GPU上实时运行。 Conclusion: Object-IR通过自监督网格变形策略有效缓解了关键对象的几何失真问题，兼具高效性与优越的重定向质量，推动了无监督图像重定向的发展。 Abstract: Eliminating geometric distortion in semantically important regions remains an intractable challenge in image retargeting. This paper presents Object-IR, a self-supervised architecture that reformulates image retargeting as a learning-based mesh warping optimization problem, where the mesh deformation is guided by object appearance consistency and geometric-preserving constraints. Given an input image and a target aspect ratio, we initialize a uniform rigid mesh at the output resolution and use a convolutional neural network to predict the motion of each mesh grid and obtain the deformed mesh. The retargeted result is generated by warping the input image according to the rigid mesh in the input image and the deformed mesh in the output resolution. To mitigate geometric distortion, we design a comprehensive objective function incorporating a) object-consistent loss to ensure that the important semantic objects retain their appearance, b) geometric-preserving loss to constrain simple scale transform of the important meshes, and c) boundary loss to enforce a clean rectangular output. Notably, our self-supervised paradigm eliminates the need for manually annotated retargeting datasets by deriving supervision directly from the input's geometric and semantic properties. Extensive evaluations on the RetargetMe benchmark demonstrate that our Object-IR achieves state-of-the-art performance, outperforming existing methods in quantitative metrics and subjective visual quality assessments. The framework efficiently processes arbitrary input resolutions (average inference time: 0.009s for 1024x683 resolution) while maintaining real-time performance on consumer-grade GPUs. The source code will soon be available at https://github.com/tlliao/Object-IR.

[77] Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis

Zhidong Yang,Xiuhui Shi,Wei Ba,Zhigang Song,Haijing Luan,Taiyuan Hu,Senlin Lin,Jiguang Wang,Shaohua Kevin Zhou,Rui Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为FuseCPath的新框架，用于融合异构的病理基础模型（FMs），在多个癌症数据集上实现了最先进的性能。

Details

Motivation: 现有的病理基础模型因训练数据和网络结构不同而表现出显著异质性，导致在下游任务中特征性能不稳定。 Method: 提出了多视图聚类方法筛选代表性图像块，设计了簇级重嵌入策略融合异构的图像块级特征，并采用协同蒸馏策略融合滑片级特征。 Result: 在肺癌、膀胱癌和结直肠癌的TCGA数据集上实验表明，FuseCPath在多个任务中均达到最先进水平。 Conclusion: FuseCPath能有效融合多种异构病理基础模型，提升下游任务的整体性能。 Abstract: Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathological foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level feature representations from WSIs. However, current pathological FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the extracted features from different FMs in the downstream tasks. To fully explore the advantage of multiple FMs effectively, in this work, we propose a novel framework for the fusion of heterogeneous pathological FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs' embeddings. (ii) To effectively fuse the heterogeneous patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the heterogeneous slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments conducted on lung cancer, bladder cancer, and colorectal cancer datasets from The Cancer Genome Atlas (TCGA) have demonstrated that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.

[78] Trans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation

Alik Pramanick,Mayank Bansal,Utkarsh Srivastava,Suklav Ghosh,Arijit Sur

Main category: cs.CV

TL;DR: 提出一种结合空间和频域去噪策略的两阶段训练方法，利用离散小波变换和Transformer提升图像分类模型对对抗攻击的鲁棒性。

Details

Motivation: 深度神经网络易受对抗攻击，限制了其在安全关键系统中的应用，需提高模型鲁棒性。 Method: 采用两阶段训练：先用结合空间特征与小波系数的Transformer去噪网络处理图像，再用去噪后图像重新训练分类器。 Result: 在MNIST、CIFAR-10和Fashion-MNIST数据集上显著提升分类准确率，优于传统去噪和对抗训练方法。 Conclusion: 所提方法有效增强模型对抗攻击的防御能力，通过融合频域分析与Transformer结构实现了更优的去噪与分类性能。 Abstract: In recent times, deep neural networks (DNNs) have been successfully adopted for various applications. Despite their notable achievements, it has become evident that DNNs are vulnerable to sophisticated adversarial attacks, restricting their applications in security-critical systems. In this paper, we present two-phase training methods to tackle the attack: first, training the denoising network, and second, the deep classifier model. We propose a novel denoising strategy that integrates both spatial and frequency domain approaches to defend against adversarial attacks on images. Our analysis reveals that high-frequency components of attacked images are more severely corrupted compared to their lower-frequency counterparts. To address this, we leverage Discrete Wavelet Transform (DWT) for frequency analysis and develop a denoising network that combines spatial image features with wavelets through a transformer layer. Next, we retrain the classifier using the denoised images, which enhances the classifier's robustness against adversarial attacks. Experimental results across the MNIST, CIFAR-10, and Fashion-MNIST datasets reveal that the proposed method remarkably elevates classification accuracy, substantially exceeding the performance by utilizing a denoising network and adversarial training approaches. The code is available at https://github.com/Mayank94/Trans-Defense.

[79] C-LEAD: Contrastive Learning for Enhanced Adversarial Defense

Suklav Ghosh,Sonal Kumar,Arijit Sur

Main category: cs.CV

TL;DR: 本文提出了一种利用对比学习进行对抗防御的新方法，通过结合干净和对抗性扰动图像训练模型，显著提高了模型对多种对抗攻击的鲁棒性。

Details

Motivation: 深度神经网络在计算机视觉任务中表现出色，但容易受到对抗攻击的影响，限制了其在实际场景中的部署，因此需要提高模型的鲁棒性。 Method: 采用对比损失函数，联合优化模型参数和输入扰动，在训练过程中同时使用干净样本和对抗样本，使模型学习到更鲁棒的特征表示。 Result: 实验结果表明，该方法在面对多种对抗性扰动时显著提升了模型的防御能力，验证了对比学习在增强模型鲁棒性方面的有效性。 Conclusion: 对比学习能够帮助提取更具信息性和抗干扰能力的特征，为深度学习中的对抗鲁棒性研究提供了新的有效路径。 Abstract: Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model's parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model's robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.

[80] Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Yehna Kim andYoung-Eun Kim,Seong-Whan Lee

Main category: cs.CV

TL;DR: 提出一种利用网络爬取描述和大语言模型提取关键词的方法，以减少人工标注需求，并通过时空交互模块对齐描述属性与视频内容，在零样本动作识别中取得显著效果。

Details

Motivation: 由于动作类别中的多义词可能导致语义歧义，仅依赖动作类别提供语义上下文存在挑战。 Method: 利用网络爬取的描述信息，借助大语言模型提取关键词，并引入时空交互模块来关注对象和动作单元，实现描述属性与视频内容的对齐。 Result: 在UCF-101、HMDB-51和Kinetics-600数据集上分别达到81.0%、53.1%和68.9%的准确率。 Conclusion: 该方法有效缓解了多义词带来的语义模糊问题，减少了对人工标注的依赖，在多种下游任务中表现出良好的适应性和性能。 Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.

[81] Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Xiaowen Chu

Main category: cs.CV

TL;DR: 提出了一种通过协同设计评估、数据和模型来实现通用视频检索的新框架，包含新的基准UVRB、大规模合成数据和Modality Pyramid训练方法，显著提升了零样本泛化能力。

Details

Motivation: 现有视频检索范式受限于狭窄的基准和单一任务训练，缺乏对多维度泛化能力的评估与支持，导致通用能力不足。 Method: 构建了通用视频检索基准UVRB（16个数据集），基于其诊断设计可扩展的合成数据流程生成155万高质量数据对，并提出Modality Pyramid课程学习框架训练通用视频嵌入模型GVE。 Result: GVE在UVRB上实现了最先进的零样本泛化性能，发现传统基准无法预测通用能力，且部分相关检索是重要但被忽视的场景。 Conclusion: 所提出的协同设计框架为突破当前视频检索的局限性、实现真正通用的视频检索提供了有效路径。 Abstract: The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

[82] RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents

Yinglu Li,Zhiying Lu,Zhihang Liu,Chuanbin Liu,Hongtao Xie

Main category: cs.CV

TL;DR: 提出RegionRAG，一种从文档级到区域级的多模态检索增强生成新框架，通过细粒度区域检索提升准确性和效率。

Details

Motivation: 现有方法以整个文档为检索单位，引入大量无关视觉内容，影响模型性能。 Method: 提出RegionRAG，采用混合监督策略训练，推理时动态聚类显著图像块为完整语义区域，实现区域级检索。 Result: 在六个基准上达到SOTA，平均R@1提升10.02%，问答准确率提高3.56%，仅使用前人方法71.42%的视觉token。 Conclusion: RegionRAG通过将检索粒度细化到区域级别，有效减少冗余信息干扰，显著提升多模态RAG的效率与准确性。 Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.

[83] T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

Raza Imam,Hu Wang,Dwarikanath Mahapatra,Mohammad Yaqub

Main category: cs.CV

TL;DR: 提出了一种无需反向传播的测试时任务自适应模型融合方法T^3，通过动态计算插值系数，在保持效率的同时提升多模态医学图像任务中的鲁棒性和准确性。

Details

Motivation: 现有模型融合技术在医学成像中表现不佳，无法兼顾预训练模型的泛化性和微调专家模型的精度，且面对模态变化时可靠性不足。 Method: 提出T^3框架，利用两个模型输出分布之间的Jensen-Shannon散度计算每样本插值系数；进一步设计批处理版本T^3_B以降低计算开销。 Result: 在四个医学模态上验证，T^3在Top-1准确率和误差减少方面达到SOTA，显著优于基线方法。 Conclusion: T^3实现了在不同临床任务中的一致性能增益，为医学视觉语言模型的自适应部署提供了高效可靠的新方案。 Abstract: In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.

[84] HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

Shaojie Zhang,Pei Fu,Ruoceng Zhang,Jiahui Yang,Anan Du,Xiuwen Xi,Shaokang Wang,Ying Huang,Bin Qin,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: 本文提出了HyperClick框架，通过不确定性校准来提升图形用户界面（GUI）智能体在执行用户指令时的可靠性和准确性，解决了现有模型过度自信的问题。

Details

Motivation: 现有的GUI智能体模型在监督微调或强化微调后缺乏对自身能力边界的认知，导致预测不可靠，尤其在动态GUI自动化任务中容易因单个错误导致任务失败。 Method: 提出HyperClick框架，引入双重奖励机制：结合正确操作的二元奖励与基于截断高斯的空间置信度建模，并使用Brier评分进行置信度校准，联合优化定位准确性和置信可靠性。 Result: 在七个挑战性基准上的实验表明，HyperClick在保持SOTA性能的同时，提供了良好校准的置信度，显著降低了过度自信现象。 Conclusion: HyperClick通过显式的置信度校准和自省式自我批评机制，提升了GUI智能体在真实场景中的可靠性和鲁棒性，推动了更安全的GUI自动化发展。 Abstract: Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a truncated Gaussian-based spatial confidence modeling, calibrated using the Brier score. This approach jointly optimizes grounding accuracy and confidence reliability, fostering introspective self-criticism. Extensive experiments on seven challenge benchmarks show that HyperClick achieves state-of-the-art performance while providing well-calibrated confidence. By enabling explicit confidence calibration and introspective self-criticism, HyperClick reduces overconfidence and supports more reliable GUI automation.

[85] FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zirui Zhu,Hailun Xu,Yang Luo,Yong Liu,Kanchan Sarkar,Zhenheng Yang,Yang You

Main category: cs.CV

TL;DR: FOCUS是一种无需训练、模型无关的关键帧选择模块，通过将关键帧选择建模为多臂老虎机中的组合纯探索问题，在严格令牌预算下高效选取查询相关帧，显著提升长视频理解的准确性。

Details

Motivation: 现有关键帧选择方法依赖预过滤以降低推理成本，容易遗漏最具信息量的时刻，且难以扩展到小时级长视频。 Method: 将短时间片段视为臂，使用经验均值和Bernstein置信半径识别高价值区域，并采用两阶段探索-利用策略：先定位高价值时间段，再从中选取最佳帧。 Result: 在两个长视频问答基准上，FOCUS处理不到2%的视频帧即带来显著准确率提升；对于超过20分钟的视频，在LongVideoBench上准确率提升11.9%。 Conclusion: FOCUS提供了一种简单而通用的解决方案，有效支持大规模长视频理解任务，具有理论保证且适用于多种多模态大语言模型。 Abstract: Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.

[86] Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Qinghong Yin,Yu Tian,Yue Zhang

Main category: cs.CV

TL;DR: 本文提出了一种语义引导的鲁棒对抗性概念擦除方法S-GRACE，通过在概念空间中引入语义指导来生成对抗样本，显著提升了扩散模型中敏感概念擦除的效果，同时更好地保留了非目标概念并大幅减少了训练时间。

Details

Motivation: 现有基于对抗训练的概念擦除方法在扩散模型中未能充分拟合目标概念空间，忽视了概念语义的作用，导致擦除不彻底或干扰其他概念。 Method: 从概念空间的角度分析问题，提出S-GRACE框架，利用语义引导生成更贴合目标概念空间的对抗样本，并在此基础上进行擦除训练。 Result: 在七种最先进方法和三种对抗提示生成策略下的多个去学习场景中，S-GRACE将擦除性能平均提升26%，训练时间减少90%，且更好保留了非目标概念。 Conclusion: S-GRACE通过语义引导有效解决了扩散模型中概念擦除不充分的问题，显著优于现有方法，为安全可控的内容生成提供了新思路。 Abstract: Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

[87] Versatile and Efficient Medical Image Super-Resolution Via Frequency-Gated Mamba

Wenfeng Huang,Xiangyun Liao,Wei Cao,Wenjing Jia,Weixin Si

Main category: cs.CV

TL;DR: 提出FGMamba，一种轻量级的频域感知门控状态空间模型，用于医学图像超分辨率，结合全局依赖建模与细节增强，在多种模态上优于现有方法。

Details

Motivation: 医学图像超分辨率需兼顾长距离解剖结构和细粒度频率细节，同时降低计算开销。 Method: 设计了GASM模块（融合状态空间模型与双分支空间/通道注意力）和PFFM模块（通过FFT引导的多分辨率融合捕捉高频细节）。 Result: 在五种医学影像模态上验证，PSNR/SSIM优于CNN和Transformer方法，参数量小于0.75M。 Conclusion: 频域感知的状态空间建模可有效实现高效、准确的医学图像增强。 Abstract: Medical image super-resolution (SR) is essential for enhancing diagnostic accuracy while reducing acquisition cost and scanning time. However, modeling both long-range anatomical structures and fine-grained frequency details with low computational overhead remains challenging. We propose FGMamba, a novel frequency-aware gated state-space model that unifies global dependency modeling and fine-detail enhancement into a lightweight architecture. Our method introduces two key innovations: a Gated Attention-enhanced State-Space Module (GASM) that integrates efficient state-space modeling with dual-branch spatial and channel attention, and a Pyramid Frequency Fusion Module (PFFM) that captures high-frequency details across multiple resolutions via FFT-guided fusion. Extensive evaluations across five medical imaging modalities (Ultrasound, OCT, MRI, CT, and Endoscopic) demonstrate that FGMamba achieves superior PSNR/SSIM while maintaining a compact parameter footprint ($<$0.75M), outperforming CNN-based and Transformer-based SOTAs. Our results validate the effectiveness of frequency-aware state-space modeling for scalable and accurate medical image enhancement.

Alvee Hassan,Rusab Sarmun,Muhammad E. H. Chowdhury,M. Murugappan,Md. Sakib Abrar Hossain,Sakib Mahmud,Abdulrahman Alqahtani,Sohaib Bassam Zoghoul,Amith Khandakar,Susu M. Zughaier,Somaya Al-Maadeed,Anwarul Hasan

Main category: cs.CV

TL;DR: 提出了一种名为CASR-Net的三阶段网络，用于提升冠状动脉分割质量，结合CLAHE与改进的Ben Graham方法进行预处理，采用基于DenseNet121编码器和Self-ONN解码器的UNet结构，并引入轮廓优化模块，显著提升了分割性能。

Details

Motivation: 冠状动脉疾病（CAD）的早期检测对降低死亡率至关重要，但X射线血管造影图像质量差会影响临床诊断，因此需要一种鲁棒的自动化分割方法来辅助医生。 Method: CASR-Net包含三个阶段：多通道预处理（结合CLAHE与改进的Ben Graham方法）、基于UNet（DenseNet121编码器+Self-ONN解码器）的分割网络、以及用于抑制假阳性的轮廓细化模块；采用5折交叉验证在两个公开数据集上评估。 Result: 相比单独使用各预处理方法，多通道策略使DSC提升0.31-0.89%，IoU提升0.40-1.16%；整体模型在IoU、DSC和clDice上分别达到61.43%、76.10%和79.36%，优于多种现有先进模型。 Conclusion: CASR-Net为冠状动脉分割提供了一种鲁棒的自动化方案，能有效支持临床诊断与治疗规划，尤其适用于狭窄血管的连续性保持。 Abstract: Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.

[89] Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection

Zijia An,Boyu Diao,Ruiqi Liu,Libo Huang,Chuanguang Yang,Fei Wang,Zhulin An,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了一种用于增量目标检测的参数化提示方法P²IOD，通过神经网络作为可学习的提示结构，实现跨任务的知识自适应整合，并采用融合策略抑制灾难性遗忘，在PASCAL VOC和MS COCO数据集上取得了当前最优性能。

Details

Motivation: 现有基于提示池的方法假设增量任务间类别互不相交，忽略了检测图像中多类共存的特性，导致在增量目标检测中出现混淆；因此需要具有自适应整合能力且能抑制遗忘的提示结构。 Method: 利用神经网络的全局演化特性，将网络本身作为参数化提示来自适应地整合任务知识，并设计参数化提示融合策略以约束结构更新，防止灾难性遗忘。 Result: 在PASCAL VOC2007和MS COCO数据集上进行了大量实验，验证了P²IOD在增量目标检测中的有效性，并在现有基线方法中达到了最先进的性能。 Conclusion: P²IOD通过参数化的提示结构和融合机制，成功解决了增量目标检测中因类别共现带来的挑战，实现了优异的持续学习表现。 Abstract: Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.

[90] SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction

Wenfeng Huang,Xiangyun Liao,Yinling Qian,Hao Liu,Yongming Yang,Wenjing Jia,Qiong Wang

Main category: cs.CV

TL;DR: 提出了一种自适应无混叠高斯点阵框架SAGS，用于解决内窥镜视频中可变形组织重建中的混叠和伪影问题。

Details

Motivation: 现有3D高斯点阵方法在可变形组织重建中忽视了运动引起的混叠和伪影问题，影响可视化质量。 Method: 引入注意力驱动的动态加权4D形变解码器，结合3D平滑滤波和2D Mip滤波，提升重建质量。 Result: 在EndoNeRF和SCARED两个公开基准上，PSNR、SSIM和LPIPS指标均优于现有方法，且可视化效果更优。 Conclusion: SAGS有效缓解了可变形组织重建中的伪影问题，在保持高效渲染的同时显著提升了重建质量。 Abstract: Surgical reconstruction of dynamic tissues from endoscopic videos is a crucial technology in robot-assisted surgery. The development of Neural Radiance Fields (NeRFs) has greatly advanced deformable tissue reconstruction, achieving high-quality results from video and image sequences. However, reconstructing deformable endoscopic scenes remains challenging due to aliasing and artifacts caused by tissue movement, which can significantly degrade visualization quality. The introduction of 3D Gaussian Splatting (3DGS) has improved reconstruction efficiency by enabling a faster rendering pipeline. Nevertheless, existing 3DGS methods often prioritize rendering speed while neglecting these critical issues. To address these challenges, we propose SAGS, a self-adaptive alias-free Gaussian splatting framework. We introduce an attention-driven, dynamically weighted 4D deformation decoder, leveraging 3D smoothing filters and 2D Mip filters to mitigate artifacts in deformable tissue reconstruction and better capture the fine details of tissue movement. Experimental results on two public benchmarks, EndoNeRF and SCARED, demonstrate that our method achieves superior performance in all metrics of PSNR, SSIM, and LPIPS compared to the state of the art while also delivering better visualization quality.

[91] Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis

Weiming Chen,Yijia Wang,Zhihan Zhu,Zhihai He

Main category: cs.CV

TL;DR: 提出一种结合文本描述和编码潜变量的图像生成与深度压缩联合方法，实现超低比特率下的高质量视觉场景重建，适用于带宽受限场景。

Details

Motivation: 现有文本到图像生成模型只能语义级近似视觉场景，难以满足远程视觉分析和人机交互对精度的需求，尤其是在极低带宽条件下。 Method: 将图像生成与深度图像压缩结合，利用联合文本描述和编码潜变量引导修正流模型进行精确视觉场景生成，以极低比特率传输语义信息和潜变量。 Result: 实验表明，该方法在显著降低带宽使用的同时，保持了与现有方法相当的图像重建质量和视觉分析精度。 Conclusion: 所提方法在超低比特率视觉通信中具有优越性能，能够在节省带宽的同时支持高精度视觉任务，适用于深空探测、战场情报等极端场景。 Abstract: We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.

[92] MeisenMeister: A Simple Two Stage Pipeline for Breast Cancer Classification on MRI

Benjamin Hamm,Yannick Kirchhoff,Maximilian Rokuss,Klaus Maier-Hein

Main category: cs.CV

TL;DR: 本文提出了一种用于乳腺MRI癌症检测的分类方法，旨在解决高质量分割标签稀缺的问题，以提高早期乳腺癌检测的准确性和效率。

Details

Motivation: 由于缺乏高质量的乳腺MRI分割标签，现有的全身体位病变分割和多时间点分析方法难以有效应用于乳腺癌早期筛查，因此需要开发更加鲁棒的分类方法。 Method: 作者基于基础假设设计了迭代开发流程，通过多次实验、评估与优化逐步改进模型，并强调性能、鲁棒性和临床相关性。 Result: 该方法在Odelia乳腺MRI挑战赛2025中表现出良好的性能，具体指标未提及，但代码已公开。 Conclusion: 所提出的方法为乳腺MRI的自动化分析提供了有效解决方案，有助于推动大规模乳腺癌筛查的发展。 Abstract: The ODELIA Breast MRI Challenge 2025 addresses a critical issue in breast cancer screening: improving early detection through more efficient and accurate interpretation of breast MRI scans. Even though methods for general-purpose whole-body lesion segmentation as well as multi-time-point analysis exist, breast cancer detection remains highly challenging, largely due to the limited availability of high-quality segmentation labels. Therefore, developing robust classification-based approaches is crucial for the future of early breast cancer detection, particularly in applications such as large-scale screening. In this write-up, we provide a comprehensive overview of our approach to the challenge. We begin by detailing the underlying concept and foundational assumptions that guided our work. We then describe the iterative development process, highlighting the key stages of experimentation, evaluation, and refinement that shaped the evolution of our solution. Finally, we present the reasoning and evidence that informed the design choices behind our final submission, with a focus on performance, robustness, and clinical relevance. We release our full implementation publicly at https://github.com/MIC-DKFZ/MeisenMeister

[93] Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Yijia Wang,Yiqing Shen,Weiming Chen,Zhihai He

Main category: cs.CV

TL;DR: 提出一种名为CIELR的新方法，通过LLM推理将复杂图像编辑指令分解为简单操作，避免了大语言模型和扩散模型的联合微调，显著提升了复杂编辑任务的性能。

Details

Motivation: 现有图像编辑方法在处理复杂指令时需要联合微调大语言模型和扩散模型，计算成本高，因此需要一种更高效的方法。 Method: 利用基础模型构建输入图像的结构化语义表示，并引入迭代更新机制逐步优化该表示，从而实现细粒度的场景理解与复杂编辑；通过LLM推理将复杂指令分解为简单的编辑动作。 Result: 在SmartEdit数据集上比先前最优方法PSNR提升9.955 dB，在自建基准CIEBench（含86个样本）上也表现更优，有效保持需保留区域的一致性。 Conclusion: CIELR无需联合微调即可高效完成复杂图像编辑任务，兼顾精度与一致性，且发布了新 benchmark 推动该领域发展。 Abstract: Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.

[94] RzenEmbed: Towards Comprehensive Multimodal Retrieval

Weijian Jian,Yajun Zhang,Dawei Liang,Chunyu Xie,Yixiao He,Dawei Leng,Yuhui Yin

Main category: cs.CV

TL;DR: RzenEmbed是一个统一的多模态嵌入框架，支持文本、图像、视频和视觉文档，通过两阶段训练和改进的InfoNCE损失在MMEB基准上达到SOTA性能。

Details

Motivation: 现有基于CLIP的方法主要关注自然图像，缺乏对视频和视觉文档等其他关键视觉模态的良好支持，限制了其在跨模态检索中的通用性。 Method: 提出RzenEmbed，采用两阶段训练策略：第一阶段进行基础的文本-多模态检索学习；第二阶段引入改进的InfoNCE损失，包含 hardness-weighted 机制和减轻假负样本影响的方法，并结合可学习温度参数与模型汤（model souping）提升性能。 Result: 在MMEB基准上达到新的SOTA，整体得分最优，尤其在视频和视觉文档检索任务上显著优于先前方法。 Conclusion: RzenEmbed实现了跨多种视觉模态的高效统一嵌入学习，显著提升了复杂检索任务中的表现，增强了模型判别能力和指令跟随能力，具备良好的实用性和扩展性。 Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.

[95] FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning

Kenneth Yang,Wen-Li Wei,Jen-Chun Lin

Main category: cs.CV

TL;DR: 提出了一种名为Feedforward-based Parameter Selection (FPS)的梯度无关参数选择方法，通过单次前向传播选择最优参数子集，在保持性能的同时显著降低内存使用和加速参数选择。

Details

Motivation: 现有的参数高效微调方法存在推理延迟或峰值内存使用过高的问题，需要一种更高效且实用的方法来微调大规模预训练模型。 Method: FPS通过参数幅度与其对应输入激活的乘积来评分并选择重要参数，仅需一次前向传播，无需反向传播，从而实现高效的参数选择。 Result: 在24个视觉任务上评估显示，FPS相比现有方法减少了约9倍的峰值内存使用，并加快了约2倍的参数选择速度，同时保持了具有竞争力的性能。 Conclusion: FPS是一种真正内存高效且实用的大规模模型微调方案，为参数高效微调提供了新的有效路径。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on $24$ visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly $9 \times$ and accelerating parameter selection by about $2 \times$, offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.

[96] Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V

Meftun Akarsu,Kerem Catay,Sedat Bin Vedat,Enes Kutay Yarkan,Ilke Senturk,Arda Sar,Dafne Eksioglu

Main category: cs.CV

TL;DR: 提出了一种两阶段微调开源视频扩散变换器的实用管道，用于从小数据集中生成影视级场景。

Details

Motivation: 为了在小数据集上高效地将开放源视频扩散模型适应于特定影视风格，实现快速领域迁移。 Method: 第一阶段使用LoRA模块微调Wan2.1 I2V-14B模型的跨注意力层以学习视觉风格；第二阶段利用微调后的模型生成风格一致的关键帧，并通过视频解码器扩展为连贯的720p视频序列，同时采用轻量级并行和序列分割策略加速推理。 Result: 在FVD、CLIP-SIM和LPIPS指标上的定量与定性评估及专家用户研究表明，该方法在电影保真度和时间稳定性方面优于基础模型。 Conclusion: 该两阶段微调管道能高效实现影视风格迁移，支持高保真、时序稳定的视频生成，且完整训练与推理代码已开源，便于复现与跨领域应用。 Abstract: We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.

[97] Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

Wu Wei,Xiaomeng Fan,Yuwei Wu,Zhi Gao,Pengxiang Li,Yunde Jia,Mehrtash Harandi

Main category: cs.CV

TL;DR: 提出了一种名为“跨树对齐”的方法，通过构建图像和文本模态的树状层次特征并将其嵌入到具有不同曲率的双曲流形中，实现更优的多模态对齐。

Details

Motivation: 现有方法在图文模态对齐中存在不对称性，通常对文本提取层次特征而图像仅用单一特征表示，导致对齐效果不佳。 Method: 引入语义感知的视觉特征提取框架，利用文本线索指导中间Transformer层的视觉类令牌进行跨注意力机制处理，生成由粗到细语义的视觉特征；将两模态的特征树嵌入不同曲率的双曲流形，并通过最小化KL距离学习中间流形以实现跨异构流形对齐。 Result: 在多个图像数据集的分类任务中，尤其是在少样本和跨域设置下，该方法 consistently 优于强基线模型。 Conclusion: 所提方法有效解决了模态间不对称对齐问题，通过构建层次化特征和双曲流形上的对齐机制，提升了多模态表示学习性能。 Abstract: Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

[98] A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection

Sales Aribe Jr

Main category: cs.CV

TL;DR: 提出一种融合法证特征与深度学习表征的混合框架，用于提升深度伪造检测的鲁棒性与可解释性，在多个基准数据集上表现优于现有方法。

Details

Motivation: 现有深度伪造检测方法在泛化能力、抗干扰性和对新型篡改技术的适应性方面存在不足，亟需兼具鲁棒性与可解释性的解决方案。 Method: 结合噪声残差、JPEG压缩痕迹和频域描述符等法证特征，融合CNN和ViT的深度表示，构建混合检测框架。 Result: 在FaceForensics++、Celeb-DF v2和DFDC数据集上F1分数分别为0.96、0.82和0.77；在压缩、对抗扰动和未见篡改下仍保持稳定性能；Grad-CAM与法证热图在82%案例中与真实篡改区域重叠。 Conclusion: 混合方法结合了深度模型的适应性与法证线索的可解释性，为构建可靠、透明的深度伪造检测系统提供了有效路径。 Abstract: The rapid evolution of generative adversarial networks (GANs) and diffusion models has made synthetic media increasingly realistic, raising societal concerns around misinformation, identity fraud, and digital trust. Existing deepfake detection methods either rely on deep learning, which suffers from poor generalization and vulnerability to distortions, or forensic analysis, which is interpretable but limited against new manipulation techniques. This study proposes a hybrid framework that fuses forensic features, including noise residuals, JPEG compression traces, and frequency-domain descriptors, with deep learning representations from convolutional neural networks (CNNs) and vision transformers (ViTs). Evaluated on benchmark datasets (FaceForensics++, Celeb-DF v2, DFDC), the proposed model consistently outperformed single-method baselines and demonstrated superior performance compared to existing state-of-the-art hybrid approaches, achieving F1-scores of 0.96, 0.82, and 0.77, respectively. Robustness tests demonstrated stable performance under compression (F1 = 0.87 at QF = 50), adversarial perturbations (AUC = 0.84), and unseen manipulations (F1 = 0.79). Importantly, explainability analysis showed that Grad-CAM and forensic heatmaps overlapped with ground-truth manipulated regions in 82 percent of cases, enhancing transparency and user trust. These findings confirm that hybrid approaches provide a balanced solution, combining the adaptability of deep models with the interpretability of forensic cues, to develop resilient and trustworthy deepfake detection systems.

[99] Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset

Aditya Parikh,Sneha Das,Aasa Feragen

Main category: cs.CV

TL;DR: 该研究评估了乳腺癌肿瘤分割数据集MAMA-MIA中自动化分割标签的公平性，发现存在针对年轻患者的年龄相关偏见，且该偏见在控制数据来源等混杂因素后仍持续存在。

Details

Motivation: 深度学习模型在诊断流程中具有潜力，但其在图像分割等任务中的公平性评估尚不充分。未解决的分割偏见可能导致特定人群医疗质量差异，并在临床决策和模型迭代中被放大。 Method: 研究者对MAMA-MIA数据集中的自动化分割标签进行了公平性审计，评估分割质量在年龄、种族和数据来源上的差异，并控制混杂因素进行分析。 Result: 发现了针对年轻患者的固有年龄相关偏见，该偏见在控制数据来源等因素后仍然存在；同时发现多源数据聚合会影响特定站点的种族偏见。 Conclusion: 必须在细粒度层面深入调查数据偏差，以确保医学图像分割模型的公平性，避免对特定患者群体造成不利影响。 Abstract: Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.

[100] Mitigating Semantic Collapse in Partially Relevant Video Retrieval

WonJun Moon,MinSeok Jung,Gilhan Park,Tae-Young Kim,Cheol-Ho Cho,Woojin Jun,Jae-Pil Heo

Main category: cs.CV

TL;DR: 本文提出了一种新的部分相关视频检索（PRVR）框架，通过文本相关性保持学习和跨分支视频对齐（CBVA）方法，解决了文本和视频嵌入空间中的语义坍塌问题，显著提升了检索精度。

Details

Motivation: 现有PRVR方法将每个标注的文本-视频对视为正样本，忽略了视频内和视频间的丰富语义变化，导致语义坍塌，限制了多事件复杂视频的检索性能。 Method: 引入文本相关性保持学习以保留基础模型中的语义关系，并提出跨分支视频对齐（CBVA）方法，解耦时间尺度上的层次化视频表示；进一步采用顺序保持的token合并和自适应CBVA增强对齐效果。 Result: 在多个PRVR基准上的实验表明，该方法有效防止了语义坍塌，在检索准确率上显著优于现有方法。 Conclusion: 所提出的框架通过缓解文本和视频嵌入空间中的语义坍塌，显著提升了部分相关视频检索的性能，尤其适用于包含多个不同事件的复杂视频场景。 Abstract: Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.

Yanlong Yang,Guanxiong Luo

Main category: cs.CV

TL;DR: 提出了一种名为DeblurSDI的零样本、自监督盲去卷积框架，无需预训练，通过迭代的自扩散逆过程恢复清晰图像和模糊核。

Details

Motivation: 传统方法依赖手工先验或需大量外部数据预训练的深度学习模型，适应性受限，难以应对真实场景中的盲去卷积问题。 Method: 将盲去卷积建模为从纯噪声开始的迭代逆向自扩散过程，使用两个随机初始化的神经网络分别优化清晰图像和模糊核，结合数据一致性与促进稀疏性的L1范数目标函数，并引入噪声调度机制以稳定优化过程。 Result: 在多种严重退化场景下，DeblurSDI均能准确恢复清晰图像和模糊核，表现出优于现有方法的鲁棒性和性能。 Conclusion: DeblurSDI作为一种无需训练的零样本方法，能够动态学习针对输入图像的特定先验，在盲去卷积任务中实现了卓越的去模糊效果和核估计精度。 Abstract: Blind image deconvolution is a challenging ill-posed inverse problem, where both the latent sharp image and the blur kernel are unknown. Traditional methods often rely on handcrafted priors, while modern deep learning approaches typically require extensive pre-training on large external datasets, limiting their adaptability to real-world scenarios. In this work, we propose DeblurSDI, a zero-shot, self-supervised framework based on self-diffusion (SDI) that requires no prior training. DeblurSDI formulates blind deconvolution as an iterative reverse self-diffusion process that starts from pure noise and progressively refines the solution. At each step, two randomly-initialized neural networks are optimized continuously to refine the sharp image and the blur kernel. The optimization is guided by an objective function combining data consistency with a sparsity-promoting L1-norm for the kernel. A key innovation is our noise scheduling mechanism, which stabilizes the optimization and provides remarkable robustness to variations in blur kernel size. These allow DeblurSDI to dynamically learn an instance-specific prior tailored to the input image. Extensive experiments demonstrate that DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.

[102] CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

Aon Safdar,Mohamed Saadeldin

Main category: cs.CV

TL;DR: 本文提出了一种紧凑且可泛化的视觉Transformer架构CoMViT，专为资源受限的医学图像分析优化，在保持轻量级的同时在多个医学数据集上表现出色。

Details

Motivation: Vision Transformers在医学影像中表现潜力巨大，但计算开销高且在小数据集上易过拟合，限制了其在临床中的应用。 Method: CoMViT引入卷积分词器、对角掩码、动态温度缩放和基于池化的序列聚合，通过系统性架构优化提升性能与泛化能力。 Result: CoMViT在十二个MedMNIST数据集上达到与更深CNN和ViT变体相当或更优的性能，参数仅约450万，减少5-20倍，且Grad-CAM显示其能稳定关注临床相关区域。 Conclusion: 通过对ViT的原理性重构，CoMViT展现了在低资源医学影像场景下构建高效、可解释模型的可行性和潜力。 Abstract: Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.

[103] From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

Jianwen Sun,Fanrui Zhang,Yukang Feng,Chuanhao Li,Zizhen Li,Jiaxin Ai,Yifan Chang,Yu Dai,Kaipeng Zhang

Main category: cs.CV

TL;DR: 本文提出VisPainter，一个基于多智能体框架的科学插图生成系统，通过模块化设计实现元素级控制，并引入VisBench benchmark评估科学插图质量。

Details

Motivation: 现有生成模型在科学插图创作中存在缺乏语义结构或操作不直观的问题，难以满足高效、可编辑和迭代修改的需求。 Method: 提出VisPainter多智能体框架，包含Manager、Designer和Toolbox三个模块，基于模型上下文协议生成兼容矢量图形软件的图表；同时构建VisBench基准，采用七维指标从内容、布局、视觉感知和交互成本四个方面评估插图质量。 Result: 实验验证了架构设计的合理性与评估方法的可靠性，对多种视觉语言模型进行了公平排名，并量化分析了角色分工、步骤控制和描述对插图质量的影响。 Conclusion: VisPainter实现了科学插图的高效、直观与可编辑生成，结合VisBench提供了系统性的评估体系，推动了科学可视化生成领域的发展。 Abstract: Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.

[104] A Multi-tiered Human-in-the-loop Approach for Interactive School Mapping Using Earth Observation and Machine Learning

Casper Fibaek,Abi Riley,Kelsey Doerksen,Do-Hyung Kim,Rochelle Schneider

Main category: cs.CV

TL;DR: 提出一种多层级人机协同框架，用于交互式学校地图绘制，通过结合机器学习与高分辨率遥感影像及人工验证，提升发展中国家教育设施数据的准确性与完整性。

Details

Motivation: 在数据稀缺且更新不频繁的发展地区，现有教育设施记录往往不完整或存在错误，亟需一种高效、准确的方法来发现和纠正这些缺陷。 Method: 首先利用机器学习分析人口密度、土地覆盖和基础设施以识别潜在缺口；随后使用高分辨率卫星影像（VHR）和深度学习模型精确定位候选学校位置，并借助全球预训练模型提升泛化能力；最后通过人机交互界面由人工审核与修正结果。 Result: 初步评估表明，该多层级方法能有效识别学校位置，提升数据质量，且具备可扩展性和成本效益；中等分辨率影像分析因改进不显著被移除。 Conclusion: 该人机协同框架为教育资源的空间映射提供了一种可行、高效且可扩展的解决方案，有助于支持教育规划与资源分配。 Abstract: This paper presents a multi-tiered human-in-the-loop framework for interactive school mapping designed to improve the accuracy and completeness of educational facility records, particularly in developing regions where such data may be scarce and infrequently updated. The first tier involves a machine learning based analysis of population density, land cover, and existing infrastructure compared with known school locations. The first tier identifies potential gaps and "mislabelled" schools. In subsequent tiers, medium-resolution satellite imagery (Sentinel-2) is investigated to pinpoint regions with a high likelihood of school presence, followed by the application of very high-resolution (VHR) imagery and deep learning models to generate detailed candidate locations for schools within these prioritised areas. The medium-resolution approach was later removed due to insignificant improvements. The medium and VHR resolution models build upon global pre-trained steps to improve generalisation. A key component of the proposed approach is an interactive interface to allow human operators to iteratively review, validate, and refine the mapping results. Preliminary evaluations indicate that the multi-tiered strategy provides a scalable and cost-effective solution for educational infrastructure mapping to support planning and resource allocation.

[105] Referee: Reference-aware Audiovisual Deepfake Detection

Hyemin Boo,Eunsang Lee,Jiyoung Lee

Main category: cs.CV

TL;DR: 提出了一种名为Referee的参考感知音视频深度伪造检测方法，利用单一样本中的说话人特定线索，通过跨模态特征匹配和对齐身份相关查询，实现对未知伪造内容的高效检测。

Details

Motivation: 现有音视频深度伪造检测方法难以泛化到未见过的伪造样本，尤其是在面对先进生成模型制造的逼真伪造时表现不佳。 Method: 提出Referee方法，利用仅一个示例的说话人特定线索，通过匹配和对齐参考内容与目标内容中的身份相关查询，构建跨模态特征，联合推理音视频同步性和身份一致性。 Result: 在FakeAVCeleb、FaceForensics++和KoDF数据集上进行了大量实验，Referee在跨数据集和跨语言评估协议下均达到最先进的性能。 Conclusion: 跨模态身份验证对未来的深度伪造检测至关重要，Referee通过引入参考感知机制显著提升了对未知伪造的泛化能力。 Abstract: Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.

[106] NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu,Cheng Wang,Dingkang Liang,Zongchuang Zhao,Xingyu Jiang,Peng Zhang,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出了一种用于水下场景理解的多任务方法，构建了包含145万图像-文本对的大规模数据集NautData，并设计了基于物理先验的视觉特征增强模块（VFE），集成到LLaVA-1.5和Qwen2.5-VL中形成新型水下大模型NAUTILUS，在多个任务上显著提升了性能。

Details

Motivation: 缺乏大规模、支持多任务指令调优的水下数据集，且水下图像退化问题严重影响模型性能，限制了自动化水下探索的发展。 Method: 构建了NautData数据集，涵盖八类水下场景理解任务；引入基于水下成像模型的物理先验，设计可插拔的视觉特征增强（VFE）模块以恢复清晰信息，并将其集成至现有大模型如LLaVA-1.5和Qwen2.5-VL中。 Result: 在NautData和公开水下数据集上的实验表明，VFE模块显著提升基线模型在多数任务上的表现，验证了NAUTILUS在水下场景理解中的优越性。 Conclusion: NAUTILUS通过大规模数据集和物理引导的特征增强，有效推动了多任务水下场景理解的发展，具备良好的鲁棒性和应用潜力。 Abstract: Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

[107] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Jiawei Gu,Yunzhuo Hao,Huichen Will Wang,Linjie Li,Michael Qizhe Shieh,Yejin Choi,Ranjay Krishna,Yu Cheng

Main category: cs.CV

TL;DR: 提出ThinkMorph模型，通过高质量交错推理轨迹训练，实现文本与图像思维互补的多模态推理，在视觉主导任务中显著提升性能并展现涌现智能。

Details

Motivation: 探索多模态推理中语言与视觉的有效协同机制，明确有意义的交错思维链应由互补而非同构的模态共同推进。 Method: 构建ThinkMorph模型，基于24K高质量交错推理轨迹进行微调，学习生成逐步推进的文本-图像联合推理步骤，实现视觉内容的具体操作与连贯的语言逻辑。 Result: 在视觉主导基准上平均超越基线模型34.7%，泛化至域外任务，性能媲美或超过更大规模的专有视觉语言模型，并展现出未见的视觉操作能力、推理模式自适应切换及多样化多模态思维带来的测试时扩展优势。 Conclusion: 统一模型通过互补性多模态交错思维可有效推动复杂推理，揭示了多模态模型涌现能力的发展方向。 Abstract: Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

Elena Mulero Ayllón,Linlin Shen,Pierangelo Veltri,Fabrizia Gelardi,Arturo Chiti,Paolo Soda,Matteo Tortora

Main category: cs.CV

TL;DR: 提出了一种轻量级多模态框架vMambaX，用于结合PET和CT图像进行肺癌肿瘤分割，通过上下文门控跨模态感知模块增强特征交互，在PCLT20K数据集上表现优于基线模型且计算复杂度更低。

Details

Motivation: 准确的肺部肿瘤分割对诊断和治疗规划至关重要，但有效融合PET和CT的解剖与功能信息仍具挑战性。 Method: 基于Visual Mamba架构，设计了上下文门控跨模态感知模块（CGM），自适应增强多模态特征交互，突出关键区域并抑制噪声。 Result: 在PCLT20K数据集上，vMambaX优于基线模型，同时保持更低的计算复杂度。 Conclusion: 自适应跨模态门控机制能有效提升多模态肿瘤分割性能，vMambaX具有高效、可扩展的优势，适用于高级肺癌分析。 Abstract: Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at https://github.com/arco-group/vMambaX.

[109] Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds

Khandoker Ashik Uz Zaman,Mohammad Zahangir Alam,Mohammed N. M. Ali,Mahdi H. Miraz

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的3D点云鲁棒水印框架，结合SVD和PointNet++实现版权保护与所有权验证，在多种攻击下显著优于传统方法。

Details

Motivation: 3D点云在几何和非几何攻击下易损，传统水印方法难以有效保护知识产权，亟需更鲁棒的水印技术。 Method: 利用奇异值分解（SVD）将二值水印嵌入点云块的奇异值中，并采用PointNet++网络进行水印提取，通过深度学习增强抗攻击能力。 Result: 在ModelNet40数据集上验证，深度学习方法在70%裁剪攻击下达到0.83比特准确率和0.80 IoU，显著优于传统SVD方法的0.58和0.26。 Conclusion: 所提方法在严重失真下仍能高效恢复水印，显著提升3D点云版权保护的鲁棒性和准确性。 Abstract: The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method's ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.

[110] MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series

Xue Xia,Randall Balestriero,Tao Zhang,Yixin Zhou,Andrew Ding,Dev Saini,Lorenz Hurni

Main category: cs.CV

TL;DR: 本文提出MapSAM2，一个基于视觉基础模型的统一框架，用于自动分割历史地图图像和时间序列，通过将地图瓦片或时间序列视为视频，利用记忆注意力机制提升分割精度，尤其在有限监督下表现优异。

Details

Motivation: 历史地图是记录地理特征的重要档案，但其风格多变且标注数据稀缺，导致自动化分析困难；构建时空关联数据集耗时耗力，亟需高效方法支持建筑年代测定、环境变迁等研究。 Method: 将历史地图图像的瓦片集视为视频输入，利用视觉基础模型的时序建模能力进行少样本微调；对于时间序列，构建Siegfried建筑时间序列数据集，并提出通过模拟时间变换从单年地图生成伪时间序列以降低标注成本。 Result: 实验表明MapSAM2能有效学习时间关联，在有限监督或使用伪视频的情况下，准确分割并关联时间序列中的建筑物，显著提升几何精度，尤其对区域特征效果明显。 Conclusion: MapSAM2为历史地图的自动化分析提供了高效、可扩展的解决方案，推动了基于地图的时间序列理解与时空数据构建的研究进展。 Abstract: Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.

[111] Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

Ilyass Moummad,Kawtar Zaher,Hervé Goëau,Alexis Joly

Main category: cs.CV

TL;DR: 本文提出了CroVCA（Cross-View Code Alignment），一种简单且统一的学习二进制编码的方法，通过二值交叉熵损失和编码率最大化实现跨视图一致性与代码多样性，在多种基准上实现了最先进的性能，训练仅需5个周期，效率极高。

Details

Motivation: 现有哈希方法通常依赖复杂流程、多目标优化或特定学习范式，且训练耗时长，难以兼顾高效检索中的紧凑性与判别性。因此需要一种简洁、通用且高效的二进制编码学习方法。 Method: 提出CroVCA框架，使用单一二值交叉熵损失对齐语义相关视图的二进制码，并引入编码率最大化作为防坍缩正则项；设计HashCoder——一个带批归一化的轻量MLP网络，支持在冻结嵌入上进行探针或通过LoRA微调编码器。 Result: 在多个基准测试中达到SOTA结果，仅需5个训练周期；16位哈希码下，COCO无监督哈希在单GPU上不到2分钟完成，ImageNet100有监督哈希约3分钟完成。 Conclusion: CroVCA是一种高效、灵活且广泛适用的哈希方法，能够在极短时间内学习高质量二进制编码，适用于大规模检索任务。 Abstract: Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.

[112] ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning

Samarup Bhattacharya,Anubhab Bhattacharya,Abir Chakraborty

Main category: cs.CV

TL;DR: 提出ANCHOR框架，结合对抗训练与对比学习中的难样本挖掘，提升模型对对抗攻击的鲁棒性。

Details

Motivation: 对抗攻击利用梯度漏洞，通过微小扰动误导模型决策，暴露了神经网络在鲁棒性上的缺陷。 Method: 采用监督对比学习与显式的难正样本挖掘，使同类样本、其增强和扰动版本在嵌入空间中聚类，增强表示的稳定性。 Result: 在CIFAR-10上，ANCHOR在干净准确率和PGD-20攻击下的鲁棒准确率均优于标准对抗训练方法。 Conclusion: 结合对抗训练与难样本对比学习有助于学习更结构化、更鲁棒的特征表示，缩小准确率与鲁棒性之间的差距。 Abstract: Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.

[113] Who Made This? Fake Detection and Source Attribution with Diffusion Features

Simone Bonechi,Paolo Andreini,Barbara Toniella Corradini

Main category: cs.CV

TL;DR: 本文提出了一种名为FRIDA的轻量级框架，利用预训练扩散模型的内部激活特征进行深度伪造图像检测和生成源识别，无需微调即可在跨生成器任务上达到先进性能。

Details

Motivation: 由于生成式扩散模型快速发展，合成图像越来越难以与真实图像区分，导致真实性、版权和虚假信息等问题，现有监督检测方法在未见生成器上的泛化能力差，且依赖大量标注数据和频繁重训练。 Method: FRIDA框架利用预训练扩散模型的内部激活（扩散特征），采用k近邻分类器进行伪造检测，并使用紧凑神经网络模型实现生成源归属，无需微调。 Result: 在跨生成器检测任务上达到最先进的性能，同时能准确识别图像的生成源，表明扩散特征天然包含生成器特异性模式。 Conclusion: 扩散模型的表示本身编码了生成器特有的模式，FRIDA提供了一种简单、可解释的合成图像取证方法。 Abstract: The rapid progress of generative diffusion models has enabled the creation of synthetic images that are increasingly difficult to distinguish from real ones, raising concerns about authenticity, copyright, and misinformation. Existing supervised detectors often struggle to generalize across unseen generators, requiring extensive labeled data and frequent retraining. We introduce FRIDA (Fake-image Recognition and source Identification via Diffusion-features Analysis), a lightweight framework that leverages internal activations from a pre-trained diffusion model for deepfake detection and source generator attribution. A k-nearest-neighbor classifier applied to diffusion features achieves state-of-the-art cross-generator performance without fine-tuning, while a compact neural model enables accurate source attribution. These results show that diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.

[114] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu,Beichen Zhang,Yuhang Zang,Yuhang Cao,Long Xing,Xiaoyi Dong,Haodong Duan,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Spatial-SSRL的自监督强化学习方法，通过从普通RGB或RGB-D图像中提取可验证信号，构建五个无需人工标注的预训练任务，显著提升了大视觉语言模型在空间理解任务中的表现。

Details

Motivation: 现有的监督微调和强化学习方法依赖昂贵的标注、专用工具或受限环境，难以大规模应用，因此需要一种可扩展且低成本的自监督方法来提升LVLM的空间理解能力。 Method: 提出Spatial-SSRL，设计五个基于图像自身结构的预任务：打乱图像块重排序、翻转图像块识别、裁剪图像块修复、区域深度排序和相对3D位置预测，利用这些任务生成可验证的监督信号进行强化学习训练。 Result: 在七个图像和视频空间理解基准上，相比Qwen2.5-VL基线模型，Spatial-SSRL在3B和7B模型上分别平均提升4.63%和3.89%的准确率，同时保持了通用视觉能力。 Conclusion: 通过简单且内在的自监督方式，Spatial-SSRL实现了可扩展的强化学习，为提升大视觉语言模型的空间智能提供了一条实用路径。 Abstract: Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

[115] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

John Won,Kyungmin Lee,Huiwon Jang,Dongyoung Kim,Jinwoo Shin

Main category: cs.CV

TL;DR: 提出了一种名为DUal-STream diffusion (DUST)的世界模型增强型视觉-语言-动作（VLA）框架，通过分离模态流并实现跨模态共享，有效解决了状态预测与动作生成之间的模态冲突，在仿真和真实机器人任务中均显著提升了性能。

Details

Motivation: 现有的VLA模型在联合预测下一状态观测和动作序列时面临模态差异带来的挑战，难以有效建模视觉与动作两种不同模态的联合分布。 Method: 提出DUST框架，采用多模态扩散Transformer架构，保持视觉与动作模态的独立流；引入独立噪声扰动和解耦的流匹配损失，并设计异步进化的联合采样方法以支持测试时扩展。 Result: 在RoboCasa和GR-1等仿真基准上性能提升达6%，测试时扩展额外带来2-5%增益；在Franka Research 3真实机器人任务上成功率提高13%；在BridgeV2无动作视频上预训练后在RoboCasa上表现出显著迁移效果。 Conclusion: DUST通过解耦模态建模有效解决了VLA中的模态冲突问题，兼具高性能与可扩展性，验证了其在仿真与真实场景下的有效性，并展示了大规模预训练的潜力。 Abstract: Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.

[116] Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation

Riccardo Brioschi,Aleksandr Alekseev,Emanuele Nevali,Berkay Döner,Omar El Malki,Blagoj Mitrevski,Leandro Kieliger,Mark Collier,Andrii Maksai,Jesse Berent,Claudiu Musat,Efi Kokiopoulou

Main category: cs.CV

TL;DR: 提出了一种基于用户草图的图形布局生成新方法，通过多模态Transformer模型将草图和内容资产作为输入生成高质量布局，并通过合成大规模训练数据解决了数据获取成本高的问题，在多个公开数据集上优于现有方法。

Details

Motivation: 现有布局生成方法中的用户约束通常需要复杂的规格说明，降低了可用性，因此需要一种更直观、易用的约束方式来指导布局生成。 Method: 提出一种基于多模态Transformer的方法，以用户提供的草图和内容资产为输入生成布局；并通过一种新颖且高效的方法合成了大规模的训练草图数据。 Result: 在PubLayNet、DocLayNet和SlidesVQA三个公开数据集上实验表明，该方法优于现有的基于约束的方法，并提供了更直观的设计体验；同时发布了约20万张合成草图以促进后续研究。 Conclusion: 草图到布局（sketch-to-layout）是一种有前景且尚未充分探索的研究方向，所提方法不仅性能优越，还提升了用户体验，并为未来研究提供了宝贵资源。 Abstract: Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.

[117] VessShape: Few-shot 2D blood vessel segmentation by leveraging shape priors from synthetic images

Cesar H. Comin,Wesley N. Galvão

Main category: cs.CV

TL;DR: 本文提出VessShape方法，通过生成具有管状结构的大规模2D合成数据集，增强分割模型对血管形状的感知，从而提升在少样本和零样本场景下的跨域泛化能力。

Details

Motivation: 由于标注数据稀缺以及卷积神经网络倾向于学习纹理特征，导致血管分割模型在不同成像模态间泛化性能差。因此需要引入几何先验来提升模型鲁棒性和数据效率。 Method: 提出VessShape方法，生成包含程序化管状几何结构和多样化前景/背景纹理的2D合成图像，使模型关注形状而非纹理；采用预训练+微调策略，在真实数据上验证少样本和零样本性能。 Result: 在两个不同领域的真实数据集上，仅用4到10个样本微调即可实现优异的分割性能，并展现出显著的零样本迁移能力，能在无需目标域训练的情况下有效分割未见域中的血管。 Conclusion: 通过强形状偏置进行预训练是克服数据稀缺、提升血管分割模型跨域泛化能力的有效策略。 Abstract: Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.

[118] NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception

Congzhang Shao,Quan Yuan,Guiyang Luo,Yue Hu,Danni Wang,Yilin Liu,Rui Pan,Bo Chen,Jinglin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于协商共同表示的异构协同感知方法NegoCollab，通过引入协商器生成兼顾各代理局部表示的共同表示，并结合多种对齐损失实现更有效的特征对齐。

Details

Motivation: 现有协同感知方法在处理具有固定且不同感知模型的异构代理时，因将某一特定代理的表示作为共同表示，难以有效对齐差异较大的域，导致性能下降。 Method: 提出NegoCollab，训练中引入协商器从各代理的局部表示中推导出协商后的共同表示；通过发送方和接收方实现局部与共同表示间的双向转换，并采用分布、结构和实用对齐损失联合监督训练。 Result: 所提方法能有效减小各代理与共同表示间的域差距，提升异构代理间的协同感知性能，尤其在域差异显著的场景下表现更优。 Conclusion: NegoCollab通过协商机制构建更均衡的共同表示，结合多级对齐策略，显著提升了异构协同感知系统的鲁棒性与有效性。 Abstract: Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality's agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.

[119] Gaussian Combined Distance: A Generic Metric for Object Detection

Ziqian Guan,Xieyi Fu,Pengjun Huang,Hengyuan Zhang,Hubin Du,Yongtao Liu,Yinglin Wang,Qang Ma

Main category: cs.CV

TL;DR: 本文提出了高斯联合距离（GCD），用于解决小目标检测中IoU对位置偏差敏感、Wasserstein距离缺乏尺度不变性和收敛慢的问题。GCD具有尺度不变性并支持联合优化，显著提升了定位性能，在多个数据集上实现了最先进的检测效果。

Details

Motivation: 现有基于IoU的相似性度量在小目标检测中对位置偏差敏感，而Wasserstein距离缺乏尺度不变性且优化效率低，影响模型泛化能力和精度。 Method: 提出高斯联合距离（GCD）作为新的相似性度量和损失函数，通过理论分析证明其具备尺度不变性和联合优化特性，从而提升模型收敛速度和定位精度。 Result: 在AI-TOD-v2、MS-COCO-2017和Visdrone-2019等多个数据集上验证了GCD的有效性，相比Wasserstein距离在不同尺度下均取得更优性能，尤其在小目标检测上表现突出。 Conclusion: GCD是一种优于IoU和Wasserstein距离的新型相似性度量方法，具备良好的尺度不变性和优化特性，适用于多种检测器并在小目标检测任务中达到先进水平。 Abstract: In object detection, a well-defined similarity metric can significantly enhance model performance. Currently, the IoU-based similarity metric is the most commonly preferred choice for detectors. However, detectors using IoU as a similarity metric often perform poorly when detecting small objects because of their sensitivity to minor positional deviations. To address this issue, recent studies have proposed the Wasserstein Distance as an alternative to IoU for measuring the similarity of Gaussian-distributed bounding boxes. However, we have observed that the Wasserstein Distance lacks scale invariance, which negatively impacts the model's generalization capability. Additionally, when used as a loss function, its independent optimization of the center attributes leads to slow model convergence and unsatisfactory detection precision. To address these challenges, we introduce the Gaussian Combined Distance (GCD). Through analytical examination of GCD and its gradient, we demonstrate that GCD not only possesses scale invariance but also facilitates joint optimization, which enhances model localization performance. Extensive experiments on the AI-TOD-v2 dataset for tiny object detection show that GCD, as a bounding box regression loss function and label assignment metric, achieves state-of-the-art performance across various detectors. We further validated the generalizability of GCD on the MS-COCO-2017 and Visdrone-2019 datasets, where it outperforms the Wasserstein Distance across diverse scales of datasets. Code is available at https://github.com/MArKkwanGuan/mmdet-GCD.

[120] Deep learning denoising unlocks quantitative insights in operando materials microscopy

Samuel Degnan-Morgenstern,Alexander E. Cohen,Rajeev Gopal,Megan Gober,George J. Nelson,Peng Bai,Martin Z. Bazant

Main category: cs.CV

TL;DR: 提出了一种基于无监督深度学习的去噪框架，可广泛应用于多种显微成像模式和尺度，显著提升原位显微成像的定量分析能力。

Details

Motivation: 测量噪声限制了原位显微成像的有效分辨率和定量分析精度，亟需一种通用且保真的去噪方法。 Method: 开发了一种集成无监督深度学习去噪的通用框架，并结合模拟数据验证其在保持物理保真度、减少偏置和降低PDE约束优化中不确定性方面的能力。 Result: 在STXM、光学显微镜和中子放射成像实验中，该方法成功揭示了纳米级化学与结构异质性，实现了自动颗粒分割与相分类，并将噪声引起的变异性降低近80%。 Conclusion: 深度去噪是一种强大且模态无关的增强手段，可推动定量原位成像发展，并拓展原有噪声受限技术的应用范围。 Abstract: Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.

[121] Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

Bo Li,Duyuan Zheng,Xinyang Liu,Qingwen Li,Hong Li,Hongyan Cui,Ge Gao,Chen Liu

Main category: cs.CV

TL;DR: 本文提出了一种轻量且鲁棒的遮挡行人重识别模型Sh-ViT，基于ViT-Base引入打乱模块、场景自适应增强和知识蒸馏，在自建MyTT数据集和Market1501上均取得优异性能。

Details

Motivation: 现有方法在处理监控场景中的遮挡、视角变化和低质量图像时表现不佳，且多依赖复杂模块或仅适用于清晰正脸图像，难以满足实际监控需求。 Method: 基于ViT-Base架构，提出Sh-ViT模型：1）在最后Transformer层引入Shuffle模块以打破空间相关性；2）采用针对监控场景的数据增强策略（几何变换、擦除、模糊、色彩调整）；3）使用DeiT-based知识蒸馏提升小样本学习效果。同时构建了包含大量遮挡情况的MyTT数据集用于评估。 Result: Sh-ViT在MyTT数据集上达到83.2% Rank-1和80.1% mAP，在Market1501上达到94.6% Rank-1和87.5% mAP，优于CNN和ViT基线模型及当前最先进方法。 Conclusion: Sh-ViT无需外部模块即可提升对遮挡和模糊的鲁棒性，为监控环境下的人员识别提供了高效实用的解决方案。 Abstract: Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.

[122] PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

Danyal Maqbool,Changhee Lee,Zachary Huemann,Samuel D. Church,Matthew E. Larson,Scott B. Perlman,Tomas A. Romero,Joshua D. Warner,Meghan Lubner,Xin Tie,Jameson Merkow,Junjie Hu,Steve Y. Cho,Tyler J. Bradshaw

Main category: cs.CV

TL;DR: 本研究提出了一种用于3D PET/CT图像的视觉-语言模型PETAR-4B，结合PET、CT和病灶轮廓信息，实现空间定位准确的放射学报告生成。

Details

Motivation: 现有视觉-语言模型在医学领域多限于2D影像，而3D PET/CT数据具有体积大、病灶小且分散、报告复杂等特点，亟需能处理3D医学影像的多模态模型。 Method: 构建了一个包含1.1万个以上病灶级描述和3D分割的大型数据集，并基于此开发了PETAR-4B模型，该模型融合PET、CT和病灶掩码信息，支持空间感知的报告生成。 Result: 自动评估和人工评估均表明，PETAR在PET/CT报告生成质量上显著优于现有方法，提升了3D医学视觉-语言理解能力。 Conclusion: PETAR-4B有效推动了视觉-语言模型在3D医学影像中的应用，为自动化放射学报告生成提供了新思路。 Abstract: Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.

[123] Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Xiangyu Fan,Zesong Qiu,Zhuguanyu Wu,Fanzhou Wang,Zhiqian Lin,Tianxiang Ren,Dahua Lin,Ruihao Gong,Lei Yang

Main category: cs.CV

TL;DR: 提出Phased DMD，一种结合分阶段蒸馏与混合专家（MoE）的多步蒸馏框架，有效提升生成多样性与模型容量，同时保持高效性。

Details

Motivation: 解决DMD在复杂生成任务中因模型容量不足导致性能下降，以及直接扩展为多步蒸馏时带来的内存、计算深度增加和生成多样性降低的问题。 Method: 将SNR范围划分为子区间，通过分阶段分布匹配和子区间内得分匹配进行渐进式训练，并引入MoE增强模型容量。 Result: 在图像和视频生成模型（如Qwen-Image和Wan2.2）上验证了Phased DMD的有效性，相比DMD更好保留生成多样性，同时维持生成能力。 Conclusion: Phased DMD是一种有效的多步知识蒸馏方法，能够在不牺牲效率的前提下提升生成质量和多样性。 Abstract: Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.

[124] LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar

Soumitra Kundu,Gargi Panda,Saumik Bhattacharya,Aurobinda Routray,Rajlakshmi Guha

Main category: cs.CV

TL;DR: 提出LifWavNet，一种基于可学习提升小波网络的雷达到心电图（ECG）重建方法，通过多分辨率分析与合成模型实现非接触式心脏监测，性能优于现有方法。

Details

Motivation: 实现非接触式、无感的心脏监测，克服传统固定小波方法在雷达信号到ECG重建中的局限性。 Method: 设计LifWavNet，采用可学习的提升小波单元进行多分辨率特征提取与ECG波形重建，并引入多分辨率短时傅里叶变换（STFT）损失函数以提升时域和频域一致性。 Result: 在两个公开数据集上，LifWavNet在ECG重建质量及心率、心率变异性等生命体征估计方面均优于现有最先进方法，且中间特征可视化增强了模型可解释性。 Conclusion: LifWavNet是一种鲁棒的雷达基非接触ECG测量框架，具有良好的重建精度和生理意义。 Abstract: Non-contact electrocardiogram (ECG) reconstruction from radar signals offers a promising approach for unobtrusive cardiac monitoring. We present LifWavNet, a lifting wavelet network based on a multi-resolution analysis and synthesis (MRAS) model for radar-to-ECG reconstruction. Unlike prior models that use fixed wavelet approaches, LifWavNet employs learnable lifting wavelets with lifting and inverse lifting units to adaptively capture radar signal features and synthesize physiologically meaningful ECG waveforms. To improve reconstruction fidelity, we introduce a multi-resolution short-time Fourier transform (STFT) loss, that enforces consistency with the ground-truth ECG in both temporal and spectral domains. Evaluations on two public datasets demonstrate that LifWavNet outperforms state-of-the-art methods in ECG reconstruction and downstream vital sign estimation (heart rate and heart rate variability). Furthermore, intermediate feature visualization highlights the interpretability of multi-resolution decomposition and synthesis in radar-to-ECG reconstruction. These results establish LifWavNet as a robust framework for radar-based non-contact ECG measurement.

Table of Contents

cs.CL [Back]

[1] Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling

[2] Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence

[3] Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations

[4] Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services

[5] Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral

[6] Elastic Architecture Search for Efficient Language Models

[7] Dataset Creation and Baseline Models for Sexism Detection in Hausa

[8] Quantitative Intertextuality from the Digital Humanities Perspective: A Survey

[9] Recursive numeral systems are highly regular and easy to process

[10] VISTA Score: Verification In Sequential Turn-based Assessment

[11] LLM-Centric RAG with Multi-Granular Indexing and Confidence Constraints

[12] Detecting Data Contamination in LLMs via In-Context Learning

[13] Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models

[14] Characterizing Selective Refusal Bias in Large Language Models

[15] Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

[16] Probability Distributions Computed by Hard-Attention Transformers

[17] Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

[18] MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

[19] Identifying the Periodicity of Information in Natural Language

[20] Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

[21] Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

[22] MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

[23] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

[24] A Unified Representation Underlying the Judgment of Large Language Models

[25] TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

[26] ThoughtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations

[27] From the Rock Floor to the Cloud: A Systematic Survey of State-of-the-Art NLP in Battery Life Cycle

[28] Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs

[29] Awal -- Community-Powered Language Technology for Tamazight

[30] Dynamic Affective Memory Management for Personalized LLM Agents

[31] VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

[32] Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning

[33] The aftermath of compounds: Investigating Compounds and their Semantic Representations

[34] Effect of Domain Generalization Techniques in Low Resource Systems

[35] BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization

[36] SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps

[37] Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design

[38] DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

[39] Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality

[40] Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization

[41] MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval

[42] SpecAttn: Speculating Sparse Attention

[43] Culture Cartography: Mapping the Landscape of Cultural Knowledge

[44] Continuous Autoregressive Language Models

cs.CV [Back]

[45] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

[46] PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT

[47] DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting

[48] Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11

[49] SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions

[50] Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

[51] MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

[52] Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

[53] VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video

[54] AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception

[55] Hierarchical Transformers for Unsupervised 3D Shape Abstraction

[56] ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding

[57] WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond

[58] E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

[59] Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features

[60] HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition

[61] AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification

[62] How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring

[63] Generating Accurate and Detailed Captions for High-Resolution Images

[64] M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

[65] DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model

[66] H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

[67] SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles

[68] Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization

[69] Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

[70] Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

[71] Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks

[72] Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness

[73] SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping

[74] Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

[75] MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

[76] Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting

[77] Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis