Table of Contents
cs.CL [Back]
[1] A Preliminary Study of RAG for Taiwanese Historical Archives
Claire Lin,Bo-Han Feng,Xuanjun Chen,Te-Lun Yang,Hung-yi Lee,Jyh-Shing Roger Jang
Main category: cs.CL
TL;DR: 本文研究了检索增强生成(RAG)在台湾历史档案中的应用,探讨了查询特征和元数据整合策略对系统性能的影响,发现早期元数据整合能提升检索与回答准确性,但仍存在生成幻觉和处理时序或多跳查询的挑战。
Details
Motivation: 尽管RAG在知识密集型任务中展现出潜力,但其在台湾历史档案中的应用研究较少,因此本文旨在填补这一空白。 Method: 构建了一个应用于两个历史古籍中文数据集(Fort Zeelandia 和台湾省议会公报)的RAG流程,并系统分析查询特征和不同元数据整合策略对检索质量、答案生成及整体系统性能的影响。 Result: 早期阶段整合元数据可显著提升检索与生成结果的准确性,但RAG系统仍面临生成幻觉以及处理时间相关或需多步推理的历史性问题的困难。 Conclusion: 元数据的早期融合有助于提升RAG在历史档案中的表现,但需进一步研究以解决幻觉和复杂历史查询的挑战。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.[2] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey
Fatemeh Shahhosseini,Arash Marioriyad,Ali Momen,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban,Shaghayegh Haghjooy Javanmard
Main category: cs.CL
TL;DR: 本文综述了大语言模型(LLMs)在科学创意生成中的应用,提出了五类方法,并结合创造力理论框架分析其在创造性与科学性之间的平衡。
Details
Motivation: 科学创意生成是科学发现的核心,但目前LLMs在创造性和科学严谨性之间的平衡尚不清晰,需要系统性梳理和指导未来方向。 Method: 将现有方法分为五类:外部知识增强、基于提示的分布引导、推理时扩展、多智能体协作和参数级适应,并结合Boden的创造力分类和Rhodes的4Ps框架进行分析。 Result: 明确了各类方法在创造力类型和创新来源上的特点,揭示了当前LLM驱动科学创意生成的研究现状。 Conclusion: 通过整合方法论与创造力理论,本文为LLMs在科学研究中实现可靠、系统性和变革性应用提供了清晰路径和未来方向。 Abstract: Scientific idea generation lies at the heart of scientific discovery and has driven human progress-whether by solving unsolved problems or proposing novel hypotheses to explain unknown phenomena. Unlike standard scientific reasoning or general creative generation, idea generation in science is a multi-objective and open-ended task, where the novelty of a contribution is as essential as its empirical soundness. Large language models (LLMs) have recently emerged as promising generators of scientific ideas, capable of producing coherent and factual outputs with surprising intuition and acceptable reasoning, yet their creative capacity remains inconsistent and poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, examining how different approaches balance creativity with scientific soundness. We categorize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we employ two complementary frameworks: Boden's taxonomy of Combinatorial, Exploratory and Transformational creativity to characterize the level of ideas each family expected to generate, and Rhodes' 4Ps framework-Person, Process, Press, and Product-to locate the aspect or source of creativity that each method emphasizes. By aligning methodological advances with creativity frameworks, this survey clarifies the state of the field and outlines key directions toward reliable, systematic, and transformative applications of LLMs in scientific discovery.[3] GRIP: In-Parameter Graph Reasoning through Fine-Tuning Large Language Models
Jiarui Feng,Donghong Cai,Yixin Chen,Muhan Zhang
Main category: cs.CL
TL;DR: 提出了一种名为GRIP的新框架,通过精心设计的微调任务使大语言模型能够内化图中的关系信息,并将知识存储在轻量级LoRA参数中,从而在无需访问原始图的情况下执行多种图相关任务。
Details
Motivation: 现有的将图结构数据转换为文本序列或引入额外编码模块的方法存在token开销大、需要大规模后训练和模态对齐差等问题,难以有效适配大语言模型处理结构化数据。 Method: 受测试时自适应中参数内知识注入的启发,设计了GRIP框架,通过特定的微调任务将图的关系信息注入到LLM的LoRA参数中,实现知识的内部化。 Result: 在多个基准上的实验表明,该方法在效率和效果上均优于现有方法,能够在没有原始图输入的情况下完成多种图相关任务。 Conclusion: GRIP提供了一种高效且实用的方式,使大语言模型能够在推理时脱离原始图结构,依然具备处理复杂图数据的能力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modeling sequential textual data and generalizing across diverse tasks. However, adapting LLMs to effectively handle structural data, such as knowledge graphs or web data, remains a challenging problem. Some approaches adopt complex strategies to convert graphs into text sequences, resulting in significant token overhead and rendering them impractical for large-scale graphs. Others introduce additional modules to encode graphs into fixed-size token representations for LLMs. However, these methods typically require large-scale post-training on graph-text corpus and complex alignment procedures, yet often yield sub-optimal results due to poor modality alignment. Inspired by in-parameter knowledge injection for test-time adaptation of LLMs, we propose GRIP, a novel framework that equips LLMs with the ability to internalize complex relational information from graphs through carefully designed fine-tuning tasks. This knowledge is efficiently stored within lightweight LoRA parameters, enabling the fine-tuned LLM to perform a wide range of graph-related tasks without requiring access to the original graph at inference time. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach.[4] REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment
Priyanka Mudgal
Main category: cs.CL
TL;DR: 提出了一种无需参考摘要的基于大语言模型(LLM)判断的日志摘要评估指标REFLEX,能够在无金标准的情况下对摘要的相关性、信息量和连贯性进行稳定且可解释的评估。
Details
Motivation: 现有评估指标如ROUGE和BLEU依赖于表层词汇重叠,且缺乏高质量参考摘要,导致日志摘要系统评估困难。 Method: 利用大语言模型作为零样本评估器,设计REFLEX指标从相关性、信息量和连贯性等维度对日志摘要进行无参考评估。 Result: REFLEX在多个日志摘要数据集上表现出稳定的细粒度评估能力,比传统指标更能有效区分不同模型输出。 Conclusion: REFLEX为参考数据稀缺或不可用的实际场景提供了一种可扩展的日志摘要评估方案。 Abstract: Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.[5] It Takes Two: A Dual Stage Approach for Terminology-Aware Translation
Akshat Singh Jaswal
Main category: cs.CL
TL;DR: 本文提出了DuTerm,一种用于术语约束机器翻译的两阶段架构,结合了术语感知NMT模型和基于提示的LLM后编辑方法。
Details
Motivation: 为了提高术语约束下的机器翻译质量,解决传统严格约束方法带来的翻译流畅性和上下文一致性问题。 Method: 采用两阶段架构:第一阶段通过大规模合成数据微调术语感知NMT模型;第二阶段使用基于提示的大型语言模型进行后编辑,以灵活、上下文驱动的方式确保术语一致性。 Result: 在英-德、英-西和英-俄翻译任务上评估表明,LLM作为上下文驱动的修正器能持续产生更高质量的翻译,优于严格的术语约束方法。 Conclusion: LLM在术语约束翻译中更适合作为上下文驱动的修改器而非生成器,揭示了质量与约束之间的重要权衡。 Abstract: This paper introduces DuTerm, a novel two-stage architecture for terminology-constrained machine translation. Our system combines a terminology-aware NMT model, adapted via fine-tuning on large-scale synthetic data, with a prompt-based LLM for post-editing. The LLM stage refines NMT output and enforces terminology adherence. We evaluate DuTerm on English-to German, English-to-Spanish, and English-to-Russian with the WMT 2025 Terminology Shared Task corpus. We demonstrate that flexible, context-driven terminology handling by the LLM consistently yields higher quality translations than strict constraint enforcement. Our results highlight a critical trade-off, revealing that an LLM's work best for high-quality translation as context-driven mutators rather than generators.[6] Motif 2 12.7B technical report
Junghwan Lim,Sungmin Lee,Dongseok Kim,Taehyun Kim,Eunhwan Park,Jeesoo Lee,Jeongdoo Lee,Junhyeok Lee,Wai Ting Cheung,Dahye Choi,Jaeheui Her,Jaeyeon Huh,Hanbin Jung,Changjin Kang,Beomgyu Kim,Minjae Kim,Taewhan Kim,Youngrok Kim,Hyukjin Kweon,Haesol Lee,Kungyu Lee,Dongpin Oh,Yeongjae Park,Bokki Ryu,Dongjoo Weon
Main category: cs.CL
TL;DR: Motif-2-12.7B 是一个高效的大语言模型,通过架构创新和系统级优化,在有限计算资源下实现强大的语言理解与指令泛化能力。
Details
Motivation: 在受限的计算预算下提升大语言模型的效率和可扩展性,探索通过架构与训练优化替代单纯模型扩大的路径。 Method: 采用分组微分注意力(GDA)分离信号与噪声控制通路,结合MuonClip优化器、融合PolyNorm激活和并行Muon算法等系统级优化,并在5.5万亿token上进行课程驱动的预训练和三阶段监督微调。 Result: 模型在多个基准测试中表现出竞争力,性能媲美更大规模的模型,显著提升了训练吞吐量和内存效率。 Conclusion: 精心设计的架构扩展和训练优化可在不增加参数规模的情况下实现高性能,为高效大模型提供了可行路径。 Abstract: We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.[7] Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models
Xin Liu,Qiyang Song,Qihang Zhou,Haichao Du,Shaowen Xu,Wenbo Jiang,Weijuan Zhang,Xiaoqi Jia
Main category: cs.CL
TL;DR: 本文研究了多头自注意力机制(MHA)在大语言模型(LLMs)多语言处理中的作用,提出了一种高效的Language Attention Head Importance Scores (LAHIS) 方法来识别对多语言能力重要的注意力头,并发现存在语言特定和语言通用的注意力头。基于此,作者设计了一个仅需调优20个参数的轻量级适配方法,提升了多语言理解性能。
Details
Motivation: 尽管MHA在大语言模型中至关重要,但其在多语言能力中的具体作用尚不明确。本文旨在探究MHA如何支持多语言处理,提升模型的多语言理解和生成能力。 Method: 提出了LAHIS方法,通过一次前向和反向传播评估注意力头对不同语言的重要性;并在Aya-23-8B、Llama-3.2-3B和Mistral-7B-v0.1上进行实验分析;进一步设计了一个可学习软掩码的轻量级适配模块来调节语言相关注意力头的输出。 Result: 发现了语言特定和语言通用的注意力头;语言特定头有助于跨语言注意力迁移并减少目标语言外的生成问题;所提轻量级适配方法仅用20个可调参数即提升了XQuAD上的准确率。 Conclusion: MHA在多语言处理中起关键作用,LAHIS增强了对LLMs多语言机制的可解释性,同时提出的轻量适配方法有效提升了多语言性能。 Abstract: Large language models (LLMs) increasingly support multilingual understanding and generation. Meanwhile, efforts to interpret their internal mechanisms have emerged, offering insights to enhance multilingual performance. While multi-head self-attention (MHA) has proven critical in many areas, its role in multilingual capabilities remains underexplored. In this work, we study the contribution of MHA in supporting multilingual processing in LLMs. We propose Language Attention Head Importance Scores (LAHIS), an effective and efficient method that identifies attention head importance for multilingual capabilities via a single forward and backward pass through the LLM. Applying LAHIS to Aya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1, we reveal the existence of both language-specific and language-general heads. Language-specific heads enable cross-lingual attention transfer to guide the model toward target language contexts and mitigate off-target language generation issue, contributing to addressing challenges in multilingual LLMs. We also introduce a lightweight adaptation that learns a soft head mask to modulate attention outputs over language heads, requiring only 20 tunable parameters to improve XQuAD accuracy. Overall, our work enhances both the interpretability and multilingual capabilities of LLMs from the perspective of MHA.[8] LLM Optimization Unlocks Real-Time Pairwise Reranking
Jingyu Wu,Aditya Shrivastava,Jing Zhu,Alfy Samuel,Anoop Kumar,Daben Liu
Main category: cs.CL
TL;DR: 本文研究了如何优化基于大语言模型的成对重排序方法(PRP),以提升检索增强生成系统中文档重排序的效率,实现了高达166倍的延迟降低,同时性能损失极小。
Details
Motivation: 现有的基于大语言模型的成对重排序方法虽然有效,但计算开销大、延迟高,难以应用于实时场景,亟需优化。 Method: 通过采用多种优化策略,包括使用更小的模型、限制重排序文档数量、降低精度、减少位置偏差的一方向推理以及限制输出token数,来提升重排序效率。 Result: 将每查询延迟从61.36秒降至0.37秒,加速达166倍,Recall@k指标仅有轻微下降。 Conclusion: 合理的设计选择可显著提升LLM-based重排序的效率,使其更适用于对延迟敏感的实际应用场景。 Abstract: Efficiently reranking documents retrieved from information retrieval (IR) pipelines to enhance overall quality of Retrieval-Augmented Generation (RAG) system remains an important yet challenging problem. Recent studies have highlighted the importance of Large Language Models (LLMs) in reranking tasks. In particular, Pairwise Reranking Prompting (PRP) has emerged as a promising plug-and-play approach due to its usability and effectiveness. However, the inherent complexity of the algorithm, coupled with the high computational demands and latency incurred due to LLMs, raises concerns about its feasibility in real-time applications. To address these challenges, this paper presents a focused study on pairwise reranking, demonstrating that carefully applied optimization methods can significantly mitigate these issues. By implementing these methods, we achieve a remarkable latency reduction of up to 166 times, from 61.36 seconds to 0.37 seconds per query, with an insignificant drop in performance measured by Recall@k. Our study highlights the importance of design choices that were previously overlooked, such as using smaller models, limiting the reranked set, using lower precision, reducing positional bias with one-directional order inference, and restricting output tokens. These optimizations make LLM-based reranking substantially more efficient and feasible for latency-sensitive, real-world deployments.[9] LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives
Ratna Kandala,Katie Hoemann
Main category: cs.CL
TL;DR: 本研究比较了三种荷兰语特定的大型语言模型(LLMs)与传统词典工具LIWC和Pattern在弗拉芒语情感效价预测中的表现,发现尽管LLMs具有更先进的架构,但在自发性文本中仍不如传统方法,尤其是Pattern表现最佳。结果挑战了LLMs在情感分析中普遍优越的假设,并强调需为低资源语言变体开发更合适的评估框架。
Details
Motivation: 评估大型语言模型在低资源语言变体(如弗拉芒语)中情感分析的有效性,并检验其是否优于传统词典方法。 Method: 使用约25000条来自102名荷兰语参与者的自发文本及其自我评估的情感效价(-50至+50),对比三种荷兰语LLMs(ChocoLlama-8B-Instruct、Reynaerde-7B-chat、GEITje-7B-ultra)与LIWC和Pattern的效价预测性能。 Result: 三种荷兰语LLMs的表现均低于传统工具LIWC和Pattern,其中Pattern表现最优,显示出在真实语境情感识别中的优势。 Conclusion: 当前的LLM微调方法可能未能充分捕捉日常语言中的情感细微差别,尤其在低资源语言环境中;传统词典方法在特定任务中仍具竞争力,需发展更符合文化和语言特性的评估体系。 Abstract: Understanding emotional nuances in everyday language is crucial for computational linguistics and emotion research. While traditional lexicon-based tools like LIWC and Pattern have served as foundational instruments, Large Language Models (LLMs) promise enhanced context understanding. We evaluated three Dutch-specific LLMs (ChocoLlama-8B-Instruct, Reynaerde-7B-chat, and GEITje-7B-ultra) against LIWC and Pattern for valence prediction in Flemish, a low-resource language variant. Our dataset comprised approximately 25000 spontaneous textual responses from 102 Dutch-speaking participants, each providing narratives about their current experiences with self-assessed valence ratings (-50 to +50). Surprisingly, despite architectural advancements, the Dutch-tuned LLMs underperformed compared to traditional methods, with Pattern showing superior performance. These findings challenge assumptions about LLM superiority in sentiment analysis tasks and highlight the complexity of capturing emotional valence in spontaneous, real-world narratives. Our results underscore the need for developing culturally and linguistically tailored evaluation frameworks for low-resource language variants, while questioning whether current LLM fine-tuning approaches adequately address the nuanced emotional expressions found in everyday language use.[10] Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering
Sai Shridhar Balamurali,Lu Cheng
Main category: cs.CL
TL;DR: 提出一种轻量级的基于NLI和词法匹配标志的答案评估方法,在长文本问答中性能媲美GPT-4o,且计算成本显著更低,并发布新的人工标注基准DIVER-QA用于评测评估指标。
Details
Motivation: 现有大模型回答评估方法存在缺陷:基于词汇的指标忽略语义细节,而使用LLM作为裁判的方法计算成本过高。因此需要一种准确且高效的替代方案。 Method: 采用现成的自然语言推理(NLI)模型进行评分,并引入一个简单的词法匹配标志进行增强,形成轻量级评估方法。在多个QA数据集上与人类标注对比,验证其有效性。 Result: 该方法在长文本问答任务上达到与GPT-4o相当的准确率(89.9%),但参数量和计算开销显著更低;同时发布了包含3000个样本的DIVER-QA人工标注基准。 Conclusion: 基于NLI的轻量级评估方法在准确性和效率之间实现了良好平衡,具备实用价值,并为未来评估指标研究提供了公开资源。 Abstract: Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.[11] Stress Testing Factual Consistency Metrics for Long-Document Summarization
Zain Muhammad Mujahid,Dustin Wright,Isabelle Augenstein
Main category: cs.CL
TL;DR: 本文系统评估了六种常用无参考事实性度量在长文档摘要中的可靠性,发现现有指标在语义等价摘要上得分不一致,且对信息密集型声明的评估可靠性下降。研究通过七种保真扰动和不同领域数据集揭示了指标局限性,并提出改进方向:多跨度推理、上下文感知校准及基于语义保持变体的训练。
Details
Motivation: 现有的事实性评估指标主要针对短文本摘要设计,在处理长文档时面临输入长度限制和长距离依赖问题,难以可靠评估摘要的事实一致性。因此需要系统检验这些指标在长文档场景下的鲁棒性和适用性。 Method: 对六种广泛使用的无参考事实性指标进行系统评估,引入七种保持事实性的摘要扰动(如改写、简化、同义替换、逻辑等价否定、词汇缩减、压缩和源文本插入),并在三个涵盖科幻、法律和科学领域的长文档基准数据集上测试指标表现,分析其对检索上下文和声明信息密度的敏感性。 Result: 实验表明,现有短文本指标在长文档场景下对语义等价的摘要给出不一致评分,且在信息密集、内容与源文档多部分相似的声明上可靠性下降;扩展检索上下文可在某些领域提升稳定性,但没有一种指标能持续保持事实对齐。 Conclusion: 当前用于短文本摘要的事实性指标不适用于长文档场景,需通过多跨度推理、上下文感知校准以及在语义保持变异数据上训练来提升其鲁棒性和可靠性。 Abstract: Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.[12] CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences
Rhitabrat Pokharel,Yufei Tao,Ameeta Agrawal
Main category: cs.CL
TL;DR: 提出了一种新的多语言偏好优化方法CAPO,通过基于相对奖励的动态损失缩放机制,提升了对噪声和低边际比较的鲁棒性,在奖励准确性和对齐效果上优于现有方法。
Details
Motivation: 现有的偏好优化方法(如DPO)在多语言场景下泛化能力差,尤其是在处理噪声或低置信度的偏好对时表现不佳。 Method: 提出信心感知偏好优化(CAPO),引入基于相对奖励的动态损失缩放机制,根据每对偏好样本的置信度调整学习信号。 Result: CAPO在奖励准确性上比现有方法高出至少16%,并在多种语言中显著拉大了优选与非优选响应之间的差距。 Conclusion: CAPO通过动态调节学习信号,有效提升了多语言环境下偏好优化的鲁棒性和对齐性能。 Abstract: Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have proven effective in English, they often fail to generalize robustly to multilingual settings. We propose a simple yet effective alternative, Confidence-Aware Preference Optimization (CAPO), which replaces DPO's fixed treatment of preference pairs with a dynamic loss scaling mechanism based on a relative reward. By modulating the learning signal according to the confidence in each preference pair, CAPO enhances robustness to noisy or low-margin comparisons, typically encountered in multilingual text. Empirically, CAPO outperforms existing preference optimization baselines by at least 16% in reward accuracy, and improves alignment by widening the gap between preferred and dispreferred responses across languages.[13] Critical Confabulation: Can LLMs Hallucinate for Social Good?
Peiqi Sui,Eamon Duede,Hoyt Long,Richard Jean So
Main category: cs.CL
TL;DR: 本文提出“批判性虚构”(critical confabulation)概念,利用大语言模型的幻觉特性,在历史档案缺失处生成有证据支持的替代性叙事,以重构被忽视群体的历史角色。
Details
Motivation: 由于社会与政治不平等导致历史档案存在缺失,现有方法难以还原‘隐形人物’的真实故事,因此需要一种既能填补空白又保持历史准确性的新方法。 Method: 构建基于小说语料库的人物中心时间线,并设计开放式的叙事填空任务来模拟档案空白;使用OLMo-2等开源模型及闭源基线模型,在多种提示下生成被遮蔽事件,并评估其幻觉的可控性和有用性。 Result: 实验证明LLM具备执行批判性虚构的基础叙事理解能力,适当控制的幻觉可在不牺牲历史准确性的前提下支持知识生产。 Conclusion: 受批判性寓言启发的批判性虚构,能够将LLM的幻觉转化为有意义的知识补全工具,为修复历史不公提供新路径。 Abstract: LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to "fill-in-the-gap" for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history's "hidden figures". We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs' foundational narrative understanding capabilities to perform critical confabulation, and show how controlled and well-specified hallucinations can support LLM applications for knowledge production without collapsing speculation into a lack of historical accuracy and fidelity.[14] Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production
Shiva Upadhye,Richard Futrell
Main category: cs.CL
TL;DR: 本研究通过两个自然语言语料库研究,采用改进的可预测性度量和更强大的语言模型,探讨了上下文可预测性(尤其是后向可预测性)对口语产出中词长和替换错误的影响,揭示了过去与未来语境在词汇选择与编码中的功能作用。
Details
Motivation: 尽管前向可预测性在语言产出与理解中的作用已被广泛研究,但后向可预测性(即词语对其未来语境的可预测性)的作用尚不清楚,可能涉及语言生成中的前瞻规划机制,因此需要更合理的度量方法和分析框架来揭示其功能。 Method: 研究使用自然口语语料库,提出一种新的信息论可预测性度量方法,整合过去与未来语境的预测信息;第一项研究重新检验可预测性对词长的影响,第二项研究在生成式框架下建模词汇、语境与交际因素对替换错误的影响,并预测实际出现的错误词。 Result: 所提出的后向可预测性替代指标在两项研究中均产生与经典效应相似的结果;对替换错误的细粒度分析显示,不同类型的错误反映了说话者在词汇规划中对词形、意义和语境信息的不同权衡。 Conclusion: 过去和未来的语境共同影响语言产出中的词汇编码与选择,后向可预测性效应支持了语言生成中存在前瞻性规划的观点,该研究为可预测性效应与句子规划机制之间建立了桥梁。 Abstract: Contextual predictability shapes both the form and choice of words in online language production. The effects of the predictability of a word given its previous context are generally well-understood in both production and comprehension, but studies of naturalistic production have also revealed a poorly-understood backward predictability effect of a word given its future context, which may be related to future planning. Here, in two studies of naturalistic speech corpora, we investigate backward predictability effects using improved measures and more powerful language models, introducing a new principled and conceptually motivated information-theoretic predictability measure that integrates predictability from both the future and the past context. Our first study revisits classic predictability effects on word duration. Our second study investigates substitution errors within a generative framework that independently models the effects of lexical, contextual, and communicative factors on word choice, while predicting the actual words that surface as speech errors. We find that our proposed conceptually-motivated alternative to backward predictability yields qualitatively similar effects across both studies. Through a fine-grained analysis of substitution errors, we further show that different kinds of errors are suggestive of how speakers prioritize form, meaning, and context-based information during lexical planning. Together, these findings illuminate the functional roles of past and future context in how speakers encode and choose words, offering a bridge between contextual predictability effects and the mechanisms of sentence planning.[15] Design, Results and Industry Implications of the World's First Insurance Large Language Model Evaluation Benchmark
Hua Zhou,Bing Ma,Yufei Zhang,Yi Zhao
Main category: cs.CL
TL;DR: 本文提出了CUFEInse v1.0,一个面向保险领域的专业大模型评测基准,涵盖5个核心维度、54个子指标和14,430个高质量问题,系统评估了11种主流大语言模型,揭示了其在精算、合规等专业场景中的瓶颈,并为垂直领域模型优化与选型提供了权威参考。
Details
Motivation: 现有大语言模型在保险等专业垂直领域缺乏系统、权威的评测基准,难以准确评估其在精算、合规、核保理赔推理等关键能力上的表现,制约了学术研究与产业应用的发展。 Method: 遵循“定量导向、专家驱动、多验证”的原则,构建包含保险理论、行业认知、安全合规、智能体应用和逻辑严谨性五个核心维度的CUFEInse v1.0评测体系,并基于该体系对11个主流大模型进行综合评测。 Result: 评测发现通用模型在精算能力和合规适应方面存在明显短板;领域高质量训练虽提升专业性能但业务适配与合规仍有不足;准确识别出当前大模型在保险精算、核保理赔推理、合规文案生成等场景的共性瓶颈。 Conclusion: CUFEInse v1.0填补了保险领域专业评测基准的空白,为学术界和产业界提供了系统、专业、权威的评估工具,其构建理念与方法对垂直领域大模型评测具有重要借鉴意义,并指明了‘领域适应+推理增强’的未来发展方向。 Abstract: This paper comprehensively elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of "quantitative-oriented, expert-driven, and multi-validation," the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub-indicators, and 14,430 high-quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general-purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High-quality domain-specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of "domain adaptation + reasoning enhancement" for insurance large models.[16] From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory
Siyu Xia,Zekun Xu,Jiajun Chai,Wentian Fan,Yan Song,Xiaohan Wang,Guojun Yin,Wei Lin,Haifeng Zhang,Jun Wang
Main category: cs.CL
TL;DR: 提出一种可训练的多层图记忆框架,通过强化学习优化记忆权重,提升大语言模型代理的战略推理能力。
Details
Motivation: 现有大语言模型的记忆机制存在灾难性遗忘、缺乏适应性或可解释性差的问题,难以有效利用先验经验指导决策。 Method: 构建一个以代理为中心的多层图记忆框架,将代理轨迹抽象为状态机中的决策路径,并提炼为高层、可解释的元认知策略;引入基于强化学习的权重优化方法,根据下游任务的奖励反馈动态调整各元认知的效用,并通过元认知提示将其融入LLM训练过程。 Result: 该可学习图记忆框架在多个任务中表现出强大的泛化能力,显著提升了LLM代理的战略推理性能,并在强化学习训练过程中提供了持续的性能增益。 Conclusion: 所提出的图记忆框架有效解决了现有记忆机制在适应性、可解释性和实用性方面的局限,为提升LLM代理在复杂开放环境中的自主决策能力提供了新思路。 Abstract: Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving across complex, open-ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high-level, human-interpretable strategic meta-cognition. In order to make memory adaptable, we propose a reinforcement-based weight optimization procedure that estimates the empirical utility of each meta-cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent's training loop through meta-cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents' strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.[17] AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys
Chenxi Lin,Weikang Yuan,Zhuoren Jiang,Biao Huang,Ruitao Zhang,Jianan Ge,Yueqian Xu,Jianxing Yu
Main category: cs.CL
TL;DR: 本论文提出了AlignSurvey,首个系统化模拟和评估完整社会调查流程的基准,利用大语言模型(LLM)重构从角色建模到响应生成的全过程,并构建了包含多国对话与调查数据的多层次数据集,配套发布SurveyLM系列模型,致力于提升社会调查的灵活性、公平性与跨文化可比性。
Details
Motivation: 传统社会调查面临固定问题格式、高成本、适应性差和跨文化等效性难以保证等问题,且现有LLM研究多局限于结构化问题,未覆盖完整调查流程,可能因训练数据偏差而忽视边缘群体,因此需要一个全面、公平且可扩展的调查模拟基准。 Method: 提出AlignSurvey基准,定义四个与调查阶段对应的任务:社会角色建模、半结构化访谈建模、态度立场建模和调查响应建模;构建包含44K+访谈对话和400K+结构化记录的Social Foundation Corpus,以及多个全流程调查数据集(如ASE);通过两阶段微调开源LLM得到SurveyLM系列模型,并设计任务特定指标评估个体与群体层面的一致性、保真度与公平性。 Result: 成功构建了支持全流程评估的AlignSurvey基准与配套数据资源,实验表明SurveyLM在模拟人类调查响应方面具有较高对齐度与一致性,且在不同人口统计群体中表现出更优的公平性,支持跨文化比较与透明研究。 Conclusion: AlignSurvey为利用大语言模型进行社会调查研究提供了可靠、开放和负责任的框架,推动了调查科学的自动化与包容性发展,具有广泛应用于学术研究与政策制定的潜力。 Abstract: Understanding human attitudes, preferences, and behaviors through social surveys is essential for academic research and policymaking. Yet traditional surveys face persistent challenges, including fixed-question formats, high costs, limited adaptability, and difficulties ensuring cross-cultural equivalence. While recent studies explore large language models (LLMs) to simulate survey responses, most are limited to structured questions, overlook the entire survey process, and risks under-representing marginalized groups due to training data biases. We introduce AlignSurvey, the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs. It defines four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling and survey response modeling. It also provides task-specific evaluation metrics to assess alignment fidelity, consistency, and fairness at both individual and group levels, with a focus on demographic diversity. To support AlignSurvey, we construct a multi-tiered dataset architecture: (i) the Social Foundation Corpus, a cross-national resource with 44K+ interview dialogues and 400K+ structured survey records; and (ii) a suite of Entire-Pipeline Survey Datasets, including the expert-annotated AlignSurvey-Expert (ASE) and two nationally representative surveys for cross-cultural evaluation. We release the SurveyLM family, obtained through two-stage fine-tuning of open-source LLMs, and offer reference models for evaluating domain-specific alignment. All datasets, models, and tools are available at github and huggingface to support transparent and socially responsible research.[18] Planned Event Forecasting using Future Mentions and Related Entity Extraction in News Articles
Neelesh Kumar Shukla,Pranay Sanghvi
Main category: cs.CL
TL;DR: 本文提出了一种基于主题建模、词向量和命名实体识别的系统,用于通过分析新闻文章预测社会动荡事件,并引入“相关实体提取”方法以识别真正参与事件的关键实体。
Details
Motivation: 在印度等民主国家,民众自由表达意见可能导致未经许可的抗议活动,对社会秩序造成干扰,因此需要提前预测此类事件以便政府采取应对措施。 Method: 利用主题建模和word2vec过滤相关新闻,使用NER识别人物、组织、地点和时间等实体,并进行时间标准化;提出“相关实体提取”方法以筛选真正参与事件的关键实体。 Result: 开发了一个地理上独立且通用的模型,能够有效识别与社会动荡事件相关的特征和关键实体,提升事件预测的准确性。 Conclusion: 该模型能有效从新闻中提取与未来社会动荡事件相关的关键信息,具备良好的泛化能力,有助于政府部门提前预警和管理公共安全。 Abstract: In democracies like India, people are free to express their views and demands. Sometimes this causes situations of civil unrest such as protests, rallies, and marches. These events may be disruptive in nature and are often held without prior permission from the competent authority. Forecasting these events helps administrative officials take necessary action. Usually, protests are announced well in advance to encourage large participation. Therefore, by analyzing such announcements in news articles, planned events can be forecasted beforehand. We developed such a system in this paper to forecast social unrest events using topic modeling and word2vec to filter relevant news articles, and Named Entity Recognition (NER) methods to identify entities such as people, organizations, locations, and dates. Time normalization is applied to convert future date mentions into a standard format. In this paper, we have developed a geographically independent, generalized model to identify key features for filtering civil unrest events. There could be many mentions of entities, but only a few may actually be involved in the event. This paper calls such entities Related Entities and proposes a method to extract them, referred to as Related Entity Extraction.[19] Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification
Chenhao Dang,Jing Ma
Main category: cs.CL
TL;DR: 提出了一种名为Manifold-Correcting Causal Flow (MC^2F)的方法,通过在句子嵌入空间中建模干净数据流形分布,有效提升文本分类模型对抗攻击的鲁棒性,同时保持甚至提升干净数据上的性能。
Details
Motivation: 增强模型对抗攻击鲁棒性通常会降低其在干净数据上的性能,本文旨在解决这一权衡问题。 Method: 提出MC^2F,包含两个模块:使用分层黎曼连续归一化流(SR-CNF)学习干净数据流形密度,并利用测地线净化求解器将对抗样本沿最短路径投影回流形以恢复干净表示。 Result: 在三个数据集和多种对抗攻击下评估显示,MC^2F在对抗鲁棒性上达到新SOTA,且完全保持甚至略微提升干净数据上的分类准确率。 Conclusion: 通过在嵌入流形上显式建模和纠正对抗扰动,可以同时实现高鲁棒性和高精度,打破了传统防御方法的性能权衡。 Abstract: A persistent challenge in text classification (TC) is that enhancing model robustness against adversarial attacks typically degrades performance on clean data. We argue that this challenge can be resolved by modeling the distribution of clean samples in the encoder embedding manifold. To this end, we propose the Manifold-Correcting Causal Flow (MC^2F), a two-module system that operates directly on sentence embeddings. A Stratified Riemannian Continuous Normalizing Flow (SR-CNF) learns the density of the clean data manifold. It identifies out-of-distribution embeddings, which are then corrected by a Geodesic Purification Solver. This solver projects adversarial points back onto the learned manifold via the shortest path, restoring a clean, semantically coherent representation. We conducted extensive evaluations on text classification (TC) across three datasets and multiple adversarial attacks. The results demonstrate that our method, MC^2F, not only establishes a new state-of-the-art in adversarial robustness but also fully preserves performance on clean data, even yielding modest gains in accuracy.[20] Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning
Songze Li,Zhiqiang Liu,Zhaoyan Gong,Xiaoke Guo,Zhengke Gui,Huajun Chen,Wen Zhang
Main category: cs.CL
TL;DR: 提出了一种名为Logits-to-Logic的框架,通过在自回归生成过程中对logits进行增强和过滤,提升大语言模型在结构化知识推理任务中的逻辑一致性,有效缓解逻辑漂移问题,并在多个KGQA基准上达到最先进性能。
Details
Motivation: 大语言模型在处理非结构化文本时表现出色,但在结构化知识推理任务中由于表示差异容易出现逻辑漂移问题,现有方法仅提供输入级引导,无法从根本上解决输出中的逻辑不一致。 Method: 提出了Logits-to-Logic框架,包含logits增强和logits过滤两个核心模块,直接作用于自回归生成过程中的logits,以纠正输出中的逻辑缺陷。 Result: 在多个知识图谱问答(KGQA)基准上的实验表明,该方法显著提升了大语言模型的逻辑一致性,并实现了最先进的性能。 Conclusion: Logits-to-Logic框架通过干预生成过程中的logits,有效增强了大语言模型在结构化知识推理中的逻辑保持能力,为解决逻辑漂移问题提供了新思路。 Abstract: Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text, enabling them to understand the logic in natural language and generate logic-consistent responses. However, the representational differences between unstructured and structured knowledge make LLMs inherently struggle to maintain logic consistency, leading to \textit{Logic Drift} challenges in structured knowledge reasoning tasks such as Knowledge Graph Question Answering (KGQA). Existing methods address this limitation by designing complex workflows embedded in prompts to guide LLM reasoning. Nevertheless, these approaches only provide input-level guidance and fail to fundamentally address the \textit{Logic Drift} in LLM outputs. Additionally, their inflexible reasoning workflows cannot adapt to different tasks and knowledge graphs. To enhance LLMs' logic consistency in structured knowledge reasoning, we specifically target the logits output from the autoregressive generation process. We propose the \textit{Logits-to-Logic} framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs. Extensive experiments show that our approach significantly improves LLMs' logic consistency in structured knowledge reasoning and achieves state-of-the-art performance on multiple KGQA benchmarks.[21] Social Media for Mental Health: Data, Methods, and Findings
Nur Shazwani Kamarudin,Ghazaleh Beigi,Lydia Manikonda,Huan Liu
Main category: cs.CL
TL;DR: 本章探讨了利用社交媒体数据研究心理健康问题(如抑郁、焦虑和自杀倾向)的最新方法,强调通过语言、视觉和情感指标分析用户发布内容,以改善医疗实践、提供及时支持并影响政策制定。
Details
Motivation: 随着虚拟社区和论坛的增多,人们在社交媒体上匿名分享心理困扰的现象日益普遍,这为研究心理健康问题提供了新的数据来源。利用这些数据有助于更好地理解用户需求,并推动心理健康意识的提升。 Method: 综述了现有的研究方法,包括机器学习、特征工程、自然语言处理和调查方法,对社交媒体数据进行了分类,并分析了语言、视觉和情感层面的用户表达指标。 Result: 总结了当前在利用社交媒体数据进行心理健康研究方面的成果,展示了如何通过技术手段识别心理状态,并提出该领域在医疗支持和政策影响方面的潜力。 Conclusion: 社交媒体数据为心理健康研究提供了宝贵的资源,结合先进的计算方法可实现早期干预、提高公众意识,并为未来的研究和政策制定指明方向。 Abstract: There is an increasing number of virtual communities and forums available on the web. With social media, people can freely communicate and share their thoughts, ask personal questions, and seek peer-support, especially those with conditions that are highly stigmatized, without revealing personal identity. We study the state-of-the-art research methodologies and findings on mental health challenges like de- pression, anxiety, suicidal thoughts, from the pervasive use of social media data. We also discuss how these novel thinking and approaches can help to raise awareness of mental health issues in an unprecedented way. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. The main goal of this chapter is to show how this new source of data can be tapped to improve medical practice, provide timely support, and influence government or policymakers. In the context of social media for mental health issues, this chapter categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and surveys methods and outlines directions for future research.[22] Distinct Theta Synchrony across Speech Modes: Perceived, Spoken, Whispered, and Imagined
Jung-Sun Lee,Ha-Na Jo,Eunyeong Ko
Main category: cs.CL
TL;DR: 本研究比较了不同言语模式下的theta频段神经同步性,发现外显和耳语言语表现出更强的额颞叶同步性,而感知言语则以颞顶叶网络为主,想象言语则主要涉及额叶和辅助运动区的局部同步。
Details
Motivation: 先前研究多集中于单一言语模式,缺乏对不同言语模式下theta同步性的综合比较,因此需要系统分析各模式间的神经机制差异。 Method: 基于连接性指标分析不同言语模式下的theta频段神经同步性,关注区域间的变异特征。 Result: 外显和耳语言语显示更广泛且强烈的额颞叶同步性;感知言语以后部和颞叶同步为主;想象言语则表现为局限于额叶及辅助运动区的内部一致的同步模式。 Conclusion: 不同言语模式下的theta同步性在空间分布和强度上存在显著差异,反映了语言处理中共享与独特的神经动态机制。 Abstract: Human speech production encompasses multiple modes such as perceived, overt, whispered, and imagined, each reflecting distinct neural mechanisms. Among these, theta-band synchrony has been closely associated with language processing, attentional control, and inner speech. However, previous studies have largely focused on a single mode, such as overt speech, and have rarely conducted an integrated comparison of theta synchrony across different speech modes. In this study, we analyzed differences in theta-band neural synchrony across speech modes based on connectivity metrics, focusing on region-wise variations. The results revealed that overt and whispered speech exhibited broader and stronger frontotemporal synchrony, reflecting active motor-phonological coupling during overt articulation, whereas perceived speech showed dominant posterior and temporal synchrony patterns, consistent with auditory perception and comprehension processes. In contrast, imagined speech demonstrated a more spatially confined but internally coherent synchronization pattern, primarily involving frontal and supplementary motor regions. These findings indicate that the extent and spatial distribution of theta synchrony differ substantially across modes, with overt articulation engaging widespread cortical interactions, whispered speech showing intermediate engagement, and perception relying predominantly on temporoparietal networks. Therefore, this study aims to elucidate the differences in theta-band neural synchrony across various speech modes, thereby uncovering both the shared and distinct neural dynamics underlying language perception and imagined speech.[23] Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker
Matthias De Lange,Jens-Joris Decorte,Jeroen Van Hautte
Main category: cs.CL
TL;DR: 本文提出了WorkBench,首个涵盖六个与工作相关任务的统一评估套件,并基于此提出Unified Work Embeddings (UWE),一种任务无关的双编码器模型,在零样本排名任务中显著优于通用嵌入模型。
Details
Motivation: 由于工作场景中的NLP任务具有长尾分布、多标签和数据稀缺等复杂性,现有通用嵌入模型在工作领域表现不明确,因此需要专门的评估框架和模型来应对这些挑战。 Method: 构建了WorkBench评估套件,利用真实数据构建任务特定的二部图,并通过接地增强合成数据;提出UWE模型,采用多对多InfoNCE目标函数和任务无关的软 late interaction机制,利用token级嵌入进行训练。 Result: UWE在零样本设置下对未见过的工作领域目标空间实现了高效排名,支持低延迟推理(通过缓存目标空间嵌入),并在macro-averaged MAP和RP@10指标上显著优于通用嵌入模型。 Conclusion: UWE通过利用跨任务迁移和结构化训练数据,在工作相关的多任务NLP中展现出强大性能,为工作场景下的语义匹配提供了有效且高效的解决方案。 Abstract: Workforce transformation across diverse industries has driven an increased demand for specialized natural language processing capabilities. Nevertheless, tasks derived from work-related contexts inherently reflect real-world complexities, characterized by long-tailed distributions, extreme multi-label target spaces, and scarce data availability. The rise of generalist embedding models prompts the question of their performance in the work domain, especially as progress in the field has focused mainly on individual tasks. To this end, we introduce WorkBench, the first unified evaluation suite spanning six work-related tasks formulated explicitly as ranking problems, establishing a common ground for multi-task progress. Based on this benchmark, we find significant positive cross-task transfer, and use this insight to compose task-specific bipartite graphs from real-world data, synthetically enriched through grounding. This leads to Unified Work Embeddings (UWE), a task-agnostic bi-encoder that exploits our training-data structure with a many-to-many InfoNCE objective, and leverages token-level embeddings with task-agnostic soft late interaction. UWE demonstrates zero-shot ranking performance on unseen target spaces in the work domain, enables low-latency inference by caching the task target space embeddings, and shows significant gains in macro-averaged MAP and RP@10 over generalist embedding models.[24] NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation
Maoqi Liu,Quan Fang,Yuhao Wu,Can Zhao,Yang Yang,Kaiquan Cai
Main category: cs.CL
TL;DR: 提出NOTAM-Evolve框架,结合知识图谱和自演化机制,实现对NOTAM的深度解析,显著提升结构化解释准确率。
Details
Motivation: 现有系统对NOTAM的解析多为浅层处理,难以提取可用于决策的关键信息,且受限于人工标注数据不足,无法有效支持航空安全所需的深度理解。 Method: 提出NOTAM-Evolve框架,通过知识图谱增强的检索模块实现动态知识对接,并引入基于静态领域规则的模式推理;采用闭环自演化学习机制,使大语言模型能从自身输出中持续优化解析能力,减少对人工标注的依赖。 Result: 在包含1万条专家标注NOTAM的新基准数据集上实验表明,该方法比基础大模型绝对准确率提升30.4%,达到当前最优性能。 Conclusion: NOTAM-Evolve实现了对NOTAM的深度解析,有效结合动态数据 grounding 与静态规则推理,为复杂航空文本的理解提供了可进化的自动化解决方案。 Abstract: Accurate interpretation of Notices to Airmen (NOTAMs) is critical for aviation safety, yet their condensed and cryptic language poses significant challenges to both manual and automated processing. Existing automated systems are typically limited to shallow parsing, failing to extract the actionable intelligence needed for operational decisions. We formalize the complete interpretation task as deep parsing, a dual-reasoning challenge requiring both dynamic knowledge grounding (linking the NOTAM to evolving real-world aeronautical data) and schema-based inference (applying static domain rules to deduce operational status). To tackle this challenge, we propose NOTAM-Evolve, a self-evolving framework that enables a large language model (LLM) to autonomously master complex NOTAM interpretation. Leveraging a knowledge graph-enhanced retrieval module for data grounding, the framework introduces a closed-loop learning process where the LLM progressively improves from its own outputs, minimizing the need for extensive human-annotated reasoning traces. In conjunction with this framework, we introduce a new benchmark dataset of 10,000 expert-annotated NOTAMs. Our experiments demonstrate that NOTAM-Evolve achieves a 30.4% absolute accuracy improvement over the base LLM, establishing a new state of the art on the task of structured NOTAM interpretation.[25] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?
Taja Kuzman Pungeršek,Peter Rupnik,Ivan Porupski,Vuk Dinić,Nikola Ljubešić
Main category: cs.CL
TL;DR: 本文比较了细调的BERT类模型与大语言模型(LLM)在南斯拉夫语系文本分类任务中的表现,发现LLM在零样本设置下表现优异,但在输出稳定性、推理速度和计算成本方面存在不足,因此细调模型仍更适用于大规模自动文本标注。
Details
Motivation: 探究大语言模型在资源较少语言(如南斯拉夫语系)上的文本分类性能,并评估其相对于传统细调模型的优劣。 Method: 在三种任务(情感分类、主题分类、体裁识别)和三个领域(议会演讲、新闻文章、网络文本)上,对比开源及闭源LLM与细调BERT类模型的零样本与少样本表现。 Result: LLM在零样本设置下表现强劲,常达到或超过细调BERT模型;在南斯拉夫语系与英语中表现相当,但存在输出不稳定、推理慢、成本高的问题。 Conclusion: 尽管LLM在零样本文本分类中表现良好,但由于其可预测性差、速度慢和高成本,细调BERT类模型仍是大规模应用的更实用选择。 Abstract: Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.[26] Self-Correction Distillation for Structured Data Question Answering
Yushan Zhu,Wen Zhang,Long Jin,Mengshu Sun,Ling Zhong,Zhiqiang Liu,Juan Li,Lei Liang,Chong Long,Chao Deng,Junlan Feng
Main category: cs.CL
TL;DR: 提出了一种自校正蒸馏(SCD)方法,通过错误提示机制和两阶段蒸馏策略提升小规模大语言模型在结构化数据问答中的性能。
Details
Motivation: 小规模大语言模型在生成结构化查询时容易出错,现有统一框架难以有效支持小模型。 Method: 设计了错误提示机制(EPM)用于检测错误并提供定制化错误信息,并采用两阶段蒸馏策略将大规模模型的查询生成与纠错能力迁移至小规模模型。 Result: 在5个基准数据集和3种结构化数据类型上实验表明,该方法在8B规模模型上优于其他蒸馏方法,性能接近GPT-4,且大规模模型结合EPM也超越了多数现有方法。 Conclusion: SCD有效提升了小规模LLMs在结构化问答任务中的表现,具备良好泛化能力和应用潜力。 Abstract: Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries. To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs' query-generation and error-correction capabilities to small-scale LLM. Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.[27] HyCoRA: Hyper-Contrastive Role-Adaptive Learning for Role-Playing
Shihao Yang,Zhicong Lu,Yong Yang,Bo Lv,Yang Shen,Nayu Liu
Main category: cs.CL
TL;DR: 本文提出了一种新的HyCoRA框架,通过平衡角色特有和共有特征的学习,提升多角色扮演能力。
Details
Motivation: 现有方法在处理多角色扮演时,要么忽略角色的独特性,要么忽视角色间的共性,难以兼顾个性与共性建模。 Method: 提出HyCoRA框架,包含Hyper-Half低秩自适应结构和超对比学习机制;前者由轻量级超网络生成角色特有模块,同时保留可训练的共享模块,后者增强角色间差异特征的区分能力。 Result: 在英中文基准测试上均表现出优越性能,GPT-4评估和可视化分析验证了其有效捕捉角色特征的能力。 Conclusion: HyCoRA能有效平衡角色个性与共性学习,显著提升多角色扮演模型的表现。 Abstract: Multi-character role-playing aims to equip models with the capability to simulate diverse roles. Existing methods either use one shared parameterized module across all roles or assign a separate parameterized module to each role. However, the role-shared module may ignore distinct traits of each role, weakening personality learning, while the role-specific module may overlook shared traits across multiple roles, hindering commonality modeling. In this paper, we propose a novel HyCoRA: Hyper-Contrastive Role-Adaptive learning framework, which efficiently improves multi-character role-playing ability by balancing the learning of distinct and shared traits. Specifically, we propose a Hyper-Half Low-Rank Adaptation structure, where one half is a role-specific module generated by a lightweight hyper-network, and the other half is a trainable role-shared module. The role-specific module is devised to represent distinct persona signatures, while the role-shared module serves to capture common traits. Moreover, to better reflect distinct personalities across different roles, we design a hyper-contrastive learning mechanism to help the hyper-network distinguish their unique characteristics. Extensive experimental results on both English and Chinese available benchmarks demonstrate the superiority of our framework. Further GPT-4 evaluations and visual analyses also verify the capability of HyCoRA to capture role characteristics.[28] BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution
Abdullah Muhammad Moosa,Nusrat Sultana,Mahdi Muhammad Moosa,Md. Miraiz Hossain
Main category: cs.CL
TL;DR: 本研究提出了一个新的孟加拉语作者识别数据集BARD10,并系统分析了停用词去除对传统与深度学习模型在作者归属任务中的影响,发现TF-IDF+SVM表现最优,且孟加拉语停用词具有重要风格标识作用。
Details
Motivation: 旨在解决孟加拉语作者归属研究中缺乏平衡基准数据集的问题,并探究停用词在不同模型和文本类型中的风格意义。 Method: 构建了包含10位作者的BARD10数据集,结合BAAD16数据集,统一使用TF-IDF+SVM、Bangla BERT、XGBoost和MLP四种分类器,在一致预处理条件下评估停用词去除的影响。 Result: TF-IDF+SVM在BAAD16上达到0.997的macro-F1分数,在BARD10上为0.921,优于落后多达5个百分点的Bangla BERT;研究表明BARD10作者对停用词移除敏感,而BAAD16较稳健,且高频特征携带的作者信息在Transformer模型中被削弱。 Conclusion: 孟加拉语停用词是关键的写作风格指标;经过精细调优的传统机器学习模型在短文本场景下有效;BARD10连接了正式文学与网络对话,为未来长上下文或领域适配的Transformer模型提供了可复现的基准。 Abstract: This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of 0.997 on BAAD16 and 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to stop-word pruning, while BAAD16 authors remain comparatively robust highlighting genre-dependent reliance on stop-word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop-words serve as essential stylistic indicators; finely calibrated ML models prove effective within short-text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long-context or domain-adapted transformers.[29] Estranged Predictions: Measuring Semantic Category Disruption with Masked Language Modelling
Yuxuan Liu,Haim Dubossarsky,Ruth Ahnert
Main category: cs.CL
TL;DR: 本研究通过掩码语言模型(MLM)量化科幻小说中“人类”、“动物”与“机器”等概念类别的边界渗透性,揭示科幻文本中显著的概念滑动,尤其是机器范畴的跨类别替换现象,表明科幻通过语义规范的可控扰动实现“陌生化”效果。
Details
Motivation: 受Darko Suvin的陌生化理论启发,本文旨在将文学理论操作化为可计算指标,探索科幻小说如何通过语言层面的语义偏离来重构本体论范畴。 Method: 利用RoBERTa模型在科幻语料(Gollancz SF Masterworks)与普通小说语料(NovelTM)上进行掩码词预测,并通过Gemini对生成词汇进行分类,采用保留率、替换率和熵值三个指标衡量概念边界的稳定性与流动性。 Result: 发现科幻文本中机器指称表现出更高的跨类别替代和语义分散,而人类术语则保持较强语义凝聚性并常作为替代层级的锚点,反映出科幻特有的、在人类中心逻辑内部进行的范畴重组。 Conclusion: 科幻中的陌生化可被视为对语义规范的系统性扰动,可通过概率语言模型加以捕捉;掩码语言模型在批判性使用下可成为揭示文类特有本体假设的解释工具,为计算文学研究提供了新方法路径。 Abstract: This paper examines how science fiction destabilises ontological categories by measuring conceptual permeability across the terms human, animal, and machine using masked language modelling (MLM). Drawing on corpora of science fiction (Gollancz SF Masterworks) and general fiction (NovelTM), we operationalise Darko Suvin's theory of estrangement as computationally measurable deviation in token prediction, using RoBERTa to generate lexical substitutes for masked referents and classifying them via Gemini. We quantify conceptual slippage through three metrics: retention rate, replacement rate, and entropy, mapping the stability or disruption of category boundaries across genres. Our findings reveal that science fiction exhibits heightened conceptual permeability, particularly around machine referents, which show significant cross-category substitution and dispersion. Human terms, by contrast, maintain semantic coherence and often anchor substitutional hierarchies. These patterns suggest a genre-specific restructuring within anthropocentric logics. We argue that estrangement in science fiction operates as a controlled perturbation of semantic norms, detectable through probabilistic modelling, and that MLMs, when used critically, serve as interpretive instruments capable of surfacing genre-conditioned ontological assumptions. This study contributes to the methodological repertoire of computational literary studies and offers new insights into the linguistic infrastructure of science fiction.[30] Multimodal LLMs Do Not Compose Skills Optimally Across Modalities
Paula Ontalvilla,Aitor Ormazabal,Gorka Azkune
Main category: cs.CL
TL;DR: 本文研究了多模态大语言模型(MLLM)在跨模态技能组合方面的能力,发现现有模型存在显著的组合差距,尽管思维链提示和特定微调策略有所改善,但仍需进一步研究。
Details
Motivation: 随着神经网络在预训练中获得越来越复杂的技能,其组合能力尚不明确,因此需要评估MLLM在跨模态任务中的技能组合表现。 Method: 设计了三个可分解为两个模态依赖技能的评估任务,并在直接提示和两步级联推理两种设置下评估多个开源MLLM;探索了思维链提示和特定微调方法以缓解组合差距。 Result: 所有被评估的MLLM均表现出显著的跨模态技能组合差距;思维链提示和微调虽有改进,但差距依然存在。 Conclusion: 当前MLLM在跨模态技能组合方面存在局限,现有方法不足以完全解决该问题,需进一步研究提升其组合能力。 Abstract: Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.[31] Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition
Raquel Montero,Natalia Moskvina,Paolo Morosi,Tamara Serrano,Elena Pagliarini,Evelina Leivada
Main category: cs.CL
TL;DR: 本文探讨了多模态大语言模型在量化表达上的表现,分析了人类量化特征(如量词排序、使用范围和认知偏差)在模型中的编码方式,并比较了模型与人类的差异及其跨语言稳定性。
Details
Motivation: 由于量化涉及逻辑、语用和数值等多个领域,当前大语言模型在此类任务上表现不佳,但原因尚不明确。因此需要探究人类共有的量化特征在模型中的体现。 Method: 研究考察了三个跨语言普遍存在但尚未在大语言模型中充分探索的人类量化特征:量词的等级排序、使用范围与原型性,以及人类近似数系统中的固有偏见,并分析这些特征在不同模型架构和语言中的表现。 Result: 发现人类与多模态大语言模型在量化表征方面存在显著差异,且这些差异因模型类型和所用语言而异。 Conclusion: 大语言模型在量化理解上与人类存在系统性差距,需进一步改进其语义与语用能力;跨语言视角有助于评估模型能力的鲁棒性和普适性。 Abstract: Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models' architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.[32] Sentence-Anchored Gist Compression for Long-Context LLMs
Dmitrii Tarasov,Elizaveta Goncharova,Kuznetsov Andrey
Main category: cs.CL
TL;DR: 本文研究了使用学习到的压缩标记对大语言模型(LLM)进行上下文压缩,以减少处理长序列时的内存和计算开销。
Details
Motivation: 大语言模型在处理长序列时面临较高的内存和计算成本,限制了其在资源受限环境下的应用,因此需要有效的上下文压缩方法。 Method: 通过对预训练的大语言模型进行微调,使其能够利用学习到的压缩标记将上下文压缩2倍到8倍。 Result: 在短上下文和长上下文基准测试中,模型性能未显著下降;在30亿参数的LLaMA模型上,该方法在实现更高压缩比的同时,效果与现有压缩技术相当。 Conclusion: 所提出的方法能有效压缩大语言模型的上下文,在保持性能的同时显著降低计算和内存需求,具备实际应用潜力。 Abstract: This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.[33] On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility
Kushal Tatariya,Wessel Poelman,Miryam de Lhoneux
Main category: cs.CL
TL;DR: 本文研究了位置编码在不同语言建模中的作用,检验了形态复杂性与词序灵活性之间的权衡假设,发现位置编码的选择对形态复杂性或词序灵活性没有明显影响,结果依赖于任务、语言和评估指标的选择。
Details
Motivation: 探究主流语言模型架构(特别是位置编码)是否因以英语为中心而对结构不同的语言性能产生负面影响,验证形态复杂性与词序灵活性之间的权衡假说。 Method: 通过预训练七种类型多样的语言的单语模型变体(使用绝对、相对和无位置编码),并在四种下游任务上进行评估,分析不同位置编码对语言建模的影响。 Result: 未发现位置编码与形态复杂性或词序灵活性之间存在明确关联,挑战了先前的研究结论;结果显示任务、语言和评估指标的选择对结论稳定性至关重要。 Conclusion: 位置编码的设计并非普遍适用于所有语言类型的决定性因素,研究语言模型架构影响需综合考虑任务、语言多样性和评估方式。 Abstract: Language model architectures are predominantly first created for English and subsequently applied to other languages. It is an open question whether this architectural bias leads to degraded performance for languages that are structurally different from English. We examine one specific architectural choice: positional encodings, through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis posits a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pretrain monolingual model variants with absolute, relative, and no positional encodings for seven typologically diverse languages and evaluate them on four downstream tasks. Contrary to previous findings, we do not observe a clear interaction between position encodings and morphological complexity or word order flexibility, as measured by various proxies. Our results show that the choice of tasks, languages, and metrics are essential for drawing stable conclusions[34] Relation as a Prior: A Novel Paradigm for LLM-based Document-level Relation Extraction
Qiankun Pi,Yepeng Sun,Jicang Lu,Qinlong Fan,Ningbo Huang,Shiyu Wang
Main category: cs.CL
TL;DR: 提出了一种新的基于大语言模型的文档级关系抽取范式RelPrior,通过将关系作为先验信息来过滤无关实体对并避免预定义标签的严格限制,显著提升了性能。
Details
Motivation: 现有基于大语言模型的方法在文档级关系抽取中存在性能瓶颈,主要由于无关实体对引入噪声以及对超出预定义集的关系标签误判。 Method: 提出RelPrior范式:利用二元关系作为先验过滤相关实体对,并使用预定义关系作为匹配先验进行三元组抽取,而非直接预测关系标签。 Result: 在两个基准数据集上的实验表明,RelPrior达到了最先进的性能,优于现有的基于大语言模型的方法。 Conclusion: RelPrior有效缓解了噪声干扰和标签误判问题,显著提升了大语言模型在文档级关系抽取任务中的表现。 Abstract: Large Language Models (LLMs) have demonstrated their remarkable capabilities in document understanding. However, recent research reveals that LLMs still exhibit performance gaps in Document-level Relation Extraction (DocRE) as requiring fine-grained comprehension. The commonly adopted "extract entities then predict relations" paradigm in LLM-based methods leads to these gaps due to two main reasons: (1) Numerous unrelated entity pairs introduce noise and interfere with the relation prediction for truly related entity pairs. (2) Although LLMs have identified semantic associations between entities, relation labels beyond the predefined set are still treated as prediction errors. To address these challenges, we propose a novel Relation as a Prior (RelPrior) paradigm for LLM-based DocRE. For challenge (1), RelPrior utilizes binary relation as a prior to extract and determine whether two entities are correlated, thereby filtering out irrelevant entity pairs and reducing prediction noise. For challenge (2), RelPrior utilizes predefined relation as a prior to match entities for triples extraction instead of directly predicting relation. Thus, it avoids misjudgment caused by strict predefined relation labeling. Extensive experiments on two benchmarks demonstrate that RelPrior achieves state-of-the-art performance, surpassing existing LLM-based methods.[35] Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?
Kunal Kingkar Das,Manoj Balaji Jagadeeshan,Nallani Chakravartula Sahith,Jivnesh Sandhan,Pawan Goyal
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在低资源、形态丰富的梵语诗歌转散文任务中的表现,发现经过领域特定微调的ByT5-Sanskrit模型显著优于指令驱动的LLM方法。
Details
Motivation: 探讨大语言模型是否适用于低资源且形态复杂的语言(如梵语)的复杂NLP任务,特别是在需要多步推理的语言转换任务中是否能超越专用模型。 Method: 比较了指令微调和上下文提示的大语言模型与专门的编码器-解码器模型(ByT5-Sanskrit),并在基于Paninian语法和古典注释启发式设计的提示模板上进行实验。 Result: 领域特定微调的ByT5-Sanskrit模型在自动和人工评估中均显著优于所有LLM方法,且在跨领域数据上表现出良好泛化能力;提示策略在缺乏训练数据时可作为有效替代方案。 Conclusion: 对于低资源、形态复杂的语言任务,专门的任务模型仍优于通用大语言模型,表明当前LLM的通用性在特定语言环境下存在局限。 Abstract: Large Language Models (LLMs) are increasingly treated as universal, general-purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low-resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction-tuned and in-context-prompted LLMs with smaller task-specific encoder-decoder models on the Sanskrit poetry-to-prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi-step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models. For LLMs, we apply instruction fine-tuning on general-purpose models and design in-context learning templates grounded in Paninian grammar and classical commentary heuristics. For task-specific modelling, we fully fine-tune a ByT5-Sanskrit Seq2Seq model. Our experiments show that domain-specific fine-tuning of ByT5-Sanskrit significantly outperforms all instruction-driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall's Tau scores. Additionally, our prompting strategies provide an alternative to fine-tuning when domain-specific verse corpora are unavailable, and the task-specific Seq2Seq model demonstrates robust generalisation on out-of-domain evaluations.[36] Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?
Arzu Burcu Güven,Anna Rogers,Rob van der Goot
Main category: cs.CL
TL;DR: 研究了BabyLM语料库和CHILDES中不同年龄组的句法特性,发现句法知识有助于解释模型在语言任务中的表现,且使用可句法分类的子集数据比完整噪声语料更能提升性能。
Details
Motivation: 探索儿童语言发展语料的句法特征及其对模型性能的影响,并评估不同课程学习策略的有效性。 Method: 分析BabyLM和CHILDES语料库的句法属性,比较不同认知启发式课程学习方法在语言任务上的表现。 Result: CHILDES中按年龄划分的句法差异不显著;某些课程学习策略有助于阅读任务,但主要性能提升来自使用可句法分类的子集数据。 Conclusion: 句法结构信息有助于理解模型表现,精选的、语法清晰的数据子集比全量含噪数据更有利于模型训练。 Abstract: We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.[37] Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction
Shivam Rawat,Lucie Flek,Akbar Karimi
Main category: cs.CL
TL;DR: 本文提出了一种基于编码器的系统,用于从天文学文献中提取望远镜、仪器和语义属性等关键信息,该系统基于SciBERT模型并针对天文文本进行微调,在多任务学习框架下显著优于GPT基线模型。
Details
Motivation: 天文学科学文献快速增长,手动提取关键信息效率低下,亟需自动化方法来高效提取研究论文中的实体和上下文信息。 Method: 采用基于SciBERT的多任务Transformer架构,对天文领域的文本进行微调;训练时随机采样数据段,推理时对测试段采用多数投票策略。 Result: 所提出的系统在分类望远镜引用、检测语义属性和识别仪器提及方面显著优于开放权重的GPT基线模型,且实现简单、成本低。 Conclusion: 基于SciBERT的多任务微调方法在天文文献知识提取任务中表现优异,是一种高效且实用的自动化解决方案。 Abstract: Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.[38] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback
Yishan Du,Conrad Borchers,Mutlu Cukurova
Main category: cs.CL
TL;DR: 本文提出一种基于嵌入的基准框架,用于检测大语言模型在形成性反馈中的性别偏见。通过控制反事实分析,研究发现多数模型对隐式性别线索存在语义响应不对称,部分模型还受显式性别信息影响,揭示了生成式AI在教育反馈中的潜在偏见。
Details
Motivation: 随着教师越来越多地使用生成式人工智能(GenAI),亟需可靠的方法来评估大语言模型在教育场景中的公平性,尤其是针对形成性反馈中可能存在的性别偏见。 Method: 基于600篇真实学生作文构建沿性别维度的反事实样本:一是通过词汇替换改变文中性别暗示(隐式线索),二是在提示中更改作者性别背景(显式线索)。采用六种主流大语言模型生成反馈,利用句子嵌入的余弦和欧氏距离量化输出差异,并通过置换检验评估显著性,结合降维可视化分析结构模式。 Result: 所有模型在隐式性别替换下均产生更大的语义偏移(男性→女性方向更明显);仅GPT系列和Llama模型对显式性别线索敏感。定性分析显示,男性线索常引发更多自主支持型反馈,而女性线索则更多导致控制型语言。 Conclusion: 即使最先进的大语言模型在教育反馈中仍存在系统性性别偏见,需建立公平性审计标准、规范反事实评估报告,并优化提示设计以确保教育AI的公平部署。 Abstract: As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback. Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions: (i) implicit cues via lexicon-based swaps of gendered terms within essays, and (ii) explicit cues via gendered author background in the prompt. We investigated six representative LLMs (i.e. GPT-5 mini, GPT-4o mini, DeepSeek-R1, DeepSeek-R1-Qwen, Gemini 2.5 Pro, Llama-3-8B). We first quantified the response divergence with cosine and Euclidean distances over sentence embeddings, then assessed significance via permutation tests, and finally, visualised structure using dimensionality reduction. In all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Only the GPT and Llama models showed sensitivity to explicit gender cues. These findings show that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, suggesting persistent gender biases in feedback they provide learners. Qualitative analyses further revealed consistent linguistic differences (e.g., more autonomy-supportive feedback under male cues vs. more controlling feedback under female cues). We discuss implications for fairness auditing of pedagogical GenAI, propose reporting standards for counterfactual evaluation in learning analytics, and outline practical guidance for prompt design and deployment to safeguard equitable feedback.[39] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
Heyang Liu,Ziyang Cheng,Yuhao Wang,Hongcheng Liu,Yiqi Li,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 本文提出了VocalBench-zh,一个面向中文语音交互的多模态大模型评估套件,包含10个子集和超过1万条高质量数据,涵盖12种用户导向能力,用于系统评估主流模型的表现并推动下一代语音交互系统的发展。
Details
Motivation: 由于缺乏全面的中文语音到语音(S2S)基准测试,难以对多模态大语言模型进行系统性评估和公平比较,因此需要构建适用于中文语境的评估体系。 Method: 设计了一个能力分级的评估套件VocalBench-zh,包含10个精心构建的子集和超过10,000个高质量实例,覆盖12种用户导向的能力维度,并在14个主流模型上进行评测。 Result: 实验揭示了当前多模态大模型在中文语音交互中的共性挑战,表明现有方法在多个能力维度上仍有不足。 Conclusion: VocalBench-zh为中文语音交互系统提供了系统化的评估框架,有助于推动下一代多模态语音交互技术的发展,相关代码和数据集已公开。 Abstract: The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.[40] Prompt Tuning for Natural Language to SQL with Embedding Fine-Tuning and RAG
Jisoo Jang,Tien-Cuong Bui,Yunjun Choi,Wen-Syan Li
Main category: cs.CL
TL;DR: 本文提出了一种基于提示调优的自然语言到SQL的错误校正框架,结合生成式预训练大模型与检索增强生成(RAG),通过模拟医疗诊断流程实现SQL查询的自动纠错,在实验中比现有基线方法准确率提升12%。
Details
Motivation: 随着自然语言接口的广泛应用,亟需高效准确地将自然语言查询转化为SQL语句。现有方法在复杂场景下仍存在较多语义和语法错误,缺乏系统性纠错机制。 Method: 受医学诊断过程启发,提出一种新型框架:首先诊断错误类型,识别错误原因,生成修复指令,并应用修正;结合提示调优、微调和检索增强生成(RAG)技术,利用外部知识库提升准确性与可解释性。 Result: 实验表明,该框架相比现有基线方法在准确率上提升了12%,显著改善了NL-to-SQL的性能。 Conclusion: 所提出的错误校正框架有效提升了自然语言转SQL的准确性和鲁棒性,有望推动数据驱动环境中自然语言数据库接口的发展与应用。 Abstract: This paper introduces an Error Correction through Prompt Tuning for NL-to-SQL, leveraging the latest advancements in generative pre-training-based LLMs and RAG. Our work addresses the crucial need for efficient and accurate translation of natural language queries into SQL expressions in various settings with the growing use of natural language interfaces. We explore the evolution of NLIDBs from early rule-based systems to advanced neural network-driven approaches. Drawing inspiration from the medical diagnostic process, we propose a novel framework integrating an error correction mechanism that diagnoses error types, identifies their causes, provides fixing instructions, and applies these corrections to SQL queries. This approach is further enriched by embedding fine-tuning and RAG, which harnesses external knowledge bases for improved accuracy and transparency. Through comprehensive experiments, we demonstrate that our framework achieves a significant 12 percent accuracy improvement over existing baselines, highlighting its potential to revolutionize data access and handling in contemporary data-driven environments.[41] ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech
Marios Koniaris,Argyro Tsipi,Panayiotis Tsanakas
Main category: cs.CL
TL;DR: 本文提出了ParliaBench,一个用于议会演讲生成的基准,包含专门构建的英国议会演讲数据集和综合评估框架,通过微调大语言模型并引入新的基于嵌入的指标(如政治光谱对齐和政党对齐),在语言质量、语义连贯性和政治真实性方面显著提升了生成效果。
Details
Motivation: 现有的大语言模型缺乏针对议会演讲生成的专门训练,且当前评估方法未能有效衡量政治真实性,因此需要一个专门的基准和评估框架来提升生成内容的政治 authenticity 和 ideological consistency。 Method: 构建了来自英国议会的演讲数据集,提出结合计算指标与LLM-as-a-judge的评估框架,并引入两个新的嵌入式指标——政治光谱对齐和政党对齐,用于量化意识形态定位;对五种大语言模型进行微调,生成2.8万条演讲并系统评估。 Result: 微调后的模型在大多数指标上表现出统计学显著的改进,新提出的指标在政治维度上展现出强区分能力,能够有效衡量生成内容的政治真实性与意识形态一致性。 Conclusion: ParliaBench为议会演讲生成提供了有效的训练与评估方案,证明了领域微调的重要性,并展示了新型嵌入指标在衡量政治生成内容方面的有效性。 Abstract: Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.[42] Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments
Luca Bindini,Simone Giovannini,Simone Marinai,Valeria Nardoni,Kimiya Noor Ali
Main category: cs.CL
TL;DR: 本研究探讨了视觉大语言模型(VLLMs)在无需额外处理的情况下理解科学文献中复杂表格层次结构的能力,提出一个新的基准数据集CHiTab,并通过提示工程和模型微调评估多个VLLM的性能,发现通用VLLMs具备一定表格结构理解能力。
Details
Motivation: 现有VLLMs并非专为解析表格结构设计,但在科学文献理解等任务中需处理复杂表格,因此探究其对层次化表格结构的理解能力具有重要意义。 Method: 基于PubTables-1M数据集构建包含层次标题的复杂表格子集CHiTab;采用多种提示工程策略测试不同格式与风格对模型的影响;评估开源VLLMs的零样本表现并部分模型进行微调;同时对比人类在此任务上的表现。 Result: 实验表明,未经专门设计的通用VLLMs能够在一定程度上推断表格的层次结构,部分模型经微调后性能提升;但整体仍存在局限性,且与人类表现相比仍有差距。 Conclusion: VLLMs具备初步的复杂表格结构理解潜力,但需进一步优化以提升可靠性;本研究为将结构化数据理解能力融入通用VLLMs提供了实证依据与方向指引。 Abstract: This work investigates the ability of Vision Large Language Models (VLLMs) to understand and interpret the structure of tables in scientific articles. Specifically, we explore whether VLLMs can infer the hierarchical structure of tables without additional processing. As a basis for our experiments we use the PubTables-1M dataset, a large-scale corpus of scientific tables. From this dataset, we extract a subset of tables that we introduce as Complex Hierarchical Tables (CHiTab): a benchmark collection of complex tables containing hierarchical headings. We adopt a series of prompt engineering strategies to probe the models' comprehension capabilities, experimenting with various prompt formats and writing styles. Multiple state-of-the-art open-weights VLLMs are evaluated on the benchmark first using their off-the-shelf versions and then fine-tuning some models on our task. We also measure the performance of humans to solve the task on a small set of tables comparing with performance of the evaluated VLLMs. The experiments support our intuition that generic VLLMs, not explicitly designed for understanding the structure of tables, can perform this task. This study provides insights into the potential and limitations of VLLMs to process complex tables and offers guidance for future work on integrating structured data understanding into general-purpose VLLMs.[43] Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates
Shuaimin Li,Liyang Fan,Yufang Lin,Zeyang Li,Xian Wei,Shiwen Ni,Hamid Alinejad-Rokny,Min Yang
Main category: cs.CL
TL;DR: 提出ReViewGraph框架,通过LLM模拟审稿人-作者多轮辩论并构建设辞图,利用图神经网络进行推理,显著提升论文评审决策的准确性。
Details
Motivation: 现有论文评审方法依赖表面特征或直接使用大语言模型,易产生幻觉、偏见评分和推理能力不足,且难以捕捉审稿人与作者之间的论证交互动态。 Method: 通过基于LLM的多智能体协作模拟审稿人-作者对话,抽取多种观点关系(如接受、拒绝、澄清、妥协)并构建异质交互图,利用图神经网络对辩论结构进行推理。 Result: 在三个数据集上的实验表明,ReViewGraph平均相对提升15.73%,优于强基线方法。 Conclusion: 建模细粒度的审稿人-作者辩论结构有助于提升自动论文评审的性能,验证了结构化推理在学术评价中的潜力。 Abstract: Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer-author debate structures.[44] Adaptive Multi-Agent Response Refinement in Conversational Systems
Soyeong Jeong,Aparna Elangovan,Emine Yilmaz,Oleg Rokhlenko
Main category: cs.CL
TL;DR: 提出了一种基于多智能体框架的对话响应优化方法,通过分工负责事实性、个性化和连贯性,并采用动态通信策略提升协作效果,在多个挑战性数据集上显著优于基线模型。
Details
Motivation: 现有单一大模型在对话响应中难以兼顾事实性、个性化和连贯性,且依赖用户发现错误并重试不现实,需自动优化响应质量。 Method: 构建多智能体框架,每个智能体分别负责事实性、个性化或连贯性的审查与优化,并通过动态通信策略自适应选择和协调相关智能体进行协作。 Result: 在多个具有挑战性的对话数据集上验证了该框架的有效性,尤其在涉及知识和用户个性化的任务中显著优于现有基线方法。 Conclusion: 多智能体协同优化能有效提升对话系统的响应质量,动态通信策略增强了系统对不同查询需求的适应性和协作效率。 Abstract: Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user's persona, or both.[45] AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress
Zhiheng Xi,Chenyang Liao,Guanyu Li,Yajie Yang,Wenxiang Chen,Zhihao Zhang,Binghai Wang,Senjie Jin,Yuhao Zhou,Jian Guan,Wei Wu,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CL
TL;DR: 本文提出了一种用于代理任务的新型过程奖励模型AgentPRM,通过重新定义决策评估方式,结合时序差分和广义优势估计方法高效训练,显著提升了大语言模型在多轮决策任务中的计算效率和性能。
Details
Motivation: 大语言模型在多轮决策任务中表现不佳,传统方法依赖复杂的提示工程或专家轨迹微调,缺乏对决策过程的有效评估机制。 Method: 提出AgentPRM,利用时序差分和广义优势估计构建过程奖励模型,评估每个决策对目标的贡献并指导决策过程。 Result: 实验表明AgentPRM比基线方法计算效率提高8倍以上,并在扩展测试时间计算时表现出稳健改进。 Conclusion: AgentPRM能有效提升LLM代理在复杂任务中的决策能力,具备良好的可扩展性和应用潜力,尤其适用于强化学习场景。 Abstract: Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.[46] DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering
Xinyi Wang,Yiping Song,Zhiliang Tian,Bo Liu,Tingjin Luo,Minlie Huang
Main category: cs.CL
TL;DR: 提出了一种双隐式过程奖励模型(DPRM),用于多跳问答任务,通过分别建模思维链(CoT)和知识图谱(KG)的推理过程,并引入一致性约束提升推理路径质量,在多个数据集上显著优于现有方法。
Details
Motivation: 现有隐式过程奖励模型无法处理知识图谱的结构约束,且难以捕捉思维链与知识图谱路径之间的不一致,限制了其在多跳问答中的应用。 Method: 设计两个隐式过程奖励模型(KG-PRM 和 CoT-PRM),分别从结果信号中推导步骤奖励;利用偏好对学习知识图谱的结构约束,并引入CoT与KG推理步骤间的一致性约束以实现相互验证与协同优化。 Result: 在多个数据集上超越13个基线模型,Hit@1指标最高提升16.6%。 Conclusion: DPRM能有效结合知识图谱与思维链的推理优势,通过双重隐式奖励建模和一致性约束显著提升多跳问答的推理准确性。 Abstract: In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.[47] The Dynamic Articulatory Model DYNARTmo: Dynamic Movement Generation and Speech Gestures
Bernd J. Kröger
Main category: cs.CL
TL;DR: 本文介绍了DYNARTmo动态发音模型的实现,该模型基于语音手势及其乐谱生成连续的发音运动,模拟从语言表达到发音-声学实现的层次控制。
Details
Motivation: 为了建立一个神经生物学启发的计算框架,以更好地理解言语产生的层级控制机制。 Method: 通过构建手势库、协调手势乐谱,并将其转换为控制DYNARTmo声道模型的连续发音轨迹来实现。 Result: 实现了能够生成连续发音运动的DYNARTmo模型,支持从语言表示到发音-声学输出的完整模拟。 Conclusion: DYNARTmo为研究言语产生提供了有效的计算工具,有助于深入理解发音过程中的神经控制机制。 Abstract: This paper describes the current implementation of the dynamic articulatory model DYNARTmo, which generates continuous articulator movements based on the concept of speech gestures and a corresponding gesture score. The model provides a neurobiologically inspired computational framework for simulating the hierarchical control of speech production from linguistic representation to articulatory-acoustic realization. We present the structure of the gesture inventory, the coordination of gestures in the gesture score, and their translation into continuous articulator trajectories controlling the DYNARTmo vocal tract model.[48] TurkEmbed: Turkish Embedding Model on NLI & STS Tasks
Özay Ezerceli,Gizem Gümüşçekiçci,Tuğba Erkoç,Berke Özenç
Main category: cs.CL
TL;DR: 本文提出了一种新的土耳其语嵌入模型TurkEmbed,通过使用多样化数据集和先进的训练技术(如matryoshka表示学习),在自然语言推断和语义文本相似性任务上优于现有模型。
Details
Motivation: 现有土耳其语嵌入模型多依赖机器翻译数据集,可能导致语义理解不准确,因此需要一个更精确、适应性强的模型。 Method: 结合多种数据集并采用matryoshka表示学习等先进训练技术,生成更鲁棒和准确的嵌入表示。 Result: 在Turkish STS-b-TR数据集上显著提升了语义相似性任务的表现,并在All-NLI-TR和STS-b-TR基准上比当前最优模型Emrecan提升1-4%。 Conclusion: TurkEmbed能有效增强土耳其语NLP生态,为下游应用提供更细致的语言理解能力。 Abstract: This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4\% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.[49] PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints
Tangrui Li,Pei Wang,Hongzheng Wang Christian Hahm,Matteo Spatola,Justin Shi
Main category: cs.CL
TL;DR: 提出一种名为PCRLLM的框架,通过单步推理和显式逻辑结构提升大语言模型的逻辑一致性与可验证性,并支持多模型协作与大规模逐步推理数据生成。
Details
Motivation: 大语言模型在推理过程中常缺乏逻辑连贯性,难以保证从前提推导结论的过程符合明确的推理规则,导致可信度问题。 Method: 设计Proof-Carrying Reasoning with LLMs(PCRLLM)框架,强制推理过程为单步推断,并在自然语言中显式表达前提、规则和结论,以支持对目标逻辑的验证;引入基准模式生成大规模逐步推理数据。 Result: 实现了可验证的链式推理,即使在黑盒设置下也能进行层级验证;支持基于形式化规则的多LLM协同推理;提供了结合自然语言与形式严谨性的推理数据生成方案。 Conclusion: PCRLLM提升了大语言模型推理的逻辑性、透明性和可信度,为构建可验证、可协作的推理系统提供了新路径。 Abstract: Large Language Models (LLMs) often exhibit limited logical coherence, mapping premises to conclusions without adherence to explicit inference rules. We propose Proof-Carrying Reasoning with LLMs (PCRLLM), a framework that constrains reasoning to single-step inferences while preserving natural language formulations. Each output explicitly specifies premises, rules, and conclusions, thereby enabling verification against a target logic. This mechanism mitigates trustworthiness concerns by supporting chain-level validation even in black-box settings. Moreover, PCRLLM facilitates systematic multi-LLM collaboration, allowing intermediate steps to be compared and integrated under formal rules. Finally, we introduce a benchmark schema for generating large-scale step-level reasoning data, combining natural language expressiveness with formal rigor.[50] Interaction Dynamics as a Reward Signal for LLMs
Sian Gooding,Edward Grefenstette
Main category: cs.CL
TL;DR: 本文提出了TRACE,一种基于对话嵌入轨迹几何特性的新型奖励信号,利用对话的结构动态来评估多轮对话中智能体的协作效果。实验表明,仅依赖交互结构信号的奖励模型性能接近基于完整文本分析的基线模型,而结合两者的方法性能更优,说明交互方式与内容同样重要。
Details
Motivation: 传统LLM对齐方法依赖文本内容的奖励信号,忽略了对话交互过程中的动态信息。本文旨在探索并利用对话的交互动力学(即‘对话几何’)作为补充且独立的信号源,以提升多轮对话中智能体的对齐效果。 Method: 提出TRACE方法,通过分析对话嵌入轨迹的几何特性(如方向、曲率等)提取交互动态特征,并训练仅基于这些结构信号的奖励模型;同时构建融合文本内容与结构信号的混合模型进行对比。 Result: 仅使用结构信号的TRACE模型在配对准确率上达到68.20%,接近基于完整文本的强大LLM基线(70.04%);而融合模型达到80.17%,表现最佳。 Conclusion: 交互方式本身是衡量对话成功的重要指标,与内容具有互补性。TRACE不仅提供了一种隐私保护的对齐框架,还可作为诊断工具识别促进协作的关键交互模式。 Abstract: The alignment of Large Language Models (LLMs) for multi-turn conversations typically relies on reward signals derived from the content of the text. This approach, however, overlooks a rich, complementary source of signal: the dynamics of the interaction itself. This paper introduces TRACE (Trajectory-based Reward for Agent Collaboration Estimation), a novel reward signal derived from the geometric properties of a dialogue's embedding trajectory--a concept we term 'conversational geometry'. Our central finding is that a reward model trained only on these structural signals achieves a pairwise accuracy (68.20%) comparable to a powerful LLM baseline that analyzes the full transcript (70.04%). Furthermore, a hybrid model combining interaction dynamics with textual analysis achieves the highest performance (80.17%), demonstrating their complementary nature. This work provides strong evidence that for interactive settings, how an agent communicates is as powerful a predictor of success as what it says, offering a new, privacy-preserving framework that not only aligns agents but also serves as a diagnostic tool for understanding the distinct interaction patterns that drive successful collaboration.[51] Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?
Shiyan Zheng,Herun Wan,Minnan Luo,Junhang Huang
Main category: cs.CL
TL;DR: 本文研究了社交机器人检测器在面对文本特征中的捷径学习时的鲁棒性问题,并提出基于大语言模型和反事实数据增强的缓解策略。
Details
Motivation: 现有社交机器人检测器在基准测试中表现良好,但在真实场景中因模糊的真实标签和误导性线索而鲁棒性不足,尤其是捷径学习(依赖虚假相关性)的影响尚未充分研究。 Method: 通过构建用户标签与表面文本特征之间的虚假关联,设计多种捷径情景,评估检测器性能;并提出基于大语言模型的反事实数据增强方法,从数据分布和模型因果信息提取层面进行干预。 Result: 实验显示基线模型在捷径情景下平均准确率相对下降32%;所提方法在三种层次上缓解该问题,平均相对性能提升56%。 Conclusion: 社交机器人检测器易受文本捷径特征影响,所提出的多层级反事实增强策略能有效提升模型鲁棒性。 Abstract: While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32\% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model's ability to extract causal information. Our strategies achieve an average relative performance improvement of 56\% under shortcut scenarios.[52] SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation
Berkcan Kapusuzoglu,Supriyo Chakraborty,Renkun Ni,Stephen Rawls,Sambit Sahu
Main category: cs.CL
TL;DR: 提出了一种名为SPEAR-MM的模型融合框架,用于在金融领域适应大语言模型的同时,有效保留其通用推理能力,显著优于标准持续预训练方法。
Details
Motivation: 金融领域的大语言模型在领域适应过程中容易遗忘重要的通用推理能力,影响客户交互和复杂金融分析,因此需要一种能够平衡领域适应与通用能力保持的方法。 Method: 通过后验分析近似各层对外部基准的影响,利用球面插值融合选择性地冻结或恢复Transformer层,实现模型融合。 Result: 在LLaMA-3.1-8B上应用于金融任务时,SPEAR-MM保持了91.2%的通用能力(标准方法为69.7%),并保留了94%的领域适应效果,同时降低90%计算成本。 Conclusion: SPEAR-MM能有效在领域适应中保留关键通用能力,提供可解释的权衡控制,适用于资源受限的金融机构。 Abstract: Large language models (LLMs) adapted to financial domains often suffer from catastrophic forgetting of general reasoning capabilities essential for customer interactions and complex financial analysis. We introduce Selective Parameter Evaluation and Restoration via Model Merging (SPEAR-MM), a practical framework that preserves critical capabilities while enabling domain adaptation. Our method approximates layer-wise impact on external benchmarks through post-hoc analysis, then selectively freezes or restores transformer layers via spherical interpolation merging. Applied to LLaMA-3.1-8B for financial tasks, SPEAR-MM achieves 91.2% retention of general capabilities versus 69.7% for standard continual pretraining, while maintaining 94% of domain adaptation gains. The approach provides interpretable trade-off control and reduces computational costs by 90% crucial for resource-constrained financial institutions.[53] Structured RAG for Answering Aggregative Questions
Omri Koshorek,Niv Granot,Aviv Alloni,Shahar Admati,Roee Hendel,Ido Weiss,Alan Arazi,Shay-Nitzan Cohen,Yonatan Belinkov
Main category: cs.CL
TL;DR: 提出S-RAG,一种专为聚合查询设计的检索增强生成方法,通过构建语料库的结构化表示并使用形式化查询提升性能。
Details
Motivation: 现有RAG方法主要针对少量相关文档的查询,难以处理需从大量文档中聚合信息的复杂查询。 Method: 在数据摄入时构建语料库的结构化表示,在推理时将自然语言查询转化为对该结构的形式化查询。 Result: 在HOTELS、WORLD CUP两个新数据集及公开基准上,S-RAG显著优于传统RAG系统和长上下文大模型。 Conclusion: S-RAG有效解决了聚合查询的挑战,推动了对复杂信息整合任务的研究。 Abstract: Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.[54] Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research
Neelavro Saha,Rafi Shahriyar,Nafis Ashraf Roudra,Saadman Sakib,Annajiat Alim Rasel
Main category: cs.CL
TL;DR: 本文提出了一个名为Bangla-SGP的新数据集,包含1000个手动标注的孟加拉手语句子-词素对,并通过基于规则的检索增强生成方法合成了约3000个额外样本,用于推动孟加拉手语句子级翻译研究。
Details
Motivation: 由于缺乏大规模句子级翻译数据集,孟加拉手语(BdSL)翻译研究长期受限于词和字母级别的识别任务,亟需高质量的句子级平行数据集以促进该领域发展。 Method: 构建了一个包含1000个专业手语者标注的句子-词素对的真实数据集,并采用基于句法和形态学规则的检索增强生成(RAG)框架合成约3000个额外数据;使用mBart50、Google mT5、GPT4.1-nano等Transformer模型进行句子到词素的翻译建模与评估。 Result: 在BLEU评分基础上评估了多个Transformer模型在Bangla-SGP数据集上的句子到词素翻译性能,并与RWTH-PHOENIX-2014T基准进行了对比,验证了模型在词素翻译一致性方面的表现。 Conclusion: Bangla-SGP为低资源的孟加拉手语翻译提供了首个句子级平行语料库,结合规则驱动的数据增强方法有效扩展了数据规模,为未来手语翻译研究奠定了基础。 Abstract: Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.[55] AlphaResearch: Accelerating New Algorithm Discovery with Language Models
Zhaojian Yu,Kaiyue Feng,Yilun Zhao,Shilin He,Xiao-Ping Zhang,Arman Cohan
Main category: cs.CL
TL;DR: 本文提出了AlphaResearch,一个用于在开放性问题上自主发现新算法的研究代理,并通过双环境验证机制(执行验证和模拟同行评审)提升发现过程的可行性与创新性;同时构建了AlphaResearchComp评测基准,实验表明该系统在8个算法问题中有2个超越人类研究者,其中“圆堆积”问题的发现优于已知最佳结果。
Details
Motivation: 大语言模型在复杂但易验证的问题上进展显著,但在探索未知领域(如新算法发现)方面仍面临挑战,尤其是在开放性问题中缺乏有效的自主研究与评估框架。 Method: 提出AlphaResearch,结合基于执行的验证和模拟现实世界同行评审的双重研究环境,通过迭代执行“提出想法-双重环境验证-优化方案”的流程来发现新算法;并构建包含8个开放性算法问题的AlphaResearchComp评测基准,确保问题可执行、可度量且可复现。 Result: AlphaResearch在8个问题中以2/8的胜率优于人类研究者,尤其在“圆堆积”问题上达到已知最优性能,超越人类成果及AlphaEvolve等强基线;对6个失败案例进行了深入分析,揭示了当前方法的局限性。 Conclusion: AlphaResearch展示了大语言模型在自主算法发现中的潜力,验证了双重验证机制和系统化评估的有效性,为未来AI驱动科研自动化提供了可行路径与重要启示。 Abstract: Large language models have made significant progress in complex but easy-to-verify problems, yet they still struggle with discovering the unknown. In this paper, we present \textbf{AlphaResearch}, an autonomous research agent designed to discover new algorithms on open-ended problems. To synergize the feasibility and innovation of the discovery process, we construct a novel dual research environment by combining the execution-based verify and simulated real-world peer review environment. AlphaResearch discovers new algorithm by iteratively running the following steps: (1) propose new ideas (2) verify the ideas in the dual research environment (3) optimize the research proposals for better performance. To promote a transparent evaluation process, we construct \textbf{AlphaResearchComp}, a new evaluation benchmark that includes an eight open-ended algorithmic problems competition, with each problem carefully curated and verified through executable pipelines, objective metrics, and reproducibility checks. AlphaResearch gets a 2/8 win rate in head-to-head comparison with human researchers, demonstrate the possibility of accelerating algorithm discovery with LLMs. Notably, the algorithm discovered by AlphaResearch on the \emph{``packing circles''} problem achieves the best-of-known performance, surpassing the results of human researchers and strong baselines from recent work (e.g., AlphaEvolve). Additionally, we conduct a comprehensive analysis of the remaining challenges of the 6/8 failure cases, providing valuable insights for future research.[56] Investigating CoT Monitorability in Large Reasoning Models
Shu Yang,Junchao Wu,Xilin Gou,Xuansheng Wu,Derek Wong,Ninhao Liu,Di Wang
Main category: cs.CL
TL;DR: 本文首次系统研究了通过大推理模型(LRM)的思维链(CoT)进行行为监控的可能性与挑战,提出了两个核心视角:推理表达的真实性(verbalization)和监控器的可靠性(monitor reliability),并提出了一种基于CoT的新监控范式MoME。
Details
Motivation: 尽管LRM的详细推理过程为AI安全提供了新机会(即CoT可监控性),但存在两个根本问题:一是模型的推理是否真实反映其决策过程(faithfulness),二是监控器本身可能被复杂冗长的推理所欺骗,因此需要系统研究CoT监控的有效性与局限。 Method: 通过在数学、科学和伦理领域的实证研究与相关性分析,评估不同LRM的推理表达质量与监控可靠性之间的关系;进一步研究不同CoT干预方法对监控效果的影响;提出MoME范式,利用LLM作为监控器,通过分析目标模型的CoT来判断其是否存在误行为,并提供结构化判断与证据支持。 Result: 发现了推理表达质量与监控可靠性之间存在显著相关性;不同CoT干预方法会不同程度影响监控效果;MoME范式能够有效识别模型的捷径使用或谄媚等误行为,展现出比传统方法更强的监控潜力。 Conclusion: CoT监控具有潜力但受限于推理的真实性与监控器的鲁棒性,未来需设计更可信的推理生成机制与更可靠的自动化监控方法,MoME为实现这一方向提供了可行路径。 Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models' long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models' misbehavior through their CoT and provide structured judgments along with supporting evidence.[57] From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL
Amirmohammad Omidi Galdiani,Sepehr Rezaei Melal,Mohammad Norasteh,Arash Yousefi Jordehi,Seyed Abolghasem Mirroshandel
Main category: cs.CL
TL;DR: 本文提出了一种从OntoNotes 5.0语料库的WSJ部分构建高质量语义角色标注(SRL)数据集并将其适配于观点角色标注(ORL)任务的方法。
Details
Motivation: 为了在低资源观点挖掘场景中利用SRL提升ORL性能,需要构建一个高质量、可复用的标注数据集。 Method: 基于PropBank框架,设计了一个可复现的提取流程,将谓词-论元结构与表层文本对齐,将句法树指针转换为连贯的跨度,并进行严格清洗;同时处理不连续论元,修正标注错误。 Result: 构建了包含97,169个谓词-论元实例的数据集,明确定义了Agent(ARG0)、Predicate(REL)和Patient(ARG1)角色,并映射到ORL的Holder、Expression和Target模式,提供了详细算法描述和统计分析。 Conclusion: 该工作为研究人员提供了一个可复用的资源,有助于推动SRL在ORL任务中的应用,特别是在低资源环境下。 Abstract: This report presents a detailed methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal (WSJ) portion of the OntoNotes 5.0 corpus and adapting it for Opinion Role Labeling (ORL) tasks. Leveraging the PropBank annotation framework, we implement a reproducible extraction pipeline that aligns predicate-argument structures with surface text, converts syntactic tree pointers to coherent spans, and applies rigorous cleaning to ensure semantic fidelity. The resulting dataset comprises 97,169 predicate-argument instances with clearly defined Agent (ARG0), Predicate (REL), and Patient (ARG1) roles, mapped to ORL's Holder, Expression, and Target schema. We provide a detailed account of our extraction algorithms, discontinuous argument handling, annotation corrections, and statistical analysis of the resulting dataset. This work offers a reusable resource for researchers aiming to leverage SRL for enhancing ORL, especially in low-resource opinion mining scenarios.[58] Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Davi Bastos Costa,Felippe Alves,Renato Vicente
Main category: cs.CL
TL;DR: 该研究通过道德基础问卷(MFQ)构建基准,评估大语言模型在角色扮演情境下的道德反应,提出“道德易感性”和“道德鲁棒性”两个指标,发现模型家族对鲁棒性影响显著(Claude最稳健),而模型大小则影响易感性,且二者呈正相关。
Details
Motivation: 随着大语言模型越来越多地参与社会情境,理解其在角色扮演中如何表达和调整道德判断成为重要课题。 Method: 使用道德基础问卷(MFQ)构建量化基准,衡量不同角色下大语言模型道德判断的变异性,定义并分析‘道德易感性’和‘道德鲁棒性’两个指标,并在多个模型家族和规模间进行比较。 Result: 模型家族是影响道德鲁棒性的主要因素(Claude最突出),模型大小则对道德易感性有明显正向影响;此外,道德鲁棒性与易感性之间存在正相关关系,这一关系在家族层面更为显著。同时提供了无角色设定下的模型及平均角色的道德基础特征谱。 Conclusion: 角色设定显著影响大语言模型的道德输出,不同模型家族在道德稳定性方面表现差异显著,而模型越大越容易受角色影响,道德稳定性与角色敏感性并非对立而是协同变化。 Abstract: Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across and within personas, respectively. We find that, for moral robustness, model family accounts for most of the variance, while model size shows no systematic effect. The Claude family is, by a significant margin, the most robust, followed by Gemini and GPT-4 models, with other families exhibiting lower robustness. In contrast, moral susceptibility exhibits a mild family effect but a clear within-family size effect, with larger variants being more susceptible. Moreover, robustness and susceptibility are positively correlated, an association that is more pronounced at the family level. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in large language models.[59] Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models
Tianyu Fu,Yichen You,Zekai Chen,Guohao Dai,Huazhong Yang,Yu Wang
Main category: cs.CL
TL;DR: 提出动态隐式思考方法Think-at-Hard(TaH),仅对难以预测的token进行深层迭代优化,提升大模型推理能力。
Details
Motivation: 现有循环Transformer在每个token上固定执行额外迭代,导致简单token被过度思考而引入错误,需动态识别并仅优化困难token。 Method: 设计轻量级决策网络判断token难度,仅对可能错误的token触发隐式迭代;引入LoRA模块将模型目标从通用预测转为难例精修,并采用双因果注意力机制实现跨迭代信息流动与并行计算。 Result: 在五个基准上显著提升推理性能:相比全token二次迭代基线,准确率提升8.1-11.3%,94% token免于二次迭代;相比同数据微调的Qwen3单次推理模型提升4.0-5.0%;引入少量额外参数后增益进一步提升至8.5-12.6%和5.3-5.4%。 Conclusion: TaH通过动态控制隐式思考过程,在不增加参数量的前提下有效提升大模型推理能力,兼顾效率与性能。 Abstract: Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.[60] Training Language Models to Explain Their Own Computations
Belinda Z. Li,Zifan Carl Guo,Vincent Huang,Jacob Steinhardt,Jacob Andreas
Main category: cs.CL
TL;DR: 研究表明,语言模型可以通过微调来生成对其内部计算的自然语言解释,并在自我解释方面表现优于其他模型,显示出其作为可扩展解释方法的潜力。
Details
Motivation: 探索语言模型是否能利用自身内部信息的优势,学习并准确描述其内部计算过程,并比较其自我解释能力与其他模型的差异。 Method: 使用现有的可解释性技术作为真实数据,对语言模型进行微调,使其生成关于特征编码信息、内部激活的因果结构以及输入令牌对输出影响的自然语言解释。 Result: 经过数万条示例训练后,解释模型能在新查询上实现非平凡的泛化,且用自身解释自身的效果通常优于用其他模型进行解释,即使后者能力更强。 Conclusion: 语言模型能够学习可靠地解释其内部计算,且这种自我解释能力为现有可解释性方法提供了一种可扩展的补充手段。 Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a *different* model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.cs.CV [Back]
[61] Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs
Hari Lee
Main category: cs.CV
TL;DR: 提出了一种基于文本的可解释视频异常检测框架TbVAD,通过语言驱动的方式在弱监督下实现异常检测与解释。
Details
Motivation: 传统弱监督视频异常检测依赖视觉特征,缺乏可解释性;希望利用语言实现可解释、基于知识推理的异常检测。 Method: 使用视觉-语言模型将视频内容转为细粒度字幕,组织成四个语义槽(动作、物体、上下文、环境),进行基于文本的知识推理并生成逐槽位的异常解释。 Result: 在UCF-Crime和XD-Violence数据集上验证了方法的有效性,实现了可解释且可靠的异常检测。 Conclusion: TbVAD通过纯文本域内的语义推理,能够在保持高性能的同时提供清晰的异常原因解释,适用于真实监控场景。 Abstract: We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.[62] Two Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM
Joe Kileel,Oscar Mickelin,Amit Singer,Sheng Xu
Main category: cs.CV
TL;DR: 提出了一种名为双矩方法(MoDM)的新数据融合框架,利用两种不同取向分布下的投影图像二阶矩来重建分子结构,仅使用二阶统计量即可实现高精度恢复。
Details
Motivation: 传统冷冻电镜重构依赖大量噪声图像和复杂的预处理,缺乏有效利用多组不同实验条件数据的方法。 Method: 通过分析在均匀和非均匀未知取向分布下获得的投影图像的二阶矩,提出基于凸松弛的算法,从这些二阶统计量中恢复三维结构。 Result: 证明了这些矩在一般情况下可唯一确定分子结构(全局旋转和反射除外),并在实验中实现了准确重建。 Conclusion: 利用不同实验条件下数据集的多样性可显著提升计算成像中的重建质量,MoDM为无需高阶统计或完整粒子拾取的数据融合提供了新途径。 Abstract: Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions--one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.[63] Modulo Video Recovery via Selective Spatiotemporal Vision Transformer
Tianyu Geng,Feng Ji,Wee Peng Tay
Main category: cs.CV
TL;DR: 本文提出了首个用于模数视频重建的深度学习框架SSViT,通过引入选择性时空视觉Transformer和令牌选择策略,在8位折叠视频上实现了高质量重建,并在模数视频恢复任务中达到最先进的性能。
Details
Motivation: 传统的图像传感器动态范围有限,在高动态范围场景下容易饱和;虽然模数相机通过折叠辐照度解决了这一问题,但其恢复需要专门算法,且现有HDR方法不适用于模数恢复,深度学习技术在此领域应用进展缓慢。 Method: 提出Selective Spatiotemporal Vision Transformer(SSViT),利用Transformer捕捉全局依赖和时空关系,并设计令牌选择策略以提高效率并聚焦关键区域,专门针对模数视频重建进行优化。 Result: 实验表明,SSViT能够从8位折叠视频中生成高质量的重建结果,在模数视频恢复任务中性能优于现有方法,达到最先进水平。 Conclusion: SSViT是首个用于模数视频重建的深度学习框架,验证了Transformer在该任务中的有效性,并推动了模数恢复领域的发展。 Abstract: Conventional image sensors have limited dynamic range, causing saturation in high-dynamic-range (HDR) scenes. Modulo cameras address this by folding incident irradiance into a bounded range, yet require specialized unwrapping algorithms to reconstruct the underlying signal. Unlike HDR recovery, which extends dynamic range from conventional sampling, modulo recovery restores actual values from folded samples. Despite being introduced over a decade ago, progress in modulo image recovery has been slow, especially in the use of modern deep learning techniques. In this work, we demonstrate that standard HDR methods are unsuitable for modulo recovery. Transformers, however, can capture global dependencies and spatial-temporal relationships crucial for resolving folded video frames. Still, adapting existing Transformer architectures for modulo recovery demands novel techniques. To this end, we present Selective Spatiotemporal Vision Transformer (SSViT), the first deep learning framework for modulo video reconstruction. SSViT employs a token selection strategy to improve efficiency and concentrate on the most critical regions. Experiments confirm that SSViT produces high-quality reconstructions from 8-bit folded videos and achieves state-of-the-art performance in modulo video recovery.[64] Laplacian Score Sharpening for Mitigating Hallucination in Diffusion Models
Barath Chandran. C,Srinivas Anumasa,Dianbo Liu
Main category: cs.CV
TL;DR: 提出一种基于分数函数拉普拉斯修正的后处理方法,有效减少无条件扩散模型中的模式插值幻觉问题。
Details
Motivation: 扩散模型在生成样本时容易出现幻觉,现有研究指出这是由于模式插值和分数平滑所致,但缺乏有效的生成阶段抑制方法。 Method: 在推理过程中对分数函数进行后处理调整,利用分数函数的拉普拉斯(锐度)信息,并通过有限差分变体的Hutchinson迹估计器高效近似高维拉普拉斯。 Result: 该方法显著降低了1D、2D玩具分布和高维图像数据上幻觉样本的生成率,并揭示了分数函数拉普拉斯与不确定性之间的关系。 Conclusion: 所提出的拉普拉斯修正方法能有效缓解扩散模型中的模式插值导致的幻觉问题,提升生成样本的合理性与一致性。 Abstract: Diffusion models, though successful, are known to suffer from hallucinations that create incoherent or unrealistic samples. Recent works have attributed this to the phenomenon of mode interpolation and score smoothening, but they lack a method to prevent their generation during sampling. In this paper, we propose a post-hoc adjustment to the score function during inference that leverages the Laplacian (or sharpness) of the score to reduce mode interpolation hallucination in unconditional diffusion models across 1D, 2D, and high-dimensional image data. We derive an efficient Laplacian approximation for higher dimensions using a finite-difference variant of the Hutchinson trace estimator. We show that this correction significantly reduces the rate of hallucinated samples across toy 1D/2D distributions and a high- dimensional image dataset. Furthermore, our analysis explores the relationship between the Laplacian and uncertainty in the score.[65] Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance
Kwanyoung Kim
Main category: cs.CV
TL;DR: 提出了一种基于最优传输理论的新型引导方法ASAG,通过在自注意力层中注入对抗性代价来改善扩散模型中的条件和无条件生成质量。
Details
Motivation: 现有引导方法如无分类器引导依赖于手动设计的启发式扰动函数,缺乏原则性基础,且可能损害生成质量。 Method: 将扩散模型中的注意力分数重新解释为最优传输问题,利用Sinkhorn算法引入对抗性传输代价,削弱查询与键之间的像素级相似性,从而破坏误导性的注意力对齐。 Result: ASAG在文本到图像生成任务中表现出一致的性能提升,并在IP-Adapter和ControlNet等下游应用中增强了可控性和保真度。该方法无需重新训练,即插即用。 Conclusion: ASAG为扩散模型的注意力机制提供了新的理论视角和改进路径,是一种轻量、通用且有效的引导框架。 Abstract: Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.[66] LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration
Tung Vu,Hai Nguyen,Cong Tran
Main category: cs.CV
TL;DR: 提出LiveNeRF框架,实现高质量实时人脸替换,适用于直播、视频会议等场景,同时强调负责任的部署以应对滥用风险。
Details
Motivation: 现有方法在实时性和视觉质量上存在局限,难以满足实际应用需求。 Method: 提出LiveNeRF框架,结合神经辐射场与实时优化策略,实现33 FPS的高性能人脸替换。 Result: 在保持33 FPS的同时显著提升视觉质量,支持直播、视频会议等实时交互应用。 Conclusion: LiveNeRF实现了高效高质量的人脸替换,具备广泛的应用前景,但需结合用户授权和检测技术防范滥用。 Abstract: Face replacement technology enables significant advancements in entertainment, education, and communication applications, including dubbing, virtual avatars, and cross-cultural content adaptation. Our LiveNeRF framework addresses critical limitations of existing methods by achieving real-time performance (33 FPS) with superior visual quality, enabling practical deployment in live streaming, video conferencing, and interactive media. The technology particularly benefits content creators, educators, and individuals with speech impairments through accessible avatar communication. While acknowledging potential misuse in unauthorized deepfake creation, we advocate for responsible deployment with user consent verification and integration with detection systems to ensure positive societal impact while minimizing risks.[67] TrackStudio: An Integrated Toolkit for Markerless Tracking
Hristo Dimitrov,Giulia Dominijanni,Viktorija Pavalkyte,Tamar R. Makin
Main category: cs.CV
TL;DR: TrackStudio是一个无需编程技能的模块化GUI工具,集成了2D/3D运动追踪、校准、预处理、特征提取与可视化功能,适用于非专家在多样化环境中进行无标记动作捕捉。
Details
Motivation: 现有的无标记动作追踪工具虽然性能强大,但使用门槛高,需要大量技术专业知识,缺乏面向非专家用户的易用且集成的解决方案。 Method: 整合现有的开源工具,构建一个即插即用的模块化图形界面(GUI)流程,支持自动化的2D和3D追踪、标定、数据预处理、特征提取和可视化,并提供详细的用户指南和常见问题说明。 Result: 在三种不同环境下对76名参与者进行了测试,使用低成本网络摄像头或高分辨率相机,平均帧间相关性超过0.98,手部追踪的平均三角化误差低于13.6毫米,表现出稳定且一致的追踪性能,并可扩展至面部和其他身体部位追踪。 Conclusion: TrackStudio为需要可靠性能但不具备专业技术背景的研究人员或普通用户提供了一条实用且便捷的无标记动作追踪路径。 Abstract: Markerless motion tracking has advanced rapidly in the past 10 years and currently offers powerful opportunities for behavioural, clinical, and biomechanical research. While several specialised toolkits provide high performance for specific tasks, using existing tools still requires substantial technical expertise. There remains a gap in accessible, integrated solutions that deliver sufficient tracking for non-experts across diverse settings. TrackStudio was developed to address this gap by combining established open-source tools into a single, modular, GUI-based pipeline that works out of the box. It provides automatic 2D and 3D tracking, calibration, preprocessing, feature extraction, and visualisation without requiring any programming skills. We supply a user guide with practical advice for video acquisition, synchronisation, and setup, alongside documentation of common pitfalls and how to avoid them. To validate the toolkit, we tested its performance across three environments using either low-cost webcams or high-resolution cameras, including challenging conditions for body position, lightning, and space and obstructions. Across 76 participants, average inter-frame correlations exceeded 0.98 and average triangulation errors remained low (<13.6mm for hand tracking), demonstrating stable and consistent tracking. We further show that the same pipeline can be extended beyond hand tracking to other body and face regions. TrackStudio provides a practical, accessible route into markerless tracking for researchers or laypeople who need reliable performance without specialist expertise.[68] Predicting Coronary Artery Calcium Severity based on Non-Contrast Cardiac CT images using Deep Learning
Lachlan Nguyen,Aidan Cousins,Arcot Sowmya,Hugh Dixson,Sonit Singh
Main category: cs.CV
TL;DR: 该研究开发了一种深度学习卷积神经网络模型,用于将心脏非增强CT图像中的冠状动脉钙化评分自动分为六类临床类别,表现出高准确性(96.5%)和强一致性(Cohen's kappa 0.962),显示出良好的临床应用潜力。
Details
Motivation: 当前冠状动脉钙化评分依赖耗时的半自动分析,需要放射科医生参与,限制了其广泛应用,因此亟需一种高效、自动化的替代方法。 Method: 采用深度学习卷积神经网络(CNN)模型,基于68例患者的非增强心脏CT图像及其半自动CAC评分结果进行训练与验证,数据集划分为训练、验证和测试集,以半自动评分为参考标签进行分类建模。 Result: 模型在六分类任务中总体准确率达96.5%,Cohen's kappa为0.962,表现高度一致;在32例误分类中,26例倾向于高估钙化评分,显示出良好的泛化能力。 Conclusion: 该CNN模型能准确、一致地对冠状动脉钙化评分进行六分类,性能接近现有半自动方法,具备应用于临床自动化风险分层的可行性。 Abstract: Cardiovascular disease causes high rates of mortality worldwide. Coronary artery calcium (CAC) scoring is a powerful tool to stratify the risk of atherosclerotic cardiovascular disease. Current scoring practices require time-intensive semiautomatic analysis of cardiac computed tomography by radiologists and trained radiographers. The purpose of this study is to develop a deep learning convolutional neural networks (CNN) model to classify the calcium score in cardiac, non-contrast computed tomography images into one of six clinical categories. A total of 68 patient scans were retrospectively obtained together with their respective reported semiautomatic calcium score using an ECG-gated GE Discovery 570 Cardiac SPECT/CT camera. The dataset was divided into training, validation and test sets. Using the semiautomatic CAC score as the reference label, the model demonstrated high performance on a six-class CAC scoring categorisation task. Of the scans analysed, the model misclassified 32 cases, tending towards overestimating the CAC in 26 out of 32 misclassifications. Overall, the model showed high agreement (Cohen's kappa of 0.962), an overall accuracy of 96.5% and high generalisability. The results suggest that the model outputs were accurate and consistent with current semiautomatic practice, with good generalisability to test data. The model demonstrates the viability of a CNN model to stratify the calcium score into an expanded set of six clinical categories.[69] FlowFeat: Pixel-Dense Embedding of Motion Profiles
Nikita Araslanov,Anna Sonnweber,Daniel Cremers
Main category: cs.CV
TL;DR: 提出FlowFeat,一种高分辨率、多任务的图像特征表示方法,通过新颖的蒸馏技术结合光流网络和视频数据,实现自监督训练,在多个密集预测任务中显著提升现有模型性能。
Details
Motivation: 现有的先进网络(如Transformer)生成的特征图分辨率较低,不适合密集预测任务,因此需要一种更高分辨率且通用的特征表示方法。 Method: 提出FlowFeat,利用一种新的蒸馏技术嵌入可能的表观运动分布(即运动轮廓),结合光流网络和多样化视频数据,构建一个有效的自监督训练框架来统计近似表观运动。 Result: FlowFeat在视频对象分割、单目深度估计和语义分割三个密集任务上显著增强了五种最先进编码器和其他上采样策略的表示能力,具有高空间细节、几何语义信息丰富且时间一致性高。 Conclusion: FlowFeat提供了一种可靠且通用的高分辨率图像表示方法,推动了密集图像表示的发展,训练成本低且对光流估计误差鲁棒。 Abstract: Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.[70] Cross Modal Fine-grained Alignment via Granularity-aware and Region-uncertain Modeling
Jiale Liu,Haoming Zhou,Yishu Zhu,Bingzhi Chen,Yuncheng Jiang
Main category: cs.CV
TL;DR: 提出一种统一的方法,通过显著性感知和粒度感知建模以及区域级不确定性建模,提升细粒度图文对齐的鲁棒性和可解释性。
Details
Motivation: 现有方法在复杂场景中泛化能力差,缺乏对模态内重要性的评估机制和细粒度不确定性建模,难以捕捉区域与词之间的一对多或多对一关系。 Method: 引入模态特定偏差来识别显著特征,避免依赖脆弱的跨模态注意力,并将区域特征表示为高斯混合分布以捕捉细粒度不确定性。 Result: 在Flickr30K和MS-COCO数据集上取得SOTA性能,显著提升不同骨干网络下的对齐效果。 Conclusion: 所提方法有效解决了细粒度图文对齐中的显著性建模和不确定性建模问题,增强了模型的鲁棒性与可解释性。 Abstract: Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.[71] UltraGS: Gaussian Splatting for Ultrasound Novel View Synthesis
Yuezhe Yang,Wenjie Cai,Dexin Yang,Yufang Dong,Xingbo Dong,Zhe Jin
Main category: cs.CV
TL;DR: 本文提出了一种名为UltraGS的高斯点阵化框架,用于优化超声成像的新视角合成。该方法结合深度感知高斯分布、基于球谐函数与超声物理特性的渲染模型SH-DARS,并发布了一个临床超声数据集。实验表明其在多个指标上达到SOTA性能,并支持实时合成。
Details
Motivation: 超声成像视野有限,导致新视角合成困难,现有方法难以准确建模组织结构和回波物理特性。 Method: 提出深度感知的高斯点阵化策略,每个高斯单元具有可学习的视野;设计SH-DARS渲染函数,融合低阶球谐函数与超声波的衰减、反射和散射物理特性;构建并发布临床超声检查数据集用于评估。 Result: 在三个数据集上实验显示,UltraGS在PSNR(最高29.55)、SSIM(最高0.89)和MSE(低至0.002)上均优于现有方法,并实现64.69 fps的实时渲染速度。 Conclusion: UltraGS通过结合几何表示学习与超声物理建模,显著提升了超声图像新视角合成的质量与效率,具备临床应用潜力。 Abstract: Ultrasound imaging is a cornerstone of non-invasive clinical diagnostics, yet its limited field of view complicates novel view synthesis. We propose \textbf{UltraGS}, a Gaussian Splatting framework optimized for ultrasound imaging. First, we introduce a depth-aware Gaussian splatting strategy, where each Gaussian is assigned a learnable field of view, enabling accurate depth prediction and precise structural representation. Second, we design SH-DARS, a lightweight rendering function combining low-order spherical harmonics with ultrasound-specific wave physics, including depth attenuation, reflection, and scattering, to model tissue intensity accurately. Third, we contribute the Clinical Ultrasound Examination Dataset, a benchmark capturing diverse anatomical scans under real-world clinical protocols. Extensive experiments on three datasets demonstrate UltraGS's superiority, achieving state-of-the-art results in PSNR (up to 29.55), SSIM (up to 0.89), and MSE (as low as 0.002) while enabling real-time synthesis at 64.69 fps. The code and dataset are open-sourced at: https://github.com/Bean-Young/UltraGS.[72] VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics
Daniel Cher,Brian Wei,Srikumar Sastry,Nathan Jacobs
Main category: cs.CV
TL;DR: VectorSynth 是一种基于扩散模型的框架,能够根据带有语义属性的多边形地理标注生成像素级精确的卫星图像,支持语言提示与几何感知条件结合的交互式工作流。
Details
Motivation: 现有文本或布局条件生成模型难以实现细粒度、空间对齐的卫星图像合成,缺乏对地理语义与空间结构的精确建模。 Method: 提出 VectorSynth 框架,通过视觉-语言对齐模块将多边形语义转换为像素级嵌入,并指导条件图像生成过程;使用扩散模型实现高保真、结构合理的图像合成。 Result: 在语义保真度和结构真实性方面显著优于先前方法,实现了精细的空间对齐能力,并支持交互式地理场景编辑与‘假设’模拟。 Conclusion: VectorSynth 实现了从语义向量地图到高质量卫星图像的精确合成,推动了地图引导内容生成与地理空间仿真应用的发展。 Abstract: We introduce VectorSynth, a diffusion-based framework for pixel-accurate satellite image synthesis conditioned on polygonal geographic annotations with semantic attributes. Unlike prior text- or layout-conditioned models, VectorSynth learns dense cross-modal correspondences that align imagery and semantic vector geometry, enabling fine-grained, spatially grounded edits. A vision language alignment module produces pixel-level embeddings from polygon semantics; these embeddings guide a conditional image generation framework to respect both spatial extents and semantic cues. VectorSynth supports interactive workflows that mix language prompts with geometry-aware conditioning, allowing rapid what-if simulations, spatial edits, and map-informed content generation. For training and evaluation, we assemble a collection of satellite scenes paired with pixel-registered polygon annotations spanning diverse urban scenes with both built and natural features. We observe strong improvements over prior methods in semantic fidelity and structural realism, and show that our trained vision language model demonstrates fine-grained spatial grounding. The code and data are available at https://github.com/mvrl/VectorSynth.[73] Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs
Yuezhe Yang,Yiyue Guo,Wenjie Cai,Qingqing Ruan,Siying Wang,Xingbo Dong,Zhe Jin,Yong Dai
Main category: cs.CV
TL;DR: 提出Auto-US,一个结合超声视频和临床诊断文本的智能诊断系统,构建了包含495个视频的CUV数据集,并开发CTU-Net实现86.73%分类准确率,结合大语言模型生成临床有意义的诊断建议,经专业医生验证得分超过3/5。
Details
Motivation: 现有AI辅助超声诊断研究在数据集多样性、诊断性能和临床适用性方面存在局限,需提升实际应用中的效率与准确性。 Method: 构建多源多类别的CUV超声视频数据集,设计并训练CTU-Net进行视频分类,集成大语言模型以生成临床诊断建议。 Result: CTU-Net在超声视频分类中达到86.73%的准确率;Auto-US生成的诊断建议临床评分均超过3/5,获得专业医生认可。 Conclusion: Auto-US在真实世界超声应用中展现出良好的有效性与临床潜力,推动AI辅助诊断向更高临床实用性发展。 Abstract: AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73\% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: https://github.com/Bean-Young/Auto-US.[74] Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation
Shengqian Zhu,Chengrong Yu,Qiang Wang,Ying Song,Guangjun Li,Jiafei Wu,Xiaogang Xu,Zhang Yi,Junjie Hu
Main category: cs.CV
TL;DR: 提出了一种用于类增量医学图像分割(CIMIS)的新方法,通过原型引导校准蒸馏(PGCD)和双对齐原型蒸馏(DAPD)来有效保留旧知识并提升新类学习性能。
Details
Motivation: 现有方法在处理空间区域和特征通道时采用统一策略,或仅关注全局与局部原型对齐而忽略旧类在新数据中的局部表示,导致旧知识退化。 Method: 提出PGCD利用原型-特征相似性校准不同空间区域的类别特定蒸馏强度;DAPD则同时对齐当前模型中提取的旧类局部原型与全局及局部原型,增强旧类别的分割性能。 Result: 在两个多器官分割基准上的实验表明,该方法优于现有最先进方法,具有更强的鲁棒性和泛化能力。 Conclusion: PGCD和DAPD有效缓解了CIMIS中的知识遗忘问题,在保持旧类知识的同时提升了模型性能。 Abstract: Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class labels. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading information from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms state-of-the-art methods, highlighting its robustness and generalization capabilities.[75] Filtered-ViT: A Robust Defense Against Multiple Adversarial Patch Attacks
Aja Khanal,Ahmed Faid,Apurva Narayan
Main category: cs.CV
TL;DR: Filtered-ViT是一种新的视觉Transformer架构,集成了空间自适应、多尺度的鲁棒性感知机制SMART-VMF,能在存在多个局部干扰时有效抑制损坏区域并保持语义细节,在对抗性和自然发生的补丁状干扰下均表现出优异的鲁棒性。
Details
Motivation: 现有的防御方法大多假设单一对抗补丁,在面对多个局部干扰时表现不佳,而实际应用中(如医疗影像)常出现多补丁攻击或自然伪影,因此需要更鲁棒的视觉系统。 Method: 提出Filtered-ViT,结合SMART Vector Median Filtering(SMART-VMF),通过空间自适应、多尺度滤波机制选择性抑制 corrupted 区域,保留关键语义信息。 Result: 在ImageNet + LaVAN多补丁攻击下,4个同时存在的1%补丁中实现46.3%的鲁棒准确率,清洁准确率达79.8%,优于现有防御方法;在真实放射影像案例中有效缓解遮挡和扫描噪声,不损害诊断内容。 Conclusion: Filtered-ViT是首个在对抗性和自然发生的补丁状干扰下均展现统一鲁棒性的视觉Transformer,为高风险场景下的可靠视觉系统提供了可行路径。 Abstract: Deep learning vision systems are increasingly deployed in safety-critical domains such as healthcare, yet they remain vulnerable to small adversarial patches that can trigger misclassifications. Most existing defenses assume a single patch and fail when multiple localized disruptions occur, the type of scenario adversaries and real-world artifacts often exploit. We propose Filtered-ViT, a new vision transformer architecture that integrates SMART Vector Median Filtering (SMART-VMF), a spatially adaptive, multi-scale, robustness-aware mechanism that enables selective suppression of corrupted regions while preserving semantic detail. On ImageNet with LaVAN multi-patch attacks, Filtered-ViT achieves 79.8% clean accuracy and 46.3% robust accuracy under four simultaneous 1\% patches, outperforming existing defenses. Beyond synthetic benchmarks, a real-world case study on radiographic medical imagery shows that Filtered-ViT mitigates natural artifacts such as occlusions and scanner noise without degrading diagnostic content. This establishes Filtered-ViT as the first transformer to demonstrate unified robustness against both adversarial and naturally occurring patch-like disruptions, charting a path toward reliable vision systems in truly high-stakes environments.[76] Beyond Randomness: Understand the Order of the Noise in Diffusion
Song Yan,Min Li,Bi Xinliang,Jian Yang,Yusen Zhang,Guanye Xiong,Yunwei Lan,Tao Zhang,Wei Zhai,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 本文提出了一种无需训练且通用的两步“语义擦除-注入”方法,通过分析和调控文本到内容生成扩散模型中的初始噪声,实现对生成内容语义的有效调制。
Details
Motivation: 传统上认为扩散模型生成过程中的初始噪声是随机的,仅用于增加生成多样性;但本文发现噪声中蕴含可分析的语义模式,进而探索如何利用噪声进行语义控制以提升生成一致性。 Method: 首先分析随机噪声对生成的影响,发现其包含丰富语义信息,并可通过信息论方法简单地擦除不想要的语义;然后利用扩散过程与语义注入的等价性,将目标语义注入经清理的噪声中,形成“语义擦除-注入”两步法。 Result: 该方法在基于DiT和UNet架构的多种文本到内容生成模型上均表现出一致有效性,能够有效擦除和注入语义,提升生成内容的可控性与一致性。 Conclusion: 初始噪声并非完全随机,而是可被解析和调控的语义载体;所提出的训练-free方法为扩散模型的生成优化提供了新视角和通用工具。 Abstract: In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model's generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step "Semantic Erasure-Injection" process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.[77] Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval
Likang Peng,Chao Su,Wenyuan Wu,Yuan Sun,Dezhong Peng,Xi Peng,Xu Wang
Main category: cs.CV
TL;DR: 提出了一种新的跨模态哈希框架SCBCH,通过语义一致性分类和双向软对比学习,有效应对多标签数据中的噪声和语义重叠问题,在多个基准上优于现有方法。
Details
Motivation: 现有跨模态哈希方法依赖完全标注数据,且忽视多标签数据中的语义重叠和标签噪声问题,导致实际应用中性能下降。 Method: 提出SCBCH框架,包含两个模块:CSCC利用跨模态语义一致性评估样本可靠性以减轻噪声标签影响;BSCH基于多标签语义重叠动态生成软对比样本对,实现跨模态的自适应对比学习。 Result: 在四个常用跨模态检索基准上的实验表明,该方法在含噪声的多标签条件下显著优于现有最先进方法,具有良好的鲁棒性和泛化能力。 Conclusion: SCBCH通过建模语义一致性和软对比学习,有效提升了跨模态哈希在真实噪声环境下的检索性能,为处理不完美标注数据提供了新思路。 Abstract: Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.[78] Divide-and-Conquer Decoupled Network for Cross-Domain Few-Shot Segmentation
Runmin Cong,Anpeng Wang,Bin Wan,Cong Zhang,Xiaofei Zhou,Wei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于跨域少样本分割(CD-FSS)的Divide-and-Conquer Decoupled Network(DCDNet),通过解耦类别相关和域相关信息来提升模型在未见域中的泛化与快速适应能力。
Details
Motivation: 现有方法中编码器特征常混杂域相关和类别相关信息,限制了模型在跨域场景下的泛化和快速适应能力。 Method: 提出DCDNet,包含三个核心模块:1)ACFD模块通过对比学习和对抗学习将骨干特征解耦为类别私有和域共享表示;2)MGDF模块在空间引导下动态融合基础、共享和私有特征;3)在微调阶段引入CAM模块,通过调制机制利用共享特征引导私有特征。 Result: 在四个具有挑战性的数据集上进行了大量实验,DCDNet在跨域泛化和少样本适应任务上均优于现有方法。 Conclusion: DCDNet有效解决了特征纠缠问题,在跨域少样本分割任务中实现了新的最先进性能。 Abstract: Cross-domain few-shot segmentation (CD-FSS) aims to tackle the dual challenge of recognizing novel classes and adapting to unseen domains with limited annotations. However, encoder features often entangle domain-relevant and category-relevant information, limiting both generalization and rapid adaptation to new domains. To address this issue, we propose a Divide-and-Conquer Decoupled Network (DCDNet). In the training stage, to tackle feature entanglement that impedes cross-domain generalization and rapid adaptation, we propose the Adversarial-Contrastive Feature Decomposition (ACFD) module. It decouples backbone features into category-relevant private and domain-relevant shared representations via contrastive learning and adversarial learning. Then, to mitigate the potential degradation caused by the disentanglement, the Matrix-Guided Dynamic Fusion (MGDF) module adaptively integrates base, shared, and private features under spatial guidance, maintaining structural coherence. In addition, in the fine-tuning stage, to enhanced model generalization, the Cross-Adaptive Modulation (CAM) module is placed before the MGDF, where shared features guide private features via modulation ensuring effective integration of domain-relevant information. Extensive experiments on four challenging datasets show that DCDNet outperforms existing CD-FSS methods, setting a new state-of-the-art for cross-domain generalization and few-shot adaptation.[79] Learning Sparse Label Couplings for Multilabel Chest X-Ray Diagnosis
Utkarsh Prakash Srivastava,Kaushik Gupta,Kaushik Nath
Main category: cs.CV
TL;DR: 提出了一种基于SE-ResNeXt101的多标签胸部X光分类强基线方法,结合标签图精炼模块和多种训练优化技术,在不增加额外标注的情况下显著提升性能。
Details
Motivation: 针对胸部X光多标签分类中的类别不平衡、标签共现和不对称误判代价问题,需要一个高效且实用的解决方案。 Method: 采用SE-ResNeXt101作为骨干网络,使用Asymmetric Loss、MIS分层、混合精度训练、余弦退火学习率、梯度裁剪和权重指数移动平均;提出轻量级标签图精炼模块,通过可学习的稀疏标签耦合矩阵优化预测结果。 Result: 在数据集上基线模型达到92.64%的macro AUC,加入标签图精炼模块后性能稳定提升,计算开销极小。 Conclusion: 该方法可复现、硬件友好、无需额外标注,为多标签胸部X光分类提供了实用且高效的改进路径。 Abstract: We study multilabel classification of chest X-rays and present a simple, strong pipeline built on SE-ResNeXt101 $(32 \times 4d)$. The backbone is finetuned for 14 thoracic findings with a sigmoid head, trained using Multilabel Iterative Stratification (MIS) for robust cross-validation splits that preserve label co-occurrence. To address extreme class imbalance and asymmetric error costs, we optimize with Asymmetric Loss, employ mixed-precision (AMP), cosine learning-rate decay with warm-up, gradient clipping, and an exponential moving average (EMA) of weights. We propose a lightweight Label-Graph Refinement module placed after the classifier: given per-label probabilities, it learns a sparse, trainable inter-label coupling matrix that refines logits via a single message-passing step while adding only an L1-regularized parameter head. At inference, we apply horizontal flip test-time augmentation (TTA) and average predictions across MIS folds (a compact deep ensemble). Evaluation uses macro AUC averaging classwise ROC-AUC and skipping single-class labels in a fold to reflect balanced performance across conditions. On our dataset, a strong SE-ResNeXt101 baseline attains competitive macro AUC (e.g., 92.64% in our runs). Adding the Label-Graph Refinement consistently improves validation macro AUC across folds with negligible compute. The resulting method is reproducible, hardware-friendly, and requires no extra annotations, offering a practical route to stronger multilabel CXR classifiers.[80] PC-Diffusion: Aligning Diffusion Models with Human Preferences via Preference Classifier
Shaomeng Wang,He Wang,Xiaolu Wei,Longquan Dai,Jinhui Tang
Main category: cs.CV
TL;DR: 提出了一种名为PC-Diffusion的新框架,通过轻量级偏好分类器实现扩散模型中的人类偏好对齐,避免了全模型微调和参考模型依赖,降低了计算成本并提高了稳定性。
Details
Motivation: 现有的DPO类方法在应用于扩散模型时存在计算成本高和对参考模型质量敏感的问题,限制了其在偏好对齐中的效率与稳定性。 Method: 设计了一个可训练的轻量级偏好分类器(Preference Classifier),将偏好学习从生成模型中解耦;通过该分类器直接建模样本间的相对偏好,并在生成过程中进行偏好引导校正,无需对整个扩散模型微调,也不依赖参考模型。 Result: 理论分析表明PC-Diffusion能跨时间步一致传播偏好分布,其训练目标等价于DPO但无需参考模型,且能逐步引导生成结果向偏好对齐区域演化;实验结果显示其在显著降低训练成本的同时,达到了与DPO相当的偏好一致性。 Conclusion: PC-Diffusion通过解耦偏好学习与生成模型,提供了一种高效、稳定且低代价的扩散模型偏好对齐方案,优于传统DPO类方法。 Abstract: Diffusion models have achieved remarkable success in conditional image generation, yet their outputs often remain misaligned with human preferences. To address this, recent work has applied Direct Preference Optimization (DPO) to diffusion models, yielding significant improvements.~However, DPO-like methods exhibit two key limitations: 1) High computational cost,due to the entire model fine-tuning; 2) Sensitivity to reference model quality}, due to its tendency to introduce instability and bias. To overcome these limitations, we propose a novel framework for human preference alignment in diffusion models (PC-Diffusion), using a lightweight, trainable Preference Classifier that directly models the relative preference between samples. By restricting preference learning to this classifier, PC-Diffusion decouples preference alignment from the generative model, eliminating the need for entire model fine-tuning and reference model reliance.~We further provide theoretical guarantees for PC-Diffusion:1) PC-Diffusion ensures that the preference-guided distributions are consistently propagated across timesteps. 2)The training objective of the preference classifier is equivalent to DPO, but does not require a reference model.3) The proposed preference-guided correction can progressively steer generation toward preference-aligned regions.~Empirical results show that PC-Diffusion achieves comparable preference consistency to DPO while significantly reducing training costs and enabling efficient and stable preference-guided generation.[81] DI3CL: Contrastive Learning With Dynamic Instances and Contour Consistency for SAR Land-Cover Classification Foundation Model
Zhongle Ren,Hui Ding,Kai Wang,Biao Hou,Xingyu Luo,Weibin Li,Licheng Jiao
Main category: cs.CV
TL;DR: 本文提出了一种用于SAR地物分类的通用基础模型,通过动态实例和轮廓一致性对比学习框架(DI3CL)提升模型的泛化能力与结构判别力,并构建大规模数据集SARSense进行预训练,在多种下游任务中表现出优越性能。
Details
Motivation: 现有SAR地物分类方法多依赖监督学习和大量标注数据,限制了模型的可扩展性、泛化能力和适用场景适应性,因此需要一个通用的基础模型来推动下游应用的发展。 Method: 提出DI3CL预训练框架,包含动态实例(DI)模块增强全局上下文感知,以及轮廓一致性(CC)模块利用浅层特征引导模型关注地物几何轮廓;并在包含460,532张SAR图像的大规模数据集SARSense上进行自监督预训练。 Result: 在SAR地物分类、水体检测和道路提取等多个任务上实验表明,该方法在迁移学习场景下显著优于现有方法,展现出强健的泛化能力和鲁棒性。 Conclusion: DI3CL作为一种SAR图像理解的基础模型,有效提升了下游任务的性能,为SAR地物分类提供了可扩展、可迁移的新范式。 Abstract: Although significant advances have been achieved in SAR land-cover classification, recent methods remain predominantly focused on supervised learning, which relies heavily on extensive labeled datasets. This dependency not only limits scalability and generalization but also restricts adaptability to diverse application scenarios. In this paper, a general-purpose foundation model for SAR land-cover classification is developed, serving as a robust cornerstone to accelerate the development and deployment of various downstream models. Specifically, a Dynamic Instance and Contour Consistency Contrastive Learning (DI3CL) pre-training framework is presented, which incorporates a Dynamic Instance (DI) module and a Contour Consistency (CC) module. DI module enhances global contextual awareness by enforcing local consistency across different views of the same region. CC module leverages shallow feature maps to guide the model to focus on the geometric contours of SAR land-cover objects, thereby improving structural discrimination. Additionally, to enhance robustness and generalization during pre-training, a large-scale and diverse dataset named SARSense, comprising 460,532 SAR images, is constructed to enable the model to capture comprehensive and representative features. To evaluate the generalization capability of our foundation model, we conducted extensive experiments across a variety of SAR land-cover classification tasks, including SAR land-cover mapping, water body detection, and road extraction. The results consistently demonstrate that the proposed DI3CL outperforms existing methods. Our code and pre-trained weights are publicly available at: https://github.com/SARpre-train/DI3CL.[82] Revisiting MLLM Based Image Quality Assessment: Errors and Remedy
Zhenchen Tang,Songlin Yang,Bo Peng,Zichuan Wang,Jing Dong
Main category: cs.CV
TL;DR: 提出了一种名为Q-Scorer的新框架,通过引入轻量级回归模块和特定于图像质量评估(IQA)的分数标记,解决了多模态大语言模型在IQA任务中离散输出与连续质量评分之间的不匹配问题,实现了最先进的性能。
Details
Motivation: 多模态大语言模型(MLLMs)在图像质量评估(IQA)任务中面临离散token输出与连续质量评分需求之间的不匹配问题,现有方法存在转换误差和语义混淆,限制了性能并损害了模型原有能力。 Method: 提出Q-Scorer框架,结合理论分析,引入一个轻量级回归模块和专为IQA设计的分数token,以减少离散到连续转换中的误差,并避免语义混淆对模型的影响。 Result: 在多个IQA基准测试中达到最先进水平,具备良好的混合数据集泛化能力,并能与其他方法结合进一步提升性能。 Conclusion: Q-Scorer有效解决了MLLM在IQA任务中的关键瓶颈,兼顾准确性与兼容性,为基于MLLM的IQA提供了高效可靠的解决方案。 Abstract: The rapid progress of multi-modal large language models (MLLMs) has boosted the task of image quality assessment (IQA). However, a key challenge arises from the inherent mismatch between the discrete token outputs of MLLMs and the continuous nature of quality scores required by IQA tasks. This discrepancy significantly hinders the performance of MLLM-based IQA methods. Previous approaches that convert discrete token predictions into continuous scores often suffer from conversion errors. Moreover, the semantic confusion introduced by level tokens (e.g., ``good'') further constrains the performance of MLLMs on IQA tasks and degrades their original capabilities for related tasks. To tackle these problems, we provide a theoretical analysis of the errors inherent in previous approaches and, motivated by this analysis, propose a simple yet effective framework, Q-Scorer. This framework incorporates a lightweight regression module and IQA-specific score tokens into the MLLM pipeline. Extensive experiments demonstrate that Q-Scorer achieves state-of-the-art performance across multiple IQA benchmarks, generalizes well to mixed datasets, and further improves when combined with other methods.[83] Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views
Haida Feng,Hao Wei,Zewen Xu,Haolin Wang,Chade Li,Yihong Wu
Main category: cs.CV
TL;DR: 提出了一种名为Sparse3DPR的无需训练的3D场景理解框架,利用预训练大语言模型的推理能力,仅需稀疏视角RGB输入,通过分层平面增强场景图和任务自适应子图提取方法,显著提升了准确性和效率。
Details
Motivation: 现有的无需训练的3D场景理解方法在实际部署中存在准确性和效率不足的问题,而基于训练的方法缺乏灵活性和泛化能力,因此需要一种兼顾高效、准确且无需训练的新框架。 Method: 提出Sparse3DPR,构建分层平面增强场景图作为空间锚点以支持开放词汇推理,并设计任务自适应子图提取方法动态过滤无关信息,减少上下文噪声,提升推理效率与准确性。 Result: 在Space3D-Bench上相比ConceptGraphs实现了28.7%的EM@1提升和78.2%的速度提升,在ScanQA上性能媲美基于训练的方法,并通过真实场景实验验证了其鲁棒性和泛化能力。 Conclusion: Sparse3DPR是一种高效、准确且无需训练的开放场景理解框架,通过结构化场景表示和动态信息筛选,在保持灵活性的同时显著优于现有训练-free方法,并达到与训练-based方法相当的性能。 Abstract: Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.[84] Cancer-Net PCa-MultiSeg: Multimodal Enhancement of Prostate Cancer Lesion Segmentation Using Synthetic Correlated Diffusion Imaging
Jarett Dewbury,Chi-en Amy Tai,Alexander Wong
Main category: cs.CV
TL;DR: 合成相关扩散成像(CDI$^s$)可有效提升前列腺癌病灶分割性能,无需额外扫描时间,具有临床即时应用潜力。
Details
Motivation: 现有深度学习方法在前列腺癌病灶分割上表现有限(Dice分数≤0.32),亟需改进分割性能。 Method: 引入合成相关扩散成像(CDI$^s$)作为标准扩散序列的增强模态,并在200名患者数据上评估六种先进分割模型的性能。 Result: 94%的模型配置中CDI$^s$提升了或保持了分割效果,最高实现72.5%的显著相对提升;CDI$^s$ + DWI组合在半数架构中显著改善性能且无退化。 Conclusion: CDI$^s$是一种无需额外扫描时间或模型修改的即插即用式增强方法,可在多种深度学习架构上稳定提升前列腺癌病灶分割性能,具备临床实用价值。 Abstract: Current deep learning approaches for prostate cancer lesion segmentation achieve limited performance, with Dice scores of 0.32 or lower in large patient cohorts. To address this limitation, we investigate synthetic correlated diffusion imaging (CDI$^s$) as an enhancement to standard diffusion-based protocols. We conduct a comprehensive evaluation across six state-of-the-art segmentation architectures using 200 patients with co-registered CDI$^s$, diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC) sequences. We demonstrate that CDI$^s$ integration reliably enhances or preserves segmentation performance in 94% of evaluated configurations, with individual architectures achieving up to 72.5% statistically significant relative improvement over baseline modalities. CDI$^s$ + DWI emerges as the safest enhancement pathway, achieving significant improvements in half of evaluated architectures with zero instances of degradation. Since CDI$^s$ derives from existing DWI acquisitions without requiring additional scan time or architectural modifications, it enables immediate deployment in clinical workflows. Our results establish validated integration pathways for CDI$^s$ as a practical drop-in enhancement for PCa lesion segmentation tasks across diverse deep learning architectures.[85] Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy
Gong Jingyu,Tong Kunkun,Chen Zhuoran,Yuan Chuanhan,Chen Mingang,Zhang Zhizhong,Tan Xin,Xie Yuan
Main category: cs.CV
TL;DR: 本文提出了一种基于统一场景语义占据(SSO)的人体运动合成框架SSOMotion,通过双向三平面分解和CLIP编码实现对复杂场景的细粒度语义理解与高效计算,在多种真实场景数据集上表现出优异的性能和泛化能力。
Details
Motivation: 现有方法主要关注场景结构但忽视语义理解,导致在复杂3D场景中生成的人体运动不够自然和合理。因此,需要一种能够同时捕捉场景结构和语义信息的运动合成方法。 Method: 提出SSOMotion框架,采用统一的场景语义占据(SSO)表示;设计双向三平面分解以压缩SSO并提取紧凑特征;利用CLIP编码和共享线性降维将场景语义映射到统一特征空间;通过逐帧场景查询结合指令中的运动方向进行运动控制。 Result: 在ShapeNet、PROX和Replica等复杂场景数据集上进行了大量实验和消融研究,结果表明该方法在运动合理性、场景交互准确性和计算效率方面均优于现有方法,具有良好的泛化能力。 Conclusion: SSOMotion通过引入统一的语义占据表示和高效的特征压缩机制,有效提升了3D场景中人体运动合成的质量与可控性,验证了语义理解在运动合成中的重要性。 Abstract: Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.[86] CloudMamba: Grouped Selective State Spaces for Point Cloud Analysis
Kanglin Qu,Pan Gao,Qun Dai,Zhanzhi Ye,Rui Ye,Yuanhao Sun
Main category: cs.CV
TL;DR: 本文提出了一种基于SSM的点云网络CloudMamba,通过序列扩展与合并、链式Mamba结构以及分组选择性状态空间模型(GS6)来解决点云序列化不完善、高层几何感知不足和S6过拟合问题,在多种点云任务中实现了更优性能与更低复杂度。
Details
Motivation: 现有Mamba在点云分析中的应用受限于点云序列化的不完善、对高阶几何特征感知不足以及S6模型的过拟合问题,因此需要一种更稳定且高效的架构来提升性能。 Method: 提出CloudMamba,包括序列扩展与合并策略以稳定适应Mamba的因果性;设计链式Mamba结构以捕获扫描过程中的高阶几何信息;引入GS6通过参数共享缓解S6的过拟合。 Result: 在多个点云分类、分割等任务上验证了CloudMamba的有效性,取得了优于现有方法的性能,同时具有更低的计算复杂度。 Conclusion: CloudMamba有效解决了Mamba应用于点云分析中的关键挑战,在保持线性复杂度的同时显著提升了模型表现,推动了SSM在点云处理中的发展。 Abstract: Due to the long-range modeling ability and linear complexity property, Mamba has attracted considerable attention in point cloud analysis. Despite some interesting progress, related work still suffers from imperfect point cloud serialization, insufficient high-level geometric perception, and overfitting of the selective state space model (S6) at the core of Mamba. To this end, we resort to an SSM-based point cloud network termed CloudMamba to address the above challenges. Specifically, we propose sequence expanding and sequence merging, where the former serializes points along each axis separately and the latter serves to fuse the corresponding higher-order features causally inferred from different sequences, enabling unordered point sets to adapt more stably to the causal nature of Mamba without parameters. Meanwhile, we design chainedMamba that chains the forward and backward processes in the parallel bidirectional Mamba, capturing high-level geometric information during scanning. In addition, we propose a grouped selective state space model (GS6) via parameter sharing on S6, alleviating the overfitting problem caused by the computational mode in S6. Experiments on various point cloud tasks validate CloudMamba's ability to achieve state-of-the-art results with significantly less complexity.[87] MonoCLUE : Object-Aware Clustering Enhances Monocular 3D Object Detection
Sunghun Yang,Minhyeok Lee,Jungho Lee,Sangyoun Lee
Main category: cs.CV
TL;DR: 提出MonoCLUE方法,通过局部聚类和广义场景记忆增强单目3D检测,在KITTI上达到SOTA性能。
Details
Motivation: 单目3D检测存在深度歧义和视野受限问题,尤其在遮挡或截断场景下表现不佳,现有方法多关注深度信息而忽视关键视觉线索。 Method: 采用K-means对视觉特征进行局部聚类以捕获物体部件级表征,并构建跨图像的广义场景记忆来增强特征一致性;将两者融合到对象查询中以引导注意力。 Result: 在KITTI基准上实现了最先进的检测性能,尤其在遮挡和部分可见物体场景下表现出更强的鲁棒性。 Conclusion: MonoCLUE通过结合局部聚类与广义场景记忆,有效提升了单目3D检测的准确性与泛化能力,为复杂驾驶场景下的感知提供了高效解决方案。 Abstract: Monocular 3D object detection offers a cost-effective solution for autonomous driving but suffers from ill-posed depth and limited field of view. These constraints cause a lack of geometric cues and reduced accuracy in occluded or truncated scenes. While recent approaches incorporate additional depth information to address geometric ambiguity, they overlook the visual cues crucial for robust recognition. We propose MonoCLUE, which enhances monocular 3D detection by leveraging both local clustering and generalized scene memory of visual features. First, we perform K-means clustering on visual features to capture distinct object-level appearance parts (e.g., bonnet, car roof), improving detection of partially visible objects. The clustered features are propagated across regions to capture objects with similar appearances. Second, we construct a generalized scene memory by aggregating clustered features across images, providing consistent representations that generalize across scenes. This improves object-level feature consistency, enabling stable detection across varying environments. Lastly, we integrate both local cluster features and generalized scene memory into object queries, guiding attention toward informative regions. Exploiting a unified local clustering and generalized scene memory strategy, MonoCLUE enables robust monocular 3D detection under occlusion and limited visibility, achieving state-of-the-art performance on the KITTI benchmark.[88] Visual Bridge: Universal Visual Perception Representations Generating
Yilin Gao,Shuguang Dou,Junzhou Li,Zhiheng Yu,Yin Li,Dongsheng Jiang,Shugong Xu
Main category: cs.CV
TL;DR: 提出一种基于流匹配的通用视觉感知框架,能够跨多个任务生成多样化的视觉表示,在零样本和微调设置下均表现出色。
Details
Motivation: 现有的扩散模型通常受限于“单任务单模型”范式,缺乏在多任务场景下的泛化与扩展能力;受大语言模型跨领域泛化能力启发,希望构建一个统一的视觉感知框架。 Method: 将多任务视觉感知建模为从图像块到任务特定表示的通用流匹配问题,利用强自监督基础模型作为锚点,并引入多尺度循环任务嵌入机制,学习统一的速度场以实现异构任务间的表示迁移。 Result: 在分类、检测、分割、深度估计和图文检索等多个任务上实验表明,该模型在零样本和微调设置下性能优于先前的通用模型和多个专用模型,且消融研究验证了其鲁棒性、可扩展性和泛化能力。 Conclusion: 该工作推动了通用视觉感知的发展,为未来通用视觉建模研究提供了坚实基础。 Abstract: Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.[89] Generating Sketches in a Hierarchical Auto-Regressive Process for Flexible Sketch Drawing Manipulation at Stroke-Level
Sicong Zang,Shuhui Gao,Zhijun Fang
Main category: cs.CV
TL;DR: 提出了一种分层自回归的草图生成方法,实现生成过程中灵活的笔画级操控。
Details
Motivation: 现有方法在生成前需固定所有笔画条件,无法在生成过程中进行调整,缺乏灵活性。 Method: 采用三阶段分层自回归过程:预测笔画嵌入、锚定位置、转换为绘制动作,并基于已生成内容自回归地生成后续笔画。 Result: 实现了在草图生成过程中任意时刻对笔画进行编辑和控制,提升了生成的灵活性和可控性。 Conclusion: 该方法支持动态、灵活的草图编辑,优于传统一次性条件输入的方法。 Abstract: Generating sketches with specific patterns as expected, i.e., manipulating sketches in a controllable way, is a popular task. Recent studies control sketch features at stroke-level by editing values of stroke embeddings as conditions. However, in order to provide generator a global view about what a sketch is going to be drawn, all these edited conditions should be collected and fed into generator simultaneously before generation starts, i.e., no further manipulation is allowed during sketch generating process. In order to realize sketch drawing manipulation more flexibly, we propose a hierarchical auto-regressive sketch generating process. Instead of generating an entire sketch at once, each stroke in a sketch is generated in a three-staged hierarchy: 1) predicting a stroke embedding to represent which stroke is going to be drawn, and 2) anchoring the predicted stroke on the canvas, and 3) translating the embedding to a sequence of drawing actions to form the full sketch. Moreover, the stroke prediction, anchoring and translation are proceeded auto-regressively, i.e., both the recently generated strokes and their positions are considered to predict the current one, guiding model to produce an appropriate stroke at a suitable position to benefit the full sketch generation. It is flexible to manipulate stroke-level sketch drawing at any time during generation by adjusting the exposed editable stroke embeddings.[90] Theoretical Analysis of Power-law Transformation on Images for Text Polarity Detection
Narendra Singh Yadav,Pavan Kumar Perepu
Main category: cs.CV
TL;DR: 本文对基于幂律变换的图像文本极性检测方法中的现象进行了理论分析,解释了在不同文本与背景对比度下类间方差变化的原因。
Details
Motivation: 为了理解文献中观察到的幂律变换后图像直方图统计中类间最大方差随文本极性变化的现象,需要进行理论上的解释和验证。 Method: 通过理论推导和分析,研究了在文本和背景作为两类的情况下,暗文或亮文在不同背景下经过幂律变换后类间方差的变化规律。 Result: 提出了对已有现象的理论解释,阐明了为何暗文在亮背景或亮文在暗背景下经幂律变换后类间方差会增加或减少。 Conclusion: 为基于幂律变换的文本极性检测方法提供了理论支持,增强了该现象的理解并可能促进更优的图像二值化方法的发展。 Abstract: Several computer vision applications like vehicle license plate recognition, captcha recognition, printed or handwriting character recognition from images etc., text polarity detection and binarization are the important preprocessing tasks. To analyze any image, it has to be converted to a simple binary image. This binarization process requires the knowledge of polarity of text in the images. Text polarity is defined as the contrast of text with respect to background. That means, text is darker than the background (dark text on bright background) or vice-versa. The binarization process uses this polarity information to convert the original colour or gray scale image into a binary image. In the literature, there is an intuitive approach based on power-law transformation on the original images. In this approach, the authors have illustrated an interesting phenomenon from the histogram statistics of the transformed images. Considering text and background as two classes, they have observed that maximum between-class variance between two classes is increasing (decreasing) for dark (bright) text on bright (dark) background. The corresponding empirical results have been presented. In this paper, we present a theoretical analysis of the above phenomenon.[91] Exploring the Underwater World Segmentation without Extra Training
Bingyu Li,Tao Huo,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了AquaOV255,首个大规模细粒度水下分割数据集,包含255个类别和超过2万张图像,并建立了首个水下开放词汇(OV)分割基准UOVSBench。同时提出了一种无需训练的OV分割框架Earth2Ocean,通过将陆地视觉-语言模型迁移到水下领域,在不进行额外水下训练的情况下实现了显著性能提升。
Details
Motivation: 现有数据集和模型主要局限于陆地图像,缺乏适用于水下生物精确分割的大规模数据集和有效模型,难以支持海洋生物多样性监测和生态评估。因此,亟需构建专门针对水下场景的开放词汇分割数据集与基准,并开发能够应对水下复杂环境的通用分割方法。 Method: 提出Earth2Ocean框架,包含两个核心组件:几何引导的视觉掩码生成器(GMG),利用自相似性几何先验优化局部结构感知;类别-视觉语义对齐模块(CSA),通过多模态大语言模型推理和场景感知模板构建增强文本嵌入。该方法无需水下训练即可实现跨域迁移。 Result: 在新构建的UOVSBench基准上进行了广泛实验,结果表明Earth2Ocean在平均性能上实现了显著提升,同时保持了高效的推理速度。 Conclusion: 本研究填补了水下开放词汇分割领域的空白,提供了首个大规模数据集AquaOV255和基准UOVSBench,并验证了无需训练的跨域迁移框架Earth2Ocean的有效性,为未来水下视觉理解研究提供了重要基础。 Abstract: Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduce \textbf{AquaOV255}, the first large-scale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse categories for open-vocabulary (OV) evaluation. Furthermore, we establish the first underwater OV segmentation benchmark, \textbf{UOVSBench}, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive evaluation. Alongside, we present \textbf{Earth2Ocean}, a training-free OV segmentation framework that transfers terrestrial vision--language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (\textbf{GMG}) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment (\textbf{CSA}) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves significant performance improvement on average while maintaining efficient inference.[92] HD$^2$-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving
Zhiwen Yang,Yuxin Peng
Main category: cs.CV
TL;DR: 提出HD$^2$-SSC框架,通过高维语义解耦和高密度占用优化模块,解决相机图像在3D语义场景补全中的维度与密度差距问题。
Details
Motivation: 现有方法存在输入输出维度差距和标注与真实密度差距,导致3D场景补全效果不佳。 Method: 设计高维语义解耦模块以扩展2D特征并解耦像素语义与遮挡;采用检测-优化架构的高密度占用优化模块,利用上下文几何与语义结构提升补全密度。 Result: 在SemanticKITTI和SSCBench-KITTI-360数据集上实验表明,所提方法显著提升3D语义场景补全性能。 Conclusion: HD$^2$-SSC有效弥合了维度与密度差距,提升了基于相机的3D语义场景补全质量。 Abstract: Camera-based 3D semantic scene completion (SSC) plays a crucial role in autonomous driving, enabling voxelized 3D scene understanding for effective scene perception and decision-making. Existing SSC methods have shown efficacy in improving 3D scene representations, but suffer from the inherent input-output dimension gap and annotation-reality density gap, where the 2D planner view from input images with sparse annotated labels leads to inferior prediction of real-world dense occupancy with a 3D stereoscopic view. In light of this, we propose the corresponding High-Dimension High-Density Semantic Scene Completion (HD$^2$-SSC) framework with expanded pixel semantics and refined voxel occupancies. To bridge the dimension gap, a High-dimension Semantic Decoupling module is designed to expand 2D image features along a pseudo third dimension, decoupling coarse pixel semantics from occlusions, and then identify focal regions with fine semantics to enrich image features. To mitigate the density gap, a High-density Occupancy Refinement module is devised with a "detect-and-refine" architecture to leverage contextual geometric and semantic structures for enhanced semantic density with the completion of missing voxels and correction of erroneous ones. Extensive experiments and analyses on the SemanticKITTI and SSCBench-KITTI-360 datasets validate the effectiveness of our HD$^2$-SSC framework.[93] An Image-Based Path Planning Algorithm Using a UAV Equipped with Stereo Vision
Selim Ahmet Iz,Mustafa Unel
Main category: cs.CV
TL;DR: 提出了一种基于图像的路径规划算法,利用计算机视觉和无人机生成的视差图进行轨迹候选点定义,并通过与A*和PRM算法的比较验证了其有效性。
Details
Motivation: 传统二维图像无法区分地形中的坑洞和山丘,难以保证路径安全性,因此需要结合深度信息进行更安全的路径规划。 Method: 使用无人机获取地形视差图,结合边缘、线条、角点检测和立体深度重建等计算机视觉技术确定轨迹候选点,通过ArUco标记和圆检测自动识别起始和目标点。 Result: 在V-REP仿真环境和实验室物理场景中对比A*和PRM算法,结果表明所提算法具有良好的路径规划效果和应用潜力。 Conclusion: 该基于视差图的图像路径规划方法能有效提升复杂地形下的路径安全性,具备实际应用前景。 Abstract: This paper presents a novel image-based path planning algorithm that was developed using computer vision techniques, as well as its comparative analysis with well-known deterministic and probabilistic algorithms, namely A* and Probabilistic Road Map algorithm (PRM). The terrain depth has a significant impact on the calculated path safety. The craters and hills on the surface cannot be distinguished in a two-dimensional image. The proposed method uses a disparity map of the terrain that is generated by using a UAV. Several computer vision techniques, including edge, line and corner detection methods, as well as the stereo depth reconstruction technique, are applied to the captured images and the found disparity map is used to define candidate way-points of the trajectory. The initial and desired points are detected automatically using ArUco marker pose estimation and circle detection techniques. After presenting the mathematical model and vision techniques, the developed algorithm is compared with well-known algorithms on different virtual scenes created in the V-REP simulation program and a physical setup created in a laboratory environment. Results are promising and demonstrate effectiveness of the proposed algorithm.[94] Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification
Yihang Wu,Ahmad Chaddad
Main category: cs.CV
TL;DR: 提出了一种基于CLIP的联邦学习方法FedMedCLIP,用于解决医疗图像分类中的数据异构性和通信成本问题。
Details
Motivation: 深度模型在医疗成像中表现优异,但依赖源数据训练,存在隐私风险;联邦学习可缓解此问题,但面临数据异质性和资源开销挑战,尤其在使用视觉语言模型时更为显著。 Method: 提出FedMedCLIP:冻结CLIP编码器以降低计算开销,引入掩码特征适配模块(FAM)减少通信负载,并采用掩码MLP作为本地私有分类器;设计基于自适应KL散度的蒸馏正则化方法实现FAM与MLP间的互学习,并结合模型压缩和集成预测进行分类。 Result: 在四个公开医疗数据集上实验表明,该方法性能优于基线(如ISIC2019上高出8%),且资源消耗合理(比FedAVG快120倍)。 Conclusion: FedMedCLIP有效平衡了联邦学习中医疗图像分类的性能与资源开销,具备实际部署潜力。 Abstract: Despite the remarkable performance of deep models in medical imaging, they still require source data for training, which limits their potential in light of privacy concerns. Federated learning (FL), as a decentralized learning framework that trains a shared model with multiple hospitals (a.k.a., FL clients), provides a feasible solution. However, data heterogeneity and resource costs hinder the deployment of FL models, especially when using vision language models (VLM). To address these challenges, we propose a novel contrastive language-image pre-training (CLIP) based FL approach for medical image classification (FedMedCLIP). Specifically, we introduce a masked feature adaptation module (FAM) as a communication module to reduce the communication load while freezing the CLIP encoders to reduce the computational overhead. Furthermore, we propose a masked multi-layer perceptron (MLP) as a private local classifier to adapt to the client tasks. Moreover, we design an adaptive Kullback-Leibler (KL) divergence-based distillation regularization method to enable mutual learning between FAM and MLP. Finally, we incorporate model compression to transmit the FAM parameters while using ensemble predictions for classification. Extensive experiments on four publicly available medical datasets demonstrate that our model provides feasible performance (e.g., 8\% higher compared to second best baseline on ISIC2019) with reasonable resource cost (e.g., 120$\times$ faster than FedAVG).[95] Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers
Sida Huang,Siqi Huang,Ping Luo,Hongyuan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的布局到图像生成方法Laytrol网络,通过继承预训练模型MM-DiT的参数并采用专门的初始化方案来保持基础模型的知识,从而提高生成图像的空间一致性和视觉质量。
Details
Motivation: 现有的布局到图像生成方法在引入布局条件时通常会集成适配器模块,但生成的图像往往视觉质量低且风格与基础模型不一致,表明预训练知识有所损失。因此,需要一种能有效缓解这一问题的新方法。 Method: 构建了利用基础模型自身合成图像的LaySyn数据集以减少从预训练数据的分布偏移,并提出了Laytrol网络,该网络参数继承自MM-DiT以保留基础模型的预训练知识。采用特定的初始化方案,确保布局编码器输出令牌处于MM-DiT的数据域内,同时将布局控制网络的输出初始化为零。此外,对布局令牌应用对象级旋转位置嵌入以提供粗略的位置信息。 Result: 定性和定量实验均证明了所提方法的有效性,能够显著提升生成图像的空间一致性和视觉质量。 Conclusion: 通过使用LaySyn数据集和Laytrol网络及其特定初始化策略,可以在保持基础模型预训练知识的同时,有效改善布局到图像生成任务中的视觉质量和风格一致性。 Abstract: With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.[96] DiffRegCD: Integrated Registration and Change Detection with Diffusion Features
Seyedehnanita Madani,Rama Chellappa,Vishal M. Patel
Main category: cs.CV
TL;DR: 本文提出DiffRegCD,一个将密集配准与变化检测统一的框架,通过分类而非回归实现亚像素精度的对应估计,并利用预训练扩散模型的多尺度特征提升对光照和视角变化的鲁棒性,在多种数据集上表现优于现有方法。
Details
Motivation: 现有变化检测方法在处理严重图像错位(如视差、视角变化和长时间隔)时性能受限,传统两阶段方法和现有联合框架难以应对大位移问题,且依赖合成扰动或全局单应性假设,缺乏真实对齐监督。 Method: 提出DiffRegCD框架,将对应关系估计重构为高斯平滑分类任务以实现亚像素精度;利用冻结的预训练去噪扩散模型的多尺度特征增强鲁棒性;通过在标准变化检测数据集上施加可控仿射变换生成配对的真实光流和变化标签进行监督训练。 Result: 在多个航拍(LEVIR-CD, DSIFN-CD等)和地面(VL-CMU-CD)数据集上实验表明,DiffRegCD在大位移、长时序和几何变化下均优于最新基线方法,表现出更强的稳定性和准确性。 Conclusion: DiffRegCD通过结合扩散模型特征和基于分类的对应估计,为统一的变化检测提供了新范式,显著提升了在复杂现实场景中的配准与检测性能。 Abstract: Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.[97] Is It Truly Necessary to Process and Fit Minutes-Long Reference Videos for Personalized Talking Face Generation?
Rui-Qing Sun,Ang Li,Zhijing Wu,Tian Lan,Qianyu Lu,Xingshan Yao,Chen Xu,Xian-Ling Mao
Main category: cs.CV
TL;DR: 提出一种名为ISExplore的高效片段选择策略,仅用5秒高质量参考视频片段即可实现与使用长视频相当甚至更好的说话人脸生成效果,显著提升数据处理和训练速度。
Details
Motivation: 现有基于NeRF或3DGS的说话人脸生成方法需处理数分钟参考视频,耗时严重,限制了实际应用。本文探究是否必须使用长时间视频,并发现片段的信息质量比长度更重要。 Method: 提出ISExplore策略,从音频特征多样性、唇部运动幅度和相机视角数量三个维度自动筛选最具信息量的5秒参考视频片段,用于后续模型训练。 Result: 在NeRF和3DGS方法上实验表明,该方法使数据处理和训练速度提升5倍以上,同时保持高保真输出质量。 Conclusion: 参考视频的质量远比长度重要,仅需几秒高质量片段即可实现优异的说话人脸生成效果,为高效个性化建模提供了新思路。 Abstract: Talking Face Generation (TFG) aims to produce realistic and dynamic talking portraits, with broad applications in fields such as digital education, film and television production, e-commerce live streaming, and other related areas. Currently, TFG methods based on Neural Radiated Field (NeRF) or 3D Gaussian sputtering (3DGS) are received widespread attention. They learn and store personalized features from reference videos of each target individual to generate realistic speaking videos. To ensure models can capture sufficient 3D information and successfully learns the lip-audio mapping, previous studies usually require meticulous processing and fitting several minutes of reference video, which always takes hours. The computational burden of processing and fitting long reference videos severely limits the practical application value of these methods.However, is it really necessary to fit such minutes of reference video? Our exploratory case studies show that using some informative reference video segments of just a few seconds can achieve performance comparable to or even better than the full reference video. This indicates that video informative quality is much more important than its length. Inspired by this observation, we propose the ISExplore (short for Informative Segment Explore), a simple-yet-effective segment selection strategy that automatically identifies the informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and number of camera views. Extensive experiments demonstrate that our approach increases data processing and training speed by more than 5x for NeRF and 3DGS methods, while maintaining high-fidelity output. Project resources are available at xx.[98] Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification
Zhenfeng Zhuang,Fangyu Zhou,Liansheng Wang
Main category: cs.CV
TL;DR: 提出了一种基于多模态原型的多实例学习方法(MP-MIL),通过双向交互和立体最优传输(SOT)算法提升计算病理学中的跨模态融合与模型可解释性。
Details
Motivation: 现有方法在处理全切片图像时依赖多实例学习,但缺乏细粒度标注导致实例级描述存在偏差,且跨模态引导多为单向,限制了性能。 Method: 构建任务特定的病理实体文本原型(通过冻结LLM生成)和视觉实例级原型,采用立体最优运输(SOT)算法进行基于相似性的跨模态对齐,实现双向信息交互与平衡的信息压缩。 Result: 在三个癌症数据集上进行了少样本分类和可解释性实验,结果表明该方法具有更强的泛化能力和更好的解释性。 Conclusion: MP-MIL通过引入任务特定原型和SOT融合机制,有效提升了多模态病理分析的性能与可解释性,适用于高分辨率WSI的低资源学习场景。 Abstract: While Large Language Models (LLMs) are emerging as a promising direction in computational pathology, the substantial computational cost of giga-pixel Whole Slide Images (WSIs) necessitates the use of Multi-Instance Learning (MIL) to enable effective modeling. A key challenge is that pathological tasks typically provide only bag-level labels, while instance-level descriptions generated by LLMs often suffer from bias due to a lack of fine-grained medical knowledge. To address this, we propose that constructing task-specific pathological entity prototypes is crucial for learning generalizable features and enhancing model interpretability. Furthermore, existing vision-language MIL methods often employ unidirectional guidance, limiting cross-modal synergy. In this paper, we introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction through a balanced information compression scheme. Specifically, we leverage a frozen LLM to generate task-specific pathological entity descriptions, which are learned as text prototypes. Concurrently, the vision branch learns instance-level prototypes to mitigate the model's reliance on redundant data. For the fusion stage, we employ the Stereoscopic Optimal Transport (SOT) algorithm, which is based on a similarity metric, thereby facilitating broader semantic alignment in a higher-dimensional space. We conduct few-shot classification and explainability experiments on three distinct cancer datasets, and the results demonstrate the superior generalization capabilities of our proposed method.[99] ReIDMamba: Learning Discriminative Features with Visual State Space Model for Person Re-Identification
Hongyang Gu,Qisong Yang,Lei Pu,Siming Han,Yao Ding
Main category: cs.CV
TL;DR: 提出了一种纯Mamba架构的行人重识别框架ReIDMamba,通过多类别令牌和多粒度特征提取模块提升特征判别能力,并引入排序感知三元组正则化减少特征冗余,在参数量、内存占用和推理速度上优于现有Transformer方法。
Details
Motivation: 解决Transformer在行人重识别中因序列长度增长导致的计算和内存开销呈二次增长的问题,同时克服CNN局部感受野和信息丢失的局限性。 Method: 设计基于Mamba的强基线模型,引入多个类别令牌以捕获全局细粒度特征;提出多粒度特征提取器(MGFE)模块,采用多分支结构与类别令牌融合;引入排名感知三元组正则化(RATR),结合类内与类间多样性约束以增强特征多样性与鲁棒性。 Result: ReIDMamba参数量仅为TransReID的三分之一,GPU内存消耗更低,推理速度更快,在五个主流行人重识别基准上达到最先进性能。 Conclusion: ReIDMamba是首个将纯Mamba架构应用于行人重识别的工作,兼具高效性与高性能,为ReID提供了新的架构方向。 Abstract: Extracting robust discriminative features is a critical challenge in person re-identification (ReID). While Transformer-based methods have successfully addressed some limitations of convolutional neural networks (CNNs), such as their local processing nature and information loss resulting from convolution and downsampling operations, they still face the scalability issue due to the quadratic increase in memory and computational requirements with the length of the input sequence. To overcome this, we propose a pure Mamba-based person ReID framework named ReIDMamba. Specifically, we have designed a Mamba-based strong baseline that effectively leverages fine-grained, discriminative global features by introducing multiple class tokens. To further enhance robust features learning within Mamba, we have carefully designed two novel techniques. First, the multi-granularity feature extractor (MGFE) module, designed with a multi-branch architecture and class token fusion, effectively forms multi-granularity features, enhancing both discrimination ability and fine-grained coverage. Second, the ranking-aware triplet regularization (RATR) is introduced to reduce redundancy in features from multiple branches, enhancing the diversity of multi-granularity features by incorporating both intra-class and inter-class diversity constraints, thus ensuring the robustness of person features. To our knowledge, this is the pioneering work that integrates a purely Mamba-driven approach into ReID research. Our proposed ReIDMamba model boasts only one-third the parameters of TransReID, along with lower GPU memory usage and faster inference throughput. Experimental results demonstrate ReIDMamba's superior and promising performance, achieving state-of-the-art performance on five person ReID benchmarks. Code is available at https://github.com/GuHY777/ReIDMamba.[100] Burst Image Quality Assessment: A New Benchmark and Unified Framework for Multiple Downstream Tasks
Xiaoye Liang,Lai Jiang,Minglang Qiao,Yichen Guo,Yue Zhang,Xin Deng,Shengxi Li,Yufan Liu,Mai Xu
Main category: cs.CV
TL;DR: 本文提出了一个用于评估连拍图像质量的新任务BuIQA,并构建了首个大规模基准数据集,提出了一种基于任务驱动提示生成和知识蒸馏的统一框架,在10个下游任务中表现出色,并能提升去噪和超分任务性能。
Details
Motivation: 连拍图像存在冗余,导致存储、传输开销大且下游任务效率低,因此需要一种能够评估各帧任务驱动质量的方法来指导图像选择。 Method: 构建了包含7346个序列、45827张图像和191572个标注分数的BuIQA数据集;提出统一框架,包括任务驱动提示生成网络(结合异构知识蒸馏)和任务感知质量评估网络。 Result: 在10个下游场景中表现优于现有最先进方法,并通过选择高质量帧使去噪和超分辨率任务获得0.33 dB的PSNR提升。 Conclusion: 所提BuIQA任务和框架有效实现了连拍图像的质量评估与筛选,显著提升下游任务性能,具有实际应用价值。 Abstract: In recent years, the development of burst imaging technology has improved the capture and processing capabilities of visual data, enabling a wide range of applications. However, the redundancy in burst images leads to the increased storage and transmission demands, as well as reduced efficiency of downstream tasks. To address this, we propose a new task of Burst Image Quality Assessment (BuIQA), to evaluate the task-driven quality of each frame within a burst sequence, providing reasonable cues for burst image selection. Specifically, we establish the first benchmark dataset for BuIQA, consisting of $7,346$ burst sequences with $45,827$ images and $191,572$ annotated quality scores for multiple downstream scenarios. Inspired by the data analysis, a unified BuIQA framework is proposed to achieve an efficient adaption for BuIQA under diverse downstream scenarios. Specifically, a task-driven prompt generation network is developed with heterogeneous knowledge distillation, to learn the priors of the downstream task. Then, the task-aware quality assessment network is introduced to assess the burst image quality based on the task prompt. Extensive experiments across 10 downstream scenarios demonstrate the impressive BuIQA performance of the proposed approach, outperforming the state-of-the-art. Furthermore, it can achieve $0.33$ dB PSNR improvement in the downstream tasks of denoising and super-resolution, by applying our approach to select the high-quality burst frames.[101] Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection
Shenao Zhao,Pengpeng Liang,Zhoufan Yang
Main category: cs.CV
TL;DR: 本文提出了一种名为MMAssist的方法,利用图像和文本特征作为桥梁来提升LiDAR-based 3D目标检测的无监督域适应性能。通过将2D边界框的视觉与语言特征与3D特征对齐并融合,并结合外部2D检测器生成的伪标签,显著提升了跨域检测效果,在多个数据集上优于现有方法。
Details
Motivation: 尽管点云和图像常被同时采集,但在基于教师-学生架构的3D无监督域适应中,图像数据的作用未被充分挖掘。本文旨在探索如何有效利用多模态信息(图像和文本)来提升3D检测的域适应性能。 Method: 提出MMAssist方法:1)将真值或伪标签投影到图像得到2D框;2)用预训练视觉骨干提取图像特征;3)利用大视觉语言模型(LVLM)生成文本描述并提取文本特征;4)在源域和目标域训练中对齐3D预测框与对应的图像、文本特征并加权融合;5)对齐教师-学生分支间的特征;6)使用现成2D检测器辅助生成更优伪标签。 Result: 在三个主流3D检测数据集上的三种域适应任务中,MMAssist均取得了优于当前最先进方法的性能表现。 Conclusion: 通过引入图像和文本模态作为特征对齐的桥梁,并结合多模态特征融合与伪标签优化策略,MMAssist有效提升了无监督域适应下LiDAR 3D目标检测的性能,验证了多模态信息在3D域适应中的潜力。 Abstract: Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box's text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at https://github.com/liangp/MMAssist.[102] Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection
Seyedehanita Madani,Vishal M. Patel
Main category: cs.CV
TL;DR: 提出一种模块化流水线,通过扩散语义形变、密集配准和残差流精修提升遥感影像变化检测中的空间与时间鲁棒性,无需修改现有检测网络。
Details
Motivation: 解决长时间跨度遥感影像因空间错位导致的变化检测性能下降问题,提升模型在真实场景中的鲁棒性和跨域泛化能力。 Method: 结合扩散模型生成语义形变中间帧,利用RoMa逐帧估计对应关系,通过密集配准和轻量U-Net进行流场精修,实现高保真图像对齐。 Result: 在LEVIR-CD、WHU-CD和DSIFN-CD数据集上验证了方法的有效性,显著提升配准精度和变化检测性能,且适用于多种骨干网络。 Conclusion: 该方法具有良好的通用性和即插即用特性,可有效增强现有变化检测模型对未对齐影像的鲁棒性。 Abstract: Remote sensing change detection is often challenged by spatial misalignment between bi-temporal images, especially when acquisitions are separated by long seasonal or multi-year gaps. While modern convolutional and transformer-based models perform well on aligned data, their reliance on precise co-registration limits their robustness in real-world conditions. Existing joint registration-detection frameworks typically require retraining and transfer poorly across domains. We introduce a modular pipeline that improves spatial and temporal robustness without altering existing change detection networks. The framework integrates diffusion-based semantic morphing, dense registration, and residual flow refinement. A diffusion module synthesizes intermediate morphing frames that bridge large appearance gaps, enabling RoMa to estimate stepwise correspondences between consecutive frames. The composed flow is then refined through a lightweight U-Net to produce a high-fidelity warp that co-registers the original image pair. Extensive experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show consistent gains in both registration accuracy and downstream change detection across multiple backbones, demonstrating the generality and effectiveness of the proposed approach.[103] DANCE: Density-agnostic and Class-aware Network for Point Cloud Completion
Da-Yeong Kim,Yeong-Jun Cho
Main category: cs.CV
TL;DR: 本文提出了一种名为DANCE的点云补全新框架,能够处理不同密度输入并保持观测几何结构,通过基于射线的多视角采样和Transformer解码器优化候选点位置,并引入语义分类头实现无需外部图像监督的类别一致性补全。
Details
Motivation: 现有方法通常假设固定的输入/输出密度或依赖于基于图像的表示,难以应对现实场景中输入点云稀疏性变化大和监督信号有限的问题。因此需要一种对密度不敏感且能利用语义信息进行一致补全的方法。 Method: 提出DANCE框架:首先通过多视角射线采样生成候选点;使用Transformer解码器优化点的位置并预测其不透明度得分以决定是否保留;引入轻量级分类头直接在几何特征上训练,提供类别感知能力,实现无外部图像监督的语义引导补全。 Result: 在PCN和MVP基准上的大量实验表明,DANCE在补全精度和结构一致性方面优于现有最先进方法,同时对不同输入密度和噪声水平具有更强鲁棒性。 Conclusion: DANCE是一种密度无关且类别感知的点云补全框架,能够在保持原始观测几何的同时有效补全缺失区域,并通过语义引导提升补全结果的合理性与一致性,适用于真实场景中的不完整3D扫描数据。 Abstract: Point cloud completion aims to recover missing geometric structures from incomplete 3D scans, which often suffer from occlusions or limited sensor viewpoints. Existing methods typically assume fixed input/output densities or rely on image-based representations, making them less suitable for real-world scenarios with variable sparsity and limited supervision. In this paper, we introduce Density-agnostic and Class-aware Network (DANCE), a novel framework that completes only the missing regions while preserving the observed geometry. DANCE generates candidate points via ray-based sampling from multiple viewpoints. A transformer decoder then refines their positions and predicts opacity scores, which determine the validity of each point for inclusion in the final surface. To incorporate semantic guidance, a lightweight classification head is trained directly on geometric features, enabling category-consistent completion without external image supervision. Extensive experiments on the PCN and MVP benchmarks show that DANCE outperforms state-of-the-art methods in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.[104] ChexFract: From General to Specialized - Enhancing Fracture Description Generation
Nikolay Nechaev,Evgeniia Przhezdzetskaia,Dmitry Umerenkov,Dmitry V. Dylov
Main category: cs.CV
TL;DR: 本研究开发了针对骨折病理检测和描述的专用视觉-语言模型,使用MAIRA-2和CheXagent编码器,在胸部X光片报告生成中显著优于通用模型,并公开发布了性能最佳的模型以促进罕见病理准确报告的研究。
Details
Motivation: 现有通用视觉-语言模型在描述胸部X光片中的罕见但重要的病理(如骨折)时表现不足,亟需提升临床意义强的报告生成能力。 Method: 训练基于MAIRA-2和CheXagent编码器的骨折特异性视觉-语言模型,并按骨折类型、位置和年龄分析模型输出。 Result: 骨折专用模型在生成准确的骨折描述方面显著优于通用模型,揭示了当前架构的优势与局限。 Conclusion: 专用视觉-语言模型能更准确地描述胸部X光片中的骨折情况,有助于改善罕见病理的自动报告,所发布的模型将推动后续研究。 Abstract: Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.[105] CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting
Chae-Yeon Heo,Yeong-Jun Cho
Main category: cs.CV
TL;DR: 提出了一种语义引导的图像修复框架CSF-Net,利用预训练的Amodal Completion模型生成缺失区域的结构感知候选作为语义先验,并通过Transformer架构融合上下文与语义特征,提升大遮挡图像修复的质量。
Details
Motivation: 在大遮挡图像修复中,关键视觉内容缺失且上下文线索有限,导致修复结果容易出现结构失真和语义不一致问题。因此需要引入有效的语义先验来指导修复过程。 Method: 利用预训练的Amodal Completion模型生成缺失区域的结构感知候选作为语义先验,设计基于Transformer的上下文-语义融合网络(CSF-Net),将语义先验与上下文特征融合生成语义引导图像,并集成到现有修复模型中以提升性能。 Result: 在Places365和COCOA数据集上的实验表明,CSF-Net能有效减少物体幻觉现象,显著提升修复结果的视觉真实感和语义一致性,且适用于多种遮挡条件。 Conclusion: CSF-Net通过融合语义先验与上下文信息,为大遮挡图像修复提供了通用且有效的解决方案,无需修改原有修复模型结构即可实现性能提升。 Abstract: In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment. The code for CSF-Net is available at https://github.com/chaeyeonheo/CSF-Net.[106] Hardware-Aware YOLO Compression for Low-Power Edge AI on STM32U5 for Weeds Detection in Digital Agriculture
Charalampos S. Kouzinopoulos,Yuri Manna
Main category: cs.CV
TL;DR: 提出了一种基于YOLOv8n的低功耗边缘AI系统,用于在STM32微控制器上实现实时杂草检测,结合模型压缩技术,在保持较高精度的同时显著降低能耗。
Details
Motivation: 传统除草方法依赖化学除草剂,易造成环境污染并导致抗药性杂草出现;现有基于计算机视觉的精准除草方案通常依赖高算力平台,难以在资源受限的农业环境中广泛应用。 Method: 采用YOLOv8n目标检测模型,并应用结构化剪枝、整数量化和输入图像分辨率缩放等模型压缩技术,将模型优化后部署于STM32U575ZI微控制器上,实现边缘端实时杂草检测。 Result: 在包含74种植物的CropAndWeed数据集上验证,系统每次推理仅消耗51.8mJ能量,能够在严格硬件限制下实现实时、原位杂草检测,兼顾检测精度与能效。 Conclusion: 该低功耗边缘AI系统为可持续农业中的杂草管理提供了可行的解决方案,具备在资源受限环境下规模化部署的潜力。 Abstract: Weeds significantly reduce crop yields worldwide and pose major challenges to sustainable agriculture. Traditional weed management methods, primarily relying on chemical herbicides, risk environmental contamination and lead to the emergence of herbicide-resistant species. Precision weeding, leveraging computer vision and machine learning methods, offers a promising eco-friendly alternative but is often limited by reliance on high-power computational platforms. This work presents an optimized, low-power edge AI system for weeds detection based on the YOLOv8n object detector deployed on the STM32U575ZI microcontroller. Several compression techniques are applied to the detection model, including structured pruning, integer quantization and input image resolution scaling in order to meet strict hardware constraints. The model is trained and evaluated on the CropAndWeed dataset with 74 plant species, achieving a balanced trade-off between detection accuracy and efficiency. Our system supports real-time, in-situ weeds detection with a minimal energy consumption of 51.8mJ per inference, enabling scalable deployment in power-constrained agricultural environments.[107] Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning
Jialong Qin,Xin Zou,Di Lu,Yibo Yan,Xuming Hu
Main category: cs.CV
TL;DR: 提出SharpV,一种用于自适应剪枝视觉token和KV缓存的极简高效方法,通过动态调整剪枝比例和自校准方式实现分层缓存剪枝,在多个基准上表现优越,且无需访问注意力分数,兼容Flash Attention等硬件加速技术。
Details
Motivation: 现有VideoLLM因处理过多冗余视觉token导致计算复杂度高和KV缓存扩展困难,需更高效的剪枝方法。 Method: SharpV基于时空信息动态调整剪枝比例,并在KV缓存阶段通过自校准方式根据与原始视觉特征的相似性剪除退化的视觉特征,实现两阶段剪枝。 Result: 实验表明SharpV在多个公开基准上优于现有方法,偶尔性能超过密集模型,实现了高效的分层缓存压缩。 Conclusion: SharpV是首个无需访问注意力分数的两阶段剪枝框架,为VideoLLM的信息流提供了新视角,并具备良好的硬件兼容性。 Abstract: Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs' information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.[108] EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision
Yifei Cao,Yu Liu,Guolong Wang,Zhu Liu,Kai Wang,Xianjie Zhang,Jizhe Yu,Xun Tu
Main category: cs.CV
TL;DR: 提出EAGLE框架,利用外观和几何感知的记忆机制实现统一的2D-3D自我中心视觉查询定位,在Ego4D-VQ基准上达到最先进性能。
Details
Motivation: 解决自我中心视觉中由于相机运动、视角变化和外观变化带来的视觉查询定位难题。 Method: 设计基于鸟类记忆巩固机制的EAGLE框架,结合外观感知的元学习记忆(AMM)和几何感知定位记忆(GLM),并通过视觉几何Transformer(VGGT)统一2D-3D任务。 Result: 在Ego4D-VQ基准上实现了最先进的性能,显著提升了检索精度和3D反投影的准确性。 Conclusion: EAGLE通过结构化的记忆整合机制,有效支持长短时目标建模,实现了精确的轮廓分割与空间判别,推动了具身AI和AR/VR中的视觉查询定位发展。 Abstract: Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.[109] Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving
Jian Wang,Lijun He,Yixing Yong,Haixia Bi,Fan Li
Main category: cs.CV
TL;DR: 本文提出了一种名为AdvRoad的方法,用于生成外观自然、类似路面的对抗性海报,以在自动驾驶场景中隐蔽地攻击基于RGB相机的3D目标检测系统。
Details
Motivation: 现有的对抗性攻击方法生成的海报外观不自然且内容固定,容易被人察觉或防御,因此需要一种更隐蔽、更具适应性的攻击方式来评估自动驾驶系统的安全性。 Method: 采用两阶段方法:第一阶段生成具有自然路面风格的对抗性样本(Road-Style Adversary Generation),第二阶段根据具体场景进行自适应优化(Scenario-Associated Adaptation),以提升攻击有效性并保持视觉隐蔽性。 Result: 实验表明,AdvRoad能在多种3D检测器、不同场景和伪造位置上实现有效攻击,且具有良好的泛化能力;物理实验证明了其在真实环境中的可行性与威胁性。 Conclusion: AdvRoad能够生成外观自然、难以察觉的对抗性路面海报,在不影响人类感知的前提下成功诱导3D检测器产生虚幻物体,揭示了视觉自动驾驶系统在现实世界中的潜在安全漏洞。 Abstract: Modern autonomous driving (AD) systems leverage 3D object detection to perceive foreground objects in 3D environments for subsequent prediction and planning. Visual 3D detection based on RGB cameras provides a cost-effective solution compared to the LiDAR paradigm. While achieving promising detection accuracy, current deep neural network-based models remain highly susceptible to adversarial examples. The underlying safety concerns motivate us to investigate realistic adversarial attacks in AD scenarios. Previous work has demonstrated the feasibility of placing adversarial posters on the road surface to induce hallucinations in the detector. However, the unnatural appearance of the posters makes them easily noticeable by humans, and their fixed content can be readily targeted and defended. To address these limitations, we propose the AdvRoad to generate diverse road-style adversarial posters. The adversaries have naturalistic appearances resembling the road surface while compromising the detector to perceive non-existent objects at the attack locations. We employ a two-stage approach, termed Road-Style Adversary Generation and Scenario-Associated Adaptation, to maximize the attack effectiveness on the input scene while ensuring the natural appearance of the poster, allowing the attack to be carried out stealthily without drawing human attention. Extensive experiments show that AdvRoad generalizes well to different detectors, scenes, and spoofing locations. Moreover, physical attacks further demonstrate the practical threats in real-world environments.[110] High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection
Zhiyuan Chen,Yuelin Guo,Zitong Huang,Haoyu He,Renhao Lu,Weizhe Zhang
Main category: cs.CV
TL;DR: 提出Cascade HQP-DETR,通过高质量合成数据、高质量提议引导的查询编码和级联去噪算法,在仅12个训练周期下在PASCAL VOC 2007上达到61.04% mAP@0.5的SOTA性能。
Details
Motivation: 现有ISOD方法受限于合成数据质量低、DETR模型易过拟合且收敛慢、以及均匀去噪导致对伪标签噪声过拟合的问题。 Method: 1) 构建基于LLaMA-3、Flux和Grounding DINO的高质量合成数据集(FluxVOC/FluxCOCO);2) 提出高质提议引导的查询编码,利用SAM生成的提议和RoI特征初始化查询;3) 设计级联去噪算法,逐层提升IoU阈值以动态调整训练权重。 Result: 在PASCAL VOC 2007上取得61.04% mAP@0.5,超越强基线,且仅用12个epoch训练,表现出快速收敛和良好泛化能力。 Conclusion: Cascade HQP-DETR有效解决了ISOD中的数据质量、模型过拟合与训练策略问题,验证了其在少周期训练下的高效性与架构通用性。 Abstract: Object detection models demand large-scale annotated datasets, which are costly and labor-intensive to create. This motivated Imaginary Supervised Object Detection (ISOD), where models train on synthetic images and test on real images. However, existing methods face three limitations: (1) synthetic datasets suffer from simplistic prompts, poor image quality, and weak supervision; (2) DETR-based detectors, due to their random query initialization, struggle with slow convergence and overfitting to synthetic patterns, hindering real-world generalization; (3) uniform denoising pressure promotes model overfitting to pseudo-label noise. We propose Cascade HQP-DETR to address these limitations. First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets, advancing ISOD from weak to full supervision. Second, our High-Quality Proposal guided query encoding initializes object queries with image-specific priors from SAM-generated proposals and RoI-pooled features, accelerating convergence while steering the model to learn transferable features instead of overfitting to synthetic patterns. Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers, guiding the model to learn robust boundaries from reliable visual cues rather than overfitting to noisy labels. Trained for just 12 epochs solely on FluxVOC, Cascade HQP-DETR achieves a SOTA 61.04\% mAP@0.5 on PASCAL VOC 2007, outperforming strong baselines, with its competitive real-data performance confirming the architecture's universal applicability.[111] Multi-modal Deepfake Detection and Localization with FPN-Transformer
Chende Zheng,Ruiqi Suo,Zhoulin Ji,Jingyi Deng,Fangbin Yi,Chenhao Lin,Chao Shen
Main category: cs.CV
TL;DR: 提出了一种基于FPN-Transformer的多模态深伪检测与定位框架,利用WavLM和CLIP提取音视频特征,通过R-TLM块构建多尺度特征金字塔,实现跨模态时序依赖分析,并采用双分支预测头实现伪造概率预测与时序偏移优化,在IJCAI'25 DDL-AV基准上取得0.7535的优异性能。
Details
Motivation: 现有单模态检测方法难以捕捉跨模态相关性,且对精细伪造片段的时序定位能力有限,缺乏在复杂场景下的泛化性和精确性。 Method: 采用预训练自监督模型(WavLM和CLIP)提取音视频特征,设计R-TLM块构建多尺度特征金字塔,结合局部注意力机制建模跨上下文时序依赖,并通过双分支预测头联合完成伪造检测与时间边界回归。 Result: 在IJCAI'25 DDL-AV测试集上达到0.7535的最终得分,显著优于现有方法,验证了模型在跨模态深伪检测与定位任务中的有效性与优越性。 Conclusion: 所提方法有效提升了多模态深伪内容的检测精度与时序定位能力,具备良好的跨模态泛化性能,为通用深伪检测提供了新思路。 Abstract: The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL[112] Perceptual Quality Assessment of 3D Gaussian Splatting: A Subjective Dataset and Prediction Metric
Zhaolin Wan,Yining Diao,Jingqi Xu,Hao Wang,Zhiyang Li,Xiaopeng Fan,Wangmeng Zuo,Debin Zhao
Main category: cs.CV
TL;DR: 本文提出了3DGS-QA,首个针对3D高斯点阵渲染(3DGS)的主观质量评估数据集,并提出一种无需参考图像的基于原生3D高斯图元的质量预测模型,能有效评估不同退化条件下的感知质量。
Details
Motivation: 尽管3D高斯点阵渲染(3DGS)在实时高质量渲染中表现突出,但其在不同重建条件下的感知质量尚未被系统研究,尤其是视角稀疏、训练迭代不足、降采样、噪声和颜色失真等因素对视觉体验的影响缺乏量化分析。 Method: 构建包含15类物体、共225个退化重建样本的3DGS-QA数据集;提出一种无参考质量预测模型,直接从3D高斯图元中提取空间与光度特征,在结构感知框架下估计感知质量,无需渲染图像或真实参考。 Result: 实验表明,所提模型在多种退化条件下均优于传统及基于学习的质量评估方法,表现出更强的鲁棒性与准确性;同时为现有QA方法提供了基准测试平台。 Conclusion: 该工作填补了3DGS感知质量评估的研究空白,所提出的模型和数据集有助于推动面向3DGS内容优化与质量提升的后续研究。 Abstract: With the rapid advancement of 3D visualization, 3D Gaussian Splatting (3DGS) has emerged as a leading technique for real-time, high-fidelity rendering. While prior research has emphasized algorithmic performance and visual fidelity, the perceptual quality of 3DGS-rendered content, especially under varying reconstruction conditions, remains largely underexplored. In practice, factors such as viewpoint sparsity, limited training iterations, point downsampling, noise, and color distortions can significantly degrade visual quality, yet their perceptual impact has not been systematically studied. To bridge this gap, we present 3DGS-QA, the first subjective quality assessment dataset for 3DGS. It comprises 225 degraded reconstructions across 15 object types, enabling a controlled investigation of common distortion factors. Based on this dataset, we introduce a no-reference quality prediction model that directly operates on native 3D Gaussian primitives, without requiring rendered images or ground-truth references. Our model extracts spatial and photometric cues from the Gaussian representation to estimate perceived quality in a structure-aware manner. We further benchmark existing quality assessment methods, spanning both traditional and learning-based approaches. Experimental results show that our method consistently achieves superior performance, highlighting its robustness and effectiveness for 3DGS content evaluation. The dataset and code are made publicly available at https://github.com/diaoyn/3DGSQA to facilitate future research in 3DGS quality assessment.[113] WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation
Gongshu Wang,Zhirui Wang,Kan Yang
Main category: cs.CV
TL;DR: 提出WEDepth,一种无需修改视觉基础模型结构和权重即可提升单目深度估计性能的新方法,在多个数据集上达到SOTA,并具备强零样本迁移能力。
Details
Motivation: 单目深度估计因从单张2D图像恢复3D场景的不适定性而具有挑战性,现有方法依赖微调视觉基础模型,但会改变其结构或权重,限制了先验知识的有效利用。 Method: 将视觉基础模型作为多层级特征增强器,系统地在不同表示层次注入先验知识,不修改模型结构和预训练权重,从而有效激发并利用其内在先验。 Result: 在NYU-Depth v2和KITTI数据集上达到新的SOTA性能,表现优于需多次前向传播的扩散模型及基于相对深度预训练的方法,并展现出强大的跨场景零样本迁移能力。 Conclusion: WEDepth通过保留视觉基础模型的完整性并有效利用其多级先验,在单目深度估计中实现了高性能和良好的泛化性,为下游任务提供了高效且灵活的解决方案。 Abstract: Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.[114] ProSona: Prompt-Guided Personalization for Multi-Expert Medical Image Segmentation
Aya Elgebaly,Nikolaos Delopoulos,Juliane Hörner-Rieber,Carolin Rippke,Sebastian Klüter,Luca Boldrini,Lorenzo Placidi,Riccardo Dal Bello,Nicolaus Andratschke,Michael Baumgartl,Claus Belka,Christopher Kurz,Guillaume Landry,Shadi Albarqouni
Main category: cs.CV
TL;DR: 本文提出了一种名为ProSona的两阶段框架,通过学习连续的标注风格潜在空间,并结合自然语言提示实现可控制的个性化医学图像分割,在肺结节和前列腺MRI数据集上优于现有方法。
Details
Motivation: 医学图像分割中存在显著的观察者间变异性,尤其是在专家意见不一致的肺结节勾画任务中,现有方法难以有效建模这种多样性并实现个性化分割。 Method: 提出ProSona框架:第一阶段使用概率U-Net建模多种专家假设,学习包含多样标注风格的潜在空间;第二阶段通过自然语言提示引导的投影机制,在该潜在空间中进行导航以生成个性化分割结果,并采用多层次对比学习对齐文本与视觉表征。 Result: 在LIDC-IDRI肺结节和多中心前列腺MRI数据集上,相比DPersona方法,ProSona将广义能量距离降低17%,平均Dice系数提升超过1个点。 Conclusion: 自然语言提示可用于灵活、准确且可解释地控制个性化医学图像分割,ProSona通过建模连续标注风格空间有效捕捉专家变异性和语义差异。 Abstract: Automated medical image segmentation suffers from high inter-observer variability, particularly in tasks such as lung nodule delineation, where experts often disagree. Existing approaches either collapse this variability into a consensus mask or rely on separate model branches for each annotator. We introduce ProSona, a two-stage framework that learns a continuous latent space of annotation styles, enabling controllable personalization via natural language prompts. A probabilistic U-Net backbone captures diverse expert hypotheses, while a prompt-guided projection mechanism navigates this latent space to generate personalized segmentations. A multi-level contrastive objective aligns textual and visual representations, promoting disentangled and interpretable expert styles. Across the LIDC-IDRI lung nodule and multi-institutional prostate MRI datasets, ProSona reduces the Generalized Energy Distance by 17% and improves mean Dice by more than one point compared with DPersona. These results demonstrate that natural-language prompts can provide flexible, accurate, and interpretable control over personalized medical image segmentation. Our implementation is available online 1 .[115] Generalized-Scale Object Counting with Gradual Query Aggregation
Jer Pelhan,Alan Lukezic,Matej Kristan
Main category: cs.CV
TL;DR: 提出GECO2,一种端到端的少样本计数与检测方法,通过跨尺度聚合示例特征,有效解决多尺度和密集小目标检测难题,在准确性和效率上均优于现有方法。
Details
Motivation: 现有少样本计数方法在处理多尺度物体和密集小物体区域时表现不佳,主要受限于多尺度特征融合和计算资源的权衡。 Method: 提出GECO2,采用新的密集查询表示方法,逐步跨尺度聚合示例特定的特征信息,生成高分辨率密集查询,实现对大小物体的统一检测。 Result: GECO2在计数和检测精度上超过当前最先进方法10%,运行速度快3倍,且GPU内存占用更小。 Conclusion: GECO2有效解决了少样本计数中多尺度和密集小物体检测的挑战,兼具高性能与高效性,具有广泛应用潜力。 Abstract: Few-shot detection-based counters estimate the number of instances in the image specified only by a few test-time exemplars. A common approach to localize objects across multiple sizes is to merge backbone features of different resolutions. Furthermore, to enable small object detection in densely populated regions, the input image is commonly upsampled and tiling is applied to cope with the increased computational and memory requirements. Because of these ad-hoc solutions, existing counters struggle with images containing diverse-sized objects and densely populated regions of small objects. We propose GECO2, an end-to-end few-shot counting and detection method that explicitly addresses the object scale issues. A new dense query representation gradually aggregates exemplar-specific feature information across scales that leads to high-resolution dense queries that enable detection of large as well as small objects. GECO2 surpasses state-of-the-art few-shot counters in counting as well as detection accuracy by 10% while running 3x times faster at smaller GPU memory footprint.[116] Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching
Aditi Singhania,Arushi Jain,Krutik Malani,Riddhi Dhawan,Souymodip Chakraborty,Vineet Batra,Ankit Phogat
Main category: cs.CV
TL;DR: 本文提出了一种基于LoRA微调扩散模型的主图图像生成方法,通过潜在连接策略和掩码条件流匹配实现强身份一致性与高提示多样性的平衡,并设计了两阶段蒸馏数据筛选框架以支持大规模训练,同时提出了CHARIS评估框架进行细粒度质量评价。
Details
Motivation: 在主体驱动的图像生成中,如何在保持主体身份一致的同时实现多样化提示下的高质量生成是一个关键挑战,现有方法往往难以兼顾二者。 Method: 采用LoRA微调扩散模型,结合潜在空间中的参考图像与目标图像拼接策略,引入掩码条件流匹配(CFM)目标;并提出两阶段蒸馏数据筛选框架:第一阶段利用数据恢复和视觉语言模型过滤构建高质量种子数据集,第二阶段用于参数高效微调。 Result: 该方法在无需修改网络结构的前提下实现了优异的身份保持能力与生成多样性,CHARIS评估框架在五个维度上验证了其在身份一致性、提示遵循、色彩保真度、视觉质量和变换多样性方面的优越性能。 Conclusion: 所提方法有效解决了身份一致性与提示多样性之间的权衡问题,具备良好的扩展性和实用性,CHARIS为未来主体驱动生成提供了可靠的评估标准。 Abstract: Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.[117] I2E: Real-Time Image-to-Event Conversion for High-Performance Spiking Neural Networks
Ruichen Ma,Liwei Meng,Guanchao Qiao,Ning Ning,Yang Liu,Shaogang Hu
Main category: cs.CV
TL;DR: 本文提出I2E框架,通过模拟微眼动将静态图像高效转换为高保真事件流,显著提升脉冲神经网络训练的数据可用性与性能。
Details
Motivation: 脉冲神经网络(SNN)因缺乏足够的事件流数据而受限,现有数据生成方法效率低,阻碍了SNN的发展和应用。 Method: 提出I2E算法框架,利用高度并行的卷积模拟微眼动来生成事件流,实现比以往方法快300倍以上的转换速度,支持实时数据增强。 Result: 在大规模基准测试中验证有效性;基于I2E-ImageNet训练的SNN达到60.50%的最先进准确率;在CIFAR10-DVS上通过预训练+微调实现92.5%的准确率。 Conclusion: I2E为SNN提供了可扩展的数据生成方案,验证了合成事件数据可作为真实传感器数据的高保真替代,推动类脑计算系统发展。 Abstract: Spiking neural networks (SNNs) promise highly energy-efficient computing, but their adoption is hindered by a critical scarcity of event-stream data. This work introduces I2E, an algorithmic framework that resolves this bottleneck by converting static images into high-fidelity event streams. By simulating microsaccadic eye movements with a highly parallelized convolution, I2E achieves a conversion speed over 300x faster than prior methods, uniquely enabling on-the-fly data augmentation for SNN training. The framework's effectiveness is demonstrated on large-scale benchmarks. An SNN trained on the generated I2E-ImageNet dataset achieves a state-of-the-art accuracy of 60.50%. Critically, this work establishes a powerful sim-to-real paradigm where pre-training on synthetic I2E data and fine-tuning on the real-world CIFAR10-DVS dataset yields an unprecedented accuracy of 92.5%. This result validates that synthetic event data can serve as a high-fidelity proxy for real sensor data, bridging a long-standing gap in neuromorphic engineering. By providing a scalable solution to the data problem, I2E offers a foundational toolkit for developing high-performance neuromorphic systems. The open-source algorithm and all generated datasets are provided to accelerate research in the field.[118] Radar-APLANC: Unsupervised Radar-based Heartbeat Sensing via Augmented Pseudo-Label and Noise Contrast
Ying Wang,Zhaodong Sun,Xu Cheng,Zuxian He,Xiaobai Li
Main category: cs.CV
TL;DR: 提出了一种名为Radar-APLANC的无监督框架,用于雷达心跳感知,通过增强伪标签和噪声对比方法,在无需昂贵真值信号的情况下实现了与现有监督方法相媲美的性能。
Details
Motivation: 传统雷达心跳检测方法因噪声影响性能,而基于学习的方法虽鲁棒但依赖昂贵的标注数据,因此需要一种无需真实生理信号标注的高效无监督方法。 Method: 提出利用雷达范围矩阵中的心跳范围和噪声范围分别构建正负样本,并设计噪声对比三元组(NCT)损失函数,结合传统方法生成的伪标签和自适应噪声感知的伪标签增强策略进行训练。 Result: 在Equipleth数据集和自建雷达数据集上的实验表明,该无监督方法性能接近最先进的监督方法。 Conclusion: Radar-APLANC为非接触式心跳感知提供了一种高效、低成本的无监督解决方案,减少了对标注数据的依赖并提升了抗噪能力。 Abstract: Frequency Modulated Continuous Wave (FMCW) radars can measure subtle chest wall oscillations to enable non-contact heartbeat sensing. However, traditional radar-based heartbeat sensing methods face performance degradation due to noise. Learning-based radar methods achieve better noise robustness but require costly labeled signals for supervised training. To overcome these limitations, we propose the first unsupervised framework for radar-based heartbeat sensing via Augmented Pseudo-Label and Noise Contrast (Radar-APLANC). We propose to use both the heartbeat range and noise range within the radar range matrix to construct the positive and negative samples, respectively, for improved noise robustness. Our Noise-Contrastive Triplet (NCT) loss only utilizes positive samples, negative samples, and pseudo-label signals generated by the traditional radar method, thereby avoiding dependence on expensive ground-truth physiological signals. We further design a pseudo-label augmentation approach featuring adaptive noise-aware label selection to improve pseudo-label signal quality. Extensive experiments on the Equipleth dataset and our collected radar dataset demonstrate that our unsupervised method achieves performance comparable to state-of-the-art supervised methods. Our code, dataset, and supplementary materials can be accessed from https://github.com/RadarHRSensing/Radar-APLANC.[119] CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion
Cameron Braunstein,Mariya Toneva,Eddy Ilg
Main category: cs.CV
TL;DR: 本研究探讨了Stable Diffusion等潜在扩散模型在文本到图像生成过程中是否具备人类可理解的语义表示,发现语义信息主要来源于CLIP的文本编码,而非扩散过程本身。
Details
Motivation: 了解扩散模型在生成图像时是否真正理解其生成内容的语义,以及语义信息在模型内部的来源和表示程度。 Method: 通过对Stable Diffusion进行探针分析,使用简单的回归层预测物体的语义属性,并与人类标注进行对比评估。 Result: 发现语义信息的准确解码主要依赖于CLIP的文本编码;特定语义属性组的解码准确性差异显著;在逆向扩散过程中语义属性更难区分。 Conclusion: CLIP这一独立训练的视觉-语言模型决定了人类可理解的语义表示,而扩散过程主要充当视觉解码器的角色。 Abstract: Latent diffusion models such as Stable Diffusion achieve state-of-the-art results on text-to-image generation tasks. However, the extent to which these models have a semantic understanding of the images they generate is not well understood. In this work, we investigate whether the internal representations used by these models during text-to-image generation contain semantic information that is meaningful to humans. To do so, we perform probing on Stable Diffusion with simple regression layers that predict semantic attributes for objects and evaluate these predictions against human annotations. Surprisingly, we find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process. We demonstrate that groups of specific semantic attributes have markedly different decoding accuracy than the average, and are thus represented to different degrees. Finally, we show that attributes become more difficult to disambiguate from one another during the inverse diffusion process, further demonstrating the strongest semantic representation of object attributes in CLIP. We conclude that the separately trained CLIP vision-language model is what determines the human-like semantic representation, and that the diffusion process instead takes the role of a visual decoder.[120] Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis
Aditi Singhania,Krutik Malani,Riddhi Dhawan,Arushi Jain,Garv Tandon,Nippun Sharma,Souymodip Chakraborty,Vineet Batra,Ankit Phogat
Main category: cs.CV
TL;DR: 提出了一种名为Beyond the Pixels的分层评估框架,用于细粒度地评估生成模型中的身份保持能力,通过引导视觉语言模型进行结构化推理,显著提升了与人类判断的一致性,并发布了一个新的基准测试集。
Details
Motivation: 现有评估生成模型身份保持能力的指标依赖全局嵌入或粗粒度提示,难以捕捉细粒度的身份变化且缺乏诊断洞察力。 Method: 将身份评估分解为特征层级的转换,采用(type, style)->attribute->feature的决策树结构,并通过具体变换提示替代抽象相似性评分,利用视觉语言模型进行结构化推理。 Result: 在四个最先进的生成模型上验证了该框架,结果显示其与人类判断高度一致;同时构建了一个包含1078个图像-提示对的新基准,涵盖多样化主体类型和多种变换轴。 Conclusion: 该分层评估框架能更准确、可靠地评估生成图像中的身份保持,减少幻觉问题,提供可解释的诊断信息,为未来生成模型的评估提供了新标准。 Abstract: Evaluating identity preservation in generative models remains a critical yet unresolved challenge. Existing metrics rely on global embeddings or coarse VLM prompting, failing to capture fine-grained identity changes and providing limited diagnostic insight. We introduce Beyond the Pixels, a hierarchical evaluation framework that decomposes identity assessment into feature-level transformations. Our approach guides VLMs through structured reasoning by (1) hierarchically decomposing subjects into (type, style) -> attribute -> feature decision tree, and (2) prompting for concrete transformations rather than abstract similarity scores. This decomposition grounds VLM analysis in verifiable visual evidence, reducing hallucinations and improving consistency. We validate our framework across four state-of-the-art generative models, demonstrating strong alignment with human judgments in measuring identity consistency. Additionally, we introduce a new benchmark specifically designed to stress-test generative models. It comprises 1,078 image-prompt pairs spanning diverse subject types, including underrepresented categories such as anthropomorphic and animated characters, and captures an average of six to seven transformation axes per prompt.[121] StableMorph: High-Quality Face Morph Generation with Stable Diffusion
Wassim Kabbani,Kiran Raja,Raghavendra Ramachandra,Christoph Busch
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的新型人脸融合图像生成方法StableMorph,能够生成高质量、无伪影的逼真融合图像,显著提升对人脸识别系统的欺骗能力,为活体检测研究提供了更接近真实威胁的测试基准。
Details
Motivation: 现有融合图像生成方法常产生模糊、有伪影或结构不良的图像,易被检测且无法代表真实世界中最危险的攻击场景,因此需要更高质量的融合图像来有效评估和开发防融合攻击系统。 Method: 提出StableMorph方法,利用现代基于扩散的图像合成技术生成全脸、高细节、无伪影的人脸融合图像,并实现对视觉属性的高度可控。 Result: 实验表明,StableMorph生成的图像在质量上媲美甚至超过真实人脸图像,能有效欺骗人脸识别系统,对现有的融合攻击检测方法构成更大挑战。 Conclusion: StableMorph设定了融合图像质量的新标准,提升了生物特征安全性的评估水平,有助于开发更鲁棒的融合攻击检测系统。 Abstract: Face morphing attacks threaten the integrity of biometric identity systems by enabling multiple individuals to share a single identity. To develop and evaluate effective morphing attack detection (MAD) systems, we need access to high-quality, realistic morphed images that reflect the challenges posed in real-world scenarios. However, existing morph generation methods often produce images that are blurry, riddled with artifacts, or poorly constructed making them easy to detect and not representative of the most dangerous attacks. In this work, we introduce StableMorph, a novel approach that generates highly realistic, artifact-free morphed face images using modern diffusion-based image synthesis. Unlike prior methods, StableMorph produces full-head images with sharp details, avoids common visual flaws, and offers unmatched control over visual attributes. Through extensive evaluation, we show that StableMorph images not only rival or exceed the quality of genuine face images but also maintain a strong ability to fool face recognition systems posing a greater challenge to existing MAD solutions and setting a new standard for morph quality in research and operational testing. StableMorph improves the evaluation of biometric security by creating more realistic and effective attacks and supports the development of more robust detection systems.[122] Introducing Nylon Face Mask Attacks: A Dataset for Evaluating Generalised Face Presentation Attack Detection
Manasa,Sushrut Patwardhan,Narayan Vetrekar,Pavan Kumar,R. S. Gad,Raghavendra Ramachandra
Main category: cs.CV
TL;DR: 提出了一种新型的3D欺骗攻击工具——尼龙面部面具(NFM),并构建了针对该攻击的大规模数据集,用于评估现有的人脸反欺骗方法在未见攻击场景下的鲁棒性。
Details
Motivation: 现有面部识别系统易受呈现攻击(PAs)影响,尤其是新型的3D欺骗手段如尼龙面部面具(NFM),其逼真的外观和弹性结构对当前反欺骗技术构成严重威胁,亟需针对性研究。 Method: 设计并发布了一个新的数据集,使用iPhone 11 Pro采集了来自100名受试者的3,760个真实样本和51,281个NFM攻击样本,涵盖四种不同攻击场景(含真人与假人模特);采用五种最先进的PAD方法进行基准测试。 Result: 实验结果显示现有PAD方法在应对NFM攻击时性能显著下降,表现出较大差异,表明当前技术难以有效泛化到此类新型3D欺骗攻击。 Conclusion: NFM是一种极具现实威胁的新型呈现攻击手段,现有PAD方法对其防御能力有限,凸显了开发更具泛化能力的反欺骗技术的迫切需求。 Abstract: Face recognition systems are increasingly deployed across a wide range of applications, including smartphone authentication, access control, and border security. However, these systems remain vulnerable to presentation attacks (PAs), which can significantly compromise their reliability. In this work, we introduce a new dataset focused on a novel and realistic presentation attack instrument called Nylon Face Masks (NFMs), designed to simulate advanced 3D spoofing scenarios. NFMs are particularly concerning due to their elastic structure and photorealistic appearance, which enable them to closely mimic the victim's facial geometry when worn by an attacker. To reflect real-world smartphone-based usage conditions, we collected the dataset using an iPhone 11 Pro, capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four distinct presentation scenarios involving both humans and mannequins. We benchmark the dataset using five state-of-the-art PAD methods to evaluate their robustness under unseen attack conditions. The results demonstrate significant performance variability across methods, highlighting the challenges posed by NFMs and underscoring the importance of developing PAD techniques that generalise effectively to emerging spoofing threats.[123] LatentPrintFormer: A Hybrid CNN-Transformer with Spatial Attention for Latent Fingerprint identification
Arnab Maity,Manasa,Pavan Kumar C,Raghavendra Ramachandra
Main category: cs.CV
TL;DR: 提出了一种结合CNN和Transformer的新型潜在指纹识别方法LatentPrintFormer,通过空间注意力机制增强特征提取,在公开数据集上优于现有方法。
Details
Motivation: 由于图像质量低、背景噪声和部分印痕,潜在指纹识别仍具挑战性。 Method: 结合EfficientNet-B0和Swin Tiny作为主干网络,引入空间注意力模块以突出高质量脊区域并抑制噪声,融合特征后投影到512维嵌入空间进行匹配。 Result: 在两个公开数据集上实验表明,该方法在Rank-10识别率上持续优于三种最先进的技术。 Conclusion: LatentPrintFormer能有效提升潜在指纹的识别性能,具有较强的鲁棒性和应用潜力。 Abstract: Latent fingerprint identification remains a challenging task due to low image quality, background noise, and partial impressions. In this work, we propose a novel identification approach called LatentPrintFormer. The proposed model integrates a CNN backbone (EfficientNet-B0) and a Transformer backbone (Swin Tiny) to extract both local and global features from latent fingerprints. A spatial attention module is employed to emphasize high-quality ridge regions while suppressing background noise. The extracted features are fused and projected into a unified 512-dimensional embedding, and matching is performed using cosine similarity in a closed-set identification setting. Extensive experiments on two publicly available datasets demonstrate that LatentPrintFormer consistently outperforms three state-of-the-art latent fingerprint recognition techniques, achieving higher identification rates across Rank-10.[124] Foam Segmentation in Wastewater Treatment Plants: A Federated Learning Approach with Segment Anything Model 2
Mehmet Batuhan Duman,Alejandro Carnero,Cristian Martín,Daniel Garrido,Manuel Díaz
Main category: cs.CV
TL;DR: 本文提出了一种结合联邦学习(FL)和Segment Anything Model 2(SAM2)的框架,用于在保护隐私的前提下实现污水处理厂中泡沫的自动实时分割与监测。
Details
Motivation: 污水处理厂中的泡沫问题影响处理效率且增加成本,现有机器学习方法因缺乏标注数据和数据孤岛问题难以部署。 Method: 采用联邦学习范式,在多个分布式客户端上使用Flower框架对SAM2进行微调,由雾服务器聚合模型权重;利用真实、合成及公开数据集进行训练与验证。 Result: 该框架在数据有限的情况下加快了训练收敛速度,提升了分割性能,并展现出良好的泛化能力。 Conclusion: 所提方法为分布式敏感工业场景下的图像分割任务提供了一个可扩展、隐私保护且高效的解决方案,验证了基础模型与联邦学习结合的实际应用潜力。 Abstract: Foam formation in Wastewater Treatment Plants (WTPs) is a major challenge that can reduce treatment efficiency and increase costs. The ability to automatically examine changes in real-time with respect to the percentage of foam can be of great benefit to the plant. However, large amounts of labeled data are required to train standard Machine Learning (ML) models. The development of these systems is slow due to the scarcity and heterogeneity of labeled data. Additionally, the development is often hindered by the fact that different WTPs do not share their data due to privacy concerns. This paper proposes a new framework to address these challenges by combining Federated Learning (FL) with the state-of-the-art base model for image segmentation, Segment Anything Model 2 (SAM2). The FL paradigm enables collaborative model training across multiple WTPs without centralizing sensitive operational data, thereby ensuring privacy. The framework accelerates training convergence and improves segmentation performance even with limited local datasets by leveraging SAM2's strong pre-trained weights for initialization. The methodology involves fine-tuning SAM2 on distributed clients (edge nodes) using the Flower framework, where a central Fog server orchestrates the process by aggregating model weights without accessing private data. The model was trained and validated using various data collections, including real-world images captured at a WTPs in Granada, Spain, a synthetically generated foam dataset, and images from publicly available datasets to improve generalization. This research offers a practical, scalable, and privacy-aware solution for automatic foam tracking in WTPs. The findings highlight the significant potential of integrating large-scale foundational models into FL systems to solve real-world industrial challenges characterized by distributed and sensitive data.[125] OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition
Lixu Sun,Nurmemet Yolwas,Wushour Silamu
Main category: cs.CV
TL;DR: 本文提出了一种受神经认知启发的三阶段场景文本识别框架OTSNet,通过观察-思考-拼写流程实现视觉与语言模态的统一建模,在多个基准上达到最先进的识别精度。
Details
Motivation: 现有场景文本识别方法中视觉与语言模态解耦优化导致跨模态不对齐,易受背景干扰和文本形变影响,难以准确识别不规则文本。 Method: 提出OTSNet,包含三个模块:Dual Attention Macaron Encoder(DAME)用于抑制无关区域并增强关键特征;Position-Aware Module与Semantic Quantizer联合进行空间上下文建模和字形语义抽象;Multi-Modal Collaborative Verifier实现跨模态特征融合与自纠错。 Result: 在Union14M-L基准上取得83.5%的平均准确率,在严重遮挡的OST数据集上达到79.1%,在14个测试场景中的9个创下最佳性能。 Conclusion: OTSNet通过模拟人类视觉认知过程,有效缓解了跨模态不对齐与注意力偏差问题,显著提升了复杂环境下不规则文本的识别鲁棒性与准确性。 Abstract: Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.[126] PEOD: A Pixel-Aligned Event-RGB Benchmark for Object Detection under Challenging Conditions
Luoping Cui,Hanqing Liu,Mingjie Liu,Endian Lin,Donghong Jiang,Yuhao Wang,Chuang Zhu
Main category: cs.CV
TL;DR: 提出首个大规模、高分辨率的Event-RGB目标检测数据集PEOD,包含130多个时空对齐序列和34万个手动标注框,57%数据覆盖低照度、过曝和高速运动等挑战场景,并对14种方法进行基准测试,揭示现有融合方法在极端光照下的局限性。
Details
Motivation: 现有Event-RGB数据集在极端条件下的覆盖稀疏且空间分辨率低(<=640x480),难以全面评估检测器在挑战场景下的性能。 Method: 构建名为PEOD的大规模、像素对齐、高分辨率(1280x720)Event-RGB数据集,包含130+个时空对齐序列和34万个人工标注边界框,57%数据采集于低光、过曝和高速运动条件下;并在三种输入模式(基于事件、基于RGB、事件-RGB融合)下对14种检测方法进行基准测试。 Result: 在全测试集和正常子集上,融合模型表现最佳;但在光照挑战子集上,最佳事件模型优于所有融合模型,而融合模型仍优于纯RGB模型,表明当前融合方法在帧模态严重退化时存在局限。 Conclusion: PEOD为多模态感知提供了真实、高质量的基准,推动了复杂场景下目标检测的研究,同时揭示了现有融合方法在极端光照条件下的不足。 Abstract: Robust object detection for challenging scenarios increasingly relies on event cameras, yet existing Event-RGB datasets remain constrained by sparse coverage of extreme conditions and low spatial resolution (<= 640 x 480), which prevents comprehensive evaluation of detectors under challenging scenarios. To address these limitations, we propose PEOD, the first large-scale, pixel-aligned and high-resolution (1280 x 720) Event-RGB dataset for object detection under challenge conditions. PEOD contains 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% of data captured under low-light, overexposure, and high-speed motion. Furthermore, we benchmark 14 methods across three input configurations (Event-based, RGB-based, and Event-RGB fusion) on PEOD. On the full test set and normal subset, fusion-based models achieve the excellent performance. However, in illumination challenge subset, the top event-based model outperforms all fusion models, while fusion models still outperform their RGB-based counterparts, indicating limits of existing fusion methods when the frame modality is severely degraded. PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research.[127] Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation
Jun Sun,Xinxin Zhang,Simin Hong,Jian Zhu,Xiang Gao
Main category: cs.CV
TL;DR: 本文提出了一种用于异构多模态领域自适应的高效算法Boomda,通过信息瓶颈和相关性对齐学习各模态的表示,并利用多目标优化实现模态间的平衡领域对齐。
Details
Motivation: 多模态学习面临标注数据稀缺的问题,而无监督领域自适应在多模态场景下的研究尚不充分,尤其是不同模态在源域和目标域之间存在差异化的领域偏移。 Method: 引入信息瓶颈方法独立学习每个模态的表示,使用相关性对齐在表示空间中匹配源域和目标域,并将多模态领域对齐建模为多目标优化问题,最终转化为可求解的二次规划问题并得到闭式解。 Result: 提出的Boomda算法在多个实验中表现出优于现有方法的性能,验证了其在多模态领域自适应中的有效性与效率。 Conclusion: Boomda通过平衡各模态的领域对齐,实现了高效的多模态无监督领域自适应,为异构多模态学习提供了新的解决方案。 Abstract: Multimodal learning, while contributing to numerous success stories across various fields, faces the challenge of prohibitively expensive manual annotation. To address the scarcity of annotated data, a popular solution is unsupervised domain adaptation, which has been extensively studied in unimodal settings yet remains less explored in multimodal settings. In this paper, we investigate heterogeneous multimodal domain adaptation, where the primary challenge is the varying domain shifts of different modalities from the source to the target domain. We first introduce the information bottleneck method to learn representations for each modality independently, and then match the source and target domains in the representation space with correlation alignment. To balance the domain alignment of all modalities, we formulate the problem as a multi-objective task, aiming for a Pareto optimal solution. By exploiting the properties specific to our model, the problem can be simplified to a quadratic programming problem. Further approximation yields a closed-form solution, leading to an efficient modality-balanced multimodal domain adaptation algorithm. The proposed method features \textbf{B}alanced multi-\textbf{o}bjective \textbf{o}ptimization for \textbf{m}ultimodal \textbf{d}omain \textbf{a}daptation, termed \textbf{Boomda}. Extensive empirical results showcase the effectiveness of the proposed approach and demonstrate that Boomda outperforms the competing schemes. The code is is available at: https://github.com/sunjunaimer/Boomda.git.[128] Non-Aligned Reference Image Quality Assessment for Novel View Synthesis
Abhijay Ghildyal,Rajesh Sureddi,Nabajeet Barman,Saman Zadtootaghaj,Alan Bovik
Main category: cs.CV
TL;DR: 提出了一种针对新视角合成(NVS)图像质量评估的非对齐参考(NAR-IQA)框架,在缺乏像素级对齐参考的情况下,利用对比学习和LoRA增强的DINOv2特征实现优越性能,并通过大规模合成失真数据集训练,提升了模型泛化能力。
Details
Motivation: 现有全参考(FR-IQA)方法在参考图像未对齐时失效,无参考(NR-IQA)方法泛化能力差,难以准确评估NVS图像质量,因此需要一种能处理非对齐但含部分场景内容参考的新评估框架。 Method: 构建了一个基于对比学习的NAR-IQA框架,采用LoRA增强的DINOv2提取特征,并利用现有IQA方法提供监督信号;在包含TROI合成失真的大规模数据集上进行训练,避免过拟合真实NVS样本以提升泛化性。 Result: 该模型在对齐与非对齐情况下均优于现有的FR-IQA、NR-IQA和NAR-IQA方法,且与新开展的用户主观评分具有强相关性。 Conclusion: 所提出的NAR-IQA框架有效解决了NVS中因参考图像非对齐带来的质量评估难题,具备良好的泛化能力和实际应用前景。 Abstract: Evaluating the perceptual quality of Novel View Synthesis (NVS) images remains a key challenge, particularly in the absence of pixel-aligned ground truth references. Full-Reference Image Quality Assessment (FR-IQA) methods fail under misalignment, while No-Reference (NR-IQA) methods struggle with generalization. In this work, we introduce a Non-Aligned Reference (NAR-IQA) framework tailored for NVS, where it is assumed that the reference view shares partial scene content but lacks pixel-level alignment. We constructed a large-scale image dataset containing synthetic distortions targeting Temporal Regions of Interest (TROI) to train our NAR-IQA model. Our model is built on a contrastive learning framework that incorporates LoRA-enhanced DINOv2 embeddings and is guided by supervision from existing IQA methods. We train exclusively on synthetically generated distortions, deliberately avoiding overfitting to specific real NVS samples and thereby enhancing the model's generalization capability. Our model outperforms state-of-the-art FR-IQA, NR-IQA, and NAR-IQA methods, achieving robust performance on both aligned and non-aligned references. We also conducted a novel user study to gather data on human preferences when viewing non-aligned references in NVS. We find strong correlation between our proposed quality prediction model and the collected subjective ratings. For dataset and code, please visit our project page: https://stootaghaj.github.io/nova-project/[129] LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping
Chenying Liu,Wei Huang,Xiao Xiang Zhu
Main category: cs.CV
TL;DR: 本文提出了一种名为LandSegmenter的通用土地利用和土地覆盖(LULC)基础模型框架,通过构建大规模多模态数据集LAS、设计遥感专用适配器和文本编码器,以及引入置信度引导的融合策略,在减少对标注数据依赖的同时实现了优异的零样本迁移性能。
Details
Motivation: 现有LULC模型局限于特定模态和固定分类体系,且依赖大量标注数据,难以泛化;而任务无关的基础模型又需微调,因此需要一种可跨模态、跨分类体系并减少标注依赖的通用LULC模型。 Method: 提出LandSegmenter框架:1)构建基于弱标签的大规模多源多模态数据集LAS;2)在模型中引入遥感专用适配器以提取跨模态特征,并结合文本编码器增强语义理解;3)在输出端采用类别级置信度引导的融合策略提升零样本表现。 Result: 在六个不同模态和分类体系的精确标注数据集上进行实验,结果显示LandSegmenter在迁移学习和零样本设置下均表现出竞争性或更优性能,尤其在未见数据集上的零样本迁移效果显著。 Conclusion: LandSegmenter通过弱监督构建任务特定基础模型是可行且有效的,为遥感领域降低标注成本、提升模型泛化能力提供了新路径。 Abstract: Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter's zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.[130] Multi-Granularity Mutual Refinement Network for Zero-Shot Learning
Ning Wang,Long Yu,Cong Hua,Guangming Zhu,Lin Mei,Syed Afaq Ali Shah,Mohammed Bennamoun,Liang Zhang
Main category: cs.CV
TL;DR: 提出了一种多粒度互增强网络(Mg-MRN),通过解耦多粒度特征学习和跨粒度特征交互来提升零样本学习中的视觉-语义表征能力。
Details
Motivation: 现有零样本学习方法通常忽略局部区域特征之间的内在交互,限制了可迁移视觉特征的学习。 Method: 设计多粒度特征提取模块以挖掘区域级判别特征,并通过跨粒度特征融合模块增强不同粒度间区域特征的交互。 Result: 在三个主流零样本学习基准数据集上实验表明,该方法显著提升了识别性能,具有竞争力。 Conclusion: Mg-MRN通过建模多粒度区域特征的交互关系,有效增强了视觉特征的判别性和可迁移性,提升了零样本分类效果。 Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at https://github.com/NingWang2049/Mg-MRN.[131] KPLM-STA: Physically-Accurate Shadow Synthesis for Human Relighting via Keypoint-Based Light Modeling
Xinhui Yin,Qifei Li,Yilin Guo,Hongxia Xie,Xiaoli Zhang
Main category: cs.CV
TL;DR: 提出了一种基于关键点线性模型(KPLM)和阴影三角算法(STA)的新型阴影生成框架,用于图像合成中的真实感阴影生成,显著提升了阴影的外观真实感和几何精度。
Details
Motivation: 现有扩散模型在图像合成中生成的阴影缺乏足够的外观真实感和几何精度,尤其是在复杂人体姿态下表现不足,因此需要一种更精确、物理合理的阴影生成方法。 Method: 提出KPLM模型,利用九个关键点和一个边界块建模人体,实现符合物理规律的阴影投影;结合STA算法,通过显式几何公式计算阴影角度、长度和空间位置,提升几何准确性。 Result: 实验表明该方法在阴影真实感基准上达到最先进水平,尤其在复杂人体姿态下表现优异,并能有效推广到多方向重光照场景(如IC-Light)。 Conclusion: 所提出的KPLM与STA联合框架显著提升了图像合成中阴影的视觉真实感与几何精度,为扩散模型驱动的图像编辑提供了更可靠的阴影生成解决方案。 Abstract: Image composition aims to seamlessly integrate a foreground object into a background, where generating realistic and geometrically accurate shadows remains a persistent challenge. While recent diffusion-based methods have outperformed GAN-based approaches, existing techniques, such as the diffusion-based relighting framework IC-Light, still fall short in producing shadows with both high appearance realism and geometric precision, especially in composite images. To address these limitations, we propose a novel shadow generation framework based on a Keypoints Linear Model (KPLM) and a Shadow Triangle Algorithm (STA). KPLM models articulated human bodies using nine keypoints and one bounding block, enabling physically plausible shadow projection and dynamic shading across joints, thereby enhancing visual realism. STA further improves geometric accuracy by computing shadow angles, lengths, and spatial positions through explicit geometric formulations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on shadow realism benchmarks, particularly under complex human poses, and generalizes effectively to multi-directional relighting scenarios such as those supported by IC-Light.[132] Distributed Zero-Shot Learning for Visual Recognition
Zhi Chen,Yadan Luo,Zi Huang,Jingjing Li,Sen Wang,Xin Yu
Main category: cs.CV
TL;DR: 提出了一种分布式零样本学习(DistZSL)框架,通过跨节点属性正则化和全局属性-视觉一致性来应对数据异质性,提升分布式环境下的零样本学习性能。
Details
Motivation: 解决分布式节点间数据异质性对零样本学习的影响,充分利用分散数据学习 unseen 类别的有效模型。 Method: 引入跨节点属性正则器以稳定属性特征空间,并提出全局属性-视觉共识机制,确保不同节点间属性与视觉特征映射的一致性。 Result: 实验表明,DistZSL在分布式数据学习中优于现有最先进方法,显著提升了零样本学习性能。 Conclusion: 所提出的DistZSL框架能有效应对数据异质性,通过稳定的V2A关系实现高性能的分布式零样本学习。 Abstract: In this paper, we propose a Distributed Zero-Shot Learning (DistZSL) framework that can fully exploit decentralized data to learn an effective model for unseen classes. Considering the data heterogeneity issues across distributed nodes, we introduce two key components to ensure the effective learning of DistZSL: a cross-node attribute regularizer and a global attribute-to-visual consensus. Our proposed cross-node attribute regularizer enforces the distances between attribute features to be similar across different nodes. In this manner, the overall attribute feature space would be stable during learning, and thus facilitate the establishment of visual-to-attribute(V2A) relationships. Then, we introduce the global attribute-tovisual consensus to mitigate biased V2A mappings learned from individual nodes. Specifically, we enforce the bilateral mapping between the attribute and visual feature distributions to be consistent across different nodes. Thus, the learned consistent V2A mapping can significantly enhance zero-shot learning across different nodes. Extensive experiments demonstrate that DistZSL achieves superior performance to the state-of-the-art in learning from distributed data.[133] VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
Samet Hicsonmez,Abd El Rahman Shabayek,Djamila Aouada
Main category: cs.CV
TL;DR: 提出一种新的无监督多类视觉异常检测框架\ours,结合潜在扩散模型(LDM)和视觉语言模型(VLM),利用VLM生成的描述作为LDM的条件输入,提升多类真实图像中异常定位与检测性能,在Real-IAD和COCO-AD数据集上显著优于现有扩散模型方法。
Details
Motivation: 现有基于扩散模型的异常检测方法依赖合成噪声,泛化能力差,且需逐类训练,难以扩展到多类真实场景,因此需要一种可扩展、无需人工标注的高效多类异常检测方法。 Method: 使用预训练的视觉语言模型(VLM)通过简单提示词生成正常图像的详细描述,并将这些描述作为额外条件用于潜在扩散模型(LDM)的训练,从而学习多类正常图像的鲁棒特征表示,实现无监督的异常检测与定位。 Result: 在Real-IAD数据集上像素级PRO指标最高提升25点,在COCO-AD数据集上提升8点,显著优于当前最先进的基于扩散的方法。 Conclusion: \ours通过融合VLM与LDM,实现了可扩展、高效的无监督多类视觉异常检测,无需人工标注或额外训练,具有良好的实际应用潜力。 Abstract: Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.[134] WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting
Kaitao Huang,Yan Yan,Jing-Hao Xue,Hanzi Wang
Main category: cs.CV
TL;DR: 提出了一种名为WarpGAN的新型3D GAN反演方法,通过引入基于深度图的重投影与图像修复策略(SVINet),在单视角图像到多视角生成中实现了对遮挡区域更高质量、更具一致性的重建。
Details
Motivation: 现有3D GAN反演方法主要关注可见区域的重建,而遮挡区域依赖生成先验,导致因潜在码信息量不足而质量下降。因此需要一种能更好恢复遮挡区域的方法。 Method: 采用3D GAN反演编码器将单视图图像映射为潜在码;利用3D GAN生成的深度图进行新视角重投影;设计SVINet网络,结合对称性先验和多视角图像对应关系,对重投影后图像中的遮挡区域进行修复。 Result: 在多个数据集上的定量与定性实验表明,该方法在遮挡区域生成质量与多视角一致性方面均优于当前最先进的方法。 Conclusion: WarpGAN通过融合重投影与基于先验的图像修复策略,显著提升了单图像3D GAN反演中遮挡区域的生成效果,为单幅图像新视角合成提供了更优解决方案。 Abstract: 3D GAN inversion projects a single image into the latent space of a pre-trained 3D GAN to achieve single-shot novel view synthesis, which requires visible regions with high fidelity and occluded regions with realism and multi-view consistency. However, existing methods focus on the reconstruction of visible regions, while the generation of occluded regions relies only on the generative prior of 3D GAN. As a result, the generated occluded regions often exhibit poor quality due to the information loss caused by the low bit-rate latent code. To address this, we introduce the warping-and-inpainting strategy to incorporate image inpainting into 3D GAN inversion and propose a novel 3D GAN inversion method, WarpGAN. Specifically, we first employ a 3D GAN inversion encoder to project the single-view image into a latent code that serves as the input to 3D GAN. Then, we perform warping to a novel view using the depth map generated by 3D GAN. Finally, we develop a novel SVINet, which leverages the symmetry prior and multi-view image correspondence w.r.t. the same latent code to perform inpainting of occluded regions in the warped image. Quantitative and qualitative experiments demonstrate that our method consistently outperforms several state-of-the-art methods.[135] Pixel-level Quality Assessment for Oriented Object Detection
Yunhui Zhu,Buliao Huang
Main category: cs.CV
TL;DR: 提出了一种基于像素级空间一致性的质量评估框架PQA,用于解决旋转目标检测中框级别IoU预测存在的结构耦合和过估计问题,显著提升了检测性能。
Details
Motivation: 现有方法通过预测边界框与真实框的IoU来评估定位质量,但由于预测框和真实框的估计存在结构耦合,导致定位不准的框仍可能获得高IoU评分,从而影响检测性能。 Method: 提出像素级质量评估(PQA)框架,通过衡量每个像素相对于预测框和真实框的相对位置一致性来评估定位质量,并设计新的聚合指标将像素级一致性整合为统一的质量分数。 Result: 在HRSC2016和DOTA数据集上验证了PQA的有效性,可显著提升多种旋转检测器的性能,例如Rotated RetinaNet提升5.96% AP$_{50:95}$,STD提升2.32%。 Conclusion: PQA通过消除框级别IoU预测中的相似性偏差,提供了更准确的定位质量估计,能够广泛适用于各类旋转目标检测器并带来稳定性能增益。 Abstract: Modern oriented object detectors typically predict a set of bounding boxes and select the top-ranked ones based on estimated localization quality. Achieving high detection performance requires that the estimated quality closely aligns with the actual localization accuracy. To this end, existing approaches predict the Intersection over Union (IoU) between the predicted and ground-truth (GT) boxes as a proxy for localization quality. However, box-level IoU prediction suffers from a structural coupling issue: since the predicted box is derived from the detector's internal estimation of the GT box, the predicted IoU--based on their similarity--can be overestimated for poorly localized boxes. To overcome this limitation, we propose a novel Pixel-level Quality Assessment (PQA) framework, which replaces box-level IoU prediction with the integration of pixel-level spatial consistency. PQA measures the alignment between each pixel's relative position to the predicted box and its corresponding position to the GT box. By operating at the pixel level, PQA avoids directly comparing the predicted box with the estimated GT box, thereby eliminating the inherent similarity bias in box-level IoU prediction. Furthermore, we introduce a new integration metric that aggregates pixel-level spatial consistency into a unified quality score, yielding a more accurate approximation of the actual localization quality. Extensive experiments on HRSC2016 and DOTA demonstrate that PQA can be seamlessly integrated into various oriented object detectors, consistently improving performance (e.g., +5.96% AP$_{50:95}$ on Rotated RetinaNet and +2.32% on STD).[136] UI2Code$^\text{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
Zhen Yang,Wenyi Hong,Mingde Xu,Xinyue Fan,Weihan Wang,Jiele Cheng,Xiaotao Gu,Jie Tang
Main category: cs.CV
TL;DR: 提出UI2Code$^\text{N}$,一种基于多阶段训练的视觉语言模型,支持交互式UI到代码生成、编辑与优化,显著提升开源模型在UI编程任务上的性能。
Details
Motivation: 现有UI自动编码方法在多模态编码能力和利用迭代视觉反馈方面存在不足,难以反映真实开发流程。 Method: 采用分阶段预训练、微调和强化学习训练UI2Code$^\text{N}$模型,统一实现UI-to-code生成、UI编辑和UI润色,并引入测试时扩展机制以支持多轮交互反馈。 Result: 在UI-to-code和UI润色基准上达到开源模型最优性能,接近Claude-4-Sonnet和GPT-5等闭源模型水平。 Conclusion: 所提出的交互式UI-to-code范式和UI2Code$^\text{N}$模型有效提升了多模态代码生成能力,推动自动化UI开发向更实用方向发展。 Abstract: User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.[137] UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets
Arnav Aditya,Nitin Kumar,Saurabh Shigwan
Main category: cs.CV
TL;DR: 提出一种基于辅助数据集惩罚开放空间区域的损失函数,有效识别未知类别样本,在多个医学图像数据集上优于现有方法。
Details
Motivation: 由于医疗领域数据受限、标注成本高,尤其是在面对新发或罕见疾病时,算法难以覆盖所有可能的类别,因此需要有效的开放集识别方法来判断样本是否属于训练中未见的未知类别。 Method: 利用深度神经网络特征在类均值周围聚类并形成正则单形顶点的观察,设计了一种新的损失函数,通过引入辅助数据集对开放空间区域进行惩罚,从而提升对未知类别的拒绝能力。 Result: 该方法在BloodMNIST、OCTMNIST、DermaMNIST、TissueMNIST和一个公开皮肤数据集上显著优于当前最先进的技术。 Conclusion: 所提出的损失函数结合辅助数据集能有效提升医学图像分析中的开放集识别性能,具有较强的实用价值。 Abstract: Driven by advancements in deep learning, computer-aided diagnoses have made remarkable progress. However, outside controlled laboratory settings, algorithms may encounter several challenges. In the medical domain, these difficulties often stem from limited data availability due to ethical and legal restrictions, as well as the high cost and time required for expert annotations-especially in the face of emerging or rare diseases. In this context, open-set recognition plays a vital role by identifying whether a sample belongs to one of the known classes seen during training or should be rejected as an unknown. Recent studies have shown that features learned in the later stages of deep neural networks are observed to cluster around their class means, which themselves are arranged as individual vertices of a regular simplex [32]. The proposed method introduces a loss function designed to reject samples of unknown classes effectively by penalizing open space regions using auxiliary datasets. This approach achieves significant performance gain across four MedMNIST datasets-BloodMNIST, OCTMNIST, DermaMNIST, TissueMNIST and a publicly available skin dataset [29] outperforming state-of-the-art techniques.[138] Twist and Compute: The Cost of Pose in 3D Generative Diffusion
Kyle Fogarty,Jack Foster,Boqiao Zhang,Jing Yang,Cengiz Öztireli
Main category: cs.CV
TL;DR: 大型图像到3D生成模型存在显著的规范视角偏差,导致在输入旋转时性能下降;通过轻量级CNN纠正输入方向可恢复性能,引发对模型设计是否应追求模块化和对称性意识的讨论。
Details
Motivation: 揭示大规模图像到3D生成模型在归纳偏置方面的不透明性,特别是其对规范视角的依赖问题。 Method: 通过受控实验使用简单的2D旋转测试Hunyuan3D 2.0模型,并引入轻量级CNN检测和纠正输入方向。 Result: 发现模型在旋转输入下性能下降,但通过轻量级CNN纠正方向后性能得以恢复。 Conclusion: 单纯扩大规模可能不足以解决视角泛化问题,未来应探索模块化且具有对称性感知的设计方案。 Abstract: Despite their impressive results, large-scale image-to-3D generative models remain opaque in their inductive biases. We identify a significant limitation in image-conditioned 3D generative models: a strong canonical view bias. Through controlled experiments using simple 2D rotations, we show that the state-of-the-art Hunyuan3D 2.0 model can struggle to generalize across viewpoints, with performance degrading under rotated inputs. We show that this failure can be mitigated by a lightweight CNN that detects and corrects input orientation, restoring model performance without modifying the generative backbone. Our findings raise an important open question: Is scale enough, or should we pursue modular, symmetry-aware designs?[139] Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone
Rizal Khoirul Anam
Main category: cs.CV
TL;DR: 本文提出并评估了一种解耦的多模态食物识别系统,结合EfficientNet-B4视觉骨干与Gemini大语言模型,用于自动营养分析和食谱生成,在自建中文食品数据集上验证性能。
Details
Motivation: 现有数字饮食应用缺乏兼顾准确性与生成质量的系统化评估,且公共数据集存在文化偏差,难以支持中式食品的精准识别与营养分析。 Method: 采用解耦的多模态 pipeline,比较不同视觉骨干(如EfficientNet-B4、VGG-16等)与不同规模语言模型(Gemini vs Gemma)的组合表现,并引入“语义误差传播”(SEP)指标量化视觉错误对生成结果的影响。 Result: EfficientNet-B4在Top-1准确率上达到89.0%,Gemini生成内容的事实准确率达9.2/10;系统整体性能受限于视觉前端的感知精度,尤其是高语义相似类别的误分类问题最为突出。 Conclusion: 视觉模块的识别精度是决定多模态食物分析系统性能的关键瓶颈,未来需重点提升对相似菜肴的细粒度区分能力以降低语义误差传播。 Abstract: The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google's Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for "Semantic Error Propagation" (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0\% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system's overall utility is fundamentally bottlenecked by the visual front-end's perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.[140] 2D Representation for Unguided Single-View 3D Super-Resolution in Real-Time
Ignasi Mas,Ivan Huerta,Ramon Morros,Javier Ruiz-Hidalgo
Main category: cs.CV
TL;DR: 提出2Dto3D-SR框架,通过将单视图3D超分辨率问题转化为2D表示,实现无需高分辨率RGB引导的实时3D超分辨率。
Details
Motivation: 传统3D超分辨率方法依赖高分辨率RGB图像或复杂的点云处理,限制了在资源受限或RGB数据不可用场景的应用。因此需要一种更简洁、高效且不依赖RGB引导的方法。 Method: 提出2Dto3D-SR框架,使用Projected Normalized Coordinate Code (PNCC) 将3D几何编码为2D规则图像表示,从而可直接应用成熟的2D图像超分辨率模型。采用Swin Transformer和Vision Mamba两种架构分别实现高精度与高效率版本。 Result: Swin Transformer实现最先进的精度,Vision Mamba在保持竞争力性能的同时实现实时推理速度。 Conclusion: 该方法提供了一种简单、灵活且实用的3D超分辨率方案,特别适用于无法获取高分辨率RGB数据的实际场景。 Abstract: We introduce 2Dto3D-SR, a versatile framework for real-time single-view 3D super-resolution that eliminates the need for high-resolution RGB guidance. Our framework encodes 3D data from a single viewpoint into a structured 2D representation, enabling the direct application of existing 2D image super-resolution architectures. We utilize the Projected Normalized Coordinate Code (PNCC) to represent 3D geometry from a visible surface as a regular image, thereby circumventing the complexities of 3D point-based or RGB-guided methods. This design supports lightweight and fast models adaptable to various deployment environments. We evaluate 2Dto3D-SR with two implementations: one using Swin Transformers for high accuracy, and another using Vision Mamba for high efficiency. Experiments show the Swin Transformer model achieves state-of-the-art accuracy on standard benchmarks, while the Vision Mamba model delivers competitive results at real-time speeds. This establishes our geometry-guided pipeline as a surprisingly simple yet viable and practical solution for real-world scenarios, especially where high-resolution RGB data is inaccessible.[141] Accurate and Efficient Surface Reconstruction from Point Clouds via Geometry-Aware Local Adaptation
Eito Ogawa,Taiga Hayami,Hiroshi Watanabe
Main category: cs.CV
TL;DR: 提出一种基于点云曲率自适应调整局部区域间距和大小的方法,以提高点云表面重建的精度和效率。
Details
Motivation: 现有局部区域重建方法通常采用固定间距和大小,无法适应几何复杂性的变化,限制了重建的适应性和精度。 Method: 根据输入点云的曲率,自适应地调节局部区域的分布密度和尺寸,优先在高曲率区域使用更小、更密集的采样,提升细节重建能力。 Result: 相比传统均匀采样方法,在保持计算效率的同时显著提高了重建精度,尤其在复杂几何结构区域表现更优。 Conclusion: 该方法通过曲率驱动的自适应采样策略,有效提升了局部区域重建方法的精度与泛化能力,适用于如基础设施检测等实际应用。 Abstract: Point cloud surface reconstruction has improved in accuracy with advances in deep learning, enabling applications such as infrastructure inspection. Recent approaches that reconstruct from small local regions rather than entire point clouds have attracted attention for their strong generalization capability. However, prior work typically places local regions uniformly and keeps their size fixed, limiting adaptability to variations in geometric complexity. In this study, we propose a method that improves reconstruction accuracy and efficiency by adaptively modulating the spacing and size of local regions based on the curvature of the input point cloud.[142] Remodeling Semantic Relationships in Vision-Language Fine-Tuning
Xiangyang Wu,Liu Liu,Baosheng Yu,Jiayan Qiu,Zhenwei Shi
Main category: cs.CV
TL;DR: 提出一种基于语义和关系的视觉-语言微调方法,通过多层级视觉特征提取、语义分组和可继承交叉注意力机制,在视觉问答和图像描述任务上优于现有方法。
Details
Motivation: 现有视觉-语言微调方法忽视文本上下文中的语义关系信息,导致模态对齐效果不佳。 Method: 从不同视觉编码器提取多层级语义特征,将视觉特征投影以聚类相关语义,并采用可继承交叉注意力机制融合图文特征,全局去除低相关性的冗余视觉关系。 Result: 在八个基础模型和两个下游任务(视觉问答、图像描述)上验证了方法的有效性,性能超越现有方法。 Conclusion: 该方法能有效提升多模态对齐与融合效果,显著改善视觉-语言理解任务的表现。 Abstract: Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.[143] Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning
Chenyu Hu,Xiaotong Li,Hao Zhu,Biao Hou
Main category: cs.CV
TL;DR: 本文提出了一种名为DiPVNet的新型点云处理网络,通过局部可学习点积算子和全局方向感知傅里叶变换,实现了旋转不变性与方向感知能力,在分类与分割任务中表现出色。
Details
Motivation: 点云在任意旋转下会破坏其固有的方向特性,现有方法难以充分挖掘多尺度方向信息,导致表示学习受限。 Method: 引入原子点积算子,设计局部可学习点积(L2DP)操作,并利用广义谐波分析构建方向感知球面傅里叶变换(DASFT),形成全局方向响应谱,同时证明两个算子的旋转不变性。 Result: 在含噪声和大角度旋转的挑战性场景下,DiPVNet在点云分类和分割任务上均达到最先进性能。 Conclusion: DiPVNet有效结合了旋转不变性与方向感知能力,通过多尺度方向建模显著提升了点云表示学习效果。 Abstract: Point cloud processing has become a cornerstone technology in many 3D vision tasks. However, arbitrary rotations introduce variations in point cloud orientations, posing a long-standing challenge for effective representation learning. The core of this issue is the disruption of the point cloud's intrinsic directional characteristics caused by rotational perturbations. Recent methods attempt to implicitly model rotational equivariance and invariance, preserving directional information and propagating it into deep semantic spaces. Yet, they often fall short of fully exploiting the multiscale directional nature of point clouds to enhance feature representations. To address this, we propose the Direction-Perceptive Vector Network (DiPVNet). At its core is an atomic dot-product operator that simultaneously encodes directional selectivity and rotation invariance--endowing the network with both rotational symmetry modeling and adaptive directional perception. At the local level, we introduce a Learnable Local Dot-Product (L2DP) Operator, which enables interactions between a center point and its neighbors to adaptively capture the non-uniform local structures of point clouds. At the global level, we leverage generalized harmonic analysis to prove that the dot-product between point clouds and spherical sampling vectors is equivalent to a direction-aware spherical Fourier transform (DASFT). This leads to the construction of a global directional response spectrum for modeling holistic directional structures. We rigorously prove the rotation invariance of both operators. Extensive experiments on challenging scenarios involving noise and large-angle rotations demonstrate that DiPVNet achieves state-of-the-art performance on point cloud classification and segmentation tasks. Our code is available at https://github.com/wxszreal0/DiPVNet.[144] NERVE: Neighbourhood & Entropy-guided Random-walk for training free open-Vocabulary sEgmentation
Kunal Mahatha,Jose Dolz,Christian Desrosiers
Main category: cs.CV
TL;DR: 提出了一种无需训练的开放词汇语义分割新方法NERVE,结合自注意力机制中的邻域结构与熵引导的随机游走,有效融合全局与局部信息,显著提升零样本分割性能。
Details
Motivation: 现有开放词汇语义分割方法存在计算昂贵、注意力融合无效、依赖固定高斯核等问题,难以处理任意形状物体并实现高效空间平滑。 Method: 利用稳定扩散模型中自注意力层的邻域结构,引入基于熵的注意力图选择机制,并采用随机游走策略替代固定高斯核进行亲和性优化,实现全局与细粒度局部信息融合。 Result: 在7个主流语义分割基准上实现最先进的零样本分割性能,且无需CRF或PAMR等后处理技术。 Conclusion: NERVE为训练自由的开放词汇语义分割提供了强大基线,通过邻域结构和不确定性感知的随机游走,有效提升分割精度与泛化能力。 Abstract: Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of transformer attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We propose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood \& Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely integrates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochastic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spatial diffusion process encourages propagation across connected and semantically related areas, enabling it to effectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from different transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conventional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmentation benchmarks, yielding an overall state-of-the-art zero-shot segmentation performance, providing an effective approach to open-vocabulary semantic segmentation.[145] LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning
Fengyi Fu,Mengqi Huang,Lei Zhang,Zhendong Mao
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的多对象图像编辑框架LayerEdit,通过分层解耦的“分解-编辑-融合”流程,解决了文本驱动多对象编辑中因对象间注意力纠缠导致的编辑泄漏与约束问题,实现了冲突区域感知的精确分层编辑与结构一致的融合。
Details
Motivation: 现有方法忽视了多对象编辑中对象间的交互与冲突区域的注意力纠缠,导致编辑时出现跨对象泄漏或内部编辑受限,难以实现解耦的精细控制。 Method: 提出LayerEdit框架,包含三个模块:(1) 冲突感知的分层分解模块,利用注意力感知的IoU和时序区域移除实现更好的层分离;(2) 对象分层编辑模块,通过层内文本引导与跨层几何映射实现语义与结构的解耦修改;(3) 透明度引导的融合模块,学习精确透明度以实现结构一致的层融合。 Result: 实验表明LayerEdit在复杂多对象场景下优于现有方法,展现出前所未有的对象内可控性和对象间一致性,有效避免编辑泄漏并提升编辑精度。 Conclusion: LayerEdit首次实现了无需训练的分层解耦多对象图像编辑,通过冲突感知的分解与透明度引导融合,显著提升了文本驱动编辑的精确性与协调性。 Abstract: Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel "decompose-editingfusion" framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios. Codes are available at: https://github.com/fufy1024/LayerEdit.[146] Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation
Jae Joong Lee,Bedrich Benes
Main category: cs.CV
TL;DR: 本文提出了一种名为Top2Ground的新型扩散模型,能够直接从航拍图像生成地面视角的逼真图像,无需依赖深度图或3D体素等中间表示。
Details
Motivation: 由于视角差异大、遮挡严重以及视野受限,从航拍图生成地面图像极具挑战性。现有方法通常依赖中间表示,限制了生成质量与泛化能力。 Method: 采用基于扩散模型的方法,通过VAE编码的空间特征(来自航拍RGB图像和估计的高度图)与CLIP语义嵌入的联合表示来条件化去噪过程,确保生成结果在几何结构和语义内容上的一致性。 Result: 在CVUSA、CVACT和Auto Arborist三个数据集上评估,SSIM平均提升7.3%,且能稳健处理宽窄视野,表现出强泛化能力。 Conclusion: Top2Ground无需中间表示即可实现高质量的航拍到地面图像生成,在跨视角图像合成任务中展现出优越性能和广泛适用性。 Abstract: Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.[147] ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation
Yue Min,Shaobo Wang,Jiaze Li,Tianle Niu,Junxin Fan,Yongliang Miao,Lijin Yang,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于ImageBind统一特征空间的新型多模态数据压缩框架ImageBindDC,通过在傅里叶域中使用特征函数(CF)损失实现精确的统计对齐,在单模态、跨模态和联合模态三个层面保持分布一致性,显著提升了多模态数据压缩效果。
Details
Motivation: 现有数据压缩方法在多模态场景下难以保持复杂的跨模态依赖关系,导致性能下降,因此需要一种能够有效保留多模态数据结构的新方法。 Method: 提出ImageBindDC框架,利用ImageBind的统一特征空间,引入基于傅里叶域的特征函数(CF)损失,实现无限矩匹配,并设计三层次分布一致性目标:单模态对齐、跨模态对齐和联合模态对齐。 Result: 在NYU-v2数据集上,每类仅用5个压缩样本即达到与全数据训练相当的性能,较先前最优方法提升8.2%,且压缩时间减少4倍以上。 Conclusion: ImageBindDC通过在统一特征空间中进行多层级分布对齐,显著提升了多模态数据压缩的效果与效率,为高效多模态学习提供了新思路。 Abstract: Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2\% absolute improvement over the previous best method and more than 4$\times$ less condensation time.[148] Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation
Nan Bao,Yifan Zhao,Lin Zhu,Jia Li
Main category: cs.CV
TL;DR: 提出了一种新的边缘感知语义一致性框架(ESC),通过利用事件和RGB模态的边缘线索,实现极端条件下鲁棒的语义分割。
Details
Motivation: 现有方法在极端条件下因RGB信息丢失和事件与RGB模态异构性导致特征不匹配,性能下降严重。 Method: 提出边缘感知潜在重编码(Edge-awareness Latent Re-coding)和重编码融合与不确定性优化(Re-coded Consolidation and Uncertainty Optimization),利用预建边缘字典和不确定性指标对齐多模态特征。 Result: 在提出的DERS-XS数据集上比现有最先进方法提升2.55% mIoU,在空间遮挡下表现出更强的鲁棒性。 Conclusion: 该方法有效解决了极端条件下事件-RGB模态异构融合问题,显著提升了语义分割的鲁棒性和精度。 Abstract: Semantic segmentation has achieved great success in ideal conditions. However, when facing extreme conditions (e.g., insufficient light, fierce camera motion), most existing methods suffer from significant information loss of RGB, severely damaging segmentation results. Several researches exploit the high-speed and high-dynamic event modality as a complement, but event and RGB are naturally heterogeneous, which leads to feature-level mismatch and inferior optimization of existing multi-modality methods. Different from these researches, we delve into the edge secret of both modalities for resilient fusion and propose a novel Edge-awareness Semantic Concordance framework to unify the multi-modality heterogeneous features with latent edge cues. In this framework, we first propose Edge-awareness Latent Re-coding, which obtains uncertainty indicators while realigning event-RGB features into unified semantic space guided by re-coded distribution, and transfers event-RGB distributions into re-coded features by utilizing a pre-established edge dictionary as clues. We then propose Re-coded Consolidation and Uncertainty Optimization, which utilize re-coded edge features and uncertainty indicators to solve the heterogeneous event-RGB fusion issues under extreme conditions. We establish two synthetic and one real-world event-RGB semantic segmentation datasets for extreme scenario comparisons. Experimental results show that our method outperforms the state-of-the-art by a 2.55% mIoU on our proposed DERS-XS, and possesses superior resilience under spatial occlusion. Our code and datasets are publicly available at https://github.com/iCVTEAM/ESC.[149] SWAN - Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces
Sweta Banerjee,Timo Gosch,Sara Hester,Viktoria Weiss,Thomas Conrad,Taryn A. Donovan,Nils Porsche,Jonas Ammeling,Christoph Stroblberger,Robert Klopfleisch,Christopher Kaltenecker,Christof A. Bertram,Katharina Breininger,Marc Aubreville
Main category: cs.CV
TL;DR: SWAN是一个开源的、基于滑动操作的图像标注工具,用于加速大规模组织病理学图像的标注,支持桌面和移动端,具有实时元数据捕获和灵活的类别映射功能。
Details
Motivation: 传统基于文件夹的标注方法速度慢、易疲劳且难以扩展,限制了深度学习模型在临床相关任务(如有丝分裂图像分类)中的发展。 Method: 开发了一个名为SWAN的开源Web应用程序,采用滑动手势进行直观的图像块分类,并在四名病理学家参与的试点研究中,将其与传统的文件夹分类方法进行对比评估。 Result: 使用SWAN进行标注时,标注者间的一致性百分比为86.52%至93.68%(Cohen's Kappa = 0.61-0.80),与传统方法相当,且参与者认为该工具可用性高,尤其认可移动端标注的便利性。 Conclusion: SWAN能够在保持标注质量的同时显著提升标注效率,是一种可扩展且用户友好的替代传统标注流程的解决方案。 Abstract: The annotation of large scale histopathology image datasets remains a major bottleneck in developing robust deep learning models for clinically relevant tasks, such as mitotic figure classification. Folder-based annotation workflows are usually slow, fatiguing, and difficult to scale. To address these challenges, we introduce SWipeable ANnotations (SWAN), an open-source, MIT-licensed web application that enables intuitive image patch classification using a swiping gesture. SWAN supports both desktop and mobile platforms, offers real-time metadata capture, and allows flexible mapping of swipe gestures to class labels. In a pilot study with four pathologists annotating 600 mitotic figure image patches, we compared SWAN against a traditional folder-sorting workflow. SWAN enabled rapid annotations with pairwise percent agreement ranging from 86.52% to 93.68% (Cohen's Kappa = 0.61-0.80), while for the folder-based method, the pairwise percent agreement ranged from 86.98% to 91.32% (Cohen's Kappa = 0.63-0.75) for the task of classifying atypical versus normal mitotic figures, demonstrating high consistency between annotators and comparable performance. Participants rated the tool as highly usable and appreciated the ability to annotate on mobile devices. These results suggest that SWAN can accelerate image annotation while maintaining annotation quality, offering a scalable and user-friendly alternative to conventional workflows.[150] MAUGIF: Mechanism-Aware Unsupervised General Image Fusion via Dual Cross-Image Autoencoders
Kunjing Yang,Zhiwei Wang,Minru Bai
Main category: cs.CV
TL;DR: 提出一种机制感知的无监督通用图像融合方法(MAUGIF),基于双交叉图像自编码器,针对不同融合任务设计不同的解码器结构,提升性能与可解释性。
Details
Motivation: 现有图像融合方法要么过于任务特定,要么采用统一策略忽略不同任务的融合机制差异,缺乏通用性和适应性。 Method: 将融合任务分为加性和乘性两类;设计双编码器提取共享内容与模态特有特征,双解码器根据融合机制差异进行特征注入与重构。 Result: 在多种图像融合任务上验证了方法的有效性和泛化能力,实现了优于现有通用和特定方法的融合效果。 Conclusion: MAUGIF通过机制感知的架构设计,实现了高性能、高可解释性的通用图像融合,适用于多类融合任务。 Abstract: Image fusion aims to integrate structural and complementary information from multi-source images. However, existing fusion methods are often either highly task-specific, or general frameworks that apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms. To address this issue, we propose a mechanism-aware unsupervised general image fusion (MAUGIF) method based on dual cross-image autoencoders. Initially, we introduce a classification of additive and multiplicative fusion according to the inherent mechanisms of different fusion tasks. Then, dual encoders map source images into a shared latent space, capturing common content while isolating modality-specific details. During the decoding phase, dual decoders act as feature injectors, selectively reintegrating the unique characteristics of each modality into the shared content for reconstruction. The modality-specific features are injected into the source image in the fusion process, generating the fused image that integrates information from both modalities. The architecture of decoders varies according to their fusion mechanisms, enhancing both performance and interpretability. Extensive experiments are conducted on diverse fusion tasks to validate the effectiveness and generalization ability of our method. The code is available at https://anonymous.4open.science/r/MAUGIF.[151] SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer
Kaiyi Xu,Junchao Gong,Zhiwang Zhou,Zhangrui Li,Yuandong Pu,Yihao Liu,Ben Fei,Fenghua Ling,Wenlong Zhang,Lei Bei
Main category: cs.CV
TL;DR: 本文提出了SynWeather,首个用于统一多区域和多变量气象观测数据合成的数据集,以及基于扩散Transformer框架的SynWeatherDiff模型,以解决现有方法在跨变量、跨区域建模中的局限性和结果过平滑问题。
Details
Motivation: 现有气象数据合成方法通常局限于单变量、单区域任务,依赖确定性建模,难以实现变量与区域间的统一融合,忽视了变量间的互补性,并常导致结果过平滑。因此需要一种更通用、更灵活的多变量多区域气象数据合成方案。 Method: 提出SynWeather数据集,覆盖美国本土、欧洲、东亚和热带气旋区域,包含多种高分辨率气象变量;并构建基于Diffusion Transformer的SynWeatherDiff模型,采用概率性生成框架进行多变量多区域天气数据合成。 Result: 在SynWeather数据集上的实验表明,SynWeatherDiff在各类任务中优于特定任务模型和通用模型,有效缓解了生成结果的过平滑问题,提升了跨变量和跨区域的合成质量。 Conclusion: SynWeather为多变量多区域气象数据合成提供了新基准,SynWeatherDiff展示了扩散模型在复杂气象生成任务中的潜力,推动了气象数据统一建模的发展。 Abstract: With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.[152] SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering
Laura Bragagnolo,Leonardo Barcellona,Stefano Ghidoni
Main category: cs.CV
TL;DR: 提出了一种基于可微分高斯渲染的多视角3D人体姿态估计框架SkelSplat,无需3D真值监督即可实现任意视角融合,在多个数据集上优于无3D监督的方法,并显著降低跨数据集误差,且对遮挡具有鲁棒性。
Details
Motivation: 现有基于多视角的3D人体姿态估计方法依赖大量标注数据,泛化能力差,尤其在测试场景与训练不一致时表现不佳,且通常需要3D真值监督。 Method: 提出SkelSplat框架,将人体骨架建模为每个关节点对应的3D高斯分布,通过可微分渲染进行优化;引入新颖的一对一热编码方案,实现各关节独立优化,从而支持任意相机视角的无缝融合,无需3D真值监督。 Result: 在Human3.6M和CMU数据集上优于其他无需3D真值监督的方法,跨数据集误差最多降低47.8%,并在Human3.6M-Occ和Occlusion-Person上表现出对遮挡的鲁棒性,无需特定场景微调。 Conclusion: SkelSplat通过基于可微分高斯渲染的无监督多视角融合策略,有效提升了3D人体姿态估计的泛化能力和遮挡鲁棒性,为实际应用提供了更实用的解决方案。 Abstract: Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.[153] NeuSpring: Neural Spring Fields for Reconstruction and Simulation of Deformable Objects from Videos
Qingshan Xu,Jiao Liu,Shangshu Yu,Yuxuan Wang,Yuan Zhou,Junbao Zhou,Jiequan Cui,Yew-Soon Ong,Hanwang Zhang
Main category: cs.CV
TL;DR: 本文提出了NeuSpring,一种基于神经弹簧场的方法,用于从视频中重建和仿真可变形物体的物理数字孪生。
Details
Motivation: 现有方法在当前状态建模上表现良好,但在未来预测上泛化能力差,因为忽略了可变形物体的内在物理特性。 Method: 基于弹簧-质点模型,提出两种创新:分段拓扑解法和神经弹簧场,分别用于高效建模多区域弹簧连接拓扑和表示跨帧的弹簧物理属性。 Result: 在真实世界数据集上的实验表明,NeuSpring在当前状态建模和未来预测方面均优于现有方法,Chamfer距离分别提升了20%和25%。 Conclusion: NeuSpring通过考虑材料异质性和空间关联性,显著提高了可变形物体的物理学习能力和仿真精度。 Abstract: In this paper, we aim to create physical digital twins of deformable objects under interaction. Existing methods focus more on the physical learning of current state modeling, but generalize worse to future prediction. This is because existing methods ignore the intrinsic physical properties of deformable objects, resulting in the limited physical learning in the current state modeling. To address this, we present NeuSpring, a neural spring field for the reconstruction and simulation of deformable objects from videos. Built upon spring-mass models for realistic physical simulation, our method consists of two major innovations: 1) a piecewise topology solution that efficiently models multi-region spring connection topologies using zero-order optimization, which considers the material heterogeneity of real-world objects. 2) a neural spring field that represents spring physical properties across different frames using a canonical coordinate-based neural network, which effectively leverages the spatial associativity of springs for physical learning. Experiments on real-world datasets demonstrate that our NeuSping achieves superior reconstruction and simulation performance for current state modeling and future prediction, with Chamfer distance improved by 20% and 25%, respectively.[154] Mitigating Negative Flips via Margin Preserving Training
Simone Ricci,Niccolò Biondi,Federico Pernici,Alberto Del Bimbo
Main category: cs.CV
TL;DR: 提出一种新方法,在图像分类中通过保留原始模型的边界并引入logit margin校准项来减少负翻转,同时结合双源焦点蒸馏损失以保持新类别的准确性。
Details
Motivation: 随着训练类别数量增加,更新模型时容易导致之前正确分类的样本被错误分类(负翻转),影响系统一致性。 Method: 引入显式的logit margin校准项以扩大原有类别与新类别之间的相对边界,并采用双源焦点蒸馏损失,结合旧模型和独立训练的新模型进行知识蒸馏。 Result: 在多个图像分类基准上的实验表明,该方法能持续降低负翻转率,同时保持较高的整体准确率。 Conclusion: 所提方法有效平衡了模型更新过程中对旧类别的一致性保护与新类别的学习性能,显著减少了负翻转现象。 Abstract: Minimizing inconsistencies across successive versions of an AI system is as crucial as reducing the overall error. In image classification, such inconsistencies manifest as negative flips, where an updated model misclassifies test samples that were previously classified correctly. This issue becomes increasingly pronounced as the number of training classes grows over time, since adding new categories reduces the margin of each class and may introduce conflicting patterns that undermine their learning process, thereby degrading performance on the original subset. To mitigate negative flips, we propose a novel approach that preserves the margins of the original model while learning an improved one. Our method encourages a larger relative margin between the previously learned and newly introduced classes by introducing an explicit margin-calibration term on the logits. However, overly constraining the logit margin for the new classes can significantly degrade their accuracy compared to a new independently trained model. To address this, we integrate a double-source focal distillation loss with the previous model and a new independently trained model, learning an appropriate decision margin from both old and new data, even under a logit margin calibration. Extensive experiments on image classification benchmarks demonstrate that our approach consistently reduces the negative flip rate with high overall accuracy.[155] The Impact of Longitudinal Mammogram Alignment on Breast Cancer Risk Assessment
Solveig Thrun,Stine Hansen,Zijun Sun,Nele Blum,Suaiba A. Salahuddin,Xin Wang,Kristoffer Wickstrøm,Elisabeth Wetzer,Robert Jenssen,Maik Stille,Michael Kampffmeyer
Main category: cs.CV
TL;DR: 本研究比较了多种用于纵向乳腺癌风险建模的配准方法,发现基于图像的配准在预测准确性和形变场质量方面均优于特征级和隐式对齐方法。
Details
Motivation: 准确的时空对齐是利用历史乳腺X线图像进行深度学习风险预测的关键挑战,现有方法在对齐效果上存在局限。 Method: 系统评估了基于图像的配准、带或不带正则化的特征级对齐以及隐式对齐方法,在两个大规模数据集上比较其预测性能和形变场质量。 Result: 基于图像的配准在所有指标上表现最佳;特征空间中应用图像配准的形变场可实现最优风险预测性能;正则化虽提升形变质量但降低预测性能。 Conclusion: 基于图像的形变场在纵向风险建模中至关重要,能显著提升预测准确性与鲁棒性,有助于实现个性化筛查和早期干预。 Abstract: Regular mammography screening is crucial for early breast cancer detection. By leveraging deep learning-based risk models, screening intervals can be personalized, especially for high-risk individuals. While recent methods increasingly incorporate longitudinal information from prior mammograms, accurate spatial alignment across time points remains a key challenge. Misalignment can obscure meaningful tissue changes and degrade model performance. In this study, we provide insights into various alignment strategies, image-based registration, feature-level (representation space) alignment with and without regularization, and implicit alignment methods, for their effectiveness in longitudinal deep learning-based risk modeling. Using two large-scale mammography datasets, we assess each method across key metrics, including predictive accuracy, precision, recall, and deformation field quality. Our results show that image-based registration consistently outperforms the more recently favored feature-based and implicit approaches across all metrics, enabling more accurate, temporally consistent predictions and generating smooth, anatomically plausible deformation fields. Although regularizing the deformation field improves deformation quality, it reduces the risk prediction performance of feature-level alignment. Applying image-based deformation fields within the feature space yields the best risk prediction performance. These findings underscore the importance of image-based deformation fields for spatial alignment in longitudinal risk modeling, offering improved prediction accuracy and robustness. This approach has strong potential to enhance personalized screening and enable earlier interventions for high-risk individuals. The code is available at https://github.com/sot176/Mammogram_Alignment_Study_Risk_Prediction.git, allowing full reproducibility of the results.[156] Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter
Zhiyang Chen,Chen Zhang,Hao Fang,Runmin Cong
Main category: cs.CV
TL;DR: 本文提出了一种基于DINO的水下实例分割框架DiveSeg,包含AquaStyle Aligner和ObjectPrior Prompter两个关键组件,在UIIS和USIS10K数据集上实现了最先进的性能。
Details
Motivation: 水下实例分割在海洋资源勘探和生态保护中至关重要,但存在颜色失真、对比度低等挑战,现有方法难以有效应对,因此需要一种能适应水下环境并具备实例级推理能力的新方法。 Method: 基于DINO视觉基础模型,设计AquaStyle Aligner以融入水下色彩风格特征,并提出ObjectPrior Prompter引入基于二值分割的对象先验提示,从而增强实例分割中的对象级和实例级推理能力。 Result: 在UIIS和USIS10K数据集上的实验表明,DiveSeg在水下实例分割任务上达到了最先进的性能。 Conclusion: DINO可作为有效的特征学习器用于水下实例分割,所提出的DiveSeg框架通过风格对齐和对象先验提示显著提升了在复杂水下环境中的分割性能。 Abstract: Underwater instance segmentation (UIS), integrating pixel-level understanding and instance-level discrimination, is a pivotal technology in marine resource exploration and ecological protection. In recent years, large-scale pretrained visual foundation models, exemplified by DINO, have advanced rapidly and demonstrated remarkable performance on complex downstream tasks. In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine-tuning process, facilitating better adaptation to the underwater domain. (2) The ObjectPrior Prompter, which incorporates binary segmentation-based prompts to deliver object-level priors, provides essential guidance for instance segmentation task that requires both object- and instance-level reasoning. We conduct thorough experiments on the popular UIIS and USIS10K datasets, and the results show that DiveSeg achieves the state-of-the-art performance. Code: https://github.com/ettof/Diveseg.[157] Towards Open-Set Myoelectric Gesture Recognition via Dual-Perspective Inconsistency Learning
Chen Liu,Can Han,Weishi Xu,Yaqi Wang,Dahong Qian
Main category: cs.CV
TL;DR: 提出了一种基于扩散模型的稀疏感知语义引导数据增强方法(SASG-DA),用于提升sEMG手势识别中数据增强的保真性与多样性,显著改善模型泛化性能。
Details
Motivation: sEMG手势识别因训练数据稀缺易导致深度学习模型过拟合和泛化能力差,现有数据增强方法在保真性和有针对性的多样性方面存在不足。 Method: 提出SASG-DA方法,包括语义表示引导(SRG)机制、高斯建模语义建模(GMSS)策略和稀疏感知语义采样策略,利用细粒度语义信息指导扩散模型生成既真实又多样化的样本。 Result: 在Ninapro DB2、DB4和DB7数据集上实验表明,SASG-DA显著优于现有数据增强方法,有效提升分类性能和模型泛化能力。 Conclusion: SASG-DA通过语义引导和稀疏感知采样,在保证生成样本真实性的同时增强了针对性多样性,有效缓解了sEMG数据稀缺导致的过拟合问题。 Abstract: Surface electromyography (sEMG)-based gesture recognition plays a critical role in human-machine interaction (HMI), particularly for rehabilitation and prosthetic control. However, sEMG-based systems often suffer from the scarcity of informative training data, leading to overfitting and poor generalization in deep learning models. Data augmentation offers a promising approach to increasing the size and diversity of training data, where faithfulness and diversity are two critical factors to effectiveness. However, promoting untargeted diversity can result in redundant samples with limited utility. To address these challenges, we propose a novel diffusion-based data augmentation approach, Sparse-Aware Semantic-Guided Diffusion Augmentation (SASG-DA). To enhance generation faithfulness, we introduce the Semantic Representation Guidance (SRG) mechanism by leveraging fine-grained, task-aware semantic representations as generation conditions. To enable flexible and diverse sample generation, we propose a Gaussian Modeling Semantic Modeling (GMSS) strategy, which models the semantic representation distribution and allows stochastic sampling to produce both faithful and diverse samples. To enhance targeted diversity, we further introduce a Sparse-Aware Semantic Sampling strategy to explicitly explore underrepresented regions, improving distribution coverage and sample utility. Extensive experiments on benchmark sEMG datasets, Ninapro DB2, DB4, and DB7, demonstrate that SASG-DA significantly outperforms existing augmentation methods. Overall, our proposed data augmentation approach effectively mitigates overfitting and improves recognition performance and generalization by offering both faithful and diverse samples.[158] VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation
Arpan Phukan,Anupam Pandey,Deepjyoti Bodo,Asif Ekbal
Main category: cs.CV
TL;DR: 本文提出了VideoChain,一种新的多跳视频问答生成(MVQG)框架,能够生成需要跨多个时间分离视频片段进行推理的问题。
Details
Motivation: 现有的视频问答生成局限于单一视频片段的零跳问题,缺乏对多跳推理的支持,而多跳问题能更有效地评估模型的推理能力。 Method: 基于改进的BART模型并融合视频嵌入,构建模块化架构;利用TVQA+数据集自动构建大规模MVQ-60数据集,通过合并零跳问答对实现多跳问题生成。 Result: 在标准生成指标上表现优异:ROUGE-L为0.6454,ROUGE-1为0.6854,BLEU-1为0.6711,BERTScore-F1为0.7967,语义相似度达0.8110。 Conclusion: VideoChain能有效生成连贯、上下文相关且需复杂推理的多跳视频问题,推动了视频问答生成向更高层次的推理发展。 Abstract: Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain's strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model's ability to generate coherent, contextually grounded, and reasoning-intensive questions.[159] Extreme Model Compression with Structured Sparsity at Low Precision
Dan Liu,Nikita Dvornik,Xue Liu
Main category: cs.CV
TL;DR: 本文提出了SLOPE框架,统一结合结构化稀疏和低比特量化,在保持高精度的同时显著压缩模型大小。
Details
Motivation: 深度神经网络在资源受限设备上部署困难,现有稀疏化与量化方法通常孤立研究,直接结合会严重损害模型性能。 Method: 提出一种训练时正则化策略,通过促进全精度权重与稀疏、量化后权重之间的角度对齐来最小化差异,从而有效联合稀疏化与量化。 Result: 在ResNet-18等模型上实现约20倍模型压缩,保持约99%原始准确率,并在分类、检测和分割任务中优于现有方法。 Conclusion: SLOPE能够有效协同结构化稀疏与低精度量化,为高效模型压缩提供了新思路。 Abstract: Deep neural networks (DNNs) are used in many applications, but their large size and high computational cost make them hard to run on devices with limited resources. Two widely used techniques to address this challenge are weight quantization, which lowers the precision of all weights, and structured sparsity, which removes unimportant weights while retaining the important ones at full precision. Although both are effective individually, they are typically studied in isolation due to their compounded negative impact on model accuracy when combined. In this work, we introduce SLOPE Structured Sparsity at Low Precision), a unified framework, to effectively combine structured sparsity and low-bit quantization in a principled way. We show that naively combining sparsity and quantization severely harms performance due to the compounded impact of both techniques. To address this, we propose a training-time regularization strategy that minimizes the discrepancy between full-precision weights and their sparse, quantized counterparts by promoting angular alignment rather than direct matching. On ResNet-18, SLOPE achieves $\sim20\times$ model size reduction while retaining $\sim$99% of the original accuracy. It consistently outperforms state-of-the-art quantization and structured sparsity methods across classification, detection, and segmentation tasks on models such as ResNet-18, ViT-Small, and Mask R-CNN.[160] Retrospective motion correction in MRI using disentangled embeddings
Qi Wang,Veronika Ecker,Marcel Früh,Sergios Gatidis,Thomas Küstner
Main category: cs.CV
TL;DR: 提出一种基于分层向量量化变分自编码器的MRI运动伪影校正方法,通过解耦运动特征实现对未见运动模式的泛化校正。
Details
Motivation: 现有MRI运动校正方法难以泛化到不同类型的运动和身体区域,尤其是机器学习方法通常局限于特定应用和数据集。 Method: 采用分层向量量化变分自编码器学习运动到干净图像特征的解耦嵌入,利用多分辨率码本捕捉运动模式,并结合自回归模型学习无运动图像的先验分布以指导校正。 Result: 在模拟全身运动伪影上验证了该方法的有效性,能够在不同运动严重程度下实现鲁棒校正,并显示出对未见运动模式的良好泛化能力。 Conclusion: 该方法无需针对特定伪影训练即可实现跨运动类型和解剖区域的通用校正,提升了机器学习在MRI运动校正中的适用性和推广性。 Abstract: Physiological motion can affect the diagnostic quality of magnetic resonance imaging (MRI). While various retrospective motion correction methods exist, many struggle to generalize across different motion types and body regions. In particular, machine learning (ML)-based corrections are often tailored to specific applications and datasets. We hypothesize that motion artifacts, though diverse, share underlying patterns that can be disentangled and exploited. To address this, we propose a hierarchical vector-quantized (VQ) variational auto-encoder that learns a disentangled embedding of motion-to-clean image features. A codebook is deployed to capture finite collection of motion patterns at multiple resolutions, enabling coarse-to-fine correction. An auto-regressive model is trained to learn the prior distribution of motion-free images and is used at inference to guide the correction process. Unlike conventional approaches, our method does not require artifact-specific training and can generalize to unseen motion patterns. We demonstrate the approach on simulated whole-body motion artifacts and observe robust correction across varying motion severity. Our results suggest that the model effectively disentangled physical motion of the simulated motion-effective scans, therefore, improving the generalizability of the ML-based MRI motion correction. Our work of disentangling the motion features shed a light on its potential application across anatomical regions and motion types.[161] A Circular Argument : Does RoPE need to be Equivariant for Vision?
Chase van de Geijn,Timo Lüddecke,Polina Turishcheva,Alexander S. Ecker
Main category: cs.CV
TL;DR: 本文研究了旋转位置编码(RoPE)在高维数据中的推广,提出Spherical RoPE方法,并通过实验证明相对位置编码的重要性可能被高估,尤其是在计算机视觉中。
Details
Motivation: 尽管RoPE在自然语言处理中表现出色,但其成功归因于位置等变性。本文旨在探讨这种等变性是否真正关键,并尝试将其推广到更高维度的数据如图像和视频中。 Method: 数学上证明RoPE是一维数据中最通用的等变位置嵌入解;提出Mixed RoPE作为M维数据的推广形式,并进一步提出假设非交换生成元的Spherical RoPE。 Result: 实验表明Spherical RoPE在学习行为上优于或相当于其等变对应方法,说明严格等变性对性能影响有限。 Conclusion: 相对位置编码可能并非RoPE性能优越的关键因素,尤其在计算机视觉领域,这为未来更高效、泛化能力更强的位置编码设计提供了新方向。 Abstract: Rotary Positional Encodings (RoPE) have emerged as a highly effective technique for one-dimensional sequences in Natural Language Processing spurring recent progress towards generalizing RoPE to higher-dimensional data such as images and videos. The success of RoPE has been thought to be due to its positional equivariance, i.e. its status as a relative positional encoding. In this paper, we mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data. Moreover, we show Mixed RoPE to be the analogously general solution for M-dimensional data, if we require commutative generators -- a property necessary for RoPE's equivariance. However, we question whether strict equivariance plays a large role in RoPE's performance. We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators. Empirically, we find Spherical RoPE to have the equivalent or better learning behavior compared to its equivariant analogues. This suggests that relative positional embeddings are not as important as is commonly believed, at least within computer vision. We expect this discovery to facilitate future work in positional encodings for vision that can be faster and generalize better by removing the preconception that they must be relative.[162] Text-based Aerial-Ground Person Retrieval
Xinyu Zhou,Yu Wu,Jiayao Ma,Wenhao Wang,Min Cao,Mang Ye
Main category: cs.CV
TL;DR: 本文提出了基于文本的空中-地面行人检索(TAG-PR),通过构建新数据集TAG-PEDES和提出TAG-CLIP框架来解决跨视角异构视图下的文本-图像匹配问题。
Details
Motivation: 传统文本到行人的检索主要关注地面视角,而实际应用中存在大量空中视角图像,因此需要一种能处理大视角差异的跨模态检索方法。 Method: 提出TAG-CLIP框架,采用分层路由的专家混合模块学习视角特定与视角无关特征,并通过视角解耦策略提升跨模态对齐;同时构建了带有自动文本标注的TAG-PEDES数据集。 Result: 在TAG-PEDES和现有T-PR基准上验证了TAG-CLIP的有效性,表现出优越的跨视角文本-图像检索性能。 Conclusion: TAG-PR为跨视角文本-图像行人检索提供了新的研究方向,所提方法有效缓解了视角差异带来的挑战。 Abstract: This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.[163] RAPTR: Radar-based 3D Pose Estimation using Transformer
Sorachi Kato,Ryoma Yataka,Pu Perry Wang,Pedro Miraldo,Takuya Fujihashi,Petros Boufounos
Main category: cs.CV
TL;DR: 本文提出了一种名为RAPTR的弱监督雷达3D人体姿态估计方法,仅使用易于获取的3D边界框和2D关键点标签,通过两阶段解码器架构显著降低了关节位置误差。
Details
Motivation: 传统的雷达3D人体姿态估计依赖于精细且昂贵的3D关键点标注,尤其在复杂室内环境中(如遮挡、杂乱或多人员场景)难以获取,因此需要一种更可扩展的弱监督方法。 Method: 提出RAPTR,采用两阶段姿态解码器架构:第一阶段利用3D模板损失从3D边界框标签中估计初始3D姿态以缓解深度歧义;第二阶段通过2D关键点标签和3D重力损失结合伪3D可变形注意力机制 refine 初始姿态。 Result: 在HIBER和MMVR两个室内雷达数据集上,RAPTR相比现有方法分别将关节位置误差降低了34.3%和76.9%。 Conclusion: RAPTR在弱监督条件下实现了高效准确的3D人体姿态估计,显著减少了对精细标注数据的依赖,具有良好的实用性和扩展性。 Abstract: Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3\%$ on HIBER and $76.9\%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.[164] Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation
Difei Gu,Yunhe Gao,Mu Zhou,Dimitris Metaxas
Main category: cs.CV
TL;DR: Anatomy-VLM 是一种细粒度视觉-语言模型,通过多尺度解剖特征定位和结构化知识融合,提升医学影像的疾病诊断准确性,并具备专家级临床解释能力。
Details
Motivation: 现有视觉-语言模型将医学图像视为整体,忽略对疾病诊断至关重要的细微解剖细节,难以实现专家级诊断。 Method: 设计一个模型编码器来定位关键解剖特征,结合结构化医学知识进行上下文感知的解释,并通过多尺度信息对齐生成可解释的疾病预测。 Result: Anatomy-VLM 在分布内和分布外数据集上均表现出色,在下游分割任务中验证了其捕捉解剖与病理知识的能力,并支持零样本按解剖部位解释。 Conclusion: Anatomy-VLM 通过模拟临床工作流,实现了更准确、可解释的医学影像诊断,展现出强大的临床应用潜力。 Abstract: Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.[165] OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild
Yuncheng Guo,Junyan Ye,Chenjue Zhang,Hengrui Kang,Haohuan Fu,Conghui He,Weijia Li
Main category: cs.CV
TL;DR: 本文提出OmniAID,一种基于解耦混合专家(MoE)架构的通用AI生成图像检测框架,通过分离内容相关缺陷与内容无关的通用伪影,实现跨生成模型和语义内容的鲁棒泛化,并引入新大规模数据集Mirage验证其有效性。
Details
Motivation: 现有AIGI检测方法学习到的是纠缠的伪造表示,混淆了内容依赖性缺陷与通用伪影,且受限于过时的基准测试,难以在真实场景中泛化。 Method: 提出OmniAID框架,采用解耦的混合专家(MoE)架构,包含可路由的特定语义专家(处理不同内容域的语义缺陷)和固定的通用伪影专家(捕捉生成方式相关的共性特征),并通过两阶段训练策略:先独立训练各专家以确保专业化,再训练轻量级门控网络进行输入路由。 Result: 在传统基准和新提出的Mirage数据集上进行的大量实验表明,OmniAID显著优于现有的单一结构检测器,在跨模型和跨内容的生成图像检测任务中表现出更强的泛化能力和鲁棒性。 Conclusion: 通过显式解耦‘生成内容’与‘生成方式’,OmniAID为AI生成图像检测建立了新的鲁棒标准,具备应对现代真实世界威胁的能力,同时Mirage数据集为未来研究提供了更贴近现实的评估平台。 Abstract: A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current state-of-the-art methods learn a single, entangled forgery representation--conflating content-dependent flaws with content-agnostic artifacts--and are further constrained by outdated benchmarks. To overcome these limitations, we propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture. The core of our method is a hybrid expert system engineered to decouple: (1) semantic flaws across distinct content domains, and (2) these content-dependent flaws from content-agnostic universal artifacts. This system employs a set of Routable Specialized Semantic Experts, each for a distinct domain (e.g., human, animal), complemented by a Fixed Universal Artifact Expert. This architecture is trained using a bespoke two-stage strategy: we first train the experts independently with domain-specific hard-sampling to ensure specialization, and subsequently train a lightweight gating network for effective input routing. By explicitly decoupling "what is generated" (content-specific flaws) from "how it is generated" (universal artifacts), OmniAID achieves robust generalization. To address outdated benchmarks and validate real-world applicability, we introduce Mirage, a new large-scale, contemporary dataset. Extensive experiments, using both traditional benchmarks and our Mirage dataset, demonstrate our model surpasses existing monolithic detectors, establishing a new, robust standard for AIGI authentication against modern, in-the-wild threats.[166] Cross-pyramid consistency regularization for semi-supervised medical image segmentation
Matus Bojko,Maros Kollar,Marek Jakab,Wanda Benesova
Main category: cs.CV
TL;DR: 提出了一种基于交叉金字塔一致性正则化的混合一致性学习方法(CPCR),用于半监督医学图像分割,通过双分支金字塔网络(DBPNet)有效利用未标注数据。
Details
Motivation: 在半监督学习中,如何有效利用大量未标注医学图像数据提升模型性能是一个关键挑战,现有方法在特征层次一致性和知识蒸馏方面仍有不足。 Method: 设计了一个双分支金字塔网络(DBPNet),包含两个略有不同的解码器,生成多尺度的预测金字塔;提出CPCR学习策略,结合一致性学习、不确定性最小化,并引入新的金字塔级跨解码器软标签正则化项,实现深层特征的知识蒸馏。 Result: 在公开基准数据集上,DBPNet结合CPCR优于五种最先进的自监督方法,且性能与近期方法相当。 Conclusion: 所提出的CPCR方法能更有效地挖掘未标注数据中的潜在信息,提升了半监督医学图像分割的性能,具有较强的竞争力和应用潜力。 Abstract: Semi-supervised learning (SSL) enables training of powerful models with the assumption of limited, carefully labelled data and a large amount of unlabeled data to support the learning. In this paper, we propose a hybrid consistency learning approach to effectively exploit unlabeled data for semi-supervised medical image segmentation by leveraging Cross-Pyramid Consistency Regularization (CPCR) between two decoders. First, we design a hybrid Dual Branch Pyramid Network (DBPNet), consisting of an encoder and two decoders that differ slightly, each producing a pyramid of perturbed auxiliary predictions across multiple resolution scales. Second, we present a learning strategy for this network named CPCR that combines existing consistency learning and uncertainty minimization approaches on the main output predictions of decoders with our novel regularization term. More specifically, in this term, we extend the soft-labeling setting to pyramid predictions across decoders to support knowledge distillation in deep hierarchical features. Experimental results show that DBPNet with CPCR outperforms five state-of-the-art self-supervised learning methods and has comparable performance with recent ones on a public benchmark dataset.[167] Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification
Anh Mai Vu,Tuan L. Vo,Ngoc Lam Quang Bui,Nam Nguyen Le Binh,Akash Awasthi,Huy Quoc Vo,Thanh-Huy Nguyen,Zhu Han,Chandra Mohan,Hien Van Nguyen
Main category: cs.CV
TL;DR: 提出了一种新的归因方法Contrastive Integrated Gradients (CIG),用于提升全切片图像(WSI)分析中的可解释性,通过在logit空间中计算对比梯度来突出类别判别区域,并在多个癌症数据集上验证了其有效性。
Details
Motivation: 现有的归因方法如Integrated Gradients在高分辨率WSI上应用时可能忽略关键的类别判别信号,难以区分肿瘤亚型,因此需要一种更具判别性和可解释性的归因方法。 Method: 提出了Contrastive Integrated Gradients (CIG),通过对比目标类别与参考类别的特征重要性,在logit空间中生成更清晰的归因图;同时满足集成归因的公理要求,并设计了MIL-AIC和MIL-SIC两个指标评估归因质量。 Result: 在CAMELYON16、TCGA-RCC和TCGA-Lung三个不同癌种的数据集上实验表明,CIG在定量指标(MIL-AIC、MIL-SIC)和可视化效果上均优于现有方法,归因结果更贴近真实肿瘤区域。 Conclusion: CIG是一种理论严谨且有效的归因方法,能够提升WSI分析中模型预测的可解释性与可信度,具有在AI辅助病理诊断中广泛应用的潜力。 Abstract: Interpretability is essential in Whole Slide Image (WSI) analysis for computational pathology, where understanding model predictions helps build trust in AI-assisted diagnostics. While Integrated Gradients (IG) and related attribution methods have shown promise, applying them directly to WSIs introduces challenges due to their high-resolution nature. These methods capture model decision patterns but may overlook class-discriminative signals that are crucial for distinguishing between tumor subtypes. In this work, we introduce Contrastive Integrated Gradients (CIG), a novel attribution method that enhances interpretability by computing contrastive gradients in logit space. First, CIG highlights class-discriminative regions by comparing feature importance relative to a reference class, offering sharper differentiation between tumor and non-tumor areas. Second, CIG satisfies the axioms of integrated attribution, ensuring consistency and theoretical soundness. Third, we propose two attribution quality metrics, MIL-AIC and MIL-SIC, which measure how predictive information and model confidence evolve with access to salient regions, particularly under weak supervision. We validate CIG across three datasets spanning distinct cancer types: CAMELYON16 (breast cancer metastasis in lymph nodes), TCGA-RCC (renal cell carcinoma), and TCGA-Lung (lung cancer). Experimental results demonstrate that CIG yields more informative attributions both quantitatively, using MIL-AIC and MIL-SIC, and qualitatively, through visualizations that align closely with ground truth tumor regions, underscoring its potential for interpretable and trustworthy WSI-based diagnostics[168] Generalizable Blood Cell Detection via Unified Dataset and Faster R-CNN
Siddharth Sahay
Main category: cs.CV
TL;DR: 本文提出了一种用于外周血细胞显微图像自动分类与检测的综合方法,并基于Faster R-CNN框架对比了迁移学习与从零训练的性能差异,结果显示迁移学习显著提升收敛速度与模型稳定性。
Details
Motivation: 解决外周血细胞图像数据稀缺和异质性问题,提升自动化血液病诊断系统的准确性和可部署性。 Method: 构建统一的数据预处理 pipeline,整合四个公开数据集;采用基于ResNet-50-FPN的Faster R-CNN框架,对比随机初始化与COCO预训练迁移学习两种训练策略。 Result: 迁移学习方案(Regimen 2)相比基线模型收敛更快、稳定性更高,最终验证损失低至0.08666,性能显著优于基线。 Conclusion: 迁移学习能有效提升PBC检测模型的训练效率与性能,所提出的统一数据处理与建模范式为自动化血液学诊断系统提供了可靠基础。 Abstract: This paper presents a comprehensive methodology and comparative performance analysis for the automated classification and object detection of peripheral blood cells (PBCs) in microscopic images. Addressing the critical challenge of data scarcity and heterogeneity, robust data pipeline was first developed to standardize and merge four public datasets (PBC, BCCD, Chula, Sickle Cell) into a unified resource. Then employed a state-of-the-art Faster R-CNN object detection framework, leveraging a ResNet-50-FPN backbone. Comparative training rigorously evaluated a randomly initialized baseline model (Regimen 1) against a Transfer Learning Regimen (Regimen 2), initialized with weights pre-trained on the Microsoft COCO dataset. The results demonstrate that the Transfer Learning approach achieved significantly faster convergence and superior stability, culminating in a final validation loss of 0.08666, a substantial improvement over the baseline. This validated methodology establishes a robust foundation for building high-accuracy, deployable systems for automated hematological diagnosis.[169] Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding
Da Li,Yuxiao Luo,Keping Bi,Jiafeng Guo,Wei Yuan,Biao Yang,Yan Wang,Fan Yang,Tingting Gao,Guorui Zhou
Main category: cs.CV
TL;DR: 本文提出了一种名为CoMa的压缩预训练方法,作为对比学习的热身阶段,能够在少量数据上将视觉语言模型(VLM)转化为具有竞争力的嵌入模型,在效率和效果上均实现了优化。
Details
Motivation: 现有的视觉语言模型在多模态表示学习中表现优异,但通常需要大规模对比学习来同时优化语义保留和任务判别性特征。作者认为这两个目标可以解耦:先通过充分理解输入内容,再进行对比学习以提升下游任务性能。因此,提出一种更高效的前提训练策略。 Method: 提出CoMa方法,引入一个压缩的预训练阶段,作为对比学习前的热身步骤。该阶段使用少量数据对VLM进行初步优化,使其更好捕捉输入的语义内容,从而为后续对比学习提供更好的初始化。 Result: 实验表明,仅用少量预训练数据,CoMa即可将VLM转化为具有竞争力的嵌入模型。在与同类规模VLM比较中,CoMa在MMEB基准上达到了新的SOTA结果。 Conclusion: CoMa通过解耦语义理解和判别性学习过程,证明了轻量级预训练可有效提升VLM作为嵌入模型的性能,在效率和有效性方面均有显著优势。 Abstract: Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that VLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input facilitates the embedding model in achieving superior performance in downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform a VLM into a competitive embedding model. CoMa achieves new state-of-the-art results among VLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness.[170] Fast Multi-Organ Fine Segmentation in CT Images with Hierarchical Sparse Sampling and Residual Transformer
Xueqi Guo,Halid Ziya Yerebakan,Yoshihisa Shinagawa,Kritika Iyer,Gerardo Hermosillo Valadez
Main category: cs.CV
TL;DR: 提出了一种基于分层稀疏采样和残差Transformer的快速多器官分割框架,在保持高精度的同时显著降低计算成本,实现在CPU上约2.24秒的快速分割,具有实现实时精细器官分割的潜力。
Details
Motivation: 传统3D医学图像逐体素分割计算开销大,现有快速分类器在速度与精度之间存在权衡,因此需要一种更高效且准确的多器官分割方法。 Method: 采用分层稀疏采样策略减少计算量并保留多层次上下文信息,结合残差Transformer网络结构提取和融合稀疏特征中的多级信息,在低计算成本下实现高效分割。 Result: 在包含10,253张CT图像的内部数据集和公开数据集TotalSegmentator上,该方法在定性和定量指标上均优于现有快速器官分类器,CPU上分割速度约为2.24秒。 Conclusion: 所提出的框架在大幅缩短分割时间的同时保持良好性能,展现出实现实时多器官分割的潜力,适用于临床自动化流程。 Abstract: Multi-organ segmentation of 3D medical images is fundamental with meaningful applications in various clinical automation pipelines. Although deep learning has achieved superior performance, the time and memory consumption of segmenting the entire 3D volume voxel by voxel using neural networks can be huge. Classifiers have been developed as an alternative in cases with certain points of interest, but the trade-off between speed and accuracy remains an issue. Thus, we propose a novel fast multi-organ segmentation framework with the usage of hierarchical sparse sampling and a Residual Transformer. Compared with whole-volume analysis, the hierarchical sparse sampling strategy could successfully reduce computation time while preserving a meaningful hierarchical context utilizing multiple resolution levels. The architecture of the Residual Transformer segmentation network could extract and combine information from different levels of information in the sparse descriptor while maintaining a low computational cost. In an internal data set containing 10,253 CT images and the public dataset TotalSegmentator, the proposed method successfully improved qualitative and quantitative segmentation performance compared to the current fast organ classifier, with fast speed at the level of ~2.24 seconds on CPU hardware. The potential of achieving real-time fine organ segmentation is suggested.[171] CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing
Leonie Bossemeyer,Samuel Heinrich,Grant Van Horn,Oisin Mac Aodha
Main category: cs.CV
TL;DR: CleverBirds是一个大规模的知识追踪基准,用于细粒度鸟类物种识别,基于eBird平台收集的数据,包含超过40,000名参与者回答的1700多万道多选题,旨在研究人类视觉专业知识的发展过程。
Details
Motivation: 准确推断人类学习者的知识状态是理解视觉学习的关键步骤,而现有方法在建模专家技能发展方面仍具挑战性。 Method: 利用eBird公民科学平台收集的大规模数据,构建包含细粒度鸟类识别任务的知识追踪基准,并分析不同参与者子组和问题类型下的知识追踪表现。 Result: CleverBirds成为同类中最大的基准之一,涵盖超过10,000种鸟类,平均每位参与者回答约400个问题,展示了上下文信息对预测性能的不同影响。 Conclusion: 该数据集为视觉专业知识发展的长期研究提供了新途径,支持跨个体和时间的知识追踪方法开发与评估。 Abstract: Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner's knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners' knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.[172] UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Zhengyang Liang,Daoan Zhang,Huichi Zhou,Rui Huang,Bobo Li,Yuechen Zhang,Shengqiong Wu,Xiaohan Wang,Jiebo Luo,Lizi Liao,Hao Fei
Main category: cs.CV
TL;DR: 本文提出了UniVA,一个开源的多智能体框架,旨在通过结合视频理解、分割、编辑和生成能力,实现复杂的迭代视频工作流。
Details
Motivation: 现有的AI模型通常只能单独处理视频生成或理解任务,难以满足现实应用中复杂且迭代的工作流需求。因此,需要一种能够统一多种视频处理能力的通用框架。 Method: UniVA采用“规划-执行”双智能体架构:规划智能体解析用户意图并将其分解为结构化步骤,执行智能体通过基于MCP的模块化工具服务器完成具体操作。系统还引入了多层次记忆机制(全局知识、任务上下文和用户偏好),以支持长期推理和上下文连贯性。 Result: UniVA实现了文本/图像/视频条件下的生成、多轮编辑、对象分割与合成等任意组合的迭代工作流,并具备完整的可追溯性和自反思能力。同时提出了UniVA-Bench,用于评估此类多步视频任务。 Conclusion: UniVA是一个开放、通用的视频智能框架,推动了交互式、代理式和通用视频AI系统的发展,其开源特性有助于促进下一代多模态AI研究。 Abstract: While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)[173] Large Sign Language Models: Toward 3D American Sign Language Translation
Sen Zhang,Xiaoxiao He,Di Liu,Zhaoyang Xia,Mingyu Zhao,Chaowei Tan,Vivian Li,Bo Liu,Dimitris N. Metaxas,Mubbasir Kapadia
Main category: cs.CV
TL;DR: 提出Large Sign Language Models (LSLM),利用大型语言模型(Large Language Models, LLMs)作为骨干,直接基于3D美国手语数据进行翻译,提升听障人士的数字通信可及性。
Details
Motivation: 现有手语识别方法多依赖2D视频,难以捕捉空间、手势和深度信息;为提升翻译准确性与鲁棒性,并推动LLMs理解多模态人类语言,需探索基于3D数据的手语翻译框架。 Method: 采用3D手语数据作为输入,利用LLMs作为主干模型,实现从3D手势特征到文本的直接翻译,并探索在指令引导下通过外部提示调节翻译结果的设定。 Result: 验证了LSLM在3D ASL翻译任务中的有效性,支持更准确、灵活的翻译,并展示了LLMs处理具身多模态语言的潜力。 Conclusion: LSLM为理解多样化语言形式提供了基础,是迈向包容性多模态智能系统的重要一步。 Abstract: We present Large Sign Language Models (LSLM), a novel framework for translating 3D American Sign Language (ASL) by leveraging Large Language Models (LLMs) as the backbone, which can benefit hearing-impaired individuals' virtual communication. Unlike existing sign language recognition methods that rely on 2D video, our approach directly utilizes 3D sign language data to capture rich spatial, gestural, and depth information in 3D scenes. This enables more accurate and resilient translation, enhancing digital communication accessibility for the hearing-impaired community. Beyond the task of ASL translation, our work explores the integration of complex, embodied multimodal languages into the processing capabilities of LLMs, moving beyond purely text-based inputs to broaden their understanding of human communication. We investigate both direct translation from 3D gesture features to text and an instruction-guided setting where translations can be modulated by external prompts, offering greater flexibility. This work provides a foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language.[174] 3D4D: An Interactive, Editable, 4D World Model via 3D Video Generation
Yunhong He,Zhengqing Yuan,Zhengzhong Tu,Yanfang Ye,Lichao Sun
Main category: cs.CV
TL;DR: 提出了一种名为3D4D的交互式4D可视化框架,结合WebGL与Supersplat渲染技术,将静态图像和文本转化为连贯的4D场景,支持高效的实时多模态交互。
Details
Motivation: 为了实现对复杂4D环境的直观、交互式探索,克服传统静态可视化方法在动态性和用户参与度上的局限。 Method: 设计并实现了包含四个核心模块的3D4D框架,集成WebGL与Supersplat渲染,并采用注视点渲染(foveated rendering)策略以提升渲染效率。 Result: 该框架能够高效生成实时4D可视化效果,支持用户自适应地探索4D场景,提升了多模态交互体验。 Conclusion: 3D4D为4D内容的交互式可视化提供了一个高效、灵活的解决方案,具有广泛的应用潜力。 Abstract: We introduce 3D4D, an interactive 4D visualization framework that integrates WebGL with Supersplat rendering. It transforms static images and text into coherent 4D scenes through four core modules and employs a foveated rendering strategy for efficient, real-time multi-modal interaction. This framework enables adaptive, user-driven exploration of complex 4D environments. The project page and code are available at https://yunhonghe1021.github.io/NOVA/.[175] RePose-NeRF: Robust Radiance Fields for Mesh Reconstruction under Noisy Camera Poses
Sriram Srinivasan,Gautam Ramachandra
Main category: cs.CV
TL;DR: 提出一种鲁棒框架,直接从具有噪声外参的多视角图像中重建高质量、可编辑的3D网格,兼顾相机位姿优化与隐式场景表示学习,提升在机器人应用中的实用性。
Details
Motivation: 现有NeRF方法依赖精确相机位姿且使用隐式体表示,难以在真实场景中稳健应用,且与标准3D软件不兼容。 Method: 联合优化相机位姿并学习隐式场景表示,通过该表示生成高保真3D网格。 Result: 在标准基准上验证了方法在位姿不确定情况下的准确性和鲁棒性,生成的网格兼容常用3D图形与机器人工具。 Conclusion: 该方法弥合了神经隐式表示与实际机器人应用之间的差距,实现了高效、可编辑的3D重建。 Abstract: Accurate 3D reconstruction from multi-view images is essential for downstream robotic tasks such as navigation, manipulation, and environment understanding. However, obtaining precise camera poses in real-world settings remains challenging, even when calibration parameters are known. This limits the practicality of existing NeRF-based methods that rely heavily on accurate extrinsic estimates. Furthermore, their implicit volumetric representations differ significantly from the widely adopted polygonal meshes, making rendering and manipulation inefficient in standard 3D software. In this work, we propose a robust framework that reconstructs high-quality, editable 3D meshes directly from multi-view images with noisy extrinsic parameters. Our approach jointly refines camera poses while learning an implicit scene representation that captures fine geometric detail and photorealistic appearance. The resulting meshes are compatible with common 3D graphics and robotics tools, enabling efficient downstream use. Experiments on standard benchmarks demonstrate that our method achieves accurate and robust 3D reconstruction under pose uncertainty, bridging the gap between neural implicit representations and practical robotic applications.[176] Vision Transformer Based User Equipment Positioning
Parshwa Shah,Dhaval K. Patel,Brijesh Soni,Miguel López-Benítez,Siddhartan Govindasamy
Main category: cs.CV
TL;DR: 提出了一种基于注意力机制的视觉Transformer(ViT)架构,用于从信道状态信息(CSI)矩阵中的角度延迟分布(ADP)进行用户设备定位,相较于现有方法性能提升约38%。
Details
Motivation: 现有的深度学习定位模型对所有输入赋予相同注意力,且不适用于非序列数据(如瞬时CSI),限制了其在UE定位中的性能。 Method: 采用基于注意力机制的Vision Transformer(ViT)架构,聚焦于CSI矩阵提取的Angle Delay Profile(ADP),适用于非序列输入,并在DeepMIMO和ViWi射线追踪数据集上进行验证。 Result: 在DeepMIMO室内场景中RMSE为0.55米,室外为13.59米,在ViWi户外遮挡场景中为3.45米,误差距离分布表现优于其他对比方法,整体性能优于现有方案约38%。 Conclusion: 所提出的ViT架构能有效利用ADP特征进行高精度UE定位,尤其适用于非序列CSI输入,在多种场景下显著优于现有方法。 Abstract: Recently, Deep Learning (DL) techniques have been used for User Equipment (UE) positioning. However, the key shortcomings of such models is that: i) they weigh the same attention to the entire input; ii) they are not well suited for the non-sequential data e.g., when only instantaneous Channel State Information (CSI) is available. In this context, we propose an attention-based Vision Transformer (ViT) architecture that focuses on the Angle Delay Profile (ADP) from CSI matrix. Our approach, validated on the `DeepMIMO' and `ViWi' ray-tracing datasets, achieves an Root Mean Squared Error (RMSE) of 0.55m indoors, 13.59m outdoors in DeepMIMO, and 3.45m in ViWi's outdoor blockage scenario. The proposed scheme outperforms state-of-the-art schemes by $\sim$ 38\%. It also performs substantially better than other approaches that we have considered in terms of the distribution of error distance.[177] SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology
Shanaka Liyanaarachchi,Chathurya Wijethunga,Shihab Aaquil Ahamed,Akthas Absar,Ranga Rodrigo
Main category: cs.CV
TL;DR: 提出了一种名为SENCA-st的新架构,通过共享编码器和邻域交叉注意力机制,有效整合空间转录组学和组织病理学图像数据,显著提升了肿瘤异质性和微环境区域的检测性能。