Table of Contents
cs.CL [Back]
[1] MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch
Nikolay Banar,Ehsan Lotfi,Jens Van Nooten,Cristina Arhiliuc,Marija Kliocaite,Walter Daelemans
Main category: cs.CL
TL;DR: 本文介绍了针对荷兰语嵌入模型的评估与生成新资源,包括MTEB-NL基准、训练数据集和E5-NL嵌入模型。
Details
Motivation: 荷兰语在多语言资源中代表性不足,缺乏专门的评估基准和训练数据,限制了其嵌入模型的发展。 Method: 构建了包含现有和新建数据集的MTEB-NL评估基准;整合公开的荷兰语检索数据集并结合大语言模型生成的合成数据构建训练集;基于E5架构微调得到E5-NL嵌入模型。 Result: 发布了MTEB-NL基准、增强的训练数据集以及一系列紧凑高效的E5-NL嵌入模型,并通过Hugging Face Hub和MTEB包公开了所有资源。 Conclusion: 所提出的新资源有效填补了荷兰语嵌入研究的空白,有助于推动荷兰语在文本嵌入领域的进一步发展。 Abstract: Recently, embedding resources, including models, benchmarks, and datasets, have been widely released to support a variety of languages. However, the Dutch language remains underrepresented, typically comprising only a small fraction of the published multilingual resources. To address this gap and encourage the further development of Dutch embeddings, we introduce new resources for their evaluation and generation. First, we introduce the Massive Text Embedding Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and newly created ones, covering a wide range of tasks. Second, we provide a training dataset compiled from available Dutch retrieval datasets, complemented with synthetic data generated by large language models to expand task coverage beyond retrieval. Finally, we release a series of E5-NL models compact yet efficient embedding models that demonstrate strong performance across multiple tasks. We make our resources publicly available through the Hugging Face Hub and the MTEB package.[2] MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
Matteo Marcuzzo,Alessandro Zangari,Andrea Albarelli,Jose Camacho-Collados,Mohammad Taher Pilehvar
Main category: cs.CL
TL;DR: MORABLES是一个基于寓言和短篇故事的人工验证基准,用于评估大语言模型在道德推理方面的能力,结果显示即使较大的模型也容易受到对抗性操纵,且主要依赖表面模式而非真正的推理。
Details
Motivation: 随着大语言模型在标准阅读理解任务上表现出色,研究关注点转向其复杂抽象推理能力的评估,尤其是深层理解和道德推理。 Method: 构建了一个名为MORABLES的基准,包含来自历史文学的寓言和短篇故事,设计多项选择题测试道德推断,并引入对抗性变体以检验模型鲁棒性。 Result: 实验发现较大模型虽表现更好,但仍易受对抗攻击,常依赖浅层模式,在不同道德情境下自我矛盾率高达20%,且推理增强模型未能改善这一问题。 Conclusion: 当前大语言模型在道德推理方面仍存在根本缺陷,性能提升主要由模型规模驱动而非真正推理能力,揭示了其在深层语义理解上的脆弱性。 Abstract: As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.[3] LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation
Anu Pradhan,Alexandra Ortan,Apurv Verma,Madhavan Seshadri
Main category: cs.CL
TL;DR: 本文探讨了在法律领域中使用大语言模型(LLM)作为评估工具来评判检索增强生成系统的可行性,提出采用Gwet's AC2和秩相关系数作为更可靠的评估指标,并结合Wilcoxon符号秩检验与Benjamini-Hochberg校正进行系统间统计比较,从而实现高效、低成本且符合法律应用精度要求的自动化评估框架。
Details
Motivation: 随着生成式AI的发展,传统推荐系统评估方法难以捕捉法律研究等专业领域中的质量细微差别,亟需一种可靠、高效的评估方式以解决评估瓶颈问题。 Method: 通过系统性实验,比较不同评估指标(如Krippendorff's alpha、Gwet's AC2、秩相关系数)在LLM与人类评估一致性上的表现,并采用Wilcoxon Signed-Rank Test结合Benjamini-Hochberg校正进行统计显著性分析,以评估不同系统的性能差异。 Result: 发现传统一致性指标(如Krippendorff's alpha)在AI评估中可能产生误导,而Gwet's AC2和秩相关系数更能准确反映LLM与人类判断的一致性;Wilcoxon Signed-Rank Test配合多重检验校正可提供可靠的系统比较结果。 Conclusion: LLM-as-a-Judge在法律领域的推荐系统评估中具有实用潜力,结合合适的可靠性指标和统计检验方法,可构建出既高效又严谨的自动化评估框架。 Abstract: The evaluation bottleneck in recommendation systems has become particularly acute with the rise of Generative AI, where traditional metrics fall short of capturing nuanced quality dimensions that matter in specialized domains like legal research. Can we trust Large Language Models to serve as reliable judges of their own kind? This paper investigates LLM-as-a-Judge as a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts, where the stakes of recommendation quality are exceptionally high. We tackle two fundamental questions that determine practical viability: which inter-rater reliability metrics best capture the alignment between LLM and human assessments, and how do we conduct statistically sound comparisons between competing systems? Through systematic experimentation, we discover that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations. Instead, Gwet's AC2 and rank correlation coefficients emerge as more robust indicators for judge selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections provides the statistical rigor needed for reliable system comparisons. Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications, transforming what was once a human-intensive bottleneck into an automated, yet statistically principled, evaluation framework.[4] SENTRA: Selected-Next-Token Transformer for LLM Text Detection
Mitchell Plyler,Yilun Zhang,Alexander Tuzhilin,Saoud Khalifah,Sen Tian
Main category: cs.CL
TL;DR: 本文提出了一种基于Transformer的通用监督式LLM生成文本检测器SElected-Next-Token tRAnsformer(SENTRA),利用选择性下一个词概率序列和对比预训练,在跨领域场景下显著优于现有基线方法。
Details
Motivation: 随着大语言模型(LLMs)能力的增强和广泛应用,其滥用风险日益增加,亟需有效手段来识别未声明的LLM生成文本。 Method: 提出SElected-Next-Token tRAnsformer(SENTRA),一种基于Transformer的编码器,利用选定的下一个词概率序列作为输入,并通过在大量无标签数据上进行对比预训练来提升检测性能。 Result: 在三个流行的公共数据集共24个文本领域上的实验表明,SENTRA在跨域设置下显著优于多种主流基线模型,展现出良好的泛化能力。 Conclusion: SENTRA是一种高效、通用的LLM生成文本检测方法,在实际应用中具有较强的鲁棒性和广泛适用性。 Abstract: LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.[5] MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering
Wen-wai Yim,Asma Ben Abacha,Zixuan Yu,Robert Doerning,Fei Xia,Meliha Yetisgen
Main category: cs.CL
TL;DR: 本文提出了MORQA,一个用于评估医学领域自然语言生成(NLG)系统的新多语言基准,包含三个英文和中文的医学视觉与文本问答数据集,每个问题配有2-4个由医学专家编写的标准答案及人工评分。研究表明,基于大语言模型(如GPT-4和Gemini)的评估方法在与专家判断的相关性上显著优于传统指标(如BLEU、ROUGE),突出了语义敏感性和对参考答案多样性的鲁棒性是关键因素。
Details
Motivation: 传统自动评估指标在医学领域的自然语言生成任务中难以准确区分输出质量,尤其在开放性问题回答中存在多个合理答案时表现不佳,亟需更贴近人类专业判断的评估方法。 Method: 构建了一个包含多组专家编写的标准答案和人工评分的多语言医学问答基准MORQA,涵盖英语和中文的三种不同数据集,并系统评估了传统指标和基于大语言模型的评估器与专家评分的相关性。 Result: 基于大语言模型的评估方法(如GPT-4、Gemini)在与专家评分的相关性上显著优于传统自动评价指标,显示出更强的语义理解能力和对参考答案变异的适应性。 Conclusion: 在医学NLG评估中,LLM-based评估器比传统指标更接近人类专家判断,未来应发展以人为中心、语义敏感的评估方法;所有数据将公开以促进后续研究。 Abstract: Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs' sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research.[6] MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Jiayi He,Yangmin Huang,Qianyun Du,Xiangying Zhou,Zhiyang He,Jiaxue Hu,Xiaodong Tao,Lixian Lai
Main category: cs.CL
TL;DR: 本文提出了MedFact,一个用于中文医学事实核查的新基准,包含2,116个专家标注的实例,覆盖13个医学专科和多种错误类型。通过AI与人类协作的混合框架构建,评估结果显示现有大语言模型在精确定位错误方面仍显著落后于人类,且存在“过度批评”现象。
Details
Motivation: 现有医学事实核查基准数据域狭窄,难以反映真实世界医学信息的复杂性,亟需一个更全面、更具挑战性的评估基准来检验大语言模型在医疗领域的事实可靠性。 Method: 提出MedFact基准,采用AI驱动、多标准过滤并结合迭代专家反馈的混合人机框架进行数据构建;对20个主流大语言模型进行真实性分类和错误定位的评测,并与人类专家表现对比。 Result: 实验表明,尽管模型能较准确判断文本是否存在错误,但在精确定位错误方面表现不佳,远低于人类水平;同时发现模型普遍存在‘过度批评’倾向,即误将正确信息判为错误,且该问题在使用高级推理技术(如多智能体协作)时更为严重。 Conclusion: MedFact为中文医学事实核查提供了高质量、多层次的评测基准,揭示了当前大语言模型在医疗应用中面临的关键挑战,有助于推动更具事实可靠性和医学意识的模型发展。 Abstract: The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.[7] Topic Coverage-based Demonstration Retrieval for In-Context Learning
Wonbin Kweon,SeongKu Kang,Runchu Tian,Pengcheng Jiang,Jiawei Han,Hwanjo Yu
Main category: cs.CL
TL;DR: 本文提出了TopicK,一种基于主题覆盖的检索框架,通过选择能够全面覆盖测试输入和模型所需主题知识的示例,提升上下文学习的效果。
Details
Motivation: 现有方法仅基于嵌入相似性或生成概率检索示例,常导致示例不相关或冗余,难以满足测试输入的细粒度知识需求。 Method: TopicK估计输入所需的主题,并评估模型在这些主题上的知识水平,迭代选择能引入模型知识薄弱且尚未覆盖的主题的示例。 Result: 在多个数据集和开源/闭源大语言模型上的实验表明,TopicK显著提升了上下文学习的性能。 Conclusion: TopicK通过主题覆盖机制有效提高了演示示例的相关性和多样性,增强了大模型在上下文学习中的表现。 Abstract: The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model's knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.[8] Does Language Model Understand Language?
Suvojit Acharjee,Utathya Aich,Asfak Ali
Main category: cs.CL
TL;DR: 本研究评估了多种最先进的语言模型在英语和孟加拉语中对时态、否定、语态和情态等精细语言现象的理解能力,提出了LUCID数据集和HCE准确率新指标,发现Compound-Beta模型在跨语言场景下表现最均衡,与人类判断一致性最高。
Details
Motivation: 语言模型在细微语言现象上的理解仍存在不足,而在联合国可持续发展目标4(优质教育)背景下,教育技术中语言清晰度至关重要,因此需系统评估模型在关键语言理解任务上的表现。 Method: 提出新的评估框架Route for Evaluation of Cognitive Inference in Systematic Environments,并构建包含精心设计的英孟双语句子对的LUCID数据集;采用皮尔逊相关系数、斯皮尔曼相关系数、平均绝对误差及新提出的HCE准确率(衡量模型预测落在人类评分均值一个标准差内的频率)来评估MISTRAL-SABA-24B、LLaMA系列、Gemma2-9B和Compound-Beta等SOTA模型。 Result: Compound-Beta模型在英语中取得最高的皮尔逊相关系数,在混合语言数据上也表现出稳健性能,整体具有最高的相关性和最低的MAE,HCE准确率显示其预测最接近人类判断的变异性范围。 Conclusion: Compound-Beta是目前在跨语言环境中对复杂语言现象理解最均衡的模型,表明引入更贴近人类语言解释容忍度的评估指标(如HCE)有助于推动语言模型在教育等高要求场景中的可靠应用。 Abstract: Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios.[9] Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction
Sumanta Bhattacharyya,Sara Riaz,Pedram Rooshenas
Main category: cs.CL
TL;DR: 提出Reason-Refine-then-Align (R2tA) 方法,利用大模型生成的推理轨迹经 refinement 后作为监督信号,训练特定任务的小型推理模型,在数据稀缺场景下实现高效、可复现的适配。
Details
Motivation: 在缺乏高质量人工标注或直接监督的情况下,训练任务特定的小型推理模型具有挑战性。而具备推理能力的大语言模型(LLM)能生成大量中间推理过程,可被利用以构建有效的监督信号。 Method: R2tA 方法分三步:首先由开源基础模型在任务输入上生成初始推理和响应;然后对这些推理轨迹进行精细化修正,修复幻觉与不一致,构建高保真数据集;最后通过两阶段对齐——监督微调(SFT)和直接偏好优化(DPO)——使模型中间推理与人类验证的概念偏好对齐,并将最终输出建立在对齐后的推理之上。 Result: 在评估数据库设计中扩展实体关系图(EERD)这一复杂结构任务上的实验表明,R2tA 显著优于仅使用提示的方法,避免了错误遗漏和幻觉问题。构建了包含600个EERD变体的数据集(训练/测试 450/150),覆盖11类错误。实证结果显示该方法在数据稀缺领域提供了可扩展、低成本的LLM适配路径。 Conclusion: R2tA 为在标注数据稀缺的任务中训练小型专用推理模型提供了一种实用且成本效益高的解决方案,具有在教育等领域的可复制AI工具开发潜力。 Abstract: Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model's intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.[10] FunAudio-ASR Technical Report
Keyu An,Yanni Chen,Chong Deng,Changfeng Gao,Zhifu Gao,Bo Gong,Xiangang Li,Yabin Li,Xiang Lv,Yunjie Ji,Yiheng Jiang,Bin Ma,Haoneng Luo,Chongjia Ni,Zexu Pan,Yiping Peng,Zhendong Peng,Peiyao Wang,Hao Wang,Wen Wang,Wupeng Wang,Biao Tian,Zhentao Tan,Nan Yang,Bin Yuan,Jieping Ye,Jixing Yu,Qinglin Zhang,Kun Zou,Han Zhao,Shengkui Zhao,Jingren Zhou
Main category: cs.CL
TL;DR: 本文提出了一种大规模、基于大语言模型的语音识别系统FunAudio-ASR,结合数据规模、模型容量、LLM集成和强化学习,在真实应用场景中实现了最先进的性能。
Details
Motivation: 现有的基于大语言模型的语音识别系统在开源基准上表现良好,但在实际工业数据集上常表现不佳,且存在幻觉问题,影响用户体验。 Method: FunAudio-ASR结合了大规模数据、大模型容量、与大语言模型的深度集成以及强化学习,并针对流式处理、噪声鲁棒性、语码转换、热词定制等实际需求进行优化。 Result: 实验表明,FunAudio-ASR在真实工业评估集上显著优于其他LLM-based ASR系统,实现了SOTA性能,展现出在复杂实际场景中的有效性与鲁棒性。 Conclusion: FunAudio-ASR通过生产导向的优化,在兼顾实用性的同时提升了语音识别在多样化真实场景下的性能,为工业级ASR系统提供了有效解决方案。 Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.[11] A comparison of pipelines for the translation of a low resource language based on transformers
Chiara Bonfanti,Michele Colombino,Giulia Coucourde,Faeze Memari,Stefano Pinardi,Rosa Meo
Main category: cs.CL
TL;DR: 本文比较了三种用于训练基于Transformer的神经网络以实现法语到班巴拉语翻译的流水线,结果表明最简单的流水线在多个数据集上取得了最佳翻译准确率。
Details
Motivation: 为了提升低资源语言班巴拉语的机器翻译性能,探索不同训练策略的有效性。 Method: 第一种方法使用简单Transformer模型;第二种对LLaMA3指令模型进行微调;第三种采用语言蒸馏结合LaBSE和BERT扩展的方法。 Result: 第一种方法在Bayelemagaba上达到10% BLEU和21% chrF,在新构建的Yiri数据集上达到33.81% BLEU和41% chrF,表现最优;指令模型在单一数据集上表现更好。 Conclusion: 尽管结构更简单,第一种流水线在低资源条件下表现最佳,说明复杂模型未必优于基础模型。 Abstract: This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mand\`e language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively.[12] MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
Vijay Govindarajan,Pratik Patel,Sahil Tripathi,Md Azizul Hoque,Gautam Siddharth Kashyap
Main category: cs.CL
TL;DR: 提出了一种零样本自动音频描述(AAC)系统,利用预训练的音频CLIP模型提取特征并生成结构化提示,指导大语言模型生成与音频内容对齐的描述,在WavCaps模型上使用MAGIC搜索使NLG平均得分提升了35%。
Details
Motivation: 由于数据集有限,传统的自动音频描述(AAC)面临挑战,而现有方法依赖大量训练数据,限制了其应用。因此,需要一种无需大量训练即可生成高质量音频描述的方法。 Method: 采用预训练的音频CLIP模型提取听觉特征,并生成结构化提示来引导大语言模型(LLM)进行描述生成;通过音频CLIP模型优化token选择过程,确保生成文本与音频内容保持一致,相较于传统贪心解码提高了准确性。 Result: 在WavCaps模型上使用MAGIC搜索实现了NLG平均得分从4.7提升至7.3(提升35%),性能高度依赖于音频-文本匹配模型和关键词选择,使用单个关键词提示效果最佳,无关键词列表时性能下降50%。 Conclusion: 所提出的零样本AAC系统能有效利用预训练模型生成高质量音频描述,减少对大规模标注数据的依赖,关键词的选择和音频-文本匹配质量对系统性能至关重要。 Abstract: Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.[13] EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving
Mukai Li,Linfeng Song,Zhenwen Liang,Jiahao Xu,Shansan Gong,Qi Liu,Haitao Mi,Dong Yu
Main category: cs.CL
TL;DR: 本文系统比较了自动定理证明中不同测试时扩展策略的效率,提出了EconRL流程,通过动态CoT切换和多样化的并行强化学习显著降低计算成本,同时保持性能。
Details
Motivation: 现有的大语言模型在自动定理证明中因反思性思维链和采样次数增加导致计算开销大,且缺乏对不同扩展策略成本差异的分析。 Method: 提出了一种动态Chain-of-Thought切换机制和可训练前缀的多样化并行强化学习方法,并集成到统一的EconRL流程中。 Result: 在miniF2F和ProofNet数据集上实验表明,EconProver仅用12%的计算成本即可达到基线方法相当的性能。 Conclusion: 所提出的EconRL流程能有效减少token使用和采样次数,在不牺牲性能的前提下实现轻量级自动定理证明模型的部署。 Abstract: Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.[14] Positional Encoding via Token-Aware Phase Attention
Yu,Wang,Sheng Shen,Rémi Munos,Hongyuan Zhan,Yuandong Tian
Main category: cs.CL
TL;DR: 本文提出了一种新的位置编码方法TAPA,解决了RoPE在长距离建模中的距离依赖性偏差问题,具有更好的外推性和更低的困惑度。
Details
Motivation: RoPE存在内在的距离依赖性偏差,限制了其对长上下文的建模能力,且现有扩展方法需要预训练后的调整。 Method: 引入可学习的相位函数到注意力机制中,提出Token-Aware Phase Attention(TAPA)作为新的位置编码方法。 Result: TAPA能够保持长距离token交互,在更长上下文中通过轻量微调即可扩展,能外推到未见长度,并在长上下文任务上实现显著更低的困惑度。 Conclusion: TAPA优于RoPE及其变体,是一种更有效、灵活且易于扩展的位置编码方法。 Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.[15] PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition
Li Fu,Yu Xin,Sunlu Zeng,Lu Fan,Youzheng Wu,Xiaodong He
Main category: cs.CL
TL;DR: 本文提出了一种发音感知上下文化的(PAC)框架,用于解决基于大语言模型的语音识别系统中的两个关键问题:有效的发音建模和鲁棒的同音词区分。
Details
Motivation: 在基于大语言模型的自动语音识别系统中,有效建模发音和区分同音词是识别生僻词或长尾词的关键挑战。 Method: 采用两阶段学习范式:第一阶段引入发音指导的上下文学习方法,结合图素-音素交错建模策略;第二阶段提出发音判别性强化学习方法,结合扰动标签采样以增强对同音词的区分能力。 Result: 在Librispeech和AISHELL-1数据集上的实验表明,PAC相比预训练的LLM基线模型,词错误率相对降低30.2%和53.8%,对长尾词的偏置词错误率分别降低31.8%和60.5%。 Conclusion: PAC框架显著提升了大语言模型在语音识别任务中对发音建模和同音词辨别的能力,尤其在长尾词识别上表现突出。 Abstract: This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the model\'s ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.[16] Don't Change My View: Ideological Bias Auditing in Large Language Models
Paul Kröger,Emilio Barkett
Main category: cs.CL
TL;DR: 本文提出一种模型无关的统计方法,用于检测大语言模型是否被有意引导至特定意识形态立场,通过分析主题相关提示下输出分布的变化来识别潜在的意识形态操控,适用于审计专有黑箱系统。
Details
Motivation: 随着大语言模型广泛应用于影响公众意见的产品中,若其行为可被故意导向特定意识形态,则可能使系统控制者获得过度的话语权,因此亟需能够检测此类操控的方法。 Method: 采用并改进先前提出的统计方法,通过分析在主题相关的提示下模型输出的分布变化来检测意识形态偏移,无需访问模型内部结构,具有模型无关性。 Result: 实验验证了该方法在检测意识形态引导方面的有效性,展示了其在实际应用中的可行性,并可用于对大语言模型进行独立的事后审计。 Conclusion: 所提出的方法能有效检测大语言模型中的意识形态操控尝试,尤其适合用于审计无法获取内部结构的商用黑箱模型,为保障公共话语的公正性提供了技术工具。 Abstract: As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.[17] Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations
Yougen Zhou,Qin Chen,Ningning Zhou,Jie Zhou,Xingjiao Wu,Liang He
Main category: cs.CL
TL;DR: 提出一种基于强化学习和双重奖励函数的方法,以解决大语言模型在情感支持对话中策略规划的偏好偏差问题。
Details
Motivation: 大语言模型在情感支持对话中的策略规划准确率低,且存在对特定策略的显著偏好偏差,其根本原因尚未被深入研究。 Method: 通过识别大语言模型在策略规划中的知识边界来揭示偏差根源,并提出一种结合准确性与基于熵的信心的双重奖励强化学习方法来缓解该偏差。 Result: 在ESCov和ExTES数据集上,多个大语言模型基础上的实验表明,该方法优于基线模型,有效提升了策略规划性能。 Conclusion: 所提出的基于知识边界的双重奖励强化学习方法能有效减轻大语言模型在情感支持对话中的策略偏好偏差,提升对话效果。 Abstract: Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.[18] Chat-Driven Text Generation and Interaction for Person Retrieval
Zequn Xie,Chuxin Wang,Sihang Cai,Yeqiang Wang,Shulei Wang,Tao Jin
Main category: cs.CL
TL;DR: 提出了一种无需人工标注的文本描述生成与交互框架,通过多轮对话机制提升基于文本的行人检索性能。
Details
Motivation: 解决文本标注获取困难且耗时的问题,提升基于文本的行人检索系统的可扩展性和实用性。 Method: 引入多轮文本生成(MTG)和多轮文本交互(MTI)模块:MTG利用MLLM通过模拟对话生成细粒度、多样化的伪标签;MTI在推理阶段通过动态对话式推理优化用户查询,处理模糊或不完整的描述。 Result: 在大规模数据库上实现了具有竞争力或更优的检索精度,同时显著提升了系统对模糊查询的鲁棒性和可用性。 Conclusion: 所提出的无标注框架有效解决了文本标注瓶颈,为实际部署文本-based行人检索系统提供了可行路径。 Abstract: Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.[19] Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content
Shaz Furniturewala,Arkaitz Zubiaga
Main category: cs.CL
TL;DR: 本研究利用机械可解释性技术识别毒性分类器中的脆弱组件,并提出一种新策略,通过抑制这些脆弱电路来提升模型对对抗攻击的鲁棒性,同时揭示了不同人群群体间的公平性与稳健性差距。
Details
Motivation: 由于大语言模型生成的内容增多,传统基于人类文本训练的内容审核系统在面对LLM生成文本和对抗攻击时表现不佳,现有防御方法多为被动响应,缺乏主动防御机制。 Method: 研究聚焦于微调后的BERT和RoBERTa毒性分类器,使用对抗攻击技术识别模型中易受攻击的电路,并采用机械可解释性方法分析其结构漏洞,进而抑制这些脆弱组件。实验在涵盖多种少数群体的多样化数据集上进行。 Result: 发现模型中存在特定注意力头分别对性能关键或易受攻击,抑制脆弱注意力头可提升对抗样本上的表现;不同人群对应的脆弱头不同,揭示了模型在不同群体间的公平性差异。 Conclusion: 通过识别并抑制毒性分类器中的脆弱电路,不仅能增强模型对抗攻击的鲁棒性,还能为构建更公平、包容的检测模型提供指导。 Abstract: The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.[20] Case-Based Decision-Theoretic Decoding with Quality Memories
Hiroyuki Deguchi,Masaaki Nagata
Main category: cs.CL
TL;DR: 提出了一种新的解码方法CBDT,利用领域数据样例估计期望效用,相较于MAP和MBR解码能生成更高质量的文本,并在多个翻译和图像描述任务中表现更优。
Details
Motivation: MBR解码依赖于从模型中采样的文本,在处理领域外数据时难以准确捕捉知识或信息,因此需要一种更有效的方法来提升生成文本的质量。 Method: 提出了基于案例的决策理论(CBDT)解码方法,利用领域数据样例来估计期望效用,并可与MBR结合使用。 Result: CBDT解码在七项领域内的De-En和Ja↔En翻译任务以及MSCOCO和nocaps图像描述任务中优于MAP解码,且MBR与CBDT结合的表现优于单独使用MBR。 Conclusion: CBDT解码是一种有效的文本生成决策方法,能够提升跨领域文本生成质量,尤其与MBR结合时效果更佳。 Abstract: Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding. However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain. To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data. CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De--En and Ja$\leftrightarrow$En translation tasks and image captioning tasks on MSCOCO and nocaps datasets.[21] HistoryBankQA: Multilingual Temporal Question Answering on Historical Events
Biswadip Mandal,Anant Khandelwal,Manish Gupta
Main category: cs.CL
TL;DR: 本文提出了HistoryBank,一个包含1000多万条历史事件的多语言数据库,并构建了一个涵盖6种时序问答任务的基准测试,用于评估大语言模型在多语言时序推理上的表现。
Details
Motivation: 现有时序推理数据集规模有限、缺乏多语言覆盖且偏重当代事件,难以有效评估大语言模型在历史事件时序推理方面的能力。 Method: 从Wikipedia的时间线页面和信息框中提取超过1000万条历史事件,构建多语言的HistoryBank数据库,并设计涵盖六类时序问答任务的基准测试,评估多个主流语言模型的表现。 Result: GPT4o在所有语言和答案类型上表现最佳;Gemma-2优于其他小型语言模型;数据库覆盖10种语言,具有更广的历史深度和语言多样性。 Conclusion: HistoryBank为多语言历史事件的时序理解提供了重要资源,有助于推动时序推理及相关NLP任务的发展。 Abstract: Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.[22] Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
Omri Suissa,Muhiim Ali,Shengmai Chen,Yinuo Cai,Shekhar Pradhan
Main category: cs.CL
TL;DR: 本文研究了视觉语言模型(VLM)在图像中抽象概念的能力,并提出了一种新的对比学习方法,通过分组图像-文本数据集(MAGIC)和分层对比损失函数,使模型(CLEAR GLASS)在不直接接触高层概念标签的情况下,自发形成概念抽象能力。实验表明该方法在抽象概念识别上优于现有最先进模型。
Details
Motivation: 探索视觉语言模型是否具备人类那样的图像概念抽象能力,并提升模型对图像中高层语义概念的理解与表示。 Method: 构建了一个分组的图像-文本数据集MAGIC,采用内外双层对比损失函数:外层对比损失基于文本-图像组间对比,内层损失衡量组内图像-文本实例间的距离,迫使模型学习组内共性语义,从而隐式捕捉高层概念。 Result: CLEAR GLASS模型在抽象概念识别任务上表现优于当前最先进的模型,展现出由训练方法引发的涌现式概念抽象能力。 Conclusion: 通过分组对比学习策略,可在不显式提供高层概念标签的情况下,有效提升VLM的抽象概念理解能力,验证了模型中概念抽象能力的可诱导性。 Abstract: Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.[23] ConvergeWriter: Data-Driven Bottom-Up Article Construction
Binquan Ji,Jiaqi Wang,Ruiting Li,Xingchen Han,Yiyang Qi,Shichao Wang,Yifei Lu,Yuantao Han,Feiliang Ren
Main category: cs.CL
TL;DR: 提出一种“自底向上”的数据驱动框架,通过先检索后聚类的方法生成基于知识库的长文本,确保内容忠实且结构连贯。
Details
Motivation: 现有“自上而下”的方法在生成长文本时容易与外部知识脱节,导致内容碎片化和事实错误,难以保证生成内容的准确性和结构性。 Method: 采用“先检索知识、再聚类结构”的策略:首先从知识库中进行迭代检索,然后使用无监督聚类算法将文档组织为“知识簇”,再基于这些簇生成层级大纲和最终文本。 Result: 在14B和32B参数模型上的实验表明,该方法性能达到或超过当前最先进的基线方法,尤其在知识受限场景下表现出更高的保真度和结构一致性。 Conclusion: 所提出的自底向上框架能有效提升长文本生成的可靠性与结构性,为高风险、知识密集型领域的LLM应用提供了新范式。 Abstract: Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.[24] Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data
Kurt Micallef,Nizar Habash,Claudia Borg
Main category: cs.CL
TL;DR: 本文探讨了如何利用阿拉伯语资源通过跨语言增强技术来支持马耳他语的自然语言处理。
Details
Motivation: 由于马耳他语虽然具有闪米特语系根源,但其正字法基于拉丁字母,与阿拉伯语存在较大差异,导致缺乏足够的NLP资源。因此,研究如何利用现有的阿拉伯语资源来提升马耳他语NLP性能。 Method: 采用了多种将阿拉伯语文本数据与马耳他语对齐的策略,包括不同的音译方案和机器翻译方法,并提出了新的更符合马耳他语正字法的音译系统。 Result: 评估结果显示,基于阿拉伯语的数据增强能够显著提升单语和多语言模型在马耳他语NLP任务中的表现。 Conclusion: 阿拉伯语资源结合适当的对齐技术可以有效支持马耳他语的自然语言处理任务,为低资源语言提供了可行的增强路径。 Abstract: Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.[25] Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
Fuyu Xing,Zimu Wang,Wei Wang,Haiyang Zhang
Main category: cs.CL
TL;DR: 本文首次系统评估了大型视觉-语言模型(LVLMs)在多媒体事件抽取(M2E2)任务中的表现,涵盖文本、图像及跨模态子任务,并比较了少样本提示与LoRA微调的效果。结果表明,LVLMs在视觉任务中表现良好,但在文本任务中较弱;微调显著提升性能,且跨模态融合效果最佳。同时,论文通过错误分析揭示了语义精度、定位和跨模态对齐等方面的持续挑战。
Details
Motivation: 尽管大型视觉-语言模型(LVLMs)在跨模态任务中表现出色,但其在多媒体事件抽取(M2E2)中的应用尚未被充分探索。本文旨在填补这一空白,系统评估LVLMs在不同模态和设置下的性能。 Method: 在M2E2数据集上对代表性LVLMs(如DeepSeek-VL2和Qwen-VL系列)进行评估,涵盖文本、图像和跨媒体子任务,并在少样本提示和基于LoRA的微调两种设置下进行实验。 Result: 1) 少样本LVLM在视觉任务中表现良好,但在文本任务中表现较差;2) 使用LoRA微调显著提升模型性能;3) 跨模态融合展现出强协同效应,性能最优。错误分析揭示了语义精度、定位和跨模态对齐仍是主要挑战。 Conclusion: LVLMs在M2E2任务中具有潜力,尤其在跨模态场景下表现优异,但需通过微调和改进对齐机制来克服当前局限,未来研究应聚焦于提升多模态语义理解与精确接地能力。 Abstract: The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.[26] The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
Yubo Zhu,Dongrui Liu,Zecheng Lin,Wei Tong,Sheng Zhong,Jing Shao
Main category: cs.CL
TL;DR: 提出一种基于LLM隐藏表示的新型问题难度估计方法,通过建模生成过程为马尔可夫链并定义状态价值函数,实现无需生成输出的高效准确难度评估。
Details
Motivation: 现有方法依赖重复采样、辅助模型或微调,计算成本高且泛化性差,需更高效通用的难度估计方案。 Method: 将token级生成过程建模为马尔可夫链,利用目标LLM的隐藏状态定义价值函数,基于初始隐藏状态估计输出质量以衡量问题难度。 Result: 在文本和多模态任务上均优于现有基线方法,并可用于指导自洽性、Best-of-N和自修正等自适应推理策略,减少生成token数量。 Conclusion: 该方法能高效准确地估计LLM感知的问题难度,提升推理效率,具有良好的通用性和应用潜力。 Abstract: Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.[27] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
Shiyu Li,Yang Tang,Ruijie Liu,Shi-Zhe Chen,Xi Chen
Main category: cs.CL
TL;DR: 本文提出Conan-embedding-v2,一个从零训练的1.4B参数大语言模型,通过引入新闻数据、多语言对和跨语言检索数据集缩小数据差距,并采用软掩码机制和动态难负例挖掘方法解决LLM与嵌入模型间的训练差异,显著提升文本嵌入性能。
Details
Motivation: 现有基于LoRA微调的LLMs在文本嵌入任务中受限于LLMs与嵌入模型之间的数据和训练差异,难以充分发挥潜力。 Method: 1. 在预训练阶段引入新闻数据和多语言平行语料;2. 构建跨语言检索数据集以增强多语言嵌入对齐;3. 提出软掩码机制,逐步从因果掩码过渡到双向掩码;4. 采用动态难负例挖掘策略优化句子级对比学习。 Result: Conan-embedding-v2在MTEB和中文MTEB(2025年5月19日榜单)上均达到SOTA性能,仅用1.4B参数即实现卓越的文本嵌入效果。 Conclusion: 从零训练并结合软掩码与动态难负例挖掘的策略能有效弥合LLM与嵌入模型之间的差距,为高效文本嵌入模型的设计提供了新方向。 Abstract: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).[28] All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
Caiqi Zhang,Chang Shu,Ehsan Shareghi,Nigel Collier
Main category: cs.CL
TL;DR: 提出了一种无需训练的基于图的置信度估计方法,专门用于推理任务,通过利用推理路径的图结构特性(如中心性、路径收敛和加权)来估计置信度。
Details
Motivation: 现有置信度估计方法主要针对事实问答任务,难以推广到推理任务,因此需要一种适用于推理任务的更有效的置信度估计方法。 Method: 将推理路径建模为有向图,利用图的属性(如中心性、路径收敛性和路径权重)进行置信度估计,无需额外训练。 Result: 在两个大语言模型和三个推理数据集上的实验表明,该方法显著提升了置信度估计效果,并在两个下游任务中表现更优。 Conclusion: 所提出的图基置信度估计方法无需训练且有效,能更好地支持大语言模型在推理任务中的可靠部署。 Abstract: Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.[29] Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework
Heng Zhang,Chengzhi Zhang
Main category: cs.CL
TL;DR: 提出一种端到端框架,通过挖掘全文学术论文生成结构化研究工作流,结合PU学习、提示学习和少样本分类,实现NLP领域研究流程的自动化重建与可视化。
Details
Motivation: 现有方法仅提取碎片化的研究步骤,难以捕捉完整的研究工作流,阻碍科研可复现性和‘AI for科学’的发展。 Method: 采用段落中心方法,首先使用基于SciBERT的正-无标签学习识别描述工作流的段落,然后用Flan-T5结合提示学习生成工作流短语,再利用ChatGPT进行少样本分类,将短语归类为数据准备、处理和分析阶段,最后映射到文档位置生成可视化流程图。 Result: 在NLP领域实现了高精度的工作流段落识别(F1=0.9772)和短语生成(ROUGE-1=0.4543),分类精度达0.958,并成功构建了可读的可视化流程图,揭示了过去二十年NLP方法论的演变趋势。 Conclusion: 该框架为自动化生成研究工作流提供了有效技术路径,并提供了面向科学范式演化的新型过程导向分析视角。 Abstract: The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.[30] Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models
Yuval Weiss,David Demitri Africa,Paula Buttery,Richard Diehl Martinez
Main category: cs.CL
TL;DR: 本研究首次系统地探讨了ReLoRA在小型语言模型(SLM)预训练中的应用,发现其性能普遍低于标准训练方法,且随着模型规模增大差距扩大,表明低秩更新策略在SLM预训练中存在局限性。
Details
Motivation: 探索参数高效方法如LoRA在小语言模型预训练中的适用性,特别是在计算和环境成本较低的情况下。 Method: 通过对11M到66M参数的小型语言模型进行消融实验,评估ReLoRA在损失、困惑度和BLiMP任务上的表现,并分析其学习动态。 Result: ReLoRA在各项指标上表现均不如标准训练,且模型越大差距越明显;进一步分析显示ReLoRA加剧了小模型中存在的秩不足问题。 Conclusion: 低秩更新策略难以直接迁移到小语言模型的预训练中,低计算资源下的预训练需要更多研究。 Abstract: Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.[31] Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
Chenye Zou,Xingyue Wen,Tianyi Hu,Qian Janice Wang,Daniel Hershcovich
Main category: cs.CL
TL;DR: 本文提出了跨文化葡萄酒评论适配的新问题,构建了首个中英文平行专业评论语料库,并通过自动与人工评估分析现有模型在文化细微表达上的不足。
Details
Motivation: 随着大语言模型的发展,如何在翻译中融入地域口味偏好和文化特有描述成为新挑战,现有方法难以捕捉文化细节。 Method: 收集8千条中文和1.6万条英文专业葡萄酒评论,构建平行语料库;采用神经机器翻译基线和最先进的大语言模型进行实验,结合自动指标与人工评估,提出文化贴近性、文化中立性和文化真实性三项文化导向评估标准。 Result: 实验表明当前模型在跨文化葡萄酒描述翻译中表现不佳,尤其难以准确传达文化相关的内容和风味描述。 Conclusion: 翻译模型在处理文化敏感内容时仍面临显著挑战,需进一步研究以提升文化适应能力。 Abstract: Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria -- Cultural Proximity, Cultural Neutrality, and Cultural Genuineness -- to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.[32] SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data
Jian Gao,Fufangchen Zhao,Yiyang Zhang,Danfeng Yan
Main category: cs.CL
TL;DR: 本文提出SitLLM,一个结合柔性压力传感与大语言模型的轻量级多模态框架,用于细粒度坐姿理解与个性化健康反馈生成。
Details
Motivation: 现有坐姿监测系统识别粗糙、缺乏语义表达能力,难以提供个性化反馈,且对长期肌肉骨骼健康关注不足。 Method: SitLLM包含三个模块:高斯鲁棒传感器嵌入模块(分块压力图并注入噪声以增强特征鲁棒性)、提示驱动跨模态对齐模块(通过多头交叉注意力将传感器嵌入映射到LLM语义空间)、多上下文提示模块(融合特征、结构、统计和语义层级信息以指导指令理解)。 Result: 该框架实现了细粒度的坐姿识别与语义理解,并能生成个性化的健康建议,具备良好的鲁棒性与上下文感知能力。 Conclusion: SitLLM有效整合了压力传感与大语言模型,提升了坐姿监测的语义表达与个性化反馈能力,为智能健康监护提供了新思路。 Abstract: Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM's semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.[33] Multi-Model Synthetic Training for Mission-Critical Small Language Models
Nolan Platt,Pragyansmita Nayak
Main category: cs.CL
TL;DR: 提出一种利用大语言模型作为一次性教师生成合成数据的方法,将32亿条船舶跟踪记录转化为2.1万条问答对,用于微调较小的Qwen2.5-7B模型,在海事情报任务中实现75%准确率,成本降低261倍。
Details
Motivation: 由于领域特定训练数据稀缺且复杂,大语言模型在专业领域的应用受限,因此需要低成本、可复制的方法来提升小模型在专业任务上的性能。 Method: 使用GPT-4o和o3-mini等大模型作为教师模型,从3.2亿条AIS数据生成21,543个合成问答对,用于微调Qwen2.5-7B模型,并通过多模型生成防止过拟合。 Result: 微调后的Qwen2.5-7B模型在海事任务中达到75%的准确率,推理成本显著低于使用大模型直接推断,实现了261倍的成本降低。 Conclusion: 经过适当微调的小型语言模型可以媲美大型模型的性能,同时大幅降低成本,该方法为缺乏标注数据的专业领域提供了高效、可复现的AI应用框架。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing overfitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models -- when fine tuned properly -- can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.[34] Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
Francesco Pappone,Ruggero Marino Lazzaroni,Federico Califano,Niccolò Gentile,Roberto Marras
Main category: cs.CL
TL;DR: 本文提出了一种基于语义奖励的GRPO方法,使用小型编码器模型通过余弦相似度提供密集的语义奖励信号,以提升大语言模型在意大利医学入学考试解释生成中的保真度与清晰度。
Details
Motivation: 传统强化学习依赖于昂贵的LLM评判或无法捕捉语义的关键词指标,难以对齐如教学有效性等复杂目标。 Method: 在GRPO框架中引入小型encoder-only Transformer作为语义奖励模型,利用生成解释与真实参考之间的余弦相似度提供奖励信号,并结合领域自适应继续预训练和监督微调。 Result: 相比强监督微调基线,该方法显著提升了生成解释的保真度和清晰度。 Conclusion: 使用轻量级编码器模型进行语义奖励建模,能有效实现复杂生成任务中的精细奖励塑造。 Abstract: While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks[35] Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning
Sijia Cui,Shuai Xu,Aiyao He,Yanna Wang,Bo Xu
Main category: cs.CL
TL;DR: 提出了一种名为PLAP(Plan with Language, Act with Parameter)的规划框架,通过结合语言模型与参数化技能库,提升大语言模型在复杂、对抗性长视野环境中的智能体接地能力,并在MicroRTS游戏中验证了其有效性。
Details
Motivation: 现有方法在生成可靠低级动作或依赖专家经验将高级任务转化为具体动作序列方面存在不足,难以让大语言模型智能体有效适应复杂、对抗性的长视野环境。 Method: PLAP框架包含三个核心组件:(1)包含环境特定参数化技能的技能库;(2)由大语言模型驱动的技能规划器;(3)将参数化技能转换为可执行动作序列的技能执行器。该方法在MicroRTS环境中实现并评估。 Result: 实验表明,GPT-4o驱动的PLAP在零样本设置下优于80%的基线智能体,而Qwen2-72B驱动的PLAP在精心设计的少样本示例下超越顶级脚本智能体CoacAI。同时构建了针对6个闭源和2个开源大语言模型的综合评估指标,并发布了长视野技能规划能力的LLM排行榜。 Conclusion: PLAP框架有效提升了大语言模型智能体在复杂长视野环境中的接地能力和任务执行性能,展示了语言规划与参数化动作执行结合的潜力。 Abstract: Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP.[36] LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals
Jinxin Li,Gang Tu,ShengYu Cheng,Junjie Hu,Jinting Wang,Rui Chen,Zhilong Zhou,Dongbo Shan
Main category: cs.CL
TL;DR: HSAD是一种基于隐藏信号分析的幻觉检测框架,通过建模自回归生成过程中的隐藏表征时间动态,结合频域分析实现对大语言模型幻觉的有效检测。
Details
Motivation: 现有幻觉检测方法受限于外部知识覆盖或无法捕捉推理动态偏差,导致效果和鲁棒性有限。 Method: HSAD通过采样各层激活值构建隐藏层信号,应用快速傅里叶变换(FFT)获取频域表示,并提取最强非直流频率分量作为谱特征;利用LLM的自回归特性确定最优观测点进行检测。 Result: 在TruthfulQA等多个基准上,HSAD相比先前最先进方法提升了10个百分点以上。 Conclusion: HSAD将推理过程建模与频域分析相结合,建立了一种新的、更鲁棒的大语言模型幻觉检测范式。 Abstract: Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.[37] The Few-shot Dilemma: Over-prompting Large Language Models
Yongjian Tang,Doruk Tuncel,Christian Koerner,Thomas Runkler
Main category: cs.CL
TL;DR: 本文研究了大语言模型中因提示中示例过多而导致性能下降的“过提示”现象,提出了一种结合三种少样本选择方法的框架,并在多个模型上验证了其有效性,特别是在软件需求分类任务中优于现有方法1%。
Details
Motivation: 挑战传统认为更多相关示例总能提升大语言模型性能的观点,探究为何过多示例反而导致性能下降。 Method: 构建一个包含随机采样、语义嵌入和TF-IDF向量的少样本选择框架,在多个大语言模型上评估不同数量和选择方式的示例对性能的影响。 Result: 发现过多领域特定示例会反常地降低某些大语言模型的性能;通过逐步增加TF-IDF选中的分层示例,确定了各模型的最佳示例数量。 Conclusion: 提出的组合方法用更少的示例实现了更优性能,避免了过提示问题,在功能与非功能需求分类任务中超越现有技术1%。 Abstract: Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.[38] Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin
Main category: cs.CL
TL;DR: 本研究引入了一个新基准,用于评估大语言模型(LLM)在自然对话情境下对人类五大人格特质的连续性预测能力,结果表明现有模型与真实人格评分的相关性较低(均低于0.26),凸显出LLM在心理特质理解方面的局限性。
Details
Motivation: 尽管LLM被广泛应用于需要心理理解的场景,但其对人类人格特质的识别能力尚未在生态有效的对话环境中得到充分探索,尤其是基于真实、连续人格评分的评估仍属空白。 Method: 构建包含半结构化访谈文本及对应连续五大人格评分的新数据集,系统评估三种方法:GPT-4.1 Mini的零样本与思维链提示、RoBERTa和Meta-LLaMA的LoRA微调,以及BERT和text-embedding-3-small的静态嵌入回归。 Result: 所有模型预测与真实人格评分之间的皮尔逊相关系数均低于0.26;思维链提示相比零样本提升有限,表明人格推断更依赖潜在语义表征而非显式推理。 Conclusion: 当前LLM在理解复杂人类人格特质方面存在显著局限,未来需发展特定于人格特征的提示方法、上下文感知建模和面向对齐的微调策略。 Abstract: Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI's text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.[39] ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
Ali Salamatian,Amirhossein Abaskohi,Wan-Cyuan Fan,Mir Rayat Imtiaz Hossain,Leonid Sigal,Giuseppe Carenini
Main category: cs.CL
TL;DR: 提出ChartGaze数据集,利用人类眼动追踪数据改进图表问答中视觉语言模型的注意力机制,提升准确性和可解释性。
Details
Motivation: 现有视觉语言模型在图表问答任务中常关注无关区域,导致准确性和可解释性下降,亟需更贴近人类感知的注意力机制。 Method: 构建包含人类眼动数据的ChartGaze数据集,系统比较人类与模型注意力差异,并提出基于眼动引导的注意力优化方法,使模型注意力与人类注视点对齐。 Result: 所提方法在多个模型上实现了最高2.56个百分点的准确率提升,同时改善了模型注意力与人类注视的一致性。 Conclusion: 引入人类眼动数据可有效提升图表问答中模型的推理质量与可解释性,为视觉语言模型提供了更具认知合理性的注意力监督信号。 Abstract: Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.[40] WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
Zile Qiao,Guoxin Chen,Xuanzhong Chen,Donglei Yu,Wenbiao Yin,Xinyu Wang,Zhen Zhang,Baixuan Li,Huifeng Yin,Kuan Li,Rui Min,Minpeng Liao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
Main category: cs.CL
TL;DR: 本文提出了WebResearcher框架,通过迭代深度研究范式和可扩展的数据合成引擎WebFrontier,实现AI代理自主发现与综合知识,显著提升工具使用能力并在多个基准上达到先进水平。
Details
Motivation: 现有单上下文深度研究方法存在上下文窒息和噪声污染问题,难以有效进行主动知识构建,因此需要一种更高效、可扩展的框架来支持AI代理的自主研究能力。 Method: 将深度研究重构为马尔可夫决策过程,设计迭代式WebResearcher框架,结合动态报告整合与专注工作区管理,并利用工具增强的复杂性递增策略生成高质量训练数据(WebFrontier),支持并行多代理探索。 Result: 在6个具有挑战性的基准测试中,WebResearcher实现了最先进的性能,甚至超过前沿的专有系统,且其训练数据显著提升了传统单上下文方法的工具使用能力。 Conclusion: WebResearcher为AI驱动的自主研究提供了一个高效、可扩展的解决方案,推动了从被动知识回忆到主动知识构建的转变。 Abstract: Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.[41] Scaling Agents via Continual Pre-training
Liangcai Su,Zhen Zhang,Guangyu Li,Zhuo Chen,Chenxi Wang,Maojia Song,Xinyu Wang,Kuan Li,Jialong Wu,Xuanzhong Chen,Zile Qiao,Zhongwang Zhang,Huifeng Yin,Shihao Cai,Runnan Fang,Zhengwei Tao,Wenbiao Yin,Chenxiong Qian,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
Main category: cs.CL
TL;DR: 本文提出了Agentic Continual Pre-training (Agentic CPT) 方法,以解决现有大语言模型在代理任务中表现不佳的问题,并基于此开发了深度研究代理模型 AgentFounder,在多个基准测试中实现了最先进的性能。
Details
Motivation: 现有的后训练方法在构建具备自主工具使用和多步推理能力的代理系统时表现不足,主要原因是缺乏强大的代理基础模型,导致模型在学习多样化代理行为的同时难以对齐专家示范,产生优化冲突。 Method: 提出并引入代理持续预训练(Agentic CPT)到深度研究代理的训练流程中,通过持续预训练构建强大的代理基础模型,并在此基础上开发了 AgentFounder 模型。 Result: AgentFounder-30B 在10个基准测试中达到最先进水平,特别是在 BrowseComp-en 达到 39.9%,BrowseComp-zh 达到 43.3%,HLE 上 Pass@1 达到 31.5%,同时保持强大的工具使用能力。 Conclusion: Agentic CPT 是构建高性能代理模型的有效途径,AgentFounder 的成功验证了构建强健代理基础模型的重要性。 Abstract: Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.[42] Towards General Agentic Intelligence via Environment Scaling
Runnan Fang,Shihao Cai,Baixuan Li,Jialong Wu,Guangyu Li,Wenbiao Yin,Xinyu Wang,Xiaobin Wang,Liangcai Su,Zhen Zhang,Shibin Wu,Zhengwei Tao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的框架AgentScaler,通过自动生成多样化的仿真环境和两阶段微调策略,显著提升了大语言模型在函数调用方面的智能代理能力。
Details
Motivation: 为了实现大语言模型在真实场景中的应用,需要具备强大的函数调用能力,而现有方法受限于训练环境的多样性,难以发展出通用的代理智能。 Method: 设计了一个可扩展框架,自动构建异构的全仿真环境,并采用两阶段微调策略:先赋予代理基础能力,再进行领域特化。 Result: 在tau-bench、tau2-Bench和ACEBench等多个代理基准上实验表明,AgentScaler显著提升了模型的函数调用能力。 Conclusion: 通过规模化和多样化的环境训练,结合分阶段微调,能够有效提升大语言模型的通用代理智能水平。 Abstract: Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.[43] WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
Zijian Li,Xin Guan,Bo Zhang,Shen Huang,Houquan Zhou,Shaopeng Lai,Ming Yan,Yong Jiang,Pengjun Xie,Fei Huang,Jun Zhang,Jingren Zhou
Main category: cs.CL
TL;DR: 本文提出了WebWeaver,一种双代理框架,通过动态规划与证据获取的迭代结合以及分层检索与写作,显著提升了开放性深度研究任务的性能。
Details
Motivation: 现有方法存在静态研究流程和一次性生成导致的长上下文失效、信息遗漏和幻觉等问题,难以有效应对开放性深度研究任务。 Method: 提出WebWeaver框架,包含规划者和写作者两个代理:规划者动态优化提纲并积累证据至记忆库;写作者按提纲逐段检索并撰写,实现聚焦合成。 Result: 在DeepResearch Bench、DeepConsult和DeepResearchGym等多个主流OEDR基准上达到最先进水平。 Conclusion: 自适应规划与聚焦式合成为生成高质量、可靠且结构良好的研究报告提供了有效路径,验证了类人迭代研究过程的优势。 Abstract: This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.[44] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
Xixi Wu,Kuan Li,Yida Zhao,Liwen Zhang,Litu Ou,Huifeng Yin,Zhongwang Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Minhao Cheng,Shuai Wang,Hong Cheng,Jingren Zhou
Main category: cs.CL
TL;DR: 提出ReSum范式,通过周期性上下文摘要实现无限探索,结合ReSum-GRPO训练方法提升基于大语言模型的网页代理在复杂查询中的性能。
Details
Motivation: 现有基于ReAct范式的LLM网页代理受限于上下文窗口,在处理涉及多实体、复杂关系和高不确定性的复杂查询时,容易耗尽上下文预算,难以完成任务。 Method: 提出ReSum范式,通过定期将交互历史压缩为紧凑的推理状态来突破上下文限制;并设计ReSum-GRPO,结合GRPO算法、分段轨迹训练和优势广播,使代理适应基于摘要的推理。 Result: 在三个基准上实验表明,ReSum平均比ReAct提升4.5%,ReSum-GRPO进一步提升至最多8.2%;仅用1K样本训练的WebResummer-30B在BrowseComp-zh和BrowseComp-en上分别达到33.3%和18.3%的Pass@1,超越现有开源网页代理。 Conclusion: ReSum及其训练方法有效缓解了上下文长度限制问题,显著提升了网页代理在复杂任务中的表现,具备良好的实用性和扩展潜力。 Abstract: Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.[45] Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Millicent Li,Alberto Mario Ceballos Arroyo,Giordano Rogers,Naomi Saphra,Byron C. Wallace
Main category: cs.CL
TL;DR: 该论文质疑现有的激活语言化方法是否真正揭示了目标大模型的内部机制,还是仅仅反映了输入信息或语言化模型自身的知识。研究发现,现有数据集无法有效评估这些方法,且语言化结果常反映的是语言化模型的知识而非目标模型的激活状态,因此需要更严谨的基准和实验控制。
Details
Motivation: 现有的激活语言化方法声称能解释大模型的内部表示与操作,但其有效性存疑:它们可能并未真正捕捉目标模型的内部状态,而只是复现输入信息或语言化模型自身的先验知识。因此,有必要系统评估这些方法的实际解释能力。 Method: 作者在先前工作中使用的多个数据集上评估流行的激活语言化方法,并设计控制实验,测试在不访问目标模型内部激活的情况下,仅基于输入是否也能取得相似性能;同时分析语言化结果的来源,判断其反映的是目标模型激活还是语言化模型自身的知识。 Result: 实验表明,许多语言化方法在没有访问目标模型内部激活的情况下仍能在基准任务上表现良好,说明当前数据集不适合评估此类方法;控制实验进一步显示,语言化结果往往反映的是语言化模型自身的参数知识,而非目标模型的实际激活状态。 Conclusion: 当前的激活语言化方法可能并未提供关于目标模型内部运作的真正洞察,其结果受语言化模型自身知识影响较大。未来的研究需要设计更具针对性的评估基准和严格的实验控制,以确保解释方法的有效性和可靠性。 Abstract: Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.cs.CV [Back]
[46] Artificial Intelligence in Breast Cancer Care: Transforming Preoperative Planning and Patient Education with 3D Reconstruction
Mustafa Khanbhai,Giulia Di Nardo,Jun Ma,Vivienne Freitas,Caterina Masino,Ali Dolatabadi,Zhaoxun "Lorenz" Liu,Wey Leong,Wagner H. Souza,Amin Madani
Main category: cs.CV
TL;DR: 本研究提出一种基于人类参与的机器学习方法U-Mamba,用于提升3D解剖结构分割与重建的泛化能力,尤其在乳腺MRI数据中表现出高精度分割效果,并支持临床术前规划与患者教育。
Details
Motivation: 传统模型在不同数据集间的泛化能力有限,难以满足术前精准规划对解剖结构分割的需求,因此需要开发更具通用性的3D解剖重建算法。 Method: 采用三阶段流程:图像匿名化与手动标注、配准与自动分割(全乳腺、腺体组织和肿瘤),并利用ITK-SNAP进行3D可视化;引入人类参与机制优化U-Mamba模型以提升跨场景泛化能力。 Result: U-Mamba在T1加权图像上取得优异的分割性能(全器官DSC 0.97,腺体组织0.96,肿瘤0.82),生成精确的3D重建,显著提升术前规划、术中导航和医患沟通效果。 Conclusion: 该人机协同方法有效提升了算法在多源数据下的泛化能力,为临床提供可靠的3D可视化工具,促进个性化治疗决策和患者参与。 Abstract: Effective preoperative planning requires accurate algorithms for segmenting anatomical structures across diverse datasets, but traditional models struggle with generalization. This study presents a novel machine learning methodology to improve algorithm generalization for 3D anatomical reconstruction beyond breast cancer applications. We processed 120 retrospective breast MRIs (January 2018-June 2023) through three phases: anonymization and manual segmentation of T1-weighted and dynamic contrast-enhanced sequences; co-registration and segmentation of whole breast, fibroglandular tissue, and tumors; and 3D visualization using ITK-SNAP. A human-in-the-loop approach refined segmentations using U-Mamba, designed to generalize across imaging scenarios. Dice similarity coefficient assessed overlap between automated segmentation and ground truth. Clinical relevance was evaluated through clinician and patient interviews. U-Mamba showed strong performance with DSC values of 0.97 ($\pm$0.013) for whole organs, 0.96 ($\pm$0.024) for fibroglandular tissue, and 0.82 ($\pm$0.12) for tumors on T1-weighted images. The model generated accurate 3D reconstructions enabling visualization of complex anatomical features. Clinician interviews indicated improved planning, intraoperative navigation, and decision support. Integration of 3D visualization enhanced patient education, communication, and understanding. This human-in-the-loop machine learning approach successfully generalizes algorithms for 3D reconstruction and anatomical segmentation across patient datasets, offering enhanced visualization for clinicians, improved preoperative planning, and more effective patient education, facilitating shared decision-making and empowering informed patient choices across medical applications.[47] RU-Net for Automatic Characterization of TRISO Fuel Cross Sections
Lu Cai,Fei Xu,Min Xian,Yalei Tang,Shoukun Sun,John Stempien
Main category: cs.CV
TL;DR: 本研究利用卷积神经网络(CNN)自动分割辐照后TRISO颗粒的显微图像,以加速数据分析并提高客观性。
Details
Motivation: 传统手动分析TRISO颗粒截面图像耗时且主观性强,难以获得具有统计意义的结果,因此需要一种自动化、客观的方法来识别辐照引起的形态变化。 Method: 构建了一个包含2000多张显微图像及标注数据的TRISO层数据集,采用多种CNN模型(包括自主研发的RU-Net、U-Net、ResNet和Attention U-Net)进行图像分割。 Result: 基于RU-Net的模型在交并比(IoU)指标上表现最优,能够高效准确地分割TRISO各层结构。 Conclusion: CNN模型可显著加快TRISO颗粒截面分析速度,减少人工干预,提升结果的客观性和可重复性,为核燃料性能评估提供了有力工具。 Abstract: During irradiation, phenomena such as kernel swelling and buffer densification may impact the performance of tristructural isotropic (TRISO) particle fuel. Post-irradiation microscopy is often used to identify these irradiation-induced morphologic changes. However, each fuel compact generally contains thousands of TRISO particles. Manually performing the work to get statistical information on these phenomena is cumbersome and subjective. To reduce the subjectivity inherent in that process and to accelerate data analysis, we used convolutional neural networks (CNNs) to automatically segment cross-sectional images of microscopic TRISO layers. CNNs are a class of machine-learning algorithms specifically designed for processing structured grid data. They have gained popularity in recent years due to their remarkable performance in various computer vision tasks, including image classification, object detection, and image segmentation. In this research, we generated a large irradiated TRISO layer dataset with more than 2,000 microscopic images of cross-sectional TRISO particles and the corresponding annotated images. Based on these annotated images, we used different CNNs to automatically segment different TRISO layers. These CNNs include RU-Net (developed in this study), as well as three existing architectures: U-Net, Residual Network (ResNet), and Attention U-Net. The preliminary results show that the model based on RU-Net performs best in terms of Intersection over Union (IoU). Using CNN models, we can expedite the analysis of TRISO particle cross sections, significantly reducing the manual labor involved and improving the objectivity of the segmentation results.[48] Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture
Abigail R. Cohen,Yuming Sun,Zhihao Qin,Harsh S. Muriki,Zihao Xiao,Yeonju Lee,Matthew Housley,Andrew F. Sharkey,Rhuanito S. Ferrarezi,Jing Li,Lu Gan,Yongsheng Chen
Main category: cs.CV
TL;DR: 提出了一种基于多光谱成像的分层管道,用于作物异常检测和状态估计,在效率与精度之间实现权衡,支持边缘诊断和农业可持续性。
Details
Motivation: 现有营养管理方法分析耗时,难以实现实时优化;成像技术虽快但计算成本高,限制了在资源受限环境下的应用。 Method: 采用自编码器(AE)进行早期异常检测,比较两种状态估计模块:基于植被指数的随机森林(RF)和基于原始图像的视觉Transformer(ViT),并在不同施肥处理下进行多光谱成像实验。 Result: AE在移植后9天以较低能耗实现了73%的T3样本异常检测;ViT在磷和钙估计上优于RF(R2分别为0.61 vs 0.58,0.48 vs 0.35),但能耗更高。 Conclusion: 该模块化管道可在不同精度与能耗需求下灵活部署,为边缘端农业诊断和资源可持续管理提供了可行方案。 Abstract: Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.[49] Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
Yuriel Ryan,Rui Yang Tan,Kenny Tsu Wei Choo,Roy Ka-Wei Lee
Main category: cs.CV
TL;DR: 本文提出了PixelHumor,一个包含2800幅多格漫画的基准数据集,用于评估大型多模态模型(LMMs)在理解多模态幽默和叙事序列方面的能力。实验表明,当前最先进的模型在面板排序任务上表现远低于人类,突显其在整合视觉与文本线索方面的不足。
Details
Motivation: 幽默理解是社会智能的核心部分,但目前的大型多模态模型在理解多模态幽默方面仍存在显著挑战,缺乏有效的评估手段。 Method: 构建了一个名为PixelHumor的包含2800个标注多格漫画的数据集,用于系统评估LMMs在多模态幽默理解和叙事顺序识别上的能力。 Result: 实验显示,当前最先进的LMMs在面板排序任务上的准确率仅为61%,显著低于人类表现,暴露出其在视觉-文本融合与叙事连贯性理解上的缺陷。 Conclusion: PixelHumor为评估多模态上下文和叙事推理提供了严格框架,有助于推动能够进行自然、社会感知交互的更强大LMMs的发展。 Abstract: Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs' ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models' integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.[50] OnlineHOI: Towards Online Human-Object Interaction Generation and Perception
Yihong Ji,Yunze Liu,Yiyao Zhuo,Weijiang Yu,Fei Ma,Joshua Huang,Fei Yu
Main category: cs.CV
TL;DR: 提出在线HOI生成与感知任务,基于Mamba框架结合记忆机制实现SOTA性能。
Details
Motivation: 现有方法在离线设置下建模人-物交互,难以适应仅依赖当前和历史信息的现实在线场景。 Method: 提出OnlineHOI框架,基于Mamba架构并引入记忆机制以有效整合历史信息。 Result: 在Core4D、OAKINK2的在线生成任务及HOI4D在线感知任务上达到最先进性能。 Conclusion: 所提OnlineHOI框架在在线人-物交互生成与感知任务中表现优异,适用于真实动态场景。 Abstract: The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba's powerful modeling capabilities for streaming data and the Memory mechanism's efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.[51] EfficientNet-Based Multi-Class Detection of Real, Deepfake, and Plastic Surgery Faces
Li Kun,Milena Radenkovic
Main category: cs.CV
TL;DR: 深度伪造技术在推动科技进步的同时,也对隐私、国家安全和社会稳定带来了严重威胁。
Details
Motivation: 探讨深度学习在带来便利的同时,其衍生的深度伪造技术对社会造成的负面影响。 Method: 分析深度伪造技术的应用及其在政治、经济和个人隐私方面的潜在危害。 Result: 揭示了深度伪造技术可能导致虚假信息传播、破坏选举、损害个人声誉和影响面部识别系统安全等问题。 Conclusion: 需要加强对深度伪造技术的监管与防范,以减轻其对社会的负面影响。 Abstract: Currently, deep learning has been utilised to tackle several difficulties in our everyday lives. It not only exhibits progress in computer vision but also constitutes the foundation for several revolutionary technologies. Nonetheless, similar to all phenomena, the use of deep learning in diverse domains has produced a multifaceted interaction of advantages and disadvantages for human society. Deepfake technology has advanced, significantly impacting social life. However, developments in this technology can affect privacy, the reputations of prominent personalities, and national security via software development. It can produce indistinguishable counterfeit photographs and films, potentially impairing the functionality of facial recognition systems, so presenting a significant risk. The improper application of deepfake technology produces several detrimental effects on society. Face-swapping programs mislead users by altering persons' appearances or expressions to fulfil particular aims or to appropriate personal information. Deepfake technology permeates daily life through such techniques. Certain individuals endeavour to sabotage election campaigns or subvert prominent political figures by creating deceptive pictures to influence public perception, causing significant harm to a nation's political and economic structure.[52] A Modern Look at Simplicity Bias in Image Classification Tasks
Xiaoguang Chang,Teng Wang,Changyin Sun
Main category: cs.CV
TL;DR: 本文研究了CLIP模型中的简单性偏差(SB)与其在图像分类任务中性能之间的关系,提出了一种更精细的频率感知度量方法,并验证了其在不同SB调制方法下的有效性,实验表明SB与OOD泛化正相关,但与对抗鲁棒性关系复杂。
Details
Motivation: 理解大模型如CLIP中的简单性偏差如何影响其在不同图像分类任务上的表现,尤其是在现有复杂度度量方法对小模型有效但难以应用于大模型的情况下。 Method: 提出一种频率感知的复杂度度量方法,用于捕捉更细粒度的简单性偏差差异,并在两种最新的SB调制方法下对CLIP模型进行实验验证。 Result: 新提出的度量方法比现有方法更具信息性和一致性;实验显示更强的简单性偏差有助于OOD泛化,但在对抗鲁棒性方面表现不佳,不同任务中SB的影响呈现多样化行为。 Conclusion: 应根据目标任务的特点调整模型的归纳偏置,以实现更好的性能,简单性偏差并非总是有益,需视任务而定。 Abstract: The simplicity Bias (SB) of neural networks, i.e.\ their tendency to represent simple functions, is a key factor in their generalization capabilities. Recent studies show that an excessive SB may harm performance on complex tasks, and the need for this bias varies across tasks. Many of these studies focus on simple models or synthetic tasks. It remains challenging to measure the SB in large models and little is known about the relevance of the SB to various image classification tasks. In this paper, we investigate the relationship between the SB in CLIP models and their performance across image classification tasks. First, we theoretically analyze the potential limitation of existing measures of complexity that have been used to characterize small models. To address this, we propose a frequency-aware measure capturing finer-grained SB differences. We validate this measure on CLIP models subjected to two recent SB-modulation methods, demonstrating that it is more informative and consistent than previous measures. Second, we examine the relation between the SB of those models and their performance across a range of image classification tasks, including zero-shot and fine-tuning settings. These experiments reveal a range of behaviors. For example, a stronger SB correlates with a better performance on OOD generalization than on adversarial robustness. These results highlight the benefits of aligning a model's inductive biases with the characteristics of the target task.[53] GraphDerm: Fusing Imaging, Physical Scale, and Metadata in a Population-Graph Classifier for Dermoscopic Lesions
Mehdi Yousefzadeh,Parsa Esfahanian,Sara Rashidifar,Hossein Salahshoor Gavalan,Negar Sadat Rafiee Tabatabaee,Saeid Gorgin,Dara Rahmati,Maryam Daneshpazhooh
Main category: cs.CV
TL;DR: 本文提出了GraphDerm,一种融合图像、毫米级标定和患者元数据的群体图框架,用于皮肤镜多类分类,首次在ISIC规模上应用图神经网络(GNN)于皮肤镜分析。
Details
Motivation: 现有基于图像的AI模型在皮肤镜分析中常忽略患者元数据(如年龄、性别、病灶位置)和物理尺度信息,限制了几何特征的利用,影响 melanoma 分类性能。 Method: 构建GraphDerm框架:1)使用U-Net进行病灶和标尺分割;2)通过1D-CNN从标尺掩码回归像素/毫米比例;3)提取真实尺度下的几何特征(面积、周长等);4)以EfficientNet-B3提取节点特征,边表示元数据与几何相似性;5)采用谱图神经网络进行半监督节点分类,并与纯图像ANN基线对比。 Result: 标尺与病灶分割Dice分别为0.904和0.908,尺度回归MAE为1.5像素;图模型AUC达0.9812,仅用25%边的稀疏变体仍保持0.9788 AUC,显著优于图像基线(0.9440);各类别AUC普遍在0.97–0.99之间。 Conclusion: 将校准尺度、病灶几何与元数据整合到群体图中,显著提升了ISIC-2019数据集上的分类性能。稀疏图接近最优精度,表明其具备高效部署潜力;未来将优化边语义学习并在更广泛数据集上验证。 Abstract: Introduction. Dermoscopy aids melanoma triage, yet image-only AI often ignores patient metadata (age, sex, site) and the physical scale needed for geometric analysis. We present GraphDerm, a population-graph framework that fuses imaging, millimeter-scale calibration, and metadata for multiclass dermoscopic classification, to the best of our knowledge the first ISIC-scale application of GNNs to dermoscopy. Methods. We curate ISIC 2018/2019, synthesize ruler-embedded images with exact masks, and train U-Nets (SE-ResNet-18) for lesion and ruler segmentation. Pixels-per-millimeter are regressed from the ruler-mask two-point correlation via a lightweight 1D-CNN. From lesion masks we compute real-scale descriptors (area, perimeter, radius of gyration). Node features use EfficientNet-B3; edges encode metadata/geometry similarity (fully weighted or thresholded). A spectral GNN performs semi-supervised node classification; an image-only ANN is the baseline. Results. Ruler and lesion segmentation reach Dice 0.904 and 0.908; scale regression attains MAE 1.5 px (RMSE 6.6). The graph attains AUC 0.9812, with a thresholded variant using about 25% of edges preserving AUC 0.9788 (vs. 0.9440 for the image-only baseline); per-class AUCs typically fall in the 0.97-0.99 range. Conclusion. Unifying calibrated scale, lesion geometry, and metadata in a population graph yields substantial gains over image-only pipelines on ISIC-2019. Sparser graphs retain near-optimal accuracy, suggesting efficient deployment. Scale-aware, graph-based AI is a promising direction for dermoscopic decision support; future work will refine learned edge semantics and evaluate on broader curated benchmarks.[54] PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models
Wanru Zhuang,Wenbo Li,Zhibin Lan,Xu Han,Peng Li,Jinsong Su
Main category: cs.CV
TL;DR: 本文提出了位置感知的文本图像机器翻译(PATIMT),以支持细粒度和布局保持的翻译,并构建了包含10种真实场景的PATIMTBench基准进行评估。
Details
Motivation: 现有文本图像机器翻译(TIMT)研究仅提供整图文本翻译,缺乏定位信息且应用场景有限,无法满足实际需求。 Method: 提出PATIMT任务,包含区域特定翻译和带定位的全图翻译两个子任务;构建自适应OCR优化流程生成高质量标注;建立包含1200个手动标注样本的测试集;基于大型视觉-语言模型进行微调。 Result: 在新构建的PATIMTBench上,经过微调的紧凑型大型视觉-语言模型在两个子任务上均达到最先进的性能,验证了数据的可靠性和模型的可扩展性与泛化能力。 Conclusion: PATIMT为文本图像翻译提供了更实用的方向,所构建的基准和数据集为后续研究提供了重要资源,推动了该领域的精细化发展。 Abstract: Text Image Machine Translation (TIMT) aims to translate texts embedded within an image into another language. Current TIMT studies primarily focus on providing translations for all the text within an image, while neglecting to provide bounding boxes and covering limited scenarios. In this work, we extend traditional TIMT into position-aware TIMT (PATIMT), aiming to support fine-grained and layoutpreserving translation, which holds great practical value but remains largely unexplored. This task comprises two key sub-tasks: regionspecific translation and full-image translation with grounding. To support existing models on PATIMT and conduct fair evaluation, we construct the PATIMT benchmark (PATIMTBench), which consists of 10 diverse real-world scenarios. Specifically, we introduce an Adaptive Image OCR Refinement Pipeline, which adaptively selects appropriate OCR tools based on scenario and refines the results of text-rich images. To ensure evaluation reliability, we further construct a test set, which contains 1,200 high-quality instances manually annotated and reviewed by human experts. After fine-tuning on our data, compact Large Vision-Language Models (LVLMs) achieve state-of-the-art performance on both sub-tasks. Experimental results also highlight the scalability and generalizability of our training data[55] Domain Adaptive SAR Wake Detection: Leveraging Similarity Filtering and Memory Guidance
He Gao,Baoxiang Huang,Milena Radenkovic,Borui Li,Ge Chen
Main category: cs.CV
TL;DR: 提出了一种名为SimMemDA的相似性引导和记忆引导域自适应框架,用于无监督跨模态的船舶尾流检测,通过实例级特征相似性过滤和特征记忆机制提升SAR图像中的检测性能。
Details
Motivation: 由于SAR图像成像机制复杂,尾流特征抽象且噪声多,难以准确标注;而光学图像虽视觉清晰,但直接将在光学数据上训练的模型用于SAR图像会因域偏移导致性能下降,因此需要解决光学到SAR的跨模态域适应问题。 Method: 首先使用WakeGAN将光学图像转换为类SAR风格的伪图像;然后设计实例级特征相似性过滤机制,筛选分布接近目标域的源样本以减少负迁移;引入特征-置信度记忆库结合K近邻置信加权融合策略动态校准目标域伪标签;最后通过区域混合训练融合源域真实标签与校准后的伪标签进行模型优化。 Result: 实验结果表明,所提SimMemDA方法在跨模态船舶尾流检测任务中显著提升了检测精度与鲁棒性,有效缓解了域偏移问题,验证了方法的有效性与可行性。 Conclusion: SimMemDA通过风格迁移、特征相似性筛选、记忆引导伪标签优化和区域混合训练,成功实现了从光学到SAR图像的无监督域自适应尾流检测,为跨模态SAR图像解译提供了有效解决方案。 Abstract: Synthetic Aperture Radar (SAR), with its all-weather and wide-area observation capabilities, serves as a crucial tool for wake detection. However, due to its complex imaging mechanism, wake features in SAR images often appear abstract and noisy, posing challenges for accurate annotation. In contrast, optical images provide more distinct visual cues, but models trained on optical data suffer from performance degradation when applied to SAR images due to domain shift. To address this cross-modal domain adaptation challenge, we propose a Similarity-Guided and Memory-Guided Domain Adaptation (termed SimMemDA) framework for unsupervised domain adaptive ship wake detection via instance-level feature similarity filtering and feature memory guidance. Specifically, to alleviate the visual discrepancy between optical and SAR images, we first utilize WakeGAN to perform style transfer on optical images, generating pseudo-images close to the SAR style. Then, instance-level feature similarity filtering mechanism is designed to identify and prioritize source samples with target-like distributions, minimizing negative transfer. Meanwhile, a Feature-Confidence Memory Bank combined with a K-nearest neighbor confidence-weighted fusion strategy is introduced to dynamically calibrate pseudo-labels in the target domain, improving the reliability and stability of pseudo-labels. Finally, the framework further enhances generalization through region-mixed training, strategically combining source annotations with calibrated target pseudo-labels. Experimental results demonstrate that the proposed SimMemDA method can improve the accuracy and robustness of cross-modal ship wake detection tasks, validating the effectiveness and feasibility of the proposed method.[56] Uncertainty-Aware Hourly Air Temperature Mapping at 2 km Resolution via Physics-Guided Deep Learning
Shengjie Kris Liu,Siqin Wang,Lu Zhang
Main category: cs.CV
TL;DR: 提出了一种数据驱动与物理引导相结合的深度学习方法Amplifier Air-Transformer,用于生成美国本土每小时2公里分辨率的近地表空气温度数据,结合GOES-16和ERA5数据,实现高精度温度重建与预测。
Details
Motivation: 现有气象站和卫星数据源均无法提供时空连续的近地表空气温度数据,亟需一种融合多源数据、兼具高时空分辨率和高精度的无缝监测方法。 Method: 采用深度学习模型Amplifier Air-Transformer:首先利用编码年温度周期、放大ERA5信息并捕捉时空变化的神经网络重建云遮挡下的GOES-16地表温度;再通过另一神经网络基于地表特性将重建的地表温度转换为空气温度,并采用深度集成学习估计预测不确定性。 Result: 在777亿像素和1.55亿气象站记录上训练测试,2018–2024年间对美国本土每小时2km空气温度的预测精度达1.93°C(基于站点验证),实现了高时空分辨率的连续空气温度映射。 Conclusion: 该方法有效融合了数据驱动与物理先验知识,实现了高精度、高分辨率的近地表空气温度连续监测,具有良好的可扩展性,可用于其他卫星数据源的全球应用。 Abstract: Near-surface air temperature is a key physical property of the Earth's surface. Although weather stations offer continuous monitoring and satellites provide broad spatial coverage, no single data source offers seamless data in a spatiotemporal fashion. Here, we propose a data-driven, physics-guided deep learning approach to generate hourly air temperature data at 2 km resolution over the contiguous United States. The approach, called Amplifier Air-Transformer, first reconstructs GOES-16 surface temperature data obscured by clouds. It does so through a neural network encoded with the annual temperature cycle, incorporating a linear term to amplify ERA5 temperature values at finer scales and convolutional layers to capture spatiotemporal variations. Then, another neural network transforms the reconstructed surface temperature into air temperature by leveraging its latent relationship with key Earth surface properties. The approach is further enhanced with predictive uncertainty estimation through deep ensemble learning to improve reliability. The proposed approach is built and tested on 77.7 billion surface temperature pixels and 155 million air temperature records from weather stations across the contiguous United States (2018-2024), achieving hourly air temperature mapping accuracy of 1.93 C in station-based validation. The proposed approach streamlines surface temperature reconstruction and air temperature prediction, and it can be extended to other satellite sources for seamless air temperature monitoring at high spatiotemporal resolution. The generated data of this study can be downloaded at https://doi.org/10.5281/zenodo.15252812, and the project webpage can be found at https://skrisliu.com/HourlyAirTemp2kmUSA/.[57] DS@GT AnimalCLEF: Triplet Learning over ViT Manifolds with Nearest Neighbor Classification for Animal Re-identification
Anthony Miyaguchi,Chandrasekaran Maruthaiyannan,Charles R. Clark
Main category: cs.CV
TL;DR: 该论文研究了在动物个体重识别任务中,后处理度量学习的效果依赖于骨干网络嵌入的初始质量和领域特异性;使用领域特定模型(MegaDescriptor)比通用模型(DINOv2)更易通过微调提升性能,表明领域预训练对小样本细粒度任务至关重要。
Details
Motivation: 探索在数据有限的动物个体重识别任务中,通用与领域特定骨干模型在度量学习下的表现差异,分析特征空间可塑性对性能的影响。 Method: 比较DINOv2和MegaDescriptor作为骨干网络,采用K近邻分类器结合阈值判断已知或新个体,并添加三元组学习投影头进行微调,评估其在BAKS和BAUS指标上的性能变化。 Result: 三元组学习使MegaDescriptor性能提升0.13分,而DINOv2仅提升0.03分;DINOv2的验证损失停滞且可视化显示其特征空间难以调整,说明通用模型在细粒度任务中优化困难。 Conclusion: 在数据受限的专业重识别任务中,单纯依赖后处理度量学习难以有效改进通用模型,应优先选择领域特定的预训练模型。 Abstract: This paper details the DS@GT team's entry for the AnimalCLEF 2025 re-identification challenge. Our key finding is that the effectiveness of post-hoc metric learning is highly contingent on the initial quality and domain-specificity of the backbone embeddings. We compare a general-purpose model (DINOv2) with a domain-specific model (MegaDescriptor) as a backbone. A K-Nearest Neighbor classifier with robust thresholding then identifies known individuals or flags new ones. While a triplet-learning projection head improved the performance of the specialized MegaDescriptor model by 0.13 points, it yielded minimal gains (0.03) for the general-purpose DINOv2 on averaged BAKS and BAUS. We demonstrate that the general-purpose manifold is more difficult to reshape for fine-grained tasks, as evidenced by stagnant validation loss and qualitative visualizations. This work highlights the critical limitations of refining general-purpose features for specialized, limited-data re-ID tasks and underscores the importance of domain-specific pre-training. The implementation for this work is publicly available at github.com/dsgt-arc/animalclef-2025.[58] GhostNetV3-Small: A Tailored Architecture and Comparative Study of Distillation Strategies for Tiny Images
Florian Zager,Hamza A. A. Gardi
Main category: cs.CV
TL;DR: 本文提出GhostNetV3-Small,一种适用于低分辨率图像的轻量级模型,在CIFAR-10上表现优于原版GhostNetV3,但多种知识蒸馏方法均未提升性能,表明结构适配比蒸馏更有效。
Details
Motivation: 为解决深度神经网络在资源受限边缘设备上计算开销大的问题,探索模型压缩与适应方法,以实现高效推理。 Method: 设计GhostNetV3-Small架构以适应低分辨率输入,并对比传统知识蒸馏、教师助手和教师集成等蒸馏技术的效果。 Result: GhostNetV3-Small在CIFAR-10上达到93.94%准确率,显著优于原模型;但所有蒸馏策略均使其准确率下降。 Conclusion: 在小规模图像分类任务中,架构适配比知识蒸馏更有效,需进一步研究针对低分辨率场景的模型设计与先进蒸馏方法。 Abstract: Deep neural networks have achieved remarkable success across a range of tasks, however their computational demands often make them unsuitable for deployment on resource-constrained edge devices. This paper explores strategies for compressing and adapting models to enable efficient inference in such environments. We focus on GhostNetV3, a state-of-the-art architecture for mobile applications, and propose GhostNetV3-Small, a modified variant designed to perform better on low-resolution inputs such as those in the CIFAR-10 dataset. In addition to architectural adaptation, we provide a comparative evaluation of knowledge distillation techniques, including traditional knowledge distillation, teacher assistants, and teacher ensembles. Experimental results show that GhostNetV3-Small significantly outperforms the original GhostNetV3 on CIFAR-10, achieving an accuracy of 93.94%. Contrary to expectations, all examined distillation strategies led to reduced accuracy compared to baseline training. These findings indicate that architectural adaptation can be more impactful than distillation in small-scale image classification tasks, highlighting the need for further research on effective model design and advanced distillation techniques for low-resolution domains.[59] From Orthomosaics to Raw UAV Imagery: Enhancing Palm Detection and Crown-Center Localization
Rongkun Zhu,Kangning Cui,Wei Tang,Rui-Feng Wang,Sarra Alqahtani,David Lutz,Fan Yang,Paul Fine,Jordan Karubian,Robert Plemmons,Jean-Michel Morel,Victor Pauca,Miles Silman
Main category: cs.CV
TL;DR: 本研究探讨了使用无人机原始影像进行热带森林棕榈树检测和树冠中心定位的效果,发现原始影像在实际部署场景中表现更优,而正射镶嵌影像更适合跨域泛化。引入树冠中心标注可显著提升定位精度。
Details
Motivation: 准确的单木制图对生态监测和森林管理至关重要,但传统使用的无人机正射镶嵌影像存在拼接伪影和预处理复杂等问题,限制了其野外应用。因此需要探索更适用于现场部署的替代方案。 Method: 采用先进的目标检测器和关键点模型,比较原始无人机影像与正射镶嵌影像在域内和跨域迁移下的检测性能,并分析树冠中心标注对定位精度的提升作用。 Result: 原始影像在实际部署相关场景中检测性能优于正射镶嵌影像,后者在跨域泛化方面更具鲁棒性;训练中加入树冠中心标注可进一步提高定位准确性。 Conclusion: 原始无人机影像更适合现场部署的树木检测任务,结合树冠中心标注能提供更精确的树木位置信息,为基于无人机的生物多样性与保护监测提供了实用指导。 Abstract: Accurate mapping of individual trees is essential for ecological monitoring and forest management. Orthomosaic imagery from unmanned aerial vehicles (UAVs) is widely used, but stitching artifacts and heavy preprocessing limit its suitability for field deployment. This study explores the use of raw UAV imagery for palm detection and crown-center localization in tropical forests. Two research questions are addressed: (1) how detection performance varies across orthomosaic and raw imagery, including within-domain and cross-domain transfer, and (2) to what extent crown-center annotations improve localization accuracy beyond bounding-box centroids. Using state-of-the-art detectors and keypoint models, we show that raw imagery yields superior performance in deployment-relevant scenarios, while orthomosaics retain value for robust cross-domain generalization. Incorporating crown-center annotations in training further improves localization and provides precise tree positions for downstream ecological analyses. These findings offer practical guidance for UAV-based biodiversity and conservation monitoring.[60] DYNAMO: Dependency-Aware Deep Learning Framework for Articulated Assembly Motion Prediction
Mayank Patel,Rahul Jain,Asim Unmesh,Karthik Ramani
Main category: cs.CV
TL;DR: 本文提出了MechBench,一个包含693个合成齿轮组件的数据集,用于研究从静态几何中推断耦合运动的问题,并提出了DYNAMO模型,能够直接从分割的CAD点云预测各部件的SE(3)运动轨迹。
Details
Motivation: 现有方法难以仅从几何结构推理机械组件(如齿轮)中由几何耦合产生的复杂运动,因此需要一种能处理此类耦合运动的新方法。 Method: 构建了MechBench数据集,包含带真实运动轨迹的齿轮组件;提出DYNAMO模型,一种依赖感知的神经网络,直接从CAD点云预测每个部件的SE(3)运动序列。 Result: 实验表明DYNAMO在多种齿轮配置下优于强基线方法,实现了准确且时间上一致的运动预测。 Conclusion: MechBench和DYNAMO共同建立了一个基于数据驱动的学习框架,用于分析CAD装配体中的耦合机械运动,推动了3D感知与设计自动化的发展。 Abstract: Understanding the motion of articulated mechanical assemblies from static geometry remains a core challenge in 3D perception and design automation. Prior work on everyday articulated objects such as doors and laptops typically assumes simplified kinematic structures or relies on joint annotations. However, in mechanical assemblies like gears, motion arises from geometric coupling, through meshing teeth or aligned axes, making it difficult for existing methods to reason about relational motion from geometry alone. To address this gap, we introduce MechBench, a benchmark dataset of 693 diverse synthetic gear assemblies with part-wise ground-truth motion trajectories. MechBench provides a structured setting to study coupled motion, where part dynamics are induced by contact and transmission rather than predefined joints. Building on this, we propose DYNAMO, a dependency-aware neural model that predicts per-part SE(3) motion trajectories directly from segmented CAD point clouds. Experiments show that DYNAMO outperforms strong baselines, achieving accurate and temporally consistent predictions across varied gear configurations. Together, MechBench and DYNAMO establish a novel systematic framework for data-driven learning of coupled mechanical motion in CAD assemblies.[61] Cott-ADNet: Lightweight Real-Time Cotton Boll and Flower Detection Under Field Conditions
Rui-Feng Wang,Mingrui Xu,Matthew C Bauer,Iago Beffart Schardong,Xiaowen Ma,Kangning Cui
Main category: cs.CV
TL;DR: 提出了一种轻量级实时检测网络Cott-ADNet,用于复杂田间条件下棉花铃和花的识别,基于YOLOv11n并引入两个新模块,在低计算成本下实现高效多尺度特征建模,实验表明其在精度和效率上表现优异,适用于自动化棉花收获与表型分析。
Details
Motivation: 棉花采收因依赖人工、效率低及错过最佳采收期导致产量损失,准确识别棉铃及其成熟度对自动化采收、产量估算和育种研究至关重要。 Method: 基于YOLOv11n框架,改进卷积设计以增强空间表征和鲁棒性;引入NeLU增强的全局注意力机制以捕捉弱特征和低对比度特征,并设计膨胀感受野SPPF模块以低成本扩大感受野,实现有效的多尺度上下文建模。 Result: 在自建的4966张图像数据集和1216张外部验证图像上测试,Cott-ADNet达到91.5%精确率、89.8%召回率、93.3% mAP50、71.3% mAP和90.6% F1分数,仅需7.5 GFLOPs,且在多尺度和旋转变化下性能稳定。 Conclusion: Cott-ADNet是一种高效、准确的棉花铃和花识别模型,适合田间实时部署,为自动化棉花收获和高通量表型分析提供了可靠的技术支持。 Abstract: Cotton is one of the most important natural fiber crops worldwide, yet harvesting remains limited by labor-intensive manual picking, low efficiency, and yield losses from missing the optimal harvest window. Accurate recognition of cotton bolls and their maturity is therefore essential for automation, yield estimation, and breeding research. We propose Cott-ADNet, a lightweight real-time detector tailored to cotton boll and flower recognition under complex field conditions. Building on YOLOv11n, Cott-ADNet enhances spatial representation and robustness through improved convolutional designs, while introducing two new modules: a NeLU-enhanced Global Attention Mechanism to better capture weak and low-contrast features, and a Dilated Receptive Field SPPF to expand receptive fields for more effective multi-scale context modeling at low computational cost. We curate a labeled dataset of 4,966 images, and release an external validation set of 1,216 field images to support future research. Experiments show that Cott-ADNet achieves 91.5% Precision, 89.8% Recall, 93.3% mAP50, 71.3% mAP, and 90.6% F1-Score with only 7.5 GFLOPs, maintaining stable performance under multi-scale and rotational variations. These results demonstrate Cott-ADNet as an accurate and efficient solution for in-field deployment, and thus provide a reliable basis for automated cotton harvesting and high-throughput phenotypic analysis. Code and dataset is available at https://github.com/SweefongWong/Cott-ADNet.[62] Deep learning for 3D point cloud processing -- from approaches, tasks to its implications on urban and environmental applications
Zhenxin Zhang,Zhihua Xu,Yuwei Cao,Ningli Xu,Shuye Wang,Shen'ao Cui,Zhen Li,Rongjun Qin
Main category: cs.CV
TL;DR: 本文对用于点云处理的深度学习方法和数据集进行了元综述,重点关注场景补全、配准、语义分割和建模等关键任务,并分析了这些方法在城市与环境应用中的实际价值与现存差距。
Details
Motivation: 现有综述多关注网络架构更新,忽视了点云处理方法在实际应用中的价值,尤其是在面对大规模数据、复杂场景、点密度变化和多模态数据时的实用性问题。 Method: 通过回顾深度学习在点云处理中的主流方法和常用数据集,结合典型城市与环境应用场景,进行系统性元综述,分析算法性能与实际需求之间的差距。 Result: 识别出当前深度学习方法在实际部署中存在的若干关键鸿沟,包括数据规模、场景多样性、点密度变化和模态适应性等方面的挑战。 Conclusion: 未来的研究需在算法设计和实际应用之间建立更紧密的联系,提升深度学习方法在真实世界点云处理任务中的可用性和鲁棒性。 Abstract: Point cloud processing as a fundamental task in the field of geomatics and computer vision, has been supporting tasks and applications at different scales from air to ground, including mapping, environmental monitoring, urban/tree structure modeling, automated driving, robotics, disaster responses etc. Due to the rapid development of deep learning, point cloud processing algorithms have nowadays been almost explicitly dominated by learning-based approaches, most of which are yet transitioned into real-world practices. Existing surveys primarily focus on the ever-updating network architecture to accommodate unordered point clouds, largely ignoring their practical values in typical point cloud processing applications, in which extra-large volume of data, diverse scene contents, varying point density, data modality need to be considered. In this paper, we provide a meta review on deep learning approaches and datasets that cover a selection of critical tasks of point cloud processing in use such as scene completion, registration, semantic segmentation, and modeling. By reviewing a broad range of urban and environmental applications these tasks can support, we identify gaps to be closed as these methods transformed into applications and draw concluding remarks in both the algorithmic and practical aspects of the surveyed methods.[63] Two-Stage Decoupling Framework for Variable-Length Glaucoma Prognosis
Yiran Song,Yikai Zhang,Silvia Orengo-Nania,Nian Wang,Fenglong Ma,Rui Zhang,Yifan Peng,Mingquan Lin
Main category: cs.CV
TL;DR: 提出了一种用于变长青光眼预后预测的两阶段解耦框架(TSDF),结合自监督学习和基于注意力机制的时间聚合模块,提升了模型在不同规模和临床设置下的性能与鲁棒性。
Details
Motivation: 现有方法受限于固定长度输入和小规模数据集,难以有效进行青光眼预后预测。 Method: 第一阶段使用自监督学习进行特征表示,融合多个不同规模的青光眼数据集;第二阶段采用基于注意力机制的时间聚合模块处理变长序列输入。 Result: 在OHTS和GRAPE两个基准数据集上实验表明,该方法在不同数据规模和临床场景下均显著提升模型性能,同时保持较小参数量。 Conclusion: TSDF框架通过解耦训练和灵活处理变长序列,有效提升了青光眼预后预测的准确性和泛化能力。 Abstract: Glaucoma is one of the leading causes of irreversible blindness worldwide. Glaucoma prognosis is essential for identifying at-risk patients and enabling timely intervention to prevent blindness. Many existing approaches rely on historical sequential data but are constrained by fixed-length inputs, limiting their flexibility. Additionally, traditional glaucoma prognosis methods often employ end-to-end models, which struggle with the limited size of glaucoma datasets. To address these challenges, we propose a Two-Stage Decoupling Framework (TSDF) for variable-length glaucoma prognosis. In the first stage, we employ a feature representation module that leverages self-supervised learning to aggregate multiple glaucoma datasets for training, disregarding differences in their supervisory information. This approach enables datasets of varying sizes to learn better feature representations. In the second stage, we introduce a temporal aggregation module that incorporates an attention-based mechanism to process sequential inputs of varying lengths, ensuring flexible and efficient utilization of all available data. This design significantly enhances model performance while maintaining a compact parameter size. Extensive experiments on two benchmark glaucoma datasets:the Ocular Hypertension Treatment Study (OHTS) and the Glaucoma Real-world Appraisal Progression Ensemble (GRAPE),which differ significantly in scale and clinical settings,demonstrate the effectiveness and robustness of our approach.[64] Image Tokenizer Needs Post-Training
Kai Qiu,Xiang Li,Hao Chen,Jason Kuen,Xiaohao Xu,Jiuxiang Gu,Yinyi Luo,Bhiksha Raj,Zhe Lin,Marios Savvides
Main category: cs.CV
TL;DR: 提出了一种新的分词器训练方案,包括主训练和后训练,以改善离散潜在空间中的图像生成质量。
Details
Motivation: 现有分词器仅关注重建任务,忽略了生成过程中的误差,导致重建与生成分布之间存在显著差异。 Method: 在主训练中引入潜在扰动策略来模拟采样噪声,并提出一种即插即用的分词器训练方案;在后训练中优化解码器以减小生成与重建标记之间的分布差异。 Result: 使用约4亿参数的生成器,所提方法在主训练后达到1.60 gFID,后训练进一步降低至1.36 gFID,并验证了其在多种分词器和生成模型上的有效性。 Conclusion: 该分词器训练方案显著提升了生成质量和收敛速度,且提出的pFID指标能有效关联分词器性能与生成质量。 Abstract: Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a $\sim$400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.[65] Towards Foundational Models for Single-Chip Radar
Tianshu Huang,Akarsh Prabhakara,Chuhan Chen,Jay Karhade,Deva Ramanan,Matthew O'Toole,Anthony Rowe
Main category: cs.CV
TL;DR: 本文收集了目前最大的原始雷达数据集(100万样本,29小时),并训练了一个适用于4D单芯片雷达的基础模型GRT,能够以接近高分辨率传感器的精度预测3D占据和语义分割。
Details
Motivation: 毫米波雷达成本低、鲁棒性强,但角分辨率差,现有学习方法受限于小规模数据集和缺乏统一的基础模型。 Method: 收集大规模原始雷达数据集(100万样本),提出通用雷达Transformer(GRT)作为基础模型,并使用原始雷达数据进行训练,避免常用有损表示带来的信息损失。 Result: GRT在多种场景下具有良好泛化能力,支持多任务微调,且展现出每10倍数据提升20%性能的对数缩放规律;使用原始数据相当于获得10倍数据增益;估计需约1亿样本(3000小时)才能充分释放模型潜力。 Conclusion: 基于大规模原始数据训练的基础模型GRT显著提升了单芯片毫米波雷达的感知能力,为未来雷达学习提供了可扩展的框架和数据方向。 Abstract: mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets. In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20\% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we roughly estimate that $\approx$100M samples (3000 hours) of data are required to fully exploit the potential of GRT.[66] Evaluating Robustness of Vision-Language Models Under Noisy Conditions
Purushoth,Alireza
Main category: cs.CV
TL;DR: 该研究提出了一种评估视觉-语言模型在噪声条件下鲁棒性的综合框架,发现模型性能受标注描述性、模型大小和噪声类型显著影响。
Details
Motivation: 尽管视觉-语言模型在多模态任务中表现优异,但其在噪声条件下的鲁棒性尚不明确,因此需要系统评估。 Method: 构建了一个包含光照变化、运动模糊和压缩伪影等可控扰动的评估框架,采用基于词汇的指标(如BLEU、CIDEr)和基于句子嵌入的神经相似性度量来量化语义一致性。 Result: 实验表明:1)真实标注的描述性显著影响模型性能;2)较大的模型(如LLaVA)语义理解更强,但并非在所有情况下优于小模型;3)JPEG压缩和运动模糊显著降低模型性能。 Conclusion: 研究揭示了模型大小、数据集特征与噪声鲁棒性之间的权衡,为未来多模态学习提供了标准化基准。 Abstract: Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering. However, their robustness under noisy conditions remains unfamiliar. In this study, we present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations, including lighting variation, motion blur, and compression artifacts. We used both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures using sentence embeddings to quantify semantic alignment. Our experiments span diverse datasets, revealing key insights: (1) descriptiveness of ground-truth captions significantly influences model performance; (2) larger models like LLaVA excel in semantic understanding but do not universally outperform smaller models; and (3) certain noise types, such as JPEG compression and motion blur, dramatically degrade performance across models. Our findings highlight the nuanced trade-offs between model size, dataset characteristics, and noise resilience, offering a standardized benchmark for future robust multimodal learning.[67] Instance-Guided Class Activation Mapping for Weakly Supervised Semantic Segmentation
Ali Torabi,Sanjog Gaihre,MD Mahbubur Rahman,Yaqoob Majeed
Main category: cs.CV
TL;DR: 本文提出了一种名为IG-CAM的弱监督语义分割新方法,通过实例引导和影响函数生成高质量、边界感知的定位图,在PASCAL VOC 2012数据集上达到了SOTA性能。
Details
Motivation: 现有弱监督语义分割方法在物体边界定位不精确,且往往只关注最具判别性的区域,忽略了完整对象覆盖的问题。 Method: 提出IG-CAM方法,包含三个关键创新:实例引导细化、影响函数集成和多尺度边界增强。利用真实分割掩码指导CAM生成,并结合影响函数提升特征鲁棒性,通过多尺度策略优化边界清晰度。 Result: 在PASCAL VOC 2012上取得82.3% mIoU(CRF后达86.6%),显著优于先前方法,定性与定量实验表明其具有优越的定位精度和泛化能力。 Conclusion: IG-CAM为弱监督语义分割设立了新基准,能够在无需像素级标注的情况下实现精确的对象覆盖与边界提取,具备实际应用价值。 Abstract: Weakly Supervised Semantic Segmentation (WSSS) addresses the challenge of training segmentation models using only image-level annotations, eliminating the need for expensive pixel-level labeling. While existing methods struggle with precise object boundary localization and often focus only on the most discriminative regions, we propose IG-CAM (Instance-Guided Class Activation Mapping), a novel approach that leverages instance-level cues and influence functions to generate high-quality, boundary-aware localization maps. Our method introduces three key innovations: (1) Instance-Guided Refinement that uses ground truth segmentation masks to guide CAM generation, ensuring complete object coverage rather than just discriminative parts; (2) Influence Function Integration that captures the relationship between training samples and model predictions, leading to more robust feature representations; and (3) Multi-Scale Boundary Enhancement that employs progressive refinement strategies to achieve sharp, precise object boundaries. IG-CAM achieves state-of-the-art performance on the PASCAL VOC 2012 dataset with an mIoU of 82.3% before post-processing, which further improves to 86.6% after applying Conditional Random Field (CRF) refinement, significantly outperforming previous WSSS methods. Our approach demonstrates superior localization accuracy, with complete object coverage and precise boundary delineation, while maintaining computational efficiency. Extensive ablation studies validate the contribution of each component, and qualitative comparisons across 600 diverse images showcase the method's robustness and generalization capability. The results establish IG-CAM as a new benchmark for weakly supervised semantic segmentation, offering a practical solution for scenarios where pixel-level annotations are unavailable or prohibitively expensive.[68] Artist-Created Mesh Generation from Raw Observation
Yao He,Youngjoong Kwon,Wenxiao Cai,Ehsan Adeli
Main category: cs.CV
TL;DR: 提出了一种端到端的框架,将3D点云去噪与补全转化为2D修复任务,直接生成高质量、艺术家风格的网格模型。
Details
Motivation: 现有方法通常假设输入是干净且完整的点云,难以应用于真实传感器获取的噪声或不完整数据,限制了在实际场景中的应用。 Method: 通过将3D点云优化重新定义为2D图像修复任务,利用强大的生成模型进行点云补全和去噪,并端到端地生成高质量的艺术家风格网格。 Result: 在ShapeNet数据集上的初步实验表明,该方法能有效生成干净、完整的网格模型。 Conclusion: 所提框架能够有效处理真实世界中噪声和不完整的点云输入,实现高质量艺术家风格网格的端到端生成,具有良好的应用前景。 Abstract: We present an end-to-end framework for generating artist-style meshes from noisy or incomplete point clouds, such as those captured by real-world sensors like LiDAR or mobile RGB-D cameras. Artist-created meshes are crucial for commercial graphics pipelines due to their compatibility with animation and texturing tools and their efficiency in rendering. However, existing approaches often assume clean, complete inputs or rely on complex multi-stage pipelines, limiting their applicability in real-world scenarios. To address this, we propose an end-to-end method that refines the input point cloud and directly produces high-quality, artist-style meshes. At the core of our approach is a novel reformulation of 3D point cloud refinement as a 2D inpainting task, enabling the use of powerful generative models. Preliminary results on the ShapeNet dataset demonstrate the promise of our framework in producing clean, complete meshes.[69] Axis-Aligned 3D Stalk Diameter Estimation from RGB-D Imagery
Benjamin Vail,Rahul Harsha Cheppally,Ajay Sharda,Sidharth Rai
Main category: cs.CV
TL;DR: 提出一种基于RGB-D图像的几何感知计算机视觉流程,用于高通量作物育种中的茎秆直径估计。
Details
Motivation: 传统茎秆直径测量方法费时、易出错且难以扩展,限制了高通量表型分析的应用。 Method: 结合深度学习实例分割、3D点云重建和基于主成分分析(PCA)的轴向切片,实现对茎秆直径的鲁棒估计。 Result: 该方法有效缓解了弯曲、遮挡和图像噪声的影响,实现了准确且可扩展的茎秆直径估算。 Conclusion: 所提方法为作物育种和农学研究中的高通量表型分析提供了一种可靠、可扩展的解决方案。 Abstract: Accurate, high-throughput phenotyping is a critical component of modern crop breeding programs, especially for improving traits such as mechanical stability, biomass production, and disease resistance. Stalk diameter is a key structural trait, but traditional measurement methods are labor-intensive, error-prone, and unsuitable for scalable phenotyping. In this paper, we present a geometry-aware computer vision pipeline for estimating stalk diameter from RGB-D imagery. Our method integrates deep learning-based instance segmentation, 3D point cloud reconstruction, and axis-aligned slicing via Principal Component Analysis (PCA) to perform robust diameter estimation. By mitigating the effects of curvature, occlusion, and image noise, this approach offers a scalable and reliable solution to support high-throughput phenotyping in breeding and agronomic research.[70] Neural Collapse-Inspired Multi-Label Federated Learning under Label-Distribution Skew
Can Peng,Yuyuan Liu,Yingyu Yang,Pramit Saha,Qianye Yang,J. Alison Noble
Main category: cs.CV
TL;DR: 本文提出了一种基于Neural Collapse理论的联邦学习方法,用于解决多标签场景下由于数据分布异构导致的性能下降问题。通过特征解耦和共享NC结构对齐客户端间的特征分布,提升了模型在多标签联邦学习中的表现。
Details
Motivation: 现有的联邦学习研究大多集中在单标签分类,而现实应用(如医学图像)常涉及多标签数据。由于标签共现、依赖关系及局部与全局标签差异,多标签联邦学习面临更大挑战,亟需针对性解决方案。 Method: 引入特征解耦模块以提取语义特定特征,并利用预定义的共享Neural Collapse结构来引导解耦后的类特征聚类;同时设计正则化损失以促进潜在特征空间中的紧凑聚类,从而实现跨客户端特征分布对齐。 Result: 在四个基准数据集的八种不同设置下进行实验,结果表明所提方法优于现有方法,显著提升了多标签联邦学习的性能。 Conclusion: 该方法有效应对了多标签联邦学习中数据异质性带来的挑战,通过Neural Collapse指导特征学习,实现了高质量、聚类良好的表示,具有较强的实用性和扩展性。 Abstract: Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. However, the performance of deep learning often deteriorates in FL due to decentralized and heterogeneous data. This challenge is further amplified in multi-label scenarios, where data exhibit complex characteristics such as label co-occurrence, inter-label dependency, and discrepancies between local and global label relationships. While most existing FL research primarily focuses on single-label classification, many real-world applications, particularly in domains such as medical imaging, often involve multi-label settings. In this paper, we address this important yet underexplored scenario in FL, where clients hold multi-label data with skewed label distributions. Neural Collapse (NC) describes a geometric structure in the latent feature space where features of each class collapse to their class mean with vanishing intra-class variance, and the class means form a maximally separated configuration. Motivated by this theory, we propose a method to align feature distributions across clients and to learn high-quality, well-clustered representations. To make the NC-structure applicable to multi-label settings, where image-level features may contain multiple semantic concepts, we introduce a feature disentanglement module that extracts semantically specific features. The clustering of these disentangled class-wise features is guided by a predefined shared NC structure, which mitigates potential conflicts between client models due to diverse local data distributions. In addition, we design regularisation losses to encourage compact clustering in the latent feature space. Experiments conducted on four benchmark datasets across eight diverse settings demonstrate that our approach outperforms existing methods, validating its effectiveness in this challenging FL scenario.[71] Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection
Yingxin Lai,Zitong Yu,Jun Wang,Linlin Shen,Yong Xu,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出了一种基于多智能体框架Agent4FaceForgery的面部伪造检测新方法,通过模拟人类伪造行为和社交媒体中的文本-图像交互,生成更贴近真实场景的训练数据,显著提升了检测器性能。
Details
Motivation: 现有的面部伪造检测方法在离线基准上表现良好,但在实际应用中效果不佳,主要因为训练数据缺乏生态有效性。为此,本文旨在通过模拟真实的伪造生成过程和复杂的文本-图像交互,缩小实验室与现实之间的差距。 Method: 提出一个由大语言模型驱动的多智能体框架Agent4FaceForgery,每个智能体具备身份档案和记忆模块,模拟人类伪造意图和迭代过程;智能体在模拟社交环境中互动,生成具有细粒度文本-图像一致性标签的数据;引入自适应拒绝采样(ARS)机制以保证数据质量和多样性。 Result: 实验表明,使用该框架生成的数据训练的多种结构的检测器在性能上有显著提升,验证了所提方法在提升检测模型泛化能力和现实适用性方面的有效性。 Conclusion: Agent4FaceForgery通过模拟真实社交环境下的伪造行为,提供了一种生态有效性更高的数据生成范式,为面部伪造检测提供了新的研究方向和实用工具。 Abstract: Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and real-world efficacy,which we attribute to the ecological invalidity of training data.This work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text-image interactions that accompany forgeries in social media. To solve this,we propose a multi-agent framework where LLM-poweredagents, equipped with profile and memory modules, simulate the forgery creation process. Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text-image consistency, moving beyond simple binary classification. An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity. Extensive experiments validate that the data generated by our simulationdriven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.[72] Explicit Multimodal Graph Modeling for Human-Object Interaction Detection
Wenxuan Ji,Haichao Shi,Xiao-Yu zhang
Main category: cs.CV
TL;DR: 提出了一种基于图神经网络的多模态图网络模型MGNM,用于增强人-物交互检测,在HICO-DET和V-COCO基准上达到最先进性能。
Details
Motivation: Transformer架构未能显式建模HOI检测中的关系结构,影响交互识别;而GNN更适用于此任务。 Method: 设计了一个四阶段图结构的多模态图网络框架,并引入多层次视觉与语言特征的交互机制,以增强人-物对之间的信息传播。 Result: 在HICO-DET和V-COCO两个基准上取得了最先进的性能,且在结合更先进的物体检测器时表现出显著性能提升,并在稀有和非稀有类别间保持良好平衡。 Conclusion: MGNM通过显式建模关系结构和多层级特征交互,有效提升了HOI检测性能,验证了GNN在该任务上的优势。 Abstract: Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.[73] VQT-Light:Lightweight HDR Illumination Map Prediction with Richer Texture.pdf
Kunliang Xie
Main category: cs.CV
TL;DR: 提出了一种基于VQVAE和ViT的新框架VQT-Light,用于准确且快速的光照估计,通过离散特征提取和全局上下文建模,在纹理细节、推理速度和保真度方面优于现有方法。
Details
Motivation: 现有光照估计方法在恢复光照图细节纹理、运行速度和保真度方面存在不足,难以兼顾性能与质量。 Method: 采用VQVAE提取光照图的离散特征以避免后验坍缩,并利用ViT替代CNN捕捉输入图像的全局上下文和依赖关系,将光照估计建模为多分类任务。 Result: 模型实现了40FPS的推理速度,在多个评估指标上有所提升,生成的光照图具有更丰富的纹理和更高的保真度。 Conclusion: VQT-Light在保持轻量和高效的同时,显著提升了光照估计的质量,实验验证其优于当前最先进方法。 Abstract: Accurate lighting estimation is a significant yet challenging task in computer vision and graphics. However, existing methods either struggle to restore detailed textures of illumination map, or face challenges in running speed and texture fidelity. To tackle this problem, we propose a novel framework (VQT-Light) based on VQVAE and ViT architecture. VQT-Light includes two modules: feature extraction and lighting estimation. First, we take advantages of VQVAE to extract discrete features of illumination map rather than continuous features to avoid "posterior collapse". Second, we capture global context and dependencies of input image through ViT rather than CNNs to improve the prediction of illumination outside the field of view. Combining the above two modules, we formulate the lighting estimation as a multiclass classification task, which plays a key role in our pipeline. As a result, our model predicts light map with richer texture and better fidelity while keeping lightweight and fast. VQT-Light achieves an inference speed of 40FPS and improves multiple evaluation metrics. Qualitative and quantitative experiments demonstrate that the proposed method realizes superior results compared to existing state-of-the-art methods.[74] Adaptive Sampling Scheduler
Qi Wang,Shuliang Zhu,Jinjia Zhou
Main category: cs.CV
TL;DR: 本文提出了一种适用于多种一致性蒸馏框架的自适应采样调度器,通过动态选择目标时间步、优化交替采样策略以及引入平滑裁剪和色彩平衡技术,提升了扩散模型的生成性能与灵活性。
Details
Motivation: 现有的一致性蒸馏方法在目标时间步选择上依赖预设的确定性或随机策略,需为不同蒸馏过程专门设计采样调度器,限制了模型在实际应用中的灵活性和采样潜力。 Method: 提出一种自适应采样调度器,包含三项创新:(i) 基于计算的重要性动态选择目标时间步;(ii) 沿解轨迹优化前向去噪与反向加噪的交替采样;(iii) 引入平滑裁剪和色彩平衡技术以提升高引导尺度下的生成稳定性与质量。 Result: 在多种一致性蒸馏框架中进行实验验证,结果表明该方法显著提升了生成性能,并展现出良好的适应性和灵活性。 Conclusion: 所提出的自适应采样调度器有效克服了传统调度策略的局限性,增强了扩散模型在复杂生成场景中的适用性与性能。 Abstract: Consistent distillation methods have evolved into effective techniques that significantly accelerate the sampling process of diffusion models. Although existing methods have achieved remarkable results, the selection of target timesteps during distillation mainly relies on deterministic or stochastic strategies, which often require sampling schedulers to be designed specifically for different distillation processes. Moreover, this pattern severely limits flexibility, thereby restricting the full sampling potential of diffusion models in practical applications. To overcome these limitations, this paper proposes an adaptive sampling scheduler that is applicable to various consistency distillation frameworks. The scheduler introduces three innovative strategies: (i) dynamic target timestep selection, which adapts to different consistency distillation frameworks by selecting timesteps based on their computed importance; (ii) Optimized alternating sampling along the solution trajectory by guiding forward denoising and backward noise addition based on the proposed time step importance, enabling more effective exploration of the solution space to enhance generation performance; and (iii) Utilization of smoothing clipping and color balancing techniques to achieve stable and high-quality generation results at high guidance scales, thereby expanding the applicability of consistency distillation models in complex generation scenarios. We validated the effectiveness and flexibility of the adaptive sampling scheduler across various consistency distillation methods through comprehensive experimental evaluations. Experimental results consistently demonstrated significant improvements in generative performance, highlighting the strong adaptability achieved by our method.[75] DisorientLiDAR: Physical Attacks on LiDAR-based Localization
Yizhen Lao,Yu Zhang,Ziting Wang,Chengbo Wang,Yifei Xue,Wanpeng Shao
Main category: cs.CV
TL;DR: 本文提出了一种针对LiDAR定位的新型对抗攻击框架DisorientLiDAR,通过逆向工程识别并移除关键点来破坏定位精度,并在真实世界中验证了其有效性。
Details
Motivation: 深度学习模型易受对抗性攻击,但针对自动驾驶定位系统的攻击研究较少,尤其是LiDAR定位面临严重安全挑战。 Method: 通过逆向工程定位模型(如特征提取网络),识别关键点并策略性移除,从而干扰LiDAR定位。 Result: 在KITTI数据集上的实验表明,移除包含Top-K关键点的区域显著降低了HRegNet、D3Feat和GeoTransformer的配准精度;在Autoware平台上引发明显定位漂移;物理世界中使用近红外吸收材料隐藏关键区域也成功复现了攻击效果。 Conclusion: DisorientLiDAR能有效攻击LiDAR定位系统,且可在物理世界实施,揭示了当前定位模型的安全漏洞,具有高度真实性和通用性。 Abstract: Deep learning models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations. Even this poses a serious security challenge for the localization of self-driving cars, there has been very little exploration of attack on it, as most of adversarial attacks have been applied to 3D perception. In this work, we propose a novel adversarial attack framework called DisorientLiDAR targeting LiDAR-based localization. By reverse-engineering localization models (e.g., feature extraction networks), adversaries can identify critical keypoints and strategically remove them, thereby disrupting LiDAR-based localization. Our proposal is first evaluated on three state-of-the-art point-cloud registration models (HRegNet, D3Feat, and GeoTransformer) using the KITTI dataset. Experimental results demonstrate that removing regions containing Top-K keypoints significantly degrades their registration accuracy. We further validate the attack's impact on the Autoware autonomous driving platform, where hiding merely a few critical regions induces noticeable localization drift. Finally, we extended our attacks to the physical world by hiding critical regions with near-infrared absorptive materials, thereby successfully replicate the attack effects observed in KITTI data. This step has been closer toward the realistic physical-world attack that demonstrate the veracity and generality of our proposal.[76] Exploring Spectral Characteristics for Single Image Reflection Removal
Pengbo Guo,Chengxu Liu,Guoshuai Zhao,Xingsong Hou,Jialie Shen,Xueming Qian
Main category: cs.CV
TL;DR: 本文提出了一种基于光谱学习的反射去除新方法,通过构建光谱码本并设计频谱感知Transformer,在多个基准上实现了优于现有方法的性能。
Details
Motivation: 由于反射与透射成分在图像中重叠,且现有方法忽略反射光的光谱特性变化,导致反射去除困难。 Method: 提出光谱码本重建反射图像的光学光谱,设计两个光谱先验优化模块,并引入频谱感知Transformer联合恢复透射内容。 Result: 在三个不同反射去除非公开和公开数据集上实验表明,该方法在定量和定性指标上均优于当前最优方法。 Conclusion: 通过利用光谱信息,该方法有效区分并去除反射,提升了图像恢复的质量和泛化能力。 Abstract: Eliminating reflections caused by incident light interacting with reflective medium remains an ill-posed problem in the image restoration area. The primary challenge arises from the overlapping of reflection and transmission components in the captured images, which complicates the task of accurately distinguishing and recovering the clean background. Existing approaches typically address reflection removal solely in the image domain, ignoring the spectral property variations of reflected light, which hinders their ability to effectively discern reflections. In this paper, we start with a new perspective on spectral learning, and propose the Spectral Codebook to reconstruct the optical spectrum of the reflection image. The reflections can be effectively distinguished by perceiving the wavelength differences between different light sources in the spectrum. To leverage the reconstructed spectrum, we design two spectral prior refinement modules to re-distribute pixels in the spatial dimension and adaptively enhance the spectral differences along the wavelength dimension. Furthermore, we present the Spectrum-Aware Transformer to jointly recover the transmitted content in spectral and pixel domains. Experimental results on three different reflection benchmarks demonstrate the superiority and generalization ability of our method compared to state-of-the-art models.[77] Maps for Autonomous Driving: Full-process Survey and Frontiers
Pengxin Chen,Zhipeng Luo,Xiaoqi Jiang,Zhangcai Yin,Jonathan Li
Main category: cs.CV
TL;DR: 本文将自动驾驶地图的演进分为高精地图、轻量地图和隐式地图三个阶段,系统综述了各阶段的地图生产流程、技术挑战与解决方案,并探讨了前沿的地图表征研究及其在端到端自动驾驶框架中的集成。
Details
Motivation: 随着自动驾驶技术的发展,传统高精地图面临更新成本高、存储开销大等问题,亟需更高效、灵活的地图表示与生产方式。 Method: 通过文献综述的方式,将地图发展划分为三个阶段,分析各阶段的地图生成流程、关键技术挑战及学术界提出的应对方案,并探讨新型地图表征与端到端自动驾驶系统的融合路径。 Result: 总结了从HD地图到隐式地图的技术演进脉络,梳理了各阶段的代表性方法与优化策略,揭示了地图轻量化和隐式化的发展趋势。 Conclusion: 隐式地图结合学习-based表征有望降低地图依赖,提升系统泛化能力,是未来自动驾驶地图的重要发展方向。 Abstract: Maps have always been an essential component of autonomous driving. With the advancement of autonomous driving technology, both the representation and production process of maps have evolved substantially. The article categorizes the evolution of maps into three stages: High-Definition (HD) maps, Lightweight (Lite) maps, and Implicit maps. For each stage, we provide a comprehensive review of the map production workflow, with highlighting technical challenges involved and summarizing relevant solutions proposed by the academic community. Furthermore, we discuss cutting-edge research advances in map representations and explore how these innovations can be integrated into end-to-end autonomous driving frameworks.[78] CIARD: Cyclic Iterative Adversarial Robustness Distillation
Liming Lu,Shuchao Pang,Xu Zheng,Xiang Gu,Anan Du,Yunhuai Liu,Yongbin Zhou
Main category: cs.CV
TL;DR: 提出了一种新的循环迭代对抗鲁棒性蒸馏方法(CIARD),通过多教师框架和持续对抗重训练,有效平衡了学生模型的鲁棒性与清洁样本性能。
Details
Motivation: 现有对抗鲁棒性蒸馏方法在提升学生模型鲁棒性的同时,往往导致其在清洁样本上的性能下降,主要源于双教师框架优化目标冲突和对抗样本迭代生成导致教师模型性能退化。 Method: 提出CIARD,包含两个创新:一是采用带对比推损失对齐的多教师框架,缓解双教师优化目标冲突;二是引入持续对抗重训练,动态维持教师模型的鲁棒性。 Result: 在CIFAR-10、CIFAR-100和Tiny-ImageNet上实验表明,CIARD平均提升对抗防御率3.53%,清洁样本准确率提高5.87%,显著优于现有方法。 Conclusion: CIARD有效解决了鲁棒性与清洁性能之间的权衡问题,为轻量级模型的对抗鲁棒性蒸馏建立了新基准。 Abstract: Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model's robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1. The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2. The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: a. A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and b. Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average 3.53 improvement in adversarial defense rates across various attack scenarios and a 5.87 increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/eminentgu/CIARD[79] Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
Jinjie Shen,Yaxiong Wang,Lechao Cheng,Nan Pu,Zhun Zhong
Main category: cs.CV
TL;DR: 本文提出了一种语义对齐的多模态伪造检测方法,构建了首个语义协调的多模态操纵数据集SAMM,并提出了检索增强的检测框架RamDG,在检测准确率上显著优于现有方法。
Details
Motivation: 现有数据集中多模态间的语义不一致与真实世界中的操纵模式不符,导致模型在实际应用中表现不佳,因此需要构建更贴近现实的语义协调操纵数据集。 Method: 通过两阶段 pipeline 构建SAMM数据集:首先应用先进的图像编辑技术,然后生成与视觉修改语义一致的合理文本描述;基于此提出RamDG框架,利用外部知识库检索上下文证据,结合输入信息进行伪造检测与定位。 Result: 实验表明,RamDG在SAMM数据集上的检测准确率比现有最先进方法高出2.06%。 Conclusion: 语义协调的多模态操纵更贴近真实场景,所提出的RamDG框架能有效提升检测性能,为媒体取证提供了新的研究方向和数据基础。 Abstract: The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06\% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.[80] MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization
YiTong Liu,TianZhu Liu,YanFeng GU
Main category: cs.CV
TL;DR: 提出了一种基于EVA02的多尺度频率注意力融合(MFAF)方法,用于跨视角地理定位,通过多频率分支块和频率感知空间注意力模块提升特征表示的一致性和鲁棒性。
Details
Motivation: 现有方法在跨视角地理定位中忽视了空间和语义信息,难以应对视角变化带来的外观差异和特征提取困难。 Method: 设计了MFB块以捕捉多尺度下的低频结构和高频边缘特征,并引入FSA模块来自适应关注关键频率区域,抑制背景噪声和视角变化干扰。 Result: 在University-1652、SUES-200和Dense-UAV等多个基准上实验表明,该方法在无人机定位与导航任务中表现出竞争性的性能。 Conclusion: MFAF方法有效提升了跨视角图像匹配的准确性和鲁棒性,具有在实际地理定位应用中的潜力。 Abstract: Cross-view geo-localization aims to determine the geographical location of a query image by matching it against a gallery of images. This task is challenging due to the significant appearance variations of objects observed from variable views, along with the difficulty in extracting discriminative features. Existing approaches often rely on extracting features through feature map segmentation while neglecting spatial and semantic information. To address these issues, we propose the EVA02-based Multi-scale Frequency Attention Fusion (MFAF) method. The MFAF method consists of Multi-Frequency Branch-wise Block (MFB) and the Frequency-aware Spatial Attention (FSA) module. The MFB block effectively captures both low-frequency structural features and high-frequency edge details across multiple scales, improving the consistency and robustness of feature representations across various viewpoints. Meanwhile, the FSA module adaptively focuses on the key regions of frequency features, significantly mitigating the interference caused by background noise and viewpoint variability. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and Dense-UAV, demonstrate that the MFAF method achieves competitive performance in both drone localization and drone navigation tasks.[81] A Comparative Study of YOLOv8 to YOLOv11 Performance in Underwater Vision Tasks
Gordon Hung,Ivan Felipe Rodriguez
Main category: cs.CV
TL;DR: 本文评估了YOLOv8至YOLOv11在水下图像检测任务中的性能,使用两个公开的海洋数据集(珊瑚疾病和鱼类物种),发现在YOLOv9之后精度趋于饱和,而推理速度持续提升,其中轻量级的YOLOv10在准确率与速度之间表现最佳平衡,适合部署于AUV上,并提供了可复现的基准和代码库。
Details
Motivation: 水下图像存在光照衰减、浑浊和类别不平衡等问题,且AUV计算资源有限,现有YOLO模型多基于陆地场景训练,其在海洋环境下的表现尚不明确,因此需要系统评估最新YOLO版本在水下视觉任务中的有效性。 Method: 构建了两个公开的水下数据集(Coral Disease含4,480张图像18类,Fish Species含7,500张图像20类),设置四种训练数据比例(25%~100%),固定验证与测试集;采用相同超参数训练YOLOv8-s至YOLOv11-s模型,评估精度、召回率、mAP、推理时间及FPS,并通过Grad-CAM分析特征使用情况。 Result: 在两个数据集上,YOLOv9后模型精度趋于饱和,表明后续改进更侧重效率而非准确性;推理速度显著提升,YOLOv10-s在保持高精度的同时具有最优的速度-精度权衡,适合嵌入式AUV部署;提供了开放、可复现的基准与代码。 Conclusion: 最新的YOLO系列中,YOLOv10-s在水下视觉任务中表现出最佳的效率与性能平衡,推荐用于资源受限的AUV平台;研究填补了YOLO在海洋环境下的系统性评估空白,并推动了水下计算机视觉的标准化 benchmark 发展。 Abstract: Autonomous underwater vehicles (AUVs) increasingly rely on on-board computer-vision systems for tasks such as habitat mapping, ecological monitoring, and infrastructure inspection. However, underwater imagery is hindered by light attenuation, turbidity, and severe class imbalance, while the computational resources available on AUVs are limited. One-stage detectors from the YOLO family are attractive because they fuse localization and classification in a single, low-latency network; however, their terrestrial benchmarks (COCO, PASCAL-VOC, Open Images) leave open the question of how successive YOLO releases perform in the marine domain. We curate two openly available datasets that span contrasting operating conditions: a Coral Disease set (4,480 images, 18 classes) and a Fish Species set (7,500 images, 20 classes). For each dataset, we create four training regimes (25 %, 50 %, 75 %, 100 % of the images) while keeping balanced validation and test partitions fixed. We train YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s with identical hyperparameters (100 epochs, 640 px input, batch = 16, T4 GPU) and evaluate precision, recall, mAP50, mAP50-95, per-image inference time, and frames-per-second (FPS). Post-hoc Grad-CAM visualizations probe feature utilization and localization faithfulness. Across both datasets, accuracy saturates after YOLOv9, suggesting architectural innovations primarily target efficiency rather than accuracy. Inference speed, however, improves markedly. Our results (i) provide the first controlled comparison of recent YOLO variants on underwater imagery, (ii) show that lightweight YOLOv10 offers the best speed-accuracy trade-off for embedded AUV deployment, and (iii) deliver an open, reproducible benchmark and codebase to accelerate future marine-vision research.[82] StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo
Xianda Guo,Chenming Zhang,Ruilin Wang,Youmin Zhang,Wenzhao Zheng,Matteo Poggi,Hao Zhao,Qin Zou,Long Chen
Main category: cs.CV
TL;DR: 本文提出了StereoCarla,一个基于CARLA仿真器的高保真合成立体视觉数据集,专为自动驾驶场景设计,具有多样化的相机配置和环境条件,显著提升了立体匹配模型在跨域任务中的泛化性能。
Details
Motivation: 现有立体匹配模型因训练数据多样性不足而导致泛化能力受限,尤其是在自动驾驶等复杂真实场景中。 Method: 基于CARLA仿真平台构建名为StereoCarla的立体视觉数据集,支持多种基线、视角、传感器布局及光照、天气、道路几何等环境变化,并在多个标准数据集上进行跨域实验验证。 Result: 在KITTI2012、KITTI2015、Middlebury和ETH3D四个基准上的实验表明,使用StereoCarla训练的模型优于使用11个现有数据集训练的模型,并且在多数据集联合训练中显著提升泛化精度。 Conclusion: StereoCarla为自动驾驶中的立体匹配算法提供了高多样性、可控性强的基准测试平台,有助于开发更鲁棒的深度感知系统。 Abstract: Stereo matching plays a crucial role in enabling depth perception for autonomous driving and robotics. While recent years have witnessed remarkable progress in stereo matching algorithms, largely driven by learning-based methods and synthetic datasets, the generalization performance of these models remains constrained by the limited diversity of existing training data. To address these challenges, we present StereoCarla, a high-fidelity synthetic stereo dataset specifically designed for autonomous driving scenarios. Built on the CARLA simulator, StereoCarla incorporates a wide range of camera configurations, including diverse baselines, viewpoints, and sensor placements as well as varied environmental conditions such as lighting changes, weather effects, and road geometries. We conduct comprehensive cross-domain experiments across four standard evaluation datasets (KITTI2012, KITTI2015, Middlebury, ETH3D) and demonstrate that models trained on StereoCarla outperform those trained on 11 existing stereo datasets in terms of generalization accuracy across multiple benchmarks. Furthermore, when integrated into multi-dataset training, StereoCarla contributes substantial improvements to generalization accuracy, highlighting its compatibility and scalability. This dataset provides a valuable benchmark for developing and evaluating stereo algorithms under realistic, diverse, and controllable settings, facilitating more robust depth perception systems for autonomous vehicles. Code can be available at https://github.com/XiandaGuo/OpenStereo, and data can be available at https://xiandaguo.net/StereoCarla.[83] SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes
Wenzhuo Jin,Qianfeng Yang,Xianhao Wu,Hongming Chen,Pengpeng Li,Xiang Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为SmokeBench的真实监控图像去烟基准数据集,包含多种场景和烟雾浓度下的配对图像,用于推动火灾早期阶段的图像去烟算法研究。
Details
Motivation: 火灾初期产生的烟雾严重影响监控系统的可视性,限制了应急响应能力,因此需要有效的图像去烟技术来恢复清晰场景信息。 Method: 构建了一个包含真实世界配对烟雾与无烟图像的大规模数据集SmokeBench,并对多种去烟方法进行了系统性基准测试。 Result: 提供了精确对齐的烟雾退化与清晰图像对,支持监督学习和严格评估,实验验证了现有去烟方法在该数据集上的性能。 Conclusion: SmokeBench为真实火灾场景中的图像去烟研究提供了重要基础,有助于推动鲁棒且实用的去烟算法发展,该数据集已公开发布。 Abstract: Early-stage fire scenes (0-15 minutes after ignition) represent a crucial temporal window for emergency interventions. During this stage, the smoke produced by combustion significantly reduces the visibility of surveillance systems, severely impairing situational awareness and hindering effective emergency response and rescue operations. Consequently, there is an urgent need to remove smoke from images to obtain clear scene information. However, the development of smoke removal algorithms remains limited due to the lack of large-scale, real-world datasets comprising paired smoke-free and smoke-degraded images. To address these limitations, we present a real-world surveillance image desmoking benchmark dataset named SmokeBench, which contains image pairs captured under diverse scenes setup and smoke concentration. The curated dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of desmoking methods on our dataset. Our dataset provides a valuable foundation for advancing robust and practical image desmoking in real-world fire scenes. This dataset has been released to the public and can be downloaded from https://github.com/ncfjd/SmokeBench.[84] RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation
Siju Ma,Changsiyu Gong,Xiaofeng Fan,Yong Ma,Chengjie Jiang
Main category: cs.CV
TL;DR: 本文提出了一种新的文本驱动红外与可见光图像融合框架RIS-FUSION,通过联合优化融合与指代表分割(RIS)任务,提升文本对融合结果的引导作用。
Details
Motivation: 现有文本驱动融合方法缺乏对文本贡献的有效监督和评估机制,而RIS与文本驱动融合具有共同目标——突出文本所指对象,因此可利用RIS来增强融合的语义一致性。 Method: 提出RIS-FUSION级联框架,核心为LangGatedFusion模块,将文本特征注入融合主干网络以增强语义对齐;同时构建MM-RIS大规模多模态RIS基准数据集,包含12.5k训练和3.5k测试三元组(图像对、掩码、指代表达)。 Result: 实验表明,RIS-FUSION在mIoU指标上超过现有方法11%以上,达到最先进性能。 Conclusion: 通过统一图像融合与指代表分割任务,RIS-FUSION有效提升了文本在融合过程中的指导能力,验证了跨任务联合优化在文本驱动融合中的有效性。 Abstract: Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.[85] Learning by Imagining: Debiased Feature Augmentation for Compositional Zero-Shot Learning
Haozhe Zhang,Chenchen Jing,Mingyu Liu,Qingsheng Wang,Hao Chen
Main category: cs.CV
TL;DR: 提出了一种名为Debiased Feature Augmentation (DeFA)的新方法,通过解耦重构框架和去偏策略来增强特征,以解决组合零样本学习中的挑战。
Details
Motivation: 由于属性和对象的纠缠特性以及现实数据中长尾分布的普遍存在,学习可泛化的组合表示在组合零样本学习(CZSL)中仍然具有挑战性。 Method: DeFA结合了解耦与重构框架进行特征增强,并采用去偏策略,利用已见属性和对象的先验知识合成高保真组合特征,支持组合泛化。 Result: 在三个广泛使用的数据集上的大量实验表明,DeFA在封闭世界和开放世界设置下均达到最先进的性能。 Conclusion: DeFA有效提升了组合零样本学习中的泛化能力,解决了属性-对象纠缠和数据长尾分布的问题。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by learning prior knowledge of seen primitives, \textit{i.e.}, attributes and objects. Learning generalizable compositional representations in CZSL remains challenging due to the entangled nature of attributes and objects as well as the prevalence of long-tailed distributions in real-world data. Inspired by neuroscientific findings that imagination and perception share similar neural processes, we propose a novel approach called Debiased Feature Augmentation (DeFA) to address these challenges. The proposed DeFA integrates a disentangle-and-reconstruct framework for feature augmentation with a debiasing strategy. DeFA explicitly leverages the prior knowledge of seen attributes and objects by synthesizing high-fidelity composition features to support compositional generalization. Extensive experiments on three widely used datasets demonstrate that DeFA achieves state-of-the-art performance in both \textit{closed-world} and \textit{open-world} settings.[86] AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models
Heng Zhang,Haichuan Hu,Yaomin Shen,Weihao Yu,Yilei Yuan,Haochen You,Guo Cheng,Zijian Zhang,Lubin Gan,Huihui Wei,Hao Zhang,Jin Huang
Main category: cs.CV
TL;DR: 提出AsyMoE架构,通过三类专家模块解决视觉与语言处理中的模态不对称问题,在提升准确率的同时减少激活参数量。
Details
Motivation: 现有MoE方法在处理视觉和语言模态时因模态间不对称性而难以平衡模态特异性特征与跨模态交互,导致语言专家在深层失去上下文对齐。 Method: 设计三类专家:模态内专家处理各自模态信息,双曲跨模态专家进行分层跨模态交互,证据优先语言专家抑制参数偏差并保持上下文对齐。 Result: AsyMoE相比普通MoE和模态特定MoE分别提升26.58%和15.45%的准确率,且激活参数减少25.45%。 Conclusion: AsyMoE有效建模视觉与语言处理的不对称性,提升了多模态大模型的效率与性能。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.[87] EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer
Pukun Zhao,Longxiang Wang,Miaowei Wang,Chen Chen,Fanqing Zhou,Haojian Huang
Main category: cs.CV
TL;DR: 提出了两个动态空间推理基准(局部可观察迷宫导航和match-2消除),用于评估模型在局部感知、环境反馈和全局目标耦合下的空间理解与自适应规划能力,并引入基于主观体验的记忆机制,实验表明现有主流模型在动态空间推理和长期记忆方面存在关键局限。
Details
Motivation: 现有空间推理基准多关注静态或全局可观察环境,难以反映在部分可观测和动态变化条件下长时序推理与记忆利用的挑战。 Method: 设计了两个动态空间任务:局部可观察迷宫导航和match-2消除任务,每个动作都会引发环境结构变化,要求模型持续更新认知与策略;并提出一种基于主观体验的记忆机制以实现跨任务经验迁移与验证。 Result: 实验显示主流模型在新提出的动态空间基准上表现不佳,暴露出其在动态空间推理和长期记忆方面的关键缺陷,而所提记忆机制有助于提升跨任务性能。 Conclusion: 所提出的动态空间基准有效揭示了现有模型在动态、部分可观测环境中的局限性,为未来方法改进提供了综合性评测平台。 Abstract: Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models' abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.[88] SPGen: Spherical Projection as Consistent and Flexible Representation for Single Image 3D Shape Generation
Jingdong Zhang,Weikai Chen,Yuan Liu,Jionghao Wang,Zhengming Yu,Zhuowen Shen,Bo Yang,Wenping Wang,Xin Li
Main category: cs.CV
TL;DR: SPGen提出了一种基于球面投影(SP)的单视图3D生成模型,通过将几何信息映射到球面上并展开为多层2D表示,在图像域内实现一致、灵活且高效的3D生成,显著提升了几何质量和计算效率。
Details
Motivation: 现有单视图3D生成模型依赖多视图扩散先验,易产生视图间不一致,难以准确表达复杂内部结构和非平凡拓扑。因此需要一种能克服这些问题的新方法。 Method: 将几何信息投影到包围球面上,并展开为紧凑的多层2D球面投影(SP)表示;在图像域内使用2D扩散模型进行生成,利用SP映射的单视角特性保证一致性,并支持直接升维为封闭或开放的3D表面。 Result: 实验表明,SPGen在几何质量与计算效率方面显著优于现有基线方法,实现了更一致的视图生成、对内部结构的良好建模以及高效的微调能力。 Conclusion: SPGen通过引入球面投影表示,在图像域内实现了高质量、一致且灵活的单视图3D生成,为3D内容生成提供了一种高效且具扩展性的新范式。 Abstract: Existing single-view 3D generative models typically adopt multiview diffusion priors to reconstruct object surfaces, yet they remain prone to inter-view inconsistencies and are unable to faithfully represent complex internal structure or nontrivial topologies. In particular, we encode geometry information by projecting it onto a bounding sphere and unwrapping it into a compact and structural multi-layer 2D Spherical Projection (SP) representation. Operating solely in the image domain, SPGen offers three key advantages simultaneously: (1) Consistency. The injective SP mapping encodes surface geometry with a single viewpoint which naturally eliminates view inconsistency and ambiguity; (2) Flexibility. Multi-layer SP maps represent nested internal structures and support direct lifting to watertight or open 3D surfaces; (3) Efficiency. The image-domain formulation allows the direct inheritance of powerful 2D diffusion priors and enables efficient finetuning with limited computational resources. Extensive experiments demonstrate that SPGen significantly outperforms existing baselines in geometric quality and computational efficiency.[89] Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models
Yunhan Zhao,Xiang Zheng,Xingjun Ma
Main category: cs.CV
TL;DR: 提出了一种新的视觉语言模型越狱方法Defense2Attack,利用防御模式来指导越狱提示设计,在单次尝试中实现了优越的攻击性能。
Details
Motivation: 现有的视觉语言模型(VLMs)虽然能力强,但易受越狱攻击,而当前的越狱方法在有效性和效率上仍有提升空间。 Method: 提出Defense2Attack方法,包含三个组件:嵌入具有积极语义的通用对抗扰动的视觉优化器、使用防御风格提示的文本优化器、通过强化微调增强越狱效果的红队后缀生成器。 Result: 在四个VLM和四个安全基准上的实验表明,该方法在单次尝试中优于现有最先进攻击方法。 Conclusion: 通过引入弱防御机制可显著提升越狱的有效性和效率,为攻击VLM提供了新视角。 Abstract: Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.[90] Effective Gaussian Management for High-fidelity Object Reconstruction
Jiateng Liu,Hao Gao,Jiu-Cheng Xie,Chi-Man Pun,Jian Xiong,Haolun Li,Feng Xu
Main category: cs.CV
TL;DR: 提出一种高斯管理方法,通过动态激活球谐函数或法线并自适应调整高斯表示,实现高质量对象重建,具有更高的效率和更少的参数。
Details
Motivation: 解决现有高斯点阵方法中因双重监督导致的梯度冲突问题,并提升重建精度与表示效率。 Method: 引入由表面重建模块指导的新型致密化策略,动态激活球谐函数或法线;设计轻量级高斯表示,根据梯度大小自适应调整球谐阶数,并进行任务解耦剪枝。 Result: 在重建质量和效率上均优于当前最先进的方法,使用显著更少的参数实现了更优性能。 Conclusion: 所提出的高斯管理方法具有模型无关性,可有效集成到其他框架中,在减少模型大小的同时提升性能。 Abstract: This paper proposes an effective Gaussian management approach for high-fidelity object reconstruction. Departing from recent Gaussian Splatting (GS) methods that employ indiscriminate attribute assignment, our approach introduces a novel densification strategy that dynamically activates spherical harmonics (SHs) or normals under the supervision of a surface reconstruction module, which effectively mitigates the gradient conflicts caused by dual supervision and achieves superior reconstruction results. To further improve representation efficiency, we develop a lightweight Gaussian representation that adaptively adjusts the SH orders of each Gaussian based on gradient magnitudes and performs task-decoupled pruning to remove Gaussian with minimal impact on a reconstruction task without sacrificing others, which balances the representational capacity with parameter quantity. Notably, our management approach is model-agnostic and can be seamlessly integrated into other frameworks, enhancing performance while reducing model size. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art approaches in both reconstruction quality and efficiency, achieving superior performance with significantly fewer parameters.[91] Modelling and analysis of the 8 filters from the "master key filters hypothesis" for depthwise-separable deep networks in relation to idealized receptive fields based on scale-space theory
Tony Lindeberg,Zahra Babaiee,Peyman M. Kiasari
Main category: cs.CV
TL;DR: 该论文分析并建模了从基于ConvNeXt架构的深度可分离网络中提取的8个“主滤波器”,发现这些学习到的滤波器可通过离散尺度空间滤波器(如高斯核的差分算子)进行良好近似,并验证了其在替换原滤波器后仍具有良好的预测性能。
Details
Motivation: 探索深度可分离卷积网络中学习到的感受野是否可以用可解释的、理想化的尺度空间模型来近似,以增强对网络内部机制的理解。 Method: 首先计算学习滤波器的空间扩展度量(加权均值和方差),然后将聚类得到的“主滤波器”建模为离散高斯平滑的差分算子,采用不同或相同的尺度参数,并通过匹配空间方差或最小化l1/l2范数进行模型拟合。 Result: 实验表明,理想化的感受野模型与学习到的滤波器在定性上高度相似,且在空间方差相等或范数最小化的条件下实现了良好的拟合效果;替换滤波器后的网络仍保持良好性能。 Conclusion: 深度可分离网络中学习到的滤波器可以被离散尺度空间滤波器有效近似,支持了其结构具有明确数学解释的可能性,并为设计更高效、可解释的网络提供了理论依据。 Abstract: This paper presents the results of analysing and modelling a set of 8 ``master key filters'', which have been extracted by applying a clustering approach to the receptive fields learned in depthwise-separable deep networks based on the ConvNeXt architecture. For this purpose, we first compute spatial spread measures in terms of weighted mean values and weighted variances of the absolute values of the learned filters, which support the working hypotheses that: (i) the learned filters can be modelled by separable filtering operations over the spatial domain, and that (ii) the spatial offsets of the those learned filters that are non-centered are rather close to half a grid unit. Then, we model the clustered ``master key filters'' in terms of difference operators applied to a spatial smoothing operation in terms of the discrete analogue of the Gaussian kernel, and demonstrate that the resulting idealized models of the receptive fields show good qualitative similarity to the learned filters. This modelling is performed in two different ways: (i) using possibly different values of the scale parameters in the coordinate directions for each filter, and (ii) using the same value of the scale parameter in both coordinate directions. Then, we perform the actual model fitting by either (i) requiring spatial spread measures in terms of spatial variances of the absolute values of the receptive fields to be equal, or (ii) minimizing the discrete $l_1$- or $l_2$-norms between the idealized receptive field models and the learned filters. Complementary experimental results then demonstrate the idealized models of receptive fields have good predictive properties for replacing the learned filters by idealized filters in depthwise-separable deep networks, thus showing that the learned filters in depthwise-separable deep networks can be well approximated by discrete scale-space filters.[92] What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment
Rishab Parthasarathy,Jasmine Collins,Cory Stephenson
Main category: cs.CV
TL;DR: 研究了人类与多模态大语言模型(LLMs)在评估生成图像质量时对不同图像属性(如美学、解剖准确性、构图等)的重视程度差异,发现人类能有效判断各属性,而LLMs在某些属性上判断能力较弱。
Details
Motivation: 理解多模态LLMs如何利用与人类相关的图像概念(如风格、构图)进行图像质量评估,并揭示其与人类判断的差异。 Method: 构建包含人类偏好的数据集,使用合成图像对分析人与LLM在多个图像质量属性上的跨任务相关性,并通过受控合成数据集研究各属性的独立影响。 Result: 人类判断中各图像质量属性间存在较强相关性,而LLMs的相关性较弱;人类能轻松判断所有属性,但LLMs在如解剖准确性等属性上判断困难。 Conclusion: 多模态LLMs在图像质量评估中对关键视觉属性的感知与人类存在显著差异,揭示了当前模型在模仿人类审美和细节判断上的局限性。 Abstract: Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image--specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style--are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.[93] Recurrent Cross-View Object Geo-Localization
Xiaohan Zhang,Si-Yuan Cao,Xiaokai Bai,Yiming Li,Zhangkai Shen,Zhe Wu,Xiaoxi Hu,Hui-liang Shen
Main category: cs.CV
TL;DR: 本文提出了ReCOT,一种基于Transformer的循环交叉视图目标地理定位方法,通过引入可学习token和迭代优化机制,在减少60%参数的情况下实现了最先进的性能。
Details
Motivation: 现有CVOGL方法易受特征噪声影响且缺乏误差校正机制,需更鲁棒和高效的定位框架。 Method: 将CVOGL重构为循环定位任务,使用可学习token编码查询意图,并结合SAM知识蒸馏和参考特征增强模块(RFEM)进行迭代定位优化。 Result: 在标准CVOGL基准上取得SOTA性能,同时模型参数减少60%。 Conclusion: ReCOT通过循环机制和特征增强策略有效提升了交叉视图目标定位的精度与效率,具有更强鲁棒性和更少参数消耗。 Abstract: Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.[94] A-TDOM: Active TDOM via On-the-Fly 3DGS
Yiwei Xu,Xiang Wang,Yifei Yu,Wentian Gan,Luca Morelli,Giulio Perda,Xiongwu Xiao,Zongqian Zhan,Xin Wang,Fabio Remondino
Main category: cs.CV
TL;DR: 提出了一种基于On-the-Fly 3DGS优化的近实时真数字正射影像图(TDOM)生成方法A-TDOM,能够在每次获取新图像后数秒内完成优化并保持良好的渲染质量与几何精度。
Details
Motivation: 传统TDOM生成依赖复杂的离线摄影测量流程,导致延迟且易受相机位姿、DSM不准确和场景遮挡等问题影响质量,难以满足实时应用需求。 Method: 采用On-the-Fly SfM计算每幅新图像的位姿和稀疏点云,并将新的高斯点集成到先前未见或粗略重建区域中进行实时3DGS优化,结合正交splatting实现每次更新后的即时渲染。 Result: 在多个基准上的初步实验表明,A-TDOM可在近实时条件下生成TDOM,每张新图像的3DGS优化耗时仅数秒,同时保持可接受的渲染质量和几何精度。 Conclusion: A-TDOM通过在线3DGS优化实现了高效、近实时的TDOM生成,克服了传统方法的延迟与质量问题,具有广泛的应用前景。 Abstract: True Digital Orthophoto Map (TDOM) serves as a crucial geospatial product in various fields such as urban management, city planning, land surveying, etc. However, traditional TDOM generation methods generally rely on a complex offline photogrammetric pipeline, resulting in delays that hinder real-time applications. Moreover, the quality of TDOM may degrade due to various challenges, such as inaccurate camera poses or Digital Surface Model (DSM) and scene occlusions. To address these challenges, this work introduces A-TDOM, a near real-time TDOM generation method based on On-the-Fly 3DGS optimization. As each image is acquired, its pose and sparse point cloud are computed via On-the-Fly SfM. Then new Gaussians are integrated and optimized into previously unseen or coarsely reconstructed regions. By integrating with orthogonal splatting, A-TDOM can render just after each update of a new 3DGS field. Initial experiments on multiple benchmarks show that the proposed A-TDOM is capable of actively rendering TDOM in near real-time, with 3DGS optimization for each new image in seconds while maintaining acceptable rendering quality and TDOM geometric accuracy.[95] DyGLNet: Hybrid Global-Local Feature Fusion with Dynamic Upsampling for Medical Image Segmentation
Yican Zhao,Ce Wang,You Hao,Lei Li,Tianli Liao
Main category: cs.CV
TL;DR: 本文提出了一种用于医学图像分割的高效网络DyGLNet,通过融合全局与局部特征并采用动态上采样机制,在多个公开数据集上实现了优于现有方法的分割性能,尤其在边界精度和小目标分割方面表现突出,同时计算复杂度更低。
Details
Motivation: 医学图像分割面临多尺度病灶变化、组织边界模糊以及计算开销大等挑战,需要一种既能保持高精度又能提升效率的分割模型。 Method: 提出DyGLNet,设计了SHDCBlock模块(结合单头自注意力与多尺度空洞卷积)以协同建模局部细节与全局上下文,并引入动态自适应上采样模块DyFusionUp实现基于可学习偏移的高保真特征图重建,同时采用轻量化设计降低计算开销。 Result: 在七个公开医学图像数据集上实验表明,DyGLNet在分割精度、边界还原和小目标识别方面优于现有方法,且具有更低的计算复杂度。 Conclusion: DyGLNet通过动态特征融合与轻量设计,实现了高效准确的医学图像分割,具备良好的临床应用潜力。 Abstract: Medical image segmentation grapples with challenges including multi-scale lesion variability, ill-defined tissue boundaries, and computationally intensive processing demands. This paper proposes the DyGLNet, which achieves efficient and accurate segmentation by fusing global and local features with a dynamic upsampling mechanism. The model innovatively designs a hybrid feature extraction module (SHDCBlock), combining single-head self-attention and multi-scale dilated convolutions to model local details and global context collaboratively. We further introduce a dynamic adaptive upsampling module (DyFusionUp) to realize high-fidelity reconstruction of feature maps based on learnable offsets. Then, a lightweight design is adopted to reduce computational overhead. Experiments on seven public datasets demonstrate that DyGLNet outperforms existing methods, particularly excelling in boundary accuracy and small-object segmentation. Meanwhile, it exhibits lower computation complexity, enabling an efficient and reliable solution for clinical medical image analysis. The code will be made available soon.[96] BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers
Mohammed Al-Habib,Zuping Zhang,Abdulrahman Noman
Main category: cs.CV
TL;DR: 提出了一种用于少样本学习的双层自适应令牌精炼方法(BATR-FST),通过预训练和元微调两个阶段提升视觉Transformer在少样本分类中的性能。
Details
Motivation: 现有方法在处理令牌级交互、有限数据下的泛化能力以及归纳偏置方面存在不足,限制了Vision Transformers在少样本学习中的表现。 Method: 采用两阶段框架:预训练阶段使用掩码图像建模(MIM)获取可迁移的patch表示;元微调阶段引入双层自适应令牌精炼模块,包括令牌聚类、不确定性感知权重分配、双层注意力机制,并结合图令牌传播和类别分离惩罚来增强语义一致性和判别能力。 Result: 在三个基准少样本数据集上实验表明,BATR-FST在1-shot和5-shot设置下均取得优于现有方法的分类性能。 Conclusion: BATR-FST有效提升了Vision Transformers在少样本学习场景下的令牌表示能力和分类性能,具备更强的局部特征精炼和全局上下文整合能力。 Abstract: Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.[97] CECT-Mamba: a Hierarchical Contrast-enhanced-aware Model for Pancreatic Tumor Subtyping from Multi-phase CECT
Zhifang Gong,Shuo Gao,Ben Zhao,Yingjing Xu,Yijun Yang,Shenghong Ju,Guangquan Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba的自动多相CT图像分析方法,用于胰腺肿瘤亚型的精准分类。
Details
Motivation: 胰腺肿瘤的高度异质性和变异性给精确分型诊断带来挑战,现有方法未能有效利用多相CT的上下文信息。 Method: 提出双层次对比增强感知Mamba模块,结合空间与时间采样序列,并引入相似性引导的精细化模块及多粒度融合机制,实现对多相CT数据中病灶时空特征的有效建模。 Result: 在270例临床病例上的实验结果显示,区分胰腺导管腺癌(PDAC)与神经内分泌肿瘤(PNETs)的准确率达97.4%,AUC为98.6%。 Conclusion: 该方法能有效整合多相CT的时空信息,显著提升胰腺肿瘤亚型分类的准确性与效率,具有临床应用潜力。 Abstract: Contrast-enhanced computed tomography (CECT) is the primary imaging technique that provides valuable spatial-temporal information about lesions, enabling the accurate diagnosis and subclassification of pancreatic tumors. However, the high heterogeneity and variability of pancreatic tumors still pose substantial challenges for precise subtyping diagnosis. Previous methods fail to effectively explore the contextual information across multiple CECT phases commonly used in radiologists' diagnostic workflows, thereby limiting their performance. In this paper, we introduce, for the first time, an automatic way to combine the multi-phase CECT data to discriminate between pancreatic tumor subtypes, among which the key is using Mamba with promising learnability and simplicity to encourage both temporal and spatial modeling from multi-phase CECT. Specifically, we propose a dual hierarchical contrast-enhanced-aware Mamba module incorporating two novel spatial and temporal sampling sequences to explore intra and inter-phase contrast variations of lesions. A similarity-guided refinement module is also imposed into the temporal scanning modeling to emphasize the learning on local tumor regions with more obvious temporal variations. Moreover, we design the space complementary integrator and multi-granularity fusion module to encode and aggregate the semantics across different scales, achieving more efficient learning for subtyping pancreatic tumors. The experimental results on an in-house dataset of 270 clinical cases achieve an accuracy of 97.4% and an AUC of 98.6% in distinguishing between pancreatic ductal adenocarcinoma (PDAC) and pancreatic neuroendocrine tumors (PNETs), demonstrating its potential as a more accurate and efficient tool.[98] Modeling the Multivariate Relationship with Contextualized Representations for Effective Human-Object Interaction Detection
Zhehao Li,Yucheng Qian,Chong Wang,Yinghao Lu,Zhihao Yang,Jiafei Wu
Main category: cs.CV
TL;DR: 本文提出了一种新的上下文化表征学习网络,通过引入辅助实体(如工具)的三元组结构和可学习提示机制,增强人类-物体交互检测中的上下文建模能力。
Details
Motivation: 现有两阶段方法在人类-物体交互检测中因上下文建模不完整而受限,难以捕捉复杂交互。 Method: 提出Contextualized Representation Learning Network,结合功能引导推理(affordance-guided reasoning)和上下文提示,利用三元组<人类, 工具, 物体>建模工具依赖性交互,并通过注意力机制融合实例类别与视觉特征。 Result: 在HICO-Det和V-COCO数据集上大多数场景下均取得优于现有方法的表现。 Conclusion: 通过引入辅助对象的功能角色和语言-视觉上下文对齐机制,显著提升了复杂交互检测的准确性与鲁棒性。 Abstract: Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning Network that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures[99] Double Helix Diffusion for Cross-Domain Anomaly Image Generation
Linchun Wu,Qin Zou,Xianbiao Qi,Bo Du,Zhongyuan Wang,Qingquan Li
Main category: cs.CV
TL;DR: 提出了一种名为Double Helix Diffusion (DH-Diff)的新型跨域生成框架,用于同时合成高保真异常图像及其像素级标注掩码,解决了现有方法在结构一致性和特征解耦方面的局限性。
Details
Motivation: 由于真实异常样本稀缺,制造领域的视觉异常检测面临挑战;现有合成数据方法存在结构不一致和特征纠缠问题,限制了生成数据的质量和实用性。 Method: 设计了受双螺旋结构启发的DH-Diff框架,包含特征分离、连接和融合模块;采用领域解耦注意力机制独立增强图像与标注特征,并通过语义分数图对齐模块确保结构真实性;支持文本提示和图形引导进行灵活控制。 Result: 实验表明,DH-Diff在生成数据的多样性和真实性方面显著优于现有最先进方法,并有效提升了下游异常检测任务的性能。 Conclusion: DH-Diff能够高效生成结构合理且标注精确的异常图像,为数据稀缺场景下的视觉异常检测提供了可靠的解决方案。 Abstract: Visual anomaly inspection is critical in manufacturing, yet hampered by the scarcity of real anomaly samples for training robust detectors. Synthetic data generation presents a viable strategy for data augmentation; however, current methods remain constrained by two principal limitations: 1) the generation of anomalies that are structurally inconsistent with the normal background, and 2) the presence of undesirable feature entanglement between synthesized images and their corresponding annotation masks, which undermines the perceptual realism of the output. This paper introduces Double Helix Diffusion (DH-Diff), a novel cross-domain generative framework designed to simultaneously synthesize high-fidelity anomaly images and their pixel-level annotation masks, explicitly addressing these challenges. DH-Diff employs a unique architecture inspired by a double helix, cycling through distinct modules for feature separation, connection, and merging. Specifically, a domain-decoupled attention mechanism mitigates feature entanglement by enhancing image and annotation features independently, and meanwhile a semantic score map alignment module ensures structural authenticity by coherently integrating anomaly foregrounds. DH-Diff offers flexible control via text prompts and optional graphical guidance. Extensive experiments demonstrate that DH-Diff significantly outperforms state-of-the-art methods in diversity and authenticity, leading to significant improvements in downstream anomaly detection performance.[100] Superpixel Anything: A general object-based framework for accurate yet regular superpixel segmentation
Julien Walther,Rémi Giraud,Michaël Clément
Main category: cs.CV
TL;DR: 本文提出了SPAM(SuperPixel Anything Model),一种能够生成准确且规则超像素的通用框架,结合深度学习特征与大规模预训练模型,在语义无关的情况下实现高质量图像分割。
Details
Motivation: 传统超像素方法依赖低层特征,而深度学习方法虽利用高层特征但牺牲了超像素的规整性,导致分割结果不够可解释。因此需要一种既能保持准确性又能维持规则性的超像素分割方法。 Method: 提出SPAM框架,训练模型提取用于生成超像素的图像特征,并在推理时利用大规模预训练模型进行语义无关的分割,使超像素更好地对齐物体边界;支持任意先验高层分割并可交互式聚焦特定物体。 Result: 实验表明,SPAM在定性和定量指标上均优于现有最先进方法,能有效处理不确定性区域并生成更符合物体边界的规则超像素。 Conclusion: SPAM是一种强大且鲁棒的超像素分割工具,兼顾准确性与规整性,适用于多种计算机视觉应用。 Abstract: Superpixels are widely used in computer vision to simplify image representation and reduce computational complexity. While traditional methods rely on low-level features, deep learning-based approaches leverage high-level features but also tend to sacrifice regularity of superpixels to capture complex objects, leading to accurate but less interpretable segmentations. In this work, we introduce SPAM (SuperPixel Anything Model), a versatile framework for segmenting images into accurate yet regular superpixels. We train a model to extract image features for superpixel generation, and at inference, we leverage a large-scale pretrained model for semantic-agnostic segmentation to ensure that superpixels align with object masks. SPAM can handle any prior high-level segmentation, resolving uncertainty regions, and is able to interactively focus on specific objects. Comprehensive experiments demonstrate that SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks, making it a valuable and robust tool for various applications. Code and pre-trained models are available here: https://github.com/waldo-j/spam.[101] Hunyuan3D Studio: End-to-End AI Pipeline for Game-Ready 3D Asset Generation
Biwen Lei,Yang Li,Xinhai Liu,Shuhui Yang,Lixin Xu,Jingwei Huang,Ruining Tang,Haohan Weng,Jian Liu,Jing Xu,Zhen Zhou,Yiling Zhu,Jiankai Xing,Jiachen Xu,Changfeng Ma,Xinhao Yan,Yunhan Yang,Chunshi Wang,Duoteng Xu,Xueqi Ma,Yuguang Chen,Jing Li,Mingxin Yang,Sheng Zhang,Yifei Feng,Xin Huang,Di Luo,Zebin He,Puhua Jiang,Changrong Hu,Zihan Qin,Shiwei Miao,Haolin Liu,Yunfei Zhao,Zeqiang Lai,Qingxiang Lin,Zibo Zhao,Kunhong Li,Xianghui Yang,Huiwen Shi,Xin Yang,Yuxuan Wang,Zebin Yao,Yihang Lian,Sicong Liu,Xintong Han,Wangchen Qin,Caisheng Ouyang,Jianyin Liu,Tianwen Yuan,Shuai Jiang,Hong Duan,Yanqi Niu,Wencong Lin,Yifu Sun,Shirui Huang,Lin Niu,Gu Gong,Guojian Xiao,Bojian Zheng,Xiang Yuan,Qi Chen,Jie Xiao,Dongyang Zheng,Xiaofeng Yang,Kai Liu,Jianchen Zhu,Lifu Wang,Qinglin Lu,Jie Liu,Liang Dong,Fan Jiang,Ruibin Chen,Lei Wang,Chao Zhang,Jiaxin Lin,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Yinhe Wu,Jiayao Du,Jupeng Chen,Xinyue Mao,Dongyuan Guo,Yixuan Tang,Yulin Tsai,Yonghao Tan,Jiaao Yu,Junlin Yu,Keren Zhang,Yifan Li,Peng Chen,Tian Liu,Di Wang,Yuhong Liu,Linus,Jie Jiang,Zhuo Chen,Chunchao Guo
Main category: cs.CV
TL;DR: Hunyuan3D Studio 是一个端到端的 AI 驱动内容创作平台,能够将概念图或文本描述快速转化为具备优化几何结构和高保真 PBR 材质的高质量 3D 模型,显著提升游戏开发中 3D 资产生成效率。
Details
Motivation: 传统 3D 资产创建过程繁琐且依赖专业人员,亟需自动化工具来降低门槛并加速游戏开发流程。 Method: 集成多种先进神经网络模块(如部件级 3D 生成、多边形生成、语义 UV 等),构建统一、易用的系统框架,实现从单张图像或文本到游戏级 3D 资产的一键生成。 Result: 系统可生成视觉表现力强且符合现代游戏引擎技术要求的 3D 资产,大幅缩短迭代时间,提升生产效率。 Conclusion: Hunyuan3D Studio 实现了从创意到技术资产的无缝衔接,代表了 AI 辅助游戏与交互媒体内容创作的重要进展。 Abstract: The creation of high-quality 3D assets, a cornerstone of modern game development, has long been characterized by labor-intensive and specialized workflows. This paper presents Hunyuan3D Studio, an end-to-end AI-powered content creation platform designed to revolutionize the game production pipeline by automating and streamlining the generation of game-ready 3D assets. At its core, Hunyuan3D Studio integrates a suite of advanced neural modules (such as Part-level 3D Generation, Polygon Generation, Semantic UV, etc.) into a cohesive and user-friendly system. This unified framework allows for the rapid transformation of a single concept image or textual description into a fully-realized, production-quality 3D model complete with optimized geometry and high-fidelity PBR textures. We demonstrate that assets generated by Hunyuan3D Studio are not only visually compelling but also adhere to the stringent technical requirements of contemporary game engines, significantly reducing iteration time and lowering the barrier to entry for 3D content creation. By providing a seamless bridge from creative intent to technical asset, Hunyuan3D Studio represents a significant leap forward for AI-assisted workflows in game development and interactive media.[102] SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention
Yuan Cao,Dong Wang
Main category: cs.CV
TL;DR: 提出了一种名为SAGA的高效线性注意力机制,通过输入自适应的可学习门控选择性地聚合KV信息,在降低计算复杂度的同时提升模型表达能力和性能。
Details
Motivation: 现有线性注意力方法对历史键值信息进行均匀压缩,导致特征冗余、方向对齐丢失以及低秩问题,从而影响模型性能。 Method: 引入输入自适应的可学习门控机制,选择性调制KV特征图的信息聚合,并采用高效的Hadamard积分解方法计算门控,不增加额外内存开销。 Result: 在1280×1280分辨率下,相比PVT-T吞吐量提升1.76倍,峰值GPU内存减少2.69倍;在ImageNet上top-1准确率最高提升4.4%。 Conclusion: SAGA有效缓解了传统线性注意力中的低秩限制,兼顾计算效率与模型表达能力,显著提升了高分辨率视觉任务的性能。 Abstract: While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.[103] Data Scaling Laws for Radiology Foundation Models
Maximilian Ilse,Harshita Sharma,Anton Schwaighofer,Sam Bond-Taylor,Fernando Pérez-García,Olesya Melnichenko,Anne-Marie G. Sykes,Kelly K. Horst,Ashish Khandelwal,Maxwell Reynolds,Maria T. Wetscherek,Noel C. F. Codella,Javier Alvarez-Valle,Korfiatis Panagiotis,Valentina Salvatelli
Main category: cs.CV
TL;DR: 本研究系统探讨了在350万张胸部X光片上持续预训练两种医学影像编码器(MI2和RAD-DINO)的效果,发现MI2在放射学发现任务上表现更优,而RAD-DINO在导管相关任务上更强;引入结构化标签的监督能提升性能,且仅需3万样本即可超越开源基础模型,凸显了机构特定数据持续预训练的价值。
Details
Motivation: 医学影像基础模型受限于较小的数据集,导致对数据规模和预训练范式影响的理解不足;本文旨在探究在固定计算和评估条件下,不同编码器范式在大规模单机构数据上的持续预训练效果,并弥补以往研究偏重放射学发现而忽视其他临床任务(如导管检测)的偏差。 Method: 对代表CLIP和DINOv2两大范式的MI2和RAD-DINO模型,在最多350万张单机构胸部X光图像上进行持续预训练,保持计算资源和评估协议一致;评估任务包括分类(放射学发现、导管)、分割(导管)和报告生成,并特别引入导管类任务以评估模型对细长结构连续性的建模能力;同时探索结合报告文本与结构化标签(UniCL)的监督方式。 Result: MI2在放射学发现相关任务上扩展性更好,RAD-DINO在导管相关任务上表现更强;意外发现MI2结合报告和结构化标签进行持续预训练可进一步提升性能;某些任务上仅需3万机构内样本即可超越开放权重的基础模型。 Conclusion: 机构特定的持续预训练能显著提升医学影像模型性能,利用内部数据进行针对性优化是一种高效可行的路径,且不同架构在不同临床任务上具有互补优势,结构化监督信号在大规模预训练中具有重要价值。 Abstract: Foundation vision encoders such as CLIP and DINOv2, trained on web-scale data, exhibit strong transfer performance across tasks and datasets. However, medical imaging foundation models remain constrained by smaller datasets, limiting our understanding of how data scale and pretraining paradigms affect performance in this setting. In this work, we systematically study continual pretraining of two vision encoders, MedImageInsight (MI2) and RAD-DINO representing the two major encoder paradigms CLIP and DINOv2, on up to 3.5M chest x-rays from a single institution, holding compute and evaluation protocols constant. We evaluate on classification (radiology findings, lines and tubes), segmentation (lines and tubes), and radiology report generation. While prior work has primarily focused on tasks related to radiology findings, we include lines and tubes tasks to counterbalance this bias and evaluate a model's ability to extract features that preserve continuity along elongated structures. Our experiments show that MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Surprisingly, continually pretraining MI2 with both reports and structured labels using UniCL improves performance, underscoring the value of structured supervision at scale. We further show that for some tasks, as few as 30k in-domain samples are sufficient to surpass open-weights foundation models. These results highlight the utility of center-specific continual pretraining, enabling medical institutions to derive significant performance gains by utilizing in-domain data.[104] Exploring Metric Fusion for Evaluation of NeRFs
Shreyas Shivakumara,Gabriel Eilertsen,Karljohan Lundin Palmerius
Main category: cs.CV
TL;DR: 本文提出了一种融合DISTS和VMAF两种图像质量评估指标的方法,以更准确地评估NeRF生成图像的主观质量,并在多个数据集和配置下验证了其鲁棒性和泛化能力。
Details
Motivation: 由于NeRF生成结果存在独特伪影,现有单一指标难以全面评估其视觉质量,因此需要结合基于不同感知机制的指标以提升与主观评分的相关性。 Method: 采用DISTS和VMAF两个成功指标,实验比较了两种归一化策略和两种融合策略对与主观评分相关性的影响。 Result: 融合方法在Synthetic和Outdoor两个不同数据集上均表现出优于单一指标的性能,且在三种不同配置下具有稳定的相关性提升。 Conclusion: 结合不同感知机制的质量评估指标可有效克服单一指标的局限性,所提出的融合框架在评估NeRF渲染质量方面更具鲁棒性和通用性。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated significant potential in synthesizing novel viewpoints. Evaluating the NeRF-generated outputs, however, remains a challenge due to the unique artifacts they exhibit, and no individual metric performs well across all datasets. We hypothesize that combining two successful metrics, Deep Image Structure and Texture Similarity (DISTS) and Video Multi-Method Assessment Fusion (VMAF), based on different perceptual methods, can overcome the limitations of individual metrics and achieve improved correlation with subjective quality scores. We experiment with two normalization strategies for the individual metrics and two fusion strategies to evaluate their impact on the resulting correlation with the subjective scores. The proposed pipeline is tested on two distinct datasets, Synthetic and Outdoor, and its performance is evaluated across three different configurations. We present a detailed analysis comparing the correlation coefficients of fusion methods and individual scores with subjective scores to demonstrate the robustness and generalizability of the fusion metrics.[105] Leveraging Large Language Models to Effectively Generate Visual Data for Canine Musculoskeletal Diagnoses
Martin Thißen,Thi Ngoc Diep Tran,Barbara Esteve Ratsch,Ben Joel Schönbein,Ute Trapp,Beate Egner,Romana Piat,Elke Hergenröther
Main category: cs.CV
TL;DR: 该研究探索了大型语言模型(LLM)生成用于犬类肌肉骨骼诊断的合成视觉训练数据的潜力,通过映射技术将视觉标注转化为文本,并利用引导解码、思维链和少样本提示生成1000份合成数据,在真实数据上达到88%的F1分数。
Details
Motivation: 由于罕见病例或高成本导致数据收集困难,尤其是在犬类肌肉骨骼状况的视觉记录中,异常情况较少见,因此需要有效方法缓解数据稀缺问题。 Method: 开发了一种将视觉文档划分为200多个代表肌肉或关节区域的映射方法,结合引导解码、思维链推理和少样本提示,利用LLM生成针对髌骨脱位及其他诊断的合成视觉文档。 Result: 生成的合成数据对诊断的位置和严重程度敏感,且不受狗性别影响;仅用合成数据训练的模型在70个真实世界样本上取得了88%的F1分数。 Conclusion: LLM生成的合成数据能有效应对医学领域中罕见疾病的数据稀缺问题,具有临床应用潜力,且该方法可推广至其他领域。 Abstract: It is well-established that more data generally improves AI model performance. However, data collection can be challenging for certain tasks due to the rarity of occurrences or high costs. These challenges are evident in our use case, where we apply AI models to a novel approach for visually documenting the musculoskeletal condition of dogs. Here, abnormalities are marked as colored strokes on a body map of a dog. Since these strokes correspond to distinct muscles or joints, they can be mapped to the textual domain in which large language models (LLMs) operate. LLMs have demonstrated impressive capabilities across a wide range of tasks, including medical applications, offering promising potential for generating synthetic training data. In this work, we investigate whether LLMs can effectively generate synthetic visual training data for canine musculoskeletal diagnoses. For this, we developed a mapping that segments visual documentations into over 200 labeled regions representing muscles or joints. Using techniques like guided decoding, chain-of-thought reasoning, and few-shot prompting, we generated 1,000 synthetic visual documentations for patellar luxation (kneecap dislocation) diagnosis, the diagnosis for which we have the most real-world data. Our analysis shows that the generated documentations are sensitive to location and severity of the diagnosis while remaining independent of the dog's sex. We further generated 1,000 visual documentations for various other diagnoses to create a binary classification dataset. A model trained solely on this synthetic data achieved an F1 score of 88% on 70 real-world documentations. These results demonstrate the potential of LLM-generated synthetic data, which is particularly valuable for addressing data scarcity in rare diseases. While our methodology is tailored to the medical domain, the insights and techniques can be adapted to other fields.[106] Cumulative Consensus Score: Label-Free and Model-Agnostic Evaluation of Object Detectors in Deployment
Avinaash Manoharan,Xiangyu Yin,Domenik Helm,Chih-Hong Cheng
Main category: cs.CV
TL;DR: 提出了一种无需标签的累积共识分数(CCS)指标,用于在实际场景中持续监控和比较目标检测模型。
Details
Motivation: 在部署中评估目标检测模型具有挑战性,因为真实标注很少可用。 Method: 通过测试时数据增强生成图像的不同视图,收集预测框并计算IoU重叠,归一化后取平均得到空间一致性得分。 Result: 在Open Images和KITTI上,CCS与F1分数、概率检测质量等指标的符合度超过90%。 Conclusion: CCS是一种模型无关、可解释性强且适用于DevOps风格的目标检测器监控方法。 Abstract: Evaluating object detection models in deployment is challenging because ground-truth annotations are rarely available. We introduce the Cumulative Consensus Score (CCS), a label-free metric that enables continuous monitoring and comparison of detectors in real-world settings. CCS applies test-time data augmentation to each image, collects predicted bounding boxes across augmented views, and computes overlaps using Intersection over Union. Maximum overlaps are normalized and averaged across augmentation pairs, yielding a measure of spatial consistency that serves as a proxy for reliability without annotations. In controlled experiments on Open Images and KITTI, CCS achieved over 90% congruence with F1-score, Probabilistic Detection Quality, and Optimal Correction Cost. The method is model-agnostic, working across single-stage and two-stage detectors, and operates at the case level to highlight under-performing scenarios. Altogether, CCS provides a robust foundation for DevOps-style monitoring of object detectors.[107] Few to Big: Prototype Expansion Network via Diffusion Learner for Point Cloud Few-shot Semantic Segmentation
Qianguang Zhao,Dongli Wang,Yan Zhou,Jianxun Li,Richard Irampa
Main category: cs.CV
TL;DR: 本文提出了一种用于少样本3D点云语义分割的Prototype Expansion Network (PENet),通过引入扩散模型的预训练编码器生成广义特征,结合内在学习和扩散学习双流架构,并利用原型同化模块和校准机制提升分割性能。
Details
Motivation: 现有基于原型的方法在少样本3D点云语义分割中受限于类内多样性和集合间不一致性两个问题,难以充分表示类别变化并与查询空间对齐。 Method: 提出PENet框架,包含双流学习结构(Intrinsic Learner和Diffusion Learner)生成互补原型,通过Prototype Assimilation Module中的推拉交叉引导注意力块实现原型与查询空间的对齐,并引入Prototype Calibration Mechanism防止语义漂移。 Result: 在S3DIS和ScanNet数据集上的实验表明,PENet在多种少样本设置下显著优于当前最先进的方法。 Conclusion: PENet通过融合扩散模型生成的广义特征和监督学习特征,构建大容量原型,有效缓解了类内多样性和跨集合不一致问题,显著提升了少样本3D点云语义分割性能。 Abstract: Few-shot 3D point cloud semantic segmentation aims to segment novel categories using a minimal number of annotated support samples. While existing prototype-based methods have shown promise, they are constrained by two critical challenges: (1) Intra-class Diversity, where a prototype's limited representational capacity fails to cover a class's full variations, and (2) Inter-set Inconsistency, where prototypes derived from the support set are misaligned with the query feature space. Motivated by the powerful generative capability of diffusion model, we re-purpose its pre-trained conditional encoder to provide a novel source of generalizable features for expanding the prototype's representational range. Under this setup, we introduce the Prototype Expansion Network (PENet), a framework that constructs big-capacity prototypes from two complementary feature sources. PENet employs a dual-stream learner architecture: it retains a conventional fully supervised Intrinsic Learner (IL) to distill representative features, while introducing a novel Diffusion Learner (DL) to provide rich generalizable features. The resulting dual prototypes are then processed by a Prototype Assimilation Module (PAM), which adopts a novel push-pull cross-guidance attention block to iteratively align the prototypes with the query space. Furthermore, a Prototype Calibration Mechanism (PCM) regularizes the final big capacity prototype to prevent semantic drift. Extensive experiments on the S3DIS and ScanNet datasets demonstrate that PENet significantly outperforms state-of-the-art methods across various few-shot settings.[108] Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder
Qifei Jia,Yu Liu,Yajie Chai,Xintong Yao,Qiming Lu,Yasen Zhang,Runyu Shi,Ying Huang,Guoquan Zhang
Main category: cs.CV
TL;DR: 提出Lego-Edit,利用多模态大语言模型(MLLM)组织模型级编辑工具,通过三阶段渐进式强化学习提升对开放域用户指令的泛化编辑能力。
Details
Motivation: 现有图像编辑方法难以泛化到训练域外的多样化真实用户指令,限制了实际应用。 Method: 构建包含多种模型和图像操作函数的模型级工具包,并采用三阶段渐进式强化学习,利用未标注的开放域指令反馈训练MLLM以实现细粒度编辑组合。 Result: 在GEdit-Bench和ImgBench上达到SOTA性能,具备强大的开放域指令推理能力,且能无缝集成新编辑工具而无需额外微调。 Conclusion: Lego-Edit有效提升了指令驱动图像编辑的泛化性和实用性,展示了MLLM在组织和调度编辑工具方面的潜力。 Abstract: Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: https://github.com/xiaomi-research/lego-edit.[109] Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing
Weiming Chen,Zhihan Zhu,Yijia Wang,Zhihai He
Main category: cs.CV
TL;DR: 本文提出了一种基于Runge-Kutta求解器的高效高阶反演方法和一种解耦扩散Transformer注意力(DDTA)机制,以提升Rectified Flow模型在图像重建和文本引导编辑任务中的生成性能与控制精度。
Details
Motivation: Rectified flow模型虽在生成性能上优于DDIM-based扩散模型,但在实际应用中存在反演精度低和多模态注意力纠缠两大问题,影响了图像一致性与语义控制精度。 Method: 提出基于Runge-Kutta微分方程求解器的高阶反演方法以提高反演精度,并设计DDTA机制解耦扩散Transformer中的文本与图像注意力,实现更精确的语义控制。 Result: 在图像重建和文本引导编辑任务上,该方法在保真度和可编辑性方面均达到最先进的性能。 Conclusion: 所提出的高阶反演方法和DDTA机制有效解决了rectified flow模型在实际应用中的关键瓶颈,显著提升了生成质量和控制能力。 Abstract: Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.[110] MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization
Yiyi Zhang,Yuchen Yuan,Ying Zheng,Jialun Pei,Jinpeng Li,Zheng Li,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 提出了一种名为MEJO的框架,通过解耦共享与特定任务表示,并结合多模态大语言模型和梯度协调策略,有效解决了手术三元组识别中的跨任务和类内优化冲突。
Details
Motivation: 解决手术三元组识别中由于任务间表示纠缠和类别不平衡导致的优化冲突问题。 Method: 提出MEJO框架,包括S²D学习方案以分离任务共享和特定表示,利用MLLM增强语义特征,并设计CGL策略平衡类别梯度。 Result: 在CholecT45和CholecT50数据集上实验表明,该方法在处理优化冲突方面优于现有方法。 Conclusion: MEJO框架能有效提升手术三元组识别性能,缓解多任务学习中的优化冲突。 Abstract: Surgical triplet recognition, which involves identifying instrument, verb, target, and their combinations, is a complex surgical scene understanding challenge plagued by long-tailed data distribution. The mainstream multi-task learning paradigm benefiting from cross-task collaborative promotion has shown promising performance in identifying triples, but two key challenges remain: 1) inter-task optimization conflicts caused by entangling task-generic and task-specific representations; 2) intra-task optimization conflicts due to class-imbalanced training data. To overcome these difficulties, we propose the MLLM-Engaged Joint Optimization (MEJO) framework that empowers both inter- and intra-task optimization for surgical triplet recognition. For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$^2$D) learning scheme that decomposes representations into task-shared and task-specific components. To enhance task-shared representations, we construct a Multimodal Large Language Model (MLLM) powered probabilistic prompt pool to dynamically augment visual features with expert-level semantic cues. Additionally, comprehensive task-specific cues are modeled via distinct task prompts covering the temporal-spatial dimensions, effectively mitigating inter-task ambiguities. To tackle intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative gradients originating from head and tail classes for more coordinated learning behaviors. Extensive experiments on the CholecT45 and CholecT50 datasets demonstrate the superiority of our proposed framework, validating its effectiveness in handling optimization conflicts.[111] DialNav: Multi-turn Dialog Navigation with a Remote Guide
Leekyeung Han,Hyunji Min,Gyeom Hwangbo,Jonghyun Choi,Paul Hongsuck Seo
Main category: cs.CV
TL;DR: 提出了一种新的协作性具身对话任务DialNav,其中导航代理(Navigator)与远程引导者(Guide)通过多轮对话协作到达目标位置,并发布了配套的RAIN数据集。
Details
Motivation: 现有工作缺乏对导航中对话双方(尤其是引导者)的全面评估,且未充分考虑引导者需推断导航者位置的现实需求,因此需要一个更综合的具身对话任务框架。 Method: 提出了DialNav任务框架,收集并发布了包含人-人对话与导航轨迹的RAIN数据集,在逼真的环境中构建了导航与对话联合评估的基准,并设计了不同Navigator和Guide模型的实验设置。 Result: 建立了完整的DialNav基准,实验分析了不同模型在导航和对话方面的影响,揭示了当前方法的关键挑战。 Conclusion: DialNav实现了对具身对话任务中双方角色的综合评估,强调了沟通的重要性,所发布数据集和代码为未来研究提供了重要资源。 Abstract: We introduce DialNav, a novel collaborative embodied dialog task, where a navigation agent (Navigator) and a remote guide (Guide) engage in multi-turn dialog to reach a goal location. Unlike prior work, DialNav aims for holistic evaluation and requires the Guide to infer the Navigator's location, making communication essential for task success. To support this task, we collect and release the Remote Assistance in Navigation (RAIN) dataset, human-human dialog paired with navigation trajectories in photorealistic environments. We design a comprehensive benchmark to evaluate both navigation and dialog, and conduct extensive experiments analyzing the impact of different Navigator and Guide models. We highlight key challenges and publicly release the dataset, code, and evaluation framework to foster future research in embodied dialog.[112] Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models
Jianfei Zhao,Feng Zhang,Xin Sun,Lingxing Kong,Zhixing Tan,Chong Feng
Main category: cs.CV
TL;DR: 提出跨层视觉平滑(CLVS)方法,通过引入视觉记忆机制在多层间平滑注意力分布,提升大视觉语言模型对关键对象的持续关注能力,显著改善关系和属性理解性能。
Details
Motivation: 大视觉语言模型虽能准确定位图像中的关键对象,但对其注意力往往短暂;假设持续关注关键对象可提升模型视觉能力。 Method: 设计一种视觉记忆机制,在首层以位置无偏的视觉注意力初始化记忆,后续层中注意力与先前层的记忆联合计算并迭代更新记忆,利用不确定性判断视觉理解完成度以适时终止平滑过程。 Result: 在三个大视觉语言模型和四个基准上的实验验证了CLVS的有效性和通用性,在多种视觉理解任务上达到最先进性能,尤其在关系和属性理解方面提升显著。 Conclusion: CLVS通过维持对关键对象的平滑、持续注意力,有效增强了LVLM的视觉理解能力,具有良好的通用性和应用潜力。 Abstract: Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs' visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model's visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.[113] MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion
Guihui Li,Bowei Dong,Kaizhi Dong,Jiayi Li,Haiyong Zheng
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态场景图引导的红外与可见光图像融合框架MSGFusion,通过结合文本和视觉生成的结构化场景图,显式建模实体、属性和空间关系,提升了融合图像的语义一致性与细节保留能力。
Details
Motivation: 现有基于深度学习的图像融合方法依赖低层视觉特征,难以捕捉高层语义信息;而使用非结构化文本引导的方法无法精细建模语义结构,限制了融合性能。 Method: 提出MSGFusion框架,利用从文本和视觉中提取的结构化场景图,通过场景图表示、分层聚合和图驱动融合模块,协同优化高层语义与低层细节的融合过程。 Result: 在多个公开数据集上实验表明,MSGFusion在细节保持、结构清晰度、语义一致性和下游任务(如低光目标检测、语义分割和医学图像融合)中的泛化能力均显著优于现有最先进方法。 Conclusion: MSGFusion通过引入结构化的多模态场景图有效提升了红外与可见光图像融合的质量和语义表达能力,为复杂环境下的多模态融合提供了新思路。 Abstract: Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.[114] AREPAS: Anomaly Detection in Fine-Grained Anatomy with Reconstruction-Based Semantic Patch-Scoring
Branko Mitic,Philipp Seeböck,Helmut Prosch,Georg Langs
Main category: cs.CV
TL;DR: 提出了一种新的生成式异常检测方法,通过图像到图像的转换和补丁相似性评分,在胸部CT和脑部MRI中实现了更优的异常分割效果。
Details
Motivation: 现有生成式异常检测方法难以处理肺部解剖结构中的正常细粒度组织变异,限制了其在医学影像中的应用。 Method: 该方法包括两个步骤:首先使用图像到图像翻译生成无异常的重建图像,然后通过比较原始图像与生成图像之间的局部补丁相似性进行精确的异常定位。 Result: 在胸部CT感染病灶检测和脑部T1加权MRI缺血性卒中病灶分割任务中,该方法相比其他先进重建方法的DICE分数分别相对提升了1.9%和4.4%。 Conclusion: 所提方法能有效应对正常组织细粒度变异带来的挑战,在多种医学影像模态中表现出良好的泛化能力和更高的异常分割精度。 Abstract: Early detection of newly emerging diseases, lesion severity assessment, differentiation of medical conditions and automated screening are examples for the wide applicability and importance of anomaly detection (AD) and unsupervised segmentation in medicine. Normal fine-grained tissue variability such as present in pulmonary anatomy is a major challenge for existing generative AD methods. Here, we propose a novel generative AD approach addressing this issue. It consists of an image-to-image translation for anomaly-free reconstruction and a subsequent patch similarity scoring between observed and generated image-pairs for precise anomaly localization. We validate the new method on chest computed tomography (CT) scans for the detection and segmentation of infectious disease lesions. To assess generalizability, we evaluate the method on an ischemic stroke lesion segmentation task in T1-weighted brain MRI. Results show improved pixel-level anomaly segmentation in both chest CTs and brain MRIs, with relative DICE score improvements of +1.9% and +4.4%, respectively, compared to other state-of-the-art reconstruction-based methods.[115] T-SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking
Hojat Ardi,Amir Jahanshahi,Ali Diba
Main category: cs.CV
TL;DR: 提出了一种具有时序感知能力的Siamese跟踪框架T-SiamTPN,通过引入时序特征融合和注意力机制,在保持计算效率的同时显著提升了空中目标跟踪的精度和鲁棒性。
Details
Motivation: 现有跟踪器多关注空间线索,忽视时序依赖,且相关操作难以应对非线性外观变化,导致在遮挡和长时跟踪中性能受限。 Method: 在SiamTPN基础上引入显式时序建模,采用时序特征融合和基于注意力的交互机制,增强时序一致性和特征表达能力。 Result: 相比基线模型,成功率提升13.7%,精度提升14.7%;在Jetson Nano上实现实时7.1 FPS,具备低运行开销。 Conclusion: 时序建模对Siamese跟踪框架至关重要,T-SiamTPN是一种高效、强健的空中目标跟踪解决方案。 Abstract: Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: https://github.com/to/be/released[116] A Novel Compression Framework for YOLOv8: Achiev-ing Real-Time Aerial Object Detection on Edge Devices via Structured Pruning and Channel-Wise Distillation
Melika Sabaghian,Mohammad Ali Keyvanrad,Seyyedeh Mahila Moghadami
Main category: cs.CV
TL;DR: 提出一种三阶段压缩管道,结合稀疏训练、结构化通道剪枝和逐通道知识蒸馏,在大幅压缩YOLOv8模型的同时保持较高检测精度,实现在资源受限设备上的高效实时部署。
Details
Motivation: 在资源受限设备上高效部署深度学习模型需要大幅压缩模型,同时不显著降低性能,尤其是针对空中目标检测任务中对小目标和中等目标的检测需求。 Method: 采用三阶段压缩策略:1)稀疏感知训练引入动态稀疏性;2)基于批归一化缩放因子进行结构化通道剪枝;3)使用可调温度和损失加权方案的逐通道知识蒸馏(CWD)恢复因剪枝导致的精度下降。 Result: 在VisDrone数据集上,YOLOv8m参数减少73.51%(25.85M→6.85M),FLOPs从49.6G降至13.3G,MACs从101G降至34.5G,AP50仅下降2.7%,达到47.9,并将推理速度从26 FPS提升至45 FPS;结合TensorRT后进一步提升至68 FPS,AP50微降至47.6。 Conclusion: 所提方法在显著压缩模型规模和提升推理速度的同时,保持了良好的检测性能,适用于高吞吐量、资源受限的边缘设备部署场景。 Abstract: Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without com-promising performance. In this study, we propose a novel three-stage compression pipeline for the YOLOv8 object detection model, integrating sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge Distillation (CWD). First, sparsity-aware training introduces dynamic sparsity during model optimization, effectively balancing parameter reduction and detection accuracy. Second, we apply structured channel pruning by leveraging batch normalization scaling factors to eliminate redundant channels, significantly reducing model size and computational complexity. Finally, to mitigate the accuracy drop caused by pruning, we employ CWD to transfer knowledge from the original model, using an adjustable temperature and loss weighting scheme tailored for small and medium object detection. Extensive experiments on the VisDrone dataset demonstrate the effectiveness of our approach across multiple YOLOv8 variants. For YOLOv8m, our method reduces model parameters from 25.85M to 6.85M (a 73.51% reduction), FLOPs from 49.6G to 13.3G, and MACs from 101G to 34.5G, while reducing AP50 by only 2.7%. The resulting compressed model achieves 47.9 AP50 and boosts inference speed from 26 FPS (YOLOv8m baseline) to 45 FPS, enabling real-time deployment on edge devices. We further apply TensorRT as a lightweight optimization step. While this introduces a minor drop in AP50 (from 47.9 to 47.6), it significantly improves inference speed from 45 to 68 FPS, demonstrating the practicality of our approach for high-throughput, re-source-constrained scenarios.[117] MATTER: Multiscale Attention for Registration Error Regression
Shipeng Liu,Ziliang Xiong,Khac-Hoang Ngo,Per-Erik Forssén
Main category: cs.CV
TL;DR: 本文提出了一种基于回归的点云配准质量验证方法,通过多尺度特征提取和注意力机制聚合,实现了对配准误差的细粒度、鲁棒估计,并在下游建图任务中显著提升了性能。
Details
Motivation: 现有方法将点云配准验证视为分类任务,无法提供细粒度的质量评估;此外,在处理空间密度不均的点云时性能受限,因此需要更精确且鲁棒的验证方法。 Method: 采用回归模型进行配准质量验证,引入多尺度特征提取和基于注意力的特征聚合机制,以更好地捕捉不同尺度下的配准偏差。 Result: 在多个数据集上实现了准确且鲁棒的配准误差估计,尤其在空间密度异质的点云上表现优异;用于指导建图任务时,相比现有分类方法显著提升了建图质量。 Conclusion: 回归方法比分类方法更适合点云配准质量验证,所提出的特征提取与聚合策略有效提升了估计精度和下游任务性能。 Abstract: Point cloud registration (PCR) is crucial for many downstream tasks, such as simultaneous localization and mapping (SLAM) and object tracking. This makes detecting and quantifying registration misalignment, i.e.,~{\it PCR quality validation}, an important task. All existing methods treat validation as a classification task, aiming to assign the PCR quality to a few classes. In this work, we instead use regression for PCR validation, allowing for a more fine-grained quantification of the registration quality. We also extend previously used misalignment-related features by using multiscale extraction and attention-based aggregation. This leads to accurate and robust registration error estimation on diverse datasets, especially for point clouds with heterogeneous spatial densities. Furthermore, when used to guide a mapping downstream task, our method significantly improves the mapping quality for a given amount of re-registered frames, compared to the state-of-the-art classification-based method.[118] 4DRadar-GS: Self-Supervised Dynamic Driving Scene Reconstruction with 4D Radar
Xiao Tang,Guirong Zhuo,Cong Wang,Boyuan Zheng,Minqing Huang,Lianqing Zheng,Long Chen,Shouyi Lu
Main category: cs.CV
TL;DR: 提出4DRadar-GS,一种利用4D雷达辅助的自监督3D重建框架,用于动态驾驶场景,实现了最先进的性能。
Details
Motivation: 现有方法在动态物体重建上因运动估计不准确和时间一致性弱而表现不佳,尤其在缺乏标注数据时。 Method: 提出4D雷达辅助的高斯初始化方案和速度引导的点跟踪模型(VGPT),结合场景流监督进行联合训练,提升动态物体的重建精度和时间一致性。 Result: 在OmniHD-Scenes数据集上达到最先进的动态场景3D重建性能。 Conclusion: 4DRadar-GS有效提升了动态驾驶场景中3D重建的准确性与时间一致性,优于现有自监督方法。 Abstract: 3D reconstruction and novel view synthesis are critical for validating autonomous driving systems and training advanced perception models. Recent self-supervised methods have gained significant attention due to their cost-effectiveness and enhanced generalization in scenarios where annotated bounding boxes are unavailable. However, existing approaches, which often rely on frequency-domain decoupling or optical flow, struggle to accurately reconstruct dynamic objects due to imprecise motion estimation and weak temporal consistency, resulting in incomplete or distorted representations of dynamic scene elements. To address these challenges, we propose 4DRadar-GS, a 4D Radar-augmented self-supervised 3D reconstruction framework tailored for dynamic driving scenes. Specifically, we first present a 4D Radar-assisted Gaussian initialization scheme that leverages 4D Radar's velocity and spatial information to segment dynamic objects and recover monocular depth scale, generating accurate Gaussian point representations. In addition, we propose a Velocity-guided PointTrack (VGPT) model, which is jointly trained with the reconstruction pipeline under scene flow supervision, to track fine-grained dynamic trajectories and construct temporally consistent representations. Evaluated on the OmniHD-Scenes dataset, 4DRadar-GS achieves state-of-the-art performance in dynamic driving scene 3D reconstruction.[119] Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings
Abdalla Arafa,Didier Stricker
Main category: cs.CV
TL;DR: 提出一种基于多视角CLIP特征聚合的物体级高斯表示方法,实现无需可微渲染的3D开放词汇对象提取与语义理解。
Details
Motivation: 3D高斯点阵的模糊性和alpha混合导致语义信息在3D场景理解中难以准确传递,限制了其在AR/VR和机器人中的应用。 Method: 利用预分解的物体级高斯分布,通过多视角CLIP特征聚合生成每个物体的嵌入‘包’,实现物体级语义表示,绕过可微渲染进行语义学习。 Result: 在3D开放词汇对象提取任务中表现优异,同时在2D开放词汇分割任务中性能接近当前最优方法。 Conclusion: 该方法有效解决了3D场景中语义模糊和跨物体混淆问题,支持开放词汇检索、2D分割与3D对象提取,具有良好的任务适应性。 Abstract: Novel view synthesis has seen significant advancements with 3D Gaussian Splatting (3DGS), enabling real-time photorealistic rendering. However, the inherent fuzziness of Gaussian Splatting presents challenges for 3D scene understanding, restricting its broader applications in AR/VR and robotics. While recent works attempt to learn semantics via 2D foundation model distillation, they inherit fundamental limitations: alpha blending averages semantics across objects, making 3D-level understanding impossible. We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely. Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation, creating comprehensive "bags of embeddings" that holistically describe objects. This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction. Experiments demonstrate that our method effectively overcomes the challenges of 3D open-vocabulary object extraction while remaining comparable to state-of-the-art performance in 2D open-vocabulary segmentation, ensuring minimal compromise.[120] Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain
Yuqi Xie,Shuhan Ye,Chong Wang,Jiazhen Xu,Le Shen,Yuanbin Qian,Jiangbo Qian
Main category: cs.CV
TL;DR: 提出了一种名为Time-step Mixup知识迁移(TMKT)的新方法,通过在不同时间步混合RGB和DVS输入,实现事件相机与脉冲神经网络间的细粒度跨模态知识迁移,并引入模态感知辅助学习目标以支持标签混合,有效缓解模态差异,提升脉冲图像分类性能。
Details
Motivation: 由于事件数据有限且DVS输出稀疏,现有方法在将RGB数据集的语义知识迁移到DVS时往往忽视了两种模态之间的显著分布差异,导致训练效果受限。 Method: 提出Time-step Mixup知识转移(TMKT),利用SNN的异步特性,在多个时间步对RGB和DVS输入进行插值混合;同时设计模态感知的辅助学习目标,支持跨模态场景下的标签混合,增强模型跨模态判别能力。 Result: 实验表明,该方法在多个数据集上显著提升了脉冲神经网络在图像分类任务中的性能,有效实现了平滑的知识迁移并缓解了训练过程中的模态偏移问题。 Conclusion: TMKT通过细粒度的时间步混合策略和模态感知学习目标,成功弥合了RGB与DVS模态间的差距,为基于事件相机的高效视觉处理提供了有效的训练解决方案。 Abstract: The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model's ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.[121] MMMS: Multi-Modal Multi-Surface Interactive Segmentation
Robin Schön,Julian Lorenz,Katja Ludwig,Daniel Kienzle,Rainer Lienhart
Main category: cs.CV
TL;DR: 本文提出了一种基于用户点击的交互式多模态多表面分割方法(MMMS),通过结合非RGB模态信息和黑盒RGB骨干网络,有效提升复杂场景下的分割精度,并减少所需点击次数。
Details
Motivation: 针对同一图像中多个纠缠表面的分割难题,现有方法难以有效处理,且缺乏合适的评估指标。此外,如何利用多模态数据提升交互式分割性能仍需探索。 Method: 提出一种新型网络架构,以RGB图像、多种非RGB模态、初始错误掩码和编码后的点击作为输入,预测改进的分割掩码。该架构在特征提取和多模态融合后引入交互信息,并设计了适应多表面场景的扩展评估指标。 Result: 在DeLiVER和MFNet数据集上,使用多模态输入平均每个表面分别减少了1.28和1.19次点击(NoC@90)。同时,仅使用RGB的基线模型在单掩码交互分割任务中表现具有竞争力甚至更优。 Conclusion: 所提出的MMMS方法能有效利用多模态信息,在多表面交互式分割任务中显著降低用户交互成本,且兼容黑盒骨干网络,具备实际应用潜力。 Abstract: In this paper, we present a method to interactively create segmentation masks on the basis of user clicks. We pay particular attention to the segmentation of multiple surfaces that are simultaneously present in the same image. Since these surfaces may be heavily entangled and adjacent, we also present a novel extended evaluation metric that accounts for the challenges of this scenario. Additionally, the presented method is able to use multi-modal inputs to facilitate the segmentation task. At the center of this method is a network architecture which takes as input an RGB image, a number of non-RGB modalities, an erroneous mask, and encoded clicks. Based on this input, the network predicts an improved segmentation mask. We design our architecture such that it adheres to two conditions: (1) The RGB backbone is only available as a black-box. (2) To reduce the response time, we want our model to integrate the interaction-specific information after the image feature extraction and the multi-modal fusion. We refer to the overall task as Multi-Modal Multi-Surface interactive segmentation (MMMS). We are able to show the effectiveness of our multi-modal fusion strategy. Using additional modalities, our system reduces the NoC@90 by up to 1.28 clicks per surface on average on DeLiVER and up to 1.19 on MFNet. On top of this, we are able to show that our RGB-only baseline achieves competitive, and in some cases even superior performance when tested in a classical, single-mask interactive segmentation scenario.[122] ICDAR 2025 Competition on FEw-Shot Text line segmentation of ancient handwritten documents (FEST)
Silvia Zottin,Axel De Nardin,Giuseppe Branca,Claudio Piciarelli,Gian Luca Foresti
Main category: cs.CV
TL;DR: FEST竞赛提出了一种针对古代手写文档的少样本文本行分割挑战,旨在利用仅每份手稿三张标注图像来推动在标注数据稀缺情况下鲁棒、自适应分割方法的发展。
Details
Motivation: 由于历史手写文档存在书写不规则、墨迹褪色、布局复杂等问题,且缺乏大规模标注数据,传统监督学习难以适用,因此需要发展少样本条件下的文本行分割方法。 Method: 通过举办FEST竞赛,提供U-DIADS-TL数据集,要求参赛者仅使用每份手稿三张标注图像进行训练,开发适用于多种布局、退化程度和非标准格式的文本行分割系统。 Result: 该竞赛促进了适用于真实历史文档场景的少样本文本行分割技术的发展,支持人文学者以最少标注成本使用自动化分析工具。 Conclusion: FEST竞赛为古代手写文档的文本行分割提供了有效的评估平台,推动了低资源条件下文档分析技术在人文学科中的应用。 Abstract: Text line segmentation is a critical step in handwritten document image analysis. Segmenting text lines in historical handwritten documents, however, presents unique challenges due to irregular handwriting, faded ink, and complex layouts with overlapping lines and non-linear text flow. Furthermore, the scarcity of large annotated datasets renders fully supervised learning approaches impractical for such materials. To address these challenges, we introduce the Few-Shot Text Line Segmentation of Ancient Handwritten Documents (FEST) Competition. Participants are tasked with developing systems capable of segmenting text lines in U-DIADS-TL dataset, using only three annotated images per manuscript for training. The competition dataset features a diverse collection of ancient manuscripts exhibiting a wide range of layouts, degradation levels, and non-standard formatting, closely reflecting real-world conditions. By emphasizing few-shot learning, FEST competition aims to promote the development of robust and adaptable methods that can be employed by humanities scholars with minimal manual annotation effort, thus fostering broader adoption of automated document analysis tools in historical research.[123] SHREC 2025: Protein surface shape retrieval including electrostatic potential
Taher Yacoub,Camille Depenveiller,Atsushi Tatsuma,Tin Barisin,Eugen Rusakov,Udo Gobel,Yuxu Peng,Shiqiang Deng,Yuki Kagaya,Joon Hong Park,Daisuke Kihara,Marco Guerra,Giorgio Palmieri,Andrea Ranieri,Ulderico Fugacci,Silvia Biasotti,Ruiwen He,Halim Benhabiles,Adnane Cabani,Karim Hammoudi,Haotian Li,Hao Huang,Chunyan Li,Alireza Tehrani,Fanwang Meng,Farnaz Heidar-Zadeh,Tuan-Anh Yang,Matthieu Montes
Main category: cs.CV
TL;DR: 本论文介绍了SHREC 2025蛋白质表面形状检索挑战赛,评估了9个团队提交的15种方法在包含11,555个蛋白质表面的大数据集上的检索性能,重点分析了结合静电势与分子表面形状的方法的有效性。
Details
Motivation: 旨在评估不同方法在蛋白质表面形状检索中的性能,特别是探索结合静电势等额外分子表面描述符是否能提升检索效果,尤其是在数据有限的类别中。 Method: 使用包含11,555个蛋白质表面及其计算静电势的大规模数据集,对15种提交方法进行评估,采用准确率、平衡准确率、F1分数、精确率和召回率等多种指标衡量检索性能。 Result: 结合静电势信息与分子表面形状的方法取得了最佳检索性能,该优势在数据量较少的类别中依然显著。 Conclusion: 结果表明,在蛋白质表面形状检索中融合静电势等补充信息能显著提升检索效果,强调了利用多种分子表面描述符的重要性。 Abstract: This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.[124] Improving Accuracy and Efficiency of Implicit Neural Representations: Making SIREN a WINNER
Hemanth Chandravamsi,Dhanush V. Shenoy,Steven H. Frankel
Main category: cs.CV
TL;DR: 提出WINNER方法,通过基于目标信号谱质心自适应添加高斯噪声扰动权重初始化,解决SIREN在频谱不匹配时的“频谱瓶颈”问题,在音频、图像和3D形状拟合任务中表现优越。
Details
Motivation: SIREN在权重初始化不当的情况下难以拟合超出其频率支持范围的信号,出现‘频谱瓶颈’现象,导致训练失败。 Method: 提出WINNER,对均匀初始化的SIREN权重加入自适应高斯噪声,噪声尺度由目标信号的谱质心决定,无需引入额外可训练参数。 Result: 在音频拟合任务上达到SOTA,在图像和3D形状拟合任务上显著优于基础SIREN。 Conclusion: WINNER有效缓解了SIREN的频谱瓶颈问题,提升了拟合能力,并为深度网络的目标感知自适应初始化提供了新思路。 Abstract: We identify and address a fundamental limitation of sinusoidal representation networks (SIRENs), a class of implicit neural representations. SIRENs Sitzmann et al. (2020), when not initialized appropriately, can struggle at fitting signals that fall outside their frequency support. In extreme cases, when the network's frequency support misaligns with the target spectrum, a 'spectral bottleneck' phenomenon is observed, where the model yields to a near-zero output and fails to recover even the frequency components that are within its representational capacity. To overcome this, we propose WINNER - Weight Initialization with Noise for Neural Representations. WINNER perturbs uniformly initialized weights of base SIREN with Gaussian noise - whose noise scales are adaptively determined by the spectral centroid of the target signal. Similar to random Fourier embeddings, this mitigates 'spectral bias' but without introducing additional trainable parameters. Our method achieves state-of-the-art audio fitting and significant gains in image and 3D shape fitting tasks over base SIREN. Beyond signal fitting, WINNER suggests new avenues in adaptive, target-aware initialization strategies for optimizing deep neural network training. For code and data visit cfdlabtechnion.github.io/siren_square/.[125] PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
Xu Zheng,Chenfei Liao,Ziqiao Weng,Kaiyu Lei,Zihao Dongfang,Haocong He,Yuanhuiyi Lyu,Lutao Jiang,Lu Qi,Li Chen,Danda Pani Paudel,Kailun Yang,Linfeng Zhang,Luc Van Gool,Xuming Hu
Main category: cs.CV
TL;DR: 本文综述了在具身智能时代,由于工业需求和学术兴趣的增长,全向视觉的快速发展趋势,介绍了全向生成、感知、理解方面的最新突破及相关数据集,并提出了一个理想的全景系统架构PANORAMA,包含四个关键子系统。同时,文章深入探讨了全景视觉与具身智能交叉领域的新兴趋势、跨社区影响以及未来的发展路线和开放性挑战。
Details
Motivation: 全向视觉相比传统针孔视觉能提供更全面的环境感知能力,但在基础研究方面长期滞后。随着机器人、工业检测和环境监测等领域对环境理解需求的提升,亟需推动全向视觉的发展以构建更可靠、完整的感知系统。 Method: 通过整合学术界和工业界的最新进展,总结全向视觉在生成、感知和理解方面的技术突破,提出一个名为PANORAMA的理想化全景系统架构,并分析该领域的发展趋势、挑战与机遇。 Result: 提出了PANORAMA全景系统架构,涵盖四个关键子系统;系统梳理了当前全向视觉的技术进展、代表性数据集及应用成果;明确了该领域在未来具身AI系统中的发展方向和开放挑战。 Conclusion: 全向视觉正处于快速发展阶段,在具身AI时代具有广阔前景。构建鲁棒、通用的全向AI系统需要跨学科合作,解决感知完整性、模型泛化性和系统集成等核心问题,未来的研究应聚焦于统一框架设计、大规模数据支持和真实场景部署。 Abstract: Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.[126] Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection
Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Sicong Li,Qingming Huang
Main category: cs.CV
TL;DR: 提出了一种双阶段重加权专家混合框架(DR-MoE),用于从自我中心视频中检测用户操作错误,特别擅长识别罕见和模糊的错误实例。
Details
Motivation: 由于错误行为在自我中心视频中往往微妙且不频繁,导致类别不平衡和检测困难,因此需要一种能够有效识别这些稀有错误的方法。 Method: 第一阶段使用冻结的ViViT模型和LoRA微调的ViViT模型提取特征,并通过特征级专家模块融合;第二阶段训练三个具有不同目标的分类器(重加权交叉熵、AUC损失、标签感知损失结合锐度感知最小化),并通过分类级专家模块融合其预测结果。 Result: 该方法在识别罕见和模糊错误方面表现出色,显著提升了在类别不平衡情况下的检测性能和模型校准能力。 Conclusion: DR-MoE框架通过双阶段多目标专家融合策略,有效提升了从自我中心视频中检测用户操作错误的能力,尤其适用于低频和难以判断的错误场景。 Abstract: In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.[127] Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection
Yue Zhou,Xinan He,Kaiqing Lin,Bing Fan,Feng Ding,Jinhua Zeng,Bin Li
Main category: cs.CV
TL;DR: 提出一种基于现代视觉基础模型(VFM)的简单线性分类器,用于检测AI生成图像,在真实场景中显著优于专门设计的检测器,揭示了VFM对合成图像的语义对齐能力及其依赖训练数据暴露的问题。
Details
Motivation: 现有AI生成图像检测器在受控基准上表现良好,但在真实场景中失败严重,需要更鲁棒和泛化的检测方法。 Method: 使用现代视觉基础模型(VFM)上的简单线性分类器,通过文本-图像相似性分析探测模型对伪造概念的对齐能力,并在训练数据与测试数据时间隔离的数据集上评估性能。 Result: 该方法在真实场景下的准确率提升超过20%;发现最新VLM能将合成图像与‘AI生成’等概念对齐,但在预训练截止日期后采集的新数据上性能显著下降。 Conclusion: 1) 更新的VFM本身提供的‘火力’远胜于手工设计的静态检测器;2) 真实泛化能力评估必须使用独立于模型整个训练历史(包括预训练)的测试数据。 Abstract: While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on `in-the-wild' benchmarks. Instead of crafting another specialized `knife' for this problem, we bring a `gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively `outguns' bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20\%. Our analysis pinpoints the source of the VFM's `firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., `AI-generated'), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM's pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world `gunfight' of AI-generated image detection, the raw `firepower' of an updated VFM is far more effective than the `craftsmanship' of a static detector. 2) True generalization evaluation requires test data to be independent of the model's entire training history, including pre-training.[128] Drone Detection Using a Low-Power Neuromorphic Virtual Tripwire
Anton Eldeborg Lundin,Rasmus Winzell,Hanna Hamrell,David Gustafsson,Hannes Ovrén
Main category: cs.CV
TL;DR: 提出了一种基于脉冲神经网络和神经形态相机的无人机检测系统,实现了低功耗、高能效的全自动检测。
Details
Motivation: 小型无人机对军民设施构成日益增长的威胁,亟需早期、自动化的检测手段。 Method: 采用脉冲神经网络和神经形态相机(事件相机),将检测模型部署在神经形态芯片上,构建全神经形态系统,并利用合成数据进行训练。 Result: 系统比边缘GPU方案能效高出几个数量级,可电池供电运行一年以上;多个检测单元可组成虚拟警戒线,检测无人机进入的时间和位置;模型主要依赖无人机形状而非螺旋桨的时序特征。 Conclusion: 该系统具有体积小、功耗低的优势,适合在缺乏电力基础设施或高风险区域部署,具备良好的实际应用前景。 Abstract: Small drones are an increasing threat to both military personnel and civilian infrastructure, making early and automated detection crucial. In this work we develop a system that uses spiking neural networks and neuromorphic cameras (event cameras) to detect drones. The detection model is deployed on a neuromorphic chip making this a fully neuromorphic system. Multiple detection units can be deployed to create a virtual tripwire which detects when and where drones enter a restricted zone. We show that our neuromorphic solution is several orders of magnitude more energy efficient than a reference solution deployed on an edge GPU, allowing the system to run for over a year on battery power. We investigate how synthetically generated data can be used for training, and show that our model most likely relies on the shape of the drone rather than the temporal characteristics of its propellers. The small size and low power consumption allows easy deployment in contested areas or locations that lack power infrastructure.[129] Dream3DAvatar: Text-Controlled 3D Avatar Reconstruction from a Single Image
Gaofeng Liu,Hengsen Li,Ruoyu Gao,Xuetong Li,Zhiyuan Ma,Tao Fang
Main category: cs.CV
TL;DR: 提出Dream3DAvatar,一种高效的、可文本控制的两阶段框架,用于从单张图像生成3D头像,通过引入多个适配器模块实现几何、姿态和身份一致性,并生成高质量的3D Gaussian Splatting表示。
Details
Motivation: 从单幅图像重建完整的3D头像存在信息不足的问题,尤其是遮挡区域的几何和纹理难以控制,现有方法在细节恢复和文本可控性方面仍有局限。 Method: 采用两阶段框架:第一阶段使用改进的SDXL模型结合Pose-Adapter和ID-Adapter-G生成多视角图像,并利用BLIP2增强文本描述;第二阶段使用带多视图特征融合模块的Transformer从生成图像中重建3DGS表示,并引入ID-Adapter-R提升面部细节恢复。 Result: 实验表明该方法能无需后处理生成逼真且可动画化的3D头像,在多个指标上优于现有基线方法。 Conclusion: Dream3DAvatar有效解决了单图像3D头像生成中的遮挡与控制难题,实现了高保真、文本可控且具身份一致性的3D avatar生成。 Abstract: With the rapid advancement of 3D representation techniques and generative models, substantial progress has been made in reconstructing full-body 3D avatars from a single image. However, this task remains fundamentally ill-posedness due to the limited information available from monocular input, making it difficult to control the geometry and texture of occluded regions during generation. To address these challenges, we redesign the reconstruction pipeline and propose Dream3DAvatar, an efficient and text-controllable two-stage framework for 3D avatar generation. In the first stage, we develop a lightweight, adapter-enhanced multi-view generation model. Specifically, we introduce the Pose-Adapter to inject SMPL-X renderings and skeletal information into SDXL, enforcing geometric and pose consistency across views. To preserve facial identity, we incorporate ID-Adapter-G, which injects high-resolution facial features into the generation process. Additionally, we leverage BLIP2 to generate high-quality textual descriptions of the multi-view images, enhancing text-driven controllability in occluded regions. In the second stage, we design a feedforward Transformer model equipped with a multi-view feature fusion module to reconstruct high-fidelity 3D Gaussian Splat representations (3DGS) from the generated images. Furthermore, we introduce ID-Adapter-R, which utilizes a gating mechanism to effectively fuse facial features into the reconstruction process, improving high-frequency detail recovery. Extensive experiments demonstrate that our method can generate realistic, animation-ready 3D avatars without any post-processing and consistently outperforms existing baselines across multiple evaluation metrics.[130] Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Yan Chen,Long Li,Teng Xi,Long Zeng,Jingdong Wang
Main category: cs.CV
TL;DR: 提出了一种两阶段强化学习框架PeBR-R1,以联合提升视觉-语言模型的感知与推理能力,在七个基准数据集上表现出优越性能。
Details
Motivation: 直接将用于大语言模型的强化学习方法迁移到视觉-语言模型中效果不佳,因VLM需先准确理解视觉输入,再进行推理,任务更复杂。 Method: 设计了一个两阶段强化学习框架:第一阶段通过粗粒度和细粒度视觉理解提升视觉感知能力;第二阶段专注于增强推理能力,并采用数据集级采样缓解优势函数消失问题。 Result: 在七个基准数据集上的实验表明,所提方法显著提升了VLM的感知与推理性能,PeBR-R1在多种视觉推理任务上优于现有模型。 Conclusion: 两阶段强化学习框架能有效协同提升视觉-语言模型的感知与推理能力,为VLM的训练提供了更优的RL范式。 Abstract: Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.[131] HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models
Xu Li,Yuxuan Liang,Xiaolei Chen,Yi Zheng,Haotian Chen,Bin Li,Xiangyang Xue
Main category: cs.CV
TL;DR: 本文提出了一种名为HERO的高分辨率视觉token早期丢弃框架,通过内容自适应的token预算分配和功能感知的token选择,在无需训练的情况下实现了高效且准确的高分辨率视觉-语言模型推理。
Details
Motivation: 高分辨率大视觉语言模型(HR-LVLMs)因将图像分块编码导致视觉token数量剧增,带来显著的计算与内存开销,亟需提升推理效率。 Method: 基于对HR-LVLM中视觉token利用情况的实证分析,提出HERO框架,结合tile级重要性估计与功能感知的token选择机制,自适应分配token预算并保留具有互补作用的tokens。 Result: HERO在多个基准和模型尺度上均实现了优越的效率-精度权衡,显著减少视觉token数量的同时保持甚至提升性能。 Conclusion: HERO为高效高分辨率视觉语言模型推理提供了有效的训练-free解决方案,揭示了视觉token在不同阶段的作用差异及其优化路径。 Abstract: By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.[132] TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
Qianqi Lu,Yuxiang Xie,Jing Zhang,Shiwei Zou,Yan Chen,Xidao Luan
Main category: cs.CV
TL;DR: 本文提出了一种三阶段图像-文本特征对齐网络TFANet,用于解决指代表达分割中的多模态错位和语言语义丢失问题。
Details
Motivation: 现有方法在复杂场景中常因多模态错位和语言语义丢失而导致目标定位错误或分割不完整,尤其是在存在多个视觉相似对象的情况下。 Method: 设计了三阶段框架:知识增强阶段(KPS)采用多尺度线性交叉注意力模块(MLAM)实现跨尺度的双向语义交互;知识融合阶段(KFS)通过跨模态特征扫描模块(CFSM)捕获长距离依赖;知识强化阶段(KIS)引入词级语言特征引导的语义深化模块(WFDM)以补偿语义退化。 Result: TFANet在多个基准数据集上表现出优于现有方法的性能,尤其在复杂场景下实现了更准确的目标定位与更完整的分割结果。 Conclusion: TFANet通过分阶段的多层次对齐策略,有效提升了指代表达分割中的多模态对齐质量与语义完整性。 Abstract: Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages.[133] Using KL-Divergence to Focus Frequency Information in Low-Light Image Enhancement
Yan Xingyang,Huang Xiaohong,Zhang Zhao,You Tian,Xu Ziheng
Main category: cs.CV
TL;DR: 本文提出了一种名为LLFDisc的U形深度增强网络,结合交叉注意力和门控机制,用于频域感知增强。通过引入基于KL散度的分布感知损失和改进的感知损失,该方法在多个基准上实现了最先进的性能。
Details
Motivation: 传统的傅里叶频率信息拟合使用逐像素损失函数,容易过度关注局部信息而丢失全局信息。因此,需要一种能更鲁棒地对齐频域信息的方法。 Method: 提出LLFDisc网络,结合交叉注意力和门控机制;设计分布感知损失,利用闭式KL散度目标直接拟合傅里叶域信息;改进基于VGG的感知损失,在深层特征上嵌入KL散度以提升结构保真度。 Result: 在多个基准上的实验表明,LLFDisc在定性和定量评估中均达到最先进水平。 Conclusion: LLFDisc通过频域信息的鲁棒对齐和结构保真度的提升,有效解决了传统方法中全局信息丢失的问题,显著提升了图像增强效果。 Abstract: In the Fourier domain, luminance information is primarily encoded in the amplitude spectrum, while spatial structures are captured in the phase components. The traditional Fourier Frequency information fitting employs pixel-wise loss functions, which tend to focus excessively on local information and may lead to global information loss. In this paper, we present LLFDisc, a U-shaped deep enhancement network that integrates cross-attention and gating mechanisms tailored for frequency-aware enhancement. We propose a novel distribution-aware loss that directly fits the Fourier-domain information and minimizes their divergence using a closed-form KL-Divergence objective. This enables the model to align Fourier-domain information more robustly than with conventional MSE-based losses. Furthermore, we enhance the perceptual loss based on VGG by embedding KL-Divergence on extracted deep features, enabling better structural fidelity. Extensive experiments across multiple benchmarks demonstrate that LLFDisc achieves state-of-the-art performance in both qualitative and quantitative evaluations. Our code will be released at: https://github.com/YanXY000/LLFDisc[134] Enhancing Dual Network Based Semi-Supervised Medical Image Segmentation with Uncertainty-Guided Pseudo-Labeling
Yunyao Lu,Yihang Wu,Ahmad Chaddad,Tareef Daqqaq,Reem Kateb
Main category: cs.CV
TL;DR: 本文提出了一种基于双网络架构的新型半监督3D医学图像分割框架,通过交叉一致性增强模块和自监督对比学习机制,有效减少伪标签噪声并降低预测不确定性,在多个数据集上显著优于现有方法。
Details
Motivation: 现有的半监督医学图像分割方法存在伪标签噪声大、特征空间监督不足的问题,且依赖大量标注数据不切实际,因此需要更鲁棒的半监督方法。 Method: 提出双网络架构,结合交叉伪标签与熵滤波监督的交叉一致性增强模块,设计基于KL散度的不确定性感知动态加权策略,并引入自监督对比学习对齐不确定体素特征与可靠类别原型。 Result: 在Left Atrial、NIH Pancreas和BraTS-2019三个3D分割数据集上实验表明,该方法在仅使用10%标注数据时达到89.95%的Dice分数,性能优于当前最先进方法,且消融实验证明各模块有效性。 Conclusion: 所提出的框架能有效提升半监督3D医学图像分割性能,通过减少伪标签噪声和增强特征空间监督,显著降低模型预测不确定性,具有良好的应用潜力。 Abstract: Despite the remarkable performance of supervised medical image segmentation models, relying on a large amount of labeled data is impractical in real-world situations. Semi-supervised learning approaches aim to alleviate this challenge using unlabeled data through pseudo-label generation. Yet, existing semi-supervised segmentation methods still suffer from noisy pseudo-labels and insufficient supervision within the feature space. To solve these challenges, this paper proposes a novel semi-supervised 3D medical image segmentation framework based on a dual-network architecture. Specifically, we investigate a Cross Consistency Enhancement module using both cross pseudo and entropy-filtered supervision to reduce the noisy pseudo-labels, while we design a dynamic weighting strategy to adjust the contributions of pseudo-labels using an uncertainty-aware mechanism (i.e., Kullback-Leibler divergence). In addition, we use a self-supervised contrastive learning mechanism to align uncertain voxel features with reliable class prototypes by effectively differentiating between trustworthy and uncertain predictions, thus reducing prediction uncertainty. Extensive experiments are conducted on three 3D segmentation datasets, Left Atrial, NIH Pancreas and BraTS-2019. The proposed approach consistently exhibits superior performance across various settings (e.g., 89.95\% Dice score on left Atrial with 10\% labeled data) compared to the state-of-the-art methods. Furthermore, the usefulness of the proposed modules is further validated via ablation experiments.[135] A Synthetic Data Pipeline for Supporting Manufacturing SMEs in Visual Assembly Control
Jonas Werheid,Shengjie He,Aymen Gannouni,Anas Abdelrazeq,Robert H. Schmitt
Main category: cs.CV
TL;DR: 提出一种基于CAD数据和目标检测算法的合成数据生成方法,用于高效、低成本的装配质量视觉控制,适用于资源有限的中小型企业。
Details
Motivation: 中小型企业缺乏足够的资源进行大规模数据收集和标注,传统计算机视觉方法成本高,难以应用。 Method: 利用CAD数据生成模拟场景,结合合成数据训练目标检测模型,实现无需大量真实图像标注的装配控制。 Result: 在合成数据上mAP@0.5:0.95达到99.5%,在真实测试数据上达到93%,验证了方法的有效性和跨域性能。 Conclusion: 该方法显著减少了数据生成时间与成本,为中小企业提供了可集成且数据高效的视觉装配控制解决方案。 Abstract: Quality control of assembly processes is essential in manufacturing to ensure not only the quality of individual components but also their proper integration into the final product. To assist in this matter, automated assembly control using computer vision methods has been widely implemented. However, the costs associated with image acquisition, annotation, and training of computer vision algorithms pose challenges for integration, especially for small- and medium-sized enterprises (SMEs), which often lack the resources for extensive training, data collection, and manual image annotation. Synthetic data offers the potential to reduce manual data collection and labeling. Nevertheless, its practical application in the context of assembly quality remains limited. In this work, we present a novel approach for easily integrable and data-efficient visual assembly control. Our approach leverages simulated scene generation based on computer-aided design (CAD) data and object detection algorithms. The results demonstrate a time-saving pipeline for generating image data in manufacturing environments, achieving a mean Average Precision (mAP@0.5:0.95) up to 99,5% for correctly identifying instances of synthetic planetary gear system components within our simulated training data, and up to 93% when transferred to real-world camera-captured testing data. This research highlights the effectiveness of synthetic data generation within an adaptable pipeline and underscores its potential to support SMEs in implementing resource-efficient visual assembly control solutions.[136] Hierarchical Deep Fusion Framework for Multi-dimensional Facial Forgery Detection -- The 2024 Global Deepfake Image Detection Challenge
Kohou Wang,Huan Hu,Xiang Liu,Zezhou Chen,Ping Chen,Zhaoxiang Liu,Shiguo Lian
Main category: cs.CV
TL;DR: 本文提出了一种基于层次化深度融合框架(HDFF)的面部伪造检测方法,通过集成四种预训练模型并进行多阶段微调,在MultiFFDI数据集上实现了高性能检测效果,最终在竞赛中取得了0.96852的分数,排名第20位。
Details
Motivation: 深度伪造技术的快速发展对数字安全和真实性构成严重威胁,现有方法难以泛化到多种伪造手段,因此需要更鲁棒、更具通用性的检测模型。 Method: 提出层次化深度融合框架(HDFF),集成Swin-MLP、CoAtNet、EfficientNetV2和DaViT四种不同的预训练子模型,通过多阶段微调并在特征层进行拼接,最后训练分类器实现伪造检测。 Result: 该方法在比赛的私有排行榜上获得了0.96852的得分,184支队伍中排名第20位,验证了层次化融合策略在复杂图像分类任务中的有效性。 Conclusion: HDFF通过融合多个异构模型的特征表示,显著提升了面部伪造检测的性能,表明集成学习与层次化特征融合是应对多样化深度伪造技术的有效途径。 Abstract: The proliferation of sophisticated deepfake technology poses significant challenges to digital security and authenticity. Detecting these forgeries, especially across a wide spectrum of manipulation techniques, requires robust and generalized models. This paper introduces the Hierarchical Deep Fusion Framework (HDFF), an ensemble-based deep learning architecture designed for high-performance facial forgery detection. Our framework integrates four diverse pre-trained sub-models, Swin-MLP, CoAtNet, EfficientNetV2, and DaViT, which are meticulously fine-tuned through a multi-stage process on the MultiFFDI dataset. By concatenating the feature representations from these specialized models and training a final classifier layer, HDFF effectively leverages their collective strengths. This approach achieved a final score of 0.96852 on the competition's private leaderboard, securing the 20th position out of 184 teams, demonstrating the efficacy of hierarchical fusion for complex image classification tasks.[137] Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving
Ruibo Li,Hanyu Shi,Zhe Wang,Guosheng Lin
Main category: cs.CV
TL;DR: 本文提出了一种基于LiDAR点云的弱监督与自监督类无关运动预测新方法,利用前景/背景或非地面/地面掩码替代运动标注,显著降低标注成本,同时性能优于现有自监督方法,并可媲美部分监督方法。
Details
Motivation: 自动驾驶中动态环境下的运动理解至关重要,而现有运动预测方法依赖大量标注数据。本文旨在减少对精确运动标注的依赖,通过弱监督和自监督方式实现高效、低成本的运动预测。 Method: 提出一种新的弱监督范式,使用全量或极小比例(如0.1%、0.01%)的前景/背景或非地面/地面掩码作为监督信号;设计了两种弱监督方法和一种完全自监督方法,并引入鲁棒一致性感知的Chamfer Distance损失,结合多帧信息和抗 outlier 惩罚函数来提升自监督学习效果。 Result: 实验表明,所提弱监督和自监督模型均优于现有的自监督方法,其中弱监督模型性能接近某些监督方法,在降低标注成本的同时保持高性能。 Conclusion: 本文方法有效平衡了标注代价与模型性能,验证了利用场景结构先验(如前景-背景分离)进行弱监督和自监督运动预测的可行性与优势。 Abstract: Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.[138] Advancing Real-World Parking Slot Detection with Large-Scale Dataset and Semi-Supervised Baseline
Zhihao Zhang,Chunyu Lin,Lang Nie,Jiyuan Wang,Yao Zhao
Main category: cs.CV
TL;DR: 本文提出了一种用于自动泊车系统中停车位检测的半监督方法SS-PSD,并构建了大规模、复杂场景下的停车位检测数据集CRPS-D,显著提升了检测性能。
Details
Motivation: 现有停车位检测数据集规模有限,缺乏真实噪声干扰,且人工标注成本高、易出错,难以满足实际需求。 Method: 构建了包含多种光照、天气条件和倾斜车位的大规模数据集CRPS-D;提出了基于教师-学生模型的半监督方法SS-PSD,引入置信度引导的掩码一致性和自适应特征扰动机制。 Result: 实验表明,SS-PSD在所提数据集和现有数据集上均优于当前最先进方法,尤其在使用更多无标签数据时性能增益更显著。 Conclusion: 本文首次将半监督学习应用于停车位检测,有效利用无标签数据提升模型性能,为后续研究提供了高质量数据集和基准方法。 Abstract: As automatic parking systems evolve, the accurate detection of parking slots has become increasingly critical. This study focuses on parking slot detection using surround-view cameras, which offer a comprehensive bird's-eye view of the parking environment. However, the current datasets are limited in scale, and the scenes they contain are seldom disrupted by real-world noise (e.g., light, occlusion, etc.). Moreover, manual data annotation is prone to errors and omissions due to the complexity of real-world conditions, significantly increasing the cost of annotating large-scale datasets. To address these issues, we first construct a large-scale parking slot detection dataset (named CRPS-D), which includes various lighting distributions, diverse weather conditions, and challenging parking slot variants. Compared with existing datasets, the proposed dataset boasts the largest data scale and consists of a higher density of parking slots, particularly featuring more slanted parking slots. Additionally, we develop a semi-supervised baseline for parking slot detection, termed SS-PSD, to further improve performance by exploiting unlabeled data. To our knowledge, this is the first semi-supervised approach in parking slot detection, which is built on the teacher-student model with confidence-guided mask consistency and adaptive feature perturbation. Experimental results demonstrate the superiority of SS-PSD over the existing state-of-the-art (SoTA) solutions on both the proposed dataset and the existing dataset. Particularly, the more unlabeled data there is, the more significant the gains brought by our semi-supervised scheme. The relevant source codes and the dataset have been made publicly available at https://github.com/zzh362/CRPS-D.[139] MSDNet: Efficient 4D Radar Super-Resolution via Multi-Stage Distillation
Minqing Huang,Shouyi Lu,Boyuan Zheng,Ziyao Li,Xiao Tang,Guirong Zhuo
Main category: cs.CV
TL;DR: 提出MSDNet,一种多阶段蒸馏框架,通过重建引导和扩散引导的特征蒸馏,高效地将密集LiDAR先验迁移到4D雷达特征,实现高质量且低延迟的点云超分辨率。
Details
Motivation: 现有4D雷达超分辨率方法训练成本高、推理延迟大、泛化能力差,难以兼顾精度与效率。 Method: 采用多阶段特征蒸馏:第一阶段通过特征重建对齐并稠密化学生特征;第二阶段将第一阶段结果视为含噪教师特征,利用轻量级扩散网络进行精细化;引入噪声适配器自适应对齐特征噪声水平与预定义扩散时间步。 Result: 在VoD和自建数据集上实验表明,MSDNet在实现高保真重建的同时具备低延迟推理能力,并显著提升下游任务性能。 Conclusion: MSDNet有效平衡了4D雷达点云超分辨率的精度与效率,具有良好的实际应用潜力。 Abstract: 4D radar super-resolution, which aims to reconstruct sparse and noisy point clouds into dense and geometrically consistent representations, is a foundational problem in autonomous perception. However, existing methods often suffer from high training cost or rely on complex diffusion-based sampling, resulting in high inference latency and poor generalization, making it difficult to balance accuracy and efficiency. To address these limitations, we propose MSDNet, a multi-stage distillation framework that efficiently transfers dense LiDAR priors to 4D radar features to achieve both high reconstruction quality and computational efficiency. The first stage performs reconstruction-guided feature distillation, aligning and densifying the student's features through feature reconstruction. In the second stage, we propose diffusion-guided feature distillation, which treats the stage-one distilled features as a noisy version of the teacher's representations and refines them via a lightweight diffusion network. Furthermore, we introduce a noise adapter that adaptively aligns the noise level of the feature with a predefined diffusion timestep, enabling a more precise denoising. Extensive experiments on the VoD and in-house datasets demonstrate that MSDNet achieves both high-fidelity reconstruction and low-latency inference in the task of 4D radar point cloud super-resolution, and consistently improves performance on downstream tasks. The code will be publicly available upon publication.[140] TexTAR : Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images
Rohan Kumar,Jyothi Swaroopa Jinka,Ravi Kiran Sarvadevabhatla
Main category: cs.CV
TL;DR: 本文提出了一种新的基于Transformer的多任务模型TexTAR,用于文本属性识别(TAR),结合上下文感知机制和2D RoPE结构,在多语言、多领域场景下实现了优于现有方法的性能,并发布了新的大规模数据集MMTAD。
Details
Motivation: 现有文本属性识别方法在计算效率和复杂、多语言环境下的适应性方面存在不足,难以有效捕捉上下文信息以提升识别准确性。 Method: 提出TexTAR模型,采用多任务学习框架和2D RoPE式的位置编码机制来增强上下文感知能力;设计了新的数据选择流程,并构建了包含多种真实文档类型的多语言多领域数据集MMTAD。 Result: 在多个基准上的实验表明,TexTAR在文本属性识别任务上优于现有方法,验证了上下文感知对提升识别性能的有效性。 Conclusion: 上下文信息的建模显著提升了文本属性识别的准确性和鲁棒性,TexTAR为该任务提供了一个高效且可扩展的解决方案。 Abstract: Recognizing textual attributes such as bold, italic, underline and strikeout is essential for understanding text semantics, structure, and visual presentation. These attributes highlight key information, making them crucial for document analysis. Existing methods struggle with computational efficiency or adaptability in noisy, multilingual settings. To address this, we introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR). Our novel data selection pipeline enhances context awareness, and our architecture employs a 2D RoPE (Rotary Positional Embedding)-style mechanism to incorporate input context for more accurate attribute predictions. We also introduce MMTAD, a diverse, multilingual, multi-domain dataset annotated with text attributes across real-world documents such as legal records, notices, and textbooks. Extensive evaluations show TexTAR outperforms existing methods, demonstrating that contextual awareness contributes to state-of-the-art TAR performance.[141] Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)
Zhihao He,Tianyao He,Tieyuan Chen,Yun Xu,Huabin Liu,Chaofan Gan,Gui Zou,Weiyao Lin
Main category: cs.CV
TL;DR: 提出了一种多视频协同框架,通过将视频知识表示为时空图并融合相关信息,提升视频语言模型的推理能力。
Details
Motivation: 现有视频语言模型在处理单个视频时因时空信息不完整而导致推理错误和幻觉,需要利用多个相关视频来增强推理性能。 Method: 设计了视频结构化模块将视频表示为时空图,通过图融合模块整合多视频的结构化知识,并构建多模态结构化提示输入大语言模型。 Result: 实验表明该框架能有效提升视频语言模型的推理准确性和效率。 Conclusion: 所提出的多视频协同框架为改进视频语言模型提供了一个有前景的方向。 Abstract: Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video's knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models.[142] WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory
Ruifei Ding,Zhe Chen,Wen Fan,Chen Long,Huijuan Xiao,Yelu Zeng,Zhen Dong,Bisheng Yang
Main category: cs.CV
TL;DR: 本文提出了WHU-STree,一个跨城市、多模态、丰富标注的街道路树数据集,支持多种街树清查任务,推动多模态融合与跨域应用研究。
Details
Motivation: 传统街树调查耗时耗力,现有移动测绘数据集存在场景小、标注少或多模态缺失等问题,限制了城市街树智能清查的发展。 Method: 采集两个不同城市的同步点云和高分辨率图像,构建包含21,007棵树、50个树种和2个形态参数的多模态数据集WHU-STree,并支持10余项街树清查任务,对树种分类和单木分割进行基准测试。 Result: 实验验证了多模态数据融合在街树分析中的显著潜力,展示了跨城市数据的泛化能力,并识别出多模态融合、多任务协同、跨域适应等关键挑战与未来方向。 Conclusion: WHU-STree为城市街树智能管理提供了高质量数据基础,促进了多模态深度学习在城市生态资产清查中的应用与发展。 Abstract: Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks--tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: https://github.com/WHU-USI3DV/WHU-STree.[143] More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
Yingtai Li,Haoran Lai,Xiaoqian Zhou,Shuai Ming,Wenxin Ma,Wei Wei,Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: 本文提出利用大语言模型(LLM)从放射学报告中自动提取诊断标签,构建低成本、大规模的“银标准”数据集,用于医学对比视觉-语言预训练。实验表明,基于该数据集训练的视觉编码器性能媲美使用专业模型提取标签的训练效果,并在多个下游任务上实现最先进的表现,展示了LLM在提升医学AI系统性能与可扩展性方面的巨大潜力。
Details
Motivation: 现有的医学视觉-语言预训练依赖高质量标注数据,但人工标注成本高且难以扩展。虽然已有方法使用BERT类模型提取标签,但仍受限于模型能力和标注规模。本文旨在探索如何利用大语言模型(LLM)高效、低成本地生成大规模监督信号,以推动医学视觉-语言对齐的发展。 Method: 首先利用现代大语言模型直接从放射科报告中自动提取诊断标签,无需复杂提示工程即可达到高精度(>96% AUC),从而构建低成本的大规模‘银标准’数据集。随后,在该数据集上进行监督式预训练,并结合3D ResNet-18与标准CLIP框架进行对比学习,实现跨模态对齐。 Result: 基于‘银标准’数据集训练的视觉编码器性能与使用专业BERT模型提取标签训练的效果相当;在零样本诊断任务中取得83.8% AUC(CT-RATE)和77.3% AUC(RAD-ChestCT)的表现,并在跨模态检索任务中显著提升性能(图像-图像MAP@50=53.7%,报告-图像Recall@100=52.2%)。 Conclusion: 大语言模型能够高效构建高质量医学视觉-语言预训练所需的监督数据,显著降低数据标注成本,提升模型性能与可扩展性,为未来医学AI系统的开发提供了可行且强大的新范式。 Abstract: The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~\$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\% AUC for zero-shot diagnosis on CT-RATE, 77.3\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\% for image-image, Recall@100=52.2\% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.[144] Road Obstacle Video Segmentation
Shyam Nandan Rai,Shyamgopal Karthik,Mariana-Iuliana Georgescu,Barbara Caputo,Carlo Masone,Zeynep Akata
Main category: cs.CV
TL;DR: 本文研究了自动驾驶中道路障碍物分割的时序特性,提出了四个用于道路障碍物视频分割的评估基准,并基于视觉基础模型引入了两个强基线方法,在长序列视频分割上实现了新的性能突破。
Details
Motivation: 现有方法多在单帧图像上进行道路障碍物分割,忽略了时序连续帧之间的相关性,导致预测结果不一致,影响自动驾驶的安全性与稳定性。 Method: 通过构建并适配四个道路障碍物视频分割评估基准,系统评估了11种先进的图像和视频分割方法,并提出两种基于视觉基础模型的强基线方法,充分利用时序信息提升分割一致性与性能。 Result: 所提方法在长序列视频分割任务上取得了新的性能突破,显著优于现有方法,验证了时序建模在道路障碍物分割中的重要性。 Conclusion: 道路障碍物分割具有本质的时序特性,应采用视频级方法进行建模;本文建立的基准和基线为未来研究提供了重要参考。 Abstract: With the growing deployment of autonomous driving agents, the detection and segmentation of road obstacles have become critical to ensure safe autonomous navigation. However, existing road-obstacle segmentation methods are applied on individual frames, overlooking the temporal nature of the problem, leading to inconsistent prediction maps between consecutive frames. In this work, we demonstrate that the road-obstacle segmentation task is inherently temporal, since the segmentation maps for consecutive frames are strongly correlated. To address this, we curate and adapt four evaluation benchmarks for road-obstacle video segmentation and evaluate 11 state-of-the-art image- and video-based segmentation methods on these benchmarks. Moreover, we introduce two strong baseline methods based on vision foundation models. Our approach establishes a new state-of-the-art in road-obstacle video segmentation for long-range video sequences, providing valuable insights and direction for future research.[145] Vi-SAFE: A Spatial-Temporal Framework for Efficient Violence Detection in Public Surveillance
Ligang Chang,Shengkai Xu,Liangchang Shen,Binhan Xu,Junqiao Wang,Tianyu Shi,Yanhui Du
Main category: cs.CV
TL;DR: 本文提出了一种用于公共监控的时空暴力检测框架Vi-SAFE,结合优化的YOLOv8和TSN,在RWF-2000数据集上实现了88%的准确率,优于现有方法。
Details
Motivation: 针对公共监控中目标小、环境复杂和实时性要求高等挑战,现有暴力检测方法在精度和效率上存在不足。 Method: 提出Vi-SAFE框架:使用GhostNetV3、EMA注意力机制和剪枝优化YOLOv8作为轻量级检测器,提取人体区域;采用TSN进行时序建模,实现暴力行为二分类;YOLOv8与TSN分别在行人和暴力数据集上独立训练。 Result: 在RWF-2000数据集上,Vi-SAFE达到0.88的准确率,显著高于单独TSN的0.77,并在精度和计算效率方面优于现有方法。 Conclusion: Vi-SAFE通过轻量化设计和有效的时空分离建模,在保证实时性的前提下显著提升了复杂场景下的暴力检测性能,适用于实际公共安全监控系统。 Abstract: Violence detection in public surveillance is critical for public safety. This study addresses challenges such as small-scale targets, complex environments, and real-time temporal analysis. We propose Vi-SAFE, a spatial-temporal framework that integrates an enhanced YOLOv8 with a Temporal Segment Network (TSN) for video surveillance. The YOLOv8 model is optimized with GhostNetV3 as a lightweight backbone, an exponential moving average (EMA) attention mechanism, and pruning to reduce computational cost while maintaining accuracy. YOLOv8 and TSN are trained separately on pedestrian and violence datasets, where YOLOv8 extracts human regions and TSN performs binary classification of violent behavior. Experiments on the RWF-2000 dataset show that Vi-SAFE achieves an accuracy of 0.88, surpassing TSN alone (0.77) and outperforming existing methods in both accuracy and efficiency, demonstrating its effectiveness for public safety surveillance. Code is available at https://anonymous.4open.science/r/Vi-SAFE-3B42/README.md.[146] End4: End-to-end Denoising Diffusion for Diffusion-Based Inpainting Detection
Fei Wang,Xuecheng Wu,Zheng Zhang,Danlei Huang,Yuheng Huang,BoWang
Main category: cs.CV
TL;DR: 提出了一种基于端到端去噪扩散的新型检测方法End4,用于识别扩散模型生成的修复图像,具有良好的泛化性和鲁棒性。
Details
Motivation: 现有方法难以检测基于扩散模型的图像修复技术生成的图像,存在潜在滥用风险。 Method: 设计了端到端去噪重建模型以提升重构与检测过程潜在空间的一致性,并引入尺度感知金字塔融合模块(SPFM)增强局部特征判别能力。 Result: 在包含五种不同遮罩区域的基准上验证了方法的有效性,End4能有效泛化到未见遮罩模式,并在多种扰动下保持鲁棒性。 Conclusion: End4显著提升了对扩散模型生成修复图像的检测能力,为应对生成模型滥用提供了有效解决方案。 Abstract: The powerful generative capabilities of diffusion models have significantly advanced the field of image synthesis, enhancing both full image generation and inpainting-based image editing. Despite their remarkable advancements, diffusion models also raise concerns about potential misuse for malicious purposes. However, existing approaches struggle to identify images generated by diffusion-based inpainting models, even when similar inpainted images are included in their training data. To address this challenge, we propose a novel detection method based on End-to-end denoising diffusion (End4). Specifically, End4 designs a denoising reconstruction model to improve the alignment degree between the latent spaces of the reconstruction and detection processes, thus reconstructing features that are more conducive to detection. Meanwhile, it leverages a Scale-aware Pyramid-like Fusion Module (SPFM) that refines local image features under the guidance of attention pyramid layers at different scales, enhancing feature discriminability. Additionally, to evaluate detection performance on inpainted images, we establish a comprehensive benchmark comprising images generated from five distinct masked regions. Extensive experiments demonstrate that our End4 effectively generalizes to unseen masking patterns and remains robust under various perturbations. Our code and dataset will be released soon.[147] Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation
Hugo Carlesso,Josiane Mothe,Radu Tudor Ionescu
Main category: cs.CV
TL;DR: 提出一种名为CMTSSL的课程多任务自监督学习框架,用于轻量级高光谱图像分析,结合掩码图像建模与解耦的空间-光谱拼图任务,通过课程学习策略提升特征表示能力,在多个基准数据集上验证了其在分割任务中的有效性且模型极轻量。
Details
Motivation: 高光谱图像数据维度高、卫星传输速率慢,需要紧凑高效的模型以支持星上处理并减少冗余数据传输,现有方法难以兼顾计算效率和空间-光谱联合建模能力。 Method: 提出CMTSSL框架,结合掩码图像建模与解耦的空间和光谱拼图任务,采用课程学习策略逐步增加数据复杂度,统一且高效地训练轻量级模型,实现对光谱连续性、空间结构和全局语义特征的联合捕捉。 Result: 在四个公开基准数据集上验证,CMTSSL在下游分割任务中表现一致提升,所用模型比某些先进模型轻16,000倍以上。 Conclusion: CMTSSL能够在保持极低模型复杂度的同时有效学习可泛化的高光谱表征,具有应用于实际星载处理系统的潜力。 Abstract: Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at https://github.com/hugocarlesso/CMTSSL.[148] Intelligent Vacuum Thermoforming Process
Andi Kuswoyo,Christos Margadji,Sebastian W. Pattinson
Main category: cs.CV
TL;DR: 提出一种基于视觉的质量控制系统,通过k-最近邻算法预测和优化真空热成型工艺参数,以少量数据提升零件质量。
Details
Motivation: 真空热成型中材料特性和模具配置的差异导致质量控制困难,需要一种低数据需求且高效的优化方法。 Method: 构建包含不同工艺参数下真空成型样品的视觉数据集,采用图像增强技术扩充数据,并使用k-最近邻算法将低质量产品映射到高质量样本,从而调整工艺参数。 Result: 模型在调节加热功率、加热时间和真空时间方面表现良好,有效减少了缺陷并提高了生产效率。 Conclusion: 该视觉质量控制系统能以较少数据实现工艺参数的精准优化,显著提升真空热成型的一致性和生产效率。 Abstract: Ensuring consistent quality in vacuum thermoforming presents challenges due to variations in material properties and tooling configurations. This research introduces a vision-based quality control system to predict and optimise process parameters, thereby enhancing part quality with minimal data requirements. A comprehensive dataset was developed using visual data from vacuum-formed samples subjected to various process parameters, supplemented by image augmentation techniques to improve model training. A k-Nearest Neighbour algorithm was subsequently employed to identify adjustments needed in process parameters by mapping low-quality parts to their high-quality counterparts. The model exhibited strong performance in adjusting heating power, heating time, and vacuum time to reduce defects and improve production efficiency.[149] ResidualViT for Efficient Temporally Dense Video Encoding
Mattia Soldan,Fabian Caba Heilbron,Bernard Ghanem,Josef Sivic,Bryan Russell
Main category: cs.CV
TL;DR: 提出ResidualViT架构和轻量级蒸馏策略,以高效计算视频帧级特征,在多个任务中显著降低计算成本并提升推理速度。
Details
Motivation: 视频理解任务需要高时间分辨率的帧级特征,但计算成本高昂,需减少冗余计算。 Method: 设计ResidualViT架构,引入可学习残差连接和令牌缩减模块,并采用轻量级蒸馏策略逼近基础模型的特征。 Result: 在四个任务和五个数据集上验证,计算成本降低最多60%,推理速度快2.5倍,同时保持接近原模型的精度。 Conclusion: ResidualViT有效平衡了计算效率与模型性能,适用于多种时序密集型视频理解任务。 Abstract: Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require "temporally dense" reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.[150] RadGame: An AI-Powered Platform for Radiology Education
Mohammed Baharoon,Siavash Raissi,John S. Jun,Thibault Heintz,Mahmoud Alabbad,Ali Alburkani,Sung Eun Kim,Kent Kleinschmidt,Abdulrahman O. Alhumaydhi,Mohannad Mohammed G. Alghamdi,Jeremy Francis Palacio,Mohammed Bukhaytan,Noah Michael Prudlo,Rithvik Akula,Brady Chrisler,Benjamin Galligos,Mohammed O. Almutairi,Mazeen Mohammed Alanazi,Nasser M. Alrashdi,Joel Jihwan Hwang,Sri Sai Dinesh Jaliparthi,Luke David Nelson,Nathaniel Nguyen,Sathvik Suryadevara,Steven Kim,Mohammed F. Mohammed,Yevgeniy R. Semenov,Kun-Hsing Yu,Abdulrhman Aljouie,Hassan AlOmaish,Adam Rodman,Pranav Rajpurkar
Main category: cs.CV
TL;DR: RadGame 是一个结合游戏化和人工智能反馈的放射学教育平台,旨在提升定位异常和生成报告两项核心技能。
Details
Motivation: 传统放射学培训依赖被动学习或有限的实时指导,缺乏可扩展的即时反馈机制。 Method: 利用公开数据集和视觉-语言模型,通过 RadGame Localize 和 RadGame Report 两个模块提供自动化的、结构化的 AI 反馈。 Result: 实验显示,相比传统方法,使用 RadGame 的参与者在定位准确性上提高 68%(对照组 17%),在报告撰写准确性上提高 31%(对照组 4%)。 Conclusion: AI 驱动的游戏化平台能有效提升放射学教学的可扩展性和反馈质量,具有广泛应用于医学教育的潜力。 Abstract: We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist's written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.[151] Image Realness Assessment and Localization with Multimodal Features
Lovish Kaushik,Agnij Biswas,Somdyuti Paul
Main category: cs.CV
TL;DR: 提出了一种基于视觉-语言模型的多模态框架,用于量化AI生成图像的真实感并识别局部不一致区域。
Details
Motivation: 需要可靠地量化AI生成图像的感知真实性和识别视觉不一致区域,以提升生成模型的实用性和训练反馈。 Method: 利用大规模数据集上训练的视觉-语言模型生成视觉不一致的文本描述,作为人类标注的替代,进行整体真实感评估和局部不一致性识别。 Result: 该方法在客观真实感预测性能上表现更好,并生成能有效区分现实与非现实区域的密集真实感图。 Conclusion: 所提出的多模态框架能够有效评估AI生成图像的真实感并定位不一致区域,具有应用于生成模型训练反馈的潜力。 Abstract: A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.[152] StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance
Zefan Qu,Zhenwei Wang,Haoyuan Wang,Ke Xu,Gerhard Hancke,Rynson W. H. Lau
Main category: cs.CV
TL;DR: 提出了一种无需训练的3D资产生成方法StyleSculptor,通过内容图像和风格图像实现细粒度的纹理、几何或两者兼具的风格控制。
Details
Motivation: 现有方法在生成可控风格的3D资产方面仍面临挑战,尤其是在同时控制纹理和几何风格方面缺乏灵活性和精度。 Method: 提出了StyleSculptor,包含风格解耦注意力(SD-Attn)模块和风格引导控制(SGC)机制;SD-Attn通过跨3D注意力机制动态融合内容与风格特征,并利用特征方差选择性注入风格特征以避免内容泄露;SGC支持仅几何、仅纹理或混合风格化及强度调节。 Result: 实验表明,StyleSculptor在生成高保真3D资产方面优于现有基线方法,能有效实现细粒度的风格控制。 Conclusion: StyleSculptor为无需训练的风格可控3D资产生成提供了新方案,具备灵活、稳定的风格迁移能力,适用于游戏和虚拟现实等实际场景。 Abstract: Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.[153] 3D Aware Region Prompted Vision Language Model
An-Chieh Cheng,Yang Fu,Yukang Chen,Zhijian Liu,Xiaolong Li,Subhashree Radhakrishnan,Song Han,Yao Lu,Jan Kautz,Pavlo Molchanov,Hongxu Yin,Xiaolong Wang,Sifei Liu
Main category: cs.CV
TL;DR: 提出了一种名为SR-3D的视觉语言模型,通过共享视觉token空间将单视图2D图像与多视图3D数据结合,支持灵活的区域提示,并在2D和3D任务中实现最先进的性能。