cs.CV [Total: 134]
cs.GR [Total: 6]
cs.CL [Total: 66]
cs.RO [Total: 3]
physics.med-ph [Total: 1]
cs.HC [Total: 2]
q-bio.NC [Total: 1]
cs.AI [Total: 1]
eess.AS [Total: 1]
astro-ph.HE [Total: 1]
cs.MM [Total: 2]
cs.IR [Total: 2]
cond-mat.mtrl-sci [Total: 1]
eess.IV [Total: 8]
cs.LG [Total: 11]
cs.DC [Total: 1]
cs.SE [Total: 2]
cs.CR [Total: 2]

cs.CV [Back]

[1] A Decade of You Only Look Once (YOLO) for Object Detection

Leo Thomas Ramos,Angel D. Sappa

Main category: cs.CV

TL;DR: 本文回顾了YOLO框架十周年的发展历程，总结了其从简单检测器到多样化架构的演变，并探讨了其应用、评估及未来方向。

Details

Motivation: 纪念YOLO框架发布十周年，分析其对实时目标检测领域的影响及发展轨迹。 Method: 通过技术概述主要版本、关键架构趋势和应用领域调查，结合评估实践和伦理考量。 Result: YOLO已发展为高效、模块化且跨领域适应的框架，广泛应用于多个领域。 Conclusion: YOLO的未来发展潜力巨大，需继续关注其技术演进和伦理影响。 Abstract: This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions, highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework's continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO's trajectory and ongoing transformation.

[2] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Zhikai Wang,Jiashuo Sun,Wenqi Zhang,Zhiqiang Hu,Xin Li,Fan Wang,Deli Zhao

Main category: cs.CV

TL;DR: VCBENCH是一个新的多模态数学推理基准，用于评估大型视觉语言模型（LVLM）在解决依赖视觉的数学问题时的表现，发现现有模型表现不佳。

Details

Motivation: 当前基准测试主要关注知识评估，忽视了模型对基础数学和视觉概念的推理能力，阻碍了AGI的发展。 Method: 提出VCBENCH基准，包含1,720个问题和6,697张图像，覆盖六个认知领域，用于评估26个LVLM。 Result: 评估显示模型表现较差，最高准确率不足50%，揭示了视觉与数学整合的挑战。 Conclusion: VCBENCH揭示了LVLM在视觉数学推理上的不足，为未来改进提供了方向。 Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual question answering. However, current benchmarks typically focus on knowledge-centric evaluations that assess domain-specific expertise, often neglecting the core ability to reason about fundamental mathematical elements and visual concepts. We identify a gap in evaluating elementary-level math problems, which rely on explicit visual dependencies-requiring models to discern, integrate, and reason across multiple images while incorporating commonsense knowledge, all of which are crucial for advancing toward broader AGI capabilities. To address this gap, we introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy. Our findings highlight the ongoing challenges in visual-mathematical integration and suggest avenues for future LVLM advancements.

[3] Co-Training with Active Contrastive Learning and Meta-Pseudo-Labeling on 2D Projections for Deep Semi-Supervised Learning

David Aparco-Cardenas,Jancarlo F. Gomes,Alexandre X. Falcão,Pedro J. de Rezende

Main category: cs.CV

TL;DR: 论文提出了一种名为active-DeepFA的方法，结合对比学习、元伪标签和主动学习，用于在标记数据稀缺的场景下训练非预训练的CNN架构。

Details

Motivation: 由于标记数据的稀缺性和标注的高成本，现有方法依赖预训练特征和大验证集，且随机采样标记数据可能忽略信息量更大的样本。 Method: active-DeepFA通过双网络协同训练减少伪标签的确认偏差，结合对比学习、标签传播和主动学习，逐步优化模型。 Result: 在三个生物图像数据集上，仅使用5%的标记数据即超越基线和其他六种SoTA方法，且在3%标记数据下也能达到可比效果。 Conclusion: active-DeepFA在标记数据稀缺的场景下显著提升性能，同时减少标注成本。 Abstract: A major challenge that prevents the training of DL models is the limited availability of accurately labeled data. This shortcoming is highlighted in areas where data annotation becomes a time-consuming and error-prone task. In this regard, SSL tackles this challenge by capitalizing on scarce labeled and abundant unlabeled data; however, SoTA methods typically depend on pre-trained features and large validation sets to learn effective representations for classification tasks. In addition, the reduced set of labeled data is often randomly sampled, neglecting the selection of more informative samples. Here, we present active-DeepFA, a method that effectively combines CL, teacher-student-based meta-pseudo-labeling and AL to train non-pretrained CNN architectures for image classification in scenarios of scarcity of labeled and abundance of unlabeled data. It integrates DeepFA into a co-training setup that implements two cooperative networks to mitigate confirmation bias from pseudo-labels. The method starts with a reduced set of labeled samples by warming up the networks with supervised CL. Afterward and at regular epoch intervals, label propagation is performed on the 2D projections of the networks' deep features. Next, the most reliable pseudo-labels are exchanged between networks in a cross-training fashion, while the most meaningful samples are annotated and added into the labeled set. The networks independently minimize an objective loss function comprising supervised contrastive, supervised and semi-supervised loss components, enhancing the representations towards image classification. Our approach is evaluated on three challenging biological image datasets using only 5% of labeled samples, improving baselines and outperforming six other SoTA methods. In addition, it reduces annotation effort by achieving comparable results to those of its counterparts with only 3% of labeled data.

[4] SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models

Nader Zantout,Haochen Zhang,Pujith Kachana,Jinkai Qiu,Ji Zhang,Wenshan Wang

Main category: cs.CV

TL;DR: SORT3D利用2D数据的丰富物体属性和大型语言模型的顺序推理能力，结合启发式空间推理工具箱，实现了无需文本到3D训练数据的零样本泛化，在复杂任务中表现优异。

Details

Motivation: 解决3D场景中物体指代语言的多样性和复杂性，以及缺乏大规模自然语言训练数据的问题。 Method: 结合2D物体属性、启发式空间推理工具箱和大型语言模型的顺序推理能力，无需文本到3D数据训练。 Result: 在两个基准测试中达到最先进的性能，并在真实环境中实现零样本导航。 Conclusion: SORT3D是一种高效且泛化能力强的解决方案，适用于复杂3D场景中的物体指代任务。 Abstract: Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on an autonomous vehicle and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at https://github.com/nzantout/SORT3D .

[5] HierSum: A Global and Local Attention Mechanism for Video Summarization

Apoorva Beedu,Irfan Essa

Main category: cs.CV

TL;DR: HierSum是一种分层视频摘要方法，结合字幕的局部线索和视频级全局上下文，利用“最重播”统计信息提升摘要效果，在多个数据集上表现优于现有方法。

Details

Motivation: 针对教学视频摘要，旨在通过关键步骤分割和上下文整合，提升摘要的准确性和实用性。 Method: 提出HierSum方法，整合字幕的局部信息和视频级全局上下文，利用“最重播”统计作为监督信号。 Result: 在TVSum、BLiSS等数据集上，HierSum在F1分数和排名相关性上优于现有方法。 Conclusion: 通过新数据集训练显著提升摘要效果，验证了方法的有效性。 Abstract: Video summarization creates an abridged version (i.e., a summary) that provides a quick overview of the video while retaining pertinent information. In this work, we focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments, each corresponding to essential steps in the video. We propose \textbf{HierSum}, a hierarchical approach that integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions. Our approach utilizes the ``most replayed" statistic as a supervisory signal to identify critical segments, thereby improving the effectiveness of the summary. We evaluate on benchmark datasets such as TVSum, BLiSS, Mr.HiSum, and the WikiHow test set, and show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation. We also curate a new multi-modal dataset using WikiHow and EHow videos and associated articles containing step-by-step instructions. Through extensive ablation studies, we demonstrate that training on this dataset significantly enhances summarization on the target datasets.

[6] A Review of 3D Object Detection with Vision-Language Models

Ranjan Sapkota,Konstantinos I Roumeliotis,Rahul Harsha Cheppally,Marco Flores Calero,Manoj Karkee

Main category: cs.CV

TL;DR: 本文系统综述了基于视觉语言模型（VLMs）的3D目标检测，分析了100多篇论文，比较了传统方法与现代VLMs框架的差异，并探讨了当前挑战与未来方向。

Details

Motivation: 研究3D目标检测与视觉语言模型的结合，解决传统方法在空间推理和数据复杂性上的不足，推动多模态AI的发展。 Method: 通过分析100多篇论文，比较点云、体素网格等传统方法与CLIP、3D LLMs等现代框架，总结关键架构、预训练策略和提示工程方法。 Result: 现代VLMs框架支持开放词汇检测和零样本泛化，但面临3D-语言数据集有限和计算需求高的挑战。 Conclusion: 未来需改进数据集和计算效率，以推动3D目标检测与视觉语言模型的进一步发展。 Abstract: This review provides a systematic analysis of comprehensive survey of 3D object detection with vision-language models(VLMs) , a rapidly advancing area at the intersection of 3D vision and multimodal AI. By examining over 100 research papers, we provide the first systematic analysis dedicated to 3D object detection with vision-language models. We begin by outlining the unique challenges of 3D object detection with vision-language models, emphasizing differences from 2D detection in spatial reasoning and data complexity. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs, which enable open-vocabulary detection and zero-shot generalization. We review key architectures, pretraining strategies, and prompt engineering methods that align textual and 3D features for effective 3D object detection with vision-language models. Visualization examples and evaluation benchmarks are discussed to illustrate performance and behavior. Finally, we highlight current challenges, such as limited 3D-language datasets and computational demands, and propose future research directions to advance 3D object detection with vision-language models. >Object Detection, Vision-Language Models, Agents, VLMs, LLMs, AI

[7] Dream-Box: Object-wise Outlier Generation for Out-of-Distribution Detection

Brian K. S. Isaac-Medina,Toby P. Breckon

Main category: cs.CV

TL;DR: 论文提出了一种名为Dream-Box的方法，利用扩散模型在像素空间生成目标级异常值，用于训练目标检测器进行OOD检测，同时提供可视化支持。

Details

Motivation: 解决传统OOD检测方法在目标检测任务中缺乏可视化支持的问题，并探索扩散模型在像素空间生成异常值的潜力。 Method: 使用扩散模型在像素空间生成目标级异常值，用于训练目标检测器，同时实现OOD检测和异常值可视化。 Result: Dream-Box在OOD检测性能上与传统方法相当，并首次提供了生成的OOD对象的具体可视化。 Conclusion: Dream-Box为OOD检测提供了一种新的解决方案，结合了性能与可视化优势，为目标检测任务开辟了新方向。 Abstract: Deep neural networks have demonstrated great generalization capabilities for tasks whose training and test sets are drawn from the same distribution. Nevertheless, out-of-distribution (OOD) detection remains a challenging task that has received significant attention in recent years. Specifically, OOD detection refers to the detection of instances that do not belong to the training distribution, while still having good performance on the in-distribution task (e.g., classification or object detection). Recent work has focused on generating synthetic outliers and using them to train an outlier detector, generally achieving improved OOD detection than traditional OOD methods. In this regard, outliers can be generated either in feature or pixel space. Feature space driven methods have shown strong performance on both the classification and object detection tasks, at the expense that the visualization of training outliers remains unknown, making further analysis on OOD failure modes challenging. On the other hand, pixel space outlier generation techniques enabled by diffusion models have been used for image classification using, providing improved OOD detection performance and outlier visualization, although their adaption to the object detection task is as yet unexplored. We therefore introduce Dream-Box, a method that provides a link to object-wise outlier generation in the pixel space for OOD detection. Specifically, we use diffusion models to generate object-wise outliers that are used to train an object detector for an in-distribution task and OOD detection. Our method achieves comparable performance to previous traditional methods while being the first technique to provide concrete visualization of generated OOD objects.

[8] Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos

Rezowan Shuvo,M S Mekala,Eyad Elyan

Main category: cs.CV

TL;DR: 论文提出了一种多阶段边界感知Transformer网络（MSBATN），用于解决手术视频中动作分割的挑战，通过分层滑动窗口注意力和统一的损失函数，显著提升了分割质量。

Details

Motivation: 手术工作流中的动作理解对术后评估至关重要，但由于外科医生操作风格的多样性和动作边界的模糊性，传统方法难以准确分割。 Method: 提出MSBATN，结合分层滑动窗口注意力和边界投票机制，将动作分类与边界检测作为独立但相互依赖的任务处理。 Result: 在三个手术数据集上的实验表明，MSBATN在F1分数上达到最优性能，并在其他指标上表现可比。 Conclusion: MSBATN通过改进边界检测和动作分类的联合优化，显著提升了手术动作分割的准确性。 Abstract: Understanding actions within surgical workflows is essential for evaluating post-operative outcomes. However, capturing long sequences of actions performed in surgical settings poses challenges, as individual surgeons have their unique approaches shaped by their expertise, leading to significant variability. To tackle this complex problem, we focused on segmentation with precise boundaries, a demanding task due to the inherent variability in action durations and the subtle transitions often observed in untrimmed videos. These transitions, marked by ambiguous starting and ending points, complicate the segmentation process. Traditional models, such as MS-TCN, which depend on large receptive fields, frequently face challenges of over-segmentation (resulting in fragmented segments) or under-segmentation (merging distinct actions). Both of these issues negatively impact the quality of segmentation. To overcome these challenges, we present the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention, designed to enhance action segmentation. Our proposed approach incorporates a novel unified loss function that treats action classification and boundary detection as distinct yet interdependent tasks. Unlike traditional binary boundary detection methods, our boundary voting mechanism accurately identifies start and end points by leveraging contextual information. Extensive experiments using three challenging surgical datasets demonstrate the superior performance of the proposed method, achieving state-of-the-art results in F1 scores at thresholds of 25% and 50%, while also delivering comparable performance in other metrics.

[9] PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

Manuel Weber,Carly Beneke

Main category: cs.CV

TL;DR: PyViT-FUSE是一个用于地球观测数据的基础模型，通过注意力机制融合多模态图像输入，采用金字塔结构的视觉变换器处理数据，并通过自监督学习训练。

Details

Motivation: 处理多模态地球观测数据的需求，尤其是如何融合不同分辨率的输入波段。 Method: 使用注意力机制融合多模态输入，采用金字塔结构的视觉变换器（ViT），基于SwAV算法的自监督学习。 Result: 模型能够生成单一表示，并通过注意力分数可视化展示融合机制的可解释性，适用于下游任务。 Conclusion: PyViT-FUSE是一种有效的多模态地球观测数据处理方法，具有可解释性和广泛适用性。 Abstract: We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.

[10] Depth as Points: Center Point-based Depth Estimation

Zhiheng Tu,Xinjian Huang,Yong He,Ruiyang Zhou,Bo Du,Weitao Wu

Main category: cs.CV

TL;DR: 提出了一种高效生成虚拟数据集的方法，并基于此构建了VirDepth数据集；同时设计了轻量级深度估计架构CenterDepth，结合全局语义和多尺度特征，显著提升了深度估计的性能和效率。

Details

Motivation: 解决自动驾驶中感知任务的数据收集复杂、计算和硬件需求高的问题。 Method: 开发高效虚拟数据集生成方法，构建VirDepth数据集；设计CenterDepth架构，结合Center FC-CRFs算法和多尺度特征聚合。 Result: 实验表明，该方法在计算速度和预测精度上均表现出色。 Conclusion: 提出的方法为自动驾驶感知任务提供了高效且高性能的解决方案。 Abstract: The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.

[11] IoT Botnet Detection: Application of Vision Transformer to Classification of Network Flow Traffic

Hassan Wasswa,Timothy Lynar,Aziida Nanyonga,Hussein Abbass

Main category: cs.CV

TL;DR: 提出了一种新的预处理方法，将视觉Transformer（ViT）应用于IoT僵尸网络攻击检测，通过将网络流数据转换为2D图像形状，并扩展ViT以支持多种分类器。

Details

Motivation: 现有工具无法捕获IoT网络流数据中的序列和空间模式，限制了Transformer模型的应用。 Method: 从.pcap文件中提取特征，将数据转换为1通道2D图像形状，并扩展ViT以支持DNN、LSTM和BLSTM等分类器。 Result: 在两种IoT攻击数据集上，DNN、LSTM和BLSTM在精确率、召回率和F1分数上表现出色。 Conclusion: 该方法成功将ViT应用于IoT攻击检测，并展示了多种分类器的竞争力。 Abstract: Despite the demonstrated effectiveness of transformer models in NLP, and image and video classification, the available tools for extracting features from captured IoT network flow packets fail to capture sequential patterns in addition to the absence of spatial patterns consequently limiting transformer model application. This work introduces a novel preprocessing method to adapt transformer models, the vision transformer (ViT) in particular, for IoT botnet attack detection using network flow packets. The approach involves feature extraction from .pcap files and transforming each instance into a 1-channel 2D image shape, enabling ViT-based classification. Also, the ViT model was enhanced to allow use any classifier besides Multilayer Perceptron (MLP) that was deployed in the initial ViT paper. Models including the conventional feed forward Deep Neural Network (DNN), LSTM and Bidirectional-LSTM (BLSTM) demonstrated competitive performance in terms of precision, recall, and F1-score for multiclass-based attack detection when evaluated on two IoT attack datasets.

[12] CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval

Hang Yu,Jiahao Wen,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于跨模态自适应元学习（CAMeL）的领域无关预训练框架，用于提升文本到人物检索任务的模型泛化能力。

Details

Motivation: 由于标注成本高和隐私保护问题，研究者通常依赖合成数据进行预训练和微调，但这些数据存在领域偏差，影响模型扩展性。 Method: 开发了反映真实场景多样性的任务，引入动态错误样本记忆单元，并采用自适应双速更新策略。 Result: 在多个真实基准测试中超越现有方法，并展示了对合成数据和噪声文本的鲁棒性。 Conclusion: CAMeL框架显著提升了模型的泛化能力和扩展性。 Abstract: Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.

[13] Video CLIP Model for Multi-View Echocardiography Interpretation

Ryo Takizawa,Satoshi Kodera,Tempei Kabayama,Ryo Matsuoka,Yuta Ando,Yuto Nakamura,Haruki Settai,Norihiko Takeda

Main category: cs.CV

TL;DR: 本文提出了一种基于多视角视频输入的视频-语言模型，用于提高超声心动图视频的自动诊断准确性。

Details

Motivation: 现有基于单帧图像的视觉语言模型在诊断依赖心脏运动的疾病时准确性较低，且超声心动图视频的多视角特性未被充分利用。 Method: 开发了一种视频-语言模型，输入五种不同视角的完整视频序列，并基于60,747例超声心动图视频与临床报告对进行训练。 Result: 实验表明，该模型比单视角视频或静态图像训练的模型具有更高的诊断准确性。 Conclusion: 多视角视频输入的视频-语言模型能显著提升超声心动图的自动诊断性能。 Abstract: Echocardiography involves recording videos of the heart using ultrasound, enabling clinicians to evaluate its condition. Recent advances in large-scale vision-language models (VLMs) have garnered attention for automating the interpretation of echocardiographic videos. However, most existing VLMs proposed for medical interpretation thus far rely on single-frame (i.e., image) inputs. Consequently, these image-based models often exhibit lower diagnostic accuracy for conditions identifiable through cardiac motion. Moreover, echocardiographic videos are recorded from various views that depend on the direction of ultrasound emission, and certain views are more suitable than others for interpreting specific conditions. Incorporating multiple views could potentially yield further improvements in accuracy. In this study, we developed a video-language model that takes five different views and full video sequences as input, training it on pairs of echocardiographic videos and clinical reports from 60,747 cases. Our experiments demonstrate that this expanded approach achieves higher interpretation accuracy than models trained with only single-view videos or with still images.

[14] Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning

Yifan Xie,Fei Ma,Yi Bin,Ying He,Fei Yu

Main category: cs.CV

TL;DR: 论文提出了一种联合不确定性学习网络（JULNet），用于高质量说话人脸视频生成，通过联合优化误差和不确定性提升模型性能。

Details

Motivation: 现有系统在视觉不确定性的学习上关注不足，导致视觉质量不一致和性能不可靠。 Method: 设计不确定性模块预测误差图和不确定性图，并通过KL散度项和直方图技术匹配分布。 Result: 实验表明，该方法在说话人脸视频生成中实现了更高的保真度和音频-唇同步。 Conclusion: JULNet通过联合学习误差和不确定性，显著提升了生成视频的质量和鲁棒性。 Abstract: Talking face video generation with arbitrary speech audio is a significant challenge within the realm of digital human technology. The previous studies have emphasized the significance of audio-lip synchronization and visual quality. Currently, limited attention has been given to the learning of visual uncertainty, which creates several issues in existing systems, including inconsistent visual quality and unreliable performance across different input conditions. To address the problem, we propose a Joint Uncertainty Learning Network (JULNet) for high-quality talking face video generation, which incorporates a representation of uncertainty that is directly related to visual error. Specifically, we first design an uncertainty module to individually predict the error map and uncertainty map after obtaining the generated image. The error map represents the difference between the generated image and the ground truth image, while the uncertainty map is used to predict the probability of incorrect estimates. Furthermore, to match the uncertainty distribution with the error distribution through a KL divergence term, we introduce a histogram technique to approximate the distributions. By jointly optimizing error and uncertainty, the performance and robustness of our model can be enhanced. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking face video generation compared to previous methods.

[15] Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Shahad Albastaki,Anabia Sohail,Iyyakutti Iyappan Ganapathi,Basit Alawode,Asim Khan,Sajid Javed,Naoufel Werghi,Mohammed Bennamoun,Arif Mahmood

Main category: cs.CV

TL;DR: 提出了一种多分辨率视觉语言模型，通过跨分辨率对齐提升计算病理学任务性能。

Details

Motivation: 单分辨率图像在癌症亚型分类等任务中信息有限，需多分辨率方法。 Method: 利用多分辨率WSI提取图像块，生成文本描述，并通过跨分辨率对齐增强特征表示。 Result: 在TCGA数据集上预训练，模型性能优于现有方法。 Conclusion: 多分辨率方法显著提升了计算病理学任务的性能。 Abstract: In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP

[16] Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras

Yunzhong Zhang,Bo Xiong,You Zhou,Changqing Su,Zhen Cheng,Zhaofei Yu,Xun Cao,Tiejun Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于脉冲相机的深度学习框架（SIV），用于高湍流复杂流场的粒子图像测速（PIV），结合了DPHT模块和GE模块，并发布了PSSD数据集，性能优于现有方法。

Details

Motivation: 传统PIV方法在高湍流复杂流场中表现不足，需要更精确且非侵入式的测量方法。 Method: 提出SIV框架，结合DPHT模块保留细节，引入GE模块提取复杂流场特征，并构建PSSD数据集。 Result: SIV在PSSD数据集上表现优于现有基线方法。 Conclusion: SIV为复杂流场测速提供了高效解决方案，并开源了数据集和实现。 Abstract: The need for accurate and non-intrusive flow measurement methods has led to the widespread adoption of Particle Image Velocimetry (PIV), a powerful diagnostic tool in fluid motion estimation. This study investigates the tremendous potential of spike cameras (a type of ultra-high-speed, high-dynamic-range camera) in PIV. We propose a deep learning framework, Spike Imaging Velocimetry (SIV), designed specifically for highly turbulent and intricate flow fields. To aggregate motion features from the spike stream while minimizing information loss, we incorporate a Detail-Preserving Hierarchical Transform (DPHT) module. Additionally, we introduce a Graph Encoder (GE) to extract contextual features from highly complex fluid flows. Furthermore, we present a spike-based PIV dataset, Particle Scenes with Spike and Displacement (PSSD), which provides labeled data for three challenging fluid dynamics scenarios. Our proposed method achieves superior performance compared to existing baseline methods on PSSD. The datasets and our implementation of SIV are open-sourced in the supplementary materials.

[17] PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance

Jiaxu Leng,Zhanjie Wu,Mingpi Tan,Mengjingcheng Mo,Jiankang Zheng,Qingqing Li,Ji Gan,Xinbo Gao

Main category: cs.CV

TL;DR: PiercingEye提出了一种双空间学习框架，结合欧几里得和双曲几何，通过层次建模和跨空间注意力机制提升视频暴力检测的判别性特征表示。

Details

Motivation: 现有弱监督视频暴力检测方法依赖欧几里得表示学习，难以区分视觉相似但语义不同的事件，且缺乏模糊训练样本。 Method: PiercingEye结合欧几里得和双曲几何，采用层次敏感的双曲聚合策略和跨空间注意力机制，并利用大语言模型生成模糊事件描述进行监督。 Result: 在XD-Violence和UCF-Crime基准测试中，PiercingEye达到最先进性能，尤其在模糊事件子集上表现优异。 Conclusion: PiercingEye通过双空间学习和逻辑引导的监督，显著提升了细粒度暴力检测能力。 Abstract: Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.

[18] WLTCL: Wide Field-of-View 3-D LiDAR Truck Compartment Automatic Localization System

Guodong Sun,Mingjing Li,Dingjie Liu,Mingxuan Liu,Bo Wu,Yang Zhang

Main category: cs.CV

TL;DR: 提出了一种基于宽视场3D LiDAR的卡车车厢自动定位系统，解决了现有方法对不同尺寸车厢适应性差、坐标系不统一及可靠性低的问题。

Details

Motivation: 物流自动化中，卡车车厢的精准自动定位是关键，但现有方法难以适应不同尺寸车厢且在复杂环境中可靠性不足。 Method: 利用宽视场3D LiDAR生成高密度点云，结合停车区域约束分割车厢点云，并通过几何特征定位关键点。 Result: 系统在实验数据及公开数据集上表现出高定位精度和低计算资源消耗。 Conclusion: 该系统在物流自动化领域具有应用和推广潜力。 Abstract: As an essential component of logistics automation, the automated loading system is becoming a critical technology for enhancing operational efficiency and safety. Precise automatic positioning of the truck compartment, which serves as the loading area, is the primary step in automated loading. However, existing methods have difficulty adapting to truck compartments of various sizes, do not establish a unified coordinate system for LiDAR and mobile manipulators, and often exhibit reliability issues in cluttered environments. To address these limitations, our study focuses on achieving precise automatic positioning of key points in large, medium, and small fence-style truck compartments in cluttered scenarios. We propose an innovative wide field-of-view 3-D LiDAR vehicle compartment automatic localization system. For vehicles of various sizes, this system leverages the LiDAR to generate high-density point clouds within an extensive field-of-view range. By incorporating parking area constraints, our vehicle point cloud segmentation method more effectively segments vehicle point clouds within the scene. Our compartment key point positioning algorithm utilizes the geometric features of the compartments to accurately locate the corner points, providing stackable spatial regions. Extensive experiments on our collected data and public datasets demonstrate that this system offers reliable positioning accuracy and reduced computational resource consumption, leading to its application and promotion in relevant fields.

[19] Exploiting Multiple Representations: 3D Face Biometrics Fusion with Application to Surveillance

Simone Maurizio La Cava,Roberto Casula,Sara Concas,Giulia Orrù,Ruben Tolosana,Martin Drahansky,Julian Fierrez,Gian Luca Marcialis

Main category: cs.CV

TL;DR: 研究探讨如何利用多种3D人脸重建算法提升人脸识别系统在复杂场景中的性能，并通过融合方法增强生物识别鲁棒性。

Details

Motivation: 针对不同应用场景的3D人脸重建算法各有局限性，研究旨在通过融合多种算法提升人脸识别系统的泛化能力和鲁棒性。 Method: 采用参数化和非参数化的分数级融合方法，结合多种3D人脸重建算法，并在不同条件下（如距离、相机设置、数据集内和跨数据集）进行综合分析。 Result: 实验表明，不同3D人脸重建算法的独特信息可以缓解多场景泛化问题，融合策略能显著提升系统的可靠性。 Conclusion: 研究展示了融合方法在3D人脸识别系统中的潜力，为实际应用提供了关键见解，并可能推广到其他人脸生物识别任务。 Abstract: 3D face reconstruction (3DFR) algorithms are based on specific assumptions tailored to the limits and characteristics of the different application scenarios. In this study, we investigate how multiple state-of-the-art 3DFR algorithms can be used to generate a better representation of subjects, with the final goal of improving the performance of face recognition systems in challenging uncontrolled scenarios. We also explore how different parametric and non-parametric score-level fusion methods can exploit the unique strengths of multiple 3DFR algorithms to enhance biometric recognition robustness. With this goal, we propose a comprehensive analysis of several face recognition systems across diverse conditions, such as varying distances and camera setups, intra-dataset and cross-dataset, to assess the robustness of the proposed ensemble method. The results demonstrate that the distinct information provided by different 3DFR algorithms can alleviate the problem of generalizing over multiple application scenarios. In addition, the present study highlights the potential of advanced fusion strategies to enhance the reliability of 3DFR-based face recognition systems, providing the research community with key insights to exploit them in real-world applications effectively. Although the experiments are carried out in a specific face verification setup, our proposed fusion-based 3DFR methods may be applied to other tasks around face biometrics that are not strictly related to identity recognition.

[20] Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness

Yufeng Wu,Xin Liao,Baowei Wang,Han Fang,Xiaoshuai Wu,Guiling Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为Simulation-to-Real (S2R)的无监督噪声层方法，用于解决屏幕-相机（SC）图像水印中的噪声模拟问题，显著提升了水印的鲁棒性和泛化能力。

Details

Motivation: 现有方法（启发式数学建模和监督神经网络）无法有效逼近SC噪声，导致水印鲁棒性不足。数学模拟存在偏差，而监督网络需要配对数据且难以学习噪声的全部特征。 Method: 提出S2R方法，通过无监督学习模拟噪声分布与真实SC噪声分布之间的差异，而非直接学习从清晰图像到真实图像的映射。 Result: 实验证明S2R方法在鲁棒性和泛化性上优于现有技术。 Conclusion: S2R方法通过简化噪声分布的学习任务，有效解决了SC噪声模拟问题，为水印技术提供了更优的解决方案。 Abstract: Unauthorized screen capturing and dissemination pose severe security threats such as data leakage and information theft. Several studies propose robust watermarking methods to track the copyright of Screen-Camera (SC) images, facilitating post-hoc certification against infringement. These techniques typically employ heuristic mathematical modeling or supervised neural network fitting as the noise layer, to enhance watermarking robustness against SC. However, both strategies cannot fundamentally achieve an effective approximation of SC noise. Mathematical simulation suffers from biased approximations due to the incomplete decomposition of the noise and the absence of interdependence among the noise components. Supervised networks require paired data to train the noise-fitting model, and it is difficult for the model to learn all the features of the noise. To address the above issues, we propose Simulation-to-Real (S2R). Specifically, an unsupervised noise layer employs unpaired data to learn the discrepancy between the modeling simulated noise distribution and the real-world SC noise distribution, rather than directly learning the mapping from sharp images to real-world images. Learning this transformation from simulation to reality is inherently simpler, as it primarily involves bridging the gap in noise distributions, instead of the complex task of reconstructing fine-grained image details. Extensive experimental results validate the efficacy of the proposed method, demonstrating superior watermark robustness and generalization compared to those of state-of-the-art methods.

[21] Kinship Verification through a Forest Neural Network

Ali Nazari,Mohsen Ebrahimi Moghaddam,Omidreza Borzoei

Main category: cs.CV

TL;DR: 提出了一种基于图神经网络的方法，用于亲属关系验证，结合了中心损失的新损失组合，在KinFaceW-II上取得了最佳结果。

Details

Motivation: 早期方法使用面部表示进行亲属关系验证，准确性较低，而联合表示方法虽更准确但计算复杂。 Method: 采用图神经网络概念，结合面部表示，设计了分类模块结构，并引入中心损失的新损失组合。 Result: 在KinFaceW-II上平均提升1.6，接近KinFaceW-I的最佳结果。 Conclusion: 该方法在亲属关系验证中表现出色，尤其在KinFaceW-II上效果显著。 Abstract: Early methods used face representations in kinship verification, which are less accurate than joint representations of parents' and children's facial images learned from scratch. We propose an approach featuring graph neural network concepts to utilize face representations and have comparable results to joint representation algorithms. Moreover, we designed the structure of the classification module and introduced a new combination of losses to engage the center loss gradually in training our network. Additionally, we conducted experiments on KinFaceW-I and II, demonstrating the effectiveness of our approach. We achieved the best result on KinFaceW-II, an average improvement of nearly 1.6 for all kinship types, and we were near the best on KinFaceW-I. The code is available at https://github.com/ali-nazari/Kinship-Verification

[22] R-Sparse R-CNN: SAR Ship Detection Based on Background-Aware Sparse Learnable Proposals

Kamirul Kamirul,Odysseas Pappas,Alin Achim

Main category: cs.CV

TL;DR: R-Sparse R-CNN是一种用于SAR图像中定向船舶检测的新方法，通过稀疏可学习提案和背景感知提案（BAPs）提升检测精度，结合双上下文池化（DCP）和交互模块，显著优于现有方法。

Details

Motivation: 解决SAR图像中复杂环境下船舶检测的挑战，通过整合背景信息提升检测准确性。 Method: 采用稀疏提案（BAPs）简化流程，提出DCP联合提取船舶和背景特征，并设计基于Transformer的交互模块建模关系。 Result: 在SSDD和RSDD-SAR数据集上分别提升12.8%和11.9%，表现优异。 Conclusion: R-Sparse R-CNN是一种高效且准确的SAR图像船舶检测框架。 Abstract: We introduce R-Sparse R-CNN, a novel pipeline for oriented ship detection in Synthetic Aperture Radar (SAR) images that leverages sparse learnable proposals enriched with background contextual information, termed background-aware proposals (BAPs). The adoption of sparse proposals streamlines the pipeline by eliminating the need for proposal generators and post-processing for overlapping predictions. The proposed BAPs enrich object representation by integrating ship and background features, allowing the model to learn their contextual relationships for more accurate distinction of ships in complex environments. To complement BAPs, we propose Dual-Context Pooling (DCP), a novel strategy that jointly extracts ship and background features in a single unified operation. This unified design improves efficiency by eliminating redundant computation inherent in separate pooling. Moreover, by ensuring that ship and background features are pooled from the same feature map level, DCP provides aligned features that improve contextual relationship learning. Finally, as a core component of contextual relationship learning in R-Sparse R-CNN, we design a dedicated transformer-based Interaction Module. This module interacts pooled ship and background features with corresponding proposal features and models their relationships. Experimental results show that R-Sparse R-CNN delivers outstanding accuracy, surpassing state-of-the-art models by margins of up to 12.8% and 11.9% on SSDD and RSDD-SAR inshore datasets, respectively. These results demonstrate the effectiveness and competitiveness of R-Sparse R-CNN as a robust framework for oriented ship detection in SAR imagery. The code is available at: www.github.com/ka-mirul/R-Sparse-R-CNN.

[23] 3DPyranet Features Fusion for Spatio-temporal Feature Learning

Ihsan Ullah,Alfredo Petrosino

Main category: cs.CV

TL;DR: 论文提出了一种名为3DPyraNet的3D金字塔神经网络及其变体3DPyraNet-F，用于时空特征学习，减少了参数和计算成本，并在视频动作识别中表现出色。

Details

Motivation: 传统CNN在深度变体中参数增加，部分丧失了参数少的优势，需要一种新的网络结构以保持高效性。 Method: 提出3DPyraNet，采用新的权重方案学习时空特征，并设计3DPyraNet-F提取特征并输入线性SVM分类器。 Result: 3DPyraNet在真实环境中表现良好，3DPyraNet-F在三个基准数据集上优于现有方法。 Conclusion: 3DPyraNet及其变体在减少参数和计算成本的同时，提升了视频动作识别的性能。 Abstract: Convolutional neural network (CNN) slides a kernel over the whole image to produce an output map. This kernel scheme reduces the number of parameters with respect to a fully connected neural network (NN). While CNN has proven to be an effective model in recognition of handwritten characters and traffic signal sign boards, etc. recently, its deep variants have proven to be effective in similar as well as more challenging applications like object, scene and action recognition. Deep CNN add more layers and kernels to the classical CNN, increasing the number of parameters, and partly reducing the main advantage of CNN which is less parameters. In this paper, a 3D pyramidal neural network called 3DPyraNet and a discriminative approach for spatio-temporal feature learning based on it, called 3DPyraNet-F, are proposed. 3DPyraNet introduces a new weighting scheme which learns features from both spatial and temporal dimensions analyzing multiple adjacent frames and keeping a biological plausible structure. It keeps the spatial topology of the input image and presents fewer parameters and lower computational and memory costs compared to both fully connected NNs and recent deep CNNs. 3DPyraNet-F extract the features maps of the highest layer of the learned network, fuse them in a single vector, and provide it as input in such a way to a linear-SVM classifier that enhances the recognition of human actions and dynamic scenes from the videos. Encouraging results are reported with 3DPyraNet in real-world environments, especially in the presence of camera induced motion. Further, 3DPyraNet-F clearly outperforms the state-of-the-art on three benchmark datasets and shows comparable result for the fourth.

[24] MediAug: Exploring Visual Augmentation in Medical Imaging

Xuyin Qi,Zeyu Zhang,Canxuan Gang,Hao Zhang,Lei Zhang,Zhiwei Zhang,Yang Zhao

Main category: cs.CV

TL;DR: 论文提出MediAug框架，系统评估六种混合增强方法在医学影像中的表现，发现MixUp和SnapMix分别在ResNet-50和ViT-B上对脑肿瘤分类效果最佳，YOCO和CutMix在眼病分类中表现最优。

Details

Motivation: 解决医学影像中数据增强的两大挑战：自然图像与医学图像的领域差距，以及现有研究的碎片化和局限性。 Method: 提出统一评估框架MediAug，结合六种混合增强方法（MixUp、YOCO、CropMix、CutMix、AugMix、SnapMix）和两种骨干网络（ResNet-50、ViT-B），在脑肿瘤MRI和眼病眼底数据集上进行实验。 Result: MixUp在ResNet-50上对脑肿瘤分类准确率提升至79.19%，SnapMix在ViT-B上达99.44%；YOCO在ResNet-50上对眼病分类准确率91.60%，CutMix在ViT-B上达97.94%。 Conclusion: MediAug为医学影像数据增强提供了全面且可复现的基准，证明了混合增强方法的有效性，并揭示了不同方法在不同任务和架构中的优势。 Abstract: Data augmentation is essential in medical imaging for improving classification accuracy, lesion detection, and organ segmentation under limited data conditions. However, two significant challenges remain. First, a pronounced domain gap between natural photographs and medical images can distort critical disease features. Second, augmentation studies in medical imaging are fragmented and limited to single tasks or architectures, leaving the benefits of advanced mix-based strategies unclear. To address these challenges, we propose a unified evaluation framework with six mix-based augmentation methods integrated with both convolutional and transformer backbones on brain tumour MRI and eye disease fundus datasets. Our contributions are threefold. (1) We introduce MediAug, a comprehensive and reproducible benchmark for advanced data augmentation in medical imaging. (2) We systematically evaluate MixUp, YOCO, CropMix, CutMix, AugMix, and SnapMix with ResNet-50 and ViT-B backbones. (3) We demonstrate through extensive experiments that MixUp yields the greatest improvement on the brain tumor classification task for ResNet-50 with 79.19% accuracy and SnapMix yields the greatest improvement for ViT-B with 99.44% accuracy, and that YOCO yields the greatest improvement on the eye disease classification task for ResNet-50 with 91.60% accuracy and CutMix yields the greatest improvement for ViT-B with 97.94% accuracy. Code will be available at https://github.com/AIGeeksGroup/MediAug.

[25] VISUALCENT: Visual Human Analysis using Dynamic Centroid Representation

Niaz Ahmad,Youngmoon Lee,Guanghui Wang

Main category: cs.CV

TL;DR: VISUALCENT是一个统一的人体姿态和实例分割框架，通过基于质心的自下而上关键点检测和动态质心技术，提升了多人视觉分析的泛化性和扩展性。

Details

Motivation: 解决多人视觉分析中泛化性和扩展性的限制。 Method: 采用基于质心的自下而上关键点检测范式，结合关键点热图和动态质心技术（MaskCentroid）进行像素聚类。 Result: 在COCO和OCHuman数据集上表现出色，mAP分数和执行帧率优于现有方法。 Conclusion: VISUALCENT在准确性和实时性能上具有优势，适用于复杂场景。 Abstract: We introduce VISUALCENT, a unified human pose and instance segmentation framework to address generalizability and scalability limitations to multi person visual human analysis. VISUALCENT leverages centroid based bottom up keypoint detection paradigm and uses Keypoint Heatmap incorporating Disk Representation and KeyCentroid to identify the optimal keypoint coordinates. For the unified segmentation task, an explicit keypoint is defined as a dynamic centroid called MaskCentroid to swiftly cluster pixels to specific human instance during rapid changes in human body movement or significantly occluded environment. Experimental results on COCO and OCHuman datasets demonstrate VISUALCENTs accuracy and real time performance advantages, outperforming existing methods in mAP scores and execution frame rate per second. The implementation is available on the project page.

[26] Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

Mohammad Mahdi Abootorabi,Omid Ghahroodi,Pardis Sadat Zahraei,Hossein Behzadasl,Alireza Mirrokni,Mobina Salimipanah,Arash Rasouli,Bahar Behzadipour,Sara Azarnoush,Benyamin Maleki,Erfan Sadraiye,Kiarash Kiani Feriz,Mahdi Teymouri Nahad,Ali Moghadasi,Abolfazl Eshagh Abianeh,Nizi Nazar,Hamid R. Rabiee,Mahdieh Soleymani Baghshah,Meisam Ahmadi,Ehsaneddin Asgari

Main category: cs.CV

TL;DR: 生成式AI正在重塑动画领域，本文综述了其在角色动画中的应用，包括面部动画、表情渲染、运动合成等，并提供了背景知识、数据集和未来研究方向。

Details

Motivation: 由于生成式AI在角色动画领域的快速发展，需要一篇综合性的综述来整合分散的研究成果，为研究者和开发者提供清晰的领域视图。 Method: 本文通过分析面部动画、表情渲染、图像合成、虚拟角色创建、手势建模、运动合成等领域的领先研究和实践，结合常用数据集和评估指标，提供全面的背景介绍。 Result: 综述了生成式AI在角色动画中的最新进展，总结了各领域的研究成果、实际应用和新兴趋势，并提供了开源资源。 Conclusion: 本文为生成式AI在角色动画领域的研究者和开发者提供了全面的参考，指出了开放挑战和未来研究方向，推动了该领域的发展。 Abstract: Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.

[27] Dual-Branch Residual Network for Cross-Domain Few-Shot Hyperspectral Image Classification with Refined Prototype

Anyong Qin,Chaoqi Yuan,Qiang Li,Feng Yang,Tiecheng Song,Chenqiang Gao

Main category: cs.CV

TL;DR: 提出了一种双分支残差网络，结合空间和光谱特征，并通过改进原型和核概率匹配策略提升小样本场景下的HSI分类性能。

Details

Motivation: 解决3D卷积神经网络在HSI分类中计算成本高、泛化能力差及跨数据集适应性问题。 Method: 采用双分支残差网络整合空间和光谱特征，引入改进原型和核概率匹配策略。 Result: 在四个公开HSI数据集上表现优于其他方法。 Conclusion: 该方法有效提升了小样本HSI分类的性能和跨数据集适应性。 Abstract: Convolutional neural networks (CNNs) are effective for hyperspectral image (HSI) classification, but their 3D convolutional structures introduce high computational costs and limited generalization in few-shot scenarios. Domain shifts caused by sensor differences and environmental variations further hinder cross-dataset adaptability. Metric-based few-shot learning (FSL) prototype networks mitigate this problem, yet their performance is sensitive to prototype quality, especially with limited samples. To overcome these challenges, a dual-branch residual network that integrates spatial and spectral features via parallel branches is proposed in this letter. Additionally, more robust refined prototypes are obtained through a regulation term. Furthermore, a kernel probability matching strategy aligns source and target domain features, alleviating domain shift. Experiments on four publicly available HSI datasets illustrate that the proposal achieves superior performance compared to other methods.

[28] HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Qiuhui Chen,Jintao Wang,Gang Wang,Yi Hong

Main category: cs.CV

TL;DR: HoloDx是一个结合领域知识和多模态临床数据的框架，通过动态整合专家知识和LLMs提升阿尔茨海默病诊断的准确性和可解释性。

Details

Motivation: 现有方法难以充分利用多模态信息和动态领域知识，HoloDx旨在解决这一问题。 Method: HoloDx采用知识注入模块和记忆注入模块，分别通过知识感知门控交叉注意力和原型记忆注意力动态整合领域知识和患者特定信息。 Result: 在五个AD数据集上，HoloDx表现优于现有方法，具有更高的诊断准确性和泛化能力。 Conclusion: HoloDx通过结合领域知识和数据驱动方法，显著提升了AD诊断的性能和可解释性。 Abstract: Accurate diagnosis of Alzheimer's disease (AD) requires effectively integrating multimodal data and clinical expertise. However, existing methods often struggle to fully utilize multimodal information and lack structured mechanisms to incorporate dynamic domain knowledge. To address these limitations, we propose HoloDx, a knowledge- and data-driven framework that enhances AD diagnosis by aligning domain knowledge with multimodal clinical data. HoloDx incorporates a knowledge injection module with a knowledge-aware gated cross-attention, allowing the model to dynamically integrate domain-specific insights from both large language models (LLMs) and clinical expertise. Also, a memory injection module with a designed prototypical memory attention enables the model to retain and retrieve subject-specific information, ensuring consistency in decision-making. By jointly leveraging these mechanisms, HoloDx enhances interpretability, improves robustness, and effectively aligns prior knowledge with current subject data. Evaluations on five AD datasets demonstrate that HoloDx outperforms state-of-the-art methods, achieving superior diagnostic accuracy and strong generalization across diverse cohorts. The source code will be released upon publication acceptance.

[29] Learning to Drive from a World Model

Mitchell Goff,Greg Hogan,George Hotz,Armand du Parc Locmaria,Kacper Raczy,Harald Schäfer,Adeeb Shihadeh,Weixing Zhang,Yassine Yousfi

Main category: cs.CV

TL;DR: 论文提出了一种端到端的训练架构，利用真实驾驶数据在模拟器中训练驾驶策略，无需人工编码驾驶规则。

Details

Motivation: 现有的自动驾驶系统依赖人工编码的感知输出和驾驶规则，而直接从人类驾驶数据中学习可以简化架构并更好地扩展。 Method: 提出了两种模拟方法：重投影模拟和学习的世界模型，用于训练驾驶策略。 Result: 两种方法均能训练出无需人工编码规则的驾驶策略，并在闭环模拟和真实驾驶辅助系统中表现良好。 Conclusion: 端到端学习方法在自动驾驶中具有潜力，能够简化系统并提高扩展性。 Abstract: Most self-driving systems rely on hand-coded perception outputs and engineered driving rules. Learning directly from human driving data with an end-to-end method can allow for a training architecture that is simpler and scales well with compute and data. In this work, we propose an end-to-end training architecture that uses real driving data to train a driving policy in an on-policy simulator. We show two different methods of simulation, one with reprojective simulation and one with a learned world model. We show that both methods can be used to train a policy that learns driving behavior without any hand-coded driving rules. We evaluate the performance of these policies in a closed-loop simulation and when deployed in a real-world advanced driver-assistance system.

[30] MIA-Mind: A Multidimensional Interactive Attention Mechanism Based on MindSpore

Zhenkai Qin,Jiaquan Liang,Qiao Fang

Main category: cs.CV

TL;DR: MIA-Mind是一种轻量级多维交互注意力机制，通过联合建模空间和通道特征提升深度学习性能。

Details

Motivation: 现有注意力机制独立建模通道和空间特征，忽略了其内在关联，限制了效果。 Method: 提出MIA-Mind，基于MindSpore框架，采用跨注意力融合策略联合建模空间和通道特征。 Result: 在CIFAR-10、ISBI2012和CIC-IDS2017数据集上分别达到82.9%、78.7%和91.9%的准确率。 Conclusion: MIA-Mind具有轻量化和泛化能力，未来将扩展至大规模数据集和分布式部署。 Abstract: Attention mechanisms have significantly advanced deep learning by enhancing feature representation through selective focus. However, existing approaches often independently model channel importance and spatial saliency, overlooking their inherent interdependence and limiting their effectiveness. To address this limitation, we propose MIA-Mind, a lightweight and modular Multidimensional Interactive Attention Mechanism, built upon the MindSpore framework. MIA-Mind jointly models spatial and channel features through a unified cross-attentive fusion strategy, enabling fine-grained feature recalibration with minimal computational overhead. Extensive experiments are conducted on three representative datasets: on CIFAR-10, MIA-Mind achieves an accuracy of 82.9\%; on ISBI2012, it achieves an accuracy of 78.7\%; and on CIC-IDS2017, it achieves an accuracy of 91.9\%. These results validate the versatility, lightweight design, and generalization ability of MIA-Mind across heterogeneous tasks. Future work will explore the extension of MIA-Mind to large-scale datasets, the development of ada,ptive attention fusion strategies, and distributed deployment to further enhance scalability and robustness.

[31] Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction

Xiaoran Xu,Jiangang Yang,Wenyue Chong,Wenhui Shi,Shichu Sun,Jing Xing,Jian Liu

Main category: cs.CV

TL;DR: 提出了一种新的跨模态特征学习方法，通过细粒度文本和视觉特征的动态交互，提升单域广义目标检测（S-DGOD）的性能。

Details

Motivation: 解决现有S-DGOD方法在细粒度区域和对象级别特征学习上的不足，适应多媒体应用中多样化的域偏移。 Method: 设计了跨模态和区域感知特征交互机制，以及跨域提议精炼和混合策略，以增强检测器在未见场景中的定位能力。 Result: 在S-DGOD基准数据集上取得了新的最优结果，Cityscapes-C和DWD上分别提升了8.8%和7.9%的mPC。 Conclusion: 该方法通过细粒度跨模态交互和跨域提议优化，显著提升了S-DGOD任务的性能。 Abstract: Single-Domain Generalized Object Detection~(S-DGOD) aims to train an object detector on a single source domain while generalizing well to diverse unseen target domains, making it suitable for multimedia applications that involve various domain shifts, such as intelligent video surveillance and VR/AR technologies. With the success of large-scale Vision-Language Models, recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains. However, the utilized knowledge remains at a coarse-grained level~(e.g., the textual description of adverse weather paired with the image) and serves as an implicit regularization for guidance, struggling to learn accurate region- and object-level features in varying domains. In this work, we propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks. The core of our method is the mechanism of Cross-modal and Region-aware Feature Interaction, which simultaneously learns both inter-modal and intra-modal regional invariance through dynamic interactions between fine-grained textual and visual features. Moreover, we design a simple but effective strategy called Cross-domain Proposal Refining and Mixing, which aligns the position of region proposals across multiple domains and diversifies them, enhancing the localization ability of detectors in unseen scenarios. Our method achieves new state-of-the-art results on S-DGOD benchmark datasets, with improvements of +8.8\%~mPC on Cityscapes-C and +7.9\%~mPC on DWD over baselines, demonstrating its efficacy.

[32] Towards Latency-Aware 3D Streaming Perception for Autonomous Driving

Jiaqi Peng,Tai Wang,Jiangmiao Pang,Yuan Shen

Main category: cs.CV

TL;DR: 提出了一种针对边缘设备运行时延迟的新基准测试和LASP框架，通过历史特征集成和预测检测模块优化3D感知性能。

Details

Motivation: 现有3D感知算法在边缘设备上因运行时延迟问题难以部署，需解决延迟对性能的影响。 Method: 提出LASP框架，包含延迟感知的历史特征集成和预测检测模块。 Result: 在Jetson AGX Orin上，在线性能接近离线评估的80%，无需加速技术。 Conclusion: LASP框架能有效应对不同延迟水平，提升3D感知在边缘设备的实用性。 Abstract: Although existing 3D perception algorithms have demonstrated significant improvements in performance, their deployment on edge devices continues to encounter critical challenges due to substantial runtime latency. We propose a new benchmark tailored for online evaluation by considering runtime latency. Based on the benchmark, we build a Latency-Aware 3D Streaming Perception (LASP) framework that addresses the latency issue through two primary components: 1) latency-aware history integration, which extends query propagation into a continuous process, ensuring the integration of historical feature regardless of varying latency; 2) latency-aware predictive detection, a module that compensates the detection results with the predicted trajectory and the posterior accessed latency. By incorporating the latency-aware mechanism, our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80\% of its offline evaluation on the Jetson AGX Orin without any acceleration techniques.

Zhongxuan Li

Main category: cs.CV

TL;DR: 论文综述了盲源分离（BSS）的关键方法，从经典独立成分分析（ICA）到基于稀疏性的方法，提出了一种改进的块稀疏字典学习算法（SAC+BK-SVD），并在实验中验证了其有效性。

Details

Motivation: 传统ICA方法依赖于源信号相互独立的假设，但实际应用中存在局限性。稀疏表示理论为BSS提供了新思路，本文旨在探索基于稀疏性的方法并改进现有技术。 Method: 介绍了稀疏表示理论和分解方法，提出了块坐标松弛MCA算法及其变体（MMCA和GMCA），并改进了K-SVD算法，提出SAC+BK-SVD，用于块稀疏字典学习。 Result: 实验表明，改进的SAC+BK-SVD算法在盲图像分离任务中表现优于传统K-SVD方法。 Conclusion: 稀疏表示和块稀疏字典学习为BSS提供了有效解决方案，改进的算法在图像分离中具有显著优势。 Abstract: Blind source separation (BSS) is a key technique in array processing and data analysis, aiming to recover unknown sources from observed mixtures without knowledge of the mixing matrix. Classical independent component analysis (ICA) methods rely on the assumption that sources are mutually independent. To address limitations of ICA, sparsity-based methods have been introduced, which decompose source signals sparsely in a predefined dictionary. Morphological Component Analysis (MCA), based on sparse representation theory, assumes that a signal is a linear combination of components with distinct geometries, each sparsely representable in one dictionary and not in others. This approach has recently been applied to BSS with promising results. This report reviews key approaches derived from classical ICA and explores sparsity-based methods for BSS. It introduces the theory of sparse representation and decomposition, followed by a block coordinate relaxation MCA algorithm, whose variants are used in Multichannel MCA (MMCA) and Generalized MCA (GMCA). A local dictionary learning method using K-SVD is then presented. Finally, we propose an improved algorithm, SAC+BK-SVD, which enhances K-SVD by learning a block-sparsifying dictionary that clusters and updates similar atoms in blocks. The implementation includes experiments on image segmentation and blind image source separation using the discussed techniques. We also compare the proposed block-sparse dictionary learning algorithm with K-SVD. Simulation results demonstrate that our method yields improved blind image separation quality.

[34] DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning

Jialang Lu,Huayu Zhao,Huiyu Zhai,Xingxing Yang,Shini Han

Main category: cs.CV

TL;DR: 论文提出了一种基于Retinex图像分解的深度语义先验引导框架（DeepSPG），通过预训练的语义分割模型和多模态学习探索语义信息，显著提升了低光图像增强效果。

Details

Motivation: 现有低光图像增强方法忽略了不同区域的语义信息，尤其是在极暗区域信息损失严重的情况下。 Method: 结合图像级和文本级语义先验，设计多尺度语义感知结构，通过预训练模型引导增强过程。 Result: 在五个基准数据集上表现优于现有方法。 Conclusion: DeepSPG通过多模态语义先验引导，显著提升了低光图像增强的性能。 Abstract: There has long been a belief that high-level semantics learning can benefit various downstream computer vision tasks. However, in the low-light image enhancement (LLIE) community, existing methods learn a brutal mapping between low-light and normal-light domains without considering the semantic information of different regions, especially in those extremely dark regions that suffer from severe information loss. To address this issue, we propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for LLIE to explore informative semantic knowledge via a pre-trained semantic segmentation model and multimodal learning. Notably, we incorporate both image-level semantic prior and text-level semantic prior and thus formulate a multimodal learning framework with combinatorial deep semantic prior guidance for LLIE. Specifically, we incorporate semantic knowledge to guide the enhancement process via three designs: an image-level semantic prior guidance by leveraging hierarchical semantic features from a pre-trained semantic segmentation model; a text-level semantic prior guidance by integrating natural language semantic constraints via a pre-trained vision-language model; a multi-scale semantic-aware structure that facilitates effective semantic feature incorporation. Eventually, our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets. The implementation details and code are publicly available at https://github.com/Wenyuzhy/DeepSPG.

Huiling Zheng,Xian Zhong,Bin Liu,Yi Xiao,Bihan Wen,Xiaofeng Li

Main category: cs.CV

TL;DR: 提出了一种基于频域相位-幅度解耦（PAD）的方法，用于SAR与RGB图像的融合分类，解决了模态异质性和光谱互补性利用不足的问题。

Details

Motivation: 现有方法难以解耦共享的结构特征与模态特定的辐射属性，导致特征冲突和信息丢失。 Method: PAD框架在傅里叶域分离相位（模态共享）和幅度（模态特定）成分，包括相位谱校正（PSC）和幅度谱融合（ASF）。 Result: 在WHU-OPT-SAR和DDHR-SK数据集上表现优异，达到最新技术水平。 Conclusion: PAD为遥感中的多模态融合提供了新范式，代码已开源。 Abstract: The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and the underutilization of spectral complementarity. Existing methods often fail to decouple shared structural features from modality-specific radiometric attributes, leading to feature conflicts and information loss. To address this issue, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-specific) components in the Fourier domain. Specifically, PAD consists of two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features through convolution-guided scaling to enhance geometric consistency, and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high-frequency details and low-frequency structures using frequency-adaptive multilayer perceptrons. This approach leverages SAR's sensitivity to morphological features and RGB's spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK datasets demonstrate state-of-the-art performance. Our work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at https://github.com/RanFeng2/PAD.

[36] RadioFormer: A Multiple-Granularity Radio Map Estimation Transformer with 1\textpertenthousand Spatial Sampling

Zheng Fang,Kangjun Liu,Ke Chen,Qingyu Liu,Jianguo Zhang,Lingyang Song,Yaowei Wang

Main category: cs.CV

TL;DR: 论文提出RadioFormer，一种多粒度Transformer模型，用于解决无线电地图估计中空间稀疏观测的挑战，通过双流自注意力模块和多尺度表示提升性能。

Details

Motivation: 现有深度视觉模型（如U-Net）在无线电地图估计中需要足够的空间观测数据（0.01%到1%的像素），但在实际场景中观测数据可能极其稀疏，导致性能下降。 Method: 提出RadioFormer模型，采用双流自注意力（DSA）模块分别学习像素级信号功率相关性和块级建筑几何特征，并通过跨流交叉注意力（CCA）模块整合多尺度表示。 Result: 在公开数据集RadioMapSeer上，RadioFormer在无线电地图估计任务中优于现有方法，同时保持最低计算成本，并展示了出色的泛化能力和零样本性能。 Conclusion: RadioFormer为无线电地图估计提供了一种更实用的解决方案，适用于观测节点极少的场景，具有广泛的应用潜力。 Abstract: The task of radio map estimation aims to generate a dense representation of electromagnetic spectrum quantities, such as the received signal strength at each grid point within a geographic region, based on measurements from a subset of spatially distributed nodes (represented as pixels). Recently, deep vision models such as the U-Net have been adapted to radio map estimation, whose effectiveness can be guaranteed with sufficient spatial observations (typically 0.01% to 1% of pixels) in each map, to model local dependency of observed signal power. However, such a setting of sufficient measurements can be less practical in real-world scenarios, where extreme sparsity in spatial sampling can be widely encountered. To address this challenge, we propose RadioFormer, a novel multiple-granularity transformer designed to handle the constraints posed by spatial sparse observations. Our RadioFormer, through a dual-stream self-attention (DSA) module, can respectively discover the correlation of pixel-wise observed signal power and also learn patch-wise buildings' geometries in a style of multiple granularities, which are integrated into multi-scale representations of radio maps by a cross stream cross-attention (CCA) module. Extensive experiments on the public RadioMapSeer dataset demonstrate that RadioFormer outperforms state-of-the-art methods in radio map estimation while maintaining the lowest computational cost. Furthermore, the proposed approach exhibits exceptional generalization capabilities and robust zero-shot performance, underscoring its potential to advance radio map estimation in a more practical setting with very limited observation nodes.

[37] IM-Portrait: Learning 3D-aware Video Diffusion for PhotorealisticTalking Heads from Monocular Videos

Yuan Li,Ziqian Bai,Feitong Tan,Zhaopeng Cui,Sean Fanello,Yinda Zhang

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的3D感知方法，直接从单张身份图像和显式控制信号生成逼真说话头部视频，无需显式3D重建或多视角训练数据。

Details

Motivation: 现有方法通常需要额外阶段或联合优化来重建3D表示（如NeRF或3D高斯），而本文旨在通过单次去噪过程直接生成最终输出，提高效率。 Method: 通过生成多平面图像（MPIs）确保几何一致性，并引入训练机制在目标或参考相机空间中随机重建MPI，以同时学习图像细节和3D信息。 Result: 实验表明，该方法在头像质量和新视角渲染能力上具有竞争力，无需显式3D重建或高质量多视角训练数据。 Conclusion: 该方法通过单次去噪过程直接生成高质量3D感知视频，简化了流程并提升了效率。 Abstract: We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets. Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently. To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information. Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.

[38] Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving

Mi Zheng,Guanglei Yang,Zitong Huang,Zhenhua Guo,Kevin Han,Wangmeng Zuo

Main category: cs.CV

TL;DR: 论文提出了一种名为SOTA的新框架，用于自动驾驶场景中的异常检测，通过语义融合块和场景理解引导的提示上下文适配器提升性能。

Details

Motivation: 当前道路场景分割方法在闭集数据上训练，对分布外（OOD）物体检测能力不足，且现有异常检测方法忽视物体属性和环境约束。 Method: SOTA框架结合语义融合块（SFB）增强物体属性分割，利用场景理解引导的提示上下文适配器（SG-PCA）过滤无关异常。 Result: 在多个基准数据集上的实验表明，SOTA显著提升了OOD检测性能，实现了鲁棒且准确的分割结果。 Conclusion: SOTA框架有效解决了现有方法的不足，为自动驾驶场景中的异常检测提供了更优的解决方案。 Abstract: With the emergence of transformer-based architectures and large language models (LLMs), the accuracy of road scene perception has substantially advanced. Nonetheless, current road scene segmentation approaches are predominantly trained on closed-set data, resulting in insufficient detection capabilities for out-of-distribution (OOD) objects. To overcome this limitation, road anomaly detection methods have been proposed. However, existing methods primarily depend on image inpainting and OOD distribution detection techniques, facing two critical issues: (1) inadequate consideration of the objectiveness attributes of anomalous regions, causing incomplete segmentation when anomalous objects share similarities with known classes, and (2) insufficient attention to environmental constraints, leading to the detection of anomalies irrelevant to autonomous driving tasks. In this paper, we propose a novel framework termed Segmenting Objectiveness and Task-Awareness (SOTA) for autonomous driving scenes. Specifically, SOTA enhances the segmentation of objectiveness through a Semantic Fusion Block (SFB) and filters anomalies irrelevant to road navigation tasks using a Scene-understanding Guided Prompt-Context Adaptor (SG-PCA). Extensive empirical evaluations on multiple benchmark datasets, including Fishyscapes Lost and Found, Segment-Me-If-You-Can, and RoadAnomaly, demonstrate that the proposed SOTA consistently improves OOD detection performance across diverse detectors, achieving robust and accurate segmentation outcomes.

[39] LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition

Zhangshuo Qi,Luqi Cheng,Zijie Zhou,Guangming Xiong

Main category: cs.CV

TL;DR: LRFusionPR提出了一种融合LiDAR和雷达数据的双分支网络，通过跨模态特征交互和知识蒸馏提升自动驾驶中地点识别的准确性和鲁棒性。

Details

Motivation: 在GPS缺失环境下，地点识别对自动驾驶至关重要。LiDAR和雷达各有优势，但融合两者的方法仍面临挑战，如雷达数据的噪声和稀疏性。 Method: 采用双分支网络，在统一的极坐标BEV表示中融合LiDAR和雷达数据，利用跨注意力机制进行特征交互，并通过知识蒸馏提升雷达分支的鲁棒性。 Result: 在多个数据集上的实验表明，LRFusionPR实现了高精度的地点识别，并在不同天气条件下保持鲁棒性。 Conclusion: LRFusionPR通过LiDAR与雷达的有效融合，显著提升了地点识别的性能，为自动驾驶提供了可靠的解决方案。 Abstract: In autonomous driving, place recognition is critical for global localization in GPS-denied environments. LiDAR and radar-based place recognition methods have garnered increasing attention, as LiDAR provides precise ranging, whereas radar excels in adverse weather resilience. However, effectively leveraging LiDAR-radar fusion for place recognition remains challenging. The noisy and sparse nature of radar data limits its potential to further improve recognition accuracy. In addition, heterogeneous radar configurations complicate the development of unified cross-modality fusion frameworks. In this paper, we propose LRFusionPR, which improves recognition accuracy and robustness by fusing LiDAR with either single-chip or scanning radar. Technically, a dual-branch network is proposed to fuse different modalities within the unified polar coordinate bird's eye view (BEV) representation. In the fusion branch, cross-attention is utilized to perform cross-modality feature interactions. The knowledge from the fusion branch is simultaneously transferred to the distillation branch, which takes radar as its only input to further improve the robustness. Ultimately, the descriptors from both branches are concatenated, producing the multimodal global descriptor for place retrieval. Extensive evaluations on multiple datasets demonstrate that our LRFusionPR achieves accurate place recognition, while maintaining robustness under varying weather conditions. Our open-source code will be released at https://github.com/QiZS-BIT/LRFusionPR.

[40] Adaptive Dual-domain Learning for Underwater Image Enhancement

Lingtao Peng,Liheng Bian

Main category: cs.CV

TL;DR: 本文提出了一种基于空间-光谱双域自适应学习的水下图像增强方法（SS-UIE），通过结合空间多尺度循环选择性扫描模块和光谱自注意力模块，解决了现有方法未考虑空间区域和光谱带退化不一致的问题，并通过频率损失函数提升高频细节重建效果。

Details

Motivation: 现有学习型水下图像增强方法未能同时考虑空间区域和光谱带的不一致退化水平，且未区分对待高频细节区域，导致重建效果不佳。 Method: 提出SS-UIE方法，结合空间多尺度循环选择性扫描模块（MCSS）和光谱自注意力模块（SWSA）构建空间-光谱块（SS-block），并引入频率损失函数（FWL）优化高频细节重建。 Result: 实验表明，SS-UIE在性能和计算成本上均优于现有方法。 Conclusion: SS-UIE通过双域自适应学习和频率损失函数，显著提升了水下图像增强的效果和效率。 Abstract: Recently, learning-based Underwater Image Enhancement (UIE) methods have demonstrated promising performance. However, existing learning-based methods still face two challenges. 1) They rarely consider the inconsistent degradation levels in different spatial regions and spectral bands simultaneously. 2) They treat all regions equally, ignoring that the regions with high-frequency details are more difficult to reconstruct. To address these challenges, we propose a novel UIE method based on spatial-spectral dual-domain adaptive learning, termed SS-UIE. Specifically, we first introduce a spatial-wise Multi-scale Cycle Selective Scan (MCSS) module and a Spectral-Wise Self-Attention (SWSA) module, both with linear complexity, and combine them in parallel to form a basic Spatial-Spectral block (SS-block). Benefiting from the global receptive field of MCSS and SWSA, SS-block can effectively model the degradation levels of different spatial regions and spectral bands, thereby enabling degradation level-based dual-domain adaptive UIE. By stacking multiple SS-blocks, we build our SS-UIE network. Additionally, a Frequency-Wise Loss (FWL) is introduced to narrow the frequency-wise discrepancy and reinforce the model's attention on the regions with high-frequency details. Extensive experiments validate that the SS-UIE technique outperforms state-of-the-art UIE methods while requiring cheaper computational and memory costs.

[41] FlexPara: Flexible Neural Surface Parameterization

Yuming Zhao,Qijian Zhang,Junhui Hou,Jiazhi Xia,Wenping Wang,Ying He

Main category: cs.CV

TL;DR: FlexPara是一种无监督神经优化框架，用于实现全局和多图表表面参数化，通过自适应变形的2D UV坐标映射3D表面点。

Details

Motivation: 传统参数化方法需要高质量的网格三角剖分，且仅限于简单拓扑结构，缺乏灵活性和可控性。 Method: 设计了一系列几何可解释的子网络（切割、变形、展开和包裹），构建双向循环映射框架，无需手动指定切割缝。 Result: 实验证明FlexPara具有通用性、优越性和潜力。 Conclusion: FlexPara为表面参数化提供了灵活且可控的解决方案，代码将公开。 Abstract: Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm. The code will be publicly available at https://github.com/AidenZhao/FlexPara

[42] CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Tuan Nguyen,Naseem Khan,Issa Khalil

Main category: cs.CV

TL;DR: CapsFake，一种新型多模态胶囊网络，通过整合视觉、文本和频域模态的低级胶囊，有效检测指令引导的深度伪造图像编辑。

Details

Motivation: 深度伪造技术的快速进化，尤其是基于指令的图像编辑，威胁数字图像的完整性，现有防御系统难以检测。 Method: 提出CapsFake，利用竞争路由机制动态聚合局部特征，精确识别篡改区域。 Result: 在多个数据集上表现优异，检测准确率提升20%，对抗攻击下检测率达96%。 Conclusion: CapsFake为对抗复杂图像篡改提供了强大框架。 Abstract: The rapid evolution of deepfake technology, particularly in instruction-guided image editing, threatens the integrity of digital images by enabling subtle, context-aware manipulations. Generated conditionally from real images and textual prompts, these edits are often imperceptible to both humans and existing detection systems, revealing significant limitations in current defenses. We propose a novel multimodal capsule network, CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities. High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision. Evaluated on diverse datasets, including MagicBrush, Unsplash Edits, Open Images Edits, and Multi-turn Edits, CapsFake outperforms state-of-the-art methods by up to 20% in detection accuracy. Ablation studies validate its robustness, achieving detection rates above 94% under natural perturbations and 96% against adversarial attacks, with excellent generalization to unseen editing scenarios. This approach establishes a powerful framework for countering sophisticated image manipulations.

[43] CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis

Alexander Baumann,Leonardo Ayala,Silvia Seidlitz,Jan Sellner,Alexander Studier-Fischer,Berkin Özdemir,Lena Maier-Hein,Slobodan Ilic

Main category: cs.CV

TL;DR: CARL是一种相机无关的表示学习方法，适用于RGB、多光谱和高光谱成像，解决了光谱相机因通道维度和波长差异导致的模型泛化性问题。

Details

Motivation: 光谱成像在不同领域应用广泛，但相机间的通道维度和波长差异限制了AI方法的通用性，导致模型难以跨相机适用。 Method: 提出CARL模型，采用波长位置编码和自注意力-交叉注意力机制，将光谱信息压缩为学习到的查询表示，并通过光谱自监督JEPA策略进行预训练。 Result: 在医学成像、自动驾驶和卫星成像等领域的大规模实验中，CARL表现出对光谱异质性的独特鲁棒性，优于其他方法。 Conclusion: CARL的扩展性和多功能性使其成为未来光谱基础模型的骨干。 Abstract: Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce $\textbf{CARL}$, a model for $\textbf{C}$amera-$\textbf{A}$gnostic $\textbf{R}$epresentation $\textbf{L}$earning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic embedding, we introduce wavelength positional encoding and a self-attention-cross-attention mechanism to compress spectral information into learned query representations. Spectral-spatial pre-training is achieved with a novel spectral self-supervised JEPA-inspired strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model's unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.

[44] Unsupervised 2D-3D lifting of non-rigid objects using local constraints

Shalini Maiti,Lourdes Agapito,Benjamin Graham

Main category: cs.CV

TL;DR: 通过无监督损失训练的高容量模型，结合局部低秩约束，显著提高了非刚性物体3D形状预测的准确性。

Details

Motivation: 非刚性物体从2D关键点预测3D形状存在病态问题，传统方法依赖低秩约束和专用模型，训练困难且重建质量受限。 Method: 采用高容量模型和无监督损失，对形状的局部子集应用低秩约束，平衡模型容量与约束。 Result: 在S-Up3D数据集上，重建误差降低了70%以上，达到最新技术水平。 Conclusion: 高容量模型结合局部低秩约束是提升非刚性物体3D形状预测的有效方法。 Abstract: For non-rigid objects, predicting the 3D shape from 2D keypoint observations is ill-posed due to occlusions, and the need to disentangle changes in viewpoint and changes in shape. This challenge has often been addressed by embedding low-rank constraints into specialized models. These models can be hard to train, as they depend on finding a canonical way of aligning observations, before they can learn detailed geometry. These constraints have limited the reconstruction quality. We show that generic, high capacity models, trained with an unsupervised loss, allow for more accurate predicted shapes. In particular, applying low-rank constraints to localized subsets of the full shape allows the high capacity to be suitably constrained. We reduce the state-of-the-art reconstruction error on the S-Up3D dataset by over 70%.

De Cheng,Lingfeng He,Nannan Wang,Dingwen Zhang,Xinbo Gao

Main category: cs.CV

TL;DR: 提出了一种名为SALCR的框架，通过语义对齐学习和协作细化解决无监督可见光-红外行人重识别中的跨模态变化问题。

Details

Motivation: 现有方法忽视了跨模态特征表示和伪标签分布的细粒度模式差异，导致全局特征优化的模态共享学习不足。 Method: SALCR框架包含双向伪标签统一模块（DAGI）、细粒度语义对齐学习模块（FGSAL）和全局-部分协作细化模块（GPCR）。 Result: 实验表明，该方法优于现有技术。 Conclusion: SALCR通过细粒度语义对齐和动态优化，有效提升了跨模态行人重识别的性能。 Abstract: Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \href{https://github.com/FranklinLingfeng/code-for-SALCR}.

[46] ODExAI: A Comprehensive Object Detection Explainable AI Evaluation

Loc Phuc Truong Nguyen,Hung Truong Thanh Nguyen,Hung Cao

Main category: cs.CV

TL;DR: 论文提出ODExAI框架，用于系统评估目标检测模型的XAI方法，重点关注定位准确性、模型忠实性和计算复杂度。

Details

Motivation: 当前缺乏评估目标检测模型XAI技术的标准，阻碍了方法比较和选择。 Method: 引入ODExAI框架，基于三个核心维度评估XAI方法，并在YOLOX和Faster R-CNN等模型及标准数据集上进行基准测试。 Result: 区域方法（如D-CLOSE）定位准确（PG=88.49%）且忠实性高（OA=0.863），但计算开销大（71.42s）；CAM方法（如G-CAME）定位更优（PG=96.13%）且速度快（0.54s），但忠实性较低（OA=0.549）。 Conclusion: 现有XAI方法存在关键权衡，需根据任务需求选择；ODExAI框架为评估提供了标准化工具。 Abstract: Explainable Artificial Intelligence (XAI) techniques for interpreting object detection models remain in an early stage, with no established standards for systematic evaluation. This absence of consensus hinders both the comparative analysis of methods and the informed selection of suitable approaches. To address this gap, we introduce the Object Detection Explainable AI Evaluation (ODExAI), a comprehensive framework designed to assess XAI methods in object detection based on three core dimensions: localization accuracy, faithfulness to model behavior, and computational complexity. We benchmark a set of XAI methods across two widely used object detectors (YOLOX and Faster R-CNN) and standard datasets (MS-COCO and PASCAL VOC). Empirical results demonstrate that region-based methods (e.g., D-CLOSE) achieve strong localization (PG = 88.49%) and high model faithfulness (OA = 0.863), though with substantial computational overhead (Time = 71.42s). On the other hand, CAM-based methods (e.g., G-CAME) achieve superior localization (PG = 96.13%) and significantly lower runtime (Time = 0.54s), but at the expense of reduced faithfulness (OA = 0.549). These findings demonstrate critical trade-offs among existing XAI approaches and reinforce the need for task-specific evaluation when deploying them in object detection pipelines. Our implementation and evaluation benchmarks are publicly available at: https://github.com/Analytics-Everywhere-Lab/odexai.

Songsong Xiong,Hamidreza Kasaei

Main category: cs.CV

TL;DR: 提出了一种轻量级多模态多视角卷积-视觉Transformer网络（LM-MCVT），通过全局熵基嵌入融合（GEEF）方法提升机器人3D物体识别的准确率。

Details

Motivation: 在复杂多变的人类中心环境中（如餐厅、家庭、仓库），机器人3D物体识别面临挑战，需要更高效的方法。 Method: 结合预训练和中层卷积编码器与局部和全局Transformer，利用GEEF方法融合多视角数据。 Result: 在ModelNet40数据集上达到95.6%的识别准确率，并在OmniObject3D数据集上通过5折交叉验证验证了其鲁棒性。 Conclusion: LM-MCVT在合成和真实世界数据中均表现出色，优于现有方法。 Abstract: In human-centered environments such as restaurants, homes, and warehouses, robots often face challenges in accurately recognizing 3D objects. These challenges stem from the complexity and variability of these environments, including diverse object shapes. In this paper, we propose a novel Lightweight Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to enhance 3D object recognition in robotic applications. Our approach leverages the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate multi-views efficiently. The LM-MCVT architecture incorporates pre- and mid-level convolutional encoders and local and global transformers to enhance feature extraction and recognition accuracy. We evaluate our method on the synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using a four-view setup, surpassing existing state-of-the-art methods. To further validate its effectiveness, we conduct 5-fold cross-validation on the real-world OmniObject3D dataset using the same configuration. Results consistently show superior performance, demonstrating the method's robustness in 3D object recognition across synthetic and real-world 3D data.

[48] OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion

Shuhao Kang,Martin Y. Liao,Yan Xia,Olaf Wysocki,Boris Jutzi,Daniel Cremers

Main category: cs.CV

TL;DR: OPAL是一种新型LiDAR地点识别网络，利用OpenStreetMap作为轻量级先验，通过跨模态可见性掩码和自适应径向融合模块，显著提升了识别性能。

Details

Motivation: 现有方法依赖密集3D地图或航拍图像，存储开销大且缺乏实时适应性，OPAL旨在解决这一问题。 Method: 设计了跨模态可见性掩码和自适应径向融合模块，将稀疏LiDAR扫描与结构化OSM数据结合。 Result: 在KITTI和KITTI-360数据集上，OPAL的召回率提升15.98%，推理速度快12倍。 Conclusion: OPAL通过轻量级OSM先验和高效模块设计，显著提升了LiDAR地点识别的性能和效率。 Abstract: LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components: a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning, and an adaptive radial fusion module that dynamically consolidates multiscale radial features into discriminative global descriptors. Extensive experiments on the augmented KITTI and KITTI-360 datasets demonstrate OPAL's superiority, achieving 15.98% higher recall at @1m threshold for top-1 retrieved matches while operating at 12x faster inference speeds compared to state-of-the-art approaches. Code and datasets are publicly available at: https://github.com/WHU-USI3DV/OPAL .

[49] Rendering Anywhere You See: Renderability Field-guided Gaussian Splatting

Xiaofeng Jin,Yan Fang,Matteo Frosi,Jianfei Ge,Jiangjian Xiao,Matteo Matteucci

Main category: cs.CV

TL;DR: 提出了一种基于渲染能力场引导的高斯泼溅方法（RF-GS），用于解决场景视图合成中非均匀观测导致的渲染稳定性问题。

Details

Motivation: 场景视图合成在虚拟现实、增强现实和机器人技术中日益重要，但非均匀观测导致渲染质量不稳定。 Method: 通过渲染能力场量化输入不均匀性，引导伪视图采样；训练图像恢复模型提升伪视图质量；采用混合数据优化策略融合伪视图角度和源视图纹理信息。 Result: 在模拟和真实数据上的实验表明，该方法在渲染稳定性上优于现有方法。 Conclusion: RF-GS方法有效提升了场景视图合成的渲染稳定性，具有实际应用潜力。 Abstract: Scene view synthesis, which generates novel views from limited perspectives, is increasingly vital for applications like virtual reality, augmented reality, and robotics. Unlike object-based tasks, such as generating 360{\deg} views of a car, scene view synthesis handles entire environments where non-uniform observations pose unique challenges for stable rendering quality. To address this issue, we propose a novel approach: renderability field-guided gaussian splatting (RF-GS). This method quantifies input inhomogeneity through a renderability field, guiding pseudo-view sampling to enhanced visual consistency. To ensure the quality of wide-baseline pseudo-views, we train an image restoration model to map point projections to visible-light styles. Additionally, our validated hybrid data optimization strategy effectively fuses information of pseudo-view angles and source view textures. Comparative experiments on simulated and real-world data show that our method outperforms existing approaches in rendering stability.

[50] OpenFusion++: An Open-vocabulary Real-time Scene Understanding System

Xiaofeng Jin,Matteo Frosi,Matteo Matteucci

Main category: cs.CV

TL;DR: OpenFusion++是一个基于TSDF的实时3D语义几何重建系统，通过融合基础模型的置信度图、动态更新语义标签和双路径编码框架，显著提升了语义准确性和查询响应能力。

Details

Motivation: 实时开放词汇场景理解在视觉语言导航、具身智能和增强现实等应用中至关重要，但现有方法存在实例分割不精确、语义更新静态和复杂查询处理能力有限的问题。 Method: OpenFusion++通过融合基础模型的置信度图优化3D点云，基于实例区域的自适应缓存动态更新全局语义标签，并采用双路径编码框架结合对象属性和环境上下文以精确响应查询。 Result: 在ICL、Replica、ScanNet和ScanNet++数据集上的实验表明，OpenFusion++在语义准确性和查询响应能力上显著优于基线方法。 Conclusion: OpenFusion++有效解决了现有方法的不足，为实时3D语义几何重建提供了高效解决方案。 Abstract: Real-time open-vocabulary scene understanding is essential for efficient 3D perception in applications such as vision-language navigation, embodied intelligence, and augmented reality. However, existing methods suffer from imprecise instance segmentation, static semantic updates, and limited handling of complex queries. To address these issues, we present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system. Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework that integrates object attributes with environmental context for precise query responses. Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.

[51] VI3NR: Variance Informed Initialization for Implicit Neural Representations

Chamin Hewa Koneputugodage,Yizhak Ben-Shabat,Sameera Ramasinghe,Stephen Gould

Main category: cs.CV

TL;DR: 本文提出了一种适用于任何激活函数的神经网络初始化方法，改进了INR的收敛性和准确性，并在多模态信号中验证了其有效性。

Details

Motivation: INR的成功依赖于网络初始化，但现有初始化方法不适用于许多激活函数，尤其是INR中常用的函数。 Method: 推导出一种层间方差稳定的初始化方法，适用于任何激活函数，并推广了现有方法。 Result: 在多模态信号（图像、音频、3D重建）中验证了初始化方法的有效性，尤其在Gaussian INR中表现突出。 Conclusion: 提出的初始化方法具有广泛适用性，显著提升了INR的性能。 Abstract: Implicit Neural Representations (INRs) are a versatile and powerful tool for encoding various forms of data, including images, videos, sound, and 3D shapes. A critical factor in the success of INRs is the initialization of the network, which can significantly impact the convergence and accuracy of the learned model. Unfortunately, commonly used neural network initializations are not widely applicable for many activation functions, especially those used by INRs. In this paper, we improve upon previous initialization methods by deriving an initialization that has stable variance across layers, and applies to any activation function. We show that this generalizes many previous initialization methods, and has even better stability for well studied activations. We also show that our initialization leads to improved results with INR activation functions in multiple signal modalities. Our approach is particularly effective for Gaussian INRs, where we demonstrate that the theory of our initialization matches with task performance in multiple experiments, allowing us to achieve improvements in image, audio, and 3D surface reconstruction.

Athul M. Mathew,Arshad Ali Khan,Thariq Khalid,Faroq AL-Tam,Riad Souissi

Main category: cs.CV

TL;DR: 提出了一种融合多模态信息的新方法用于视线目标检测（GTD），通过3D表示和深度显著性模块提升性能，在多个数据集上表现优于现有方法。

Details

Motivation: 视线目标检测需要理解人物头部、身体、眼睛与环境的复杂关系，现有方法难以全面捕捉这些信息。 Method: 将2D图像投影为3D表示，提取深度显著性模块、人脸和深度模态信息，并融合这些模态以预测视线目标。 Result: 在VideoAttentionTarget、GazeFollow和GOO-Real数据集上表现优于现有方法。 Conclusion: 该方法为GTD提供了一种有前景的新思路。 Abstract: Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (\textit{attention-grabbing}) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.

[53] Optimal Hyperspectral Undersampling Strategy for Satellite Imaging

Vita V. Vlasova,Vladimir G. Kuzmin,Maria S. Varetsa,Natalia A. Ibragimova,Oleg Y. Rogov,Elena V. Lyapuntsova

Main category: cs.CV

TL;DR: 提出了一种基于小波变换的迭代梯度采样（IWGS）方法，用于高光谱图像分类中的波段选择，显著提升了分类精度和计算效率。

Details

Motivation: 高光谱图像分类面临高维度、光谱冗余和标记数据有限的问题，需要一种高效的波段选择方法。 Method: IWGS通过小波变换域内的梯度分析迭代选择最具信息量的波段，利用小波的多分辨率特性捕捉细微光谱变化。 Result: 在Houston 2013和Indian Pines数据集上，IWGS的分类精度高达97.8%，优于现有方法。 Conclusion: IWGS在资源受限环境中表现出色，具有广泛适用性。 Abstract: Hyperspectral image (HSI) classification presents significant challenges due to the high dimensionality, spectral redundancy, and limited labeled data typically available in real-world applications. To address these issues and optimize classification performance, we propose a novel band selection strategy known as Iterative Wavelet-based Gradient Sampling (IWGS). This method incrementally selects the most informative spectral bands by analyzing gradients within the wavelet-transformed domain, enabling efficient and targeted dimensionality reduction. Unlike traditional selection methods, IWGS leverages the multi-resolution properties of wavelets to better capture subtle spectral variations relevant for classification. The iterative nature of the approach ensures that redundant or noisy bands are systematically excluded while maximizing the retention of discriminative features. We conduct comprehensive experiments on two widely-used benchmark HSI datasets: Houston 2013 and Indian Pines. Results demonstrate that IWGS consistently outperforms state-of-the-art band selection and classification techniques in terms of both accuracy and computational efficiency. These improvements make our method especially suitable for deployment in edge devices or other resource-constrained environments, where memory and processing power are limited. In particular, IWGS achieved an overall accuracy up to 97.8% on Indian Pines for selected classes, confirming its effectiveness and generalizability across different HSI scenarios.

[54] Marine Snow Removal Using Internally Generated Pseudo Ground Truth

Alexandra Malyugina,Guoxi Huang,Eduardo Ruiz,Benjamin Leslie,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 提出了一种新方法，通过生成配对数据集来增强水下视频质量，解决了海洋雪花噪声问题。

Details

Motivation: 水下视频因光吸收、散射和噪声（如海洋雪花）导致质量下降，现有方法因缺乏配对训练数据而效果不佳。 Method: 提出一种新框架，从原始水下视频生成配对数据集（含雪花和无雪花视频），用于监督训练。 Result: 生成的配对数据集有效提升了水下图像恢复效果，尤其在缺乏真实数据的情况下。 Conclusion: 该方法为水下视频增强提供了新思路，解决了数据不足的问题。 Abstract: Underwater videos often suffer from degraded quality due to light absorption, scattering, and various noise sources. Among these, marine snow, which is suspended organic particles appearing as bright spots or noise, significantly impacts machine vision tasks, particularly those involving feature matching. Existing methods for removing marine snow are ineffective due to the lack of paired training data. To address this challenge, this paper proposes a novel enhancement framework that introduces a new approach for generating paired datasets from raw underwater videos. The resulting dataset consists of paired images of generated snowy and snow, free underwater videos, enabling supervised training for video enhancement. We describe the dataset creation process, highlight its key characteristics, and demonstrate its effectiveness in enhancing underwater image restoration in the absence of ground truth.

[55] FusionNet: Multi-model Linear Fusion Framework for Low-light Image Enhancement

Kangbiao Shi,Yixu Feng,Tao Hu,Yu Cao,Peng Wu,Yijin Liang,Yanning Zhang,Qingsen Yan

Main category: cs.CV

TL;DR: FusionNet是一种新型多模型线性融合框架，通过并行操作捕捉全局和局部特征，解决了现有融合策略的参数爆炸、优化不稳定和特征对齐问题，并在CVPR2025 NTIRE低光增强挑战赛中取得第一名。

Details

Motivation: 现有低光图像增强方法在融合不同架构和色彩空间时面临参数爆炸、优化不稳定和特征对齐等挑战，限制了性能提升。 Method: 提出FusionNet，采用基于Hilbert空间理论保证的线性融合策略，并行捕捉多色彩空间的全局和局部特征。 Result: 在合成和真实数据集上显著优于现有方法，定量和定性结果均表现优异。 Conclusion: FusionNet通过线性融合策略有效解决了现有问题，实现了在多样化低光条件下的鲁棒增强。 Abstract: The advent of Deep Neural Networks (DNNs) has driven remarkable progress in low-light image enhancement (LLIE), with diverse architectures (e.g., CNNs and Transformers) and color spaces (e.g., sRGB, HSV, HVI) yielding impressive results. Recent efforts have sought to leverage the complementary strengths of these paradigms, offering promising solutions to enhance performance across varying degradation scenarios. However, existing fusion strategies are hindered by challenges such as parameter explosion, optimization instability, and feature misalignment, limiting further improvements. To overcome these issues, we introduce FusionNet, a novel multi-model linear fusion framework that operates in parallel to effectively capture global and local features across diverse color spaces. By incorporating a linear fusion strategy underpinned by Hilbert space theoretical guarantees, FusionNet mitigates network collapse and reduces excessive training costs. Our method achieved 1st place in the CVPR2025 NTIRE Low Light Enhancement Challenge. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art methods in terms of both quantitative and qualitative results, delivering robust enhancement under diverse low-light conditions.

[56] Myocardial Region-guided Feature Aggregation Net for Automatic Coronary artery Segmentation and Stenosis Assessment using Coronary Computed Tomography Angiography

Ni Yao,Xiangyu Liu,Danyang Sun,Chuang Han,Yanting Li,Jiaofen Nan,Chengyang Li,Fubao Zhu,Weihua Zhou,Chen Zhao

Main category: cs.CV

TL;DR: 提出了一种名为MGFA-Net的新型U形双编码器架构，用于冠状动脉分割和狭窄检测，结合心肌区域引导和多尺度特征融合，显著提升了性能。

Details

Motivation: 冠状动脉疾病是全球主要死因之一，现有方法在低对比度、形态变异和小血管分割方面存在挑战。 Method: 采用心肌区域引导模块、残差特征提取编码模块和多尺度特征融合模块，结合蒙特卡洛dropout量化预测不确定性。 Result: Dice得分85.04%，准确率84.24%，HD95为6.1294 mm，狭窄检测真阳性率比3D U-Net提升5.46%。 Conclusion: MGFA-Net通过结合解剖先验知识，为精准医疗提供了自动化且临床可解释的CAD评估工具。 Abstract: Coronary artery disease (CAD) remains a leading cause of mortality worldwide, requiring accurate segmentation and stenosis detection using Coronary Computed Tomography angiography (CCTA). Existing methods struggle with challenges such as low contrast, morphological variability and small vessel segmentation. To address these limitations, we propose the Myocardial Region-guided Feature Aggregation Net, a novel U-shaped dual-encoder architecture that integrates anatomical prior knowledge to enhance robustness in coronary artery segmentation. Our framework incorporates three key innovations: (1) a Myocardial Region-guided Module that directs attention to coronary regions via myocardial contour expansion and multi-scale feature fusion, (2) a Residual Feature Extraction Encoding Module that combines parallel spatial channel attention with residual blocks to enhance local-global feature discrimination, and (3) a Multi-scale Feature Fusion Module for adaptive aggregation of hierarchical vascular features. Additionally, Monte Carlo dropout f quantifies prediction uncertainty, supporting clinical interpretability. For stenosis detection, a morphology-based centerline extraction algorithm separates the vascular tree into anatomical branches, enabling cross-sectional area quantification and stenosis grading. The superiority of MGFA-Net was demonstrated by achieving an Dice score of 85.04%, an accuracy of 84.24%, an HD95 of 6.1294 mm, and an improvement of 5.46% in true positive rate for stenosis detection compared to3D U-Net. The integrated segmentation-to-stenosis pipeline provides automated, clinically interpretable CAD assessment, bridging deep learning with anatomical prior knowledge for precision medicine. Our code is publicly available at http://github.com/chenzhao2023/MGFA_CCTA

[57] Platonic Grounding for Efficient Multimodal Language Models

Moulik Choraria,Xinbo Wu,Akhil Bhimaraju,Nitesh Sekhar,Yue Wu,Xu Zhang,Prateek Singhal,Lav R. Varshney

Main category: cs.CV

TL;DR: 论文提出了一种改进多模态框架的方法，通过利用预训练模型的隐式对齐特性，显著降低了训练和推理的计算成本，同时保持或提升性能。

Details

Motivation: 随着Transformer模型数据和参数规模的扩大，性能提升逐渐减少，而训练成本却显著增加。特别是在多模态学习中，推理成本对模型的实用性至关重要。因此，需要更高效的微调和推理方法。 Method: 受预训练模型深层隐式对齐特性的启发，作者提出了一种简单的多模态框架改进方法。该方法利用这种对齐特性，减少计算需求。 Result: 实验表明，该方法在保持或提升基线方法性能的同时，显著降低了训练和推理的计算成本。 Conclusion: 该研究不仅提供了一种高效的多模态学习方法，还为将预训练模型高效整合到更大系统中提供了启示。 Abstract: The hyperscaling of data and parameter count in Transformer-based models is yielding diminishing performance improvement, especially when weighed against training costs. Such plateauing indicates the importance of methods for more efficient finetuning and inference, while retaining similar performance. This is especially relevant for multimodal learning paradigms, where inference costs of processing multimodal tokens can determine the model's practical viability. At the same time, research on representations and mechanistic interpretability has improved our understanding of the inner workings of Transformer-based models; one such line of work reveals an implicit alignment in the deeper layers of pretrained models, across modalities. Taking inspiration from this, we motivate and propose a simple modification to existing multimodal frameworks that rely on aligning pretrained models. We demonstrate that our approach maintains and, in some cases, even improves performance of baseline methods while achieving significant gains in both training and inference-time compute. Our work also has implications for combining pretrained models into larger systems efficiently.

[58] Enhancing seeding efficiency using a computer vision system to monitor furrow quality in real-time

Sidharth Rai,Aryan Dalal,Riley Slichter,Ajay Sharda

Main category: cs.CV

TL;DR: 提出了一种基于计算机视觉的方法，用于评估行清洁器的性能，以提高精准农业中的播种效率。

Details

Motivation: 精准农业中，种子播种受到残留物堆积、低温土壤和“hair pinning”等问题的阻碍，缺乏定量评估行清洁器性能的方法。 Method: 开发了一种计算机视觉方法，通过视频采集系统捕捉行清洁器操作后的沟槽状况，并利用分割模型分析土壤、秸秆和机械等关键元素。 Result: 研究结果表明，该方法能够客观量化行清洁器的性能，为选择行清洁器提供了依据。 Conclusion: 该方法有望提升精准农业中的行清洁器选择和播种效率。 Abstract: Effective seed sowing in precision agriculture is hindered by challenges such as residue accumulation, low soil temperatures, and hair pinning (crop residue pushed in the trench by furrow opener), which obstruct optimal trench formation. Row cleaners are employed to mitigate these issues, but there is a lack of quantitative methods to assess trench cleanliness. In this study, a novel computer vision-based method was developed to evaluate row cleaner performance. Multiple air seeders were equipped with a video acquisition system to capture trench conditions after row cleaner operation, enabling an effective comparison of the performance of each row cleaner. The captured data were used to develop a segmentation model that analyzed key elements such as soil, straw, and machinery. Using the results from the segmentation model, an objective method was developed to quantify row cleaner performance. The results demonstrated the potential of this method to improve row cleaner selection and enhance seeding efficiency in precision agriculture.

[59] Improving Small Drone Detection Through Multi-Scale Processing and Data Augmentation

Rayson Laroca,Marcelo dos Santos,David Menotti

Main category: cs.CV

TL;DR: 提出了一种基于YOLOv11的多尺度无人机检测方法，结合数据增强和后处理技术，在复杂环境中有效区分无人机与鸟类。

Details

Motivation: 现代监控中，无人机与鸟类难以区分，亟需高效检测方法。 Method: 采用多尺度输入处理、复制粘贴数据增强和帧间一致性后处理技术。 Result: 在2025年IJCNN的无人机检测挑战赛中排名前三。 Conclusion: 该方法在复杂环境中能有效检测无人机。 Abstract: Detecting small drones, often indistinguishable from birds, is crucial for modern surveillance. This work introduces a drone detection methodology built upon the medium-sized YOLOv11 object detection model. To enhance its performance on small targets, we implemented a multi-scale approach in which the input image is processed both as a whole and in segmented parts, with subsequent prediction aggregation. We also utilized a copy-paste data augmentation technique to enrich the training dataset with diverse drone and bird examples. Finally, we implemented a post-processing technique that leverages frame-to-frame consistency to mitigate missed detections. The proposed approach attained a top-3 ranking in the 8th WOSDETC Drone-vsBird Detection Grand Challenge, held at the 2025 International Joint Conference on Neural Networks (IJCNN), showcasing its capability to detect drones in complex environments effectively.

[60] MERA: Multimodal and Multiscale Self-Explanatory Model with Considerably Reduced Annotation for Lung Nodule Diagnosis

Jiahao Lu,Chong Yin,Silvia Ingala,Kenny Erleben,Michael Bachmann Nielsen,Sune Darkner

Main category: cs.CV

TL;DR: MERA是一种多模态多尺度自解释模型，用于肺结节诊断，显著减少标注需求，结合无监督和弱监督学习策略，提供多层次解释。

Details

Motivation: 肺癌早期检测至关重要，但现有XAI系统在有限标注数据下难以提供清晰解释。 Method: MERA结合自监督学习和Vision Transformer架构进行无监督特征提取，利用半监督主动学习在潜在空间中进行分层预测。 Result: 在LIDC数据集上，MERA仅需1%标注样本即达到或超越全标注方法的诊断准确性。 Conclusion: MERA设计增强了AI诊断的可信度和透明度，降低了医疗领域部署AI的门槛。 Abstract: Lung cancer, a leading cause of cancer-related deaths globally, emphasises the importance of early detection for better patient outcomes. Pulmonary nodules, often early indicators of lung cancer, necessitate accurate, timely diagnosis. Despite Explainable Artificial Intelligence (XAI) advances, many existing systems struggle providing clear, comprehensive explanations, especially with limited labelled data. This study introduces MERA, a Multimodal and Multiscale self-Explanatory model designed for lung nodule diagnosis with considerably Reduced Annotation requirements. MERA integrates unsupervised and weakly supervised learning strategies (self-supervised learning techniques and Vision Transformer architecture for unsupervised feature extraction) and a hierarchical prediction mechanism leveraging sparse annotations via semi-supervised active learning in the learned latent space. MERA explains its decisions on multiple levels: model-level global explanations via semantic latent space clustering, instance-level case-based explanations showing similar instances, local visual explanations via attention maps, and concept explanations using critical nodule attributes. Evaluations on the public LIDC dataset show MERA's superior diagnostic accuracy and self-explainability. With only 1% annotated samples, MERA achieves diagnostic accuracy comparable to or exceeding state-of-the-art methods requiring full annotation. The model's inherent design delivers comprehensive, robust, multilevel explanations aligned closely with clinical practice, enhancing trustworthiness and transparency. Demonstrated viability of unsupervised and weakly supervised learning lowers the barrier to deploying diagnostic AI in broader medical domains. Our complete code is open-source available: https://github.com/diku-dk/credanno.

[61] Mitigating Bias in Facial Recognition Systems: Centroid Fairness Loss Optimization

Jean-Rémy Conti,Stéphan Clémençon

Main category: cs.CV

TL;DR: 本文提出了一种后处理方法，通过优化基于质心的回归损失，提高预训练人脸识别模型的公平性，同时保持全局准确性。

Details

Motivation: 社会对公平AI系统的需求日益增长，尤其是人脸识别系统在不同人群中的错误率差异引发了监管关注，亟需改进公平性。 Method: 采用后处理方法，优化基于质心的回归损失，调整预训练模型的输出分数。 Result: 实验表明，该方法显著提升了公平性，同时保持了全局准确性。 Conclusion: 该方法为设计公平的人脸识别系统提供了一种高效且有效的解决方案。 Abstract: The urging societal demand for fair AI systems has put pressure on the research community to develop predictive models that are not only globally accurate but also meet new fairness criteria, reflecting the lack of disparate mistreatment with respect to sensitive attributes ($\textit{e.g.}$ gender, ethnicity, age). In particular, the variability of the errors made by certain Facial Recognition (FR) systems across specific segments of the population compromises the deployment of the latter, and was judged unacceptable by regulatory authorities. Designing fair FR systems is a very challenging problem, mainly due to the complex and functional nature of the performance measure used in this domain ($\textit{i.e.}$ ROC curves) and because of the huge heterogeneity of the face image datasets usually available for training. In this paper, we propose a novel post-processing approach to improve the fairness of pre-trained FR models by optimizing a regression loss which acts on centroid-based scores. Beyond the computational advantages of the method, we present numerical experiments providing strong empirical evidence of the gain in fairness and of the ability to preserve global accuracy.

[62] HumMorph: Generalized Dynamic Human Neural Fields from Few Views

Jakub Zadrożny,Hakan Bilen

Main category: cs.CV

TL;DR: HumMorph是一种新颖的自由视角动态人体渲染方法，具有明确的姿态控制能力，仅需少量观察视图即可实现快速推理。

Details

Motivation: 现有方法通常依赖多相机同步设置获取精确的身体参数，而HumMorph旨在解决在噪声参数下仍能实现高质量渲染的实用场景。 Method: 通过构建规范T姿态的粗略表示，并结合像素对齐的细粒度特征，实现高分辨率外观信息的提取与填充。 Result: 在单视图输入下与现有技术相当，但在双视图输入下视觉质量显著提升，且对噪声参数更具鲁棒性。 Conclusion: HumMorph在噪声参数下表现优越，为自由视角人体渲染提供了一种更实用的解决方案。 Abstract: We introduce HumMorph, a novel generalized approach to free-viewpoint rendering of dynamic human bodies with explicit pose control. HumMorph renders a human actor in any specified pose given a few observed views (starting from just one) in arbitrary poses. Our method enables fast inference as it relies only on feed-forward passes through the model. We first construct a coarse representation of the actor in the canonical T-pose, which combines visual features from individual partial observations and fills missing information using learned prior knowledge. The coarse representation is complemented by fine-grained pixel-aligned features extracted directly from the observed views, which provide high-resolution appearance information. We show that HumMorph is competitive with the state-of-the-art when only a single input view is available, however, we achieve results with significantly better visual quality given just 2 monocular observations. Moreover, previous generalized methods assume access to accurate body shape and pose parameters obtained using synchronized multi-camera setups. In contrast, we consider a more practical scenario where these body parameters are noisily estimated directly from the observed views. Our experimental results demonstrate that our architecture is more robust to errors in the noisy parameters and clearly outperforms the state of the art in this setting.

Shuo Wang,Weili Shi,Shuai Yang,Jiahao Cui,Qinwei Guo

Main category: cs.CV

TL;DR: 本文提出了一种基于多级记忆架构的动态关节镜导航系统，用于前交叉韧带（ACL）重建手术，显著提升了手术导航的实时性和准确性。

Details

Motivation: 传统静态匹配方法在复杂手术场景（如视角变化、器械遮挡和组织变形）中表现不佳，需要一种动态、实时的导航系统。 Method: 系统采用Atkinson-Shiffrin记忆模型的三级架构（感觉记忆、工作记忆和长期记忆），将静态图像匹配扩展为动态视频序列跟踪。 Result: 系统在标准关节镜设备上实时运行（25.3 FPS，延迟39.5 ms），误差较静态系统降低45%（长序列）至19%（短序列）。 Conclusion: 该系统克服了传统静态方法的局限性，为ACL重建手术提供了更精确的导航支持。 Abstract: This paper presents a dynamic arthroscopic navigation system based on multi-level memory architecture for anterior cruciate ligament (ACL) reconstruction surgery. The system extends our previously proposed markerless navigation method from static image matching to dynamic video sequence tracking. By integrating the Atkinson-Shiffrin memory model's three-level architecture (sensory memory, working memory, and long-term memory), our system maintains continuous tracking of the femoral condyle throughout the surgical procedure, providing stable navigation support even in complex situations involving viewpoint changes, instrument occlusion, and tissue deformation. Unlike existing methods, our system operates in real-time on standard arthroscopic equipment without requiring additional tracking hardware, achieving 25.3 FPS with a latency of only 39.5 ms, representing a 3.5-fold improvement over our previous static system. For extended sequences (1000 frames), the dynamic system maintained an error of 5.3 plus-minus 1.5 pixels, compared to the static system's 12.6 plus-minus 3.7 pixels - an improvement of approximately 45 percent. For medium-length sequences (500 frames) and short sequences (100 frames), the system achieved approximately 35 percent and 19 percent accuracy improvements, respectively. Experimental results demonstrate the system overcomes limitations of traditional static matching methods, providing new technical support for improving surgical precision in ACL reconstruction.

[64] Boosting 3D Liver Shape Datasets with Diffusion Models and Implicit Neural Representations

Khoa Tuan Nguyen,Francesca Tozzi,Wouter Willaert,Joris Vankerschaver,Nikdokht Rashidian,Wesley De Neve

Main category: cs.CV

TL;DR: 论文提出了一种结合扩散模型和隐式神经表示（INRs）的方法，用于增强和扩展现有的3D肝脏形状数据集，以解决数据稀缺和数据集组织混乱的问题。

Details

Motivation: 现有3D医学形状数据集存在组织混乱和包含伪影的问题，限制了鲁棒模型的开发和训练，尤其是精确的3D重建任务。 Method: 利用扩散模型的生成能力，结合隐式神经表示（INRs），生成多样且真实的3D肝脏形状，以增强数据集。 Result: 实验结果表明，该方法显著提高了数据集的多样性，为3D肝脏重建和生成提供了可扩展的解决方案。 Conclusion: 扩散模型不仅适用于3D肝脏形状生成，还可推广到其他3D医学成像的下游任务。 Abstract: While the availability of open 3D medical shape datasets is increasing, offering substantial benefits to the research community, we have found that many of these datasets are, unfortunately, disorganized and contain artifacts. These issues limit the development and training of robust models, particularly for accurate 3D reconstruction tasks. In this paper, we examine the current state of available 3D liver shape datasets and propose a solution using diffusion models combined with implicit neural representations (INRs) to augment and expand existing datasets. Our approach utilizes the generative capabilities of diffusion models to create realistic, diverse 3D liver shapes, capturing a wide range of anatomical variations and addressing the problem of data scarcity. Experimental results indicate that our method enhances dataset diversity, providing a scalable solution to improve the accuracy and reliability of 3D liver reconstruction and generation in medical applications. Finally, we suggest that diffusion models can also be applied to other downstream tasks in 3D medical imaging.

[65] GMAR: Gradient-Driven Multi-Head Attention Rollout for Vision Transformer Interpretability

Sehyeong Jo,Gangjae Jang,Haesol Park

Main category: cs.CV

TL;DR: GMAR是一种基于梯度的多注意力头重要性评估方法，显著提升了Vision Transformer的可解释性。

Details

Motivation: Vision Transformer的多头注意力机制缺乏可解释性，现有方法未能有效区分不同注意力头的重要性。 Method: 提出GMAR方法，通过梯度评分量化每个注意力头的重要性，并归一化生成加权注意力分数。 Result: 实验表明GMAR优于传统注意力展开技术，能更精确地解释各头的贡献。 Conclusion: GMAR为Vision Transformer提供了增强可解释性的实用框架。 Abstract: The Vision Transformer (ViT) has made significant advancements in computer vision, utilizing self-attention mechanisms to achieve state-of-the-art performance across various tasks, including image classification, object detection, and segmentation. Its architectural flexibility and capabilities have made it a preferred choice among researchers and practitioners. However, the intricate multi-head attention mechanism of ViT presents significant challenges to interpretability, as the underlying prediction process remains opaque. A critical limitation arises from an observation commonly noted in transformer architectures: "Not all attention heads are equally meaningful." Overlooking the relative importance of specific heads highlights the limitations of existing interpretability methods. To address these challenges, we introduce Gradient-Driven Multi-Head Attention Rollout (GMAR), a novel method that quantifies the importance of each attention head using gradient-based scores. These scores are normalized to derive a weighted aggregate attention score, effectively capturing the relative contributions of individual heads. GMAR clarifies the role of each head in the prediction process, enabling more precise interpretability at the head level. Experimental results demonstrate that GMAR consistently outperforms traditional attention rollout techniques. This work provides a practical contribution to transformer-based architectures, establishing a robust framework for enhancing the interpretability of Vision Transformer models.

[66] A Real-Time Event-Based Normal Flow Estimator

Dehao Yuan,Cornelia Fermüller

Main category: cs.CV

TL;DR: 本文提出了一种实时、异步、基于事件的正常流估计器，通过优化实现方式显著降低了计算成本。

Details

Motivation: 原始方法在处理事件切片时计算复杂度高，难以实现实时预测，因此需要一种更高效的实现方式。 Method: 将事件坐标视为整数，重新设计表示步骤为池化操作，替代原始方法中的邻接矩阵乘法，从而降低计算复杂度。 Result: 优化后的方法在RTX 3070上每秒可处理400万次正常流预测，RTX A5000上可达600万次，且仅占用1GB CUDA内存。 Conclusion: 该方法实现了实时正常流预测，计算效率显著提升，并开源了CUDA实现和Python接口。 Abstract: This paper presents a real-time, asynchronous, event-based normal flow estimator. It follows the same algorithm as Learning Normal Flow Directly From Event Neighborhoods, but with a more optimized implementation. The original method treats event slices as 3D point clouds, encodes each event's local geometry into a fixed-length vector, and uses a multi-layer perceptron to predict normal flow. It constructs representations by multiplying an adjacency matrix with a feature matrix, resulting in quadratic time complexity with respect to the number of events. In contrast, we leverage the fact that event coordinates are integers and reformulate the representation step as a pooling operation. This achieves the same effect as the adjacency matrix but with much lower computational cost. As a result, our method supports real-time normal flow prediction on event cameras. Our estimator uses 1 GB of CUDA memory and runs at 4 million normal flows per second on an RTX 3070, or 6 million per second on an RTX A5000. We release the CUDA implementation along with a Python interface at https://github.com/dhyuan99/VecKM_flow_cpp.

[67] EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation

Zhe Dong,Yuzhe Sun,Tianzhu Liu,Wangmeng Zuo,Yanfeng Gu

Main category: cs.CV

TL;DR: EarthMapper是一个用于卫星图像与地图双向翻译的自回归框架，通过地理坐标嵌入和多尺度特征对齐解决模态间对齐问题，并在CNSatMap数据集上表现出色。

Details

Motivation: 卫星图像和地图的双向翻译在规划和灾害响应中有重要应用，但缺乏像素级对齐和高精度抽象合成是主要挑战。 Method: 提出EarthMapper框架，结合地理坐标嵌入、GJSA过程、SI机制和KPAG机制，实现可控双向翻译。 Result: 在CNSatMap和纽约数据集上，EarthMapper在视觉真实性、语义一致性和结构保真度上优于现有方法，并支持零样本任务。 Conclusion: EarthMapper通过创新机制解决了双向翻译的挑战，展现了多功能性和高性能。 Abstract: Satellite imagery and maps, as two fundamental data modalities in remote sensing, offer direct observations of the Earth's surface and human-interpretable geographic abstractions, respectively. The task of bidirectional translation between satellite images and maps (BSMT) holds significant potential for applications in urban planning and disaster response. However, this task presents two major challenges: first, the absence of precise pixel-wise alignment between the two modalities substantially complicates the translation process; second, it requires achieving both high-level abstraction of geographic features and high-quality visual synthesis, which further elevates the technical complexity. To address these limitations, we introduce EarthMapper, a novel autoregressive framework for controllable bidirectional satellite-map translation. EarthMapper employs geographic coordinate embeddings to anchor generation, ensuring region-specific adaptability, and leverages multi-scale feature alignment within a geo-conditioned joint scale autoregression (GJSA) process to unify bidirectional translation in a single training cycle. A semantic infusion (SI) mechanism is introduced to enhance feature-level consistency, while a key point adaptive guidance (KPAG) mechanism is proposed to dynamically balance diversity and precision during inference. We further contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities, enabling robust benchmarking. Extensive experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance, achieving significant improvements in visual realism, semantic consistency, and structural fidelity over state-of-the-art methods. Additionally, EarthMapper excels in zero-shot tasks like in-painting, out-painting and coordinate-conditional generation, underscoring its versatility.

Yejin Jeong,Donghun Lee

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的框架（CLIP-KOA），通过结合图像和文本信息及对称性和一致性损失，提升了膝骨关节炎（KOA）分级预测的准确性和可靠性。

Details

Motivation: KOA的早期诊断至关重要，但现有的KL分级系统存在观察者间变异性和主观性问题，需要更一致的自动化诊断方法。 Method: 采用CLIP框架，结合图像与文本信息，引入对称性损失和一致性损失以确保预测一致性。 Result: CLIP-KOA在KOA严重程度预测任务中达到71.86%的准确率，比标准CLIP模型提升2.36%。 Conclusion: 该研究为数据驱动的医学预测提供了新方向，不仅提高了细粒度诊断的可靠性，还探索了多模态方法在医学图像分析中的应用。 Abstract: Knee osteoarthritis (KOA) is a universal chronic musculoskeletal disorders worldwide, making early diagnosis crucial. Currently, the Kellgren and Lawrence (KL) grading system is widely used to assess KOA severity. However, its high inter-observer variability and subjectivity hinder diagnostic consistency. To address these limitations, automated diagnostic techniques using deep learning have been actively explored in recent years. In this study, we propose a CLIP-based framework (CLIP-KOA) to enhance the consistency and reliability of KOA grade prediction. To achieve this, we introduce a learning approach that integrates image and text information and incorporate Symmetry Loss and Consistency Loss to ensure prediction consistency between the original and flipped images. CLIP-KOA achieves state-of-the-art accuracy of 71.86\% on KOA severity prediction task, and ablation studies show that CLIP-KOA has 2.36\% improvement in accuracy over the standard CLIP model due to our contribution. This study shows a novel direction for data-driven medical prediction not only to improve reliability of fine-grained diagnosis and but also to explore multimodal methods for medical image analysis. Our code is available at https://github.com/anonymized-link.

[69] Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition

Yuki Hirakawa,Ryotaro Shimizu

Main category: cs.CV

TL;DR: 提出了一种名为Masked Language Prompting (MLP)的新提示策略，通过掩码参考标题中的部分词语并利用大语言模型生成多样且语义连贯的补全，以解决时尚风格识别中数据集的构建难题。

Details

Motivation: 时尚风格识别数据集构建的挑战在于风格概念的主观性和模糊性，现有方法难以平衡视觉多样性和风格一致性。 Method: 采用MLP策略，掩码参考标题中的部分词语，利用大语言模型生成多样且语义连贯的补全，无需微调即可生成风格一致且多样的图像。 Result: 在FashionStyle14数据集上的实验表明，MLP方法在风格识别任务中优于基于类名和标题的基线方法。 Conclusion: MLP方法在有限监督下有效提升了时尚风格识别的性能，验证了其生成风格一致且多样数据的能力。 Abstract: Constructing dataset for fashion style recognition is challenging due to the inherent subjectivity and ambiguity of style concepts. Recent advances in text-to-image models have facilitated generative data augmentation by synthesizing images from labeled data, yet existing methods based solely on class names or reference captions often fail to balance visual diversity and style consistency. In this work, we propose \textbf{Masked Language Prompting (MLP)}, a novel prompting strategy that masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions. This approach preserves the structural semantics of the original caption while introducing attribute-level variations aligned with the intended style, enabling style-consistent and diverse image generation without fine-tuning. Experimental results on the FashionStyle14 dataset demonstrate that our MLP-based augmentation consistently outperforms class-name and caption-based baselines, validating its effectiveness for fashion style recognition under limited supervision.

[70] Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video

Sonia Joseph,Praneet Suresh,Lorenz Hufe,Edward Stevinson,Robert Graham,Yash Vadi,Danilo Bzdok,Sebastian Lapuschkin,Lee Sharkey,Blake Aaron Richards

Main category: cs.CV

TL;DR: Prisma是一个开源框架，旨在加速视觉机制可解释性研究，提供统一工具包、预训练权重和分析工具，并揭示了一些新发现。

Details

Motivation: 视觉机制可解释性研究因缺乏可访问的框架和预训练权重而进展缓慢，Prisma旨在解决这一问题。 Method: Prisma提供75+视觉和视频Transformer的统一工具包，支持稀疏自编码器（SAE）等训练，提供80+预训练SAE权重及分析工具。 Result: 研究发现视觉SAE的稀疏模式显著低于语言SAE，且某些情况下SAE重建能降低模型损失。 Conclusion: Prisma为理解视觉模型内部机制提供了新方向，降低了该领域的入门门槛。 Abstract: Robust tooling and publicly available pre-trained models have helped drive recent advances in mechanistic interpretability for language models. However, similar progress in vision mechanistic interpretability has been hindered by the lack of accessible frameworks and pre-trained weights. We present Prisma (Access the codebase here: https://github.com/Prisma-Multimodal/ViT-Prisma), an open-source framework designed to accelerate vision mechanistic interpretability research, providing a unified toolkit for accessing 75+ vision and video transformers; support for sparse autoencoder (SAE), transcoder, and crosscoder training; a suite of 80+ pre-trained SAE weights; activation caching, circuit analysis tools, and visualization tools; and educational resources. Our analysis reveals surprising findings, including that effective vision SAEs can exhibit substantially lower sparsity patterns than language SAEs, and that in some instances, SAE reconstructions can decrease model loss. Prisma enables new research directions for understanding vision model internals while lowering barriers to entry in this emerging field.

[71] CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design

Weitao Feng,Hang Zhou,Jing Liao,Li Cheng,Wenbo Zhou

Main category: cs.CV

TL;DR: 提出了一种基于立方体分解的室内场景合成方法CasaGPT，通过自回归模型排列立方体，减少物体交叉，提升场景质量。

Details

Motivation: 传统方法使用边界框确定3D物体位置和尺寸，效果有限。立方体作为替代方案更简单高效，能紧凑生成场景并减少物体交叉。 Method: 采用自回归模型逐步排列立方体，通过拒绝采样过滤碰撞场景，优化数据集3DFRONT-NC去除噪声。 Result: 在3D-FRONT和3DFRONT-NC数据集上表现优于现有方法，提升场景真实感。 Conclusion: CasaGPT为3D场景合成提供了高效且高质量的解决方案，未来潜力大。 Abstract: We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis.

[72] Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

Yan Wang,Baoxiong Jia,Ziyu Zhu,Siyuan Huang

Main category: cs.CV

TL;DR: MPEC提出了一种新的掩码点-实体对比学习方法，用于开放词汇3D语义分割，通过3D实体-语言对齐和多视角点云一致性提升性能。

Details

Motivation: 开放词汇3D场景理解对增强物理智能至关重要，使智能体能在真实环境中动态交互。 Method: MPEC结合3D实体-语言对齐和多视角点云一致性，生成实体特定的特征表示。 Result: 在ScanNet上达到开放词汇3D语义分割的最先进水平，并在零样本场景理解中表现优异。 Conclusion: MPEC展示了学习到的3D特征在多种场景理解任务中的潜力，推动了性能的全面提升。 Abstract: Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks. Project website: https://mpec-3d.github.io/

[73] SynergyAmodal: Deocclude Anything with Text Control

Xinyang Li,Chengjie Yi,Jiawei Lai,Mingbao Lin,Yansong Qu,Shengchuan Zhang,Liujuan Cao

Main category: cs.CV

TL;DR: SynergyAmodal框架通过数据-人类-模型三方协作，利用野外图像数据、人类专家知识和生成先验，合成高质量遮挡恢复数据集，并训练出具有零样本泛化和文本可控性的模型。

Details

Motivation: 解决遮挡恢复任务中高质量数据稀缺的问题，平衡多样性、合理性和保真度。 Method: 1. 自监督学习算法利用野外图像数据；2. 人类专家参与迭代过滤和标注；3. 训练基于文本提示的扩散模型。 Result: 生成了约16K对高质量遮挡恢复数据对，模型表现出零样本泛化和文本可控性。 Conclusion: SynergyAmodal框架有效解决了数据稀缺问题，并提升了遮挡恢复任务的性能。 Abstract: Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at https://github.com/imlixinyang/SynergyAmodal.

[74] FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding

Rong Gao,Xin Liu,Zhuozhao Hu,Bohao Xing,Baiqiang Xia,Zitong Yu,Heikki Kälviäinen

Main category: cs.CV

TL;DR: 论文介绍了FSAnno数据集，用于提升对花样滑冰等艺术体育的理解，并提出了FSBench基准测试工具。

Details

Motivation: 现有花样滑冰数据集多关注单一任务，缺乏对技术和艺术表现的综合标注，艺术体育研究相对不足。 Method: 提出FSAnno数据集和FSBench基准测试，包含多模态数据和问答对，支持从技术分析到表现评论的任务。 Result: 初步测试显示现有模型对艺术体育的理解存在显著局限。 Conclusion: FSBench有望成为评估和提升模型对花样滑冰理解的关键工具。 Abstract: Figure skating, known as the "Art on Ice," is among the most artistic sports, challenging to understand due to its blend of technical elements (like jumps and spins) and overall artistic expression. Existing figure skating datasets mainly focus on single tasks, such as action recognition or scoring, lacking comprehensive annotations for both technical and artistic evaluation. Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. FSAnno includes an open-access training and test dataset, alongside a benchmark dataset, FSBench, for fair model evaluation. FSBench consists of FSBench-Text, with multiple-choice questions and explanations, and FSBench-Motion, containing multimodal data and Question and Answer (QA) pairs, supporting tasks from technical analysis to performance commentary. Initial tests on FSBench reveal significant limitations in existing models' understanding of artistic sports. We hope FSBench will become a key tool for evaluating and enhancing model comprehension of figure skating.

[75] LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning

Peijian Zeng,Feiyan Pang,Zhanbo Wang,Aimin Yang

Main category: cs.CV

TL;DR: 提出了一种无需掩码标注的工业异常检测方法，通过动态奖励函数和链式推理框架，显著提升了检测精度。

Details

Motivation: 传统方法依赖大规模数据和掩码标注，成本高且难以扩展；现有视觉语言模型也存在类似问题。工业数据集类别不平衡严重，缺陷样本占比低。 Method: 采用动态奖励函数处理类别不平衡，结合链式推理（CoT）和组相对策略优化（GRPO）实现无掩码异常检测。 Result: 在MVTec-AD和VisA数据集上分别提升36%和16%的准确率，达到SOTA性能。 Conclusion: 该方法降低了成本，提高了可扩展性，并为缺陷定位提供了可解释的输出，推动了工业异常检测的发展。 Abstract: Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects. Traditional methods such as feature embedding and reconstruction-based approaches require large datasets and struggle with scalability. Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations, leading to high implementation costs and false positives. Additionally, industrial datasets like MVTec-AD and VisA suffer from severe class imbalance, with defect samples constituting only 23.8% and 11.1% of total data respectively. To address these challenges, we propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance. We also introduce a mask-free reasoning framework using Chain of Thought (CoT) and Group Relative Policy Optimization (GRPO) mechanisms, enabling anomaly detection directly from raw images without annotated masks. This approach generates interpretable step-by-step explanations for defect localization. Our method achieves state-of-the-art performance, outperforming prior approaches by 36% in accuracy on MVTec-AD and 16% on VisA. By eliminating mask dependency and reducing costs while providing explainable outputs, this work advances industrial anomaly detection and supports scalable quality control in manufacturing. Code to reproduce the experiment is available at https://github.com/LilaKen/LR-IAD.

[76] Adversarial Shallow Watermarking

Guobiao Li,Lei Tan,Yuliang Xue,Gaozhi Liu,Zhenxing Qian,Sheng Li,Xinpeng Zhang

Main category: cs.CV

TL;DR: 提出了一种新型水印框架ASW，通过浅层解码器和对抗优化抵抗未知失真，无需训练、编码器或噪声层。

Details

Motivation: 现有基于深度神经网络的水印方法对未知失真鲁棒性不足。 Method: ASW使用随机参数的浅层解码器，对抗优化宿主图像生成水印图像，解码器对失真不敏感。 Result: ASW在已知和未知失真下均表现优异，鲁棒性优于现有方法。 Conclusion: ASW为抵抗未知失真的水印提供了高效、简洁的解决方案。 Abstract: Recent advances in digital watermarking make use of deep neural networks for message embedding and extraction. They typically follow the ``encoder-noise layer-decoder''-based architecture. By deliberately establishing a differentiable noise layer to simulate the distortion of the watermarked signal, they jointly train the deep encoder and decoder to fit the noise layer to guarantee robustness. As a result, they are usually weak against unknown distortions that are not used in their training pipeline. In this paper, we propose a novel watermarking framework to resist unknown distortions, namely Adversarial Shallow Watermarking (ASW). ASW utilizes only a shallow decoder that is randomly parameterized and designed to be insensitive to distortions for watermarking extraction. During the watermark embedding, ASW freezes the shallow decoder and adversarially optimizes a host image until its updated version (i.e., the watermarked image) stably triggers the shallow decoder to output the watermark message. During the watermark extraction, it accurately recovers the message from the watermarked image by leveraging the insensitive nature of the shallow decoder against arbitrary distortions. Our ASW is training-free, encoder-free, and noise layer-free. Experiments indicate that the watermarked images created by ASW have strong robustness against various unknown distortions. Compared to the existing ``encoder-noise layer-decoder'' approaches, ASW achieves comparable results on known distortions and better robustness on unknown distortions.

[77] Point2Quad: Generating Quad Meshes from Point Clouds via Face Prediction

Zezeng Li,Zhihui Qi,Weimin Wang,Ziliang Wang,Junyi Duan,Na Lei

Main category: cs.CV

TL;DR: Point2Quad是首个基于学习的点云生成纯四边形网格的方法，通过融合点级和面级特征解决四边形网格生成的挑战。

Details

Motivation: 四边形网格在几何建模和计算力学中至关重要，但现有基于学习的方法主要针对三角形网格，四边形网格生成因需满足共面性、凸性和纯四边形等约束而较少被探索。 Method: Point2Quad通过k-NN候选生成考虑共面性和方形性，随后使用两个编码器提取几何和拓扑特征，结合四边形特定特征，通过复合损失训练分类器，最后通过四边形特定后处理优化结果。 Result: 在清晰和噪声数据上的广泛实验表明，Point2Quad在综合指标上优于基线方法。 Conclusion: Point2Quad为四边形网格生成提供了一种有效的基于学习的方法，解决了现有技术的局限性。 Abstract: Quad meshes are essential in geometric modeling and computational mechanics. Although learning-based methods for triangle mesh demonstrate considerable advancements, quad mesh generation remains less explored due to the challenge of ensuring coplanarity, convexity, and quad-only meshes. In this paper, we present Point2Quad, the first learning-based method for quad-only mesh generation from point clouds. The key idea is learning to identify quad mesh with fused pointwise and facewise features. Specifically, Point2Quad begins with a k-NN-based candidate generation considering the coplanarity and squareness. Then, two encoders are followed to extract geometric and topological features that address the challenge of quad-related constraints, especially by combining in-depth quadrilaterals-specific characteristics. Subsequently, the extracted features are fused to train the classifier with a designed compound loss. The final results are derived after the refinement by a quad-specific post-processing. Extensive experiments on both clear and noise data demonstrate the effectiveness and superiority of Point2Quad, compared to baseline methods under comprehensive metrics.

[78] Crowd Detection Using Very-Fine-Resolution Satellite Imagery

Tong Xiao,Qunming Wang,Ping Lu,Tenghai Huang,Xiaohua Tong,Peter M. Atkinson

Main category: cs.CV

TL;DR: 论文提出CrowdSat-Net，一种基于点的卷积神经网络，用于高分辨率卫星图像中的群体检测，并创建了首个相关数据集CrowdSat。

Details

Motivation: 现有群体检测方法依赖地面或航空图像，时空覆盖有限，而高分辨率卫星图像为此提供了新机会。 Method: 提出CrowdSat-Net，包含DCPAN（双上下文渐进注意力网络）和HFGDU（高频引导可变形上采样器）两个创新模块。 Result: 在CrowdSat数据集上，CrowdSat-Net的F1-score为66.12%，精度为73.23%，优于其他方法。 Conclusion: 研究通过新网络架构和数据集推动了群体检测技术的发展。 Abstract: Accurate crowd detection (CD) is critical for public safety and historical pattern analysis, yet existing methods relying on ground and aerial imagery suffer from limited spatio-temporal coverage. The development of very-fine-resolution (VFR) satellite sensor imagery (e.g., ~0.3 m spatial resolution) provides unprecedented opportunities for large-scale crowd activity analysis, but it has never been considered for this task. To address this gap, we proposed CrowdSat-Net, a novel point-based convolutional neural network, which features two innovative components: Dual-Context Progressive Attention Network (DCPAN) to improve feature representation of individuals by aggregating scene context and local individual characteristics, and High-Frequency Guided Deformable Upsampler (HFGDU) that recovers high-frequency information during upsampling through frequency-domain guided deformable convolutions. To validate the effectiveness of CrowdSat-Net, we developed CrowdSat, the first VFR satellite imagery dataset designed specifically for CD tasks, comprising over 120k manually labeled individuals from multi-source satellite platforms (Beijing-3N, Jilin-1 Gaofen-04A and Google Earth) across China. In the experiments, CrowdSat-Net was compared with five state-of-the-art point-based CD methods (originally designed for ground or aerial imagery) using CrowdSat and achieved the largest F1-score of 66.12% and Precision of 73.23%, surpassing the second-best method by 1.71% and 2.42%, respectively. Moreover, extensive ablation experiments validated the importance of the DCPAN and HFGDU modules. Furthermore, cross-regional evaluation further demonstrated the spatial generalizability of CrowdSat-Net. This research advances CD capability by providing both a newly developed network architecture for CD and a pioneering benchmark dataset to facilitate future CD development.

[79] DEEMO: De-identity Multimodal Emotion Recognition and Reasoning

Deng Li,Bohao Xing,Xin Liu,Baiqiang Xia,Bihan Wen,Heikki Kälviäinen

Main category: cs.CV

TL;DR: 论文提出了一种名为DEEMO的任务，通过去身份化的视频和音频输入实现情感理解，并构建了相关数据集和模型DEEMO-LLaMA，在隐私保护的情感识别和推理任务中表现优异。

Details

Motivation: 现有情感理解方法依赖身份敏感信息（如面部表情和语音），引发隐私担忧。DEEMO旨在通过去身份化数据实现情感理解，保护隐私。 Method: 提出DEEMO任务，构建包含非面部身体语言（NFBL）和身份无关线索的数据集（DEEMO-NFBL和DEEMO-MER），并开发多模态大语言模型DEEMO-LLaMA。 Result: DEEMO-LLaMA在去身份化情感识别中达到74.49%准确率和74.45% F1分数，在情感推理中表现显著优于现有模型。 Conclusion: DEEMO推动了隐私保护的情感理解，为伦理AI和负责任的情感计算提供了新方向。 Abstract: Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and audio inputs. The DEEMO dataset consists of two subsets: DEEMO-NFBL, which includes rich annotations of Non-Facial Body Language (NFBL), and DEEMO-MER, an instruction dataset for Multimodal Emotion Recognition and Reasoning using identity-free cues. This design supports emotion understanding without compromising identity privacy. In addition, we propose DEEMO-LLaMA, a Multimodal Large Language Model (MLLM) that integrates de-identified audio, video, and textual information to enhance both emotion recognition and reasoning. Extensive experiments show that DEEMO-LLaMA achieves state-of-the-art performance on both tasks, outperforming existing MLLMs by a significant margin, achieving 74.49% accuracy and 74.45% F1-score in de-identity emotion recognition, and 6.20 clue overlap and 7.66 label overlap in de-identity emotion reasoning. Our work contributes to ethical AI by advancing privacy-preserving emotion understanding and promoting responsible affective computing.

[80] CE-NPBG: Connectivity Enhanced Neural Point-Based Graphics for Novel View Synthesis in Autonomous Driving Scenes

Mohammad Altillawi,Fengyi Shen,Liudi Yang,Sai Manoj Prakhya,Ziyuan Liu

Main category: cs.CV

TL;DR: CE-NPBG是一种基于神经点的新方法，用于大规模自动驾驶场景中的新视角合成，通过连接几何与外观的关系图提升渲染质量和可扩展性。

Details

Motivation: 当前基于点的方法在大规模3D点云地图中面临可扩展性和渲染质量的限制，主要问题是几何与外观的可见性不匹配。 Method: 利用姿态图像和同步的3D点云，构建几何与外观的连接关系图，并通过联合对抗和点光栅化训练优化神经描述符。 Result: 显著提升渲染质量，运行时性能优化，仅需使用少量点云数据即可实现高质量渲染。 Conclusion: CE-NPBG通过几何与外观的协同优化，为大规模场景的新视角合成提供了高效且高质量的解决方案。 Abstract: Current point-based approaches encounter limitations in scalability and rendering quality when using large 3D point cloud maps because using them directly for novel view synthesis (NVS) leads to degraded visualizations. We identify the primary issue behind these low-quality renderings as a visibility mismatch between geometry and appearance, stemming from using these two modalities together. To address this problem, we present CE-NPBG, a new approach for novel view synthesis (NVS) in large-scale autonomous driving scenes. Our method is a neural point-based technique that leverages two modalities: posed images (cameras) and synchronized raw 3D point clouds (LiDAR). We first employ a connectivity relationship graph between appearance and geometry, which retrieves points from a large 3D point cloud map observed from the current camera perspective and uses them for rendering. By leveraging this connectivity, our method significantly improves rendering quality and enhances run-time and scalability by using only a small subset of points from the large 3D point cloud map. Our approach associates neural descriptors with the points and uses them to synthesize views. To enhance the encoding of these descriptors and elevate rendering quality, we propose a joint adversarial and point rasterization training. During training, we pair an image-synthesizer network with a multi-resolution discriminator. At inference, we decouple them and use the image-synthesizer to generate novel views. We also integrate our proposal into the recent 3D Gaussian Splatting work to highlight its benefits for improved rendering and scalability.

[81] Category-Level and Open-Set Object Pose Estimation for Robotics

Peter Hönig,Matthias Hirschmanner,Markus Vincze

Main category: cs.CV

TL;DR: 本文比较了类别级6D姿态估计的数据集、精度指标和算法，并分析了如何将其与开放集姿态估计结合以实现泛化。

Details

Motivation: 解决类别级和开放集物体姿态估计中因纹理、形状和尺寸未知而导致的挑战，尤其是物体对称性带来的歧义问题。 Method: 通过比较不同数据集、精度指标和算法，分析类别级6D姿态估计的现状。 Result: 提出了将类别级与开放集姿态估计结合的方法，以实现泛化。 Conclusion: 提供了可操作的建议，以推动类别级和开放集姿态估计的进一步发展。 Abstract: Object pose estimation enables a variety of tasks in computer vision and robotics, including scene understanding and robotic grasping. The complexity of a pose estimation task depends on the unknown variables related to the target object. While instance-level methods already excel for opaque and Lambertian objects, category-level and open-set methods, where texture, shape, and size are partially or entirely unknown, still struggle with these basic material properties. Since texture is unknown in these scenarios, it cannot be used for disambiguating object symmetries, another core challenge of 6D object pose estimation. The complexity of estimating 6D poses with such a manifold of unknowns led to various datasets, accuracy metrics, and algorithmic solutions. This paper compares datasets, accuracy metrics, and algorithms for solving 6D pose estimation on the category-level. Based on this comparison, we analyze how to bridge category-level and open-set object pose estimation to reach generalization and provide actionable recommendations.

[82] DG-DETR: Toward Domain Generalized Detection Transformer

Seongmin Hwang,Daeyoung Han,Moongu Jeon

Main category: cs.CV

TL;DR: DG-DETR是一种简单有效的端到端Transformer检测器，通过域无关查询选择和小波分解提升跨域鲁棒性。

Details

Motivation: 现有域泛化研究主要关注CNN检测器，而忽视了DETR的鲁棒性提升需求。 Method: 提出域无关查询选择策略和小波分解方法，分离域不变与域特定特征。 Result: 实验验证了DG-DETR在跨域检测中的有效性。 Conclusion: DG-DETR为DETR的域泛化提供了简单且可插拔的解决方案。 Abstract: End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-and-play method that improves out-of-distribution (OOD) robustness for DETRs. Specifically, we propose a novel domain-agnostic query selection strategy that removes domain-induced biases from object queries via orthogonal projection onto the instance-specific style space. Additionally, we leverage a wavelet decomposition to disentangle features into domain-invariant and domain-specific components, enabling synthesis of diverse latent styles while preserving the semantic features of objects. Experimental results validate the effectiveness of DG-DETR. Our code is available at https://github.com/sminhwang/DG-DETR.

[83] SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity

Chengzhi Wu,Yuxin Wan,Hao Fu,Julius Pfrommer,Zeyun Zhong,Junwei Zheng,Jiaming Zhang,Jürgen Beyerer

Main category: cs.CV

TL;DR: 提出了一种名为SAMBLE的方法，通过学习形状特定的采样策略，平衡边缘细节和全局均匀性，提升点云下游任务性能。

Details

Motivation: 现有学习采样方法存在采样模式不可识别或结果偏斜的问题，且忽略了不同形状点分布的自然变化。 Method: 采用稀疏注意力图和基于分箱的学习方法（SAMBLE），学习形状特定的采样策略。 Result: SAMBLE在多个点云下游任务中表现优异，即使在少量点采样场景下。 Conclusion: SAMBLE通过形状特定采样策略，有效平衡了局部细节和全局均匀性，提升了点云处理性能。 Abstract: Driven by the increasing demand for accurate and efficient representation of 3D data in various domains, point cloud sampling has emerged as a pivotal research topic in 3D computer vision. Recently, learning-to-sample methods have garnered growing interest from the community, particularly for their ability to be jointly trained with downstream tasks. However, previous learning-based sampling methods either lead to unrecognizable sampling patterns by generating a new point cloud or biased sampled results by focusing excessively on sharp edge details. Moreover, they all overlook the natural variations in point distribution across different shapes, applying a similar sampling strategy to all point clouds. In this paper, we propose a Sparse Attention Map and Bin-based Learning method (termed SAMBLE) to learn shape-specific sampling strategies for point cloud shapes. SAMBLE effectively achieves an improved balance between sampling edge points for local details and preserving uniformity in the global shape, resulting in superior performance across multiple common point cloud downstream tasks, even in scenarios with few-point sampling.

[84] ShowMak3r: Compositional TV Show Reconstruction

Sangmin Kim,Seunguk Do,Jaesik Park

Main category: cs.CV

TL;DR: ShowMak3r是一个动态辐射场重建管道，用于从娱乐视频（如电视剧）中重建和编辑场景，解决了遮挡、杂乱场景和镜头变化等挑战。

Details

Motivation: 从娱乐视频中重建动态辐射场具有挑战性，主要由于演员遮挡、多样表情、杂乱场景和小基线视图或突然镜头变化。 Method: ShowMak3r包含3DLocator模块（定位演员并估计姿势）、ShotMatcher模块（跟踪镜头变化下的演员）和动态表情恢复网络。 Result: 在Sitcoms3D数据集上，ShowMak3r能重新组装场景并支持合成镜头制作、演员重定位等应用。 Conclusion: ShowMak3r为娱乐视频的动态场景重建和编辑提供了有效解决方案。 Abstract: Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r

[85] Magnifier: A Multi-grained Neural Network-based Architecture for Burned Area Delineation

Daniele Rege Cambrin,Luca Colomba,Paolo Garza

Main category: cs.CV

TL;DR: 论文提出了一种名为Magnifier的新方法，通过双编码器（局部和全局）在有限数据下提升图像分割性能。

Details

Motivation: 在危机管理和遥感领域，图像分割对灾难响应至关重要，但数据稀缺和缺乏基准数据集限制了神经网络模型的训练能力。 Method: Magnifier通过双编码器（局部和全局）从同一输入中提取不同粒度的信息，扩展现有编码器-解码器架构，提升信息提取效率。 Result: Magnifier平均IoU提升2.65%，参数量增加有限，且在GFLOPs减半的情况下性能优于或媲美现有方法。 Conclusion: Magnifier在数据稀缺情况下显著提升了分割性能，为遥感图像分析提供了高效解决方案。 Abstract: In crisis management and remote sensing, image segmentation plays a crucial role, enabling tasks like disaster response and emergency planning by analyzing visual data. Neural networks are able to analyze satellite acquisitions and determine which areas were affected by a catastrophic event. The problem in their development in this context is the data scarcity and the lack of extensive benchmark datasets, limiting the capabilities of training large neural network models. In this paper, we propose a novel methodology, namely Magnifier, to improve segmentation performance with limited data availability. The Magnifier methodology is applicable to any existing encoder-decoder architecture, as it extends a model by merging information at different contextual levels through a dual-encoder approach: a local and global encoder. Magnifier analyzes the input data twice using the dual-encoder approach. In particular, the local and global encoders extract information from the same input at different granularities. This allows Magnifier to extract more information than the other approaches given the same set of input images. Magnifier improves the quality of the results of +2.65% on average IoU while leading to a restrained increase in terms of the number of trainable parameters compared to the original model. We evaluated our proposed approach with state-of-the-art burned area segmentation models, demonstrating, on average, comparable or better performances in less than half of the GFLOPs.

[86] Neural network task specialization via domain constraining

Roman Malashin,Daniil Ilyukhin

Main category: cs.CV

TL;DR: 论文提出通过任务特定领域约束实现神经网络专业化，提升网络在特定数据子空间上的性能。实验表明，仅通过约束类别标签空间即可提升通用网络的准确性，无需额外数据或改变训练方式。

Details

Motivation: 研究旨在探索如何通过专业化提升神经网络在特定任务上的性能，同时避免增加数据或改变训练方式。 Method: 提出一种专业化提取阶段，通过约束数据空间和修改传统微调方法，实现网络的专业化。 Result: 实验结果显示，专业化能显著提升通用网络在图像分类和目标检测任务中的准确性。 Conclusion: 研究为未来开发动态可配置图像分析系统奠定了基础，并展示了在特定数据域中排除通用网络考虑的方法。 Abstract: This paper introduces a concept of neural network specialization via task-specific domain constraining, aimed at enhancing network performance on data subspace in which the network operates. The study presents experiments on training specialists for image classification and object detection tasks. The results demonstrate that specialization can enhance a generalist's accuracy even without additional data or changing training regimes: solely by constraining class label space in which the network performs. Theoretical and experimental analyses indicate that effective specialization requires modifying traditional fine-tuning methods and constraining data space to semantically coherent subsets. The specialist extraction phase before tuning the network is proposed for maximal performance gains. We also provide analysis of the evolution of the feature space during specialization. This study paves way to future research for developing more advanced dynamically configurable image analysis systems, where computations depend on the specific input. Additionally, the proposed methods can help improve system performance in scenarios where certain data domains should be excluded from consideration of the generalist network.

[87] Lightweight Adapter Learning for More Generalized Remote Sensing Change Detection

Dou Quan,Rufan Zhou,Shuang Wang,Ning Huyan,Dong Zhao,Yunan Li,Licheng Jiao

Main category: cs.CV

TL;DR: 本文提出了一种通用性更强的变化检测网络（CANet），通过共享和特定数据集模块解决现有方法泛化能力差的问题。

Details

Motivation: 现有深度学习方法在遥感图像变化检测中表现良好，但针对不同数据集训练的模型泛化能力较差。 Method: CANet包含数据集共享和特定模块，设计了轻量级适配器和变化区域掩码（ICM），并采用独特批归一化层处理数据分布差异。 Result: CANet在多个数据集上表现优异，泛化能力强，训练成本低（仅更新4.1%-7.7%参数），且在小样本下性能更好。 Conclusion: CANet是一种高效、通用的变化检测方法，可灵活集成到现有深度模型中。 Abstract: Deep learning methods have shown promising performances in remote sensing image change detection (CD). However, existing methods usually train a dataset-specific deep network for each dataset. Due to the significant differences in the data distribution and labeling between various datasets, the trained dataset-specific deep network has poor generalization performances on other datasets. To solve this problem, this paper proposes a change adapter network (CANet) for a more universal and generalized CD. CANet contains dataset-shared and dataset-specific learning modules. The former explores the discriminative features of images, and the latter designs a lightweight adapter model, to deal with the characteristics of different datasets in data distribution and labeling. The lightweight adapter can quickly generalize the deep network for new CD tasks with a small computation cost. Specifically, this paper proposes an interesting change region mask (ICM) in the adapter, which can adaptively focus on interested change objects and decrease the influence of labeling differences in various datasets. Moreover, CANet adopts a unique batch normalization layer for each dataset to deal with data distribution differences. Compared with existing deep learning methods, CANet can achieve satisfactory CD performances on various datasets simultaneously. Experimental results on several public datasets have verified the effectiveness and advantages of the proposed CANet on CD. CANet has a stronger generalization ability, smaller training costs (merely updating 4.1%-7.7% parameters), and better performances under limited training datasets than other deep learning methods, which also can be flexibly inserted with existing deep models.

[88] Image Generation Method Based on Heat Diffusion Models

Pengfei Zhang,Shouqing Jia

Main category: cs.CV

TL;DR: HDM通过引入二维热方程离散形式，在DDPM基础上改进，生成更高质量的图像。

Details

Motivation: DDPM虽能生成高质量图像，但未充分利用像素间关系，HDM旨在通过像素级操作提升细节保留和真实性。 Method: HDM将二维热方程离散形式融入DDPM的扩散和生成公式，计算相邻像素关系。 Result: 实验表明，HDM在图像质量上优于DDPM、CDM、LDM和VQGAN等模型。 Conclusion: HDM通过像素级操作显著提升图像生成质量，为扩散模型提供了新思路。 Abstract: Denoising Diffusion Probabilistic Models (DDPMs) achieve high-quality image generation without adversarial training, but they process images as a whole. Since adjacent pixels are highly likely to belong to the same object, we propose the Heat Diffusion Model (HDM) to further preserve image details and generate more realistic images. HDM is a model that incorporates pixel-level operations while maintaining the same training process as DDPM. In HDM, the discrete form of the two-dimensional heat equation is integrated into the diffusion and generation formulas of DDPM, enabling the model to compute relationships between neighboring pixels during image processing. Our experiments demonstrate that HDM can generate higher-quality samples compared to models such as DDPM, Consistency Diffusion Models (CDM), Latent Diffusion Models (LDM), and Vector Quantized Generative Adversarial Networks (VQGAN).

[89] DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer

Junpeng Jiang,Gangyi Hong,Miao Zhang,Hengtong Hu,Kun Zhan,Rui Shao,Liqiang Nie

Main category: cs.CV

TL;DR: DiVE是一个基于扩散变换器的生成框架，用于生成高质量、时间一致且跨视图一致的多视角视频，解决了现有生成模型在驾驶场景中的质量问题。

Details

Motivation: 收集多视角驾驶场景视频成本高且困难，现有生成模型质量差且时空一致性不足，限制了其在感知任务中的应用。 Method: DiVE采用扩散变换器框架，结合统一的跨注意力和SketchFormer控制多模态数据，引入无参数视图膨胀注意力机制保证跨视图一致性。 Result: 在nuScenes数据集上，DiVE实现了多视角视频生成的最先进性能，生成结果具有高保真度和时空一致性。 Conclusion: DiVE通过创新技术解决了高分辨率视频合成的计算延迟问题，显著提升了生成速度和性能。 Abstract: Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and spatiotemporal consistency, undermining their utility in advancing perception tasks under driving scenarios. To address this gap, we propose DiVE, a diffusion transformer-based generative framework meticulously engineered to produce high-fidelity, temporally coherent, and cross-view consistent multi-view videos, aligning seamlessly with bird's-eye view layouts and textual descriptions. DiVE leverages a unified cross-attention and a SketchFormer to exert precise control over multimodal data, while incorporating a view-inflated attention mechanism that adds no extra parameters, thereby guaranteeing consistency across views. Despite these advancements, synthesizing high-resolution videos under multimodal constraints introduces dual challenges: investigating the optimal classifier-free guidance coniguration under intricate multi-condition inputs and mitigating excessive computational latency in high-resolution rendering--both of which remain underexplored in prior researches. To resolve these limitations, we introduce two innovations: Multi-Control Auxiliary Branch Distillation, which streamlines multi-condition CFG selection while circumventing high computational overhead, and Resolution Progressive Sampling, a training-free acceleration strategy that staggers resolution scaling to reduce high latency due to high resolution. These innovations collectively achieve a 2.62x speedup with minimal quality degradation. Evaluated on the nuScenes dataset, DiVE achieves SOTA performance in multi-view video generation, yielding photorealistic outputs with exceptional temporal and cross-view coherence.

[90] NSegment : Noisy Segment Improves Remote Sensing Image Segmentation

Yechan Kim,DongHo Yoon,SooYeon Kim,Moongu Jeon

Main category: cs.CV

TL;DR: NSegment是一种简单有效的数据增强方法，通过仅对分割标签应用弹性变换来解决遥感图像分割中的标注错误问题。

Details

Motivation: 遥感图像分割数据集中存在隐式和细微的标注错误，且标注数据稀缺，传统方法复杂且耗时。 Method: 提出NSegment，仅对分割标签应用弹性变换，并根据样本调整变形强度。 Result: 实验表明，该方法提升了多种先进模型的遥感图像分割性能。 Conclusion: NSegment是一种简单且高效的解决方案，能有效缓解标注不一致性问题。 Abstract: Labeling errors in remote sensing (RS) image segmentation datasets often remain implicit and subtle due to ambiguous class boundaries, mixed pixels, shadows, complex terrain features, and subjective annotator bias. Furthermore, the scarcity of annotated RS data due to high image acquisition and labeling costs complicates training noise-robust models. While sophisticated mechanisms such as label selection or noise correction might address this issue, they tend to increase training time and add implementation complexity. In this letter, we propose NSegment-a simple yet effective data augmentation solution to mitigate this issue. Unlike traditional methods, it applies elastic transformations only to segmentation labels, varying deformation intensity per sample in each training epoch to address annotation inconsistencies. Experimental results demonstrate that our approach improves the performance of RS image segmentation on various state-of-the-art models.

[91] Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval

Junlong Ren,Gangjian Zhang,Yu Hu,Jian Shu,Hao Wang

Main category: cs.CV

TL;DR: 该论文提出了一种新的部分相关视频检索（PRVR）框架，通过捕捉跨模态的双重特性（样本间相关性和样本内冗余性）来提升检索性能。

Details

Motivation: PRVR任务中，文本查询与视频内容之间存在语义不对称性，现有方法未能充分利用样本间相关性和样本内冗余性。 Method: 框架包含三个核心模块：ICE（样本间相关性增强）、IRM（样本内冗余性挖掘）和TCP（时序一致性预测）。 Result: 在三个数据集上的实验表明，该方法优于现有方法，达到了最先进的性能。 Conclusion: 通过系统性利用跨模态的双重特性，该框架显著提升了PRVR任务的性能。 Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant video moment features and treating them as hard negative samples, thereby encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances feature discrimination by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments on three datasets demonstrate the superiority of our approach compared to previous methods, achieving state-of-the-art results.

Pin-Chi Pan,Soo-Chang Pei

Main category: cs.CV

TL;DR: BARIS-ERA框架通过边界感知解码器和环境鲁棒适配器提升水下实例分割性能，显著优于Mask R-CNN。

Details

Motivation: 水下视觉条件（如光衰减、散射和颜色失真）导致模型性能下降，需改进分割精度。 Method: 提出BARIS-Decoder（边界感知细化解码器）和ERA（环境鲁棒适配器），减少90%以上可训练参数。 Result: BARIS-ERA在Swin-B和ConvNeXt V2上分别超越Mask R-CNN 3.4和3.8 mAP。 Conclusion: BARIS-ERA为水下实例分割提供了高效且鲁棒的解决方案。 Abstract: Underwater instance segmentation is challenging due to adverse visual conditions such as light attenuation, scattering, and color distortion, which degrade model performance. In this work, we propose BARIS-Decoder (Boundary-Aware Refinement Decoder for Instance Segmentation), a framework that enhances segmentation accuracy through feature refinement. To address underwater degradations, we introduce the Environmental Robust Adapter (ERA), which efficiently models underwater degradation patterns while reducing trainable parameters by over 90\% compared to full fine-tuning. The integration of BARIS-Decoder with ERA-tuning, referred to as BARIS-ERA, achieves state-of-the-art performance, surpassing Mask R-CNN by 3.4 mAP with a Swin-B backbone and 3.8 mAP with ConvNeXt V2. Our findings demonstrate the effectiveness of BARIS-ERA in advancing underwater instance segmentation, providing a robust and efficient solution.

[93] xEdgeFace: Efficient Cross-Spectral Face Recognition for Edge Devices

Anjith George,Sebastien Marcel

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的异构人脸识别（HFR）框架，结合CNN-Transformer架构，适用于资源受限的边缘设备，并在性能和计算效率上优于现有方法。

Details

Motivation: 异构人脸识别（HFR）在不同传感模态间匹配人脸图像时面临挑战，现有方法计算量大，难以在资源受限的设备上部署。 Method: 采用轻量级的CNN-Transformer混合架构，支持高效端到端训练，且只需少量配对异构数据。 Result: 在多个HFR和人脸识别基准测试中表现优异，计算开销低。 Conclusion: 该方法为异构和同构场景提供了高效且高性能的解决方案。 Abstract: Heterogeneous Face Recognition (HFR) addresses the challenge of matching face images across different sensing modalities, such as thermal to visible or near-infrared to visible, expanding the applicability of face recognition systems in real-world, unconstrained environments. While recent HFR methods have shown promising results, many rely on computation-intensive architectures, limiting their practicality for deployment on resource-constrained edge devices. In this work, we present a lightweight yet effective HFR framework by adapting a hybrid CNN-Transformer architecture originally designed for face recognition. Our approach enables efficient end-to-end training with minimal paired heterogeneous data while preserving strong performance on standard RGB face recognition tasks. This makes it a compelling solution for both homogeneous and heterogeneous scenarios. Extensive experiments across multiple challenging HFR and face recognition benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches while maintaining a low computational overhead.

[94] Explaining Vision GNNs: A Semantic and Visual Analysis of Graph-based Image Classification

Nikolaos Chaidos,Angeliki Dimitriou,Nikolaos Spanos,Athanasios Voulodimos,Giorgos Stamou

Main category: cs.CV

TL;DR: 该论文研究了基于图神经网络（GNN）的图像分类器中图结构的语义一致性，分析了其解释性，并比较了标准与对抗设置下的解释差异。

Details

Motivation: 尽管GNN在视觉任务中表现出高效性，但其解释性尚未充分探索。本文旨在分析GNN层间图结构的语义一致性和空间连贯性，以评估其解释能力。 Method: 通过量化层间图连接的语义相似性和空间连贯性，并结合热图可视化技术，分析GNN模型的解释性。 Result: 研究发现GNN模型的决策过程可以有效解释，但其深层推理与人类感知并不完全一致。 Conclusion: GNN模型的解释性在标准与对抗设置下存在差异，深层推理与人类感知的不一致性揭示了进一步研究的必要性。 Abstract: Graph Neural Networks (GNNs) have emerged as an efficient alternative to convolutional approaches for vision tasks such as image classification, leveraging patch-based representations instead of raw pixels. These methods construct graphs where image patches serve as nodes, and edges are established based on patch similarity or classification relevance. Despite their efficiency, the explainability of GNN-based vision models remains underexplored, even though graphs are naturally interpretable. In this work, we analyze the semantic consistency of the graphs formed at different layers of GNN-based image classifiers, focusing on how well they preserve object structures and meaningful relationships. A comprehensive analysis is presented by quantifying the extent to which inter-layer graph connections reflect semantic similarity and spatial coherence. Explanations from standard and adversarial settings are also compared to assess whether they reflect the classifiers' robustness. Additionally, we visualize the flow of information across layers through heatmap-based visualization techniques, thereby highlighting the models' explainability. Our findings demonstrate that the decision-making processes of these models can be effectively explained, while also revealing that their reasoning does not necessarily align with human perception, especially in deeper layers.

[95] ClearVision: Leveraging CycleGAN and SigLIP-2 for Robust All-Weather Classification in Traffic Camera Imagery

Anush Lakshman Sivaraman,Kojo Adu-Gyamfi,Ibne Farabi Shihab,Anuj Sharma

Main category: cs.CV

TL;DR: 论文提出了一种结合生成域适应和高效对比学习的框架，用于提升低质量交通摄像头图像中的天气分类性能，特别是在夜间条件下。

Details

Motivation: 解决低质量交通摄像头图像（尤其是夜间）天气分类的挑战。 Method: 使用CycleGAN进行域转换提升图像质量，结合SigLIP-2对比学习优化分类性能。 Result: 最佳模型在夜间条件下达到85.90%的准确率，整体准确率97.01%。 Conclusion: 结合域适应和高效对比学习可构建实用的天气分类系统。 Abstract: Accurate weather classification from low-quality traffic camera imagery remains a challenging task, particularly under adverse nighttime conditions. In this study, we propose a scalable framework that combines generative domain adaptation with efficient contrastive learning to enhance classification performance. Using CycleGAN-based domain translation, we improve the quality of nighttime images, enabling better feature extraction by downstream models. While the baseline EVA-02 model employing CLIP-based contrastive loss achieves an overall accuracy of 96.55\%, it exhibits a significant performance gap between daytime (97.21\%) and nighttime conditions (63.40\%). Replacing CLIP with the lightweight SigLIP-2 (Sigmoid contrastive loss) achieves a competitive overall accuracy of 94.00\%, with substantial improvements in nighttime performance (85.90\% accuracy). The combination of Vision-SigLIP-2, Text-SigLIP-2, CycleGAN, and contrastive training achieves the best nighttime accuracy (85.90\%) among all models tested, while EVA-02 with CycleGAN maintains the highest overall accuracy (97.01\%) and per-class accuracies. These findings demonstrate the potential of combining domain adaptation and efficient contrastive learning to build practical, resource-efficient weather classification systems for intelligent transportation infrastructure.

[96] Prompt Guiding Multi-Scale Adaptive Sparse Representation-driven Network for Low-Dose CT MAR

Baoshun Shi,Bing Chen,Shaolei Zhang,Huazhu Fu,Zhanli Hu

Main category: cs.CV

TL;DR: 提出了一种名为PMSRNet的多尺度自适应稀疏表示驱动网络，用于低剂量CT重建和金属伪影减少（LDMAR），解决了现有方法在多尺度信息利用和模型存储空间上的不足。

Details

Motivation: 低剂量CT（LDCT）虽能减少辐射，但会降低图像质量并产生金属伪影。现有深度学习方法在多尺度信息利用和模型存储空间上存在局限。 Method: 提出PMSRNet，基于多尺度稀疏框架，利用PSATG和MSFuM模块捕捉多尺度信息；并设计PDuMSRNet框架，通过提示引导策略实现单一模型适应多剂量水平。 Result: 实验表明，该方法在多种剂量水平下优于现有LDMAR方法。 Conclusion: PMSRNet和PDuMSRNet在LDMAR任务中表现出色，解决了多尺度信息利用和模型存储问题。 Abstract: Low-dose CT (LDCT) is capable of reducing X-ray radiation exposure, but it will potentially degrade image quality, even yields metal artifacts at the case of metallic implants. For simultaneous LDCT reconstruction and metal artifact reduction (LDMAR), existing deep learning-based efforts face two main limitations: i) the network design neglects multi-scale and within-scale information; ii) training a distinct model for each dose necessitates significant storage space for multiple doses. To fill these gaps, we propose a prompt guiding multi-scale adaptive sparse representation-driven network, abbreviated as PMSRNet, for LDMAR task. Specifically, we construct PMSRNet inspired from multi-scale sparsifying frames, and it can simultaneously employ within-scale characteristics and cross-scale complementarity owing to an elaborated prompt guiding scale-adaptive threshold generator (PSATG) and a built multi-scale coefficient fusion module (MSFuM). The PSATG can adaptively capture multiple contextual information to generate more faithful thresholds, achieved by fusing features from local, regional, and global levels. Furthermore, we elaborate a model interpretable dual domain LDMAR framework called PDuMSRNet, and train single model with a prompt guiding strategy for multiple dose levels. We build a prompt guiding module, whose input contains dose level, metal mask and input instance, to provide various guiding information, allowing a single model to accommodate various CT dose settings. Extensive experiments at various dose levels demonstrate that the proposed methods outperform the state-of-the-art LDMAR methods.

[97] SubGrapher: Visual Fingerprinting of Chemical Structures

Lucas Morin,Gerhard Ingmar Meijer,Valéry Weber,Luc Van Gool,Peter W. J. Staar

Main category: cs.CV

TL;DR: SubGrapher是一种直接从化学结构图像中提取分子指纹的方法，优于传统OCSR模型，提高了检索性能和鲁棒性。

Details

Motivation: 科学文献中的化学结构自动提取对加速研究至关重要，但专利文档中的分子信息难以通过文本搜索获取。 Method: SubGrapher通过基于学习的实例分割识别功能基团和碳骨架，构建基于子结构的指纹。 Result: SubGrapher在检索性能和鲁棒性上优于现有OCSR和指纹方法。 Conclusion: SubGrapher为化学结构检索提供了高效解决方案，相关数据和代码将公开。 Abstract: Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code will be made publicly available.

[98] Open-set Anomaly Segmentation in Complex Scenarios

Song Xia,Yi Yu,Henghui Ding,Wenhan Yang,Shifei Liu,Alex C. Kot,Xudong Jiang

Main category: cs.CV

TL;DR: 本文提出了一个名为ComsAmy的新基准，用于在复杂天气条件下评估异常分割模型的性能，并提出了DiffEEL方法，结合能量-熵学习和扩散合成器，显著提升了模型表现。

Details

Motivation: 现有异常分割基准在复杂天气条件下评估不足，可能导致模型在真实开放世界场景中表现不佳，存在安全隐患。 Method: 提出ComsAmy基准和DiffEEL方法，结合能量-熵学习和扩散合成器，增强异常分割的鲁棒性。 Result: DiffEEL在公共和ComsAmy基准上平均提升4.96%的AUPRC和9.87%的FPR95。 Conclusion: DiffEEL是一种有效的即插即用方法，显著提升了异常分割模型在复杂开放世界环境中的性能。 Abstract: Precise segmentation of out-of-distribution (OoD) objects, herein referred to as anomalies, is crucial for the reliable deployment of semantic segmentation models in open-set, safety-critical applications, such as autonomous driving. Current anomalous segmentation benchmarks predominantly focus on favorable weather conditions, resulting in untrustworthy evaluations that overlook the risks posed by diverse meteorological conditions in open-set environments, such as low illumination, dense fog, and heavy rain. To bridge this gap, this paper introduces the ComsAmy, a challenging benchmark specifically designed for open-set anomaly segmentation in complex scenarios. ComsAmy encompasses a wide spectrum of adverse weather conditions, dynamic driving environments, and diverse anomaly types to comprehensively evaluate the model performance in realistic open-world scenarios. Our extensive evaluation of several state-of-the-art anomalous segmentation models reveals that existing methods demonstrate significant deficiencies in such challenging scenarios, highlighting their serious safety risks for real-world deployment. To solve that, we propose a novel energy-entropy learning (EEL) strategy that integrates the complementary information from energy and entropy to bolster the robustness of anomaly segmentation under complex open-world environments. Additionally, a diffusion-based anomalous training data synthesizer is proposed to generate diverse and high-quality anomalous images to enhance the existing copy-paste training data synthesizer. Extensive experimental results on both public and ComsAmy benchmarks demonstrate that our proposed diffusion-based synthesizer with energy and entropy learning (DiffEEL) serves as an effective and generalizable plug-and-play method to enhance existing models, yielding an average improvement of around 4.96% in $\rm{AUPRC}$ and 9.87% in $\rm{FPR}_{95}$.

[99] A computer vision method to estimate ventilation rate of Atlantic salmon in sea fish farms

Lukas Folkman,Quynh LK Vo,Colin Johnston,Bela Stantic,Kylie A Pitt

Main category: cs.CV

TL;DR: 开发了一种基于计算机视觉的方法，用于监测大西洋鲑的呼吸频率，适用于商业海鱼养殖场的实际生产环境。

Details

Motivation: 现有智能监测方法多局限于实验室环境，缺乏在真实海鱼养殖场中的适用性，亟需直接监测生理特征的方法。 Method: 使用鱼头检测模型和卷积神经网络分类鱼嘴状态，结合多目标跟踪技术估计呼吸频率。 Result: 在独立测试集上，预测呼吸频率与真实值的皮尔逊相关系数达0.82，高效识别呼吸窘迫鱼群。 Conclusion: 该方法具有广泛适用性，有望革新鱼类健康与福利监测。 Abstract: The increasing demand for aquaculture production necessitates the development of innovative, intelligent tools to effectively monitor and manage fish health and welfare. While non-invasive video monitoring has become a common practice in finfish aquaculture, existing intelligent monitoring methods predominantly focus on assessing body condition or fish swimming patterns and are often developed and evaluated in controlled tank environments, without demonstrating their applicability to real-world aquaculture settings in open sea farms. This underscores the necessity for methods that can monitor physiological traits directly within the production environment of sea fish farms. To this end, we have developed a computer vision method for monitoring ventilation rates of Atlantic salmon (Salmo salar), which was specifically designed for videos recorded in the production environment of commercial sea fish farms using the existing infrastructure. Our approach uses a fish head detection model, which classifies the mouth state as either open or closed using a convolutional neural network. This is followed with multiple object tracking to create temporal sequences of fish swimming across the field of view of the underwater video camera to estimate ventilation rates. The method demonstrated high efficiency, achieving a Pearson correlation coefficient of 0.82 between ground truth and predicted ventilation rates in a test set of 100 fish collected independently of the training data. By accurately identifying pens where fish exhibit signs of respiratory distress, our method offers broad applicability and the potential to transform fish health and welfare monitoring in finfish aquaculture.

[100] The ATLAS of Traffic Lights: A Reliable Perception Framework for Autonomous Driving

Rupert Polley,Nikolai Polley,Dominik Heid,Marc Heinrich,Sven Ochs,J. Marius Zöllner

Main category: cs.CV

TL;DR: 提出了一种模块化的交通灯感知框架，结合了先进的检测模型和实时关联决策系统，并发布了ATLAS数据集以提升性能。

Details

Motivation: 解决现有公共数据集在交通灯状态和图标标注上的不足，提升自动驾驶车辆在复杂城市环境中的导航安全性。 Method: 集成先进的检测模型与实时关联决策框架，并利用新发布的ATLAS数据集进行训练和评估。 Result: 在ATLAS数据集上显著提升了交通灯检测的准确性和鲁棒性，并在实际自动驾驶场景中验证了框架的可靠性。 Conclusion: 提出的框架和数据集有效提升了交通灯感知的实时性和准确性，适用于自动驾驶的实际部署。 Abstract: Traffic light perception is an essential component of the camera-based perception system for autonomous vehicles, enabling accurate detection and interpretation of traffic lights to ensure safe navigation through complex urban environments. In this work, we propose a modularized perception framework that integrates state-of-the-art detection models with a novel real-time association and decision framework, enabling seamless deployment into an autonomous driving stack. To address the limitations of existing public datasets, we introduce the ATLAS dataset, which provides comprehensive annotations of traffic light states and pictograms across diverse environmental conditions and camera setups. This dataset is publicly available at https://url.fzi.de/ATLAS. We train and evaluate several state-of-the-art traffic light detection architectures on ATLAS, demonstrating significant performance improvements in both accuracy and robustness. Finally, we evaluate the framework in real-world scenarios by deploying it in an autonomous vehicle to make decisions at traffic light-controlled intersections, highlighting its reliability and effectiveness for real-time operation.

[101] RepText: Rendering Visual Text via Replicating

Haofan Wang,Yujia Xu,Yimeng Li,Junchen Li,Chaowei Zhang,Jing Wang,Kejia Yang,Zhibo Chen

Main category: cs.CV

TL;DR: RepText通过改进预训练的单语言文本生成图像模型，使其能准确渲染多语言视觉文本，无需理解文本内容。

Details

Motivation: 当前文本生成图像模型在多语言文本渲染上表现不足，尤其是非拉丁字母。 Method: 结合ControlNet设置，引入语言无关的字形和位置信息，使用文本感知损失和扩散损失，优化初始化过程并限制特征注入区域。 Result: RepText在实验中优于开源方法，与闭源多语言模型表现相当。 Conclusion: RepText有效提升了多语言文本渲染能力，但仍存在局限性。 Abstract: Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.

[102] Measuring Train Driver Performance as Key to Approval of Driverless Trains

Rustam Tagiew,Prasannavenkatesh Balaji

Main category: cs.CV

TL;DR: 论文探讨了简化计算机视觉系统安全审批的方法，并提供了新的数据集以弥补障碍物检测性能数据的不足。

Details

Motivation: 由于缺乏公开的障碍物检测性能数据，难以量化计算机视觉系统在无人驾驶列车中的表现，本文旨在填补这一空白。 Method: 通过收集711次列车驾驶员在控制实验中的性能测量数据，包括反应时间和距离障碍物的距离，生成公开且匿名的数据集。 Result: 提供了涵盖不同速度、障碍物大小、列车保护系统和颜色对比的详细数据集。 Conclusion: 数据集为研究、标准化和监管提供了无偏见且全面的描述，有助于推动无人驾驶列车技术的发展。 Abstract: Points 2.1.4(b), 2.4.2(b) and 2.4.3(b) in Annex I of Implementing Regulation (EU) No. 402/2013 allow a simplified approach for the safety approval of computer vision systems for driverless trains, if they have 'similar' functions and interfaces as the replaced human driver. The human driver is not replaced one-to-one by a technical system - only a limited set of cognitive functions are replaced. However, performance in the most challenging function, obstacle detection, is difficult to quantify due to the deficiency of published measurement results. This article summarizes the data published so far. This article also goes a long way to remedy this situation by providing a new public and anonymized dataset of 711 train driver performance measurements from controlled experiments. The measurements are made for different speeds, obstacle sizes, train protection systems and obstacle color contrasts respectively. The measured values are reaction time and distance to the obstacle. The goal of this paper is an unbiased and exhaustive description of the presented dataset for research, standardization and regulation. Further project related information including the dataset and source code is available at https://atosense-02371c.usercontent.opencode.de/

[103] CoDEx: Combining Domain Expertise for Spatial Generalization in Satellite Image Analysis

Abhishek Kuriyal,Elliot Vincent,Mathieu Aubry,Loic Landrieu

Main category: cs.CV

TL;DR: 论文提出了一种新的卫星图像领域泛化框架，通过为每个训练域训练专家模型并学习专家相似性，提升模型在测试时的性能。

Details

Motivation: 全球地形外观的差异导致卫星图像分析模型在测试时性能下降，现有方法难以解决这一问题。 Method: 为每个训练域训练专家模型，学习专家相似性并确保相似专家一致性，通过模型选择模块聚合预测。 Result: 在四个数据集（DynamicEarthNet、MUDS、OSCD、FMoW）上表现优于现有领域泛化和适应方法。 Conclusion: 提出的框架有效解决了卫星图像领域泛化问题，代码已开源。 Abstract: Global variations in terrain appearance raise a major challenge for satellite image analysis, leading to poor model performance when training on locations that differ from those encountered at test time. This remains true even with recent large global datasets. To address this challenge, we propose a novel domain-generalization framework for satellite images. Instead of trying to learn a single generalizable model, we train one expert model per training domain, while learning experts' similarity and encouraging similar experts to be consistent. A model selection module then identifies the most suitable experts for a given test sample and aggregates their predictions. Experiments on four datasets (DynamicEarthNet, MUDS, OSCD, and FMoW) demonstrate consistent gains over existing domain generalization and adaptation methods. Our code is publicly available at https://github.com/Abhishek19009/CoDEx.

[104] Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model

Muzammil Behzad,Guoying Zhao

Main category: cs.CV

TL;DR: AffectVLM是一个视觉语言模型，通过多视角整合和联合表示学习框架，结合新颖的梯度友好损失函数，提升面部情感理解的语义丰富性和视觉全面性。

Details

Motivation: 旨在从3D/4D数据中实现更全面和语义丰富的面部情感理解。 Method: 提出联合表示学习框架和梯度友好损失函数，结合增强文本提示和混合视角增强技术。 Result: 在多个基准测试中表现优异。 Conclusion: AffectVLM通过多视角整合和联合学习框架，显著提升了面部情感理解的性能。 Abstract: In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation. Additionally, we introduce augmented textual prompts to enhance the model's linguistic capabilities and employ mixed view augmentation to expand the visual dataset. We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning. Extensive experiments validate the superior performance of AffectVLM across multiple benchmarks.

[105] EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia

Valerie Zermatten,Javiera Castillo-Navarro,Pallavi Jain,Devis Tuia,Diego Marcos

Main category: cs.CV

TL;DR: 论文提出了一种通过遥感图像与物种栖息地描述对齐的方法，预测生态属性，并引入了EcoWikiRS数据集。采用WINCEL损失函数处理弱监督问题，在生态系统零样本分类任务中表现良好。

Details

Motivation: 通过遥感图像直接预测生态属性，利用物种栖息地描述提供监督信号，解决生态学中的弱监督问题。 Method: 提出WINCEL（加权InfoNCE损失），结合EcoWikiRS数据集（包含高分辨率航拍图像、物种观察数据及栖息地文本描述），训练遥感视觉语言模型（RS-VLMs）。 Result: 在基于EUNIS的生态系统零样本分类任务中，模型表现优异，提升了遥感图像的生态学解释能力。 Conclusion: 该方法为生态属性预测提供了一种可扩展的弱监督解决方案，代码和数据集已开源。 Abstract: The presence of species provides key insights into the ecological properties of a location such as land cover, climatic conditions or even soil properties. We propose a method to predict such ecological properties directly from remote sensing (RS) images by aligning them with species habitat descriptions. We introduce the EcoWikiRS dataset, consisting of high-resolution aerial images, the corresponding geolocated species observations, and, for each species, the textual descriptions of their habitat from Wikipedia. EcoWikiRS offers a scalable way of supervision for RS vision language models (RS-VLMs) for ecology. This is a setting with weak and noisy supervision, where, for instance, some text may describe properties that are specific only to part of the species' niche or is irrelevant to a specific image. We tackle this by proposing WINCEL, a weighted version of the InfoNCE loss. We evaluate our model on the task of ecosystem zero-shot classification by following the habitat definitions from the European Nature Information System (EUNIS). Our results show that our approach helps in understanding RS images in a more ecologically meaningful manner. The code and the dataset are available at https://github.com/eceo-epfl/EcoWikiRS.

[106] STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction

Zhimin Liao,Ping Wei,Shuaijia Chen,Haoxuan Wang,Ziyang Ren

Main category: cs.CV

TL;DR: 论文提出了一种基于显式状态建模的新方法，通过占用状态信息改进3D特征，结合稀疏遮挡感知注意力机制和级联细化策略，提升了3D场景的动态表示能力。

Details

Motivation: 现有基于隐式学习的方法难以捕捉局部细节并削弱了空间判别能力，因此需要一种更高效的方法来改进3D特征的建模。 Method: 提出显式状态建模方法，结合稀疏遮挡感知注意力机制和级联细化策略，同时引入长期动态交互建模方法以降低计算成本。 Result: 在RayIoU和mAVE指标上优于现有方法，训练时GPU内存使用降至8.7GB。 Conclusion: 显式状态建模方法在3D场景表示中表现出高效性和优越性能。 Abstract: 3D occupancy and scene flow offer a detailed and dynamic representation of 3D scene. Recognizing the sparsity and complexity of 3D space, previous vision-centric methods have employed implicit learning-based approaches to model spatial and temporal information. However, these approaches struggle to capture local details and diminish the model's spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. Specifically, we propose a sparse occlusion-aware attention mechanism, integrated with a cascade refinement strategy, which accurately renovates 3D features with the guidance of occupied state information. Additionally, we introduce a novel method for modeling long-term dynamic interactions, which reduces computational costs and preserves spatial information. Compared to the previous state-of-the-art methods, our efficient explicit renovation strategy not only delivers superior performance in terms of RayIoU and mAVE for occupancy and scene flow prediction but also markedly reduces GPU memory usage during training, bringing it down to 8.7GB. Our code is available on https://github.com/lzzzzzm/STCOcc

[107] Hybrid Approach Combining Ultrasound and Blood Test Analysis with a Voting Classifier for Accurate Liver Fibrosis and Cirrhosis Assessment

Kapil Kashyap,Sean Fargose,Chrisil Dabre,Fatema Dolaria,Nilesh Patil,Aniket Kore

Main category: cs.CV

TL;DR: 提出了一种结合机器学习和临床数据的混合模型，用于提高肝纤维化和肝硬化的检测准确性，准确率达92.5%。

Details

Motivation: 传统肝活检诊断方法具有侵入性，不适合常规筛查。 Method: 结合固定血液检测概率和深度学习模型（DenseNet-201）对超声图像进行预测。 Result: 混合模型的准确率达到92.5%。 Conclusion: 该模型提高了诊断准确性，支持肝病的早期干预。 Abstract: Liver cirrhosis is an insidious condition involving the substitution of normal liver tissue with fibrous scar tissue and causing major health complications. The conventional method of diagnosis using liver biopsy is invasive and, therefore, inconvenient for use in regular screening. In this paper,we present a hybrid model that combines machine learning techniques with clinical data and ultrasoundscans to improve liver fibrosis and cirrhosis detection accuracy is presented. The model integrates fixed blood test probabilities with deep learning model predictions (DenseNet-201) for ultrasonic images. The combined hybrid model achieved an accuracy of 92.5%. The findings establish the viability of the combined model in enhancing diagnosis accuracy and supporting early intervention in liver disease care.

[108] Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

Hoang Chuong Nguyen,Wei Mao,Jose M. Alvarez,Miaomiao Liu

Main category: cs.CV

TL;DR: 提出了一种新方法，通过建模连续相机运动为时间依赖的角速度和速度，减少对预计算相机位姿的依赖，并在挑战性场景中表现优越。

Details

Motivation: NeRF需要准确的预计算相机位姿，现有方法依赖良好初始位姿或深度先验，但在大旋转等挑战性场景中表现不佳。 Method: 通过时间依赖的NeRF学习连续相机运动，先学习相机间的相对运动，再聚合到世界坐标系，从而优化NeRF表示完整场景几何。 Result: 在Co3D和Scannet上实现了优越的相机位姿和深度估计，以及可比的新视角合成性能。 Conclusion: 该方法通过建模连续相机运动，减少了对外部先验的依赖，并在复杂场景中表现出色。 Abstract: Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods. Our code is available at https://github.com/HoangChuongNguyen/cope-nerf.

[109] Taming the Randomness: Towards Label-Preserving Cropping in Contrastive Learning

Mohamed Hassan,Mohammad Wasil,Sebastian Houben

Main category: cs.CV

TL;DR: 论文提出两种参数化裁剪方法，提升对比学习中自标注的鲁棒性，显著提高模型在CIFAR-10分类任务中的准确率。

Details

Motivation: 随机裁剪可能导致语义偏离原图，产生错误标注，影响对比学习效果。 Method: 引入两种参数化裁剪方法，增强自标注的鲁棒性。 Result: 在CIFAR-10分类任务中，模型准确率提升2.7%至12.4%。 Conclusion: 参数化裁剪方法有效提升对比学习的性能。 Abstract: Contrastive learning (CL) approaches have gained great recognition as a very successful subset of self-supervised learning (SSL) methods. SSL enables learning from unlabeled data, a crucial step in the advancement of deep learning, particularly in computer vision (CV), given the plethora of unlabeled image data. CL works by comparing different random augmentations (e.g., different crops) of the same image, thus achieving self-labeling. Nevertheless, randomly augmenting images and especially random cropping can result in an image that is semantically very distant from the original and therefore leads to false labeling, hence undermining the efficacy of the methods. In this research, two novel parameterized cropping methods are introduced that increase the robustness of self-labeling and consequently increase the efficacy. The results show that the use of these methods significantly improves the accuracy of the model by between 2.7\% and 12.4\% on the downstream task of classifying CIFAR-10, depending on the crop size compared to that of the non-parameterized random cropping method.

[110] HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination

Zhiming Hu,Daniel Haeufle,Syn Schmitt,Andreas Bulling

Main category: cs.CV

TL;DR: HOIGaze是一种基于学习的新方法，用于在扩展现实（XR）中手-物交互（HOI）时的视线估计。通过利用眼、手和头部的协调运动，HOIGaze有效去噪训练数据，显著提升了性能。

Details

Motivation: 传统视线估计方法将所有训练样本视为同等重要，而HOIGaze通过关注眼、手和头部的协调运动，提出了一种更有效的数据去噪方法。 Method: 1) 分层框架识别视觉关注的手；2) 使用跨模态Transformer融合头部和手-物特征；3) 引入眼-头协调损失优化训练样本。 Result: 在HOT3D和ADT数据集上，HOIGaze平均分别提升15.6%和6.0%的性能，并在眼基活动识别任务中表现优异。 Conclusion: HOIGaze展示了眼-手-头协调运动中的丰富信息，为基于学习的视线估计开辟了新方向。 Abstract: We present HOIGaze - a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: The eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training - as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.

[111] AnimateAnywhere: Rouse the Background in Human Image Animation

Xiaoyu Liu,Mingshuai Yao,Yabo Zhang,Xianhui Lin,Peiran Ren,Xiaoming Li,Ming Liu,Wangmeng Zuo

Main category: cs.CV

TL;DR: AnimateAnywhere框架通过背景运动学习器（BML）从人体姿态序列中学习背景运动，无需相机轨迹，生成生动背景的人类动画。

Details

Motivation: 现有方法忽视背景生成，导致静态或不协调结果，且相机轨迹准备不实用。 Method: 引入BML学习背景运动，结合3D注意力图的极线约束提升准确性。 Result: 实验表明，AnimateAnywhere能有效学习背景运动，生成逼真动画。 Conclusion: 该方法在无需相机轨迹下实现了背景与人体动作的和谐动画生成。 Abstract: Human image animation aims to generate human videos of given characters and backgrounds that adhere to the desired pose sequence. However, existing methods focus more on human actions while neglecting the generation of background, which typically leads to static results or inharmonious movements. The community has explored camera pose-guided animation tasks, yet preparing the camera trajectory is impractical for most entertainment applications and ordinary users. As a remedy, we present an AnimateAnywhere framework, rousing the background in human image animation without requirements on camera trajectories. In particular, based on our key insight that the movement of the human body often reflects the motion of the background, we introduce a background motion learner (BML) to learn background motions from human pose sequences. To encourage the model to learn more accurate cross-frame correspondences, we further deploy an epipolar constraint on the 3D attention map. Specifically, the mask used to suppress geometrically unreasonable attention is carefully constructed by combining an epipolar mask and the current 3D attention map. Extensive experiments demonstrate that our AnimateAnywhere effectively learns the background motion from human pose sequences, achieving state-of-the-art performance in generating human animation results with vivid and realistic backgrounds. The source code and model will be available at https://github.com/liuxiaoyu1104/AnimateAnywhere.

[112] SRMF: A Data Augmentation and Multimodal Fusion Approach for Long-Tail UHR Satellite Image Segmentation

Yulong Guo,Zilun Zhang,Yongheng Shang,Tiancheng Zhao,Shuiguang Deng,Yingchun Yang,Jianwei Yin

Main category: cs.CV

TL;DR: 论文提出SRMF框架，通过多尺度裁剪和数据增强策略解决UHR卫星图像语义分割中的长尾问题，并首次融合文本与视觉特征提升性能。

Details

Motivation: 解决UHR卫星图像语义分割中长尾问题，现有方法多忽略此问题。 Method: 采用多尺度裁剪和数据增强策略，结合文本与视觉特征的多模态融合。 Result: 在URUR、GID和FBP数据集上mIoU分别提升3.33%、0.66%和0.98%。 Conclusion: SRMF框架有效缓解长尾问题，性能达到SOTA。 Abstract: The long-tail problem presents a significant challenge to the advancement of semantic segmentation in ultra-high-resolution (UHR) satellite imagery. While previous efforts in UHR semantic segmentation have largely focused on multi-branch network architectures that emphasize multi-scale feature extraction and fusion, they have often overlooked the importance of addressing the long-tail issue. In contrast to prior UHR methods that focused on independent feature extraction, we emphasize data augmentation and multimodal feature fusion to alleviate the long-tail problem. In this paper, we introduce SRMF, a novel framework for semantic segmentation in UHR satellite imagery. Our approach addresses the long-tail class distribution by incorporating a multi-scale cropping technique alongside a data augmentation strategy based on semantic reordering and resampling. To further enhance model performance, we propose a multimodal fusion-based general representation knowledge injection method, which, for the first time, fuses text and visual features without the need for individual region text descriptions, extracting more robust features. Extensive experiments on the URUR, GID, and FBP datasets demonstrate that our method improves mIoU by 3.33\%, 0.66\%, and 0.98\%, respectively, achieving state-of-the-art performance. Code is available at: https://github.com/BinSpa/SRMF.git.

[113] Foundation Model-Driven Framework for Human-Object Interaction Prediction with Segmentation Mask Integration

Juhan Park,Kyungjae Lee,Hyung Jin Chang,Jungchan Cho

Main category: cs.CV

TL;DR: Seg2HOI是一种新框架，将基于分割的视觉基础模型与人类-物体交互任务结合，通过引入四元组（包括分割掩码）增强传统HOI检测。

Details

Motivation: 传统HOI方法基于检测，缺乏分割信息。Seg2HOI旨在结合分割模型优势，提升HOI任务的灵活性和表现。 Method: Seg2HOI继承视觉基础模型的可提示和交互机制，通过解码器将其应用于HOI任务，无需额外训练。 Result: 在公开数据集上表现媲美SOTA方法，支持零样本场景，并能生成未训练过的文本和视觉提示的HOI四元组。 Conclusion: Seg2HOI展示了分割模型在HOI任务中的高效性和灵活性，适用于广泛场景。 Abstract: In this work, we introduce Segmentation to Human-Object Interaction (\textit{\textbf{Seg2HOI}}) approach, a novel framework that integrates segmentation-based vision foundation models with the human-object interaction task, distinguished from traditional detection-based Human-Object Interaction (HOI) methods. Our approach enhances HOI detection by not only predicting the standard triplets but also introducing quadruplets, which extend HOI triplets by including segmentation masks for human-object pairs. More specifically, Seg2HOI inherits the properties of the vision foundation model (e.g., promptable and interactive mechanisms) and incorporates a decoder that applies these attributes to HOI task. Despite training only for HOI, without additional training mechanisms for these properties, the framework demonstrates that such features still operate efficiently. Extensive experiments on two public benchmark datasets demonstrate that Seg2HOI achieves performance comparable to state-of-the-art methods, even in zero-shot scenarios. Lastly, we propose that Seg2HOI can generate HOI quadruplets and interactive HOI segmentation from novel text and visual prompts that were not used during training, making it versatile for a wide range of applications by leveraging this flexibility.

[114] CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

Chenhan Jiang,Yihan Zeng,Hang Xu,Dit-Yan Yeung

Main category: cs.CV

TL;DR: 提出了一种新的文本到3D生成方法TCSD，通过结合多模态大语言模型（MLLMs）的反馈，解决了现有SDS方法在语义保真度和多对象交互上的不足。

Details

Motivation: 现有SDS方法在多对象交互和语义保真度上表现不佳，且优化过程中存在视图无关偏差，导致文本-3D对齐退化。 Method: 提出TCSD目标函数，利用MLLMs的跨模态理解能力评估和优化文本-3D对齐；开发了3DLLaVA-CRITIC模型用于多视图对齐评估；引入LLM布局初始化加速优化。 Result: 在多个基准测试中（如T$^3$Bench和TIFA子集）取得最优性能，定性结果展示了更好的文本一致性和语义交互。 Conclusion: TCSD首次将MLLMs引入SDS优化，显著提升了文本到3D生成的语义保真度和对齐性能。 Abstract: Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. The limitation stems from SDS's inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction. To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration. Comprehensive evaluations demonstrate that our framework, CoherenDream, establishes state-of-the-art performance in text-aligned 3D generation across multiple benchmarks, including T$^3$Bench and TIFA subset. Qualitative results showcase the superior performance of CoherenDream in preserving textual consistency and semantic interactions. As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks.

[115] Towards Ball Spin and Trajectory Analysis in Table Tennis Broadcast Videos via Physically Grounded Synthetic-to-Real Transfer

Daniel Kienzle,Robin Schön,Rainer Lienhart,Shin'Ichi Satoh

Main category: cs.CV

TL;DR: 该论文提出了一种从单目广播视频中推断乒乓球初始旋转和3D轨迹的新方法，仅使用合成数据训练神经网络，无需真实数据即可实现泛化。

Details

Motivation: 分析乒乓球运动员的技术需要了解球的3D轨迹和旋转，但旋转在标准广播视频中无法直接观测。 Method: 通过合成数据训练神经网络，利用物理正确的输入数据表示和目标增强技术，从2D轨迹推断3D轨迹和旋转。 Result: 在旋转分类上达到92.0%的准确率，2D重投影误差为图像对角线的0.19%。 Conclusion: 该方法首次实现了在单目广播视频中预测旋转和轨迹，且仅需合成数据即可泛化到真实数据。 Abstract: Analyzing a player's technique in table tennis requires knowledge of the ball's 3D trajectory and spin. While, the spin is not directly observable in standard broadcasting videos, we show that it can be inferred from the ball's trajectory in the video. We present a novel method to infer the initial spin and 3D trajectory from the corresponding 2D trajectory in a video. Without ground truth labels for broadcast videos, we train a neural network solely on synthetic data. Due to the choice of our input data representation, physically correct synthetic training data, and using targeted augmentations, the network naturally generalizes to real data. Notably, these simple techniques are sufficient to achieve generalization. No real data at all is required for training. To the best of our knowledge, we are the first to present a method for spin and trajectory prediction in simple monocular broadcast videos, achieving an accuracy of 92.0% in spin classification and a 2D reprojection error of 0.19% of the image diagonal.

[116] DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images

Mamadou Keita,Wassim Hamidouche,Hessen Bougueffa Eutamene,Abdelmalik Taleb-Ahmed,Abdenour Hadid

Main category: cs.CV

TL;DR: DeeCLIP是一种基于CLIP-ViT和融合学习的新框架，用于检测AI生成图像，具有更强的鲁棒性和泛化能力。

Details

Motivation: 现有检测方法难以泛化到不同生成模型且对微小扰动敏感，DeeCLIP旨在解决这些问题。 Method: 结合DeeFuser模块融合高低层特征，使用三元组损失优化嵌入空间，并采用LoRA进行参数高效微调。 Result: 在19个测试子集上平均准确率达89.00%，优于现有方法。 Conclusion: DeeCLIP在检测AI生成图像方面表现出色，代码已开源。 Abstract: This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model's ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.

[117] Using Fixed and Mobile Eye Tracking to Understand How Visitors View Art in a Museum: A Study at the Bowes Museum, County Durham, UK

Claire Warwick,Andrew Beresford,Soazig Casteau,Hubert P. H. Shum,Dan Smith,Francis Xiatian Zhang

Main category: cs.CV

TL;DR: 研究使用固定和移动眼动追踪技术分析博物馆访客如何观看艺术品，旨在优化展览设计以提升访客参与度。

Details

Motivation: 了解博物馆访客在实体画廊中如何观看艺术品，以优化展览设计并提升访客的参与度。 Method: 采用固定和移动眼动追踪技术，结合跨学科团队（数字人文、心理学、艺术史和计算机科学）的专业知识进行研究。 Result: 研究结果将为博物馆提供优化展览设计的建议，以更有效地吸引访客。 Conclusion: 通过眼动追踪技术，研究为博物馆展览设计提供了科学依据，有望提升访客的参与体验。 Abstract: The following paper describes a collaborative project involving researchers at Durham University, and professionals at the Bowes Museum, Barnard Castle, County Durham, UK, during which we used fixed and mobile eye tracking to understand how visitors view art. Our study took place during summer 2024 and builds on work presented at DH2017 (Bailey-Ross et al., 2017). Our interdisciplinary team included researchers from digital humanities, psychology, art history and computer science, working in collaboration with professionals from the museum. We used fixed and mobile eye tracking to understand how museum visitors view art in a physical gallery setting. This research will enable us to make recommendations about how the Museum's collections could be more effectively displayed, encouraging visitors to engage with them more fully.

[118] Federated Out-of-Distribution Generalization: A Causal Augmentation View

Runhui Zhang,Sijin Zhou,Zhuang Qi

Main category: cs.CV

TL;DR: 本文提出了一种名为FedCAug的联邦因果增强方法，通过因果数据增强打破属性与类别间的虚假关联，提升模型性能。

Details

Motivation: 现有联邦学习方法在数据偏差和上下文信息利用上存在不足，限制了模型性能。 Method: 设计了因果区域定位模块和因果数据增强模块，生成反事实样本，增强数据多样性。 Result: 在三个数据集上的实验表明，FedCAug显著减少了模型对背景的依赖，性能优于现有方法。 Conclusion: FedCAug通过因果增强有效提升了联邦学习的性能，同时保护了数据隐私。 Abstract: Federated learning aims to collaboratively model by integrating multi-source information to obtain a model that can generalize across all client data. Existing methods often leverage knowledge distillation or data augmentation to mitigate the negative impact of data bias across clients. However, the limited performance of teacher models on out-of-distribution samples and the inherent quality gap between augmented and original data hinder their effectiveness and they typically fail to leverage the advantages of incorporating rich contextual information. To address these limitations, this paper proposes a Federated Causal Augmentation method, termed FedCAug, which employs causality-inspired data augmentation to break the spurious correlation between attributes and categories. Specifically, it designs a causal region localization module to accurately identify and decouple the background and objects in the image, providing rich contextual information for causal data augmentation. Additionally, it designs a causality-inspired data augmentation module that integrates causal features and within-client context to generate counterfactual samples. This significantly enhances data diversity, and the entire process does not require any information sharing between clients, thereby contributing to the protection of data privacy. Extensive experiments conducted on three datasets reveal that FedCAug markedly reduces the model's reliance on background to predict sample labels, achieving superior performance compared to state-of-the-art methods.

[119] Enhancing breast cancer detection on screening mammogram using self-supervised learning and a hybrid deep model of Swin Transformer and Convolutional Neural Network

Han Chen,Anne L. Martel

Main category: cs.CV

TL;DR: 提出了一种结合自监督学习和深度混合模型（HybMNet）的新方法，用于提高乳腺癌筛查的准确性。

Details

Motivation: 高质量标注医学数据的稀缺是AI应用于乳腺癌诊断的主要障碍，HybMNet旨在减少对大量标注数据的依赖。 Method: 采用两阶段学习：1）自监督预训练（EsViT+Swin-T）；2）下游任务训练（HybMNet结合Swin-T和CNN，融合全局与局部特征）。 Result: 在CMMD和INbreast数据集上分别达到AUC 0.864和0.889，表现优异。 Conclusion: HybMNet结合自监督学习和混合模型，显著提升了乳腺癌检测性能。 Abstract: Purpose: The scarcity of high-quality curated labeled medical training data remains one of the major limitations in applying artificial intelligence (AI) systems to breast cancer diagnosis. Deep models for mammogram analysis and mass (or micro-calcification) detection require training with a large volume of labeled images, which are often expensive and time-consuming to collect. To reduce this challenge, we proposed a novel method that leverages self-supervised learning (SSL) and a deep hybrid model, named \textbf{HybMNet}, which combines local self-attention and fine-grained feature extraction to enhance breast cancer detection on screening mammograms. Approach: Our method employs a two-stage learning process: (1) SSL Pretraining: We utilize EsViT, a SSL technique, to pretrain a Swin Transformer (Swin-T) using a limited set of mammograms. The pretrained Swin-T then serves as the backbone for the downstream task. (2) Downstream Training: The proposed HybMNet combines the Swin-T backbone with a CNN-based network and a novel fusion strategy. The Swin-T employs local self-attention to identify informative patch regions from the high-resolution mammogram, while the CNN-based network extracts fine-grained local features from the selected patches. A fusion module then integrates global and local information from both networks to generate robust predictions. The HybMNet is trained end-to-end, with the loss function combining the outputs of the Swin-T and CNN modules to optimize feature extraction and classification performance. Results: The proposed method was evaluated for its ability to detect breast cancer by distinguishing between benign (normal) and malignant mammograms. Leveraging SSL pretraining and the HybMNet model, it achieved AUC of 0.864 (95% CI: 0.852, 0.875) on the CMMD dataset and 0.889 (95% CI: 0.875, 0.903) on the INbreast dataset, highlighting its effectiveness.

[120] CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition

Quynh Phung,Long Mai,Fabian David Caba Heilbron,Feng Liu,Jia-Bin Huang,Cusuh Ham

Main category: cs.CV

TL;DR: CineVerse是一个用于电影场景合成的新框架，通过两阶段方法生成连贯且丰富的电影场景。

Details

Motivation: 解决传统多镜头生成中一致性和连续性的挑战，同时应对电影制作中的复杂交互和视觉效果问题。 Method: 使用大型语言模型生成详细场景计划，再通过文本到图像生成模型合成高质量关键帧。 Result: 实验显示CineVerse在生成视觉连贯且内容丰富的电影场景方面表现优异。 Conclusion: CineVerse为电影视频合成领域的进一步探索奠定了基础。 Abstract: We present CineVerse, a novel framework for the task of cinematic scene composition. Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames. However, our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects. In order to learn to generate such content, we first create the CineVerse dataset. We use this dataset to train our proposed two-stage approach. First, we prompt a large language model (LLM) with task-specific instructions to take in a high-level scene description and generate a detailed plan for the overall setting and characters, as well as the individual shots. Then, we fine-tune a text-to-image generation model to synthesize high-quality visual keyframes. Experimental results demonstrate that CineVerse yields promising improvements in generating visually coherent and contextually rich movie scenes, paving the way for further exploration in cinematic video synthesis.

[121] Breast Cancer Detection from Multi-View Screening Mammograms with Visual Prompt Tuning

Han Chen,Anne L. Martel

Main category: cs.CV

TL;DR: 提出了一种多视图视觉提示调优网络（MVPT-NET），用于高效分析高分辨率乳腺X光片，通过选择性调优少量参数实现多视图数据的高效整合。

Details

Motivation: 多视图数据在乳腺癌检测中能提供更全面的信息，但传统方法在处理高分辨率数据时面临挑战。 Method: 先预训练单视图分类模型，再通过提示调优技术整合多视图特征，仅调优7%的参数。 Result: 在大型多机构数据集上，AUROC达到0.852，优于传统方法。 Conclusion: MVPT-NET为高分辨率乳腺X光片分析提供了高效、可扩展的解决方案。 Abstract: Accurate detection of breast cancer from high-resolution mammograms is crucial for early diagnosis and effective treatment planning. Previous studies have shown the potential of using single-view mammograms for breast cancer detection. However, incorporating multi-view data can provide more comprehensive insights. Multi-view classification, especially in medical imaging, presents unique challenges, particularly when dealing with large-scale, high-resolution data. In this work, we propose a novel Multi-view Visual Prompt Tuning Network (MVPT-NET) for analyzing multiple screening mammograms. We first pretrain a robust single-view classification model on high-resolution mammograms and then innovatively adapt multi-view feature learning into a task-specific prompt tuning process. This technique selectively tunes a minimal set of trainable parameters (7\%) while retaining the robustness of the pre-trained single-view model, enabling efficient integration of multi-view data without the need for aggressive downsampling. Our approach offers an efficient alternative to traditional feature fusion methods, providing a more robust, scalable, and efficient solution for high-resolution mammogram analysis. Experimental results on a large multi-institution dataset demonstrate that our method outperforms conventional approaches while maintaining detection efficiency, achieving an AUROC of 0.852 for distinguishing between Benign, DCIS, and Invasive classes. This work highlights the potential of MVPT-NET for medical imaging tasks and provides a scalable solution for integrating multi-view data in breast cancer detection.

[122] Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI

Hugo Georgenthum,Cristian Cosentino,Fabrizio Marozzo,Pietro Liò

Main category: cs.CV

TL;DR: 本文提出了一种多模态框架，结合计算机视觉和大语言模型，用于自动生成手术视频摘要，并在工具检测和上下文摘要中表现出色。

Details

Motivation: 手术视频的自动摘要对提升手术文档记录、支持手术培训和术后分析至关重要。本文旨在开发一种结合AI与医学的实用方法。 Method: 方法分为三阶段：1) 使用视觉变换器提取视频帧特征；2) 通过大语言模型生成帧级描述，并结合时间特征生成片段级摘要；3) 使用专用LLM聚合为完整手术报告。 Result: 在CholecT50数据集上评估，工具检测精度达96%，时间上下文摘要的BERT分数为0.74。 Conclusion: 该方法推动了AI辅助手术报告的进步，为更智能可靠的临床文档提供了可能。 Abstract: The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis. This paper presents a novel method at the intersection of artificial intelligence and medicine, aiming to develop machine learning models with direct real-world applications in surgical contexts. We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries. % The approach is structured in three key stages. First, surgical videos are divided into clips, and visual features are extracted at the frame level using visual transformers. This step focuses on detecting tools, tissues, organs, and surgical actions. Second, the extracted features are transformed into frame-level captions via large language models. These are then combined with temporal features, captured using a ViViT-based encoder, to produce clip-level summaries that reflect the broader context of each video segment. Finally, the clip-level descriptions are aggregated into a full surgical report using a dedicated LLM tailored for the summarization task. % We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos. The results show strong performance, achieving 96\% precision in tool detection and a BERT score of 0.74 for temporal context summarization. This work contributes to the advancement of AI-assisted tools for surgical reporting, offering a step toward more intelligent and reliable clinical documentation.

[123] Enhancing Quality for VVC Compressed Videos with Omniscient Quality Enhancement Model

Xiem HoangVan,Hieu Bui Minh,Sang NguyenQuang,Wen-Hsiao Peng

Main category: cs.CV

TL;DR: 本文提出了一种新型的全知视频质量增强网络（OVQE-VVC），用于提升H.266/VVC压缩视频的感知质量，显著提高了PSNR并节省了比特率。

Details

Motivation: 尽管H.266/VVC在压缩性能上有显著提升，但解码端对更高感知质量的需求仍存在挑战。AI技术，尤其是基于深度学习的视频质量增强方法，为解决这一问题提供了可能。 Method: 作者提出了一种改进的OVQE模型，并将其集成到最新的STD-VVC解码器架构中。该模型利用了时空特征和跨频率信息来增强视觉质量。 Result: 实验表明，OVQE-VVC解决方案显著提升了PSNR（约0.74 dB至1.2 dB），同时节省了约19.6%的比特率。 Conclusion: OVQE-VVC是一种有效的视频质量增强方法，为H.266/VVC解码器提供了显著的性能提升。 Abstract: The latest video coding standard H.266/VVC has shown its great improvement in terms of compression performance when compared to its predecessor HEVC standard. Though VVC was implemented with many advanced techniques, it still met the same challenges as its predecessor due to the need for even higher perceptual quality demand at the decoder side as well as the compression performance at the encoder side. The advancement of Artificial Intelligence (AI) technology, notably the deep learning-based video quality enhancement methods, was shown to be a promising approach to improving the perceptual quality experience. In this paper, we propose a novel Omniscient video quality enhancement Network for VVC compressed Videos. The Omniscient Network for compressed video quality enhancement was originally designed for HEVC compressed videos in which not only the spatial-temporal features but also cross-frequencies information were employed to augment the visual quality. Inspired by this work, we propose a modification of the OVQE model and integrate it into the lasted STD-VVC (Standard Versatile Video Coding) decoder architecture. As assessed in a rich set of test conditions, the proposed OVQE-VVC solution is able to achieve significant PSNR improvement, notably around 0.74 dB and up to 1.2 dB with respect to the original STD-VVC codec. This also corresponds to around 19.6% of bitrate saving while keeping a similar quality observation.

[124] Mesh-Learner: Texturing Mesh with Spherical Harmonics

Yunfei Wan,Jianheng Liu,Jiarong Lin,Fu Zhang

Main category: cs.CV

TL;DR: Mesh-Learner是一个兼容传统光栅化管道的3D重建与渲染框架，通过结合网格和球谐纹理学习视图依赖的辐射度，实现高效渲染和训练。

Details

Motivation: 解决现有方法在兼容性和扩展性上的不足，尤其是与光栅化管道的兼容性问题。 Method: 利用网格和球谐纹理进行端到端学习，提出新的插值方法渲染图像，并通过梯度反向传播优化纹理。 Result: 在Replica和FAST-LIVO2数据集上实现了最先进的渲染性能，且GPU内存占用较低。 Conclusion: Mesh-Learner在兼容性和性能上表现出色，适用于多种基于光栅化管道的任务。 Abstract: In this paper, we present a 3D reconstruction and rendering framework termed Mesh-Learner that is natively compatible with traditional rasterization pipelines. It integrates mesh and spherical harmonic (SH) texture (i.e., texture filled with SH coefficients) into the learning process to learn each mesh s view-dependent radiance end-to-end. Images are rendered by interpolating surrounding SH Texels at each pixel s sampling point using a novel interpolation method. Conversely, gradients from each pixel are back-propagated to the related SH Texels in SH textures. Mesh-Learner exploits graphic features of rasterization pipeline (texture sampling, deferred rendering) to render, which makes Mesh-Learner naturally compatible with tools (e.g., Blender) and tasks (e.g., 3D reconstruction, scene rendering, reinforcement learning for robotics) that are based on rasterization pipelines. Our system can train vast, unlimited scenes because we transfer only the SH textures within the frustum to the GPU for training. At other times, the SH textures are stored in CPU RAM, which results in moderate GPU memory usage. The rendering results on interpolation and extrapolation sequences in the Replica and FAST-LIVO2 datasets achieve state-of-the-art performance compared to existing state-of-the-art methods (e.g., 3D Gaussian Splatting and M2-Mapping). To benefit the society, the code will be available at https://github.com/hku-mars/Mesh-Learner.

[125] Shopformer: Transformer-Based Framework for Detecting Shoplifting via Human Pose

Narges Rashvand,Ghazal Alinezhad Noghre,Armin Danesh Pazho,Babak Rahimi Ardabili,Hamed Tabkhi

Main category: cs.CV

TL;DR: Shopformer是一种基于Transformer的模型，通过分析姿态序列而非原始视频来检测商店盗窃行为，解决了隐私和计算资源问题。

Details

Motivation: 传统监控系统效率低下，现有AI方法存在隐私和计算资源问题，需要更高效的解决方案。 Method: 提出一种自定义的标记化策略，将姿态序列转换为紧凑嵌入，用于Transformer处理。 Result: 在真实姿态数据上评估，性能优于现有异常检测模型，提供隐私保护且可扩展的实时监控方案。 Conclusion: Shopformer是一种高效、隐私友好的商店盗窃检测方法，代码已开源。 Abstract: Shoplifting remains a costly issue for the retail sector, but traditional surveillance systems, which are mostly based on human monitoring, are still largely ineffective, with only about 2% of shoplifters being arrested. Existing AI-based approaches rely on pixel-level video analysis which raises privacy concerns, is sensitive to environmental variations, and demands significant computational resources. To address these limitations, we introduce Shopformer, a novel transformer-based model that detects shoplifting by analyzing pose sequences rather than raw video. We propose a custom tokenization strategy that converts pose sequences into compact embeddings for efficient transformer processing. To the best of our knowledge, this is the first pose-sequence-based transformer model for shoplifting detection. Evaluated on real-world pose data, our method outperforms state-of-the-art anomaly detection models, offering a privacy-preserving, and scalable solution for real-time retail surveillance. The code base for this work is available at https://github.com/TeCSAR-UNCC/Shopformer.

[126] Mapping of Weed Management Methods in Orchards using Sentinel-2 and PlanetScope Data

Ioannis Kontogiorgakis,Iason Tsardanidis,Dimitrios Bormpoudakis,Ilias Tsoumas,Dimitra A. Loka,Christos Noulas,Alexandros Tsitouras,Charalampos Kontoes

Main category: cs.CV

TL;DR: 利用卫星遥感和机器学习技术，开发了一种高效、准确的果园杂草管理方法分类系统。

Details

Motivation: 杂草管理对农业生产至关重要，但传统地面调查成本高、耗时长，亟需更高效的监测方法。 Method: 结合Sentinel-2和PlanetScope卫星数据，采用机器学习方法对四种杂草管理方法（割草、耕作、化学喷洒和无措施）进行分类。 Result: 研究表明，机器学习驱动的遥感技术能显著提升果园杂草管理分类的效率和准确性。 Conclusion: 该方法为政策制定者提供了高效的工具，以评估农民实践并确保政策合规性。 Abstract: Effective weed management is crucial for improving agricultural productivity, as weeds compete with crops for vital resources like nutrients and water. Accurate maps of weed management methods are essential for policymakers to assess farmer practices, evaluate impacts on vegetation health, biodiversity, and climate, as well as ensure compliance with policies and subsidies. However, monitoring weed management methods is challenging as commonly rely on on-ground field surveys, which are often costly, time-consuming and subject to delays. In order to tackle this problem, we leverage Earth Observation (EO) data and Machine Learning (ML). Specifically, we developed an ML approach for mapping four distinct weed management methods (Mowing, Tillage, Chemical-spraying, and No practice) in orchards using satellite image time series (SITS) data from two different sources: Sentinel-2 (S2) and PlanetScope (PS). The findings demonstrate the potential of ML-driven remote sensing to enhance the efficiency and accuracy of weed management mapping in orchards.

[127] Monitoring digestate application on agricultural crops using Sentinel-2 Satellite imagery

Andreas Kalogeras,Dimitrios Bormpoudakis,Iason Tsardanidis,Dimitra A. Loka,Charalampos Kontoes

Main category: cs.CV

TL;DR: 研究利用Sentinel-2卫星影像和机器学习模型监测农业中外源有机物的应用，评估其对土壤和作物的影响。

Details

Motivation: 外源有机物（如消化物）的广泛使用需监测其对土壤和作物健康的影响，同时其可能带来微塑料污染和氮流失等环境风险。 Method: 通过Sentinel-2卫星影像时间序列分析特定指数（EOMI、NDVI、EVI），并结合机器学习模型（随机森林、k-NN、梯度提升和前馈神经网络）检测消化物的存在。 Result: 机器学习模型在检测消化物应用方面表现良好，F1分数高达0.85。 Conclusion: 结合遥感和机器学习的方法具有潜力，可扩展且经济高效地监测外源有机物应用，支持精准农业和可持续发展。 Abstract: The widespread use of Exogenous Organic Matter in agriculture necessitates monitoring to assess its effects on soil and crop health. This study evaluates optical Sentinel-2 satellite imagery for detecting digestate application, a practice that enhances soil fertility but poses environmental risks like microplastic contamination and nitrogen losses. In the first instance, Sentinel-2 satellite image time series (SITS) analysis of specific indices (EOMI, NDVI, EVI) was used to characterize EOM's spectral behavior after application on the soils of four different crop types in Thessaly, Greece. Furthermore, Machine Learning (ML) models (namely Random Forest, k-NN, Gradient Boosting and a Feed-Forward Neural Network), were used to investigate digestate presence detection, achieving F1-scores up to 0.85. The findings highlight the potential of combining remote sensing and ML for scalable and cost-effective monitoring of EOM applications, supporting precision agriculture and sustainability.

[128] SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Wufei Ma,Yu-Cheng Chou,Qihao Liu,Xingrui Wang,Celso de Melo,Jieneng Chen,Jianwen Xie,Alan Yuille

Main category: cs.CV

TL;DR: 提出了一种名为SpatialReasoner的新型大型视觉语言模型，通过显式3D表示提升3D空间推理能力，并在未见问题上表现更好。

Details

Motivation: 现有基于强化学习的3D空间推理方法隐含处理问题，且未验证其对未见问题的泛化能力。 Method: 采用显式3D表示，分阶段（感知、计算、推理）共享3D信息，结合视觉基础模型和大型语言模型。 Result: 在多种空间推理基准上表现更优，对未见问题泛化能力更强。 Conclusion: 显式3D表示结合视觉与语言模型，为3D空间推理开辟新方向。 Abstract: Recent studies in 3D spatial reasoning explore data-driven approaches and achieve enhanced spatial reasoning performance with reinforcement learning (RL). However, these methods typically perform spatial reasoning in an implicit manner, and it remains underexplored whether the acquired 3D knowledge generalizes to unseen question types at any stage of the training. In this work we introduce SpatialReasoner, a novel large vision-language model (LVLM) that address 3D spatial reasoning with explicit 3D representations shared between stages -- 3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and enable us to study the factual errors made by LVLMs. Results show that our SpatialReasoner achieve improved performance on a variety of spatial reasoning benchmarks and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.

[129] LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields

Zhengqin Li,Dilin Wang,Ka Chen,Zhaoyang Lv,Thu Nguyen-Phuoc,Milim Lee,Jia-Bin Huang,Lei Xiao,Cheng Zhang,Yufeng Zhu,Carl S. Marshall,Yufeng Ren,Richard Newcombe,Zhao Dong

Main category: cs.CV

TL;DR: LIRM是一种基于Transformer的架构，能够快速重建高质量的形状、材质和辐射场，并解决现有LRMs在重建未见过部分和生成可重光照3D内容方面的不足。

Details

Motivation: 现有LRMs在重建未见过部分和生成可重光照3D内容方面表现不佳，LIRM旨在解决这些问题。 Method: LIRM通过引入更新模型、六平面神经SDF表示和神经方向嵌入机制，逐步优化输入视图以改进重建效果。 Result: LIRM在几何和重光照精度上优于基于优化的密集视图逆渲染方法，且推理时间大幅减少。 Conclusion: LIRM为多视图3D重建提供了一个更实用的框架，显著提升了重建质量和效率。 Abstract: We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compares favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.

[130] More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV

Kai Ye,Haidi Tang,Bowen Liu,Pingyang Dai,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: CODrone是一个面向无人机（UAV）的定向目标检测数据集，旨在解决现有数据集在泛化性和实用性上的不足，通过多城市、多光照条件下的图像标注，提升实际应用中的算法表现。

Details

Motivation: 现有无人机定向目标检测数据集通常针对特定任务设计，泛化性能有限，无法充分反映真实飞行场景的需求。CODrone旨在填补这一数据缺口，提供更贴近实际应用的数据集和基准。 Method: 通过分析现有数据集的四大局限性（低分辨率、有限类别、单视角成像、受限飞行高度），CODrone提出改进方案，并收集多城市、多光照条件下的标注图像。基于22种经典或SOTA方法进行实验验证。 Result: CODrone作为新基准，有效评估了定向目标检测在真实场景中的表现，并揭示了算法瓶颈和发展机会。 Conclusion: CODrone填补了无人机定向目标检测的数据缺口，提供了更具泛化能力的基准，为实际应用和未来算法开发提供了支持。 Abstract: Applications of unmanned aerial vehicle (UAV) in logistics, agricultural automation, urban management, and emergency response are highly dependent on oriented object detection (OOD) to enhance visual perception. Although existing datasets for OOD in UAV provide valuable resources, they are often designed for specific downstream tasks.Consequently, they exhibit limited generalization performance in real flight scenarios and fail to thoroughly demonstrate algorithm effectiveness in practical environments. To bridge this critical gap, we introduce CODrone, a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions. It also serves as a new benchmark designed to align with downstream task requirements, ensuring greater applicability and robustness in UAV-based OOD.Based on application requirements, we identify four key limitations in current UAV OOD datasets-low image resolution, limited object categories, single-view imaging, and restricted flight altitudes-and propose corresponding improvements to enhance their applicability and robustness.Furthermore, CODrone contains a broad spectrum of annotated images collected from multiple cities under various lighting conditions, enhancing the realism of the benchmark. To rigorously evaluate CODrone as a new benchmark and gain deeper insights into the novel challenges it presents, we conduct a series of experiments based on 22 classical or SOTA methods.Our evaluation not only assesses the effectiveness of CODrone in real-world scenarios but also highlights key bottlenecks and opportunities to advance OOD in UAV applications.Overall, CODrone fills the data gap in OOD from UAV perspective and provides a benchmark with enhanced generalization capability, better aligning with practical applications and future algorithm development.

[131] Mitigating Catastrophic Forgetting in the Incremental Learning of Medical Images

Sara Yavari,Jacob Furst

Main category: cs.CV

TL;DR: 本文提出了一种增量学习方法，通过知识蒸馏技术提升深度学习模型在T2加权MRI前列腺癌检测中的准确性和效率。

Details

Motivation: 解决医疗影像分析中数据分散存储和大型数据集不可行的问题，同时提升模型性能。 Method: 采用知识蒸馏技术，利用过去任务生成的图像指导后续任务的模型训练。 Result: 在PI-CAI数据集和其他医学影像数据集上表现出性能提升和更快收敛。 Conclusion: 知识蒸馏是医疗影像增量学习中的一种有效方法，尤其适用于数据分散存储的场景。 Abstract: This paper proposes an Incremental Learning (IL) approach to enhance the accuracy and efficiency of deep learning models in analyzing T2-weighted (T2w) MRI medical images prostate cancer detection using the PI-CAI dataset. We used multiple health centers' artificial intelligence and radiology data, focused on different tasks that looked at prostate cancer detection using MRI (PI-CAI). We utilized Knowledge Distillation (KD), as it employs generated images from past tasks to guide the training of models for subsequent tasks. The approach yielded improved performance and faster convergence of the models. To demonstrate the versatility and robustness of our approach, we evaluated it on the PI-CAI dataset, a diverse set of medical imaging modalities including OCT and PathMNIST, and the benchmark continual learning dataset CIFAR-10. Our results indicate that KD can be a promising technique for IL in medical image analysis in which data is sourced from individual health centers and the storage of large datasets is not feasible. By using generated images from prior tasks, our method enables the model to retain and apply previously acquired knowledge without direct access to the original data.

[132] MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion

Zador Pataki,Paul-Edouard Sarlin,Johannes L. Schönberger,Marc Pollefeys

Main category: cs.CV

TL;DR: 论文提出了一种结合单目深度和法线先验的改进SfM方法，显著提升了在极端视角变化下的性能，并解决了对称性导致的错误关联问题。

Details

Motivation: 传统SfM系统在极端视角变化、低重叠或高对称性场景中容易失败，限制了其广泛应用。本文旨在通过引入深度学习推断的单目先验来解决这些问题。 Method: 通过紧密集成单目和多视图约束，结合深度学习推断的深度和法线先验，改进SfM性能。 Result: 在极端视角变化下显著优于现有方法，同时保持标准条件下的性能，并能有效解决对称性导致的错误关联问题。 Conclusion: 该方法首次实现了从少量图像可靠重建复杂室内环境的能力，且对先验误差具有鲁棒性，未来可轻松受益于单目深度和法线估计的进步。 Abstract: While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap, low-parallax or high-symmetry scenarios. Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. Thanks to a tight integration of monocular and multi-view constraints, our approach significantly outperforms existing ones under extreme viewpoint changes, while maintaining strong performance in standard conditions. We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM. This makes our approach the first capable of reliably reconstructing challenging indoor environments from few images. Through principled uncertainty propagation, it is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation. Our code is publicly available at https://github.com/cvg/mpsfm.

[133] Learning Streaming Video Representation via Multitask Training

Yibin Yan,Jilan Xu,Shangzhe Di,Yikun Liu,Yudi Shi,Qirui Chen,Zeqian Li,Yifei Huang,Weidi Xie

Main category: cs.CV

TL;DR: StreamFormer是一种新型的流式视频处理框架，通过结合因果时间注意力机制和预训练视觉Transformer，实现高效视频流处理。

Details

Motivation: 解决实时视频流处理中的低延迟决策和历史信息保留问题。 Method: 开发StreamFormer，结合因果时间注意力机制；采用多任务视觉-语言对齐框架进行训练。 Result: 在在线动作检测、视频实例分割和视频问答任务中表现优异且高效。 Conclusion: StreamFormer在实时视频应用中具有潜力。 Abstract: Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability.(ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.

[134] CompleteMe: Reference-based Human Image Completion

Yu-Ju Tsai,Brian Price,Qing Liu,Luis Figueroa,Daniil Pakhomov,Zhihong Ding,Scott Cohen,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 论文提出CompleteMe框架，通过双U-Net架构和区域聚焦注意力模块（RFA）改进基于参考图像的人体图像补全任务，显著提升细节保留和语义一致性。

Details

Motivation: 现有方法在人体图像补全中难以保留独特细节（如衣物图案或配饰），且基于参考的图像修复方法在细粒度细节捕捉上表现不佳。 Method: 提出CompleteMe框架，结合双U-Net架构和RFA模块，显式引导模型关注参考图像中的相关区域。 Result: 实验表明，该方法在视觉质量和语义一致性上优于现有技术。 Conclusion: CompleteMe通过RFA模块有效提升细节保留能力，为基于参考的人体图像补全任务提供了新解决方案。 Abstract: Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model's attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques. Project page: https://liagm.github.io/CompleteMe/

cs.GR [Back]

[135] TransparentGS: Fast Inverse Rendering of Transparent Objects with Gaussians

Letian Huang,Dongwei Ye,Jialin Dan,Chengzhi Tao,Huiwen Liu,Kun Zhou,Bo Ren,Yuanqi Li,Yanwen Guo,Jie Guo

Main category: cs.GR

TL;DR: 论文提出了TransparentGS，一种基于3D高斯泼溅（3D-GS）的透明物体快速逆向渲染方法，解决了透明物体恢复中的高光反射和折射问题。

Details

Motivation: 现有神经和高斯辐射场方法在处理高光反射和折射时存在不稳定性和过拟合问题，3D-GS也难以恢复透明物体及其附近内容。 Method: 设计了透明高斯基元表示透明物体，采用延迟折射策略；引入高斯光场探针（GaussProbe）统一编码环境光和附近内容；提出基于深度的迭代探针查询（IterQuery）算法减少视差误差。 Result: 实验表明，该方法能快速准确地从复杂环境中恢复透明物体，并在计算机图形学和视觉中有多种应用。 Conclusion: TransparentGS通过创新的透明高斯基元和光场探针技术，显著提升了透明物体恢复的效率与精度。 Abstract: The emergence of neural and Gaussian-based radiance field methods has led to considerable advancements in novel view synthesis and 3D object reconstruction. Nonetheless, specular reflection and refraction continue to pose significant challenges due to the instability and incorrect overfitting of radiance fields to high-frequency light variations. Currently, even 3D Gaussian Splatting (3D-GS), as a powerful and efficient tool, falls short in recovering transparent objects with nearby contents due to the existence of apparent secondary ray effects. To address this issue, we propose TransparentGS, a fast inverse rendering pipeline for transparent objects based on 3D-GS. The main contributions are three-fold. Firstly, an efficient representation of transparent objects, transparent Gaussian primitives, is designed to enable specular refraction through a deferred refraction strategy. Secondly, we leverage Gaussian light field probes (GaussProbe) to encode both ambient light and nearby contents in a unified framework. Thirdly, a depth-based iterative probes query (IterQuery) algorithm is proposed to reduce the parallax errors in our probe-based framework. Experiments demonstrate the speed and accuracy of our approach in recovering transparent objects from complex environments, as well as several applications in computer graphics and vision.

[136] REED-VAE: RE-Encode Decode Training for Iterative Image Editing with Diffusion Models

Gal Almog,Ariel Shamir,Ohad Fried

Main category: cs.GR

TL;DR: REED-VAE提出了一种新的训练方案，解决了潜在扩散模型在迭代图像编辑中的噪声和伪影积累问题，支持多方法编辑。

Details

Motivation: 现有潜在扩散模型在迭代编辑同一图像时，由于像素和潜在空间的反复转换，会积累噪声和伪影，限制了其灵活性和实用性。 Method: 通过RE-encode decode（REED）训练方案改进变分自编码器（VAEs），确保多次迭代后仍能保持图像质量。 Result: REED-VAE支持多种迭代编辑操作，包括基于文本和掩码的编辑，提高了编辑成功率和精确性。 Conclusion: REED-VAE为多方法图像编辑任务提供了新的基准，增强了图像的可编辑性。 Abstract: While latent diffusion models achieve impressive image editing results, their application to iterative editing of the same image is severely restricted. When trying to apply consecutive edit operations using current models, they accumulate artifacts and noise due to repeated transitions between pixel and latent spaces. Some methods have attempted to address this limitation by performing the entire edit chain within the latent space, sacrificing flexibility by supporting only a limited, predetermined set of diffusion editing operations. We present a RE-encode decode (REED) training scheme for variational autoencoders (VAEs), which promotes image quality preservation even after many iterations. Our work enables multi-method iterative image editing: users can perform a variety of iterative edit operations, with each operation building on the output of the previous one using both diffusion-based operations and conventional editing techniques. We demonstrate the advantage of REED-VAE across a range of image editing scenarios, including text-based and mask-based editing frameworks. In addition, we show how REED-VAE enhances the overall editability of images, increasing the likelihood of successful and precise edit operations. We hope that this work will serve as a benchmark for the newly introduced task of multi-method image editing. Our code and models will be available at https://github.com/galmog/REED-VAE

[137] Bernstein Bounds for Caustics

Zhimin Fan,Chen Wang,Yiming Wang,Boxuan Li,Yuxuan Guo,Ling-Qi Yan,Yanwen Guo,Jie Guo

Main category: cs.GR

TL;DR: 提出一种通过采样减少搜索域的方法，以高效模拟镜面光传输，并通过优化方差上界设计采样概率，实现快速可靠的复杂焦散效果渲染。

Details

Motivation: 系统模拟镜面光传输需要穷举所有可能的基元组合，效率极低，因此需要一种方法显著减少搜索域。 Method: 通过限制每个基元组合的辐照度范围，采样高贡献组合，并利用Bernstein基上的有理函数边界性质设计采样概率。 Result: 方法能够高效渲染复杂焦散效果，且方差低。 Conclusion: 提出的基元采样方法无偏，可与多种根查找技术结合，适用于复杂光传输模拟。 Abstract: Systematically simulating specular light transport requires an exhaustive search for primitive tuples containing admissible paths. Given the extreme inefficiency of enumerating all combinations, we propose to significantly reduce the search domain by sampling such tuples. The challenge is to design proper sampling probabilities that keep the noise level controllable. Our key insight is that by bounding the range of irradiance contributed by each primitive tuple at a given position, we can sample a subset of primitive tuples with potentially high contributions. Although low-contribution tuples are assigned a negligible probability, the overall variance remains low. Therefore, we derive vertex position and irradiance bounds for each primitive tuple, introducing a bounding property of rational functions on the Bernstein basis. When formulating position and irradiance expressions into rational functions, we handle non-rational components through remainder variables to maintain validity. Finally, we carefully design the sampling probabilities by optimizing the upper bound of the variance, expressed only using the position and irradiance bound. The proposed primitive sampling is intrinsically unbiased. It can be seamlessly combined with various unbiased and biased root-finding techniques within a local primitive domain. Extensive evaluations show that our method enables fast and reliable rendering of complex caustic effects.

[138] CLR-Wire: Towards Continuous Latent Representations for 3D Curve Wireframe Generation

Xueqi Ma,Yilin Liu,Tianlong Gao,Qirui Huang,Hui Huang

Main category: cs.GR

TL;DR: CLR-Wire是一个新颖的3D曲线框架，通过连续潜在表示统一几何和拓扑，利用注意力驱动的VAE和流匹配模型生成高质量3D线框。

Details

Motivation: 传统方法将顶点、边和面分离处理，缺乏对几何和拓扑的联合建模能力，限制了复杂形状的生成。 Method: 采用注意力驱动的VAE将曲线编码为神经参数曲线及其拓扑连接，通过流匹配模型从高斯噪声生成潜在表示并解码为3D线框。 Result: 实验表明，CLR-Wire在准确性、新颖性和多样性上显著优于现有方法，适用于CAD设计和3D内容生成。 Conclusion: CLR-Wire提供了一种高效且全面的解决方案，能够联合建模几何和拓扑，适用于多种3D生成任务。 Abstract: We introduce CLR-Wire, a novel framework for 3D curve-based wireframe generation that integrates geometry and topology into a unified Continuous Latent Representation. Unlike conventional methods that decouple vertices, edges, and faces, CLR-Wire encodes curves as Neural Parametric Curves along with their topological connectivity into a continuous and fixed-length latent space using an attention-driven variational autoencoder (VAE). This unified approach facilitates joint learning and generation of both geometry and topology. To generate wireframes, we employ a flow matching model to progressively map Gaussian noise to these latents, which are subsequently decoded into complete 3D wireframes. Our method provides fine-grained modeling of complex shapes and irregular topologies, and supports both unconditional generation and generation conditioned on point cloud or image inputs. Experimental results demonstrate that, compared with state-of-the-art generative approaches, our method achieves substantial improvements in accuracy, novelty, and diversity, offering an efficient and comprehensive solution for CAD design, geometric reconstruction, and 3D content creation.

[139] Sketch2Anim: Towards Transferring Sketch Storyboards into 3D Animation

Lei Zhong,Chuan Guo,Yiming Xie,Jiawei Wang,Changjian Li

Main category: cs.GR

TL;DR: 论文提出了一种名为Sketch2Anim的方法，通过条件运动合成将2D故事板草图自动转换为3D动画，解决了传统方法耗时耗力的问题。

Details

Motivation: 传统3D动画制作依赖2D故事板草图，过程繁琐且需要专业知识，因此亟需自动化方法实现2D到3D的直接转换。 Method: Sketch2Anim包含两个模块：1）3D条件运动生成器，利用3D关键姿势、关节轨迹和动作词实现精细控制；2）神经映射器，将2D草图与3D关键姿势和轨迹对齐，实现直接2D控制。 Result: 方法成功将故事板转换为高质量3D动画，并支持直接3D编辑。实验和用户研究表明其有效性。 Conclusion: Sketch2Anim为2D到3D动画转换提供了高效且灵活的解决方案，填补了该领域的空白。 Abstract: Storyboarding is widely used for creating 3D animations. Animators use the 2D sketches in storyboards as references to craft the desired 3D animations through a trial-and-error process. The traditional approach requires exceptional expertise and is both labor-intensive and time-consuming. Consequently, there is a high demand for automated methods that can directly translate 2D storyboard sketches into 3D animations. This task is under-explored to date and inspired by the significant advancements of motion diffusion models, we propose to address it from the perspective of conditional motion synthesis. We thus present Sketch2Anim, composed of two key modules for sketch constraint understanding and motion generation. Specifically, due to the large domain gap between the 2D sketch and 3D motion, instead of directly conditioning on 2D inputs, we design a 3D conditional motion generator that simultaneously leverages 3D keyposes, joint trajectories, and action words, to achieve precise and fine-grained motion control. Then, we invent a neural mapper dedicated to aligning user-provided 2D sketches with their corresponding 3D keyposes and trajectories in a shared embedding space, enabling, for the first time, direct 2D control of motion generation. Our approach successfully transfers storyboards into high-quality 3D motions and inherently supports direct 3D animation editing, thanks to the flexibility of our multi-conditional motion generator. Comprehensive experiments and evaluations, and a user perceptual study demonstrate the effectiveness of our approach.

[140] Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation

Victoria Yue Chen,Daoye Wang,Stephan Garbin,Sebastian Winberg,Timo Bolkart,Thabo Beeler

Main category: cs.GR

TL;DR: 提出了一种结合2D和3D特征的新方法，用于精确分割3D人脸扫描中的皮肤与非皮肤区域，显著提升了注册精度。

Details

Motivation: 现有2D或3D分割方法在皮肤与非皮肤区域分离上表现不佳，导致人脸注册质量下降。 Method: 通过冻结的图像基础模型提取多视角图像特征，并将其与3D几何特征融合，直接在扫描网格上预测分割掩码。 Result: 相比纯2D或3D方法，分割精度分别提高了8.89%和14.3%，且模型在真实数据上泛化良好。 Conclusion: 该方法有效解决了皮肤分割问题，提升了人脸注册的准确性，且无需真实数据训练。 Abstract: Face registration deforms a template mesh to closely fit a 3D face scan, the quality of which commonly degrades in non-skin regions (e.g., hair, beard, accessories), because the optimized template-to-scan distance pulls the template mesh towards the noisy scan surface. Improving registration quality requires a clean separation of skin and non-skin regions on the scan mesh. Existing image-based (2D) or scan-based (3D) segmentation methods however perform poorly. Image-based segmentation outputs multi-view inconsistent masks, and they cannot account for scan inaccuracies or scan-image misalignment, while scan-based methods suffer from lower spatial resolution compared to images. In this work, we introduce a novel method that accurately separates skin from non-skin geometry on 3D human head scans. For this, our method extracts features from multi-view images using a frozen image foundation model and aggregates these features in 3D. These lifted 2D features are then fused with 3D geometric features extracted from the scan mesh, to then predict a segmentation mask directly on the scan mesh. We show that our segmentations improve the registration accuracy over pure 2D or 3D segmentation methods by 8.89% and 14.3%, respectively. Although trained only on synthetic data, our model generalizes well to real data.

cs.CL [Back]

[141] Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages

Alessio Buscemi,Cédric Lothritz,Sergio Morales,Marcos Gomez-Vazquez,Robert Clarisó,Jordi Cabot,German Castignani

Main category: cs.CL

TL;DR: MLA-BiTe框架通过多语言增强偏见测试，改进了现有偏见评估方法，支持跨语言全面评估。

Details

Motivation: 大型语言模型（LLMs）在自然语言处理方面表现出色，但常延续训练数据中的社会偏见，需改进偏见评估方法。 Method: 引入MLA-BiTe框架，利用自动翻译和改写技术，在六种语言（含两种低资源语言）中测试四种先进LLM。 Result: 测试涵盖七个敏感歧视类别，验证了MLA-BiTe在多语言环境下的有效性。 Conclusion: MLA-BiTe为多语言偏见评估提供了系统化工具，有助于更全面地识别和减少LLM中的偏见。 Abstract: Large Language Models (LLMs) have exhibited impressive natural language processing capabilities but often perpetuate social biases inherent in their training data. To address this, we introduce MultiLingual Augmented Bias Testing (MLA-BiTe), a framework that improves prior bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages automated translation and paraphrasing techniques to support comprehensive assessments across diverse linguistic settings. In this study, we evaluate the effectiveness of MLA-BiTe by testing four state-of-the-art LLMs in six languages -- including two low-resource languages -- focusing on seven sensitive categories of discrimination.

[142] Span-Level Hallucination Detection for LLM-Generated Answers

Passant Elchafei,Mervet Abu-Elkheir

Main category: cs.CL

TL;DR: 本文提出了一种基于语义角色标注（SRL）的跨度级幻觉检测框架，用于检测LLM生成答案中的幻觉内容，并在Mu-SHROOM数据集上展示了竞争力。

Details

Motivation: 检测LLM生成答案中的幻觉跨度对提高事实一致性至关重要。 Method: 结合语义角色标注（SRL）将答案分解为原子角色，并通过基于问题的LLM提示获取参考上下文，使用DeBERTa模型评估语义对齐，并通过置信度分数优化检测。 Result: 在Mu-SHROOM数据集上表现优异，并通过GPT-4和LLaMA验证了幻觉跨度的准确性。 Conclusion: 该框架有效改进了LLM生成答案中的幻觉检测。 Abstract: Detecting spans of hallucination in LLM-generated answers is crucial for improving factual consistency. This paper presents a span-level hallucination detection framework for the SemEval-2025 Shared Task, focusing on English and Arabic texts. Our approach integrates Semantic Role Labeling (SRL) to decompose the answer into atomic roles, which are then compared with a retrieved reference context obtained via question-based LLM prompting. Using a DeBERTa-based textual entailment model, we evaluate each role semantic alignment with the retrieved context. The entailment scores are further refined through token-level confidence measures derived from output logits, and the combined scores are used to detect hallucinated spans. Experiments on the Mu-SHROOM dataset demonstrate competitive performance. Additionally, hallucinated spans have been verified through fact-checking by prompting GPT-4 and LLaMA. Our findings contribute to improving hallucination detection in LLM-generated responses.

[143] Can Third-parties Read Our Emotions?

Jiayi Li,Yingfan Zhou,Pranav Narayanan Venkit,Halima Binte Islam,Sneha Arya,Shomir Wilson,Sarah Rajtmajer

Main category: cs.CL

TL;DR: 研究发现第三方标注（包括人类标注者和大型语言模型）在捕捉作者私人状态（如情感）时存在显著局限性，但LLMs表现优于人类标注者。通过提高标注者与作者的人口统计学相似性可以改善标注质量。

Details

Motivation: 验证第三方标注是否能准确反映作者的私人状态（如情感），并探索改进标注质量的方法。 Method: 通过实验比较第三方标注（人类和LLMs）与作者自标注的情感标签，并研究人口统计学相似性对标注质量的影响。 Result: 第三方标注在反映作者私人状态时存在局限性，但LLMs表现优于人类标注者。人口统计学相似性可提升标注质量。 Conclusion: 呼吁改进标注实践以更准确地建模作者私人状态，并提出了评估第三方标注局限性的框架。 Abstract: Natural Language Processing tasks that aim to infer an author's private states, e.g., emotions and opinions, from their written text, typically rely on datasets annotated by third-party annotators. However, the assumption that third-party annotators can accurately capture authors' private states remains largely unexamined. In this study, we present human subjects experiments on emotion recognition tasks that directly compare third-party annotations with first-party (author-provided) emotion labels. Our findings reveal significant limitations in third-party annotations-whether provided by human annotators or large language models (LLMs)-in faithfully representing authors' private states. However, LLMs outperform human annotators nearly across the board. We further explore methods to improve third-party annotation quality. We find that demographic similarity between first-party authors and third-party human annotators enhances annotation performance. While incorporating first-party demographic information into prompts leads to a marginal but statistically significant improvement in LLMs' performance. We introduce a framework for evaluating the limitations of third-party annotations and call for refined annotation practices to accurately represent and model authors' private states.

[144] Spatial Speech Translation: Translating Across Space With Binaural Hearables

Tuochao Chen,Qirui Wang,Runlin He,Shyam Gollakota

Main category: cs.CL

TL;DR: 提出了一种新型的听觉设备概念，能够在嘈杂环境中实时翻译多语言对话，同时保留说话者的空间位置和声音特征。

Details

Motivation: 解决现有翻译设备在干扰环境下表现不佳的问题，同时保留说话者的空间感知。 Method: 结合盲源分离、定位、实时表达性翻译和双耳渲染技术，实现实时推理。 Result: 在强干扰环境下，翻译BLEU得分达22.01，用户研究证实了系统在真实环境中的有效性。 Conclusion: 这是首次将空间感知融入语音翻译的重要尝试。 Abstract: Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.

[145] Building UD Cairo for Old English in the Classroom

Lauren Levine,Junghyun Min,Amir Zeldes

Main category: cs.CL

TL;DR: 本文介绍了基于UD Cairo句子的古英语样本树库，通过LLM提示和真实古英语数据搜索收集数据，并由初学者标注后比较和裁决。结果表明LLM输出需后编辑，初学者合作可产生良好结果。初步解析实验显示现代英语训练数据对古英语解析效果不佳，但标注特征可提升性能。

Details

Motivation: 为历史语言学课程提供古英语样本树库，探索LLM和初学者标注在古英语数据处理中的潜力。 Method: 结合LLM提示和真实古英语数据搜索收集20个句子，由初学者标注后比较和裁决，并进行初步解析实验。 Result: LLM输出需后编辑以反映真实语法；初学者合作可产生良好结果；现代英语训练数据对古英语解析效果不佳，但标注特征可提升性能。 Conclusion: LLM和初学者标注在古英语数据处理中具有潜力，但需进一步优化和改进解析方法。 Abstract: In this paper we present a sample treebank for Old English based on the UD Cairo sentences, collected and annotated as part of a classroom curriculum in Historical Linguistics. To collect the data, a sample of 20 sentences illustrating a range of syntactic constructions in the world's languages, we employ a combination of LLM prompting and searches in authentic Old English data. For annotation we assigned sentences to multiple students with limited prior exposure to UD, whose annotations we compare and adjudicate. Our results suggest that while current LLM outputs in Old English do not reflect authentic syntax, this can be mitigated by post-editing, and that although beginner annotators do not possess enough background to complete the task perfectly, taken together they can produce good results and learn from the experience. We also conduct preliminary parsing experiments using Modern English training data, and find that although performance on Old English is poor, parsing on annotated features (lemma, hyperlemma, gloss) leads to improved performance.

[146] EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

Jianyou Wang,Weili Cao,Kaicheng Wang,Xiaoyue Wang,Ashish Dalvi,Gino Prasad,Qishan Liang,Hsuan-lin Her,Ming Wang,Qin Yang,Gene W. Yeo,David E. Neal,Maxim Khan,Christopher D. Rosin,Ramamohan Paturi,Leon Bergen

Main category: cs.CL

TL;DR: 论文研究了在生物医学论文中自动寻找与假设相关证据的任务，提出了EvidenceBench基准和标注流程，并评估了多种模型，发现其性能仍显著低于专家水平。

Details

Motivation: 研究者在验证科学假设时需要找到相关证据，但目前缺乏自动化的高效方法。 Method: 通过假设生成和逐句标注的生物医学论文流程创建EvidenceBench，并验证其准确性。 Result: 评估多种模型后发现其性能远低于专家水平，同时扩展了更大的数据集EvidenceBench-100k。 Conclusion: 提出的流程具有可扩展性，为模型训练和开发提供了大规模标注数据。 Abstract: We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench

[147] SynLexLM: Scaling Legal LLMs with Synthetic Data and Curriculum Learning

Ojasw Upadhyay,Abishek Saravankumar,Ayman Ismail

Main category: cs.CL

TL;DR: SynLexLM是一种新颖的法律领域大语言模型预训练方法，通过课程学习和合成数据增强，解决了法律数据稀缺问题，并在法律基准测试中表现优于传统模型。

Details

Motivation: 通用预训练模型难以捕捉法律领域的细微差别，且法律数据获取困难，因此需要一种高效的法律LLM预训练方法。 Method: 采用课程学习（从简单到复杂的法律文本和查询）和合成数据增强（如使用Gemini Pro生成QA对），以解决数据稀缺问题。 Result: 初步结果显示，SynLexLM在法律基准测试（BigLaw-Bench, EUR-Lex-Sum）上优于传统模型和微调版本。 Conclusion: SynLexLM有望提升法律文件分析和研究工具的性能，推动法律AI的普及。 Abstract: Large Language Models (LLMs) are powerful but often require extensive fine-tuning and large datasets for specialized domains like law. General-purpose pre-training may not capture legal nuances, and acquiring sufficient legal data is challenging. We introduce SynLexLM, a novel approach to efficiently pre-train a legal LLM. Our method employs curriculum learning, progressing from simple to complex legal texts and queries, combined with synthetic data augmentation using models like Gemini Pro to address data scarcity. We aim to achieve improved performance on legal benchmarks (BigLaw-Bench, EUR-Lex-Sum) compared to traditional models and fine-tuned versions. Preliminary work involves generating synthetic QA pairs reflecting legal reasoning. This work aims to enhance legal document analysis and research tools, potentially democratizing access to advanced legal AI.

[148] Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

Jong Inn Park,Maanas Taneja,Qianwen Wang,Dongyeop Kang

Main category: cs.CL

TL;DR: SciTalk是一个多LLM代理框架，通过迭代反馈机制生成科学准确的短视频，优于简单提示方法，但尚未达到人类创作者水平。

Details

Motivation: 解决现有方法在科学短视频生成中的事实不准确和视觉伪影问题。 Method: 使用多代理框架，包括内容摘要、视觉场景规划、文本和布局编辑，并通过迭代反馈优化生成。 Result: 实验表明SciTalk在科学准确性和吸引力上优于简单提示方法。 Conclusion: 框架为反馈驱动的视频生成提供了挑战和价值的见解，代码和数据将公开。 Abstract: Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators' workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators' quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.

[149] Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Yixin Cao,Shibo Hong,Xinze Li,Jiahao Ying,Yubo Ma,Haiyuan Liang,Yantao Liu,Zijun Yao,Xiaozhi Wang,Dan Huang,Wenxuan Zhang,Lifu Huang,Muhao Chen,Lei Hou,Qianru Sun,Xingjun Ma,Zuxuan Wu,Min-Yen Kan,David Lo,Qi Zhang,Heng Ji,Jing Jiang,Juanzi Li,Aixin Sun,Xuanjing Huang,Tat-Seng Chua,Yu-Gang Jiang

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型（LLMs）评估中的核心挑战，包括从任务特定到能力评估的转变、从手动到自动评估的转变，以及评估泛化问题。

Details

Motivation: 随着LLMs的快速发展，需要重新审视评估方法以适应其能力的扩展。 Method: 分析了两个关键转变：（i）从任务特定到能力评估的转变；（ii）从手动到自动评估的转变。同时探讨了评估泛化问题。 Result: 提出了评估LLMs的新框架，并指出当前测试集的局限性。 Conclusion: 呼吁社区共同维护一个动态更新的GitHub仓库，以应对快速发展的LLMs评估需求。 Abstract: Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety; and (ii) from manual to automated evaluation, encompassing dynamic dataset curation and "LLM-as-a-judge" scoring. Yet, even with these transitions, a crucial obstacle persists: the evaluation generalization issue. Bounded test sets cannot scale alongside models whose abilities grow seemingly without limit. We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics. Due to the fast evolving of this field, we will maintain a living GitHub repository (links are in each section) to crowd-source updates and corrections, and warmly invite contributors and collaborators.

[150] Towards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning

Abdellah Ghassel,Xianzhi Li,Xiaodan Zhu

Main category: cs.CL

TL;DR: 该论文提出了一种结合微调和高级提示策略的方法，用于检测和缓解LLM驱动的对话系统中的对话崩溃问题，显著提升了性能和效率。

Details

Motivation: 尽管大型语言模型（LLM）在许多对话任务中表现出色，但仍可能产生不连贯或矛盾的响应（崩溃），影响用户信任。因此，需要探索其处理对话崩溃的能力。 Method: 结合专用微调和高级提示策略（如少样本学习、链式思维推理和类比提示），并微调一个8B模型，验证其在英语和日语对话中的分类和校准能力。 Result: 在BETOLD数据集上实现了7%的准确率提升，并通过实时部署架构显著降低了运营成本和能耗。 Conclusion: 该方法为高影响力领域提供了可扩展的解决方案，结合了效率、可解释性和可靠性，缩小了小型开源模型与大型专有模型之间的性能差距。 Abstract: Large language models (LLMs) are rapidly changing various domains. However, their capabilities in handling conversational breakdowns still require an in-depth exploration. This paper addresses the challenge of detecting and mitigating dialogue breakdowns within LLM-driven conversational systems. While powerful models from OpenAI and Anthropic excel in many dialogue tasks, they can still produce incoherent or contradictory responses, commonly referred to as breakdowns, which undermine user trust. To tackle this, we propose an approach that combines specialized fine-tuning with advanced prompting strategies, including few-shot learning, chain-of-thought reasoning, and analogical prompting. In particular, we fine-tune a small 8B model and demonstrate its robust classification and calibration capabilities in English and Japanese dialogue. We also validate its generalization on the BETOLD dataset, achieving a 7\% accuracy improvement over its base model. Furthermore, we introduce a real-time deployment architecture that selectively escalates suspicious responses to more resource-intensive frontier models only when breakdowns are detected, significantly cutting operational expenses and energy consumption. Experimental results show our method surpasses prior state-of-the-art specialized classifiers while also narrowing performance gaps between smaller open-source models and large proprietary ones. Our approach offers a scalable solution for robust conversational AI in high-impact domains by combining efficiency, interpretability, and reliability.

[151] When2Call: When (not) to Call Tools

Hayley Ross,Ameya Sunil Mahabaleshwarkar,Yoshi Suhara

Main category: cs.CL

TL;DR: 论文提出了一个名为When2Call的新基准，用于评估语言模型在工具调用决策上的表现，包括何时调用工具、何时提问或承认无法回答。现有基准主要关注工具调用的准确性，而忽略了决策时机的重要性。

Details

Motivation: 现有基准未能充分评估语言模型在工具调用决策上的表现，尤其是何时调用工具、何时提问或承认无法回答。这限制了模型在实际应用中的灵活性和可靠性。 Method: 开发了When2Call基准，评估工具调用决策；创建了训练集，并利用多选特性设计了偏好优化训练方法，与传统微调相比效果更佳。 Result: 实验表明，当前最先进的工具调用语言模型在When2Call上仍有显著改进空间，偏好优化训练方法比传统微调表现更好。 Conclusion: When2Call基准强调了工具调用决策的重要性，为未来研究提供了新的方向和数据支持。 Abstract: Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at https://github.com/NVIDIA/When2Call.

[152] Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation

Yi Lu,Wanxu Zhao,Xin Zhou,Chenxin An,Chenglong Wang,Shuo Li,Yuming Yang,Jun Zhao,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 提出了一种无需训练的框架DPE，通过优化RoPE的隐藏维度来扩展LLMs的上下文窗口，显著超越现有方法。

Details

Motivation: 解决LLMs处理长上下文时因预训练长度限制导致的性能问题，避免昂贵的再训练开销。 Method: 通过分析RoPE的隐藏维度，检测关键维度并调整其位置索引，以最小修改实现上下文窗口扩展。 Result: DPE使Llama3-8k 8B支持128k上下文窗口，性能提升显著，甚至优于GPT-4-128K。 Conclusion: DPE是一种高效、无需训练的长上下文扩展方法，显著提升模型性能。 Abstract: Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE's different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions' position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models' performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.

[153] Latent Adversarial Training Improves the Representation of Refusal

Alexandra Abbas,Nora Petrova,Helios Ael Lyons,Natalia Perez-Campanero

Main category: cs.CL

TL;DR: 研究发现，语言模型的拒绝行为主要编码在其潜在空间的单一方向上，易受攻击。LAT通过训练引入噪声改变拒绝行为的表示，使其更集中于前两个SVD成分，提高了鲁棒性但也暴露了新漏洞。

Details

Motivation: 探讨LAT如何通过噪声训练改变拒绝行为的潜在表示，以评估其有效性和局限性。 Method: 分析Llama 2 7B，比较LAT、SSFT和AT对拒绝行为表示的影响，使用SVD分解激活差异。 Result: LAT显著改变拒绝表示，集中于前两个SVD成分（75%方差），提高了鲁棒性但对自生成向量更脆弱。 Conclusion: LAT通过扰动训练提供了更全面的拒绝行为表示，展示了其在模型安全性改进中的潜力和局限性。 Abstract: Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT's effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model's latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety.

[154] A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification

Junichiro Niimi

Main category: cs.CL

TL;DR: 本文提出了一种简单集成策略，通过多次推理提升LLMs在情感分析中的鲁棒性和准确性。

Details

Motivation: 现有文献忽视了LLMs结果的变异性与可重复性问题，而实际人工标注通过多数投票解决分歧。 Method: 采用中等规模LLMs的多次推理集成策略。 Result: 集成方法比单次大型模型更准确，RMSE降低18.6%。 Conclusion: 集成策略能显著提升LLMs在情感分析中的表现。 Abstract: With the advance of large language models (LLMs), LLMs have been utilized for the various tasks. However, the issues of variability and reproducibility of results from each trial of LLMs have been largely overlooked in existing literature while actual human annotation uses majority voting to resolve disagreements among annotators. Therefore, this study introduces the straightforward ensemble strategy to a sentiment analysis using LLMs. As the results, we demonstrate that the ensemble of multiple inference using medium-sized LLMs produces more robust and accurate results than using a large model with a single attempt with reducing RMSE by 18.6%.

Junhong Liang,Yu Zhou

Main category: cs.CL

TL;DR: 论文提出了一种多轮中文拼写纠错框架（MTCSC），通过RAG增强和长度反射机制，解决了传统方法在输出长度一致性和领域适应性上的限制，显著提升了纠错质量。

Details

Motivation: 传统中文拼写纠错（CSC）方法在输出长度一致性和领域适应性上存在不足，限制了其应用范围。 Method: 提出MTCSC框架，结合RAG增强和长度反射机制，构建领域特定检索数据库，并采用多源组合策略确保输出长度一致性。 Result: 实验表明，该方法在多种领域数据集上显著优于现有方法，尤其在领域特定和变长纠错任务中表现突出。 Conclusion: MTCSC框架有效解决了传统CSC方法的局限性，为变长和领域特定纠错任务提供了高效解决方案。 Abstract: Chinese Spelling Correction (CSC) aims to detect and correct erroneous tokens in sentences. While Large Language Models (LLMs) have shown remarkable success in identifying and rectifying potential errors, they often struggle with maintaining consistent output lengths and adapting to domain-specific corrections. Furthermore, existing CSC task impose rigid constraints requiring input and output lengths to be identical, limiting their applicability. In this work, we extend traditional CSC to variable-length correction scenarios, including Chinese Splitting Error Correction (CSEC) and ASR N-best Error Correction. To address domain adaptation and length consistency, we propose MTCSC (Multi-Turn CSC) framework based on RAG enhanced with a length reflection mechanism. Our approach constructs a retrieval database from domain-specific training data and dictionaries, fine-tuning retrievers to optimize performance for error-containing inputs. Additionally, we introduce a multi-source combination strategy with iterative length reflection to ensure output length fidelity. Experiments across diverse domain datasets demonstrate that our method significantly outperforms current approaches in correction quality, particularly in handling domain-specific and variable-length error correction tasks.

[156] LawFlow : Collecting and Simulating Lawyers' Thought Processes

Debarati Das,Khanh Chi Le,Ritik Sachin Parkar,Karin De Langis,Brendan Madson,Chad M. Berryman,Robin M. Willis,Daniel H. Moses,Brett McDonnell,Daniel Schwarcz,Dongyeop Kang

Main category: cs.CL

TL;DR: 论文介绍了LawFlow数据集，用于支持法律工作流程的端到端决策，比较了人类与LLM生成的工作流程差异，并提出了AI辅助法律实践的设计建议。

Details

Motivation: 当前AI在法律领域的应用局限于孤立子任务，缺乏对端到端决策的支持，因此需要开发更全面的数据集和模型。 Method: 通过收集法律学生的真实业务实体形成场景数据，构建LawFlow数据集，并比较人类与LLM生成的工作流程。 Result: 人类工作流程更具模块化和适应性，而LLM工作流程更线性且缺乏对下游影响的敏感性；法律专业人士更倾向于AI在支持性角色中发挥作用。 Conclusion: 研究揭示了LLM在复杂法律工作流程中的局限性，并提出了结合人类目标的AI辅助设计建议，为未来法律AI系统的发展提供了方向。 Abstract: Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a dataset of complete end-to-end legal workflows collected from trained law students, grounded in real-world business entity formation scenarios. Unlike prior datasets focused on input-output pairs or linear chains of thought, LawFlow captures dynamic, modular, and iterative reasoning processes that reflect the ambiguity, revision, and client-adaptive strategies of legal practice. Using LawFlow, we compare human and LLM-generated workflows, revealing systematic differences in structure, reasoning flexibility, and plan execution. Human workflows tend to be modular and adaptive, while LLM workflows are more sequential, exhaustive, and less sensitive to downstream implications. Our findings also suggest that legal professionals prefer AI to carry out supportive roles, such as brainstorming, identifying blind spots, and surfacing alternatives, rather than executing complex workflows end-to-end. Building on these findings, we propose a set of design suggestions, rooted in empirical observations, that align AI assistance with human goals of clarity, completeness, creativity, and efficiency, through hybrid planning, adaptive execution, and decision-point support. Our results highlight both the current limitations of LLMs in supporting complex legal workflows and opportunities for developing more collaborative, reasoning-aware legal AI systems. All data and code are available on our project page (https://minnesotanlp.github.io/LawFlow-website/).

[157] Dynamic Fisher-weighted Model Merging via Bayesian Optimization

Sanwoo Lee,Jiahao Liu,Qifan Wang,Jingang Wang,Xunliang Cai,Yunfang Wu

Main category: cs.CL

TL;DR: 论文提出了一种动态Fisher加权合并方法（DF-Merge），通过统一模型合并策略，结合贝叶斯优化动态调整参数系数，显著提升了多任务模型的性能。

Details

Motivation: 现有的模型合并方法存在性能差距，无法达到多任务微调的水平，因此需要一种更通用的合并框架来提升性能。 Method: 提出DF-Merge方法，通过贝叶斯优化动态调整参数系数，并结合Fisher信息计算参数重要性。 Result: 实验表明，DF-Merge在不同规模和任务上均优于基线方法，且只需少量迭代即可接近最优性能。 Conclusion: DF-Merge通过统一合并策略和动态调整机制，显著提升了多任务模型的性能，且高效实用。 Abstract: The fine-tuning of pre-trained language models has resulted in the widespread availability of task-specific models. Model merging offers an efficient way to create multi-task models by combining these fine-tuned models at the parameter level, without the need for training data or joint training on multiple datasets. Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise. Both approaches exhibit their own weaknesses, leading to a notable performance gap compared to multi-task fine-tuning. In this paper, we unify these seemingly distinct strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge). Specifically, candidate models are associated with a set of coefficients that linearly scale their fine-tuned parameters. Bayesian optimization is applied to dynamically adjust these coefficients, aiming to maximize overall performance on validation sets. Each iteration of this process integrates parameter importance based on the Fisher information conditioned by the coefficients. Experimental results show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks. Our analysis shows that the effectiveness of DF-Merge arises from the unified view of merging and that near-optimal performance is achievable in a few iterations, even with minimal validation data.

[158] Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs

Mohammad Akbar-Tajari,Mohammad Taher Pilehvar,Mohammad Mahmoody

Main category: cs.CL

TL;DR: GoAT是一种基于图结构的对抗性提示生成方法，用于测试大型语言模型（LLM）的对齐鲁棒性，其效果优于现有攻击方法。

Details

Motivation: 大型语言模型（LLM）仍易受对抗性越狱攻击，识别这些漏洞对提升模型鲁棒性至关重要。 Method: 利用图思维框架生成对抗性提示，通过动态图结构协同优化攻击路径，无需访问目标模型参数。 Result: GoAT在对抗鲁棒模型（如Llama）时，越狱成功率比现有方法高五倍，且生成高质量、人类可读的提示。 Conclusion: GoAT通过图结构协同优化，显著提升了对抗性漏洞的探索效率，为LLM安全性提供了新工具。 Abstract: The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model's parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT's reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other's progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: https://github.com/GoAT-pydev/Graph_of_Attacks.

[159] Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

Zhyar Rzgar K Rostam,Gábor Kertész

Main category: cs.CL

TL;DR: 论文研究了预训练语言模型（PLMs）在科学文本分类中的应用，通过数据集增强和硬投票策略提升分类性能。

Details

Motivation: 高效文本分类对处理大量学术出版物至关重要，研究旨在利用PLMs提升科学文本分类的准确性和可扩展性。 Method: 使用BERT、SciBERT、BioBERT和BlueBERT等PLMs，通过数据集增强（检索未标记数据并预测标签）和硬投票策略优化模型性能。 Result: 领域特定模型（如SciBERT、BioBERT）表现优于通用模型（如BERT），动态学习率和早停技术显著提升分类准确性。 Conclusion: 研究证明了数据集增强、推理驱动标签预测、硬投票和微调技术在自动化学术文本分类中的有效性。 Abstract: Efficient text classification is essential for handling the increasing volume of academic publications. This study explores the use of pre-trained language models (PLMs), including BERT, SciBERT, BioBERT, and BlueBERT, fine-tuned on the Web of Science (WoS-46985) dataset for scientific text classification. To enhance performance, we augment the dataset by executing seven targeted queries in the WoS database, retrieving 1,000 articles per category aligned with WoS-46985's main classes. PLMs predict labels for this unlabeled data, and a hard-voting strategy combines predictions for improved accuracy and confidence. Fine-tuning on the expanded dataset with dynamic learning rates and early stopping significantly boosts classification accuracy, especially in specialized domains. Domain-specific models like SciBERT and BioBERT consistently outperform general-purpose models such as BERT. These findings underscore the efficacy of dataset augmentation, inference-driven label prediction, hard-voting, and fine-tuning techniques in creating robust and scalable solutions for automated academic text classification.

[160] KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation

Jiabin Fan,Guoqing Luo,Michael Bowling,Lili Mou

Main category: cs.CL

TL;DR: 提出了一种名为KETCHUP的k步回报估计方法，用于基于强化学习的知识蒸馏（KD）在文本生成任务中，通过贝尔曼最优方程减少梯度估计方差，提升优化效果。

Details

Motivation: 在基于强化学习的知识蒸馏中，梯度估计的高方差问题影响了优化效果，尤其是学生模型较大时。 Method: 利用贝尔曼最优方程诱导k步回报，理论分析表明该方法能减少梯度估计方差。 Result: 在三个文本生成任务上的实验表明，该方法在标准任务指标和基于大语言模型（LLM）的评估中表现优异。 Conclusion: k步回报诱导为基于强化学习的知识蒸馏在大语言模型研究中提供了有前景的方向。 Abstract: We propose a novel k-step return estimation method (called KETCHUP) for Reinforcement Learning(RL)-based knowledge distillation (KD) in text generation tasks. Our idea is to induce a K-step return by using the Bellman Optimality Equation for multiple steps. Theoretical analysis shows that this K-step formulation reduces the variance of the gradient estimates, thus leading to improved RL optimization especially when the student model size is large. Empirical evaluation on three text generation tasks demonstrates that our approach yields superior performance in both standard task metrics and large language model (LLM)-based evaluation. These results suggest that our K-step return induction offers a promising direction for enhancing RL-based KD in LLM research.

[161] Calibrating Translation Decoding with Quality Estimation on LLMs

Di Wu,Yibin Lei,Christof Monz

Main category: cs.CL

TL;DR: 本文提出了一种通过优化假设似然与翻译质量的Pearson相关性来校准神经机器翻译（NMT）解码的方法，显著提升了翻译质量。

Details

Motivation: 传统最大后验（MAP）解码在神经机器翻译中表现不佳，导致低质量或病态假设，解码目标与真实翻译质量不一致。 Method: 通过直接优化假设似然与翻译质量的Pearson相关性，校准解码过程。 Result: 在大型语言模型（LLMs）上，仅需少量训练（每方向2K实例）即可显著提升翻译质量，且与监督微调效果正交。校准后的似然还能作为翻译质量的强代理。 Conclusion: 校准方法提升了MAP解码的有效性，实现了高效部署，并在10种语言上达到了最先进的翻译性能。 Abstract: Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation -- thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations -- even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: https://github.com/moore3930/calibrating-llm-mt.

[162] Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

Anindya Bijoy Das,Shibbir Ahmed,Shahnewaz Karim Sakib

Main category: cs.CL

TL;DR: 该论文研究了开源大语言模型（LLMs）在临床摘要中的有效性，特别是从出院报告中提取关键事件的能力，并评估了摘要中幻觉现象的普遍性。

Details

Motivation: 临床摘要在医疗保健中至关重要，但复杂医疗数据的精确摘要需要自动化工具的支持。LLMs因其自然语言理解能力显示出潜力，但其可靠性和幻觉问题需进一步研究。 Method: 通过数值模拟评估开源LLMs从出院报告中提取关键事件（如入院原因、住院事件和随访行动）的能力，并分析摘要中的幻觉现象。 Result: 研究结果显示LLMs在临床摘要中具有一定效果，但也揭示了幻觉问题的普遍性，这对信息的可靠性有直接影响。 Conclusion: 开源LLMs在临床摘要中有潜力，但需进一步优化以减少幻觉现象，确保信息的准确性和可靠性。 Abstract: Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, such as reasons for hospital admission, significant in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive numerical simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization.

[163] ClimaEmpact: Domain-Aligned Small Language Models and Datasets for Extreme Weather Analytics

Deeksha Varshney,Keane Ong,Rui Mao,Erik Cambria,Gianmarco Mengaldo

Main category: cs.CL

TL;DR: 论文提出了一种名为EWRA的方法，通过结合LLMs的结构化推理路径增强SLMs，并构建了一个极端天气新闻数据集ExtremeWeatherNews，用于提升极端天气分析的性能。

Details

Motivation: 极端天气事件的准确评估对研究和政策至关重要，但许多地区缺乏局部和细粒度数据，限制了分析和决策能力。 Method: 提出EWRA方法，通过LLMs的推理路径增强SLMs，并构建ExtremeWeatherNews数据集，用于极端天气分类、主题标注和情感分析。 Result: EWRA方法显著提升了SLMs在极端天气分析中的性能，超越了任务专用模型。 Conclusion: EWRA和ExtremeWeatherNews框架为极端天气分析提供了更高效和实用的解决方案。 Abstract: Accurate assessments of extreme weather events are vital for research and policy, yet localized and granular data remain scarce in many parts of the world. This data gap limits our ability to analyze potential outcomes and implications of extreme weather events, hindering effective decision-making. Large Language Models (LLMs) can process vast amounts of unstructured text data, extract meaningful insights, and generate detailed assessments by synthesizing information from multiple sources. Furthermore, LLMs can seamlessly transfer their general language understanding to smaller models, enabling these models to retain key knowledge while being fine-tuned for specific tasks. In this paper, we propose Extreme Weather Reasoning-Aware Alignment (EWRA), a method that enhances small language models (SLMs) by incorporating structured reasoning paths derived from LLMs, and ExtremeWeatherNews, a large dataset of extreme weather event-related news articles. EWRA and ExtremeWeatherNews together form the overall framework, ClimaEmpact, that focuses on addressing three critical extreme-weather tasks: categorization of tangible vulnerabilities/impacts, topic labeling, and emotion analysis. By aligning SLMs with advanced reasoning strategies on ExtremeWeatherNews (and its derived dataset ExtremeAlign used specifically for SLM alignment), EWRA improves the SLMs' ability to generate well-grounded and domain-specific responses for extreme weather analytics. Our results show that the approach proposed guides SLMs to output domain-aligned responses, surpassing the performance of task-specific models and offering enhanced real-world applicability for extreme weather analytics.

[164] Sample-Efficient Language Model for Hinglish Conversational AI

Sakshi Singh,Abhinav Prakash,Aakriti Shah,Chaitanya Sachdeva,Sanjana Dumpala

Main category: cs.CL

TL;DR: 本文提出了一种高效的语言模型开发方法，用于构建Hinglish（印地语与英语混合语言）对话机器人，通过微调预训练模型和合成数据解决了数据稀缺问题。

Details

Motivation: Hinglish作为一种混合语言，拼写不统一且缺乏标准化数据，对话任务面临挑战，需要高效且计算资源友好的解决方案。 Method: 评估了Gemma3-4B和Qwen2.5-7B等预训练跨语言模型，结合微调技术和合成对话数据提升性能。 Result: 实验表明，参数较少的模型在高质量混合数据上微调后，可在Hinglish对话生成任务中达到竞争力表现，同时保持计算效率。 Conclusion: 通过合成数据与微调技术的结合，能够有效解决Hinglish对话任务中的数据稀缺问题，实现高效模型开发。 Abstract: This paper presents our process for developing a sample-efficient language model for a conversational Hinglish chatbot. Hinglish, a code-mixed language that combines Hindi and English, presents a unique computational challenge due to inconsistent spelling, lack of standardization, and limited quality of conversational data. This work evaluates multiple pre-trained cross-lingual language models, including Gemma3-4B and Qwen2.5-7B, and employs fine-tuning techniques to improve performance on Hinglish conversational tasks. The proposed approach integrates synthetically generated dialogues with insights from existing Hinglish datasets to address data scarcity. Experimental results demonstrate that models with fewer parameters, when appropriately fine-tuned on high-quality code-mixed data, can achieve competitive performance for Hinglish conversation generation while maintaining computational efficiency.

[165] Efficient Reasoning for LLMs through Speculative Chain-of-Thought

Jikai Wang,Juntao Li,Lijun Wu,Min Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为SCoT的方法，通过大模型与小模型协作加速推理速度，显著降低延迟。

Details

Motivation: 现有高效推理方法主要关注减少模型参数或缩短思维链长度，但推理成本和延迟问题仍未完全解决。 Method: SCoT利用轻量级草稿模型进行思维级草拟，选择最佳思维链草稿并用目标模型纠正错误。 Result: 实验表明，SCoT在多个数据集上将推理延迟降低了48%~66%，同时保持接近目标模型的性能。 Conclusion: SCoT通过思维行为对齐和草稿选择策略，显著提升了推理效率，同时保持了准确性。 Abstract: Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have recently attracted widespread attention due to their impressive task-solving abilities. However, the enormous model size and the generation of lengthy thought chains introduce significant reasoning costs and response latency. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. In this paper, we introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed through large and small model collaboration. SCoT conducts thought-level drafting using a lightweight draft model. Then it selects the best CoT draft and corrects the error cases with the target model. The proposed thinking behavior alignment improves the efficiency of drafting and the draft selection strategy maintains the prediction accuracy for complex problems. Experimental results on GSM8K, MATH, GaoKao, CollegeMath and Olympiad datasets show that SCoT reduces reasoning latency by 48\%$\sim$66\% for Deepseek-R1-Distill-Qwen-32B while achieving near-target-model-level performance. Our code is available at https://github.com/Jikai0Wang/Speculative_CoT.

[166] Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation

Qianren Mao,Qili Zhang,Hanwen Hao,Zhentao Han,Runhua Xu,Weifeng Jiang,Qi Hu,Zhijun Chen,Tyler Zhou,Bo Li,Yangqiu Song,Jin Dong,Jianxin Li,Philip S. Yu

Main category: cs.CL

TL;DR: 提出了一种名为FedE4RAG的框架，结合联邦学习和知识蒸馏，解决私有RAG系统中的数据隐私和稀缺性问题。

Details

Motivation: 私有RAG系统面临数据稀缺和隐私问题，阻碍其部署。 Method: 采用联邦学习（FL）和知识蒸馏，结合同态加密，保护数据隐私并提升模型性能。 Result: 实验验证了FedE4RAG在提升私有RAG系统性能的同时保护数据隐私。 Conclusion: FedE4RAG为私有RAG系统提供了一种高效且隐私安全的解决方案。 Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question & Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critical data privacy issues. These obstacles impede the deployment of private RAG systems, as developing privacy-preserving RAG systems requires a delicate balance between data security and data availability. To address these challenges, we regard federated learning (FL) as a highly promising technology for privacy-preserving RAG services. We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG). This framework facilitates collaborative training of client-side RAG retrieval models. The parameters of these models are aggregated and distributed on a central-server, ensuring data privacy without direct sharing of raw data. In FedE4RAG, knowledge distillation is employed for communication between the server and client models. This technique improves the generalization of local RAG retrievers during the federated learning process. Additionally, we apply homomorphic encryption within federated learning to safeguard model parameters and mitigate concerns related to data leakage. Extensive experiments conducted on the real-world dataset have validated the effectiveness of FedE4RAG. The results demonstrate that our proposed framework can markedly enhance the performance of private RAG systems while maintaining robust data privacy protection.

[167] APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

Huajian Xin,Luming Li,Xiaoran Jin,Jacques Fleuriot,Wenda Li

Main category: cs.CL

TL;DR: 论文提出自动化证明工程（APE）范式，利用大语言模型（LLM）自动化数学库中的证明工程任务，并推出首个基于真实数学库（Mathlib4）的基准测试APE-Bench I。

Details

Motivation: 现有基准测试局限于静态证明任务，未能反映真实数学库中迭代、工程密集的工作流程，因此需要更贴近实际的评估方法。 Method: 引入APE范式，开发APE-Bench I基准测试和Eleanstic并行验证基础设施，结合Lean编译器和LLM-as-a-Judge验证任务。 Result: 实验显示LLM在局部编辑任务上表现良好，但在复杂证明工程任务上性能显著下降。 Conclusion: 为证明工程中的智能工作流奠定基础，未来将扩展至多文件协作、项目规模验证和自主代理的开发。 Abstract: Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.

[168] SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Jiaqi Chen,Bang Zhang,Ruotian Ma,Peisong Wang,Xiaodan Liang,Zhaopeng Tu,Xiaolong Li,Kwan-Yee K. Wong

Main category: cs.CL

TL;DR: 论文提出了一种名为Self-Play Critic (SPC)的新方法，通过对抗性自博弈游戏评估大语言模型（LLM）推理步骤的可靠性，无需人工标注。SPC在多个基准测试中表现优异。

Details

Motivation: 评估LLM推理步骤的可靠性（如Chain-of-Thought）因高质量步骤级监督的获取难度和成本而具有挑战性。 Method: SPC通过对抗性自博弈游戏训练两个模型：一个生成错误步骤的“生成器”和一个评估步骤正确性的“评论家”，利用强化学习迭代优化。 Result: 在多个基准测试中，SPC逐步提升了错误检测能力（如ProcessBench准确率从70.8%提升至77.7%），并显著提升了数学推理性能。 Conclusion: SPC无需人工标注即可有效评估和提升LLM推理步骤的可靠性，性能优于现有方法。 Abstract: Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, applying SPC to guide the test-time search of diverse LLMs significantly improves their mathematical reasoning performance on MATH500 and AIME2024, outperforming state-of-the-art process reward models.

[169] WuNeng: Hybrid State with Attention

Liu Xiao,Li Zhiyuan,Lin Yueyu

Main category: cs.CL

TL;DR: WuNeng架构通过结合RNN-based RWKV-7和高级注意力机制，提升了语言模型的表达能力和上下文连贯性，同时保持了高效性。

Details

Motivation: 增强大型语言模型的表达能力和上下文连贯性，而不牺牲计算效率。 Method: 集成RWKV-7状态驱动头与标准多头注意力，引入交叉头交互技术和多令牌状态处理机制。 Result: 显著提升了模型的表达能力和序列生成能力，同时参数增加极少。 Conclusion: WuNeng在表达能力和计算效率之间实现了新的平衡，为现代神经架构树立了新标准。 Abstract: The WuNeng architecture introduces a novel approach to enhancing the expressivity and power of large language models by integrating recurrent neural network (RNN)-based RWKV-7 with advanced attention mechanisms, prioritizing heightened contextual coherence over reducing KV cache size. Building upon the hybrid-head concept from Hymba, WuNeng augments standard multi-head attention with additional RWKV-7 state-driven heads, rather than replacing existing heads, to enrich the model's representational capacity. A cross-head interaction technique fosters dynamic synergy among standard, state-driven, and newly introduced middle heads, leveraging concatenation, additive modulation, and gated fusion for robust information integration. Furthermore, a multi-token state processing mechanism harnesses the continuous RWKV-7 state to capture intricate, sequence-wide dependencies, significantly boosting expressivity. Remarkably, these enhancements are achieved with minimal additional parameters, ensuring efficiency while empowering the model to excel in complex reasoning and sequence generation tasks. WuNeng sets a new standard for balancing expressivity and computational efficiency in modern neural architectures.

[170] Dynamic Embedded Topic Models: properties and recommendations based on diverse corpora

Elisabeth Fittschen,Bella Xia,Leib Celnik,Paul Dilley,Tom Lippincott

Main category: cs.CL

TL;DR: 研究了动态嵌入式主题模型的几种实现选择对五个历时语料库的影响，以确定其使用和进一步开发的关键决策。

Details

Motivation: 目标是明确动态嵌入式主题模型在应用研究中的实用性，包括词汇量的实际可扩展性和更灵活的时间间隔建模。 Method: 通过分析五种历时语料库，评估不同实现选择对模型性能的影响。 Result: 发现词汇量的可扩展性和时间间隔的灵活性对性能至关重要，而某些其他因素对性能影响不大。 Conclusion: 研究为动态嵌入式主题模型的优化提供了实用指导，强调了关键决策点。 Abstract: We measure the effects of several implementation choices for the Dynamic Embedded Topic Model, as applied to five distinct diachronic corpora, with the goal of isolating important decisions for its use and further development. We identify priorities that will maximize utility in applied scholarship, including the practical scalability of vocabulary size to best exploit the strengths of embedded representations, and more flexible modeling of intervals to accommodate the uneven temporal distributions of historical writing. Of similar importance, we find performance is not significantly or consistently affected by several aspects that otherwise limit the model's application or might consume the resources of a grid search.

[171] Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Dylan Bouchard,Mohit Singh Chauhan

Main category: cs.CL

TL;DR: 提出了一种零资源幻觉检测框架，通过多种不确定性量化技术生成标准化置信分数，并引入可调集成方法优化性能。

Details

Motivation: 大型语言模型（LLM）在高风险领域（如医疗和金融）中的幻觉问题日益突出，亟需有效的检测方法。 Method: 结合黑盒、白盒不确定性量化技术及LLM-as-a-Judge，生成0到1的标准化置信分数，并设计可调集成方法优化性能。 Result: 实验表明，可调集成方法优于单一组件及现有幻觉检测方法。 Conclusion: 定制化幻觉检测策略可提升LLM的准确性和可靠性。 Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

[172] VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

Mohamed Gado,Towhid Taliee,Muhammad Memon,Dmitry Ignatov,Radu Timofte

Main category: cs.CL

TL;DR: 本文提出了一种基于多模态模型的视觉叙事方法VIST-GPT，并设计了新的评估指标RoViST和GROOVIST，以更准确地评估叙事质量。

Details

Motivation: 传统评估指标（如BLEU、METEOR等）不适用于视觉叙事任务，需要更符合人类判断的评估方法。 Method: 利用多模态模型（特别是基于Transformer的架构）和VIST数据集，开发了VIST-GPT模型。 Result: VIST-GPT生成视觉基础强、上下文合适的叙事，并通过新指标RoViST和GROOVIST验证其质量。 Conclusion: 新方法及评估指标为视觉叙事任务提供了更有效的解决方案。 Abstract: Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.

[173] AndroidGen: Building an Android Language Agent under Data Scarcity

Hanyu Lai,Junjie Gao,Xiao Liu,Yifan Xu,Shudan Zhang,Yuxiao Dong,Jie Tang

Main category: cs.CL

TL;DR: 论文提出AndroidGen框架，旨在解决LLM在移动设备上作为代理时数据稀缺的问题，并通过收集任务轨迹训练开源LLM，无需人工标注。

Details

Motivation: 尽管LLM在NLP任务中潜力巨大，但由于高质量数据源不足和人工标注成本高，其在移动设备上的应用受限。 Method: 开发AndroidGen框架，收集任务轨迹并训练开源LLM，无需人工标注数据。 Result: 在AndroidWorld、AitW等应用中验证了AndroidGen的改进效果，并发现未来优化方向。 Conclusion: AndroidGen为LLM在移动设备上的应用提供了有效解决方案，代码和模型已开源。 Abstract: Large language models have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at https://github.com/THUDM/AndroidGen.

[174] BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou,Bruce Leon,Xiang Ying,Can Zhang,Yifan Shao,Qichen Ye,Dading Chong,Zhiling Jin,Chenxuan Xie,Meng Cao,Yuxin Gu,Sixin Hong,Jing Ren,Jian Chen,Chao Liu,Yining Hua

Main category: cs.CL

TL;DR: BrowseComp-ZH是一个专为评估大型语言模型在中文网络环境中的实时浏览能力而设计的高难度基准测试，包含289个多跳问题，覆盖11个领域。测试结果显示，现有模型表现普遍不佳。

Details

Motivation: 现有基准测试（如BrowseComp）主要关注英语，忽视了其他语言（尤其是中文）的复杂性，因此需要开发一个专门的中文基准测试。 Method: 通过反向工程从简短、客观且易验证的答案中构建289个多跳问题，并采用两阶段质量控制确保问题难度和答案唯一性。 Result: 测试了20多个先进语言模型和代理搜索系统，大多数模型准确率低于10%，最佳模型（OpenAI的DeepResearch）仅达到42.9%。 Conclusion: BrowseComp-ZH展示了当前模型在中文网络环境中的检索和推理能力仍有显著不足，为未来研究提供了重要基准。 Abstract: As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

[175] Unified Multi-Task Learning & Model Fusion for Efficient Language Model Guardrailing

James O' Neill,Santhosh Subramanian,Eric Lin,Vaikkunth Mugunthan

Main category: cs.CL

TL;DR: 论文提出了一种高效的任务特定数据生成方法，训练出的小型分类器性能显著优于当前最优模型，并通过多任务预训练和模型合并进一步提升了泛化能力。

Details

Motivation: 大型语言模型（LLMs）在防止不良行为方面存在延迟高、内存消耗大、成本昂贵和非结构化输出等问题，限制了其应用。 Method: 1. 任务特定数据生成训练高效分类器；2. 多任务预训练模型（MultiTaskGuard）；3. 搜索式模型合并方法（UniGuard）。 Result: 在多个数据集和基准测试中，提出的方法平均F1分数显著优于现有最优模型和第三方API。 Conclusion: 任务特定数据生成和模型合并方法能显著提升效率与性能，为安全行为检测提供了更优解决方案。 Abstract: The trend towards large language models (LLMs) for guardrailing against undesired behaviors is increasing and has shown promise for censoring user inputs. However, increased latency, memory consumption, hosting expenses and non-structured outputs can make their use prohibitive. In this work, we show that task-specific data generation can lead to fine-tuned classifiers that significantly outperform current state of the art (SoTA) while being orders of magnitude smaller. Secondly, we show that using a single model, \texttt{MultiTaskGuard}, that is pretrained on a large synthetically generated dataset with unique task instructions further improves generalization. Thirdly, our most performant models, \texttt{UniGuard}, are found using our proposed search-based model merging approach that finds an optimal set of parameters to combine single-policy models and multi-policy guardrail models. % On 7 public datasets and 4 guardrail benchmarks we created, our efficient guardrail classifiers improve over the best performing SoTA publicly available LLMs and 3$^{\text{rd}}$ party guardrail APIs in detecting unsafe and safe behaviors by an average F1 score improvement of \textbf{29.92} points over Aegis-LlamaGuard and \textbf{21.62} over \texttt{gpt-4o}, respectively. Lastly, our guardrail synthetic data generation process that uses custom task-specific guardrail poli

[176] Explanatory Summarization with Discourse-Driven Planning

Dongqi Liu,Xi Yu,Vera Demberg,Mirella Lapata

Main category: cs.CL

TL;DR: 本文提出了一种基于计划的摘要生成方法，通过话语框架组织摘要内容并引导解释性句子，显著提升了摘要质量。

Details

Motivation: 当前自动摘要方法未明确建模解释性内容，导致与人工摘要的匹配度不足。 Method: 采用两种话语驱动的规划策略，将计划作为输入或输出前缀的一部分。 Result: 在三个数据集上的实验表明，该方法在摘要质量、鲁棒性和可控性上优于现有方法，并减少了幻觉问题。 Conclusion: 基于话语框架的计划方法有效提升了摘要生成的质量和可控性。 Abstract: Lay summaries for scientific documents typically include explanations to help readers grasp sophisticated concepts or arguments. However, current automatic summarization methods do not explicitly model explanations, which makes it difficult to align the proportion of explanatory content with human-written summaries. In this paper, we present a plan-based approach that leverages discourse frameworks to organize summary generation and guide explanatory sentences by prompting responses to the plan. Specifically, we propose two discourse-driven planning strategies, where the plan is conditioned as part of the input or part of the output prefix, respectively. Empirical experiments on three lay summarization datasets show that our approach outperforms existing state-of-the-art methods in terms of summary quality, and it enhances model robustness, controllability, and mitigates hallucination.

[177] ICL CIPHERS: Quantifying "Learning'' in In-Context Learning via Substitution Ciphers

Zhouxiang Fang,Aayush Mishra,Muhan Gao,Anqi Liu,Daniel Khashabi

Main category: cs.CL

TL;DR: 论文提出ICL CIPHERS方法，通过可逆的替换密码研究LLM在上下文学习中的任务检索与任务学习能力。

Details

Motivation: 探讨LLM在上下文学习中任务检索与任务学习的分离问题。 Method: 引入基于替换密码的任务重构方法ICL CIPHERS，测试LLM对可逆与不可逆密码的解决能力。 Result: LLM在可逆密码任务上表现优于不可逆基线，表明其具备一定的解码能力。 Conclusion: ICL CIPHERS为量化上下文学习中的“学习”提供了新方法，并验证了LLM的解码潜力。 Abstract: Recent works have suggested that In-Context Learning (ICL) operates in dual modes, i.e. task retrieval (remember learned patterns from pre-training) and task learning (inference-time ``learning'' from demonstrations). However, disentangling these the two modes remains a challenging goal. We introduce ICL CIPHERS, a class of task reformulations based on substitution ciphers borrowed from classic cryptography. In this approach, a subset of tokens in the in-context inputs are substituted with other (irrelevant) tokens, rendering English sentences less comprehensible to human eye. However, by design, there is a latent, fixed pattern to this substitution, making it reversible. This bijective (reversible) cipher ensures that the task remains a well-defined task in some abstract sense, despite the transformations. It is a curious question if LLMs can solve ICL CIPHERS with a BIJECTIVE mapping, which requires deciphering the latent cipher. We show that LLMs are better at solving ICL CIPHERS with BIJECTIVE mappings than the NON-BIJECTIVE (irreversible) baseline, providing a novel approach to quantify ``learning'' in ICL. While this gap is small, it is consistent across the board on four datasets and six models. Finally, we examine LLMs' internal representations and identify evidence in their ability to decode the ciphered inputs.

[178] Context Selection and Rewriting for Video-based EducationalQuestion Generation

Mengxia Yu,Bang Nguyen,Olivia Zino,Meng Jiang

Main category: cs.CL

TL;DR: 论文提出了一种基于真实课堂内容的教育问题生成框架，解决了现有方法在视频内容中生成问题时的时间戳对齐和目标答案融合的挑战。

Details

Motivation: 现有教育问题生成数据集通常基于预定义的编辑文本，未能反映真实课堂内容，如带有幻灯片的讲座语音。 Method: 提出了一种利用大语言模型动态选择和重写上下文的框架，结合讲座转录和视频关键帧，生成包含答案的知识陈述。 Result: 该方法显著提高了生成问题的质量和相关性。 Conclusion: 论文通过真实数据集和新框架，改进了教育问题生成的准确性和实用性。 Abstract: Educational question generation (EQG) is a crucial component of intelligent educational systems, significantly aiding self-assessment, active learning, and personalized education. While EQG systems have emerged, existing datasets typically rely on predefined, carefully edited texts, failing to represent real-world classroom content, including lecture speech with a set of complementary slides. To bridge this gap, we collect a dataset of educational questions based on lectures from real-world classrooms. On this realistic dataset, we find that current methods for EQG struggle with accurately generating questions from educational videos, particularly in aligning with specific timestamps and target answers. Common challenges include selecting informative contexts from extensive transcripts and ensuring generated questions meaningfully incorporate the target answer. To address the challenges, we introduce a novel framework utilizing large language models for dynamically selecting and rewriting contexts based on target timestamps and answers. First, our framework selects contexts from both lecture transcripts and video keyframes based on answer relevance and temporal proximity. Then, we integrate the contexts selected from both modalities and rewrite them into answer-containing knowledge statements, to enhance the logical connection between the contexts and the desired answer. This approach significantly improves the quality and relevance of the generated questions. Our dataset and code are released in https://github.com/mengxiayu/COSER.

[179] Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara,Dev Khant,Saket Aryan,Taranjeet Singh,Deshraj Yadav

Main category: cs.CL

TL;DR: Mem0是一种可扩展的记忆中心架构，通过动态提取、整合和检索对话中的关键信息，解决了大型语言模型（LLMs）在长时间多轮对话中保持一致性的问题。其增强版利用基于图的记忆表示捕捉复杂关系结构，显著优于现有记忆系统。

Details

Motivation: 大型语言模型在生成连贯响应方面表现出色，但其固定上下文窗口限制了长时间多轮对话的一致性。Mem0旨在解决这一问题。 Method: 提出Mem0架构及其基于图记忆的增强版，动态管理对话信息。在LOCOMO基准上对比六类基线方法。 Result: Mem0在四种问题类别上均优于现有系统，性能提升显著（如26%相对改进），同时降低计算开销（如91%延迟减少）。 Conclusion: 结构化、持久化的记忆机制对长期对话一致性至关重要，Mem0为高效可靠的LLM驱动AI代理提供了可行方案。 Abstract: Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.

[180] Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models

Jacky He,Guiran Liu,Binrong Zhu,Hanlu Zhang,Hongye Zheng,Xiaokai Wang

Main category: cs.CL

TL;DR: 本文提出了一种动态优化的RAG架构，通过状态感知的动态知识检索机制提升语义理解和知识调度效率，适用于开放域问答和复杂生成任务。

Details

Motivation: 解决静态RAG结构在上下文适应和知识访问方面的局限性。 Method: 引入多级感知检索向量构建策略和可微分文档匹配路径，实现检索与生成模块的端到端联合训练与协同优化。 Result: 在Natural Questions数据集上验证，BLEU和ROUGE-L分数显著提升，且在语义模糊和多文档融合任务中表现更强鲁棒性和生成一致性。 Conclusion: 该方法具有广泛的应用潜力，对构建高质量语言生成系统具有实际价值。 Abstract: This paper focuses on the dynamic optimization of the Retrieval-Augmented Generation (RAG) architecture. It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency in large language models for open-domain question answering and complex generation tasks. The method introduces a multi-level perceptive retrieval vector construction strategy and a differentiable document matching path. These components enable end-to-end joint training and collaborative optimization of the retrieval and generation modules. This effectively addresses the limitations of static RAG structures in context adaptation and knowledge access. Experiments are conducted on the Natural Questions dataset. The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek. Comparative and ablation experiments from multiple perspectives confirm the significant improvements in BLEU and ROUGE-L scores. The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion. These results highlight its broad application potential and practical value in building high-quality language generation systems.

[181] Systematic Bias in Large Language Models: Discrepant Response Patterns in Binary vs. Continuous Judgment Tasks

Yi-Long Lu,Chunhui Zhang,Wei Wang

Main category: cs.CL

TL;DR: 研究发现，大语言模型（LLMs）在二元和连续响应格式下表现出系统性负面偏见，二元格式更易产生负面判断。

Details

Motivation: 探讨LLMs在不同响应格式（二元与连续）下的判断偏见，以评估其可靠性。 Method: 通过价值判断和文本情感分析任务，测试多种LLMs在两种响应格式下的表现。 Result: LLMs在二元格式下更倾向于负面判断，且这一模式在两种任务中均一致。 Conclusion: 任务设计中的响应格式选择可能引入系统性偏见，需谨慎考虑。 Abstract: Large Language Models (LLMs) are increasingly used in tasks such as psychological text analysis and decision-making in automated workflows. However, their reliability remains a concern due to potential biases inherited from their training process. In this study, we examine how different response format: binary versus continuous, may systematically influence LLMs' judgments. In a value statement judgments task and a text sentiment analysis task, we prompted LLMs to simulate human responses and tested both formats across several models, including both open-source and commercial models. Our findings revealed a consistent negative bias: LLMs were more likely to deliver "negative" judgments in binary formats compared to continuous ones. Control experiments further revealed that this pattern holds across both tasks. Our results highlight the importance of considering response format when applying LLMs to decision tasks, as small changes in task design can introduce systematic biases.

[182] Towards Long Context Hallucination Detection

Siyi Liu,Kishaloy Halder,Zheng Qi,Wei Xiao,Nikolaos Pappas,Phu Mon Htut,Neha Anna John,Yassine Benajiba,Dan Roth

Main category: cs.CL

TL;DR: 本文提出了一种针对长上下文输入中LLM幻觉检测的新架构，通过分解和聚合机制显著提升了性能。

Details

Motivation: 尽管LLM在多种任务中表现优异，但其在长上下文输入中容易产生幻觉信息，这一问题尚未解决。 Method: 构建了一个专门用于长上下文幻觉检测的数据集，并提出了一种新架构，使预训练编码器模型（如BERT）能够处理长上下文并有效检测幻觉。 Result: 实验表明，该架构在多项指标上显著优于同类模型和基于LLM的模型，且推理速度更快。 Conclusion: 该研究为解决长上下文输入中的LLM幻觉问题提供了初步解决方案，并展示了高效性和优越性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, they are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. Although many studies have investigated contextual hallucinations in LLMs, addressing them in long-context inputs remains an open problem. In this work, we take an initial step toward solving this problem by constructing a dataset specifically designed for long-context hallucination detection. Furthermore, we propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations through a decomposition and aggregation mechanism. Our experimental results show that the proposed architecture significantly outperforms previous models of similar size as well as LLM-based models across various metrics, while providing substantially faster inference.

[183] BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

Jiageng Wu,Bowen Gu,Ren Zhou,Kevin Xie,Doug Snyder,Yixing Jiang,Valentina Carducci,Richard Wyss,Rishi J Desai,Emily Alsentzer,Leo Anthony Celi,Adam Rodman,Sebastian Schneeweiss,Jonathan H. Chen,Santiago Romero-Brufau,Kueiyu Joshua Lin,Jie Yang

Main category: cs.CL

TL;DR: BRIDGE是一个多语言临床基准测试，涵盖87个任务，评估了52个LLM在真实临床数据上的表现，发现开源模型可与专有模型媲美，而医学微调的旧架构模型表现较差。

Details

Motivation: 当前LLM在临床环境中的评估有限，现有基准未能捕捉真实电子健康记录（EHR）数据的复杂性。 Method: 提出BRIDGE基准，包含87个任务，系统评估52个LLM在不同推理策略下的表现。 Result: 开源LLM表现与专有模型相当，医学微调的旧架构模型表现较差。 Conclusion: BRIDGE为临床文本理解的LLM开发和评估提供了基础资源和参考。 Abstract: Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding.

[184] Conflicts in Texts: Data, Implications and Challenges

Siyi Liu,Dan Roth

Main category: cs.CL

TL;DR: 该论文探讨了NLP模型中冲突信息的问题，将其分为三类（自然文本、人工标注数据和模型交互），分析了影响并提出了缓解策略。

Details

Motivation: 随着NLP模型在现实应用中的普及，冲突信息的存在可能导致模型不可靠和不值得信赖，亟需解决。 Method: 通过分类和分析三类冲突信息（自然文本、人工标注数据和模型交互），并提出缓解策略。 Result: 论文总结了冲突信息的影响，并提出了未来开发冲突感知NLP系统的方向。 Conclusion: 未来的NLP系统需要更有效地处理和协调冲突信息，以提高可靠性和可信度。 Abstract: As NLP models become increasingly integrated into real-world applications, it becomes clear that there is a need to address the fact that models often rely on and generate conflicting information. Conflicts could reflect the complexity of situations, changes that need to be explained and dealt with, difficulties in data annotation, and mistakes in generated outputs. In all cases, disregarding the conflicts in data could result in undesired behaviors of models and undermine NLP models' reliability and trustworthiness. This survey categorizes these conflicts into three key areas: (1) natural texts on the web, where factual inconsistencies, subjective biases, and multiple perspectives introduce contradictions; (2) human-annotated data, where annotator disagreements, mistakes, and societal biases impact model training; and (3) model interactions, where hallucinations and knowledge conflicts emerge during deployment. While prior work has addressed some of these conflicts in isolation, we unify them under the broader concept of conflicting information, analyze their implications, and discuss mitigation strategies. We highlight key challenges and future directions for developing conflict-aware NLP systems that can reason over and reconcile conflicting information more effectively.

[185] Detecting Effects of AI-Mediated Communication on Language Complexity and Sentiment

Kristen Sussman,Daniel Carter

Main category: cs.CL

TL;DR: 研究通过分析2020年和2024年关于特朗普的推文，发现AI对社交媒体语言模式和情感表达的影响显著。

Details

Motivation: 探讨AI对语言模式和情感表达的影响，特别是在社交媒体上的变化。 Method: 比较2020年和2024年的推文数据集，使用Flesch-Kincaid可读性和情感极性评分分析文本复杂性和情感变化。 Result: 情感极性显著增加（0.12 vs. 0.04），中性内容减少（54.8%到39.8%），积极表达增加（28.6%到45.9%）。 Conclusion: AI在社交媒体中的存在增加，且显著影响了语言和情感表达模式。 Abstract: Given the subtle human-like effects of large language models on linguistic patterns, this study examines shifts in language over time to detect the impact of AI-mediated communication (AI- MC) on social media. We compare a replicated dataset of 970,919 tweets from 2020 (pre-ChatGPT) with 20,000 tweets from the same period in 2024, all of which mention Donald Trump during election periods. Using a combination of Flesch-Kincaid readability and polarity scores, we analyze changes in text complexity and sentiment. Our findings reveal a significant increase in mean sentiment polarity (0.12 vs. 0.04) and a shift from predominantly neutral content (54.8% in 2020 to 39.8% in 2024) to more positive expressions (28.6% to 45.9%). These findings suggest not only an increasing presence of AI in social media communication but also its impact on language and emotional expression patterns.

[186] m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

Meng Xiao,Xunxin Cai,Chengrui Wang,Yuanchun Zhou

Main category: cs.CL

TL;DR: 提出了一种基于知识驱动的多智能体框架，用于生物医学领域的科学语料蒸馏，显著提升了语言模型在生物医学问答任务中的表现。

Details

Motivation: 现有开源生物医学语料库在数量和质量上不足，难以满足大型语言模型的需求，需解决生物医学知识复杂层次结构的挑战。 Method: 采用多智能体协作架构，每个智能体基于MeSH层次结构自主提取、合成和评估高质量文本数据，生成领域特定的问答对。 Result: 实验表明，基于该框架训练的模型在生物医学问答任务中表现优异，甚至超越GPT-4等先进专有模型。 Conclusion: 多智能体协作框架在生物医学语料蒸馏和LLM训练中具有显著潜力。 Abstract: The rapid progress of large language models (LLMs) in biomedical research has underscored the limitations of existing open-source annotated scientific corpora, which are often insufficient in quantity and quality. Addressing the challenge posed by the complex hierarchy of biomedical knowledge, we propose a knowledge-driven, multi-agent framework for scientific corpus distillation tailored for LLM training in the biomedical domain. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. These agents collectively generate and refine domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.

[187] Arabic Metaphor Sentiment Classification Using Semantic Information

Israa Alsiyat

Main category: cs.CL

TL;DR: 论文讨论了使用新设计的基于语义标签的情感分类工具测试阿拉伯语隐喻语料库（AMC），并通过F-score、召回率和精确度评估工具效果。

Details

Motivation: 研究阿拉伯语在线隐喻对情感的影响，填补了使用语义标签进行阿拉伯语隐喻情感分类的空白。 Method: 设计基于语义情感标签的自动工具，用于阿拉伯语隐喻情感分类，并通过标准指标（F-score、召回率、精确度）评估工具性能。 Result: 工具成功应用于AMC，展示了阿拉伯语隐喻对情感的影响。 Conclusion: 这是首次利用语义标签进行阿拉伯语隐喻情感分类的研究，为相关领域提供了新方法。 Abstract: In this paper, I discuss the testing of the Arabic Metaphor Corpus (AMC) [1] using newly designed automatic tools for sentiment classification for AMC based on semantic tags. The tool incorporates semantic emotional tags for sentiment classification. I evaluate the tool using standard methods, which are F-score, recall, and precision. The method is to show the impact of Arabic online metaphors on sentiment through the newly designed tools. To the best of our knowledge, this is the first approach to conduct sentiment classification for Arabic metaphors using semantic tags to find the impact of the metaphor.

[188] Coreference Resolution for Vietnamese Narrative Texts

Hieu-Dai Tran,Duc-Vu Nguyen,Ngan Luu-Thuy Nguyen

Main category: cs.CL

TL;DR: 论文研究了越南语中的共指消解任务，构建了一个标注数据集，并评估了GPT-3.5-Turbo和GPT-4的性能，发现GPT-4表现更优。

Details

Motivation: 越南语作为低资源语言，缺乏标注数据集，共指消解任务具有挑战性。 Method: 使用VnExpress的叙事文本构建标注数据集，并评估GPT-3.5-Turbo和GPT-4的性能。 Result: GPT-4在准确性和一致性上显著优于GPT-3.5-Turbo。 Conclusion: GPT-4是越南语共指消解的更可靠工具。 Abstract: Coreference resolution is a vital task in natural language processing (NLP) that involves identifying and linking different expressions in a text that refer to the same entity. This task is particularly challenging for Vietnamese, a low-resource language with limited annotated datasets. To address these challenges, we developed a comprehensive annotated dataset using narrative texts from VnExpress, a widely-read Vietnamese online news platform. We established detailed guidelines for annotating entities, focusing on ensuring consistency and accuracy. Additionally, we evaluated the performance of large language models (LLMs), specifically GPT-3.5-Turbo and GPT-4, on this dataset. Our results demonstrate that GPT-4 significantly outperforms GPT-3.5-Turbo in terms of both accuracy and response consistency, making it a more reliable tool for coreference resolution in Vietnamese.

[189] VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

Run Luo,Renke Shan,Longze Chen,Ziqiang Liu,Lu Wang,Min Yang,Xiaobo Xia

Main category: cs.CL

TL;DR: 论文提出了一种自监督的视觉概念建模框架VCM，通过隐式对比学习和视觉语言微调，显著降低了计算成本并提升了性能。

Details

Motivation: 当前大型视觉语言模型（LVLMs）在图像处理上效率低下，缺乏视觉概念模型，限制了实际应用。 Method: 提出VCM框架，结合隐式对比学习和视觉语言微调，无需昂贵的概念级标注。 Result: VCM显著减少计算成本（如LLaVA-1.5-7B的FLOPs减少85%），并在多任务中保持高性能。 Conclusion: VCM有效提升了视觉编码器的能力，实验验证了其高效性和有效性。 Abstract: Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs' usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85\% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders' capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.

[190] A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks

Shadan Shukr Sabr,Nazira Sabr Mustafa,Talar Sabah Omar,Salah Hwayyiz Rasool,Nawzad Anwer Omer,Darya Sabir Hamad,Hemin Abdulhameed Shams,Omer Mahmood Kareem,Rozhan Noori Abdullah,Khabat Atar Abdullah,Mahabad Azad Mohammad,Haneen Al-Raghefy,Safar M. Asaad,Sara Jamal Mohammed,Twana Saeed Ali,Fazil Shawrow,Halgurd S. Maghdid

Main category: cs.CL

TL;DR: 该研究为中央库尔德语（CKL）设计了一个准确且全面的词性标注集，以支持库尔德语的自然语言处理任务。

Details

Motivation: 由于资源匮乏，低资源语言如中央库尔德语的词性标注任务缺乏标准化和全面性，影响了其他NLP任务的发展。 Method: 研究整合了不同研究和库尔德语言学专家的词性标注，设计了一个标准化的标注集，并与通用依赖框架进行了初步比较。 Result: 提出的词性标注集能够更准确地标注库尔德语语料，并支持相关NLP任务。 Conclusion: 该研究为库尔德语的NLP任务提供了标准化工具，未来可进一步优化和扩展。 Abstract: - The field of natural language processing (NLP) has dramatically expanded within the last decade. Many human-being applications are conducted daily via NLP tasks, starting from machine translation, speech recognition, text generation and recommendations, Part-of-Speech tagging (POS), and Named-Entity Recognition (NER). However, low-resourced languages, such as the Central-Kurdish language (CKL), mainly remain unexamined due to shortage of necessary resources to support their development. The POS tagging task is the base of other NLP tasks; for example, the POS tag set has been used to standardized languages to provide the relationship between words among the sentences, followed by machine translation and text recommendation. Specifically, for the CKL, most of the utilized or provided POS tagsets are neither standardized nor comprehensive. To this end, this study presented an accurate and comprehensive POS tagset for the CKL to provide better performance of the Kurdish NLP tasks. The article also collected most of the POS tags from different studies as well as from Kurdish linguistic experts to standardized part-of-speech tags. The proposed POS tagset is designed to annotate a large CKL corpus and support Kurdish NLP tasks. The initial investigations of this study via comparison with the Universal Dependencies framework for standard languages, show that the proposed POS tagset can streamline or correct sentences more accurately for Kurdish NLP tasks.

[191] Multimodal Conditioned Diffusive Time Series Forecasting

Chen Su,Yuanhe Tian,Yan Song

Main category: cs.CL

TL;DR: MCD-TSF是一种多模态条件扩散模型，用于时间序列预测，结合时间戳和文本信息提升预测性能。

Details

Motivation: 现有扩散模型在时间序列预测中仅关注单模态数值序列，忽略了多模态信息的丰富性。 Method: 提出MCD-TSF模型，联合利用时间戳和文本作为额外指导，建立时间序列的时空和语义关联。 Result: 在八个领域的真实数据集上实验表明，MCD-TSF达到最先进性能。 Conclusion: MCD-TSF通过多模态信息显著提升了时间序列预测的准确性。 Abstract: Diffusion models achieve remarkable success in processing images and text, and have been extended to special domains such as time series forecasting (TSF). Existing diffusion-based approaches for TSF primarily focus on modeling single-modality numerical sequences, overlooking the rich multimodal information in time series data. To effectively leverage such information for prediction, we propose a multimodal conditioned diffusion model for TSF, namely, MCD-TSF, to jointly utilize timestamps and texts as extra guidance for time series modeling, especially for forecasting. Specifically, Timestamps are combined with time series to establish temporal and semantic correlations among different data points when aggregating information along the temporal dimension. Texts serve as supplementary descriptions of time series' history, and adaptively aligned with data points as well as dynamically controlled in a classifier-free manner. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed MCD-TSF model achieves state-of-the-art performance.

[192] Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

Osma Suominen,Juho Inkinen,Mona Lehtinen

Main category: cs.CL

TL;DR: Annif系统在SemEval-2025任务5中结合传统NLP与LLM技术，用于多语言主题索引，表现优异。

Details

Motivation: 探索如何结合传统XMTC算法与现代LLM技术，提升多语言主题索引的准确性和效率。 Method: 结合Annif工具包的传统NLP/ML技术与LLM的翻译、合成数据生成及单语模型预测合并。 Result: 在定量评估中排名第一（全主题类别）和第二（TIB核心主题类别），定性评估中排名第四。 Conclusion: 传统与LLM技术的结合在多语言主题索引中具有潜力。 Abstract: This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.

[193] Taming the Titans: A Survey of Efficient LLM Inference Serving

Ranran Zhen,Juntao Li,Yixin Ji,Zhenlin Yang,Tong Liu,Qingrong Xia,Xinyu Duan,Zhefeng Wang,Baoxing Huai,Min Zhang

Main category: cs.CL

TL;DR: 本文综述了大语言模型（LLM）推理服务中的低延迟与高吞吐量挑战及其解决方案，涵盖实例级、集群级和新兴场景的方法。

Details

Motivation: LLM的参数规模和注意力机制的高计算需求导致内存和计算开销大，影响推理服务的效率，亟需优化方法。 Method: 通过实例级（模型放置、请求调度等）、集群级（GPU集群部署、负载均衡等）和新兴场景（任务模块化等）方法进行系统化分析。 Result: 总结了当前优化LLM推理服务的关键技术和策略，提供了全面的研究进展概述。 Conclusion: 未来研究方向包括进一步优化实例和集群级策略，以及探索新兴场景中的创新方法。 Abstract: Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.

[194] LLM-Assisted Automated Deductive Coding of Dialogue Data: Leveraging Dialogue-Specific Characteristics to Enhance Contextual Understanding

Ying Na,Shihui Feng

Main category: cs.CL

TL;DR: 该研究提出了一种基于LLM的自动化对话数据编码方法，通过结合角色提示和思维链方法，利用多模型协作和上下文一致性检查，显著提高了编码精度。

Details

Motivation: 对话数据是理解学习过程的关键，但LLM在处理复杂上下文时面临挑战。研究旨在解决这些挑战，提升自动化编码的准确性。 Method: 1) 基于对话特性（交际行为和事件）预测编码；2) 使用多LLM协作预测；3) 通过事件和行为关联进行一致性检查。 Result: 上下文一致性检查显著提高了准确性，行为预测的准确性高于事件预测。 Conclusion: 研究为对话数据自动化编码提供了新方法框架，并解决了上下文复杂性带来的挑战。 Abstract: Dialogue data has been a key source for understanding learning processes, offering critical insights into how students engage in collaborative discussions and how these interactions shape their knowledge construction. The advent of Large Language Models (LLMs) has introduced promising opportunities for advancing qualitative research, particularly in the automated coding of dialogue data. However, the inherent contextual complexity of dialogue presents unique challenges for these models, especially in understanding and interpreting complex contextual information. This study addresses these challenges by developing a novel LLM-assisted automated coding approach for dialogue data. The novelty of our proposed framework is threefold: 1) We predict the code for an utterance based on dialogue-specific characteristics -- communicative acts and communicative events -- using separate prompts following the role prompts and chain-of-thoughts methods; 2) We engaged multiple LLMs including GPT-4-turbo, GPT-4o, DeepSeek in collaborative code prediction; 3) We leveraged the interrelation between events and acts to implement consistency checking using GPT-4o. In particular, our contextual consistency checking provided a substantial accuracy improvement. We also found the accuracy of act predictions was consistently higher than that of event predictions. This study contributes a new methodological framework for enhancing the precision of automated coding of dialogue data as well as offers a scalable solution for addressing the contextual challenges inherent in dialogue analysis.

[195] Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLMs

Huichi Zhou,Zehao Xu,Munan Zhao,Kaihong Li,Yiqiang Li,Hongtao Wang

Main category: cs.CL

TL;DR: 论文介绍了多语言道德推理基准（MMRB），评估大语言模型在五种语言和三种上下文复杂度下的表现，发现低资源语言对多语言推理影响更大。

Details

Motivation: 评估大语言模型在多语言和不同上下文复杂度下的道德推理能力，并探索低资源语言的作用。 Method: 使用MMRB基准测试五种语言和三种复杂度，并对LLaMA-3-8B模型进行微调。 Result: 道德推理性能随复杂度增加而下降，低资源语言对多语言推理影响显著。 Conclusion: 低资源语言在多语言NLP中具有关键作用，需更多关注。 Abstract: In this paper, we introduce the Multilingual Moral Reasoning Benchmark (MMRB) to evaluate the moral reasoning abilities of large language models (LLMs) across five typologically diverse languages and three levels of contextual complexity: sentence, paragraph, and document. Our results show moral reasoning performance degrades with increasing context complexity, particularly for low-resource languages such as Vietnamese. We further fine-tune the open-source LLaMA-3-8B model using curated monolingual data for alignment and poisoning. Surprisingly, low-resource languages have a stronger impact on multilingual reasoning than high-resource ones, highlighting their critical role in multilingual NLP.

[196] Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance

Takuya Tamura,Taro Yano,Masafumi Enomoto,Masafumi Oyamada

Main category: cs.CL

TL;DR: 提出了一种基于谱系关系的矩阵分解方法（LRMF），用于预测大型语言模型的性能，显著优于传统方法。

Details

Motivation: 减少大型语言模型在微调或合并前的计算成本和开发时间，同时考虑模型间的谱系关系。 Method: 采用谱系正则化矩阵分解（LRMF），通过图拉普拉斯正则化编码模型间的谱系关系。 Result: 在6个主要基准测试中，LRMF比基线方法性能预测相关性提高7-10个百分点，并能有效解决冷启动问题。 Conclusion: LRMF为大型语言模型的超参数调优、数据选择和模型组合提供了资源高效的解决方案。 Abstract: Accurately forecasting the performance of Large Language Models (LLMs) before extensive fine-tuning or merging can substantially reduce both computational expense and development time. Although prior approaches like scaling laws account for global factors such as parameter size or training tokens, they often overlook explicit lineage relationships - i.e., which models are derived or merged from which parents. In this work, we propose a novel Lineage-Regularized Matrix Factorization (LRMF) framework that encodes ancestral ties among LLMs via a graph Laplacian regularizer. By leveraging multi-hop parent-child connections, LRMF consistently outperforms conventional matrix factorization and collaborative filtering methods in both instance-level and benchmark-level performance prediction. Our large-scale study includes 2,934 publicly available Hugging Face models and 21,000+ instances across 6 major benchmarks, showing that lineage constraints yield up to 7-10 percentage points higher correlation with actual performance compared to baselines. Moreover, LRMF effectively addresses the cold-start problem, providing accurate estimates for newly derived or merged models even with minimal data. This lineage-guided strategy thus offers a resource-efficient way to inform hyperparameter tuning, data selection, and model combination in modern LLM development.

[197] To MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels

Kyo Gerrits,Ana Guerberof-Arenas

Main category: cs.CL

TL;DR: 研究探讨了四种翻译方式（机器翻译、后编辑、人工翻译及原文）对读者认知负荷的影响，发现创造性内容增加认知负荷，且人工翻译效果最显著。

Details

Motivation: 了解不同翻译方式中创造性和错误对读者认知负荷的影响。 Method: 通过问卷调查、眼动仪和回顾性有声思维访谈，分析八名参与者的阅读体验。 Result: 创造性内容增加认知负荷，人工翻译效果最显著，错误无显著影响。 Conclusion: 翻译创造性对认知负荷的影响为研究开辟了新方向，数据公开供进一步研究。 Abstract: This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this effect is highest for HT and lowest for MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research. All the code and data are available at https://github.com/INCREC/Pilot_to_MT_or_not_to_MT

[198] Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language

Anastasia Zhukova,Christian E. Matt,Terry Ruas,Bela Gipp

Main category: cs.CL

TL;DR: 本文提出了一种名为ICL-APT的高效方法，结合上下文学习（ICL）和k近邻（kNN）来增强目标数据，显著减少计算时间并保持模型性能。

Details

Motivation: 传统领域自适应持续预训练（DAPT）需要大量领域相关数据，但在非英语领域（如德语过程工业）中数据难以获取。 Method: 采用ICL和kNN技术，通过增强目标数据中的领域相关文本，减少计算资源需求。 Result: ICL-APT在平均IR指标上优于传统DAPT 3.5分，计算时间减少近4倍。 Conclusion: 该方法为计算资源有限的行业提供了高效解决方案，并适用于其他低资源领域。 Abstract: Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., language masking. Although popular, it requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that this approach performs better than traditional DAPT by 3.5 of the average IR metrics (e.g., mAP, MRR, and nDCG) and requires almost 4 times less computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.

[199] semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Ke Hong,Lufang Chen,Zhong Wang,Xiuhong Li,Qiuli Mao,Jianping Ma,Chao Xiong,Guanyu Wu,Buhe Han,Guohao Dai,Yun Liang,Yu Wang

Main category: cs.CL

TL;DR: 论文提出了一种新型LLM服务系统semi-PD，通过分离计算与统一存储解决了现有系统的存储效率问题，显著提升了高请求率下的性能。

Details

Motivation: 现有LLM服务系统分为统一系统和分离系统，前者存在延迟干扰和调度问题，后者则因存储效率低下导致性能不佳。 Method: 提出semi-PD系统，采用分离计算和统一存储设计，引入计算资源控制器和统一内存管理器，优化资源调整和动态分区算法。 Result: semi-PD在DeepSeek和Llama模型上显著降低延迟并提升请求处理能力，性能优于现有系统。 Conclusion: semi-PD通过创新设计解决了存储效率问题，为高负载LLM服务提供了高效解决方案。 Abstract: Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.

[200] GenCLS++: Pushing the Boundaries of Generative Classification in LLMs Through Comprehensive SFT and RL Studies Across Diverse Datasets

Mingqian He,Fei Zhao,Chonggang Lu,Ziyan Liu,Yue Wang,Haofu Qian

Main category: cs.CL

TL;DR: GenCLS++ 是一个联合优化监督微调（SFT）和强化学习（RL）的生成式文本分类框架，通过探索五种策略维度，显著提升了分类准确性。

Details

Motivation: 随着大语言模型（LLMs）的快速发展，传统判别式方法忽视了LLMs的生成能力，而现有生成式分类方法仅依赖简单SFT，缺乏对训练和推理提示的系统性探索。 Method: GenCLS++ 结合SFT和RL，探索五种策略维度（上下文学习变体、类别定义、不确定性标签、无关数字标签和困惑度解码），并在SFT预热后应用基于规则的RL奖励。 Result: 在七个数据集上，GenCLS++ 平均准确率比基线提升3.46%，在公开数据集上提升4.00%。分类任务无需显式推理步骤即可表现更好。 Conclusion: GenCLS++ 为生成式文本分类提供了统一框架，揭示了显式推理在分类任务中的局限性，为未来LLM应用提供了指导。 Abstract: As a fundamental task in machine learning, text classification plays a crucial role in many areas. With the rapid scaling of Large Language Models (LLMs), particularly through reinforcement learning (RL), there is a growing need for more capable discriminators. Consequently, advances in classification are becoming increasingly vital for enhancing the overall capabilities of LLMs. Traditional discriminative methods map text to labels but overlook LLMs' intrinsic generative strengths. Generative classification addresses this by prompting the model to directly output labels. However, existing studies still rely on simple SFT alone, seldom probing the interplay between training and inference prompts, and no work has systematically leveraged RL for generative text classifiers and unified SFT, RL, and inference-time prompting in one framework. We bridge this gap with GenCLS++, a framework that jointly optimizes SFT and RL while systematically exploring five high-level strategy dimensions-in-context learning variants, category definitions, explicit uncertainty labels, semantically irrelevant numeric labels, and perplexity-based decoding-during both training and inference. After an SFT "policy warm-up," we apply RL with a simple rule-based reward, yielding sizable extra gains. Across seven datasets, GenCLS++ achieves an average accuracy improvement of 3.46% relative to the naive SFT baseline; on public datasets, this improvement rises to 4.00%. Notably, unlike reasoning-intensive tasks that benefit from explicit thinking processes, we find that classification tasks perform better without such reasoning steps. These insights into the role of explicit reasoning provide valuable guidance for future LLM applications.

[201] Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking

Luigia Costabile,Gian Marco Orlando,Valerio La Gatta,Vincenzo Moscato

Main category: cs.CL

TL;DR: 论文探讨了利用大型语言模型（LLM）生成的自主代理在众包事实核查中的潜力，发现其表现优于人类众包，且更少受偏见影响。

Details

Motivation: 在线错误信息的蔓延需要可扩展、可靠的事实核查解决方案。众包事实核查虽成本低，但质量和偏见问题突出。LLM在事实核查任务中表现优异，但其在众包工作流中的作用尚未探索。 Method: 通过模拟具有多样化人口和意识形态特征的生成代理群，研究其在事实核查任务中的表现，包括证据检索、质量评估和真实性判断。 Result: 生成代理群在真实性分类上优于人类众包，内部一致性更高，且更少受社会和认知偏见影响。代理更系统地依赖准确性、精确性和信息性等标准。 Conclusion: 生成代理具有成为可扩展、一致且偏见较少的事实核查众包系统贡献者的潜力。 Abstract: The growing spread of online misinformation has created an urgent need for scalable, reliable fact-checking solutions. Crowdsourced fact-checking - where non-experts evaluate claim veracity - offers a cost-effective alternative to expert verification, despite concerns about variability in quality and bias. Encouraged by promising results in certain contexts, major platforms such as X (formerly Twitter), Facebook, and Instagram have begun shifting from centralized moderation to decentralized, crowd-based approaches. In parallel, advances in Large Language Models (LLMs) have shown strong performance across core fact-checking tasks, including claim detection and evidence evaluation. However, their potential role in crowdsourced workflows remains unexplored. This paper investigates whether LLM-powered generative agents - autonomous entities that emulate human behavior and decision-making - can meaningfully contribute to fact-checking tasks traditionally reserved for human crowds. Using the protocol of La Barbera et al. (2024), we simulate crowds of generative agents with diverse demographic and ideological profiles. Agents retrieve evidence, assess claims along multiple quality dimensions, and issue final veracity judgments. Our results show that agent crowds outperform human crowds in truthfulness classification, exhibit higher internal consistency, and show reduced susceptibility to social and cognitive biases. Compared to humans, agents rely more systematically on informative criteria such as Accuracy, Precision, and Informativeness, suggesting a more structured decision-making process. Overall, our findings highlight the potential of generative agents as scalable, consistent, and less biased contributors to crowd-based fact-checking systems.

[202] TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

Emre Can Acikgoz,Carl Guo,Suvodip Dey,Akul Datta,Takyoung Kim,Gokhan Tur,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: TD-EVAL是一个两阶段评估框架，结合细粒度轮次分析和整体对话级比较，用于任务导向对话系统的评估。

Details

Motivation: 传统自动评估方法无法检测任务导向对话系统中的关键中间错误，需更精细的评估方法。 Method: 提出TD-EVAL框架，包括轮次级评估（对话连贯性、后端知识一致性和策略合规性）和对话级比较（TOD Agent Arena）。 Result: 实验表明TD-EVAL能有效识别传统方法遗漏的错误，且与人类判断更一致。 Conclusion: TD-EVAL为任务导向对话系统评估提供了新范式，支持未来研究的即插即用框架。 Abstract: Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and {\tau}-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.

[203] Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom

Rishika Sen,Sujoy Roychowdhury,Sumit Soman,H. G. Ranjani,Srikhetra Mohanty

Main category: cs.CL

TL;DR: 研究探讨了在电信领域QA任务中，知识蒸馏（KD）中教师模型、学生模型或两者是否需要领域适应的问题，并通过实验验证了不同策略和词汇表对蒸馏模型性能的影响。

Details

Motivation: 探索在领域特定任务中，知识蒸馏过程中教师模型、学生模型或两者的领域适应对性能的影响。 Method: 通过监督微调（SFT）教师模型、学生模型或两者，并结合不同词汇表和KD算法（如vanilla KD和DSKD），设计实验评估蒸馏模型性能。 Result: 实验表明，当教师和学生模型词汇表相同时，教师模型的SFT能提升蒸馏模型性能；同时微调两者在所有指标上表现最佳，但统计显著性取决于教师模型的词汇表。 Conclusion: 在知识蒸馏中，同时微调教师和学生模型能带来最佳性能，但词汇表的选择对结果有显著影响。 Abstract: Knowledge Distillation (KD) is one of the approaches to reduce the size of Large Language Models (LLMs). A LLM with smaller number of model parameters (student) is trained to mimic the performance of a LLM of a larger size (teacher model) on a specific task. For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation. In this work, we study this problem from perspective of telecom domain Question-Answering (QA) task. We systematically experiment with Supervised Fine-tuning (SFT) of teacher only, SFT of student only and SFT of both prior to KD. We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model. Multi-faceted evaluation of the distillation using 14 different metrics (N-gram, embedding and LLM-based metrics) is considered. Experimental results show that SFT of teacher improves performance of distilled model when both models have same vocabulary, irrespective of algorithm and metrics. Overall, SFT of both teacher and student results in better performance across all metrics, although the statistical significance of the same depends on the vocabulary of the teacher models.

[204] LLM-Generated Fake News Induces Truth Decay in News Ecosystem: A Case Study on Neural News Recommendation

Beizhe Hu,Qiang Sheng,Juan Cao,Yang Li,Danding Wang

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLM）生成的假新闻对新闻生态系统的潜在影响，揭示了真实新闻在推荐系统中逐渐失去优势的“真相衰减”现象，并提出了可能的对策。

Details

Motivation: 研究动机源于LLM生成的假新闻对新闻生态系统的潜在威胁，尤其是其对新闻推荐系统的影响。 Method: 研究方法包括开发模拟管道和构建包含约56k条多样化生成新闻的数据集，以分析LLM生成假新闻在神经新闻推荐系统中的效果。 Result: 研究发现“真相衰减”现象，即真实新闻在推荐排名中逐渐被假新闻取代，并从熟悉度角度解释了这一现象。 Conclusion: 结论强调了LLM生成假新闻的威胁，呼吁相关方采取措施以维护新闻生态系统的完整性。 Abstract: Online fake news moderation now faces a new challenge brought by the malicious use of large language models (LLMs) in fake news production. Though existing works have shown LLM-generated fake news is hard to detect from an individual aspect, it remains underexplored how its large-scale release will impact the news ecosystem. In this study, we develop a simulation pipeline and a dataset with ~56k generated news of diverse types to investigate the effects of LLM-generated fake news within neural news recommendation systems. Our findings expose a truth decay phenomenon, where real news is gradually losing its advantageous position in news ranking against fake news as LLM-generated news is involved in news recommendation. We further provide an explanation about why truth decay occurs from a familiarity perspective and show the positive correlation between perplexity and news ranking. Finally, we discuss the threats of LLM-generated fake news and provide possible countermeasures. We urge stakeholders to address this emerging challenge to preserve the integrity of news ecosystems.

[205] Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLMs in English and Low-Resource Languages

Pritika Rohera,Chaitrali Ginimav,Gayatri Sawant,Raviraj Joshi

Main category: cs.CL

TL;DR: 研究评估了多语言大语言模型（LLMs）在英语和印度语言中的事实准确性，发现模型在英语中表现更好，即使问题涉及印度语境。

Details

Motivation: 探讨LLMs在低资源语言（如印度语言）中的事实准确性，以弥补现有研究的不足。 Method: 使用IndicQuest数据集，比较GPT-4o、Gemma-2-9B、Gemma-2-2B和Llama-3.1-8B在英语和19种印度语言中的表现。 Result: LLMs在英语中表现更优，印度语言中幻觉现象更频繁。 Conclusion: 当前LLMs在多语言理解能力上存在挑战，尤其是在低资源语言中。 Abstract: Multilingual Large Language Models (LLMs) have demonstrated significant effectiveness across various languages, particularly in high-resource languages such as English. However, their performance in terms of factual accuracy across other low-resource languages, especially Indic languages, remains an area of investigation. In this study, we assess the factual accuracy of LLMs - GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B - by comparing their performance in English and Indic languages using the IndicQuest dataset, which contains question-answer pairs in English and 19 Indic languages. By asking the same questions in English and their respective Indic translations, we analyze whether the models are more reliable for regional context questions in Indic languages or when operating in English. Our findings reveal that LLMs often perform better in English, even for questions rooted in Indic contexts. Notably, we observe a higher tendency for hallucination in responses generated in low-resource Indic languages, highlighting challenges in the multilingual understanding capabilities of current LLMs.

[206] AutoJudge: Judge Decoding Without Manual Annotation

Roman Garipov,Fedor Velikonivtsev,Ruslan Svirschevski,Vage Egiazarian,Max Ryabinin

Main category: cs.CL

TL;DR: AutoJudge框架通过任务特定的有损推测解码加速LLM推理，通过识别不影响下游质量的“不重要”令牌以提升速度。

Details

Motivation: 传统推测解码需逐令牌匹配目标模型输出分布，限制了推理速度。AutoJudge旨在通过放松对不重要令牌的匹配要求，提升效率。 Method: 采用半贪婪搜索算法确定需纠正的令牌不匹配，并训练轻量级分类器预测可安全接受的令牌。 Result: 在GSM8K推理任务中，AutoJudge比标准推测解码多接受1.5倍令牌，精度损失低于1%；在编程任务中也能泛化。 Conclusion: AutoJudge通过动态识别重要令牌，显著提升LLM推理速度且保持高精度，适用于多种任务。 Abstract: We introduce AutoJudge, a framework that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the generated response, relaxing the guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft model should be corrected to preserve quality, and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We test our approach with Llama 3.2 1B (draft) and Llama 3.1 8B (target) models on zero-shot GSM8K reasoning, where it achieves up to 1.5x more accepted tokens per verification cycle with under 1% degradation in answer accuracy compared to standard speculative decoding and over 2x with small loss in accuracy. When applied to the LiveCodeBench benchmark, our approach automatically detects other, programming-specific important tokens and shows similar speedups, demonstrating its ability to generalize across tasks.

cs.RO [Back]

[207] Dexonomy: Synthesizing All Dexterous Grasp Types in a Grasp Taxonomy

Jiayi Chen,Yubin Ke,Lin Peng,He Wang

Main category: cs.RO

TL;DR: 提出了一种高效生成多样化抓握类型的数据集的方法，显著优于现有基线，并构建了一个大规模数据集。

Details

Motivation: 通用且多样化的抓握技能对智能机器人至关重要，但现有数据集和方法难以满足需求。 Method: 采用两阶段优化流程：先优化物体适应手模板，再局部调整手适应物体，并通过接触感知控制验证抓握。 Result: 构建了包含10.7k物体和9.5M抓握的数据集，覆盖31种抓握类型，生成模型在真实实验中成功率82.3%。 Conclusion: 方法显著提升了抓握合成的通用性和质量，为机器人抓握技能的发展提供了重要资源。 Abstract: Generalizable dexterous grasping with suitable grasp types is a fundamental skill for intelligent robots. Developing such skills requires a large-scale and high-quality dataset that covers numerous grasp types (i.e., at least those categorized by the GRASP taxonomy), but collecting such data is extremely challenging. Existing automatic grasp synthesis methods are often limited to specific grasp types or object categories, hindering scalability. This work proposes an efficient pipeline capable of synthesizing contact-rich, penetration-free, and physically plausible grasps for any grasp type, object, and articulated hand. Starting from a single human-annotated template for each hand and grasp type, our pipeline tackles the complicated synthesis problem with two stages: optimize the object to fit the hand template first, and then locally refine the hand to fit the object in simulation. To validate the synthesized grasps, we introduce a contact-aware control strategy that allows the hand to apply the appropriate force at each contact point to the object. Those validated grasps can also be used as new grasp templates to facilitate future synthesis. Experiments show that our method significantly outperforms previous type-unaware grasp synthesis baselines in simulation. Using our algorithm, we construct a dataset containing 10.7k objects and 9.5M grasps, covering 31 grasp types in the GRASP taxonomy. Finally, we train a type-conditional generative model that successfully performs the desired grasp type from single-view object point clouds, achieving an 82.3% success rate in real-world experiments. Project page: https://pku-epic.github.io/Dexonomy.

[208] Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception

Taoyi Wang,Lijian Wang,Yihan Lin,Mingtao Ou,Yuguo Chen,Xinglong Ji,Rong Zhao

Main category: cs.RO

TL;DR: 论文提出了首个定量评估框架，用于比较两类脑启发视觉传感器（BVS）在高速动态机器人感知中的表现，发现事件视觉传感器（EVS）和Tianmouc在不同场景下各有优劣。

Details

Motivation: 传统相机在高速动态条件下因运动模糊导致性能下降，而BVS因其高时间分辨率和低功耗成为替代方案，但缺乏系统评估。 Method: 建立统一测试协议，包括跨传感器校准、标准化测试平台和质量指标，评估传感器非理想性对结构信息捕捉的影响，并进行功能基准测试。 Result: EVS在高速稀疏场景表现良好，但在高速复杂场景受限；Tianmouc在各种速度和场景下表现一致。 Conclusion: 研究为BVS技术的应用选择提供了依据，并支持进一步技术发展。 Abstract: Perception systems in robotics encounter significant challenges in high-speed and dynamic conditions when relying on traditional cameras, where motion blur can compromise spatial feature integrity and task performance. Brain-inspired vision sensors (BVS) have recently gained attention as an alternative, offering high temporal resolution with reduced bandwidth and power requirements. Here, we present the first quantitative evaluation framework for two representative classes of BVSs in variable-speed robotic sensing, including event-based vision sensors (EVS) that detect asynchronous temporal contrasts, and the primitive-based sensor Tianmouc that employs a complementary mechanism to encode both spatiotemporal changes and intensity. A unified testing protocol is established, including crosssensor calibrations, standardized testing platforms, and quality metrics to address differences in data modality. From an imaging standpoint, we evaluate the effects of sensor non-idealities, such as motion-induced distortion, on the capture of structural information. For functional benchmarking, we examine task performance in corner detection and motion estimation under different rotational speeds. Results indicate that EVS performs well in highspeed, sparse scenarios and in modestly fast, complex scenes, but exhibits performance limitations in high-speed, cluttered settings due to pixel-level bandwidth variations and event rate saturation. In comparison, Tianmouc demonstrates consistent performance across sparse and complex scenarios at various speeds, supported by its global, precise, high-speed spatiotemporal gradient samplings. These findings offer valuable insights into the applicationdependent suitability of BVS technologies and support further advancement in this area.

[209] NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung,Qi Sun,Pengfei Hong,Amir Zadeh,Chuan Li,U-Xuan Tan,Navonil Majumder,Soujanya Poria

Main category: cs.RO

TL;DR: NORA是一种3B参数的视觉-语言-动作模型，旨在减少计算开销，同时保持高性能，适用于实时机器人环境。

Details

Motivation: 现有VLA模型在零样本场景中表现优异，但存在视觉编码限制和高计算开销问题，难以满足实时机器人环境的需求。 Method: NORA采用Qwen-2.5-VL-3B作为主干模型，结合970k真实机器人演示数据和FAST+分词器，优化视觉推理和动作生成。 Result: 实验表明，NORA在任务性能上优于现有大型VLA模型，同时显著降低计算开销。 Conclusion: NORA为实时机器人自主性提供了更实用的解决方案。 Abstract: Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model{} is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

physics.med-ph [Back]

[210] Innovative Integration of 4D Cardiovascular Reconstruction and Hologram: A New Visualization Tool for Coronary Artery Bypass Grafting Planning

Shuo Wang,Tong Ren,Nan Cheng,Li Zhang,Rong Wang

Main category: physics.med-ph

TL;DR: 研究开发了一种动态心血管全息可视化工具，用于冠状动脉搭桥手术（CABG）的术前规划，并通过临床反馈验证其有效性。

Details

Motivation: CABG手术规划需要复杂的空间可视化，传统方法难以满足对冠状动脉深度、钙化和心包粘连的评估需求。 Method: 利用14名CABG候选者的4D心脏CT血管造影数据，开发半自动化工作流程，包括心脏结构分割、冠状动脉钙化评分、心包粘连评估等，并通过Looking Glass平台展示动态全息影像。 Result: 13名心脏外科医生对该工具的术前规划效用评分高（平均4.57/5.0），全息影像的心包粘连评分与术中结果高度相关（r=0.786, P<0.001）。 Conclusion: 该研究建立了一个基于患者数据的动态全息可视化框架，临床反馈证实其在CABG术前规划中的有效性。 Abstract: Background: Coronary artery bypass grafting (CABG) planning requires advanced spatial visualization and consideration of coronary artery depth, calcification, and pericardial adhesions. Objective: To develop and evaluate a dynamic cardiovascular holographic visualization tool for preoperative CABG planning. Methods: Using 4D cardiac computed tomography angiography data from 14 CABG candidates, we developed a semi-automated workflow for time-resolved segmentation of cardiac structures, epicardial adipose tissue (EAT), and coronary arteries with calcium scoring. The workflow incorporated methods for cardiac segmentation, coronary calcification quantification, visualization of coronary depth within EAT, and pericardial adhesion assessment through motion analysis. Dynamic cardiovascular holograms were displayed using the Looking Glass platform. Thirteen cardiac surgeons evaluated the tool using a Likert scale. Additionally, pericardial adhesion scores from holograms of 21 patients (including seven undergoing secondary cardiac surgeries) were compared with intraoperative findings. Results: Surgeons rated the visualization tool highly for preoperative planning utility (mean Likert score: 4.57/5.0). Hologram-based pericardial adhesion scoring strongly correlated with intraoperative findings (r=0.786, P<0.001). Conclusion: This study establishes a visualization framework for CABG planning that produces clinically relevant dynamic holograms from patient-specific data, with clinical feedback confirming its effectiveness for preoperative planning.

cs.HC [Back]

[211] Clinical knowledge in LLMs does not translate to human interactions

Andrew M. Bean,Rebecca Payne,Guy Parsons,Hannah Rose Kirk,Juan Ciro,Rafael Mosquera,Sara Hincapié Monsalve,Aruna S. Ekanayaka,Lionel Tarassenko,Luc Rocher,Adam Mahdi

Main category: cs.HC

TL;DR: 大型语言模型（LLM）在医疗场景中表现优秀，但在实际应用中与用户交互时效果不佳，建议进行系统性用户测试。

Details

Motivation: 探索LLM在提供医疗建议中的实际效果，尤其是在用户交互中的表现。 Method: 在1,298名参与者中，随机分配使用LLM（GPT-4o、Llama 3、Command R+）或自选资源（对照组），测试其在10种医疗场景中的表现。 Result: LLM单独使用时准确率较高（识别病情94.9%，处置建议56.3%），但与用户交互时表现显著下降（识别病情<34.5%，处置建议<44.2%），与对照组无差异。 Conclusion: 用户交互是LLM在医疗建议中的主要挑战，建议部署前进行系统性用户测试。 Abstract: Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.

[212] LINC: Supporting Language Independent Communication and Comprehension to Enhance Contribution in Multilingual Collaborative Meetings

Saramsh Gautam,Mahmood Jasim

Main category: cs.HC

TL;DR: 研究探讨了多语言团队中ESL研究者的沟通障碍，提出了LINC系统以支持多语言协作，并通过实验验证其有效性。

Details

Motivation: ESL研究者在多语言团队中因语言障碍难以充分参与讨论，限制了协作效果。 Method: 通过调查64名ESL研究者，提出四个设计目标，并开发LINC系统（实时多语言沟通模块和会后分析仪表板），通过六组多语言团队实验评估。 Result: LINC系统帮助参与者用偏好语言沟通，回顾会议要点，并有效准备后续会议。 Conclusion: LINC系统在多语言协作中具有潜力，但需考虑语言偏好外的其他影响因素。 Abstract: Collaborative research often includes contributors with varied perspectives from diverse linguistic backgrounds. However, English as a Second Language (ESL) researchers often struggle to communicate during meetings in English and comprehend discussions, leading to limited contribution. To investigate these challenges, we surveyed 64 ESL researchers who frequently collaborate in multilingual teams and identified four key design goals around participation, comprehension, documentation, and feedback. Guided by these design goals, we developed LINC, a multimodal Language INdependent Collaboration system with two components: a real-time module for multilingual communication during meetings and a post-meeting dashboard for discussion analysis. We evaluated the system through a two-phased study with six triads of multilingual teams. We found that using LINC, participants benefited from communicating in their preferred language, recalled and reviewed actionable insights, and prepared for upcoming meetings effectively. We discuss external factors that impact multilingual meeting participation beyond language preferences and the implications of multimodal systems in facilitating meetings in hybrid multilingual collaborative settings beyond research.

q-bio.NC [Back]

[213] Exploring Visual Complaints through a test battery in Acquired Brain Injury Patients: A Detailed Analysis of the DiaNAH Dataset

Gonçalo Hora de Carvalho

Main category: q-bio.NC

TL;DR: 研究利用DiaNAH数据集调查了948名脑损伤患者的视觉障碍主诉，采用AutoML技术处理缺失数据，发现主观视觉主诉与客观视觉测试相关性低。

Details

Motivation: 探讨脑损伤患者主观视觉主诉与客观视觉功能测试之间的复杂关系。 Method: 使用CVS问卷收集视觉症状数据，通过AutoML处理缺失数据，分析40,320种潜在症状组合。 Result: 线性相关分析显示主观主诉与客观测试结果相关性微弱或无。 Conclusion: 样本量和变异性有限，建议扩大研究以深入探索症状群及其对视觉感知的影响。 Abstract: This study investigated visual impairment complaints in a sample of 948 Acquired Brain Injury (ABI) patients using the DiaNAH dataset, emphasizing advanced machine learning techniques for managing missing data. Patients completed a CVS questionnaire capturing eight types of visual symptoms, including blurred vision and altered contrast perception. Due to incomplete data, 181 patients were excluded, resulting in an analytical subset of 767 individuals. To address the challenge of missing data, an automated machine learning (AutoML) approach was employed for data imputation, preserving the distributional characteristics of the original dataset. Patients were grouped according to singular and combined complaint clusters derived from the 40,320 potential combinations identified through the CVS questionnaire. A linear correlation analysis revealed minimal to no direct relationship between patient-reported visual complaints and standard visual perceptual function tests. This study represents an initial systematic attempt to understand the complex relationship between subjective visual complaints and objective visual perceptual assessments in ABI patients. Given the limitations of sample size and variability, further studies with larger populations are recommended to robustly explore these complaint clusters and their implications for visual perception following brain injury.

cs.AI [Back]

[214] Towards AI-Driven Policing: Interdisciplinary Knowledge Discovery from Police Body-Worn Camera Footage

Anita Srbinovska,Angela Srbinovska,Vivek Senthil,Adrian Martin,John McCluskey,Ernest Fokoué

Main category: cs.AI

TL;DR: 本文提出了一种新颖的跨学科框架，利用AI和ML技术分析警察佩戴的摄像头（BWC）录像，旨在识别警民互动的行为模式。

Details

Motivation: 通过分析BWC录像，揭示警民互动中的行为动态（如尊重、不尊重、升级和降级），为执法提供实用方法并推动知识发现。 Method: 结合视频、音频和自然语言处理（NLP）技术进行多模态数据分析，提取BWC录像中的关键信息。 Result: 提出了一种计算技术和方法论，能够有效分类和分析警民互动行为。 Conclusion: 该框架不仅为执法实践提供了实用工具，还推动了从BWC数据中挖掘知识的前沿研究。 Abstract: This paper proposes a novel interdisciplinary framework for analyzing police body-worn camera (BWC) footage from the Rochester Police Department (RPD) using advanced artificial intelligence (AI) and statistical machine learning (ML) techniques. Our goal is to detect, classify, and analyze patterns of interaction between police officers and civilians to identify key behavioral dynamics, such as respect, disrespect, escalation, and de-escalation. We apply multimodal data analysis by integrating video, audio, and natural language processing (NLP) techniques to extract meaningful insights from BWC footage. We present our methodology, computational techniques, and findings, outlining a practical approach for law enforcement while advancing the frontiers of knowledge discovery from police BWC data.

eess.AS [Back]

[215] Versatile Framework for Song Generation with Prompt-based Control

Yu Zhang,Wenxiang Guo,Changhao Pan,Zhiyuan Zhu,Ruiqi Li,Jingyu Lu,Rongjie Huang,Ruiyuan Zhang,Zhiqing Hong,Ziyue Jiang,Zhou Zhao

Main category: eess.AS

TL;DR: VersBand是一个多任务歌曲生成框架，通过VocalBand和AccompBand等模型实现高质量、可控的歌曲生成，解决了现有方法在提示控制和对齐方面的不足。

Details

Motivation: 现有方法在生成提示控制的歌曲时难以实现人声和伴奏的对齐及多样化任务支持，VersBand旨在解决这些问题。 Method: VersBand包含VocalBand（基于流匹配的人声生成模型）、AccompBand（基于流变换的伴奏生成模型）以及LyricBand和MelodyBand，支持多提示控制。 Result: 实验表明，VersBand在多个歌曲生成任务中优于基线模型，客观和主观指标均表现优异。 Conclusion: VersBand通过多任务框架实现了高质量、可控的歌曲生成，为提示控制的音乐创作提供了有效解决方案。 Abstract: Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results demonstrate that VersBand performs better over baseline models across multiple song generation tasks using objective and subjective metrics. Audio samples are available at https://VersBand.github.io.

astro-ph.HE [Back]

[216] Validation and Calibration of Semi-Analytical Models for the Event Horizon Telescope Observations of Sagittarius A*

Ali SaraerToosi,Avery Broderick

Main category: astro-ph.HE

TL;DR: 利用生成式机器学习模型ALINET高效生成黑洞吸积流图像，并通过模拟EHT数据校准物理参数估计及其不确定性。

Details

Motivation: 探索黑洞吸积流在事件视界尺度的物理特性，解决传统方法生成合成图像计算量大的问题。 Method: 使用ALINET模型生成RIAF图像，并通过模拟EHT数据评估未建模物理效应（如星际散射和源变异性）的不确定性。 Result: ALINET能够高效生成黑洞图像，并校准物理参数估计及其不确定性。 Conclusion: ALINET为黑洞吸积流研究提供了一种高效且可靠的工具，有助于更精确地理解黑洞物理特性。 Abstract: The Event Horizon Telescope (EHT) enables the exploration of black hole accretion flows at event-horizon scales. Fitting ray-traced physical models to EHT observations requires the generation of synthetic images, a task that is computationally demanding. This study leverages \alinet, a generative machine learning model, to efficiently produce radiatively inefficient accretion flow (RIAF) images as a function of the specified physical parameters. \alinet has previously been shown to be able to interpolate black hole images and their associated physical parameters after training on a computationally tractable set of library images. We utilize this model to estimate the uncertainty introduced by a number of anticipated unmodeled physical effects, including interstellar scattering and intrinsic source variability. We then use this to calibrate physical parameter estimates and their associated uncertainties from RIAF model fits to mock EHT data via a library of general relativistic magnetohydrodynamics models.

cs.MM [Back]

Taoyu Su,Jiawei Sheng,Duohe Ma,Xiaodong Li,Juwei Yue,Mengxiao Song,Yingkai Tang,Tingwen Liu

Main category: cs.MM

TL;DR: CDMEA提出了一种反事实去偏框架，通过因果视角解决多模态实体对齐中的视觉模态偏差问题，显著提升了低相似度、高噪声和低资源数据场景下的性能。

Details

Motivation: 现有方法过度依赖视觉模态，导致模型偏向图像匹配任务，忽视了视觉模态可能带来的负面影响。 Method: 提出CDMEA框架，通过估计视觉和图模态的总效应（TE）并排除视觉模态的自然直接效应（NDE），确保模型基于总间接效应（TIE）进行预测。 Result: 在9个基准数据集上，CDMEA优于14种最先进方法，尤其在低相似度、高噪声和低资源数据场景中表现突出。 Conclusion: CDMEA通过因果去偏有效利用多模态信息，减少视觉模态偏差，为多模态实体对齐提供了新思路。 Abstract: Multi-Modal Entity Alignment (MMEA) aims to retrieve equivalent entities from different Multi-Modal Knowledge Graphs (MMKGs), a critical information retrieval task. Existing studies have explored various fusion paradigms and consistency constraints to improve the alignment of equivalent entities, while overlooking that the visual modality may not always contribute positively. Empirically, entities with low-similarity images usually generate unsatisfactory performance, highlighting the limitation of overly relying on visual features. We believe the model can be biased toward the visual modality, leading to a shortcut image-matching task. To address this, we propose a counterfactual debiasing framework for MMEA, termed CDMEA, which investigates visual modality bias from a causal perspective. Our approach aims to leverage both visual and graph modalities to enhance MMEA while suppressing the direct causal effect of the visual modality on model predictions. By estimating the Total Effect (TE) of both modalities and excluding the Natural Direct Effect (NDE) of the visual modality, we ensure that the model predicts based on the Total Indirect Effect (TIE), effectively utilizing both modalities and reducing visual modality bias. Extensive experiments on 9 benchmark datasets show that CDMEA outperforms 14 state-of-the-art methods, especially in low-similarity, high-noise, and low-resource data scenarios.

[218] WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution

Pietro Bongini,Sara Mandelli,Andrea Montibeller,Mirko Casu,Orazio Pontorno,Claudio Ragaglia,Luca Zanchetta,Mattia Aquilina,Taiba Majid Wani,Luca Guarnera,Benedetta Tondi,Paolo Bestagini,Irene Amerini,Francesco Denatale,Sebastiano Battiato,Mauro Barni

Main category: cs.MM

TL;DR: WILD数据集为合成图像来源识别提供了训练和基准测试工具，包含封闭和开放集，支持多种任务评估。

Details

Motivation: 合成图像来源识别因生成器数量多、技术复杂且高质量数据集稀缺而具有挑战性。 Method: 构建WILD数据集，包含10个封闭集和10个开放集生成器，每集1万张图像，部分经过后处理。 Result: 数据集支持封闭/开放集识别、验证及抗后处理/对抗攻击测试，评估了七种基线方法。 Conclusion: WILD为合成图像来源识别提供了实用工具，基线测试展示了其挑战性和潜力。 Abstract: Synthetic image source attribution is an open challenge, with an increasing number of image generators being released yearly. The complexity and the sheer number of available generative techniques, as well as the scarcity of high-quality open source datasets of diverse nature for this task, make training and benchmarking synthetic image source attribution models very challenging. WILD is a new in-the-Wild Image Linkage Dataset designed to provide a powerful training and benchmarking tool for synthetic image attribution models. The dataset is built out of a closed set of 10 popular commercial generators, which constitutes the training base of attribution models, and an open set of 10 additional generators, simulating a real-world in-the-wild scenario. Each generator is represented by 1,000 images, for a total of 10,000 images in the closed set and 10,000 images in the open set. Half of the images are post-processed with a wide range of operators. WILD allows benchmarking attribution models in a wide range of tasks, including closed and open set identification and verification, and robust attribution with respect to post-processing and adversarial attacks. Models trained on WILD are expected to benefit from the challenging scenario represented by the dataset itself. Moreover, an assessment of seven baseline methodologies on closed and open set attribution is presented, including robustness tests with respect to post-processing.

cs.IR [Back]

[219] Generative Product Recommendations for Implicit Superlative Queries

Kaustubh D. Dhole,Nikhita Vedula,Saar Kuzi,Giuseppe Castellucci,Eugene Agichtein,Shervin Malmasi

Main category: cs.IR

TL;DR: 论文研究了如何利用大语言模型（LLMs）为隐式最高级查询生成隐式属性并推理，以改进产品推荐。提出了SUPERB标注框架，并评估了现有检索和排序方法。

Details

Motivation: 用户常通过模糊或间接查询（如“最佳越野跑鞋”）寻找产品，标准检索系统难以处理此类隐式最高级查询，需识别和推理复杂因素。 Method: 提出SUPERB四点标注框架，结合LLM生成产品标注，并评估多种现有检索和排序方法在新数据集上的表现。 Result: 实证评估了现有方法，提供了改进建议，并探讨了如何将其整合到实际电商系统中。 Conclusion: LLMs能有效处理隐式最高级查询，SUPERB框架为产品推荐提供了新思路，未来可进一步优化和集成。 Abstract: In Recommender Systems, users often seek the best products through indirect, vague, or under-specified queries, such as "best shoes for trail running". Such queries, also referred to as implicit superlative queries, pose a significant challenge for standard retrieval and ranking systems as they lack an explicit mention of attributes and require identifying and reasoning over complex factors. We investigate how Large Language Models (LLMs) can generate implicit attributes for ranking as well as reason over them to improve product recommendations for such queries. As a first step, we propose a novel four-point schema for annotating the best product candidates for superlative queries called SUPERB, paired with LLM-based product annotations. We then empirically evaluate several existing retrieval and ranking approaches on our new dataset, providing insights and discussing their integration into real-world e-commerce production systems.

[220] Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation

Carlo Merola,Jaspinder Singh

Main category: cs.IR

TL;DR: 本文分析了检索增强生成（RAG）中两种先进技术——延迟分块和上下文检索的优缺点，发现上下文检索能更好地保持语义连贯性但计算成本高，而延迟分块效率更高但牺牲了相关性和完整性。

Details

Motivation: 解决RAG中如何有效管理大量外部知识的问题，避免传统方法导致的上下文碎片化问题。 Method: 对延迟分块和上下文检索进行严格分析，评估其在优化RAG系统中的效果和效率。 Result: 上下文检索能更有效地保持语义连贯性，但计算资源需求更高；延迟分块效率更高，但相关性和完整性较差。 Conclusion: 两种技术各有优劣，需根据具体需求权衡选择。 Abstract: Retrieval-augmented generation (RAG) has become a transformative approach for enhancing large language models (LLMs) by grounding their outputs in external knowledge sources. Yet, a critical question persists: how can vast volumes of external knowledge be managed effectively within the input constraints of LLMs? Traditional methods address this by chunking external documents into smaller, fixed-size segments. While this approach alleviates input limitations, it often fragments context, resulting in incomplete retrieval and diminished coherence in generation. To overcome these shortcomings, two advanced techniques, late chunking and contextual retrieval, have been introduced, both aiming to preserve global context. Despite their potential, their comparative strengths and limitations remain unclear. This study presents a rigorous analysis of late chunking and contextual retrieval, evaluating their effectiveness and efficiency in optimizing RAG systems. Our results indicate that contextual retrieval preserves semantic coherence more effectively but requires greater computational resources. In contrast, late chunking offers higher efficiency but tends to sacrifice relevance and completeness.

cond-mat.mtrl-sci [Back]

[221] Leveraging Modified Ex Situ Tomography Data for Segmentation of In Situ Synchrotron X-Ray Computed Tomography

Tristan Manchester,Adam Anders,Julio Spadotto,Hannah Eccleston,William Beavan,Hugues Arcis,Brian J. Connolly

Main category: cond-mat.mtrl-sci

TL;DR: 提出了一种基于深度学习的图像分割方法，利用高质量实验室数据训练模型，用于同步辐射数据的动态研究，显著提高了分割效率和准确性。

Details

Motivation: 同步辐射X射线断层扫描的动态材料研究中，自动分割因复杂的成像伪影和有限训练数据而具有挑战性。 Method: 通过改进的SegFormer架构，将高质量实验室数据转化为训练模型，用于同步辐射数据的二元分割。 Result: 方法在未见数据上表现出高分割性能，处理时间从小时级降至秒级，且在实验形态变化中保持稳定。 Conclusion: 该方法可广泛应用于多种材料系统，加速跨学科的时间分辨断层数据分析。 Abstract: In situ synchrotron X-ray computed tomography enables dynamic material studies, but automated segmentation remains challenging due to complex imaging artefacts and limited training data. We present a methodology for deep learning-based segmentation by transforming high-quality ex situ laboratory data to train models for binary segmentation of in situ synchrotron data, demonstrated through copper oxide dissolution studies. Using a modified SegFormer architecture, our approach achieves high segmentation performance on unseen data while reducing processing time from hours to seconds per 3D dataset. The method maintains consistent performance over significant morphological changes during experiments, despite training only on static specimens. This methodology can be readily applied to diverse materials systems, accelerating the analysis of time-resolved tomographic data across scientific disciplines.

eess.IV [Back]

[222] Dual-Modality Computational Ophthalmic Imaging with Deep Learning and Coaxial Optical Design

Boyuan Peng,Jiaju Chen,Yiwei Zhang,Cuiyi Peng,Junyang Li,Jiaming Deng,Peiwu Qin

Main category: eess.IV

TL;DR: 提出了一种紧凑型双功能光学设备，结合眼底摄影和屈光误差检测，采用Dense-U-Net算法实现高精度瞳孔定位和屈光估计。

Details

Motivation: 近视和视网膜疾病负担增加，需要更高效的眼部筛查解决方案。 Method: 采用同轴光学设计和二向色镜分离波长依赖成像路径，结合Dense-U-Net算法进行瞳孔分割和自动对焦。 Result: 实验显示高精度瞳孔定位（EDE=2.8px，mIoU=0.931）和屈光估计误差低于5%。 Conclusion: 该设备为社区健康环境提供了快速、智能且可扩展的眼科筛查解决方案。 Abstract: The growing burden of myopia and retinal diseases necessitates more accessible and efficient eye screening solutions. This study presents a compact, dual-function optical device that integrates fundus photography and refractive error detection into a unified platform. The system features a coaxial optical design using dichroic mirrors to separate wavelength-dependent imaging paths, enabling simultaneous alignment of fundus and refraction modules. A Dense-U-Net-based algorithm with customized loss functions is employed for accurate pupil segmentation, facilitating automated alignment and focusing. Experimental evaluations demonstrate the system's capability to achieve high-precision pupil localization (EDE = 2.8 px, mIoU = 0.931) and reliable refractive estimation with a mean absolute error below 5%. Despite limitations due to commercial lens components, the proposed framework offers a promising solution for rapid, intelligent, and scalable ophthalmic screening, particularly suitable for community health settings.

[223] Reservoir-enhanced Segment Anything Model for Subsurface Diagnosis

Xiren Zhou,Shikang Liu,Xinyu Yan,Yizhan Fan,Xiangyu Wang,Yu Kang,Jian Cheng,Huanhuan Chen

Main category: eess.IV

TL;DR: 提出Res-SAM框架，结合视觉和电磁波特性，高效检测地下异常，准确率超85%，资源消耗低。

Details

Motivation: 城市地下异常（如裂缝、空洞）威胁基础设施安全，现有GPR技术因数据标注不足和复杂环境难以准确检测。 Method: Res-SAM通过视觉提示初步定位异常区域，再分析电磁波变化信息精确提取异常类别。 Result: 实验显示Res-SAM准确率>85%，优于现有方法，且仅需少量非目标数据和简单人工交互。 Conclusion: Res-SAM为城市地下异常检测提供了高效、可扩展的解决方案，提升安全性并降低成本。 Abstract: Urban roads and infrastructure, vital to city operations, face growing threats from subsurface anomalies like cracks and cavities. Ground Penetrating Radar (GPR) effectively visualizes underground conditions employing electromagnetic (EM) waves; however, accurate anomaly detection via GPR remains challenging due to limited labeled data, varying subsurface conditions, and indistinct target boundaries. Although visually image-like, GPR data fundamentally represent EM waves, with variations within and between waves critical for identifying anomalies. Addressing these, we propose the Reservoir-enhanced Segment Anything Model (Res-SAM), an innovative framework exploiting both visual discernibility and wave-changing properties of GPR data. Res-SAM initially identifies apparent candidate anomaly regions given minimal prompts, and further refines them by analyzing anomaly-induced changing information within and between EM waves in local GPR data, enabling precise and complete anomaly region extraction and category determination. Real-world experiments demonstrate that Res-SAM achieves high detection accuracy (>85%) and outperforms state-of-the-art. Notably, Res-SAM requires only minimal accessible non-target data, avoids intensive training, and incorporates simple human interaction to enhance reliability. Our research provides a scalable, resource-efficient solution for rapid subsurface anomaly detection across diverse environments, improving urban safety monitoring while reducing manual effort and computational cost.

[224] Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities

Marco Mezzina,Pieter De Backer,Tom Vercauteren,Matthew Blaschko,Alexandre Mottrie,Tinne Tuytelaars

Main category: eess.IV

TL;DR: 研究探讨了自动化手术阶段识别（SPR）中时间上下文对专家分类能力的影响，并比较了人类与AI在机器人辅助部分肾切除术（RAPN）中的表现。

Details

Motivation: 填补以往研究对非线性和长时间手术流程的不足，探索时间上下文对专家分类能力的影响。 Method: 通过定制网络平台，让不同经验水平的泌尿科医生对RAPN的单帧图像和视频片段进行分类，并记录其信心水平和视觉标志。同时，训练并比较了带和不带时间上下文的AI模型。 Result: 视频片段和特定视觉标志提高了分类准确性，专家表现优于新手。AI模型表现与专家相当，且时间上下文的加入进一步提升了性能。 Conclusion: SPR对专家和计算机视觉均具挑战性，时间信息能提升表现。手术工具和器官是关键视觉标志，未来将影响自动化SPR的发展。 Abstract: Purpose: Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events, functioning as a building block for efficient video review, surgical education as well as skill assessment. Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases. This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure. Methods: Urologists of varying expertise were grouped and tasked to indicate the surgical phase for RAPN on both single frames and video snippets using a custom-made web platform. Participants reported their confidence levels and the visual landmarks used in their decision-making. AI architectures without and with temporal context as trained and benchmarked on the Cholec80 dataset were subsequently trained on this RAPN dataset. Results: Video snippets and presence of specific visual landmarks improved phase classification accuracy across all groups. Surgeons displayed high confidence in their classifications and outperformed novices, who struggled discriminating phases. The performance of the AI models is comparable to the surgeons in the survey, with improvements when temporal context was incorporated in both cases. Conclusion: SPR is an inherently complex task for expert surgeons and computer vision, where both perform equally well when given the same context. Performance increases when temporal information is provided. Surgical tools and organs form the key landmarks for human interpretation and are expected to shape the future of automated SPR.

[225] Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction

Ehsan Karami,Hamid Soltanian-Zadeh

Main category: eess.IV

TL;DR: 通过替换批归一化为实例归一化、数据增强和对比损失，提升了膝关节骨关节炎预测模型的泛化能力。

Details

Motivation: 解决MRI深度学习模型在不同来源影像数据上的泛化性问题。 Method: 使用实例归一化、数据增强和对比损失改进基线模型，并在OAI数据库的FS-IW-TSE和DESS影像上训练评估。 Result: 模型在源域和目标域的分类准确率均有显著提升，优于基线模型。 Conclusion: 该方法有效提高了模型在不同影像数据上的泛化性能。 Abstract: Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we have shown that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves model generalization in a baseline deep learning model for knee osteoarthritis (KOA) prediction. We trained and evaluated our model using MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrate a statistically significant improvement in classification accuracy across both domains, with our approach outperforming the baseline model.

[226] Low-Rank Adaptive Structural Priors for Generalizable Diabetic Retinopathy Grading

Yunxuan Wang,Ray Yin,Yumei Tan,Hao Chen,Haiying Xia

Main category: eess.IV

TL;DR: 提出了一种名为LoASP的新方法，通过结合结构先验增强现有域泛化（DG）方法，提高了糖尿病视网膜病变（DR）分级的准确性。

Details

Motivation: 现有DG方法在DR分级中因忽略病灶特异性特征而表现不佳，需改进以提升泛化能力。 Method: 引入低秩自适应结构先验（LoASP），作为即插即用框架，学习自适应结构表示以适应DR诊断的复杂性。 Result: 在八个多样化数据集上的实验验证了LoASP在单源和多源域场景中的有效性，且可视化显示其结构先验与血管和病灶结构一致。 Conclusion: LoASP通过结合结构先验显著提升了DR分级的泛化能力和可解释性。 Abstract: Diabetic retinopathy (DR), a serious ocular complication of diabetes, is one of the primary causes of vision loss among retinal vascular diseases. Deep learning methods have been extensively applied in the grading of diabetic retinopathy (DR). However, their performance declines significantly when applied to data outside the training distribution due to domain shifts. Domain generalization (DG) has emerged as a solution to this challenge. However, most existing DG methods overlook lesion-specific features, resulting in insufficient accuracy. In this paper, we propose a novel approach that enhances existing DG methods by incorporating structural priors, inspired by the observation that DR grading is heavily dependent on vessel and lesion structures. We introduce Low-rank Adaptive Structural Priors (LoASP), a plug-and-play framework designed for seamless integration with existing DG models. LoASP improves generalization by learning adaptive structural representations that are finely tuned to the complexities of DR diagnosis. Extensive experiments on eight diverse datasets validate its effectiveness in both single-source and multi-source domain scenarios. Furthermore, visualizations reveal that the learned structural priors intuitively align with the intricate architecture of the vessels and lesions, providing compelling insights into their interpretability and diagnostic relevance.

[227] Dual Attention Driven Lumbar Magnetic Resonance Image Feature Enhancement and Automatic Diagnosis of Herniation

Lingrui Zhang,Liang Guo,Xiao An,Feng Lin,Binlong Zheng,Jiankun Wang,Zhirui Li

Main category: eess.IV

TL;DR: 本文提出了一种自动化腰椎间盘突出（LDH）分类框架，利用T1和T2加权MRI图像，结合数据增强和注意力机制，实现了高准确度的LDH检测。

Details

Motivation: LDH的诊断依赖放射科医生的专业知识，导致诊断延迟和培训成本高，亟需自动化解决方案。 Method: 使用205人的T1和T2加权MRI图像，结合数据增强和通道-空间注意力机制，提取临床特征并生成标准化诊断输出。 Result: 框架的AUC-ROC为0.969，准确度为0.9486，仅需少量训练数据即可实现高精度。 Conclusion: 该框架有望提升基层医院的LDH检测能力，为临床决策提供高效支持。 Abstract: Lumbar disc herniation (LDH) is a common musculoskeletal disease that requires magnetic resonance imaging (MRI) for effective clinical management. However, the interpretation of MRI images heavily relies on the expertise of radiologists, leading to delayed diagnosis and high costs for training physicians. Therefore, this paper proposes an innovative automated LDH classification framework. To address these key issues, the framework utilizes T1-weighted and T2-weighted MRI images from 205 people. The framework extracts clinically actionable LDH features and generates standardized diagnostic outputs by leveraging data augmentation and channel and spatial attention mechanisms. These outputs can help physicians make confident and time-effective care decisions when needed. The proposed framework achieves an area under the receiver operating characteristic curve (AUC-ROC) of 0.969 and an accuracy of 0.9486 for LDH detection. The experimental results demonstrate the performance of the proposed framework. Our framework only requires a small number of datasets for training to demonstrate high diagnostic accuracy. This is expected to be a solution to enhance the LDH detection capabilities of primary hospitals.

[228] Accelerated 3D-3D rigid registration of echocardiographic images obtained from apical window using particle filter

Thanuja Uruththirakodeeswaran,Harald Becher,Michelle Noga,Lawrence H. Le,Pierre Boulanger,Jonathan Windram,Kumaradevan Punithakumar

Main category: eess.IV

TL;DR: 提出了一种加速的SMC算法，用于3D-3D刚性配准，提高了超声图像的质量和配准速度。

Details

Motivation: 解决3D超声图像配准中的噪声和强度变化问题，提高配准精度和效率。 Method: 采用加速的SMC算法进行迭代估计，支持图像和掩模两种配准方式。 Result: 掩模配准方法Dice得分为0.819±0.045，速度提升16.7倍。 Conclusion: 加速SMC算法在超声图像配准中表现优异，具有实用价值。 Abstract: The perfect alignment of 3D echocardiographic images captured from various angles has improved image quality and broadened the field of view. This study proposes an accelerated sequential Monte Carlo (SMC) algorithm for 3D-3D rigid registration of transthoracic echocardiographic images with significant and limited overlap taken from apical window that is robust to the noise and intensity variation in ultrasound images. The algorithm estimates the translational and rotational components of the rigid transform through an iterative process and requires an initial approximation of the rotation and translation limits. We perform registration in two ways: the image-based registration computes the transform to align the end-diastolic frame of the apical nonstandard image to the apical standard image and applies the same transform to all frames of the cardiac cycle, whereas the mask-based registration approach uses the binary masks of the left ventricle in the same way. The SMC and exhaustive search (EX) algorithms were evaluated for 4D temporal sequences recorded from 7 volunteers who participated in a study conducted at the Mazankowski Alberta Heart Institute. The evaluations demonstrate that the mask-based approach of the accelerated SMC yielded a Dice score value of 0.819 +/- 0.045 for the left ventricle and gained 16.7x speedup compared to the CPU version of the SMC algorithm.

[229] SST-DUNet: Automated preclinical functional MRI skull stripping using Smart Swin Transformer and Dense UNet

Sima Soltanpour,Rachel Utama,Arnold Chang,Md Taufiq Nasseef,Dan Madularu,Praveen Kulkarni,Craig Ferris,Chris Joslin

Main category: eess.IV

TL;DR: SST-DUNet是一种结合了密集UNet架构和智能Swin Transformer特征提取器的新方法，用于自动剥离fMRI中的颅骨，解决了低分辨率和变切片尺寸的挑战。

Details

Motivation: 手动剥离fMRI中的颅骨耗时且依赖操作者，现有方法对临床前数据效果不佳，因此需要一种更高效的自动化方法。 Method: 提出SST-DUNet，结合密集UNet和智能Swin Transformer，使用SSW-MSA模块学习通道特征和脑结构依赖关系，并采用Focal和Dice损失函数解决类别不平衡问题。 Result: 在三个内部数据集上，Dice相似性得分分别为98.65%、97.86%和98.04%，自动剥离结果与手动剥离高度一致。 Conclusion: SST-DUNet可有效替代手动颅骨剥离，适用于大鼠fMRI分析。 Abstract: Skull stripping is a common preprocessing step that is often performed manually in Magnetic Resonance Imaging (MRI) pipelines, including functional MRI (fMRI). This manual process is time-consuming and operator dependent. Automating this process is challenging for preclinical data due to variations in brain geometry, resolution, and tissue contrast. While existing methods for MRI skull stripping exist, they often struggle with the low resolution and varying slice sizes in preclinical fMRI data. This study proposes a novel method called SST-DUNet, that integrates a dense UNet-based architecture with a feature extractor based on Smart Swin Transformer (SST) for fMRI skull stripping. The Smart Shifted Window Multi-Head Self-Attention (SSW-MSA) module in SST is adapted to replace the mask-based module in the Swin Transformer (ST), enabling the learning of distinct channel-wise features while focusing on relevant dependencies within brain structures. This modification allows the model to better handle the complexities of fMRI skull stripping, such as low resolution and variable slice sizes. To address the issue of class imbalance in preclinical data, a combined loss function using Focal and Dice loss is utilized. The model was trained on rat fMRI images and evaluated across three in-house datasets with a Dice similarity score of 98.65%, 97.86%, and 98.04%. The fMRI results obtained through automatic skull stripping using the SST-DUNet model closely align with those from manual skull stripping for both seed-based and independent component analyses. These results indicate that the SST-DUNet can effectively substitute manual brain extraction in rat fMRI analysis.

cs.LG [Back]

[230] Graph Fourier Transformer with Structure-Frequency Information

Yonghui Zhai,Yang Zhang,Minghao Shang,Lihua Pang,Yaxin Ren

Main category: cs.LG

TL;DR: Grafourierformer结合图变换器与频率结构信息，通过图傅里叶变换优化注意力机制，提升图分类和节点分类任务性能。

Details

Motivation: 现有图变换器在自注意力机制中忽略图的泛化偏差，仅从结构角度补偿偏差，性能不足。 Method: 利用图拉普拉斯矩阵特征值构建特征值矩阵掩码，结合傅里叶变换提取节点高低频特征，优化注意力机制。 Result: 在多个基准测试中，Grafourierformer优于GNN和图变换器模型。 Conclusion: Grafourierformer通过结合结构信息和频率信息，有效区分全局趋势与局部细节，提升模型性能。 Abstract: Graph Transformers (GTs) have shown advantages in numerous graph structure tasks but their self-attention mechanism ignores the generalization bias of graphs, with existing methods mainly compensating for this bias from aspects like position encoding, attention bias and relative distance yet still having sub-optimal performance and being insufficient by only considering the structural perspective of generalization bias. To address this, this paper proposes Grafourierformer, which innovatively combines GT with inductive bias containing Frequency-Structure information by applying Graph Fourier Transform to the Attention Matrix: specifically, eigenvalues from the Graph Laplacian matrix are used to construct an Eigenvalue matrix mask (reflecting node positions and structural relationships with neighboring nodes to enable consideration of node range structural characteristics and focus on local graph details), and inverse Fourier transform is employed to extract node high-frequency and low-frequency features, calculate low-frequency and high-frequency energy, and construct a node frequency-energy matrix to filter the eigenvalue matrix mask, allowing attention heads to incorporate both graph structural information and node frequency information optimization, adaptively distinguish global trends from local details, and effectively suppress redundant information interference. Extensive experiments on various benchmarks show Grafourierformer consistently outperforms GNN and GT-based models in graph classification and node classification tasks, with ablation experiments further validating the effectiveness and necessity of the method. Codes are available at https://github.com/Arichibald/Grafourierformer.git

[231] Hierarchical Attention Generates Better Proofs

Jianlong Chen,Chao Li,Yang Yuan,Andrew C Yao

Main category: cs.LG

TL;DR: 论文提出了一种名为“分层注意力”的正则化方法，通过五级层次结构改进大语言模型在数学定理证明中的表现，显著提高了成功率并降低了证明复杂度。

Details

Motivation: 大语言模型在形式化定理证明中表现良好，但其基于token的处理方式难以捕捉数学证明的层次结构，因此需要一种方法对齐注意力机制与数学推理结构。 Method: 提出分层注意力方法，建立从基础元素到高级概念的五级层次结构，确保证明生成中的结构化信息流。 Result: 在miniF2F和ProofNet上，证明成功率分别提高了2.05%和1.69%，证明复杂度分别降低了23.81%和16.50%。 Conclusion: 分层注意力方法有效提升了语言模型在数学定理证明中的性能，同时简化了证明过程。 Abstract: Large language models (LLMs) have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce \textbf{Hierarchical Attention}, a regularization method that aligns LLMs' attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05\% on miniF2F and 1.69\% on ProofNet while reducing proof complexity by 23.81\% and 16.50\% respectively. The code is available at https://github.com/Car-pe/HAGBP.

[232] Anyprefer: An Agentic Framework for Preference Data Synthesis

Yiyang Zhou,Zhaoyang Wang,Tianle Wang,Shangyu Xing,Peng Xia,Bo Li,Kaiyuan Zheng,Zijian Zhang,Zhaorun Chen,Wenhao Zheng,Xuchao Zhang,Chetan Bansal,Weitong Zhang,Ying Wei,Mohit Bansal,Huaxiu Yao

Main category: cs.LG

TL;DR: Anyprefer框架通过两玩家马尔可夫游戏合成高质量偏好数据，提升模型对齐性能。

Details

Motivation: 手动标注偏好数据耗时且昂贵，现有自奖励方法易产生偏差。 Method: Anyprefer采用目标模型与评判模型协作的两玩家游戏，引入外部工具减少偏差，优化提示反馈。 Result: 生成58K高质量偏好对，实验显示在多个领域性能显著提升。 Conclusion: Anyprefer有效解决偏好数据质量问题，显著提升模型对齐效果。 Abstract: High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with the target model, thereby amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and the judge model collaborate together. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model's responses, mitigating biases in the rewarding process. In addition, a feedback mechanism is introduced to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.

[233] Improving Reasoning Performance in Large Language Models via Representation Engineering

Bertram Højer,Oliver Jarvis,Stefan Heinrich

Main category: cs.LG

TL;DR: 论文提出了一种通过调控LLM激活状态来提升推理性能的方法，无需额外训练。

Details

Motivation: 探讨LLM推理能力是否与其他信息处理任务类似，并试图通过干预模型激活状态来提升推理表现。 Method: 利用表示工程方法，从LLM的残差流中提取激活状态，生成控制向量并在推理时干预模型。 Result: 实验表明，该方法能提升Mistral-7B-Instruct和Pythia模型在归纳、演绎和数学推理任务中的表现。 Conclusion: LLM的推理能力可通过简单干预激活状态来调控，且无需额外训练。 Abstract: Recent advancements in large language models (LLMs) have resulted in increasingly anthropomorphic language concerning the ability of LLMs to reason. Whether reasoning in LLMs should be understood to be inherently different is, however, widely debated. We propose utilizing a representation engineering approach wherein model activations are read from the residual stream of an LLM when processing a reasoning task. The activations are used to derive a control vector that is applied to the model as an inference-time intervention, modulating the representational space of the model, to improve performance on the specified task. We publish the code for deriving control vectors and analyzing model representations. The method allows us to improve performance on reasoning benchmarks and assess how control vectors influence the final logit distribution of a model via metrics such as KL divergence and entropy. We apply control vectors to Mistral-7B-Instruct and a range of Pythia models on an inductive, a deductive and mathematical reasoning task. We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations. The intervention is dependent upon the ability to reliably extract the model's typical state when correctly solving a task. Our results suggest that reasoning performance can be modulated in the same manner as other information-processing tasks performed by LLMs and demonstrate that we are capable of improving performance on specific tasks via a simple intervention on the residual stream with no additional training.

[234] Graph-Based Spectral Decomposition for Parameter Coordination in Language Model Fine-Tuning

Hanlu Zhang,Yumeng Ma,Shuo Wang,Guiran Liu,Binrong Zhu

Main category: cs.LG

TL;DR: 提出一种基于图谱分析的大语言模型参数协同优化算法，提升微调效率和结构感知能力。

Details

Motivation: 改进大语言模型的微调效率和训练中的结构感知能力。 Method: 将预训练模型参数视为图节点，构建加权图并应用拉普拉斯谱分解，设计联合损失函数和谱滤波机制。 Result: 在多任务评估中表现优异，有效减少参数扰动并提升微调质量。 Conclusion: 该框架显著推动了参数高效训练方法，强调了结构信号处理在深度学习优化中的重要性。 Abstract: This paper proposes a parameter collaborative optimization algorithm for large language models, enhanced with graph spectral analysis. The goal is to improve both fine-tuning efficiency and structural awareness during training. In the proposed method, the parameters of a pre-trained language model are treated as nodes in a graph. A weighted graph is constructed, and Laplacian spectral decomposition is applied to enable frequency-domain modeling and structural representation of the parameter space. Based on this structure, a joint loss function is designed. It combines the task loss with a spectral regularization term to facilitate collaborative updates among parameters. In addition, a spectral filtering mechanism is introduced during the optimization phase. This mechanism adjusts gradients in a structure-aware manner, enhancing the model's training stability and convergence behavior. The method is evaluated on multiple tasks, including traditional fine-tuning comparisons, few-shot generalization tests, and convergence speed analysis. In all settings, the proposed approach demonstrates superior performance. The experimental results confirm that the spectral collaborative optimization framework effectively reduces parameter perturbations and improves fine-tuning quality while preserving overall model performance. This work contributes significantly to the field of artificial intelligence by advancing parameter-efficient training methodologies for large-scale models, reinforcing the importance of structural signal processing in deep learning optimization, and offering a robust, generalizable framework for enhancing language model adaptability and performance.

[235] Low-Bit Integerization of Vision Transformers using Operand Reodering for Efficient Hardware

Ching-Yi Lin,Sahil Shah

Main category: cs.LG

TL;DR: 该论文提出了一种基于操作重排序的整数化方法，延迟去量化至矩阵运算之后，从而直接处理量化输入，降低了计算开销。

Details

Motivation: 预训练视觉Transformer在多种视觉任务中表现出色，但存在计算和内存成本高的问题。量化虽减少内存使用，但去量化操作仍带来显著计算开销。 Method: 通过分析计算图，提出基于操作重排序的整数化过程，延迟去量化至矩阵运算之后，实现整数化矩阵乘法和线性模块。 Result: 实验结果表明，低比特推理降低了线性层和矩阵乘法的每PE功耗，缩小了量化模型与高效推理之间的差距。 Conclusion: 该方法有效减少了计算开销，为量化模型的高效推理提供了可行方案。 Abstract: Pre-trained vision transformers have achieved remarkable performance across various visual tasks but suffer from expensive computational and memory costs. While model quantization reduces memory usage by lowering precision, these models still incur significant computational overhead due to the dequantization before matrix operations. In this work, we analyze the computation graph and propose an integerization process based on operation reordering. Specifically, the process delays dequantization until after matrix operations. This enables integerized matrix multiplication and linear module by directly processing the quantized input. To validate our approach, we synthesize the self-attention module of ViT on a systolic array-based hardware. Experimental results show that our low-bit inference reduces per-PE power consumption for linear layer and matrix multiplication, bridging the gap between quantized models and efficient inference.

[236] Geometry aware inference of steady state PDEs using Equivariant Neural Fields representations

Giovanni Catalani,Michael Bauerheim,Frédéric Tost,Xavier Bertrand,Joseph Morlier

Main category: cs.LG

TL;DR: 论文提出enf2enf方法，基于等变神经场架构，用于预测具有非参数化几何变化的稳态偏微分方程，通过编码-解码框架实现高效建模。

Details

Motivation: 利用神经场的最新进展，解决偏微分方程在非参数化几何变化下的求解问题，提升泛化能力和物理一致性。 Method: 采用编码-解码框架，将输入几何编码为潜在点云嵌入，结合全局参数解码为连续输出场，利用局部性和平移不变性捕捉几何与物理的耦合。 Result: 在高保真空气动力学数据集、超弹性材料基准和多元素翼型几何上表现优异，支持实时推理和零样本超分辨率。 Conclusion: enf2enf方法在几何与物理耦合建模中表现出色，优于现有图基、算子学习和神经场方法。 Abstract: Recent advances in Neural Fields have enabled powerful, discretization-invariant methods for learning neural operators that approximate solutions of Partial Differential Equations (PDEs) on general geometries. Building on these developments, we introduce enf2enf, an encoder--decoder methodology for predicting steady-state Partial Differential Equations with non-parameterized geometric variability, based on recently proposed Equivariant Neural Field architectures. In enf2enf, input geometries are encoded into latent point cloud embeddings that inherently preserve geometric grounding and capture local phenomena. The resulting representations are then combined with global parameters and directly decoded into continuous output fields, thus efficiently modeling the coupling between geometry and physics. By leveraging the inductive biases of locality and translation invariance, our approach is able to capture fine-scale physical features as well as complex shape variations, thereby enhancing generalization and physical compliance. Extensive experiments on a high-fidelity aerodynamic dataset, a hyper-elastic material benchmark, and multi-element airfoil geometries, demonstrate that the proposed model achieves superior or competitive performance compared to state-of-the-art graph based, operator learning, and neural field methods. Notably, our method supports real time inference and zero-shot super-resolution, enabling efficient training on low-resolution meshes while maintaining high accuracy on full-scale discretizations.

Delun Lai,Yeyubei Zhang,Yunchong Liu,Chaojie Li,Huadong Mo

Main category: cs.LG

TL;DR: 提出了一种基于深度学习的多模态融合架构，通过特征提取、自适应融合和时间序列建模，提升自主导航机器人在复杂环境中的感知能力。

Details

Motivation: 增强自主导航机器人在复杂环境中的感知能力，解决多模态数据融合的挑战。 Method: 设计轻量级特征提取网络、自适应加权跨模态融合策略，并引入时间序列信息建模。 Result: 在KITTI数据集上，导航和定位精度分别提高了3.5%和2.2%，同时保持实时性能。 Conclusion: 提供了一种新颖的解决方案，适用于复杂环境中的自主机器人导航。 Abstract: This paper introduces a novel deep learning-based multimodal fusion architecture aimed at enhancing the perception capabilities of autonomous navigation robots in complex environments. By utilizing innovative feature extraction modules, adaptive fusion strategies, and time-series modeling mechanisms, the system effectively integrates RGB images and LiDAR data. The key contributions of this work are as follows: a. the design of a lightweight feature extraction network to enhance feature representation; b. the development of an adaptive weighted cross-modal fusion strategy to improve system robustness; and c. the incorporation of time-series information modeling to boost dynamic scene perception accuracy. Experimental results on the KITTI dataset demonstrate that the proposed approach increases navigation and positioning accuracy by 3.5% and 2.2%, respectively, while maintaining real-time performance. This work provides a novel solution for autonomous robot navigation in complex environments.

[238] UNet with Axial Transformer : A Neural Weather Model for Precipitation Nowcasting

Maitreya Sonawane,Sumit Mamtani

Main category: cs.LG

TL;DR: 论文提出了一种基于Transformer的深度学习方法，用于高精度、本地化的天气临近预报，替代传统数值模型。

Details

Motivation: 传统数值天气模型对局部风暴或快速演变的天气事件（如雷暴）预测效果有限，因此希望通过深度学习实现更快速、高分辨率的预报。 Method: 采用轴向注意力机制的Transformer模型，从时间序列数据中学习复杂模式，适用于单变量和多变量数据。 Result: 在给定数据集上取得了PSNR=47.67、SSIM=0.9943的先进结果。 Conclusion: 该方法为天气临近预报提供了高效、通用的框架，展示了深度学习在此领域的潜力。 Abstract: Making accurate weather predictions can be particularly challenging for localized storms or events that evolve on hourly timescales, such as thunderstorms. Hence, our goal for the project was to model Weather Nowcasting for making highly localized and accurate predictions that apply to the immediate future replacing the current numerical weather models and data assimilation systems with Deep Learning approaches. A significant advantage of machine learning is that inference is computationally cheap given an already-trained model, allowing forecasts that are nearly instantaneous and in the native high resolution of the input data. In this work we developed a novel method that employs Transformer-based machine learning models to forecast precipitation. This approach works by leveraging axial attention mechanisms to learn complex patterns and dynamics from time series frames. Moreover, it is a generic framework and can be applied to univariate and multivariate time series data, as well as time series embeddings data. This paper represents an initial research on the dataset used in the domain of next frame prediciton, and hence, we demonstrate state-of-the-art results in terms of metrices (PSNR = 47.67, SSIM = 0.9943) used for the given dataset using UNet with Axial Transformer.

[239] Learning Brenier Potentials with Convex Generative Adversarial Neural Networks

Claudia Drygala,Hanno Gottschalk,Thomas Kruse,Ségolène Martin,Annika Mütze

Main category: cs.LG

TL;DR: 本文研究了生成对抗网络（GAN）学习Brenier势的统计学习理论，提出了一种结合ReCU网络和对抗训练的方法，确保网络的严格凸性，并证明了学习过程的收敛性。

Details

Motivation: Brenier势在概率测度传输中具有重要作用，但其学习问题尚未得到充分研究。本文旨在通过GAN框架学习Brenier势，并解决其凸性保证问题。 Method: 使用ReCU网络（激活函数为max{0,x}^3）逼近Brenier势，并通过对抗训练结合交叉熵损失和凸性惩罚项，确保网络严格凸。 Result: 理论证明了学习过程的收敛性，实验验证了网络在训练中自动学习凸性，适用于高斯混合和图像数据等目标分布。 Conclusion: 本文提出的方法成功解决了Brenier势的学习问题，为生成模型提供了新的理论支持。 Abstract: Brenier proved that under certain conditions on a source and a target probability measure there exists a strictly convex function such that its gradient is a transport map from the source to the target distribution. This function is called the Brenier potential. Furthermore, detailed information on the H\"older regularity of the Brenier potential is available. In this work we develop the statistical learning theory of generative adversarial neural networks that learn the Brenier potential. As by the transformation of densities formula, the density of the generated measure depends on the second derivative of the Brenier potential, we develop the universal approximation theory of ReCU networks with cubic activation $\mathtt{ReCU}(x)=\max\{0,x\}^3$ that combines the favorable approximation properties of H\"older functions with a Lipschitz continuous density. In order to assure the convexity of such general networks, we introduce an adversarial training procedure for a potential function represented by the ReCU networks that combines the classical discriminator cross entropy loss with a penalty term that enforces (strict) convexity. We give a detailed decomposition of learning errors and show that for a suitable high penalty parameter all networks chosen in the adversarial min-max optimization problem are strictly convex. This is further exploited to prove the consistency of the learning procedure for (slowly) expanding network capacity. We also implement the described learning algorithm and apply it to a number of standard test cases from Gaussian mixture to image data as target distributions. As predicted in theory, we observe that the convexity loss becomes inactive during the training process and the potentials represented by the neural networks have learned convexity.

[240] Mjölnir: A Deep Learning Parametrization Framework for Global Lightning Flash Density

Minjong Cheon

Main category: cs.LG

TL;DR: Mj\"olnir是一种基于深度学习的全球闪电密度参数化框架，通过ERA5和WWLLN数据训练，能准确预测闪电活动。

Details

Motivation: 利用深度学习模拟复杂大气动力学，改进全球闪电活动的参数化方法。 Method: 基于InceptionNeXt和SENet架构，采用多任务学习策略预测闪电发生和强度。 Result: 模型在全球闪电分布、季节变化和区域特征上表现优异，年平均值皮尔逊相关系数达0.96。 Conclusion: Mj\"olnir不仅是有效的数据驱动闪电参数化工具，也为下一代AI地球系统模型提供了潜力。 Abstract: Recent advances in AI-based weather forecasting models, such as FourCastNet, Pangu-Weather, and GraphCast, have demonstrated the remarkable ability of deep learning to emulate complex atmospheric dynamics. Building on this momentum, we propose Mj\"olnir, a novel deep learning-based framework for global lightning flash density parameterization. Trained on ERA5 atmospheric predictors and World Wide Lightning Location Network (WWLLN) observations at a daily temporal resolution and 1 degree spatial resolution, Mj\"olnir captures the nonlinear mapping between large-scale environmental conditions and lightning activity. The model architecture is based on the InceptionNeXt backbone with SENet, and a multi-task learning strategy to simultaneously predict lightning occurrence and magnitude. Extensive evaluations yield that Mollnir accurately reproduces the global distribution, seasonal variability, and regional characteristics of lightning activity, achieving a global Pearson correlation coefficient of 0.96 for annual mean fields. These results suggest that Mj\"olnir serves not only as an effective data-driven global lightning parameterization but also as a promising AI-based scheme for next-generation Earth system models (AI-ESMs).

cs.DC [Back]

[241] FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation

Ke Hong,Xiuhong Li,Minxu Liu,Qiuli Mao,Tianqi Wu,Zixiao Huang,Lufang Chen,Zhong Wang,Yichong Zhang,Zhenhua Zhu,Guohao Dai,Yu Wang

Main category: cs.DC

TL;DR: FlashOverlap提出了一种轻量级设计，通过瓦片级重叠、无干扰计算和通信无关性，显著提升了多GPU系统中的通信效率。

Details

Motivation: 多GPU计算中，GPU间通信成为瓶颈，尤其是在消费级GPU上。现有设计无法同时满足瓦片级重叠、无干扰计算和通信无关性的需求。 Method: FlashOverlap利用新型信号机制识别瓦片级数据依赖，并重新排列数据到连续地址，通过调用NCCL API实现通信。 Result: 实验表明，FlashOverlap实现了最高1.65倍的加速，在多数情况下优于现有工作。 Conclusion: FlashOverlap通过轻量级设计有效解决了多GPU通信瓶颈，显著提升了性能。 Abstract: Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency is an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, a lightweight design characterized by tile-wise overlapping, interference-free computation, and communication agnosticism. FlashOverlap utilizes a novel signaling mechanism to identify tile-wise data dependency without interrupting the computation process, and reorders data to contiguous addresses, enabling communication by simply calling NCCL APIs. Experiments show that such a lightweight design achieves up to 1.65x speedup, outperforming existing works in most cases.

cs.SE [Back]

[242] Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

Kang Yang,Xinjun Mao,Shangwen Wang,Yanlin Wang,Tanghaoran Zhang,Bo Lin,Yihao Qin,Zhang Zhang,Yao Lu,Kamal Al-Sabahi

Main category: cs.SE

TL;DR: 研究探讨了用LLM生成的代码注释替换人工注释是否能提升预训练数据集质量，结果显示LLM生成的注释在语义一致性上优于人工注释，并验证了其在实际任务中的优势。

Details

Motivation: 预训练代码模型依赖高质量数据，但人工注释易过时，影响模型性能。LLM生成高质量注释的潜力值得探索。 Method: 提出两种无参考评估任务（代码-注释不一致检测和语义代码搜索），并用LLM生成注释重建数据集，重新预训练CodeT5模型。 Result: LLM生成的注释比人工注释更具语义一致性，模型在LLM增强数据上表现更优。 Conclusion: LLM可用于重建预训练数据集，挑战了对人工注释的传统依赖，推动代码智能发展。 Abstract: Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-generated ones improves pre-training datasets. Since standard metrics cannot assess reference comment quality, we propose two novel reference-free evaluation tasks: code-comment inconsistency detection and semantic code search. Results show that LLM-generated comments are more semantically consistent with code than human-written ones, as confirmed by manual evaluation. Leveraging this finding, we rebuild the CodeSearchNet dataset with LLM-generated comments and re-pre-train CodeT5. Evaluations demonstrate that models trained on LLM-enhanced data outperform those using original human comments in code summarization, generation, and translation tasks. This work validates rebuilding pre-training datasets with LLMs to advance code intelligence, challenging the traditional reliance on human reference comments.

[243] Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge

Wenhan Mu,Ling Xu,Shuren Pei,Le Mi,Huichi Zhou

Main category: cs.SE

TL;DR: 论文提出EP-Shield框架，通过自然性感知推理评估和净化标识符替换攻击，显著优于现有方法。

Details

Motivation: 现有标识符替换攻击生成的对抗样本存在不自然的代码模式，容易被检测到，需改进其质量。 Method: 使用LLM-as-a-Judge评估对抗样本的自然性，提出EP-Shield框架进行净化。 Result: 实验显示EP-Shield性能优于对抗微调（提升达83.36%），且设计轻量（7B参数）。 Conclusion: EP-Shield能有效提升对抗样本的自然性和模型恢复能力，具有实际应用价值。 Abstract: The widespread adoption of code language models in software engineering tasks has exposed vulnerabilities to adversarial attacks, especially the identifier substitution attacks. Although existing identifier substitution attackers demonstrate high success rates, they often produce adversarial examples with unnatural code patterns. In this paper, we systematically assess the quality of adversarial examples using LLM-as-a-Judge. Our analysis reveals that over 80% of adversarial examples generated by state-of-the-art identifier substitution attackers (e.g., ALERT) are actually detectable. Based on this insight, we propose EP-Shield, a unified framework for evaluating and purifying identifier substitution attacks via naturalness-aware reasoning. Specifically, we first evaluate the naturalness of code and identify the perturbed adversarial code, then purify it so that the victim model can restore correct prediction. Extensive experiments demonstrate the superiority of EP-Shield over adversarial fine-tuning (up to 83.36% improvement) and its lightweight design 7B parameters) with GPT-4-level performance.

cs.CR [Back]

[244] Optimizing the Privacy-Utility Balance using Synthetic Data and Configurable Perturbation Pipelines

Anantha Sharma,Swetha Devabhaktuni,Eklove Mohan

Main category: cs.CR

TL;DR: 论文探讨了现代合成数据生成和高级数据扰动技术在BFSI等敏感行业中的应用，对比传统匿名化方法，旨在平衡隐私保护与数据实用性。

Details

Motivation: 解决数据敏感行业（如BFSI、医疗、零售和电信）中隐私保护与数据实用性之间的矛盾，同时提升操作效率。 Method: 使用生成模型（如GANs）、上下文感知PII转换、可配置统计扰动和差分隐私等先进技术。 Result: 现代技术在隐私保护和数据实用性方面优于传统方法，并能减少开销、加速分析。 Conclusion: 这些方法在平衡隐私与实用性、降低监管风险及推动数据驱动创新方面具有潜力。 Abstract: This paper explores the strategic use of modern synthetic data generation and advanced data perturbation techniques to enhance security, maintain analytical utility, and improve operational efficiency when managing large datasets, with a particular focus on the Banking, Financial Services, and Insurance (BFSI) sector. We contrast these advanced methods encompassing generative models like GANs, sophisticated context-aware PII transformation, configurable statistical perturbation, and differential privacy with traditional anonymization approaches. The goal is to create realistic, privacy-preserving datasets that retain high utility for complex machine learning tasks and analytics, a critical need in the data-sensitive industries like BFSI, Healthcare, Retail, and Telecommunications. We discuss how these modern techniques potentially offer significant improvements in balancing privacy preservation while maintaining data utility compared to older methods. Furthermore, we examine the potential for operational gains, such as reduced overhead and accelerated analytics, by using these privacy-enhanced datasets. We also explore key use cases where these methods can mitigate regulatory risks and enable scalable, data-driven innovation without compromising sensitive customer information.

[245] Backdoor Defense in Diffusion Models via Spatial Attention Unlearning

Abha Jha,Ashwath Vaithinathan Aravindan,Matthew Salaway,Atharva Sandeep Bhide,Duygu Nur Yaldiz

Main category: cs.CR

TL;DR: 提出了一种名为SAU的新方法，用于防御文本到图像扩散模型中的后门攻击，通过潜在空间操作和空间注意力机制高效移除恶意触发效果。

Details

Motivation: 扩散模型在文本到图像生成中易受后门攻击，现有防御方法不足，尤其是高维输出空间增加了检测和缓解的难度。 Method: SAU利用潜在空间操作和空间注意力机制，隔离并移除后门触发的潜在表示。 Result: SAU在多种后门攻击中实现100%触发移除准确率，CLIP得分0.7023，优于现有方法。 Conclusion: SAU是一种高效、可扩展的解决方案，能保护扩散模型免受后门攻击，同时保持高质量图像生成能力。 Abstract: Text-to-image diffusion models are increasingly vulnerable to backdoor attacks, where malicious modifications to the training data cause the model to generate unintended outputs when specific triggers are present. While classification models have seen extensive development of defense mechanisms, generative models remain largely unprotected due to their high-dimensional output space, which complicates the detection and mitigation of subtle perturbations. Defense strategies for diffusion models, in particular, remain under-explored. In this work, we propose Spatial Attention Unlearning (SAU), a novel technique for mitigating backdoor attacks in diffusion models. SAU leverages latent space manipulation and spatial attention mechanisms to isolate and remove the latent representation of backdoor triggers, ensuring precise and efficient removal of malicious effects. We evaluate SAU across various types of backdoor attacks, including pixel-based and style-based triggers, and demonstrate its effectiveness in achieving 100% trigger removal accuracy. Furthermore, SAU achieves a CLIP score of 0.7023, outperforming existing methods while preserving the model's ability to generate high-quality, semantically aligned images. Our results show that SAU is a robust, scalable, and practical solution for securing text-to-image diffusion models against backdoor attacks.

Table of Contents

cs.CV [Back]

[1] A Decade of You Only Look Once (YOLO) for Object Detection

[2] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

[3] Co-Training with Active Contrastive Learning and Meta-Pseudo-Labeling on 2D Projections for Deep Semi-Supervised Learning

[4] SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models

[5] HierSum: A Global and Local Attention Mechanism for Video Summarization

[6] A Review of 3D Object Detection with Vision-Language Models

[7] Dream-Box: Object-wise Outlier Generation for Out-of-Distribution Detection

[8] Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos

[9] PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

[10] Depth as Points: Center Point-based Depth Estimation

[11] IoT Botnet Detection: Application of Vision Transformer to Classification of Network Flow Traffic

[12] CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval

[13] Video CLIP Model for Multi-View Echocardiography Interpretation

[14] Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning

[15] Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

[16] Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras

[17] PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance

[18] WLTCL: Wide Field-of-View 3-D LiDAR Truck Compartment Automatic Localization System

[19] Exploiting Multiple Representations: 3D Face Biometrics Fusion with Application to Surveillance

[20] Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness

[21] Kinship Verification through a Forest Neural Network

[22] R-Sparse R-CNN: SAR Ship Detection Based on Background-Aware Sparse Learnable Proposals

[23] 3DPyranet Features Fusion for Spatio-temporal Feature Learning

[24] MediAug: Exploring Visual Augmentation in Medical Imaging

[25] VISUALCENT: Visual Human Analysis using Dynamic Centroid Representation

[26] Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

[27] Dual-Branch Residual Network for Cross-Domain Few-Shot Hyperspectral Image Classification with Refined Prototype

[28] HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

[29] Learning to Drive from a World Model

[30] MIA-Mind: A Multidimensional Interactive Attention Mechanism Based on MindSpore

[31] Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction

[32] Towards Latency-Aware 3D Streaming Perception for Autonomous Driving

[33] Blind Source Separation Based on Sparsity

[34] DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning

[35] PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification

[36] RadioFormer: A Multiple-Granularity Radio Map Estimation Transformer with 1\textpertenthousand Spatial Sampling

[37] IM-Portrait: Learning 3D-aware Video Diffusion for PhotorealisticTalking Heads from Monocular Videos

[38] Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving

[39] LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition

[40] Adaptive Dual-domain Learning for Underwater Image Enhancement

[41] FlexPara: Flexible Neural Surface Parameterization

[42] CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

[43] CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis

[44] Unsupervised 2D-3D lifting of non-rigid objects using local constraints

[45] Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID

[46] ODExAI: A Comprehensive Object Detection Explainable AI Evaluation

[47] LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition

[48] OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion

[49] Rendering Anywhere You See: Renderability Field-guided Gaussian Splatting

[50] OpenFusion++: An Open-vocabulary Real-time Scene Understanding System

[51] VI3NR: Variance Informed Initialization for Implicit Neural Representations

[52] Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection

[53] Optimal Hyperspectral Undersampling Strategy for Satellite Imaging

[54] Marine Snow Removal Using Internally Generated Pseudo Ground Truth

[55] FusionNet: Multi-model Linear Fusion Framework for Low-light Image Enhancement

[56] Myocardial Region-guided Feature Aggregation Net for Automatic Coronary artery Segmentation and Stenosis Assessment using Coronary Computed Tomography Angiography

[57] Platonic Grounding for Efficient Multimodal Language Models

[58] Enhancing seeding efficiency using a computer vision system to monitor furrow quality in real-time

[59] Improving Small Drone Detection Through Multi-Scale Processing and Data Augmentation

[60] MERA: Multimodal and Multiscale Self-Explanatory Model with Considerably Reduced Annotation for Lung Nodule Diagnosis

[61] Mitigating Bias in Facial Recognition Systems: Centroid Fairness Loss Optimization

[62] HumMorph: Generalized Dynamic Human Neural Fields from Few Views

[63] Dynamic Arthroscopic Navigation System for Anterior Cruciate Ligament Reconstruction Based on Multi-level Memory Architecture

[64] Boosting 3D Liver Shape Datasets with Diffusion Models and Implicit Neural Representations

[65] GMAR: Gradient-Driven Multi-Head Attention Rollout for Vision Transformer Interpretability

[66] A Real-Time Event-Based Normal Flow Estimator

[67] EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation

[68] CLIP-KOA: Enhancing Knee Osteoarthritis Diagnosis with Multi-Modal Learning and Symmetry-Aware Loss Functions

[69] Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition

[70] Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video

[71] CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design

[72] Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

[73] SynergyAmodal: Deocclude Anything with Text Control

[74] FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding

[75] LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning

[76] Adversarial Shallow Watermarking

[77] Point2Quad: Generating Quad Meshes from Point Clouds via Face Prediction

[78] Crowd Detection Using Very-Fine-Resolution Satellite Imagery