Table of Contents
- cs.CV [Total: 92]
- cs.CL [Total: 31]
- cs.GR [Total: 1]
- eess.IV [Total: 8]
- cs.AI [Total: 8]
- physics.med-ph [Total: 1]
- cs.DB [Total: 1]
- cs.RO [Total: 4]
- cs.LG [Total: 3]
- cs.IR [Total: 2]
- cs.SD [Total: 1]
cs.CV [Back]
[1] Semantic Matters: Multimodal Features for Affective Analysis
Tobias Hallmen,Robin-Nico Kampa,Fabian Deuser,Norbert Oswald,Elisabeth André
Main category: cs.CV
TL;DR: 本文提出了一种结合音频、文本和视觉模态的方法,用于行为矛盾/犹豫识别和情感模仿强度估计任务,显著优于基线方法。
Details
Motivation: 研究旨在通过多模态融合(音频、文本、视觉)提升行为矛盾/犹豫识别和情感模仿强度估计的性能。 Method: 使用预训练的Wav2Vec 2.0模型提取音频特征,结合VAD模块、BERT-like编码器和ViT,并通过LSTM进行时序建模。 Result: 多模态融合方法显著提升了任务性能。 Conclusion: 结合语义和视觉模态能够更精确地解释音频内容,为相关任务提供更优解决方案。 Abstract: In this study, we present our methodology for two tasks: the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge and the Emotional Mimicry Intensity (EMI) Estimation Challenge, both conducted as part of the 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild. Building on previous work, we utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT-like encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture for temporal modeling. In this iteration, we integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach yields significant performance improvements over baseline methods.[2] MultiCore+TPU Accelerated Multi-Modal TinyML for Livestock Behaviour Recognition
Qianxue Zhang,Eiman Kanjo
Main category: cs.CV
TL;DR: 本文提出了一种基于TinyML技术的新型牲畜监测系统,结合多模态数据实现高效、低成本的实时活动识别与追踪。
Details
Motivation: 农业技术的进步推动了从传统劳动密集型向自动化、AI驱动的管理系统的转变,需要更智能的牲畜监测方案以提高效率。 Method: 利用TinyML技术、无线通信框架和微控制器平台,融合加速度计数据和视觉输入,构建多模态网络,实现图像分类、目标检测和行为识别。 Result: 系统在商用微控制器上部署,模型大小减少270倍,响应延迟低于80ms,性能与现有方法相当。 Conclusion: 该方案为偏远地区提供了一种鲁棒、可扩展的物联网边缘监测解决方案,适应多样化农业需求,并具备未来扩展的灵活性。 Abstract: The advancement of technology has revolutionised the agricultural industry, transitioning it from labour-intensive farming practices to automated, AI-powered management systems. In recent years, more intelligent livestock monitoring solutions have been proposed to enhance farming efficiency and productivity. This work presents a novel approach to animal activity recognition and movement tracking, leveraging tiny machine learning (TinyML) techniques, wireless communication framework, and microcontroller platforms to develop an efficient, cost-effective livestock sensing system. It collects and fuses accelerometer data and vision inputs to build a multi-modal network for three tasks: image classification, object detection, and behaviour recognition. The system is deployed and evaluated on commercial microcontrollers for real-time inference using embedded applications, demonstrating up to 270$\times$ model size reduction, less than 80ms response latency, and on-par performance comparable to existing methods. The incorporation of the TinyML technique allows for seamless data transmission between devices, benefiting use cases in remote locations with poor Internet connectivity. This work delivers a robust, scalable IoT-edge livestock monitoring solution adaptable to diverse farming needs, offering flexibility for future extensions.[3] SO-DETR: Leveraging Dual-Domain Features and Knowledge Distillation for Small Object Detection
Huaxiang Zhang,Hao Zhang,Aoran Mei,Zhongxue Gan,Guo-Niu Zhu
Main category: cs.CV
TL;DR: SO-DETR是一种针对小物体检测的Transformer模型,通过双域混合编码器、增强查询选择机制和知识蒸馏策略,显著提升了小物体检测性能。
Details
Motivation: 现有Transformer方法在小物体检测中存在低层特征融合不足和查询选择策略不匹配的问题。 Method: 提出双域混合编码器(空间与频率域融合)、增强查询选择机制(动态选择高得分锚框)和知识蒸馏策略。 Result: 在VisDrone-2019-DET和UAVVaste数据集上表现优于现有方法。 Conclusion: SO-DETR通过高效的特征融合和查询优化,为小物体检测提供了有效解决方案。 Abstract: Detection Transformer-based methods have achieved significant advancements in general object detection. However, challenges remain in effectively detecting small objects. One key difficulty is that existing encoders struggle to efficiently fuse low-level features. Additionally, the query selection strategies are not effectively tailored for small objects. To address these challenges, this paper proposes an efficient model, Small Object Detection Transformer (SO-DETR). The model comprises three key components: a dual-domain hybrid encoder, an enhanced query selection mechanism, and a knowledge distillation strategy. The dual-domain hybrid encoder integrates spatial and frequency domains to fuse multi-scale features effectively. This approach enhances the representation of high-resolution features while maintaining relatively low computational overhead. The enhanced query selection mechanism optimizes query initialization by dynamically selecting high-scoring anchor boxes using expanded IoU, thereby improving the allocation of query resources. Furthermore, by incorporating a lightweight backbone network and implementing a knowledge distillation strategy, we develop an efficient detector for small objects. Experimental results on the VisDrone-2019-DET and UAVVaste datasets demonstrate that SO-DETR outperforms existing methods with similar computational demands. The project page is available at https://github.com/ValiantDiligent/SO_DETR.[4] High Dynamic Range Modulo Imaging for Robust Object Detection in Autonomous Driving
Kebin Contreras,Brayan Monroy,Jorge Bacca
Main category: cs.CV
TL;DR: 论文提出使用模数传感器进行鲁棒物体检测,解决了传统HDR图像在实时应用中效率低下的问题,实验表明其性能接近HDR图像且优于饱和图像。
Details
Motivation: 自动驾驶系统中物体检测精度受光照条件影响,传统HDR图像因需多次拍摄效率低下,模数传感器能高效解决饱和问题。 Method: 采用模数传感器获取辐照度编码图像,通过解包算法恢复HDR图像,结合YOLOv10模型进行物体检测。 Result: 模数图像处理后的检测性能接近HDR图像,显著优于饱和图像,且重建时间短于传统HDR图像获取。 Conclusion: 模数传感器结合HDR重建技术为自动驾驶系统提供了一种高效、鲁棒的物体检测解决方案。 Abstract: Object detection precision is crucial for ensuring the safety and efficacy of autonomous driving systems. The quality of acquired images directly influences the ability of autonomous driving systems to correctly recognize and respond to other vehicles, pedestrians, and obstacles in real-time. However, real environments present extreme variations in lighting, causing saturation problems and resulting in the loss of crucial details for detection. Traditionally, High Dynamic Range (HDR) images have been preferred for their ability to capture a broad spectrum of light intensities, but the need for multiple captures to construct HDR images is inefficient for real-time applications in autonomous vehicles. To address these issues, this work introduces the use of modulo sensors for robust object detection. The modulo sensor allows pixels to `reset/wrap' upon reaching saturation level by acquiring an irradiance encoding image which can then be recovered using unwrapping algorithms. The applied reconstruction techniques enable HDR recovery of color intensity and image details, ensuring better visual quality even under extreme lighting conditions at the cost of extra time. Experiments with the YOLOv10 model demonstrate that images processed using modulo images achieve performance comparable to HDR images and significantly surpass saturated images in terms of object detection accuracy. Moreover, the proposed modulo imaging step combined with HDR image reconstruction is shorter than the time required for conventional HDR image acquisition.[5] Visual moral inference and communication
Warren Zhu,Aida Ramezani,Yang Xu
Main category: cs.CV
TL;DR: 论文提出了一种支持从自然图像中进行道德推断的计算框架,发现语言-视觉融合模型在视觉道德推断中表现更好,并揭示了新闻数据中的隐含偏见。
Details
Motivation: 人类可以从多种输入来源进行道德推断,而人工智能的道德推断通常仅依赖文本输入。道德不仅通过语言传达,还通过其他模态表达。 Method: 开发了一个计算框架,支持从自然图像中进行道德推断,包括两个任务:推断人类对视觉图像的道德判断和分析公共新闻中图像传达的道德内容模式。 Result: 发现仅基于文本的模型无法捕捉人类对视觉刺激的细粒度道德判断,而语言-视觉融合模型在视觉道德推断中表现更优。新闻数据应用揭示了新闻类别和地缘政治讨论中的隐含偏见。 Conclusion: 该研究为自动化视觉道德推断和发现公共媒体中视觉道德传播模式开辟了新途径。 Abstract: Humans can make moral inferences from multiple sources of input. In contrast, automated moral inference in artificial intelligence typically relies on language models with textual input. However, morality is conveyed through modalities beyond language. We present a computational framework that supports moral inference from natural images, demonstrated in two related tasks: 1) inferring human moral judgment toward visual images and 2) analyzing patterns in moral content communicated via images from public news. We find that models based on text alone cannot capture the fine-grained human moral judgment toward visual stimuli, but language-vision fusion models offer better precision in visual moral inference. Furthermore, applications of our framework to news data reveal implicit biases in news categories and geopolitical discussions. Our work creates avenues for automating visual moral inference and discovering patterns of visual moral communication in public media.[6] SDIGLM: Leveraging Large Language Models and Multi-Modal Chain of Thought for Structural Damage Identification
Yunkai Zhang,Shiyin Wei,Yong Huang,Yawu Su,Shanshan Lu,Hui Li
Main category: cs.CV
TL;DR: SDIGLM是一种基于多模态大模型的结构损伤识别方法,通过结合视觉和文本数据,显著提升了损伤识别和描述的准确性。
Details
Motivation: 现有计算机视觉模型在结构损伤识别中存在局限性,如无法全面分析复杂损伤类型和缺乏自然语言描述能力。多模态大模型为解决这些问题提供了新思路。 Method: 基于VisualGLM-6B架构,结合U-Net语义分割模块生成视觉链式思维,并通过多轮对话微调数据集和提示工程增强逻辑推理。 Result: SDIGLM在多种基础设施类型中达到95.24%的识别准确率,并能详细描述损伤特征(如孔洞大小、裂缝方向等)。 Conclusion: SDIGLM通过多模态链式思维显著提升了结构损伤识别的性能,为实际工程应用提供了更全面的解决方案。 Abstract: Existing computer vision(CV)-based structural damage identification models demonstrate notable accuracy in categorizing and localizing damage. However, these models present several critical limitations that hinder their practical application in civil engineering(CE). Primarily, their ability to recognize damage types remains constrained, preventing comprehensive analysis of the highly varied and complex conditions encountered in real-world CE structures. Second, these models lack linguistic capabilities, rendering them unable to articulate structural damage characteristics through natural language descriptions. With the continuous advancement of artificial intelligence(AI), large multi-modal models(LMMs) have emerged as a transformative solution, enabling the unified encoding and alignment of textual and visual data. These models can autonomously generate detailed descriptive narratives of structural damage while demonstrating robust generalization across diverse scenarios and tasks. This study introduces SDIGLM, an innovative LMM for structural damage identification, developed based on the open-source VisualGLM-6B architecture. To address the challenge of adapting LMMs to the intricate and varied operating conditions in CE, this work integrates a U-Net-based semantic segmentation module to generate defect segmentation maps as visual Chain of Thought(CoT). Additionally, a multi-round dialogue fine-tuning dataset is constructed to enhance logical reasoning, complemented by a language CoT formed through prompt engineering. By leveraging this multi-modal CoT, SDIGLM surpasses general-purpose LMMs in structural damage identification, achieving an accuracy of 95.24% across various infrastructure types. Moreover, the model effectively describes damage characteristics such as hole size, crack direction, and corrosion severity.[7] Flux Already Knows - Activating Subject-Driven Image Generation without Training
Hao Kang,Stathi Fotiadis,Liming Jiang,Qing Yan,Yumin Jia,Zichuan Liu,Min Jin Chong,Xin Lu
Main category: cs.CV
TL;DR: 提出了一种基于Flux模型的零样本框架,通过网格化图像完成和镶嵌布局复制主题图像,无需额外数据或训练即可实现身份保留。
Details
Motivation: 探索如何利用预训练的基础文本到图像模型,实现高效、高质量的主题驱动图像生成,满足下游应用的轻量级定制需求。 Method: 采用网格化图像完成和镶嵌布局复制主题图像,结合级联注意力设计和元提示技术,提升生成质量和多样性。 Result: 在基准测试和人类偏好研究中表现优于基线方法,支持多种编辑功能,如标志插入、虚拟试穿等。 Conclusion: 证明了预训练模型可以实现高质量、资源高效的主题驱动生成,为下游应用提供了新的可能性。 Abstract: We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This "free lunch" approach is further strengthened by a novel cascade attention design and meta prompting technique, boosting fidelity and versatility. Experimental results show that our method outperforms baselines across multiple key metrics in benchmarks and human preference studies, with trade-offs in certain aspects. Additionally, it supports diverse edits, including logo insertion, virtual try-on, and subject replacement or insertion. These results demonstrate that a pre-trained foundational text-to-image model can enable high-quality, resource-efficient subject-driven generation, opening new possibilities for lightweight customization in downstream applications.[8] snnTrans-DHZ: A Lightweight Spiking Neural Network Architecture for Underwater Image Dehazing
Vidya Sudevan,Fakhreddine Zayer,Rizwana Kausar,Sajid Javed,Hamad Karki,Giulia De Masi,Jorge Dias
Main category: cs.CV
TL;DR: snnTrans-DHZ是一种轻量级脉冲神经网络(SNN),用于水下图像去雾,具有高效能、低功耗的特点,显著优于现有方法。
Details
Motivation: 水下图像去雾对基于视觉的海洋操作至关重要,因为光线散射和吸收会严重降低能见度。 Method: 通过将静态图像转换为时间依赖序列,利用SNN的时序动态特性,结合RGB-LAB颜色空间处理,使用三个关键模块(K估计器、背景光估计器和软图像重建模块)进行去雾。 Result: 在UIEB和EUVP数据集上分别达到PSNR 21.68 dB/SSIM 0.8795和PSNR 23.46 dB/SSIM 0.8439,仅需0.5670百万参数、7.42 GSOPs和0.0151 J能量。 Conclusion: snnTrans-DHZ高效、低功耗,适合水下机器人、海洋探索和环境监测应用。 Abstract: Underwater image dehazing is critical for vision-based marine operations because light scattering and absorption can severely reduce visibility. This paper introduces snnTrans-DHZ, a lightweight Spiking Neural Network (SNN) specifically designed for underwater dehazing. By leveraging the temporal dynamics of SNNs, snnTrans-DHZ efficiently processes time-dependent raw image sequences while maintaining low power consumption. Static underwater images are first converted into time-dependent sequences by repeatedly inputting the same image over user-defined timesteps. These RGB sequences are then transformed into LAB color space representations and processed concurrently. The architecture features three key modules: (i) a K estimator that extracts features from multiple color space representations; (ii) a Background Light Estimator that jointly infers the background light component from the RGB-LAB images; and (iii) a soft image reconstruction module that produces haze-free, visibility-enhanced outputs. The snnTrans-DHZ model is directly trained using a surrogate gradient-based backpropagation through time (BPTT) strategy alongside a novel combined loss function. Evaluated on the UIEB benchmark, snnTrans-DHZ achieves a PSNR of 21.68 dB and an SSIM of 0.8795, and on the EUVP dataset, it yields a PSNR of 23.46 dB and an SSIM of 0.8439. With only 0.5670 million network parameters, and requiring just 7.42 GSOPs and 0.0151 J of energy, the algorithm significantly outperforms existing state-of-the-art methods in terms of efficiency. These features make snnTrans-DHZ highly suitable for deployment in underwater robotics, marine exploration, and environmental monitoring.[9] Uncovering Branch specialization in InceptionV1 using k sparse autoencoders
Matthew Bozoukov
Main category: cs.CV
TL;DR: 稀疏自编码器(SAEs)在神经网络中发现了由叠加引起的多义神经元的可解释特征。本文展示了在InceptionV1的混合层中分支专业化的现象及其一致性。
Details
Motivation: 研究稀疏自编码器在InceptionV1深层中分支专业化的现象,填补此前研究的空白。 Method: 通过分析混合层(mixed4a-4e)中的5x5分支和1x1分支,展示分支专业化的实例。 Result: 发现分支专业化在不同层中具有一致性,相似特征会集中在相同卷积尺寸的分支中。 Conclusion: 分支专业化在InceptionV1的深层中具有一致性,为理解神经网络特征提取提供了新视角。 Abstract: Sparse Autoencoders (SAEs) have shown to find interpretable features in neural networks from polysemantic neurons caused by superposition. Previous work has shown SAEs are an effective tool to extract interpretable features from the early layers of InceptionV1. Since then, there have been many improvements to SAEs but branch specialization is still an enigma in the later layers of InceptionV1. We show various examples of branch specialization occuring in each layer of the mixed4a-4e branch, in the 5x5 branch and in one 1x1 branch. We also provide evidence to claim that branch specialization seems to be consistent across layers, similar features across the model will be localized in the same convolution size branches in their respective layer.[10] TransitReID: Transit OD Data Collection with Occlusion-Resistant Dynamic Passenger Re-Identification
Kaicong Huang,Talha Azfar,Jack Reilly,Ruimin Ke
Main category: cs.CV
TL;DR: TransitReID是一个基于视觉重识别(ReID)的框架,用于高效收集公交OD数据,解决了遮挡和视角变化等挑战,并在边缘设备上实现近实时操作。
Details
Motivation: 传统公交OD数据收集方法成本高且效率低,而现有技术如蓝牙和WiFi依赖特定设备,覆盖有限。公交车辆上的摄像头提供了新的数据收集机会,但面临遮挡和视角变化的挑战。 Method: TransitReID包含两部分:1)基于变分自编码器的遮挡鲁棒ReID算法,通过区域注意力机制优化权重分配;2)分层存储与动态匹配机制(HSDM),平衡存储、速度和准确性。 Result: 实验表明,TransitReID在ReID任务中达到90%的准确率,适用于公交环境。 Conclusion: TransitReID为公交OD数据收集提供了高效、准确的解决方案,解决了现有方法的局限性,并支持边缘设备上的实时操作。 Abstract: Transit Origin-Destination (OD) data are essential for transit planning, particularly in route optimization and demand-responsive paratransit systems. Traditional methods, such as manual surveys, are costly and inefficient, while Bluetooth and WiFi-based approaches require passengers to carry specific devices, limiting data coverage. On the other hand, most transit vehicles are equipped with onboard cameras for surveillance, offering an opportunity to repurpose them for edge-based OD data collection through visual person re-identification (ReID). However, such approaches face significant challenges, including severe occlusion and viewpoint variations in transit environments, which greatly reduce matching accuracy and hinder their adoption. Moreover, designing effective algorithms that can operate efficiently on edge devices remains an open challenge. To address these challenges, we propose TransitReID, a novel framework for individual-level transit OD data collection. TransitReID consists of two key components: (1) An occlusion-robust ReID algorithm featuring a variational autoencoder guided region-attention mechanism that adaptively focuses on visible body regions through reconstruction loss-optimized weight allocation; and (2) a Hierarchical Storage and Dynamic Matching (HSDM) mechanism specifically designed for efficient and robust transit OD matching which balances storage, speed, and accuracy. Additionally, a multi-threaded design supports near real-time operation on edge devices, which also ensuring privacy protection. We also introduce a ReID dataset tailored for complex bus environments to address the lack of relevant training data. Experimental results demonstrate that TransitReID achieves state-of-the-art performance in ReID tasks, with an accuracy of approximately 90\% in bus route simulations.[11] Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment
Kangsheng Wang,Chengwei Ye,Huanzhen Zhang,Linuo Xu,Shuyan Liu
Main category: cs.CV
TL;DR: 提出了一种多模态特征学习框架,用于短视频中的人格分析,结合视觉、音频和文本特征,性能优于现有方法。
Details
Motivation: 自动预测人格特质是计算机视觉中的挑战性问题,需要多模态信息融合以提高准确性。 Method: 构建面部图,设计基于Geo的双流网络(GCN和CNN),结合ResNet18、VGGFace、BiGRU、VGGish和XLM-Roberta提取多模态特征,并引入注意力机制和MLP回归模型。 Result: 实验结果表明,该框架在性能上超越了现有最先进方法。 Conclusion: 提出的多模态框架在人格分析任务中表现出色,验证了多模态融合的有效性。 Abstract: Predicting personality traits automatically has become a challenging problem in computer vision. This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips. For visual processing, we construct a facial graph and design a Geo-based two-stream network incorporating an attention mechanism, leveraging both Graph Convolutional Networks (GCN) and Convolutional Neural Networks (CNN) to capture static facial expressions. Additionally, ResNet18 and VGGFace networks are employed to extract global scene and facial appearance features at the frame level. To capture dynamic temporal information, we integrate a BiGRU with a temporal attention module for extracting salient frame representations. To enhance the model's robustness, we incorporate the VGGish CNN for audio-based features and XLM-Roberta for text-based features. Finally, a multimodal channel attention mechanism is introduced to integrate different modalities, and a Multi-Layer Perceptron (MLP) regression model is used to predict personality traits. Experimental results confirm that our proposed framework surpasses existing state-of-the-art approaches in performance.[12] ConvShareViT: Enhancing Vision Transformers with Convolutional Attention Mechanisms for Free-Space Optical Accelerators
Riad Ibadulla,Thomas M. Chen,Constantino Carlos Reyes-Aldasoro
Main category: cs.CV
TL;DR: ConvShareViT是一种新型深度学习架构,将Vision Transformers(ViTs)适配到4f自由空间光学系统中,通过共享权重的深度卷积层替换MHSA和MLP中的线性层,分析了卷积在MHSA中的行为及其学习注意力机制的有效性。
Details
Motivation: 研究如何将ViTs适配到4f光学系统中,利用光学系统的并行性和高分辨率能力,同时探索卷积在注意力机制中的作用。 Method: 用共享权重的深度卷积层替换ViT中的线性层,分析不同卷积配置(如valid-padded和same-padded)对注意力学习的影响。 Result: valid-padded共享卷积能成功学习注意力,性能与标准ViT相当;same-padded卷积则表现受限,更像传统CNN。ConvShareViT在4f系统中理论推理速度可达GPU的3.04倍。 Conclusion: ConvShareViT证明了仅通过卷积操作实现ViT的可行性,为光学深度学习应用提供了高效解决方案。 Abstract: This paper introduces ConvShareViT, a novel deep learning architecture that adapts Vision Transformers (ViTs) to the 4f free-space optical system. ConvShareViT replaces linear layers in multi-head self-attention (MHSA) and Multilayer Perceptrons (MLPs) with a depthwise convolutional layer with shared weights across input channels. Through the development of ConvShareViT, the behaviour of convolutions within MHSA and their effectiveness in learning the attention mechanism were analysed systematically. Experimental results demonstrate that certain configurations, particularly those using valid-padded shared convolutions, can successfully learn attention, achieving comparable attention scores to those obtained with standard ViTs. However, other configurations, such as those using same-padded convolutions, show limitations in attention learning and operate like regular CNNs rather than transformer models. ConvShareViT architectures are specifically optimised for the 4f optical system, which takes advantage of the parallelism and high-resolution capabilities of optical systems. Results demonstrate that ConvShareViT can theoretically achieve up to 3.04 times faster inference than GPU-based systems. This potential acceleration makes ConvShareViT an attractive candidate for future optical deep learning applications and proves that our ViT (ConvShareViT) can be employed using only the convolution operation, via the necessary optimisation of the ViT to balance performance and complexity.[13] Deep Learning Approaches for Medical Imaging Under Varying Degrees of Label Availability: A Comprehensive Survey
Siteng Ma,Honghui Du,Yu An,Jing Wang,Qinqin Wang,Haochang Wu,Aonghus Lawlor,Ruihai Dong
Main category: cs.CV
TL;DR: 该论文综述了医学影像中深度学习在标签不完整、不精确或缺失情况下的研究进展,总结了600多项相关研究,并探讨了未来挑战。
Details
Motivation: 医学影像标注耗时耗力,推动了对不完整、不精确或缺失标签下学习范式的研究需求。 Method: 分类和综述了2018年以来约600项相关研究,涵盖图像分类、分割和检测等任务。 Result: 提供了不同学习范式的定义,总结了学习机制和策略,帮助理解研究现状。 Conclusion: 讨论了未来研究挑战,为医学影像深度学习提供了方向。 Abstract: Deep learning has achieved significant breakthroughs in medical imaging, but these advancements are often dependent on large, well-annotated datasets. However, obtaining such datasets poses a significant challenge, as it requires time-consuming and labor-intensive annotations from medical experts. Consequently, there is growing interest in learning paradigms such as incomplete, inexact, and absent supervision, which are designed to operate under limited, inexact, or missing labels. This survey categorizes and reviews the evolving research in these areas, analyzing around 600 notable contributions since 2018. It covers tasks such as image classification, segmentation, and detection across various medical application areas, including but not limited to brain, chest, and cardiac imaging. We attempt to establish the relationships among existing research studies in related areas. We provide formal definitions of different learning paradigms and offer a comprehensive summary and interpretation of various learning mechanisms and strategies, aiding readers in better understanding the current research landscape and ideas. We also discuss potential future research challenges.[14] DamageCAT: A Deep Learning Transformer Framework for Typology-Based Post-Disaster Building Damage Categorization
Yiming Xiao,Ali Mostafavi
Main category: cs.CV
TL;DR: 论文提出DamageCAT框架,通过卫星图像识别建筑损坏类型,而非简单的严重程度分类,提升了灾害响应的实用性。
Details
Motivation: 自然灾害频发,现有损坏评估方法仅提供二元或等级分类,无法满足实际需求,需更详细的损坏类型信息。 Method: 提出BD-TypoSAT数据集和基于U-Net的分层Transformer架构,处理灾前灾后图像对以分类损坏类型。 Result: 模型在类不平衡数据中表现稳健,IoU为0.7921,F1分数为0.8835,尤其在罕见损坏类型识别上效果显著。 Conclusion: DamageCAT通过提供详细的损坏类型信息,优于传统方法,支持更有效的灾害响应决策。 Abstract: Natural disasters increasingly threaten communities worldwide, creating an urgent need for rapid, reliable building damage assessment to guide emergency response and recovery efforts. Current methods typically classify damage in binary (damaged/undamaged) or ordinal severity terms, limiting their practical utility. In fact, the determination of damage typology is crucial for response and recovery efforts. To address this important gap, this paper introduces DamageCAT, a novel framework that provides typology-based categorical damage descriptions rather than simple severity ratings. Accordingly, this study presents two key contributions: (1) the BD-TypoSAT dataset containing satellite image triplets (pre-disaster, post-disaster, and damage masks) from Hurricane Ida with four damage categories (partial roof damage, total roof damage, partial structural collapse, and total structural collapse), and (2) a hierarchical U-Net-based transformer architecture that effectively processes pre-post disaster image pairs to identify and categorize building damage. Despite significant class imbalances in the training data, our model achieved robust performance with overall metrics of 0.7921 Intersection over Union (IoU) and 0.8835 F1 scores across all categories. The model's capability to recognize intricate damage typology in less common categories is especially remarkable. The DamageCAT framework advances automated damage assessment by providing actionable, typological information that better supports disaster response decision-making and resource allocation compared to traditional severity-based approaches.[15] Real-time Object and Event Detection Service through Computer Vision and Edge Computing
Marcos Mendes,Gonçalo Perna,Pedro Rito,Duarte Raposo,Susana Sargento
Main category: cs.CV
TL;DR: 论文提出了一种基于计算机视觉和边缘计算的智能城市道路监控与安全系统,旨在减少涉及弱势道路使用者的交通事故。
Details
Motivation: 全球每年因交通事故造成的经济损失巨大,且城市中弱势道路使用者的事故率高。智能城市技术为解决这一问题提供了新途径。 Method: 利用计算机视觉和边缘计算技术,通过监控摄像头实现车辆、行人和自行车的检测与跟踪,并预测道路状态和碰撞事件。 Result: 在Aveiro Tech City Living Lab测试中,系统能准确检测和跟踪目标,并实时预测碰撞事件。 Conclusion: 该系统展示了在智能城市中利用先进技术提升道路安全的潜力。 Abstract: The World Health Organization suggests that road traffic crashes cost approximately 518 billion dollars globally each year, which accounts for 3% of the gross domestic product for most countries. Most fatal road accidents in urban areas involve Vulnerable Road Users (VRUs). Smart cities environments present innovative approaches to combat accidents involving cutting-edge technologies, that include advanced sensors, extensive datasets, Machine Learning (ML) models, communication systems, and edge computing. This paper proposes a strategy and an implementation of a system for road monitoring and safety for smart cities, based on Computer Vision (CV) and edge computing. Promising results were obtained by implementing vision algorithms and tracking using surveillance cameras, that are part of a Smart City testbed, the Aveiro Tech City Living Lab (ATCLL). The algorithm accurately detects and tracks cars, pedestrians, and bicycles, while predicting the road state, the distance between moving objects, and inferring on collision events to prevent collisions, in near real-time.[16] Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Amirhossein Dadashzadeh,Parsa Esmati,Majid Mirmehdi
Main category: cs.CV
TL;DR: Co-STAR提出了一种结合课程学习和协作自训练的新框架,通过双向预测对齐和自适应课程正则化解决SFUVDA中的伪标签噪声和过度自信问题。
Details
Motivation: 现有SFUVDA方法在伪标签生成中存在噪声和过度自信问题,限制了跨域适应效果。 Method: 结合源训练教师模型和对比视觉语言模型(CLIP),通过可靠性权重函数和自适应课程正则化优化学习过程。 Result: 在多个视频域适应基准测试中,Co-STAR表现优于现有SFUVDA方法。 Conclusion: Co-STAR通过课程学习和协作自训练有效提升了跨域适应性能,代码已开源。 Abstract: Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: https://github.com/Plrbear/Co-Star[17] Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
Yiran He,Yun Cao,Bowen Yang,Zeyu Zhang
Main category: cs.CV
TL;DR: 本文探讨了多模态大语言模型(LLM)在伪造检测中的应用,提出了一种框架,能够评估图像真实性、定位篡改区域并提供证据,实验显示其性能接近最先进方法。
Details
Motivation: 生成式AI的快速发展使得内容伪造更易且更难检测,多模态LLM虽具备丰富知识,但未针对AIGC检测优化,难以理解局部伪造细节。 Method: 通过精心设计的提示工程和少样本学习技术,提出一个框架,评估图像真实性、定位篡改区域、提供证据并追踪生成方法。 Result: 实验显示,GPT4V在Autosplice和LaMa数据集上的准确率分别达到92.1%和86.3%,与最先进方法竞争。 Conclusion: 多模态LLM在伪造分析中潜力巨大,但仍存在局限性,未来可进一步改进。 Abstract: The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.[18] Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Isabel Papadimitriou,Huangyuan Su,Thomas Fel,Naomi Saphra,Sham Kakade,Stephanie Gil
Main category: cs.CV
TL;DR: 论文研究了视觉语言模型(VLMs)的联合嵌入空间,通过稀疏自编码器(SAEs)分析语言和图像的组织方式,发现稀疏线性结构的存在,并揭示了跨模态语义的潜在桥梁。
Details
Motivation: 探索视觉语言模型如何将语言和图像组织在联合空间中,以及模型如何编码意义和模态。 Method: 训练并发布稀疏自编码器(SAEs)于四种视觉语言模型(CLIP、SigLIP、SigLIP2、AIMv2)的嵌入空间,分析其稀疏线性结构和跨模态语义。 Result: SAEs能更好地重建真实嵌入并保持稀疏性;关键概念在不同运行中稳定,且许多概念虽单模态激活但编码跨模态语义;提出Bridge Score量化跨模态协作。 Conclusion: VLMs的嵌入空间存在稀疏线性结构,由模态塑造但通过潜在桥梁连接,为多模态意义的构建提供了新见解。 Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or "concepts". We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that the key commonly-activating concepts extracted by SAEs are remarkably stable across runs. Interestingly, while most concepts are strongly unimodal in activation, we find they are not merely encoding modality per se. Many lie close to - but not entirely within - the subspace defining modality, suggesting that they encode cross-modal semantics despite their unimodal usage. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even unimodal concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges-offering new insight into how multimodal meaning is constructed.[19] Non-uniform Point Cloud Upsampling via Local Manifold Distribution
Yaohui Fang,Xingce Wang
Main category: cs.CV
TL;DR: 提出了一种基于高斯函数和流形分布约束的点云上采样方法,优于现有技术。
Details
Motivation: 现有方法忽略了点云的内在数据分布特性,导致处理稀疏和非均匀点云时效果不佳。 Method: 利用高斯函数的强拟合能力,通过网络迭代优化高斯分量及其权重,构建统一的统计流形施加分布约束。 Result: 在多个数据集上实验表明,该方法能生成更高质量且分布更均匀的密集点云。 Conclusion: 该方法在处理稀疏和非均匀点云时表现优异,优于现有技术。 Abstract: Existing learning-based point cloud upsampling methods often overlook the intrinsic data distribution charac?teristics of point clouds, leading to suboptimal results when handling sparse and non-uniform point clouds. We propose a novel approach to point cloud upsampling by imposing constraints from the perspective of manifold distributions. Leveraging the strong fitting capability of Gaussian functions, our method employs a network to iteratively optimize Gaussian components and their weights, accurately representing local manifolds. By utilizing the probabilistic distribution properties of Gaussian functions, we construct a unified statistical manifold to impose distribution constraints on the point cloud. Experimental results on multiple datasets demonstrate that our method generates higher-quality and more uniformly distributed dense point clouds when processing sparse and non-uniform inputs, outperforming state-of-the-art point cloud upsampling techniques.[20] Learning What NOT to Count
Adriano D'Alessandro,Ali Mahdavi-Amiri,Ghassan Hamarneh
Main category: cs.CV
TL;DR: 提出一种无需标注的方法,通过生成合成数据改进少样本/零样本目标计数模型在细粒度类别上的表现。
Details
Motivation: 解决少样本/零样本目标计数方法在区分细粒度类别时的困难,尤其是相似物体出现在同一场景时。 Method: 利用潜在生成模型合成高质量、类别特定的拥挤场景数据,并引入注意力预测网络识别细粒度类别边界。 Result: 显著提升了预训练模型在细粒度分类任务上的性能,且仅使用合成数据。 Conclusion: 该方法为少样本/零样本目标计数提供了无需标注的解决方案,并在新数据集FGTC上验证了有效性。 Abstract: Few/zero-shot object counting methods reduce the need for extensive annotations but often struggle to distinguish between fine-grained categories, especially when multiple similar objects appear in the same scene. To address this limitation, we propose an annotation-free approach that enables the seamless integration of new fine-grained categories into existing few/zero-shot counting models. By leveraging latent generative models, we synthesize high-quality, category-specific crowded scenes, providing a rich training source for adapting to new categories without manual labeling. Our approach introduces an attention prediction network that identifies fine-grained category boundaries trained using only synthetic pseudo-annotated data. At inference, these fine-grained attention estimates refine the output of existing few/zero-shot counting networks. To benchmark our method, we further introduce the FGTC dataset, a taxonomy-specific fine-grained object counting dataset for natural images. Our method substantially enhances pre-trained state-of-the-art models on fine-grained taxon counting tasks, while using only synthetic data. Code and data to be released upon acceptance.[21] Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset
Muhammad Shahid Muneer,Simon S. Woo
Main category: cs.CV
TL;DR: 论文提出了一种百万规模的多模态数据集和防御方法,用于对抗文本到图像模型中的NSFW内容生成和对抗攻击。
Details
Motivation: 随着文本到图像(T2I)模型的广泛应用,生成超现实图像的能力引发了NSFW内容泛滥和网络社会污染的问题。现有防御措施(如NSFW过滤器和后验安全检查)易受对抗攻击影响,且缺乏多模态数据集支持。 Method: 1. 构建了一个百万规模的提示词和图像数据集,使用开源扩散模型生成。2. 开发了一种多模态防御方法,能够区分安全与NSFW内容,并对抗攻击具有鲁棒性。 Result: 实验表明,该方法在准确率和召回率上优于现有SOTA NSFW检测方法,显著降低了多模态对抗攻击的成功率(ASR)。 Conclusion: 该研究为解决T2I模型中的NSFW内容问题提供了有效的数据集和防御方案,显著提升了对抗攻击的防御能力。 Abstract: In the past years, we have witnessed the remarkable success of Text-to-Image (T2I) models and their widespread use on the web. Extensive research in making T2I models produce hyper-realistic images has led to new concerns, such as generating Not-Safe-For-Work (NSFW) web content and polluting the web society. To help prevent misuse of T2I models and create a safer web environment for users features like NSFW filters and post-hoc security checks are used in these models. However, recent work unveiled how these methods can easily fail to prevent misuse. In particular, adversarial attacks on text and image modalities can easily outplay defensive measures. %Exploiting such leads to the growing concern of preventing adversarial attacks on text and image modalities. Moreover, there is currently no robust multimodal NSFW dataset that includes both prompt and image pairs and adversarial examples. This work proposes a million-scale prompt and image dataset generated using open-source diffusion models. Second, we develop a multimodal defense to distinguish safe and NSFW text and images, which is robust against adversarial attacks and directly alleviates current challenges. Our extensive experiments show that our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall, drastically reducing the Attack Success Rate (ASR) in multimodal adversarial attack scenarios. Code: https://github.com/shahidmuneer/multimodal-nsfw-defense.[22] EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
Jilan Xu,Yifei Huang,Baoqi Pei,Junlin Hou,Qingqiu Li,Guo Chen,Yuejie Zhang,Rui Feng,Weidi Xie
Main category: cs.CV
TL;DR: 本文提出EgoExo-Gen方法,通过显式建模手-物交互动态,实现跨视角视频预测,提升第一人称视频生成质量。
Details
Motivation: 第一人称视频生成在增强现实和具身智能领域有广泛应用前景,但现有方法未充分利用手-物交互的动态信息。 Method: EgoExo-Gen分两阶段:1)跨视角手-物交互掩码预测;2)结合掩码的视频扩散模型生成未来帧。 Result: 在Ego-Exo4D和H2O数据集上表现优于现有模型,手-物交互掩码显著提升手和交互对象的生成质量。 Conclusion: EgoExo-Gen通过建模手-物交互动态,有效提升了跨视角视频预测的生成质量。 Abstract: Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.[23] DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment
Li Yu,Situo Wang,Wei Zhou,Moncef Gabbouj
Main category: cs.CV
TL;DR: 论文提出了一种基于CLIP的双流视频质量评估方法DVLTA-VQA,解决了CLIP无法捕捉视频时序信息的局限性,并通过文本引导的自适应特征融合提升评估效果。
Details
Motivation: 受人类视觉系统双流理论启发,研究者希望利用CLIP的语义理解能力模拟双流功能,但CLIP缺乏对视频时序信息的捕捉能力,且现有特征融合策略固定。 Method: 提出DVLTA-VQA方法,解耦CLIP的视觉与文本组件,将其融入无参考视频质量评估流程,并通过文本引导自适应调整特征重要性。 Result: 该方法能够更有效地模拟人类视觉系统的双流功能,提升视频质量评估的准确性。 Conclusion: DVLTA-VQA通过解耦和自适应特征融合,克服了CLIP在视频质量评估中的局限性,为未来研究提供了新思路。 Abstract: Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. %Furthermore, existing feature fusion strategies in no-reference video quality assessment (NR-VQA) often rely on fixed weighting schemes, which fail to adaptively adjust feature importance. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline.[24] The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
Bingjie Gao,Xinyu Gao,Xiaoxue Wu,Yujie Zhou,Yu Qiao,Li Niu,Xinyuan Chen,Yaohui Wang
Main category: cs.CV
TL;DR: 论文提出RAPO框架,通过双分支优化提示设计,提升文本到视频生成的质量。
Details
Motivation: 现有T2V生成模型对输入提示敏感,但缺乏对提示词汇和句子结构的针对性优化。 Method: RAPO通过两个分支优化提示:一是基于关系图添加修饰词,二是利用预训练LLM重写提示。 Result: 实验表明RAPO能显著提升生成视频的静态和动态维度质量。 Conclusion: 提示优化对用户提供的提示至关重要,RAPO为此提供了有效解决方案。 Abstract: The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce \textbf{RAPO}, a novel \textbf{R}etrieval-\textbf{A}ugmented \textbf{P}rompt \textbf{O}ptimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts. Project website: \href{https://whynothaha.github.io/Prompt_optimizer/RAPO.html}{GitHub}.[25] SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggregation
Zongye Zhang,Wenrui Cai,Qingjie Liu,Yunhong Wang
Main category: cs.CV
TL;DR: SkeletonX提出了一种轻量级训练流程,用于在有限标注数据下提升骨架动作识别的性能,通过利用样本间的互信息和关键属性(表演者差异和动作共性)优化训练效果。
Details
Motivation: 当前骨架动作识别模型在新场景(如新动作类别、多样化表演者和不同骨架布局)中表现不佳,且数据收集成本高,因此需要研究如何在有限数据下高效适应。 Method: 提出SkeletonX,包括定制化的样本对构建策略和简洁有效的特征聚合模块,以利用表演者差异和动作共性优化训练。 Result: 在NTU RGB+D等数据集上,SkeletonX在有限数据下表现优异,且在单样本设置中超越现有方法,参数和计算量大幅减少。 Conclusion: SkeletonX通过高效利用有限数据,显著提升了骨架动作识别的适应性和性能。 Abstract: While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: https://github.com/zzysteve/SkeletonX[26] GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision
Zihui Zhang,Yafei Yang,Hongtao Wen,Bo Yang
Main category: cs.CV
TL;DR: 提出了一种名为GrabS的两阶段无监督3D对象分割方法,通过生成和判别性先验学习及查询机制,显著提升了复杂点云中的分割性能。
Details
Motivation: 现有无监督方法依赖预训练2D特征或外部信号(如运动)进行3D点分组,通常仅能识别简单对象且分割效果较差,缺乏对象性。 Method: 提出两阶段流程:第一阶段从对象数据集中学习生成和判别性先验;第二阶段设计代理通过查询预训练生成先验发现多对象。 Result: 在两个真实数据集和一个新合成数据集上评估,分割性能显著优于现有无监督方法。 Conclusion: GrabS通过先验学习和查询机制,有效解决了复杂点云中的无监督3D对象分割问题。 Abstract: We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.[27] Extended Short- and Long-Range Mesh Learning for Fast and Generalized Garment Simulation
Aoran Liu,Kun Hu,Clinton Mo,Changyang Li,Zhiyong Wang
Main category: cs.CV
TL;DR: 提出了一种基于GNN的3D服装模拟新框架,通过LSDMP和GSA模块扩展消息传递范围,提高计算效率。
Details
Motivation: 现有的GNN方法在高分辨率下因消息传递效率低而计算成本高,需改进。 Method: 结合Laplacian-Smoothed Dual Message-Passing (LSDMP)和Geodesic Self-Attention (GSA)模块,分别优化短程和远程消息传递。 Result: 实验表明该方法性能优越,层数少且延迟低。 Conclusion: 新框架显著提升了3D服装模拟的效率和性能。 Abstract: 3D garment simulation is a critical component for producing cloth-based graphics. Recent advancements in graph neural networks (GNNs) offer a promising approach for efficient garment simulation. However, GNNs require extensive message-passing to propagate information such as physical forces and maintain contact awareness across the entire garment mesh, which becomes computationally inefficient at higher resolutions. To address this, we devise a novel GNN-based mesh learning framework with two key components to extend the message-passing range with minimal overhead, namely the Laplacian-Smoothed Dual Message-Passing (LSDMP) and the Geodesic Self-Attention (GSA) modules. LSDMP enhances message-passing with a Laplacian features smoothing process, which efficiently propagates the impact of each vertex to nearby vertices. Concurrently, GSA introduces geodesic distance embeddings to represent the spatial relationship between vertices and utilises attention mechanisms to capture global mesh information. The two modules operate in parallel to ensure both short- and long-range mesh modelling. Extensive experiments demonstrate the state-of-the-art performance of our method, requiring fewer layers and lower inference latency.[28] TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
Yiran Wang,Jiaqi Li,Chaoyi Hong,Ruibo Li,Liusheng Sun,Xiao Song,Zhe Wang,Zhiguo Cao,Guosheng Lin
Main category: cs.CV
TL;DR: TacoDepth是一种高效的单阶段雷达-相机深度估计模型,通过图结构提取器和金字塔融合模块提升效率和鲁棒性,显著优于现有方法。
Details
Motivation: 雷达数据稀疏性导致现有方法效率低且不鲁棒,需改进以实现实时处理。 Method: 提出TacoDepth,采用图结构提取器和金字塔融合模块,单阶段融合雷达与相机数据。 Result: 深度估计精度提升12.8%,处理速度提升91.8%。 Conclusion: TacoDepth为雷达-相机深度估计提供了高效且灵活的新方案。 Abstract: Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming and not robust. To address these challenges, we propose TacoDepth, an efficient and accurate Radar-Camera depth estimation model with one-stage fusion. Specifically, the graph-based Radar structure extractor and the pyramid-based Radar fusion module are designed to capture and integrate the graph structures of Radar point clouds, delivering superior model efficiency and robustness without relying on the intermediate depth results. Moreover, TacoDepth can be flexible for different inference modes, providing a better balance of speed and accuracy. Extensive experiments are conducted to demonstrate the efficacy of our method. Compared with the previous state-of-the-art approach, TacoDepth improves depth accuracy and processing speed by 12.8% and 91.8%. Our work provides a new perspective on efficient Radar-Camera depth estimation.[29] Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets
Yongpei Ma,Pengyu Wang,Adam Dunn,Usman Naseem,Jinman Kim
Main category: cs.CV
TL;DR: 论文提出了一种语义等效问题增强(SEQA)框架,利用大语言模型生成多样但语义等效的问题,以提高医学视觉问答系统的语言多样性和一致性。同时,提出了新的评估指标TAR-SC和其他多样性指标,实验表明该方法显著提升了模型性能和一致性。
Details
Motivation: 医学视觉问答系统(MVQA)在自然语言查询中存在语言变异性问题,导致系统回答不一致。为了解决这一问题,作者提出了SEQA框架。 Method: 利用大语言模型生成语义等效的问题变体,并引入TAR-SC等评估指标。在SLAKE、VQA-RAD和PathVQA数据集上应用SEQA框架进行增强。 Result: 增强后的数据集显著提升了多样性指标(ANQI、ANQA、ANQS)。实验显示,微调模型平均准确率提升19.35%,TAR-SC指标提升11.61%。 Conclusion: SEQA框架有效提升了MVQA系统的语言多样性和回答一致性,验证了其在医学视觉问答任务中的实用性。 Abstract: Medical Visual Question Answering (MVQA) systems can interpret medical images in response to natural language queries. However, linguistic variability in question phrasing often undermines the consistency of these systems. To address this challenge, we propose a Semantically Equivalent Question Augmentation (SEQA) framework, which leverages large language models (LLMs) to generate diverse yet semantically equivalent rephrasings of questions. Specifically, this approach enriches linguistic diversity while preserving semantic meaning. We further introduce an evaluation metric, Total Agreement Rate with Semantically Equivalent Input and Correct Answer (TAR-SC), which assesses a model's capability to generate consistent and correct responses to semantically equivalent linguistic variations. In addition, we also propose three other diversity metrics - average number of QA items per image (ANQI), average number of questions per image with the same answer (ANQA), and average number of open-ended questions per image with the same semantics (ANQS). Using the SEQA framework, we augmented the benchmarked MVQA public datasets of SLAKE, VQA-RAD, and PathVQA. As a result, all three datasets achieved significant improvements by incorporating more semantically equivalent questions: ANQI increased by an average of 86.1, ANQA by 85.1, and ANQS by 46. Subsequent experiments evaluate three MVQA models (M2I2, MUMC, and BiomedGPT) under both zero-shot and fine-tuning settings on the enhanced datasets. Experimental results in MVQA datasets show that fine-tuned models achieve an average accuracy improvement of 19.35%, while our proposed TAR-SC metric shows an average improvement of 11. 61%, indicating a substantial enhancement in model consistency.[30] Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection
Qishun Wang,Zhengzheng Tu,Chenglong Li,Bo Jiang
Main category: cs.CV
TL;DR: 提出了一种无需对齐的多模态时空图学习网络(MSGNet)用于RGB-热成像视频目标检测,通过自适应分区层和稀疏图学习模块实现模态间信息交互,并利用混合结构化时序建模提升性能。
Details
Motivation: 传统RGB-热成像融合任务依赖手动对齐的多模态图像对,限制了实际应用。本文旨在解决无需对齐的RGB-热成像视频目标检测问题。 Method: 设计了自适应分区层(APL)实现初步非精确对齐,引入空间稀疏图学习模块(S-SGLM)进行模态间信息交互,并提出混合结构化时序建模(HSTM)结合时序稀疏图学习模块(T-SGLM)和时序星形块(TSB)优化时序信息利用。 Result: 在已对齐数据集VT-VOD50和未对齐数据集UVT-VOD2024上的实验验证了方法的有效性和优越性。 Conclusion: MSGNet通过图表示学习实现了无需对齐的RGB-热成像视频目标检测,为多模态融合任务提供了新思路。 Abstract: RGB-Thermal Video Object Detection (RGBT VOD) can address the limitation of traditional RGB-based VOD in challenging lighting conditions, making it more practical and effective in many applications. However, similar to most RGBT fusion tasks, it still mainly relies on manually aligned multimodal image pairs. In this paper, we propose a novel Multimodal Spatio-temporal Graph learning Network (MSGNet) for alignment-free RGBT VOD problem by leveraging the robust graph representation learning model. Specifically, we first design an Adaptive Partitioning Layer (APL) to estimate the corresponding regions of the Thermal image within the RGB image (high-resolution), achieving a preliminary inexact alignment. Then, we introduce the Spatial Sparse Graph Learning Module (S-SGLM) which employs a sparse information passing mechanism on the estimated inexact alignment to achieve reliable information interaction between different modalities. Moreover, to fully exploit the temporal cues for RGBT VOD problem, we introduce Hybrid Structured Temporal Modeling (HSTM), which involves a Temporal Sparse Graph Learning Module (T-SGLM) and Temporal Star Block (TSB). T-SGLM aims to filter out some redundant information between adjacent frames by employing the sparse aggregation mechanism on the temporal graph. Meanwhile, TSB is dedicated to achieving the complementary learning of local spatial relationships. Extensive comparative experiments conducted on both the aligned dataset VT-VOD50 and the unaligned dataset UVT-VOD2024 demonstrate the effectiveness and superiority of our proposed method. Our project will be made available on our website for free public access.[31] ACMamba: Fast Unsupervised Anomaly Detection via An Asymmetrical Consensus State Space Model
Guanchun Wang,Xiangrong Zhang,Yifei Zhang,Zelin Peng,Tianyang Zhang,Xu Tang,Licheng Jiao
Main category: cs.CV
TL;DR: ACMamba是一种用于高光谱图像异常检测的无监督方法,通过区域级实例替代密集像素级样本,显著降低计算成本,同时保持准确性。
Details
Motivation: 当前高光谱图像异常检测方法因高维特性和密集采样训练模式导致计算成本高,限制了快速部署。 Method: 提出ACMamba模型,采用非对称异常检测范式,引入基于Mamba的低成本模块发现区域全局上下文属性,并开发共识学习策略优化背景重建和异常压缩。 Result: 在八个基准测试中,ACMamba表现出更快的速度和更强的性能,优于现有技术。 Conclusion: ACMamba通过高效采样和共识学习,成功解决了高光谱图像异常检测中的计算成本问题,同时提升了性能。 Abstract: Unsupervised anomaly detection in hyperspectral images (HSI), aiming to detect unknown targets from backgrounds, is challenging for earth surface monitoring. However, current studies are hindered by steep computational costs due to the high-dimensional property of HSI and dense sampling-based training paradigm, constraining their rapid deployment. Our key observation is that, during training, not all samples within the same homogeneous area are indispensable, whereas ingenious sampling can provide a powerful substitute for reducing costs. Motivated by this, we propose an Asymmetrical Consensus State Space Model (ACMamba) to significantly reduce computational costs without compromising accuracy. Specifically, we design an asymmetrical anomaly detection paradigm that utilizes region-level instances as an efficient alternative to dense pixel-level samples. In this paradigm, a low-cost Mamba-based module is introduced to discover global contextual attributes of regions that are essential for HSI reconstruction. Additionally, we develop a consensus learning strategy from the optimization perspective to simultaneously facilitate background reconstruction and anomaly compression, further alleviating the negative impact of anomaly reconstruction. Theoretical analysis and extensive experiments across eight benchmarks verify the superiority of ACMamba, demonstrating a faster speed and stronger performance over the state-of-the-art.[32] DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation
Sang-Jun Park,Keun-Soo Heo,Dong-Hee Shin,Young-Han Son,Ji-Hye Oh,Tae-Eui Kam
Main category: cs.CV
TL;DR: 提出了一种名为DART的框架,通过疾病感知的图像-文本对齐和自校正重新对齐,提升放射学报告的生成准确性和可信度。
Details
Motivation: 减少放射学报告生成的时间消耗并准确捕捉X光图像中的关键疾病相关发现。 Method: 采用两阶段方法:1)基于疾病匹配的图像到文本检索生成初始报告;2)通过自校正模块重新对齐报告与图像。 Result: 在两个广泛使用的基准测试中取得最优结果,超越先前方法。 Conclusion: DART框架显著提升了放射学报告的生成质量和临床有效性。 Abstract: The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.[33] Neighbor-Based Feature and Index Enhancement for Person Re-Identification
Chao Yuan,Tianyi Zhang,Guanglin Niu
Main category: cs.CV
TL;DR: 论文提出了一种新模型DMON-ARO,利用潜在邻域信息增强行人重识别中的特征表示和检索性能。
Details
Motivation: 现有方法大多忽略潜在上下文信息,限制了特征表示和检索效果,而邻域信息(尤其是多阶邻域)能有效提升特征表达和检索精度。 Method: 模型包含两个互补模块:动态多阶邻域建模(DMON)和非对称关系优化(ARO),分别用于动态聚合邻域关系和优化距离矩阵。 Result: 在三个基准数据集上实验表明,模型在Rank-1准确率和mAP上均有提升。 Conclusion: DMON-ARO有效提升了行人重识别的性能,并可扩展至其他重识别任务。 Abstract: Person re-identification (Re-ID) aims to match the same pedestrian in a large gallery with different cameras and views. Enhancing the robustness of the extracted feature representations is a main challenge in Re-ID. Existing methods usually improve feature representation by improving model architecture, but most methods ignore the potential contextual information, which limits the effectiveness of feature representation and retrieval performance. Neighborhood information, especially the potential information of multi-order neighborhoods, can effectively enrich feature expression and improve retrieval accuracy, but this has not been fully explored in existing research. Therefore, we propose a novel model DMON-ARO that leverages latent neighborhood information to enhance both feature representation and index performance. Our approach is built on two complementary modules: Dynamic Multi-Order Neighbor Modeling (DMON) and Asymmetric Relationship Optimization (ARO). The DMON module dynamically aggregates multi-order neighbor relationships, allowing it to capture richer contextual information and enhance feature representation through adaptive neighborhood modeling. Meanwhile, ARO refines the distance matrix by optimizing query-to-gallery relationships, improving the index accuracy. Extensive experiments on three benchmark datasets demonstrate that our approach achieves performance improvements against baseline models, which illustrate the effectiveness of our model. Specifically, our model demonstrates improvements in Rank-1 accuracy and mAP. Moreover, this method can also be directly extended to other re-identification tasks.[34] Real-World Depth Recovery via Structure Uncertainty Modeling and Inaccurate GT Depth Fitting
Delong Suzhang,Meng Yang
Main category: cs.CV
TL;DR: 提出了一种新方法,通过输入和输出两方面的改进,解决真实世界深度恢复中结构不对齐的泛化问题。
Details
Motivation: 真实世界RGB-D数据集中低质量深度图普遍存在结构不对齐问题,且缺乏成对的原始-真实数据,导致现有方法泛化能力不足。 Method: 设计了新的原始深度生成流程以增强结构不对齐的多样性,并引入结构不确定性模块和鲁棒特征对齐模块。 Result: 在多个数据集上的实验表明,该方法在准确性和泛化能力上表现优异。 Conclusion: 该方法通过改进输入和输出处理,显著提升了深度恢复的泛化性能。 Abstract: The low-quality structure in raw depth maps is prevalent in real-world RGB-D datasets, which makes real-world depth recovery a critical task in recent years. However, the lack of paired raw-ground truth (raw-GT) data in the real world poses challenges for generalized depth recovery. Existing methods insufficiently consider the diversity of structure misalignment in raw depth maps, which leads to poor generalization in real-world depth recovery. Notably, random structure misalignments are not limited to raw depth data but also affect GT depth in real-world datasets. In the proposed method, we tackle the generalization problem from both input and output perspectives. For input, we enrich the diversity of structure misalignment in raw depth maps by designing a new raw depth generation pipeline, which helps the network avoid overfitting to a specific condition. Furthermore, a structure uncertainty module is designed to explicitly identify the misaligned structure for input raw depth maps to better generalize in unseen scenarios. Notably the well-trained depth foundation model (DFM) can help the structure uncertainty module estimate the structure uncertainty better. For output, a robust feature alignment module is designed to precisely align with the accurate structure of RGB images avoiding the interference of inaccurate GT depth. Extensive experiments on multiple datasets demonstrate the proposed method achieves competitive accuracy and generalization capabilities across various challenging raw depth maps.[35] A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification
Bianca Lamm,Janis Keuper
Main category: cs.CV
TL;DR: 提出了一种结合检索增强生成(RAG)和视觉语言模型(VLM)的Visual RAG管道,用于少样本细粒度分类(FGC),在零售领域实现了高精度产品识别和价格预测。
Details
Motivation: 细粒度分类在零售等领域仍具挑战性,尤其是快速变化且视觉相似的产品识别对自动化价格监控和推荐至关重要。 Method: 采用Visual RAG管道,结合RAG和VLM,从广告传单中提取产品数据,预测细粒度产品ID及价格信息,无需重新训练即可处理新类别。 Result: 在多样化数据集上达到86.8%的准确率,优于传统方法。 Conclusion: Visual RAG管道为少样本FGC提供了高效解决方案,尤其在动态零售环境中表现优异。 Abstract: Despite the rapid evolution of learning and computer vision algorithms, Fine-Grained Classification (FGC) still poses an open problem in many practically relevant applications. In the retail domain, for example, the identification of fast changing and visually highly similar products and their properties are key to automated price-monitoring and product recommendation. This paper presents a novel Visual RAG pipeline that combines the Retrieval Augmented Generation (RAG) approach and Vision Language Models (VLMs) for few-shot FGC. This Visual RAG pipeline extracts product and promotion data in advertisement leaflets from various retailers and simultaneously predicts fine-grained product ids along with price and discount information. Compared to previous approaches, the key characteristic of the Visual RAG pipeline is that it allows the prediction of novel products without re-training, simply by adding a few class samples to the RAG database. Comparing several VLM back-ends like GPT-4o [23], GPT-4o-mini [24], and Gemini 2.0 Flash [10], our approach achieves 86.8% accuracy on a diverse dataset.[36] Boosting Multi-View Stereo with Depth Foundation Model in the Absence of Real-World Labels
Jie Zhu,Bo Peng,Zhe Zhang,Bingzheng Liu,Jianjun Lei
Main category: cs.CV
TL;DR: DFM-MVS利用深度基础模型生成深度先验,无需真实标签即可有效训练MVS网络,并通过伪监督和错误校正策略提升性能。
Details
Motivation: 当前基于学习的MVS方法在无真实标签的情况下训练网络仍具挑战性。 Method: 提出DFM-MVS方法,利用深度基础模型生成深度先验,开发伪监督训练机制和深度先验引导的错误校正策略。 Result: 在DTU和Tanks & Temples数据集上显著优于现有无真实标签的MVS方法。 Conclusion: DFM-MVS通过深度先验和伪监督机制,有效解决了无真实标签的MVS训练问题。 Abstract: Learning-based Multi-View Stereo (MVS) methods have made remarkable progress in recent years. However, how to effectively train the network without using real-world labels remains a challenging problem. In this paper, driven by the recent advancements of vision foundation models, a novel method termed DFM-MVS, is proposed to leverage the depth foundation model to generate the effective depth prior, so as to boost MVS in the absence of real-world labels. Specifically, a depth prior-based pseudo-supervised training mechanism is developed to simulate realistic stereo correspondences using the generated depth prior, thereby constructing effective supervision for the MVS network. Besides, a depth prior-guided error correction strategy is presented to leverage the depth prior as guidance to mitigate the error propagation problem inherent in the widely-used coarse-to-fine network structure. Experimental results on DTU and Tanks & Temples datasets demonstrate that the proposed DFM-MVS significantly outperforms existing MVS methods without using real-world labels.[37] ACE: Attentional Concept Erasure in Diffusion Models
Finn Carter
Main category: cs.CV
TL;DR: 提出了一种名为ACE的新方法,用于从预训练的扩散模型中删除指定概念,同时保留其他内容的生成能力。
Details
Motivation: 解决扩散模型因大规模训练数据而生成有害、受版权保护或不良内容的问题。 Method: 结合闭式注意力操纵和轻量级微调,通过门控低秩适应和对抗增强微调实现概念擦除。 Result: 在多个基准测试中,ACE实现了最先进的概念擦除效果和鲁棒性,且高效、可扩展。 Conclusion: ACE为扩散模型的安全部署提供了一种有效的解决方案。 Abstract: Large text-to-image diffusion models have demonstrated remarkable image synthesis capabilities, but their indiscriminate training on Internet-scale data has led to learned concepts that enable harmful, copyrighted, or otherwise undesirable content generation. We address the task of concept erasure in diffusion models, i.e., removing a specified concept from a pre-trained model such that prompting the concept (or related synonyms) no longer yields its depiction, while preserving the model's ability to generate other content. We propose a novel method, Attentional Concept Erasure (ACE), that integrates a closed-form attention manipulation with lightweight fine-tuning. Theoretically, we formulate concept erasure as aligning the model's conditional distribution on the target concept with a neutral distribution. Our approach identifies and nullifies concept-specific latent directions in the cross-attention modules via a gated low-rank adaptation, followed by adversarially augmented fine-tuning to ensure thorough erasure of the concept and its synonyms. Empirically, we demonstrate on multiple benchmarks, including object classes, celebrity faces, explicit content, and artistic styles, that ACE achieves state-of-the-art concept removal efficacy and robustness. Compared to prior methods, ACE better balances generality (erasing concept and related terms) and specificity (preserving unrelated content), scales to dozens of concepts, and is efficient, requiring only a few seconds of adaptation per concept. We will release our code to facilitate safer deployment of diffusion models.[38] Cross-Frequency Collaborative Training Network and Dataset for Semi-supervised First Molar Root Canal Segmentation
Zhenhuan Zhou,Yuchen Zhang,Along He,Peng Wang,Xueshuo Xie,Tao Li
Main category: cs.CV
TL;DR: 本文提出了一种用于根管分割的半监督学习方法CFC-Net,通过多频率协作训练和不确定性引导机制,显著提升了分割性能,并在多个数据集上验证了其有效性。
Details
Motivation: 根管治疗高度依赖临床经验,缺乏公开数据集限制了深度学习在该领域的应用。本文旨在解决这一问题,并提供更客观的诊断支持。 Method: 设计了CFC-Net,包括CFC-MT(多频率协作训练)和UCF-Mix(不确定性引导机制),利用半监督学习减少标注工作量并整合多频率信息。 Result: 在FMRC-2025和三个公共牙科数据集上的实验表明,CFC-Net优于现有半监督医学图像分割方法,并展现出强泛化能力。 Conclusion: CFC-Net为根管分割提供了一种高效解决方案,同时展示了在多频率数据整合和半监督学习中的潜力。 Abstract: Root canal (RC) treatment is a highly delicate and technically complex procedure in clinical practice, heavily influenced by the clinicians' experience and subjective judgment. Deep learning has made significant advancements in the field of computer-aided diagnosis (CAD) because it can provide more objective and accurate diagnostic results. However, its application in RC treatment is still relatively rare, mainly due to the lack of public datasets in this field. To address this issue, in this paper, we established a First Molar Root Canal segmentation dataset called FMRC-2025. Additionally, to alleviate the workload of manual annotation for dentists and fully leverage the unlabeled data, we designed a Cross-Frequency Collaborative training semi-supervised learning (SSL) Network called CFC-Net. It consists of two components: (1) Cross-Frequency Collaborative Mean Teacher (CFC-MT), which introduces two specialized students (SS) and one comprehensive teacher (CT) for collaborative multi-frequency training. The CT and SS are trained on different frequency components while fully integrating multi-frequency knowledge through cross and full frequency consistency supervisions. (2) Uncertainty-guided Cross-Frequency Mix (UCF-Mix) mechanism enables the network to generate high-confidence pseudo-labels while learning to integrate multi-frequency information and maintaining the structural integrity of the targets. Extensive experiments on FMRC-2025 and three public dental datasets demonstrate that CFC-MT is effective for RC segmentation and can also exhibit strong generalizability on other dental segmentation tasks, outperforming state-of-the-art SSL medical image segmentation methods. Codes and dataset will be released.[39] Synthetic Data for Blood Vessel Network Extraction
Joël Mathys,Andreas Plesner,Jorel Elmiger,Roger Wattenhofer
Main category: cs.CV
TL;DR: 结合合成数据生成与深度学习,从体积显微镜数据中自动提取血管网络图,解决数据稀缺问题,提升拓扑准确性。
Details
Motivation: 脑部血管网络在卒中研究中至关重要,但显微镜数据中提取详细拓扑信息面临数据稀缺和高精度需求的挑战。 Method: 提出三阶段合成数据生成流程,结合生物约束和成像伪影,开发两阶段深度学习模型(3D U-Net)进行节点检测和边缘预测。 Result: 在仅5个手动标记样本上微调后,边缘预测F1分数从0.496提升至0.626。 Conclusion: 自动化血管网络提取逐渐可行,为卒中研究中的大规模血管分析提供新可能。 Abstract: Blood vessel networks in the brain play a crucial role in stroke research, where understanding their topology is essential for analyzing blood flow dynamics. However, extracting detailed topological vessel network information from microscopy data remains a significant challenge, mainly due to the scarcity of labeled training data and the need for high topological accuracy. This work combines synthetic data generation with deep learning to automatically extract vessel networks as graphs from volumetric microscopy data. To combat data scarcity, we introduce a comprehensive pipeline for generating large-scale synthetic datasets that mirror the characteristics of real vessel networks. Our three-stage approach progresses from abstract graph generation through vessel mask creation to realistic medical image synthesis, incorporating biological constraints and imaging artifacts at each stage. Using this synthetic data, we develop a two-stage deep learning pipeline of 3D U-Net-based models for node detection and edge prediction. Fine-tuning on real microscopy data shows promising adaptation, improving edge prediction F1 scores from 0.496 to 0.626 by training on merely 5 manually labeled samples. These results suggest that automated vessel network extraction is becoming practically feasible, opening new possibilities for large-scale vascular analysis in stroke research.[40] A Category-Fragment Segmentation Framework for Pelvic Fracture Segmentation in X-ray Images
Daiqi Liu,Fuxin Fan,Andreas Maier
Main category: cs.CV
TL;DR: 提出了一种基于深度学习的骨盆骨折自动分割框架(CFS),用于2D X射线图像中的骨折分割,效果显著。
Details
Motivation: 骨盆骨折通常需要手术干预,准确的骨折分割有助于手术规划和术中调整。 Method: CFS框架包括三个步骤:类别分割、骨折分割和后处理。 Result: 最佳模型的IoU为0.91(解剖结构)和0.78(骨折分割)。 Conclusion: CFS框架在骨盆骨折分割中表现出高效性和准确性。 Abstract: Pelvic fractures, often caused by high-impact trauma, frequently require surgical intervention. Imaging techniques such as CT and 2D X-ray imaging are used to transfer the surgical plan to the operating room through image registration, enabling quick intraoperative adjustments. Specifically, segmenting pelvic fractures from 2D X-ray imaging can assist in accurately positioning bone fragments and guiding the placement of screws or metal plates. In this study, we propose a novel deep learning-based category and fragment segmentation (CFS) framework for the automatic segmentation of pelvic bone fragments in 2D X-ray images. The framework consists of three consecutive steps: category segmentation, fragment segmentation, and post-processing. Our best model achieves an IoU of 0.91 for anatomical structures and 0.78 for fracture segmentation. Results demonstrate that the CFS framework is effective and accurate.[41] Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval
Yushuai Sun,Zikun Zhou,Dongmei Jiang,Yaowei Wang,Jun Yu,Guangming Lu,Wenjie Pei
Main category: cs.CV
TL;DR: 提出了一种可剪枝网络(Prunable Network),通过后训练剪枝生成兼容子网络,无需额外训练即可适配新平台。
Details
Motivation: 解决现有方法在多平台部署中灵活性不足的问题,特别是引入新平台时需要额外训练兼容模型。 Method: 优化密集网络中不同容量子网络的架构和权重,设计冲突感知梯度整合方案处理梯度冲突。 Result: 在多个基准测试和视觉骨干网络上验证了方法的有效性。 Conclusion: 该方法通过后训练剪枝实现了灵活的多平台部署,显著提升了兼容性和效率。 Abstract: Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited flexibility for multi-platform deployment. For example, when introducing a new platform into the retrieval systems, developers have to train an additional model at an appropriate capacity that is compatible with existing models via backward-compatible learning. In this paper, we propose a Prunable Network with self-compatibility, which allows developers to generate compatible subnetworks at any desired capacity through post-training pruning. Thus it allows the creation of a sparse subnetwork matching the resources of the new platform without additional training. Specifically, we optimize both the architecture and weight of subnetworks at different capacities within a dense network in compatible learning. We also design a conflict-aware gradient integration scheme to handle the gradient conflicts between the dense network and subnetworks during compatible learning. Extensive experiments on diverse benchmarks and visual backbones demonstrate the effectiveness of our method. Our code and model are available at https://github.com/Bunny-Black/PrunNet.[42] CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting
Wei Sun,Yanzhao Zhou,Jianbin Jiao,Yuan Li
Main category: cs.CV
TL;DR: 提出了一种名为CAGS的新框架,通过结合空间上下文改进3D高斯泼溅(3DGS),解决跨视图粒度不一致问题,提升3D实例分割效果。
Details
Motivation: 开放词汇3D场景理解对自然语言驱动的应用(如机器人和增强现实)至关重要,但现有方法因跨视图粒度不一致导致对象分割不连贯。 Method: CAGS通过构建局部图传播上下文特征,采用掩码中心对比学习平滑特征,并利用预计算策略降低计算成本。 Result: 在LERF-OVS和ScanNet数据集上显著减少碎片化错误,提升3D实例分割性能。 Conclusion: CAGS通过整合空间上下文,实现了更鲁棒的语言引导3D场景理解。 Abstract: Open-vocabulary 3D scene understanding is crucial for applications requiring natural language-driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open-vocabulary frameworks reveals a key challenge: cross-view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a "coffee set" segmented as a single entity in one view but as "cup + coffee + spoon" in another). Existing 3DGS-based methods often rely on isolated per-Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask-centric contrastive learning to smooth SAM-derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large-scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet, enabling robust language-guided 3D scene understanding.[43] Search is All You Need for Few-shot Anomaly Detection
Qishan Wang,Jia Guo,Shuyong Gao,Haofen Wang,Li Xiong,Junjie Hu,Hanqi Guo,Wenqiang Zhang
Main category: cs.CV
TL;DR: VisionAD是一种基于最近邻搜索的简单框架,用于少样本异常检测(FSAD),在单类和多类场景中均优于现有方法。
Details
Motivation: 工业检测中,少样本异常检测任务具有挑战性,现有方法依赖复杂的多模态模型和手动调优。 Method: VisionAD包含四个关键组件:可扩展的视觉基础模型、双重增强策略、多层特征集成和类感知视觉记忆库。 Result: 在MVTec-AD、VisA和Real-IAD基准测试中,VisionAD仅用1张正常图像即达到97.4%、94.8%和70.8%的AUROC分数,显著优于现有方法。 Conclusion: VisionAD的无训练特性和卓越的少样本能力使其在样本稀缺的实际应用中极具吸引力。 Abstract: Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at https://github.com/Qiqigeww/VisionAD.[44] Learning Physics-Informed Color-Aware Transforms for Low-Light Image Enhancement
Xingxing Yang,Jie Chen,Zaifeng Yang
Main category: cs.CV
TL;DR: 提出了一种基于物理先验的低光图像增强方法PiCat,通过颜色感知变换和内容-噪声分解网络,解决了现有方法在复杂光照下的不稳定问题。
Details
Motivation: 现有方法在sRGB空间直接映射低光到正常光图像时,存在颜色预测不一致和对光谱功率分布变化敏感的问题,导致性能不稳定。 Method: 提出PiCat框架,包含颜色感知变换(CAT)将图像转换为光照不变描述符,以及内容-噪声分解网络(CNDN)优化描述符分布。 Result: 在五个基准数据集上表现优于现有方法。 Conclusion: PiCat通过物理先验指导低光到正常光的转换,显著提升了复杂光照条件下的图像增强效果。 Abstract: Image decomposition offers deep insights into the imaging factors of visual data and significantly enhances various advanced computer vision tasks. In this work, we introduce a novel approach to low-light image enhancement based on decomposed physics-informed priors. Existing methods that directly map low-light to normal-light images in the sRGB color space suffer from inconsistent color predictions and high sensitivity to spectral power distribution (SPD) variations, resulting in unstable performance under diverse lighting conditions. To address these challenges, we introduce a Physics-informed Color-aware Transform (PiCat), a learning-based framework that converts low-light images from the sRGB color space into deep illumination-invariant descriptors via our proposed Color-aware Transform (CAT). This transformation enables robust handling of complex lighting and SPD variations. Complementing this, we propose the Content-Noise Decomposition Network (CNDN), which refines the descriptor distributions to better align with well-lit conditions by mitigating noise and other distortions, thereby effectively restoring content representations to low-light images. The CAT and the CNDN collectively act as a physical prior, guiding the transformation process from low-light to normal-light domains. Our proposed PiCat framework demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets.[45] AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection
Yuhao Chao,Jie Liu,Jie Tang,Gangshan Wu
Main category: cs.CV
TL;DR: AnomalyR1是一个基于多模态大语言模型(MLLM)的工业异常检测框架,通过GRPO和ROAM实现端到端解决方案,性能优于现有方法。
Details
Motivation: 工业异常检测因缺陷样本稀缺而具有挑战性,传统方法受限于手工特征或领域特定模型,亟需新范式。 Method: 结合MLLM(VLM-R1)与GRPO,并引入ROAM指标,实现端到端的图像和领域知识处理、推理及异常定位。 Result: 在最新多模态IAD基准测试中,3亿参数模型表现优于现有方法,达到SOTA。 Conclusion: AnomalyR1展示了MLLM在工业异常检测中的潜力,为数据稀缺场景下的智能检测系统奠定基础。 Abstract: Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data.[46] Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach
Lvpan Cai,Haowei Wang,Jiayi Ji,YanShu ZhouMen,Yiwei Ma,Xiaoshuai Sun,Liujuan Cao,Rongrong Ji
Main category: cs.CV
TL;DR: 论文提出了BR-Gen数据集和NFA-ViT模型,用于检测局部伪造图像,填补了现有数据集中在对象级伪造而忽略场景编辑的空白。
Details
Motivation: AI生成的图像编辑工具使局部伪造越来越真实,现有数据集主要关注对象级伪造,忽略了场景编辑(如天空或地面)。 Method: 提出BR-Gen数据集(150,000张局部伪造图像,基于语义校准)和NFA-ViT模型(噪声引导的伪造放大视觉变换器,通过噪声指纹和注意力机制增强检测)。 Result: BR-Gen填补了现有方法的空白,NFA-ViT在BR-Gen和现有基准测试中表现优异。 Conclusion: BR-Gen和NFA-ViT为局部伪造检测提供了新工具,提升了检测鲁棒性。 Abstract: The rise of AI-generated image editing tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce \textbf{BR-Gen}, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated Perception-Creation-Evaluation pipeline to ensure semantic coherence and visual realism. In addition, we further propose \textbf{NFA-ViT}, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, \emph{i.e.}, potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the generalization traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks. All data and codes are available at https://github.com/clpbc/BR-Gen.[47] Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models -
Laura Fieback,Nishilkumar Balar,Jakob Spiegelberg,Hanno Gottschalk
Main category: cs.CV
TL;DR: 提出了一种名为ECD的方法,通过对比解码减少大型视觉语言模型(LVLM)的幻觉生成,无需额外训练。
Details
Motivation: 尽管LVLM有所进展,但仍存在与视觉输入不符的幻觉响应,需解决这一问题。 Method: ECD利用概率幻觉检测,通过对比标记概率和幻觉分数,从原始分布中剔除幻觉概念。 Result: ECD在多个基准数据集和不同LVLM上表现优异,显著减少幻觉,超越现有方法。 Conclusion: ECD是一种高效且通用的方法,能有效抑制LVLM的幻觉生成。 Abstract: Despite recent advances in Large Vision Language Models (LVLMs), these models still suffer from generating hallucinatory responses that do not align with the visual input provided. To mitigate such hallucinations, we introduce Efficient Contrastive Decoding (ECD), a simple method that leverages probabilistic hallucination detection to shift the output distribution towards contextually accurate answers at inference time. By contrasting token probabilities and hallucination scores, ECD subtracts hallucinated concepts from the original distribution, effectively suppressing hallucinations. Notably, our proposed method can be applied to any open-source LVLM and does not require additional LVLM training. We evaluate our method on several benchmark datasets and across different LVLMs. Our experiments show that ECD effectively mitigates hallucinations, outperforming state-of-the-art methods with respect to performance on LVLM benchmarks and computation time.[48] Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Hairui Ren,Fan Tang,He Zhao,Zixuan Wang,Dandan Guo,Yi Chang
Main category: cs.CV
TL;DR: 论文提出了一种名为AiR的方法,通过扩散模型生成高质量伪标签数据,提升视觉语言模型的分类性能。
Details
Motivation: 当前伪标签策略在语义与视觉信息匹配上存在不足,导致无监督提示学习方法性能不佳。 Method: AiR利用扩散模型生成高保真合成样本,构建辅助分类器,增强视觉多样性,并改进提示学习。 Result: 在五个公开基准测试中,AiR显著优于现有无监督提示学习方法。 Conclusion: AiR通过扩散模型增强伪标签质量,为视觉语言模型的分类任务提供了更鲁棒的解决方案。 Abstract: Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.[49] R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors
Haoyang Wang,Liming Liu,Peiheng Wang,Junlin Hao,Jiangkai Wu,Xinggong Zhang
Main category: cs.CV
TL;DR: 提出了一种利用扩散模型增强稀疏视图网格重建的新框架,通过共识扩散模块和在线强化学习策略提升几何和渲染质量。
Details
Motivation: 稀疏视图条件下网格重建性能下降,扩散模型输出存在视觉伪影和3D不一致性问题。 Method: 设计共识扩散模块过滤不可靠生成,采用基于UCB的在线强化学习策略选择信息量最大的视角,结合NeRF模型联合监督。 Result: 实验表明,方法在几何和渲染质量上均有显著提升。 Conclusion: 新框架通过扩散模型和强化学习有效解决了稀疏视图重建的挑战。 Abstract: Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.[50] Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation
Jie Wang,Chen Ye Gan,Caoqi Wei,Jiangtao Wen,Yuxing Han
Main category: cs.CV
TL;DR: Flow Intelligence提出了一种基于时间运动模式的视频特征匹配新方法,无需预训练数据,适用于跨模态场景。
Details
Motivation: 传统方法依赖空间特征,在噪声、不对齐或跨模态数据中表现不佳;深度学习虽提升鲁棒性,但依赖大量数据和计算资源。 Method: 通过提取连续帧中像素块的运动签名,生成时间运动描述符,避免空间特征检测。 Result: 方法在跨模态匹配中表现优异,对平移、旋转和尺度变化具有天然不变性。 Conclusion: Flow Intelligence通过运动而非外观实现鲁棒、实时的视频特征匹配,适用于多样化环境。 Abstract: Feature matching across video streams remains a cornerstone challenge in computer vision. Increasingly, robust multimodal matching has garnered interest in robotics, surveillance, remote sensing, and medical imaging. While traditional rely on detecting and matching spatial features, they break down when faced with noisy, misaligned, or cross-modal data. Recent deep learning methods have improved robustness through learned representations, but remain constrained by their dependence on extensive training data and computational demands. We present Flow Intelligence, a paradigm-shifting approach that moves beyond spatial features by focusing on temporal motion patterns exclusively. Instead of detecting traditional keypoints, our method extracts motion signatures from pixel blocks across consecutive frames and extract temporal motion signatures between videos. These motion-based descriptors achieve natural invariance to translation, rotation, and scale variations while remaining robust across different imaging modalities. This novel approach also requires no pretraining data, eliminates the need for spatial feature detection, enables cross-modal matching using only temporal motion, and it outperforms existing methods in challenging scenarios where traditional approaches fail. By leveraging motion rather than appearance, Flow Intelligence enables robust, real-time video feature matching in diverse environments.[51] Exploring Video-Based Driver Activity Recognition under Noisy Labels
Linjuan Fan,Di Wen,Kunyu Peng,Kailun Yang,Jiaming Zhang,Ruiping Liu,Yufan Chen,Junwei Zheng,Jiamin Wu,Xudong Han,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 本文提出了一种针对驾驶员行为识别的标签噪声学习方法,通过聚类假设和样本选择策略提升模型性能。
Details
Motivation: 现实世界中的视频数据常包含错误标签,影响模型可靠性,而标签噪声学习在驾驶员行为识别领域尚未充分探索。 Method: 基于聚类假设,模型学习低维表示并分组,随后进行共细化和平滑分类器输出,结合灵活的样本选择策略过滤干净样本。 Result: 在Drive&Act数据集上的实验表明,该方法优于其他图像分类领域的标签去噪方法。 Conclusion: 该方法为驾驶员行为识别中的标签噪声问题提供了有效解决方案,代码已开源。 Abstract: As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at https://github.com/ilonafan/DAR-noisy-labels.[52] Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
Yifei Dong,Fengyi Wu,Sanjian Zhang,Guangyu Chen,Yuzhi Hu,Masumi Yano,Jingdong Sun,Siyu Huang,Feng Liu,Qi Dai,Zhi-Qi Cheng
Main category: cs.CV
TL;DR: 本文综述了反无人机(UAV)领域的研究,聚焦于分类、检测和跟踪三大目标,并探讨了新兴方法如扩散数据合成、多模态融合等。通过评估现有技术,指出了实时性能、隐蔽检测和群体场景中的不足,呼吁开发更鲁棒的自适应系统。
Details
Motivation: 无人机在基础设施检查和监视中不可或缺,但也带来了严重的安全挑战,因此需要研究反无人机技术以应对这些威胁。 Method: 综述了反无人机领域的方法,包括扩散数据合成、多模态融合、视觉语言建模、自监督学习和强化学习,并评估了单模态和多传感器管道的解决方案。 Result: 分析揭示了实时性能、隐蔽检测和群体场景中的技术缺口,强调了开发鲁棒自适应系统的紧迫性。 Conclusion: 通过指出开放研究方向,本文旨在推动创新,指导下一代反无人机防御策略的开发。 Abstract: Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.[53] A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions
Rahima Khanam,Muhammad Hussain
Main category: cs.CV
TL;DR: YOLOv12通过引入注意力机制优化实时目标检测,在速度和精度之间取得更好平衡。
Details
Motivation: 解决YOLO系列在注意力机制集成中的高计算开销问题。 Method: 采用Area Attention、Residual Efficient Layer Aggregation Networks和FlashAttention等创新架构。 Result: 在准确性、推理速度和计算效率上优于先前版本和竞争对手。 Conclusion: YOLOv12通过优化延迟-精度权衡和计算资源,推动了实时目标检测的发展。 Abstract: The YOLO (You Only Look Once) series has been a leading framework in real-time object detection, consistently improving the balance between speed and accuracy. However, integrating attention mechanisms into YOLO has been challenging due to their high computational overhead. YOLOv12 introduces a novel approach that successfully incorporates attention-based enhancements while preserving real-time performance. This paper provides a comprehensive review of YOLOv12's architectural innovations, including Area Attention for computationally efficient self-attention, Residual Efficient Layer Aggregation Networks for improved feature aggregation, and FlashAttention for optimized memory access. Additionally, we benchmark YOLOv12 against prior YOLO versions and competing object detectors, analyzing its improvements in accuracy, inference speed, and computational efficiency. Through this analysis, we demonstrate how YOLOv12 advances real-time object detection by refining the latency-accuracy trade-off and optimizing computational resources.[54] A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning
Mengyu Wang,Hanbo Bi,Yingchao Feng,Linlin Xin,Shuo Gong,Tianqi Wang,Zhiyuan Yan,Peijin Wang,Wenhui Diao,Xian Sun
Main category: cs.CV
TL;DR: 提出了一种基于复值SAR数据的遥感基础模型,通过模拟极化分解过程进行预训练,赋予模型物理可解释性,并在六个下游任务中取得最优性能。
Details
Motivation: SAR图像解译基础模型面临信息利用不足和可解释性差的问题,需通过物理模拟提升模型性能。 Method: 构建散射查询模拟极化分解过程,结合极化分解损失和功率自监督损失进行预训练。 Result: 在六个典型下游任务中取得最优性能,模型在数据稀缺条件下仍具强泛化能力。 Conclusion: 该基础模型通过物理模拟实现了高性能和可解释性,适用于SAR图像解译。 Abstract: Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervision loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on six typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.[55] Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
Xinli Yue,JianHui Sun,Junda Lu,Liangchao Yao,Fan Xia,Tianyi Wang,Fengyun Rao,Jing Lyu,Yuetang Deng
Main category: cs.CV
TL;DR: 本文提出了一种名为iMatch的改进评估方法,用于评估文本到图像生成模型的语义对齐,通过四种创新增强策略显著提升了现有方法的性能。
Details
Motivation: 随着文本到图像生成模型的快速发展,如何精确评估生成图像与文本描述的语义对齐成为重要挑战,现有方法在细粒度评估和量化方面仍有不足。 Method: iMatch通过微调多模态大语言模型,引入四种增强策略:QAlign策略、验证集增强策略、元素增强策略和图像增强策略,并结合提示类型增强和分数扰动策略。 Result: 实验结果表明,iMatch方法显著优于现有方法,并在CVPR NTIRE 2025比赛中获得第一名。 Conclusion: iMatch方法在图像-文本语义对齐评估中表现出高效性和实用价值,为相关研究提供了新思路。 Abstract: With the rapid advancement of text-to-image (T2I) generation models, assessing the semantic alignment between generated images and text descriptions has become a significant research challenge. Current methods, including those based on Visual Question Answering (VQA), still struggle with fine-grained assessments and precise quantification of image-text alignment. This paper presents an improved evaluation method named Instruction-augmented Multimodal Alignment for Image-Text and Element Matching (iMatch), which evaluates image-text semantic alignment by fine-tuning multimodal large language models. We introduce four innovative augmentation strategies: First, the QAlign strategy creates a precise probabilistic mapping to convert discrete scores from multimodal large language models into continuous matching scores. Second, a validation set augmentation strategy uses pseudo-labels from model predictions to expand training data, boosting the model's generalization performance. Third, an element augmentation strategy integrates element category labels to refine the model's understanding of image-text matching. Fourth, an image augmentation strategy employs techniques like random lighting to increase the model's robustness. Additionally, we propose prompt type augmentation and score perturbation strategies to further enhance the accuracy of element assessments. Our experimental results show that the iMatch method significantly surpasses existing methods, confirming its effectiveness and practical value. Furthermore, our iMatch won first place in the CVPR NTIRE 2025 Text to Image Generation Model Quality Assessment - Track 1 Image-Text Alignment.[56] MixSignGraph: A Sign Sequence is Worth Mixed Graphs of Nodes
Shiwei Gan,Yafeng Yin,Zhiwei Jiang,Hongkai Wen,Lei Xie,Sanglu Lu
Main category: cs.CV
TL;DR: 论文提出MixSignGraph方法,通过混合图模块(LSG、TSG、HSG)提取手语相关特征,并结合文本驱动的预训练方法(TCP)提升性能。实验表明其优于现有方法。
Details
Motivation: 现有CNN骨干网络在手语任务中难以捕捉区域相关特征,如多区域协作和单区域有效内容。 Method: 提出MixSignGraph,包含三个图模块(LSG、TSG、HSG)分别处理空间、时间和层次特征,并引入TCP方法进行预训练。 Result: 在多个公开数据集上表现优异,超越现有SOTA模型。 Conclusion: MixSignGraph能有效捕捉手语特征,TCP方法进一步提升了性能,无需额外线索即可实现优越表现。 Abstract: Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (\eg object identification, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. In fact, sign language tasks require focusing on sign-related regions, including the collaboration between different regions (\eg left hand region and right hand region) and the effective content in a single region. To capture such region-related features, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs and designs the following three graph modules for feature extraction, \ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the correlation of intra-frame cross-region features within one frame, \ie focusing on spatial features. The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames, \ie focusing on temporal features. The HSG module aggregates the same-region features from different-granularity feature maps of a frame, \ie focusing on hierarchical features. In addition, to further improve the performance of sign language tasks without gloss annotations, we propose a simple yet counter-intuitive Text-driven CTC Pre-training (TCP) method, which generates pseudo gloss labels from text labels for model pre-training. Extensive experiments conducted on current five public sign language datasets demonstrate the superior performance of the proposed model. Notably, our model surpasses the SOTA models on multiple sign language tasks across several datasets, without relying on any additional cues.[57] Action Anticipation from SoccerNet Football Video Broadcasts
Mohamad Dalal,Artur Xarles,Anthony Cioppa,Silvio Giancola,Marc Van Droogenbroeck,Bernard Ghanem,Albert Clapés,Sergio Escalera,Thomas B. Moeslund
Main category: cs.CV
TL;DR: 论文提出足球视频中的动作预测任务,并发布新数据集SoccerNet Ball Action Anticipation。提出基线方法FAANTRA,并引入新评估指标。实验验证了动作预测的可行性与挑战。
Details
Motivation: 现有研究较少关注在动作发生前预测其发生,尤其是在足球视频中。 Method: 提出FAANTRA方法,基于FUTR模型改进,用于预测球相关动作。引入新指标mAP@δ和mAP@∞。 Result: 实验表明动作预测在足球视频中可行但具挑战性,为体育分析提供了新思路。 Conclusion: 动作预测可用于自动化广播、战术分析和球员决策,数据集和代码已开源。 Abstract: Artificial intelligence has revolutionized the way we analyze sports videos, whether to understand the actions of games in long untrimmed videos or to anticipate the player's motion in future frames. Despite these efforts, little attention has been given to anticipating game actions before they occur. In this work, we introduce the task of action anticipation for football broadcast videos, which consists in predicting future actions in unobserved future frames, within a five- or ten-second anticipation window. To benchmark this task, we release a new dataset, namely the SoccerNet Ball Action Anticipation dataset, based on SoccerNet Ball Action Spotting. Additionally, we propose a Football Action ANticipation TRAnsformer (FAANTRA), a baseline method that adapts FUTR, a state-of-the-art action anticipation model, to predict ball-related actions. To evaluate action anticipation, we introduce new metrics, including mAP@$\delta$, which evaluates the temporal precision of predicted future actions, as well as mAP@$\infty$, which evaluates their occurrence within the anticipation window. We also conduct extensive ablation studies to examine the impact of various task settings, input configurations, and model architectures. Experimental results highlight both the feasibility and challenges of action anticipation in football videos, providing valuable insights into the design of predictive models for sports analytics. By forecasting actions before they unfold, our work will enable applications in automated broadcasting, tactical analysis, and player decision-making. Our dataset and code are publicly available at https://github.com/MohamadDalal/FAANTRA.[58] Understanding Attention Mechanism in Video Diffusion Models
Bingyan Liu,Chengyu Wang,Tongtong Su,Huan Ten,Jun Huang,Kailing Guo,Kui Jia
Main category: cs.CV
TL;DR: 该论文通过信息论方法分析了T2V模型中时空注意力块的作用,发现高熵注意力图与视频质量相关,并提出两种轻量级方法提升视频质量和文本引导编辑。
Details
Motivation: 研究T2V模型中注意力机制对视频合成的具体影响,如质量和时间一致性。 Method: 采用信息论方法对时空注意力块进行扰动分析。 Result: 高熵注意力图与视频质量相关,低熵图与帧内结构相关;提出两种轻量级方法提升质量。 Conclusion: 通过注意力矩阵的轻量级操作可有效提升视频质量和编辑能力。 Abstract: Text-to-video (T2V) synthesis models, such as OpenAI's Sora, have garnered significant attention due to their ability to generate high-quality videos from a text prompt. In diffusion-based T2V models, the attention mechanism is a critical component. However, it remains unclear what intermediate features are learned and how attention blocks in T2V models affect various aspects of video synthesis, such as image quality and temporal consistency. In this paper, we conduct an in-depth perturbation analysis of the spatial and temporal attention blocks of T2V models using an information-theoretic approach. Our results indicate that temporal and spatial attention maps affect not only the timing and layout of the videos but also the complexity of spatiotemporal elements and the aesthetic quality of the synthesized videos. Notably, high-entropy attention maps are often key elements linked to superior video quality, whereas low-entropy attention maps are associated with the video's intra-frame structure. Based on our findings, we propose two novel methods to enhance video quality and enable text-guided video editing. These methods rely entirely on lightweight manipulation of the attention matrices in T2V models. The efficacy and effectiveness of our methods are further validated through experimental evaluation across multiple datasets.[59] Object Placement for Anything
Bingjie Gao,Bo Zhang,Li Niu
Main category: cs.CV
TL;DR: 提出了一种半监督框架,利用大规模未标记数据提升判别式物体放置模型的泛化能力。
Details
Motivation: 现有物体放置研究受限于小规模标记数据集,限制了实际应用。 Method: 设计半监督框架,利用未标记数据,并转移合理性变化知识(标记数据到未标记数据)。 Result: 实验表明,该框架能有效提升模型的泛化能力。 Conclusion: 半监督框架为物体放置问题提供了更通用的解决方案。 Abstract: Object placement aims to determine the appropriate placement (\emph{e.g.}, location and size) of a foreground object when placing it on the background image. Most previous works are limited by small-scale labeled dataset, which hinders the real-world application of object placement. In this work, we devise a semi-supervised framework which can exploit large-scale unlabeled dataset to promote the generalization ability of discriminative object placement models. The discriminative models predict the rationality label for each foreground placement given a foreground-background pair. To better leverage the labeled data, under the semi-supervised framework, we further propose to transfer the knowledge of rationality variation, \emph{i.e.}, whether the change of foreground placement would result in the change of rationality label, from labeled data to unlabeled data. Extensive experiments demonstrate that our framework can effectively enhance the generalization ability of discriminative object placement models.[60] RadMamba: Efficient Human Activity Recognition through Radar-based Micro-Doppler-Oriented Mamba State-Space Model
Yizhuo Wu,Francesco Fioranelli,Chang Gao
Main category: cs.CV
TL;DR: RadMamba是一种基于雷达微多普勒的轻量级Mamba SSM模型,用于雷达人体动作识别(HAR),在保持高精度的同时大幅降低计算复杂度。
Details
Motivation: 现有雷达HAR解决方案计算复杂度高,限制了其在资源受限场景的应用,需要一种更高效的模型。 Method: 提出RadMamba,一种参数高效的Mamba SSM架构,专为雷达HAR设计。 Result: 在多个数据集上,RadMamba以极少的参数实现了与或超越现有模型的精度(如99.8%和92.0%)。 Conclusion: RadMamba在雷达HAR中表现出色,兼具高精度和低计算复杂度,适合资源受限场景。 Abstract: Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as ViT and SSM architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models' 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: https://github.com/lab-emi/AIRHAR.[61] pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild
Jonas Myhre Schiøtt,Viktor Sebastian Petersen,Dimitrios P. Papadopoulos
Main category: cs.CV
TL;DR: 论文提出pix2pockets框架,结合计算机视觉和强化学习,用于8球台球的球桌检测、球定位及最优击球建议。
Details
Motivation: 利用计算机视觉和强化学习的进展,解决台球游戏中的球桌检测、球定位及击球策略问题。 Method: 构建包含195张多样化图像的数据集,手动标注球和桌点;开发标准化RL环境;训练对象检测模型和球定位管道。 Result: 对象检测AP50达91.2,球定位误差仅0.4厘米;RL算法在击球任务中表现不佳,但提出的简单基线单次击球成功率达94.7%。 Conclusion: pix2pockets框架在台球任务中表现出色,但RL算法仍需改进以实现更高成功率。 Abstract: Computer vision models have seen increased usage in sports, and reinforcement learning (RL) is famous for beating humans in strategic games such as Chess and Go. In this paper, we are interested in building upon these advances and examining the game of classic 8-ball pool. We introduce pix2pockets, a foundation for an RL-assisted pool coach. Given a single image of a pool table, we first aim to detect the table and the balls and then propose the optimal shot suggestion. For the first task, we build a dataset with 195 diverse images where we manually annotate all balls and table dots, leading to 5748 object segmentation masks. For the second task, we build a standardized RL environment that allows easy development and benchmarking of any RL algorithm. Our object detection model yields an AP50 of 91.2 while our ball location pipeline obtains an error of only 0.4 cm. Furthermore, we compare standard RL algorithms to set a baseline for the shot suggestion task and we show that all of them fail to pocket all balls without making a foul move. We also present a simple baseline that achieves a per-shot success rate of 94.7% and clears a full game in a single turn 30% of the time.[62] Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM
Zirui Pan,Xin Wang,Yipeng Zhang,Hong Chen,Kwan Man Cheng,Yaofei Wu,Wenwu Zhu
Main category: cs.CV
TL;DR: 论文提出了一种名为Modular-Cam的新方法,用于解决复杂文本提示下视频生成中场景分解和相机视角转换的问题。
Details
Motivation: 现有方法在处理包含动态场景和多相机视角转换的复杂提示时,无法有效分解场景并实现平滑转换。 Method: 利用大型语言模型分析用户指令并解耦为多个场景和过渡动作;结合时间变换器和CamOperator模块控制相机运动;使用AdaControlNet确保场景一致性和色调调整。 Result: 实验证明Modular-Cam能生成多场景视频并精细控制相机运动。 Conclusion: Modular-Cam在复杂提示下生成高质量视频方面表现出色。 Abstract: Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at https://modular-cam.github.io.[63] Single-shot Star-convex Polygon-based Instance Segmentation for Spatially-correlated Biomedical Objects
Trina De,Adrian Urbanski,Artur Yakimovich
Main category: cs.CV
TL;DR: 论文提出两种架构HSD和HSD-WBR,利用空间相关性先验实现单次嵌套实例分割,优于基线方法。
Details
Motivation: 生物医学图像中的对象通常具有空间相关性或嵌套关系,但现有方法未充分利用这一先验知识。 Method: 基于StarDist设计HSD和HSD-WBR架构,分别通过联合编码器和正则化层(WBR)引入空间相关性约束。 Result: HSD和HSD-WBR在IoU_R、AP和JTPR指标上优于基线方法,实现单次分割。 Conclusion: 利用空间相关性先验可实现高效单次嵌套实例分割,适用于多对象交互场景。 Abstract: Biomedical images often contain objects known to be spatially correlated or nested due to their inherent properties, leading to semantic relations. Examples include cell nuclei being nested within eukaryotic cells and colonies growing exclusively within their culture dishes. While these semantic relations bear key importance, detection tasks are often formulated independently, requiring multi-shot analysis pipelines. Importantly, spatial correlation could constitute a fundamental prior facilitating learning of more meaningful representations for tasks like instance segmentation. This knowledge has, thus far, not been utilised by the biomedical computer vision community. We argue that the instance segmentation of two or more categories of objects can be achieved in parallel. We achieve this via two architectures HydraStarDist (HSD) and the novel (HSD-WBR) based on the widely-used StarDist (SD), to take advantage of the star-convexity of our target objects. HSD and HSD-WBR are constructed to be capable of incorporating their interactions as constraints into account. HSD implicitly incorporates spatial correlation priors based on object interaction through a joint encoder. HSD-WBR further enforces the prior in a regularisation layer with the penalty we proposed named Within Boundary Regularisation Penalty (WBR). Both architectures achieve nested instance segmentation in a single shot. We demonstrate their competitiveness based on $IoU_R$ and AP and superiority in a new, task-relevant criteria, Joint TP rate (JTPR) compared to their baseline SD and Cellpose. Our approach can be further modified to capture partial-inclusion/-exclusion in multi-object interactions in fluorescent or brightfield microscopy or digital imaging. Finally, our strategy suggests gains by making this learning single-shot and computationally efficient.[64] DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
Mengshi Qi,Pengfei Zhu,Xiangtai Li,Xiaoyang Bi,Lu Qi,Huadong Ma,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 提出DC-SAM方法,通过提示调优改进SAM和SAM2,用于图像和视频的上下文分割,并在新构建的IC-VOS基准上验证性能。
Details
Motivation: 现有SAM模型不适用于上下文分割,需改进以提升分割模型的泛化能力。 Method: 通过高质量视觉提示增强SAM提示编码器特征,设计循环一致性交叉注意力和双分支提示编码器,并引入掩模管训练策略。 Result: 在COCO-20i、PASCAL-5i和IC-VOS基准上分别达到55.5、73.0 mIoU和71.52 J&F分数。 Conclusion: DC-SAM在图像和视频上下文分割中表现优异,并提供了首个视频上下文分割基准IC-VOS。 Abstract: Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.[65] Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar,Ali Etemad
Main category: cs.CV
TL;DR: 论文提出了一种自对齐框架(RRPO),通过优化大视频语言模型(LVLMs)的错误学习,提升其在视频问答任务中的表现。
Details
Motivation: 现有LVLMs在细粒度时间理解、幻觉问题和简单错误上表现不佳,限制了其实际应用。 Method: 提出自对齐框架,通过构建偏好和非偏好响应对,并引入RRPO方法进行优化。 Result: RRPO在视频幻觉、长短视频理解和时间推理等任务中表现优于DPO。 Conclusion: 自对齐框架和RRPO方法有效提升了LVLMs的性能和稳定性。 Abstract: Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.[66] AttentionDrop: A Novel Regularization Method for Transformer Models
Mirza Samad Ahmed Baig,Syeda Anshrah Gillani,Abdul Akbar Khan,Shahid Munir Shah
Main category: cs.CV
TL;DR: 论文提出AttentionDrop,一种针对Transformer自注意力分布的正则化方法,包含三种变体,旨在解决过拟合问题。
Details
Motivation: Transformer在大规模任务中表现优异,但在数据有限或噪声较多时容易过拟合。 Method: 提出三种AttentionDrop变体:1. 硬注意力掩码;2. 模糊注意力平滑;3. 一致性正则化。 Result: 通过实验验证了这些方法在减少过拟合方面的有效性。 Conclusion: AttentionDrop为Transformer提供了一种有效的正则化手段,尤其适用于数据受限场景。 Abstract: Transformer-based architectures achieve state-of-the-art performance across a wide range of tasks in natural language processing, computer vision, and speech. However, their immense capacity often leads to overfitting, especially when training data is limited or noisy. We propose AttentionDrop, a unified family of stochastic regularization techniques that operate directly on the self-attention distributions. We introduces three variants: 1. Hard Attention Masking: randomly zeroes out top-k attention logits per query to encourage diverse context utilization. 2. Blurred Attention Smoothing: applies a dynamic Gaussian convolution over attention logits to diffuse overly peaked distributions. 3. Consistency-Regularized AttentionDrop: enforces output stability under multiple independent AttentionDrop perturbations via a KL-based consistency loss.[67] Generalized Visual Relation Detection with Diffusion Models
Kaifeng Gao,Siqi Chen,Hanwang Zhang,Jun Xiao,Yueting Zhuang,Qianru Sun
Main category: cs.CV
TL;DR: 论文提出Diff-VRD,通过扩散模型将视觉关系建模为连续嵌入,解决传统VRD模型受限于预定义关系类别的问题。
Details
Motivation: 现有VRD模型无法处理视觉关系的语义模糊性,且受限于预定义类别。 Method: 使用扩散模型在潜在空间中生成关系嵌入序列,结合视觉和文本嵌入作为条件信号,并通过后续匹配阶段分配关系词。 Result: Diff-VRD在HOI检测和SGG基准测试中表现优异,能够生成超出预定义类别的视觉关系。 Conclusion: Diff-VRD通过生成式方法有效解决了广义VRD任务,提出了新的评估指标验证其性能。 Abstract: Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.[68] Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image
Tao Wen,Jiepeng Wang,Yabo Chen,Shugong Xu,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: 论文提出了一种名为Metric-Solver的动态滑动锚点方法,用于解决室内外环境中深度尺度多样化的度量深度估计问题。
Details
Motivation: 由于室内外环境中深度尺度的多样性,准确且通用的度量深度估计具有挑战性。 Method: 采用基于滑动锚点的表示方法,将场景深度分为近场和远场两部分,并通过动态调整锚点适应不同深度尺度。 Result: 实验表明,Metric-Solver在准确性和跨数据集泛化能力上优于现有方法。 Conclusion: Metric-Solver提供了一种统一且自适应的深度表示方法,适用于多样化环境。 Abstract: Accurate and generalizable metric depth estimation is crucial for various computer vision applications but remains challenging due to the diverse depth scales encountered in indoor and outdoor environments. In this paper, we introduce Metric-Solver, a novel sliding anchor-based metric depth estimation method that dynamically adapts to varying scene scales. Our approach leverages an anchor-based representation, where a reference depth serves as an anchor to separate and normalize the scene depth into two components: scaled near-field depth and tapered far-field depth. The anchor acts as a normalization factor, enabling the near-field depth to be normalized within a consistent range while mapping far-field depth smoothly toward zero. Through this approach, any depth from zero to infinity in the scene can be represented within a unified representation, effectively eliminating the need to manually account for scene scale variations. More importantly, for the same scene, the anchor can slide along the depth axis, dynamically adjusting to different depth scales. A smaller anchor provides higher resolution in the near-field, improving depth precision for closer objects while a larger anchor improves depth estimation in far regions. This adaptability enables the model to handle depth predictions at varying distances and ensure strong generalization across datasets. Our design enables a unified and adaptive depth representation across diverse environments. Extensive experiments demonstrate that Metric-Solver outperforms existing methods in both accuracy and cross-dataset generalization.[69] Logits DeConfusion with CLIP for Few-Shot Learning
Shuo Li,Fang Liu,Zehua Hao,Xinyi Wang,Lingling Li,Xu Liu,Puhua Chen,Wenping Ma
Main category: cs.CV
TL;DR: 提出Logits DeConfusion方法,通过MAF和ICD模块解决CLIP在零样本和小样本学习中的类别混淆问题。
Details
Motivation: CLIP在下游任务中存在严重的类别间混淆问题,影响分类准确性。 Method: 结合Multi-level Adapter Fusion (MAF)模块提取并融合多级特征,以及Inter-Class Deconfusion (ICD)模块通过残差结构消除类别混淆。 Result: 实验表明,该方法显著提升了分类性能并缓解了类别混淆问题。 Conclusion: Logits DeConfusion方法有效解决了CLIP的类别混淆问题,代码已开源。 Abstract: With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation. Our ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. The code is available at https://github.com/LiShuo1001/LDC.[70] A Diffusion-Based Framework for Terrain-Aware Remote Sensing Image Reconstruction
Zhenyu Yu,Mohd Yamani Inda Idris,Pei Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的遥感图像修复方法SatelliteMaker,通过引入DEM作为条件输入和VGG-Adapter模块,实现了在数据缺失情况下的高质量重建。
Details
Motivation: 遥感图像在环境监测等领域至关重要,但云层覆盖等问题导致数据缺失,传统方法难以处理复杂缺失区域和多波段一致性。 Method: 提出SatelliteMaker方法,利用扩散模型重建缺失数据,引入DEM作为条件输入,并使用VGG-Adapter模块减少分布差异。 Result: 实验表明,SatelliteMaker在多种任务中达到最优性能。 Conclusion: SatelliteMaker通过扩散模型和条件输入,有效解决了遥感图像数据缺失问题,并保持了多波段一致性。 Abstract: Remote sensing imagery is essential for environmental monitoring, agricultural management, and disaster response. However, data loss due to cloud cover, sensor failures, or incomplete acquisition-especially in high-resolution and high-frequency tasks-severely limits satellite imagery's effectiveness. Traditional interpolation methods struggle with large missing areas and complex structures. Remote sensing imagery consists of multiple bands, each with distinct meanings, and ensuring consistency across bands is critical to avoid anomalies in the combined images. This paper proposes SatelliteMaker, a diffusion-based method that reconstructs missing data across varying levels of data loss while maintaining spatial, spectral, and temporal consistency. We also propose Digital Elevation Model (DEM) as a conditioning input and use tailored prompts to generate realistic images, making diffusion models applicable to quantitative remote sensing tasks. Additionally, we propose a VGG-Adapter module based on Distribution Loss, which reduces distribution discrepancy and ensures style consistency. Extensive experiments show that SatelliteMaker achieves state-of-the-art performance across multiple tasks.[71] Remote sensing colour image semantic segmentation of trails created by large herbivorous Mammals
Jose Francisco Diez-Pastor,Francisco Javier Gonzalez-Moya,Pedro Latorre-Carmona,Francisco Javier Perez-Barbería,Ludmila I. Kuncheva,Antonio Canepa-Oneto,Alvar Arnaiz-González,Cesar Garcia-Osorio
Main category: cs.CV
TL;DR: 论文提出了一种基于机器学习的算法,用于自动识别大型草食动物的放牧路径,以支持生态系统的保护和管理。
Details
Motivation: 识别生物多样性受威胁的空间区域对生态系统保护至关重要,大型草食动物的活动(如形成放牧路径)是关键的景观特征。 Method: 应用了五种语义分割方法和十四种编码器,结合无人机图像,评估了它们在识别放牧路径上的表现。 Result: UNet架构与MambaOut编码器的组合表现最佳,成功绘制了大部分路径,但少数情况下低估了路径结构。 Conclusion: 该方法可开发为工具,用于监测景观变化,支持栖息地保护和土地管理计划,是首次在大型草食动物路径检测中取得竞争力的图像分割结果。 Abstract: Detection of spatial areas where biodiversity is at risk is of paramount importance for the conservation and monitoring of ecosystems. Large terrestrial mammalian herbivores are keystone species as their activity not only has deep effects on soils, plants, and animals, but also shapes landscapes, as large herbivores act as allogenic ecosystem engineers. One key landscape feature that indicates intense herbivore activity and potentially impacts biodiversity is the formation of grazing trails. Grazing trails are formed by the continuous trampling activity of large herbivores that can produce complex networks of tracks of bare soil. Here, we evaluated different algorithms based on machine learning techniques to identify grazing trails. Our goal is to automatically detect potential areas with intense herbivory activity, which might be beneficial for conservation and management plans. We have applied five semantic segmentation methods combined with fourteen encoders aimed at mapping grazing trails on aerial images. Our results indicate that in most cases the chosen methodology successfully mapped the trails, although there were a few instances where the actual trail structure was underestimated. The UNet architecture with the MambaOut encoder was the best architecture for mapping trails. The proposed approach could be applied to develop tools for mapping and monitoring temporal changes in these landscape structures to support habitat conservation and land management programs. This is the first time, to the best of our knowledge, that competitive image segmentation results are obtained for the detection and delineation of trails of large herbivorous mammals.[72] Anti-Aesthetics: Protecting Facial Privacy against Customized Text-to-Image Synthesis
Songping Wang,Yueming Lyu,Shiqi Liu,Ning Li,Tong Tong,Hao Sun,Caifeng Shan
Main category: cs.CV
TL;DR: 论文提出了一种基于美学视角的层次化反美学框架(HAA),通过全局和局部反美学机制,有效降低恶意定制扩散模型的生成质量,保护面部隐私和版权。
Details
Motivation: 定制化扩散模型的兴起带来了个性化视觉内容创作的繁荣,但也存在恶意滥用风险,威胁个人隐私和版权保护。 Method: 提出HAA框架,包含全局反美学(通过奖励机制和损失函数降低整体美学质量)和局部反美学(通过对抗扰动破坏局部面部身份)。 Result: 实验表明,HAA在身份移除方面显著优于现有方法。 Conclusion: HAA为面部隐私和版权保护提供了有效工具。 Abstract: The rise of customized diffusion models has spurred a boom in personalized visual content creation, but also poses risks of malicious misuse, severely threatening personal privacy and copyright protection. Some studies show that the aesthetic properties of images are highly positively correlated with human perception of image quality. Inspired by this, we approach the problem from a novel and intriguing aesthetic perspective to degrade the generation quality of maliciously customized models, thereby achieving better protection of facial identity. Specifically, we propose a Hierarchical Anti-Aesthetic (HAA) framework to fully explore aesthetic cues, which consists of two key branches: 1) Global Anti-Aesthetics: By establishing a global anti-aesthetic reward mechanism and a global anti-aesthetic loss, it can degrade the overall aesthetics of the generated content; 2) Local Anti-Aesthetics: A local anti-aesthetic reward mechanism and a local anti-aesthetic loss are designed to guide adversarial perturbations to disrupt local facial identity. By seamlessly integrating both branches, our HAA effectively achieves the goal of anti-aesthetics from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing SOTA methods largely in identity removal, providing a powerful tool for protecting facial privacy and copyright.[73] Weakly Semi-supervised Whole Slide Image Classification by Two-level Cross Consistency Supervision
Linhao Qu,Shiman Li,Xiaoyuan Luo,Shaolei Liu,Qinhao Guo,Manning Wang,Zhijian Song
Main category: cs.CV
TL;DR: 论文提出了一种新的弱半监督全切片图像分类(WSWC)问题,并提出了CroCo框架,通过两级交叉一致性监督解决该问题。
Details
Motivation: 现有全切片图像(WSI)分类方法需要大量标注数据,成本高且耗时,限制了其实际应用。WSWC问题更符合临床实践,但现有算法无法直接解决。 Method: 提出了CroCo框架,包含两个异构分类器分支,通过袋级和实例级的交叉一致性监督进行训练。 Result: 在四个数据集上的实验表明,CroCo在有限标注数据下优于其他方法。 Conclusion: 本文首次提出WSWC问题并成功解决,为临床病理诊断提供了高效工具。 Abstract: Computer-aided Whole Slide Image (WSI) classification has the potential to enhance the accuracy and efficiency of clinical pathological diagnosis. It is commonly formulated as a Multiple Instance Learning (MIL) problem, where each WSI is treated as a bag and the small patches extracted from the WSI are considered instances within that bag. However, obtaining labels for a large number of bags is a costly and time-consuming process, particularly when utilizing existing WSIs for new classification tasks. This limitation renders most existing WSI classification methods ineffective. To address this issue, we propose a novel WSI classification problem setting, more aligned with clinical practice, termed Weakly Semi-supervised Whole slide image Classification (WSWC). In WSWC, a small number of bags are labeled, while a significant number of bags remain unlabeled. The MIL nature of the WSWC problem, coupled with the absence of patch labels, distinguishes it from typical semi-supervised image classification problems, making existing algorithms for natural images unsuitable for directly solving the WSWC problem. In this paper, we present a concise and efficient framework, named CroCo, to tackle the WSWC problem through two-level Cross Consistency supervision. CroCo comprises two heterogeneous classifier branches capable of performing both instance classification and bag classification. The fundamental idea is to establish cross-consistency supervision at both the bag-level and instance-level between the two branches during training. Extensive experiments conducted on four datasets demonstrate that CroCo achieves superior bag classification and instance classification performance compared to other comparative methods when limited WSIs with bag labels are available. To the best of our knowledge, this paper presents for the first time the WSWC problem and gives a successful resolution.[74] FocusedAD: Character-centric Movie Audio Description
Xiaojun Ye,Chun Wang,Yiren Song,Sheng Zhou,Liangcheng Li,Jiajun Bu
Main category: cs.CV
TL;DR: FocusedAD是一种新框架,用于生成以角色为中心的电影音频描述,通过模块化方法解决角色识别和情节相关叙述的挑战,并在多个基准测试中取得最佳性能。
Details
Motivation: 电影音频描述(AD)需要为盲人和视障观众提供情节相关的叙述,尤其是明确提及角色名称,这对电影理解提出了独特挑战。 Method: FocusedAD包括角色感知模块(CPM)跟踪角色区域并链接名称,动态先验模块(DPM)通过可学习软提示注入上下文线索,以及聚焦字幕模块(FCM)生成情节相关的叙述。 Result: FocusedAD在多个基准测试中达到最先进性能,包括在MAD-eval-Named和新提出的Cinepile-AD数据集上的零样本结果。 Conclusion: FocusedAD通过模块化设计和自动化角色查询库,有效解决了AD中的角色识别和情节叙述问题,具有广泛的应用潜力。 Abstract: Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie understanding.To identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at https://github.com/Thorin215/FocusedAD .[75] CodingHomo: Bootstrapping Deep Homography With Video Coding
Yike Liu,Haipeng Li,Shuaicheng Liu,Bing Zeng
Main category: cs.CV
TL;DR: 提出了一种基于视频编码的无监督单应性估计方法CodingHomo,通过运动向量和掩模引导模块提升精度。
Details
Motivation: 现有深度学习方法在复杂运动中的单应性估计仍不准确,需要更鲁棒和通用的解决方案。 Method: 利用视频中的运动向量,结合掩模引导融合模块(MGF)和掩模引导单应性估计模块(MGHE)进行粗到精的优化。 Result: CodingHomo在无监督方法中表现最优,具有高鲁棒性和泛化性。 Conclusion: CodingHomo通过视频编码和掩模引导模块显著提升了单应性估计的精度和鲁棒性。 Abstract: Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: \href{github}{https://github.com/liuyike422/CodingHomo[76] RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning
Yuan Luo,Rudolf Hoffmann,Yan Xia,Olaf Wysocki,Benedikt Schwab,Thomas H. Kolbe,Daniel Cremers
Main category: cs.CV
TL;DR: 论文提出了一种名为RADLER的新神经网络,利用对比自监督学习和语义3D城市模型提升雷达目标检测性能,并在新数据集RadarCity上验证了其有效性。
Details
Motivation: 语义3D城市模型具有丰富的先验信息,但其在减少雷达目标检测噪声方面的潜力尚未充分探索。 Method: 通过自监督学习网络提取稳健的雷达特征,并结合语义3D城市模型的语义-深度特征进行融合。 Result: 在RadarCity数据集上,RADLER的平均精度(mAP)和平均召回率(mAR)分别提升了5.46%和3.51%。 Conclusion: 该研究为语义引导和地图支持的雷达目标检测提供了新思路,并公开了项目页面以促进进一步研究。 Abstract: Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection. Our project page is publicly available athttps://gpp-communication.github.io/RADLER .[77] Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline
Joanne Lin,Crispian Morris,Ruirui Lin,Fan Zhang,David Bull,Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: 论文提出了一种新的Degradation Estimation Network (DEN),用于生成真实的sRGB噪声,无需相机元数据,并通过自监督训练实现。该方法在低光任务中表现优异。
Details
Motivation: 低光条件下,人类和机器标注都面临挑战,导致对低光图像和视频的研究不足。现有方法通常依赖不现实的噪声模型。 Method: 提出DEN网络,通过自监督训练估计物理噪声分布参数,生成多样化的真实噪声。 Result: 在合成噪声复制、视频增强和目标检测任务中,分别提升了24% KLD、21% LPIPS和62% AP_{50-95}。 Conclusion: DEN能够生成更真实的噪声,显著提升低光任务的性能。 Abstract: Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24\% KLD, 21\% LPIPS, and 62\% AP$_{50-95}$, respectively.[78] CoMotion: Concurrent Multi-person 3D Motion
Alejandro Newell,Peiyun Hu,Lahav Lipson,Stephan R. Richter,Vladlen Koltun
Main category: cs.CV
TL;DR: 提出了一种从单目摄像头流中检测和跟踪多人详细3D姿态的方法,支持拥挤场景中的时间一致性预测。
Details
Motivation: 解决拥挤场景中因遮挡和复杂姿态导致的多人3D姿态跟踪难题。 Method: 结合逐帧检测和学习姿态更新,直接通过新输入图像更新姿态,支持在线跟踪。 Result: 模型在3D姿态估计准确度上达到先进水平,同时在多人跟踪中更快更准。 Conclusion: 该方法在复杂场景中实现了高效的多人3D姿态跟踪,代码和权重已开源。 Abstract: We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion[79] Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI
Mahdi Alehdaghi,Rajarshi Bhattacharya,Pourya Shamsolmoali,Rafael M. O. Cruz,Maguelonne Heritier,Eric Granger
Main category: cs.CV
TL;DR: PCMNet提出了一种动态学习可解释原型的方法,通过聚类和概念激活向量提取,提高了深度模型的可解释性和鲁棒性。
Details
Motivation: 现有的事后解释方法(如GradCAM)缺乏概念清晰性,而原型方法(如ProtoPNet)依赖固定区域,限制了鲁棒性和语义一致性。 Method: PCMNet通过动态学习原型并聚类为概念组,结合无监督部分发现和概念激活向量提取,生成语义基础的解释。 Result: 实验表明,PCMNet在多种数据集上表现出高可解释性、稳定性和鲁棒性,尤其在遮挡场景下。 Conclusion: PCMNet为深度模型提供了一种更灵活且语义一致的解释方法,适用于复杂场景。 Abstract: Deep learning has provided considerable advancements for multimedia systems, yet the interpretability of deep models remains a challenge. State-of-the-art post-hoc explainability methods, such as GradCAM, provide visual interpretation based on heatmaps but lack conceptual clarity. Prototype-based approaches, like ProtoPNet and PIPNet, offer a more structured explanation but rely on fixed patches, limiting their robustness and semantic consistency. To address these limitations, a part-prototypical concept mining network (PCMNet) is proposed that dynamically learns interpretable prototypes from meaningful regions. PCMNet clusters prototypes into concept groups, creating semantically grounded explanations without requiring additional annotations. Through a joint process of unsupervised part discovery and concept activation vector extraction, PCMNet effectively captures discriminative concepts and makes interpretable classification decisions. Our extensive experiments comparing PCMNet against state-of-the-art methods on multiple datasets show that it can provide a high level of interpretability, stability, and robustness under clean and occluded scenarios.[80] Towards Realistic Low-Light Image Enhancement via ISP Driven Data Modeling
Zhihua Wang,Yu Long,Qinghua Lin,Kai Zhang,Yazhu Zhang,Yuming Fang,Li Liu,Xiaochun Cao
Main category: cs.CV
TL;DR: 提出了一种基于图像信号处理(ISP)的数据合成方法,用于生成多样化的低光图像增强训练数据,显著提升了模型性能。
Details
Motivation: 现有低光图像增强方法因缺乏多样化的训练数据,导致输出存在噪声放大、白平衡错误等问题。 Method: 通过反向ISP将正常光图像转为RAW格式,再在RAW域合成低光退化,并通过多阶段ISP处理生成多样化训练数据。 Result: 实验表明,使用该合成数据训练的UNet模型在定量和定性上均优于现有方法。 Conclusion: 提出的数据合成方法有效解决了训练数据不足的问题,显著提升了低光图像增强的性能。 Abstract: Deep neural networks (DNNs) have recently become the leading method for low-light image enhancement (LLIE). However, despite significant progress, their outputs may still exhibit issues such as amplified noise, incorrect white balance, or unnatural enhancements when deployed in real world applications. A key challenge is the lack of diverse, large scale training data that captures the complexities of low-light conditions and imaging pipelines. In this paper, we propose a novel image signal processing (ISP) driven data synthesis pipeline that addresses these challenges by generating unlimited paired training data. Specifically, our pipeline begins with easily collected high-quality normal-light images, which are first unprocessed into the RAW format using a reverse ISP. We then synthesize low-light degradations directly in the RAW domain. The resulting data is subsequently processed through a series of ISP stages, including white balance adjustment, color space conversion, tone mapping, and gamma correction, with controlled variations introduced at each stage. This broadens the degradation space and enhances the diversity of the training data, enabling the generated data to capture a wide range of degradations and the complexities inherent in the ISP pipeline. To demonstrate the effectiveness of our synthetic pipeline, we conduct extensive experiments using a vanilla UNet model consisting solely of convolutional layers, group normalization, GeLU activation, and convolutional block attention modules (CBAMs). Extensive testing across multiple datasets reveals that the vanilla UNet model trained with our data synthesis pipeline delivers high fidelity, visually appealing enhancement results, surpassing state-of-the-art (SOTA) methods both quantitatively and qualitatively.[81] Uncertainty-Guided Coarse-to-Fine Tumor Segmentation with Anatomy-Aware Post-Processing
Ilkin Sevgi Isler,David Mohaisen,Curtis Lisle,Damla Turgut,Ulas Bagci
Main category: cs.CV
TL;DR: 提出了一种基于不确定性引导的粗到细分割框架,结合全体积肿瘤定位和精细ROI分割,通过解剖感知后处理提升效果。
Details
Motivation: 解决胸部CT中肿瘤分割的边界模糊、类别不平衡和解剖变异性问题。 Method: 采用两阶段模型:首阶段生成粗略预测,第二阶段通过不确定性感知损失函数优化ROI分割,并结合解剖学后处理。 Result: 在私人和公共数据集上Dice和Hausdorff分数提升,假阳性减少,空间可解释性增强。Orlando数据集上Dice从0.4690提升至0.6447。 Conclusion: 结合不确定性建模和解剖先验的分割框架能实现更鲁棒且临床意义显著的肿瘤分割。 Abstract: Reliable tumor segmentation in thoracic computed tomography (CT) remains challenging due to boundary ambiguity, class imbalance, and anatomical variability. We propose an uncertainty-guided, coarse-to-fine segmentation framework that combines full-volume tumor localization with refined region-of-interest (ROI) segmentation, enhanced by anatomically aware post-processing. The first-stage model generates a coarse prediction, followed by anatomically informed filtering based on lung overlap, proximity to lung surfaces, and component size. The resulting ROIs are segmented by a second-stage model trained with uncertainty-aware loss functions to improve accuracy and boundary calibration in ambiguous regions. Experiments on private and public datasets demonstrate improvements in Dice and Hausdorff scores, with fewer false positives and enhanced spatial interpretability. These results highlight the value of combining uncertainty modeling and anatomical priors in cascaded segmentation pipelines for robust and clinically meaningful tumor delineation. On the Orlando dataset, our framework improved Swin UNETR Dice from 0.4690 to 0.6447. Reduction in spurious components was strongly correlated with segmentation gains, underscoring the value of anatomically informed post-processing.[82] Coding-Prior Guided Diffusion Network for Video Deblurring
Yike Liu,Jianhui Zhang,Haipeng Li,Shuaicheng Liu,Bing Zeng
Main category: cs.CV
TL;DR: CPGDNet利用视频编解码器的运动向量和编码残差以及预训练扩散生成模型,提出了一种两阶段视频去模糊框架,显著提升了感知质量。
Details
Motivation: 现有视频去模糊方法忽略了视频编解码器中的运动向量和编码残差,以及预训练扩散生成模型的丰富知识。 Method: CPGDNet分为两个模块:CPFP利用运动向量和编码残差进行帧对齐和生成注意力掩码;CPC将编码先验整合到预训练扩散模型中,增强关键区域并合成细节。 Result: 实验表明,该方法在感知质量上达到最优,IQA指标提升高达30%。 Conclusion: CPGDNet通过结合编码先验和生成扩散先验,显著提升了视频去模糊效果,代码和数据集将开源。 Abstract: While recent video deblurring methods have advanced significantly, they often overlook two valuable prior information: (1) motion vectors (MVs) and coding residuals (CRs) from video codecs, which provide efficient inter-frame alignment cues, and (2) the rich real-world knowledge embedded in pre-trained diffusion generative models. We present CPGDNet, a novel two-stage framework that effectively leverages both coding priors and generative diffusion priors for high-quality deblurring. First, our coding-prior feature propagation (CPFP) module utilizes MVs for efficient frame alignment and CRs to generate attention masks, addressing motion inaccuracies and texture variations. Second, a coding-prior controlled generation (CPC) module network integrates coding priors into a pretrained diffusion model, guiding it to enhance critical regions and synthesize realistic details. Experiments demonstrate our method achieves state-of-the-art perceptual quality with up to 30% improvement in IQA metrics. Both the code and the codingprior-augmented dataset will be open-sourced.[83] Cobra: Efficient Line Art COlorization with BRoAder References
Junhao Zhuang,Lingen Li,Xuan Ju,Zhaoyang Zhang,Chun Yuan,Ying Shan
Main category: cs.CV
TL;DR: Cobra是一种高效的漫画线稿上色方法,支持颜色提示和大量参考图像,解决了现有扩散模型在速度和控制上的不足。
Details
Motivation: 漫画产业需要高精度、高效率且具有上下文一致性和灵活控制的线稿上色方法,现有扩散模型在参考图像处理、推理速度和灵活性方面存在局限。 Method: Cobra采用Causal Sparse DiT架构,结合特殊设计的定位编码、因果稀疏注意力和键值缓存,有效管理长上下文参考并保持颜色一致性。 Result: Cobra通过大量上下文参考实现了准确的线稿上色,显著提升了推理速度和交互性,满足了工业需求。 Conclusion: Cobra为漫画线稿上色提供了一种高效且灵活的解决方案,解决了现有技术的不足,适合工业应用。 Abstract: The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: https://zhuang2002.github.io/Cobra/.[84] SIDME: Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction
Xia Wang,Haiyang Sun,Tiantian Cao,Yueying Sun,Min Feng
Main category: cs.CV
TL;DR: 本文提出了一种名为SIDME的自监督图像去摩尔纹方法,通过掩码编码器-解码器架构和自监督学习,有效处理摩尔纹问题,并在实验中表现优于现有方法。
Details
Motivation: 传统去摩尔纹方法通常将图像整体处理,忽略了不同颜色通道的信号特性,且对真实数据的鲁棒性不足。 Method: SIDME结合掩码编码器-解码器架构和自监督学习,设计了针对绿色通道的自监督损失函数,并开发了模拟真实条件的摩尔纹图像生成方法。 Result: 实验表明,SIDME在处理真实摩尔纹数据时表现优异,具有更强的泛化能力和鲁棒性。 Conclusion: SIDME为图像去摩尔纹提供了一种高效且鲁棒的解决方案。 Abstract: Moir\'e patterns, resulting from aliasing between object light signals and camera sampling frequencies, often degrade image quality during capture. Traditional demoir\'eing methods have generally treated images as a whole for processing and training, neglecting the unique signal characteristics of different color channels. Moreover, the randomness and variability of moir\'e pattern generation pose challenges to the robustness of existing methods when applied to real-world data. To address these issues, this paper presents SIDME (Self-supervised Image Demoir\'eing via Masked Encoder-Decoder Reconstruction), a novel model designed to generate high-quality visual images by effectively processing moir\'e patterns. SIDME combines a masked encoder-decoder architecture with self-supervised learning, allowing the model to reconstruct images using the inherent properties of camera sampling frequencies. A key innovation is the random masked image reconstructor, which utilizes an encoder-decoder structure to handle the reconstruction task. Furthermore, since the green channel in camera sampling has a higher sampling frequency compared to red and blue channels, a specialized self-supervised loss function is designed to improve the training efficiency and effectiveness. To ensure the generalization ability of the model, a self-supervised moir\'e image generation method has been developed to produce a dataset that closely mimics real-world conditions. Extensive experiments demonstrate that SIDME outperforms existing methods in processing real moir\'e pattern data, showing its superior generalization performance and robustness.[85] Human Aligned Compression for Robust Models
Samuel Räber,Andreas Plesner,Till Aczel,Roger Wattenhofer
Main category: cs.CV
TL;DR: 论文研究了基于人类对齐的有损压缩作为对抗攻击防御机制,发现学习型压缩方法(如HiFiC和ELIC)优于传统JPEG,尤其在保护语义内容的同时去除对抗噪声。
Details
Motivation: 对抗攻击通过引入微小扰动威胁图像模型的鲁棒性,研究旨在探索有效的防御机制。 Method: 比较了学习型压缩模型(HiFiC和ELIC)与传统JPEG在不同质量水平下的表现,并测试了顺序压缩的效果。 Result: 学习型压缩方法在保护语义内容的同时有效去除对抗噪声,顺序压缩进一步提升了防御效果。 Conclusion: 人类对齐的压缩是一种高效且实用的防御方法,能提升模型对抗对抗攻击的鲁棒性。 Abstract: Adversarial attacks on image models threaten system robustness by introducing imperceptible perturbations that cause incorrect predictions. We investigate human-aligned learned lossy compression as a defense mechanism, comparing two learned models (HiFiC and ELIC) against traditional JPEG across various quality levels. Our experiments on ImageNet subsets demonstrate that learned compression methods outperform JPEG, particularly for Vision Transformer architectures, by preserving semantically meaningful content while removing adversarial noise. Even in white-box settings where attackers can access the defense, these methods maintain substantial effectiveness. We also show that sequential compression--applying rounds of compression/decompression--significantly enhances defense efficacy while maintaining classification performance. Our findings reveal that human-aligned compression provides an effective, computationally efficient defense that protects the image features most relevant to human and machine understanding. It offers a practical approach to improving model robustness against adversarial threats.[86] FLIP Reasoning Challenge
Andreas Plesner,Turlan Kuzhagaliyev,Roger Wattenhofer
Main category: cs.CV
TL;DR: 论文介绍了FLIP数据集,用于评估AI在基于Idena区块链的人类验证任务中的推理能力,发现现有模型在零样本设置下表现远低于人类水平。
Details
Motivation: AI在感知和生成任务上表现优异,但推理能力仍不足,需要新的基准测试来推动发展。 Method: 通过FLIP数据集测试AI的顺序推理、视觉叙事和常识能力,结合视觉语言模型和大型语言模型进行评估。 Result: 最佳开源和闭源模型的准确率分别为75.5%和77.9%,远低于人类的95.3%;集成15个模型可将准确率提升至85.2%。 Conclusion: 现有推理模型存在局限性,FLIP为多模态AI系统提供了重要的基准测试。 Abstract: Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.[87] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate
Zhihang Yuan,Rui Xie,Yuzhang Shang,Hanling Zhang,Siyuan Wang,Shengen Yan,Guohao Dai,Yu Wang
Main category: cs.CV
TL;DR: VGDFR是一种无需训练的动态潜在帧率调整方法,用于基于扩散的视频生成,通过自适应调整潜在空间元素数量,显著提升效率。
Details
Motivation: 现有DiT视频生成模型计算需求高,而真实视频具有动态信息密度特性,高运动段需更多细节保留。 Method: 提出动态帧率调度器、潜在空间帧合并方法及优化的RoPE策略。 Result: 实验显示VGDFR可实现3倍加速,且质量损失极小。 Conclusion: VGDFR高效解决了DiT视频生成的效率问题,同时保持生成质量。 Abstract: Diffusion Transformer(DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: (1) A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. (2) A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. (3) A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.[88] Towards Learning to Complete Anything in Lidar
Ayca Takmaz,Cristiano Saltori,Neehar Peri,Tim Meinhardt,Riccardo de Lutio,Laura Leal-Taixé,Aljoša Ošep
Main category: cs.CV
TL;DR: CAL是一种基于激光雷达的零样本形状补全方法,利用多模态传感器序列挖掘物体形状和语义特征,适用于开放词汇场景。
Details
Motivation: 现有方法仅能完成和识别封闭词汇表中的物体,CAL旨在通过零样本方法解决这一问题。 Method: 利用多模态传感器序列的时序上下文挖掘物体形状和语义特征,并蒸馏为激光雷达专用模型。 Result: 模型能从部分观测推断完整物体形状,并在标准基准测试中实现开放词汇的物体识别。 Conclusion: CAL展示了在开放词汇场景中完成和识别物体的潜力。 Abstract: We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies. Our project page is https://research.nvidia.com/labs/dvl/projects/complete-anything-lidar[89] Beyond Reconstruction: A Physics Based Neural Deferred Shader for Photo-realistic Rendering
Zhuo He,Paul Henderson,Nicolas Pugeault
Main category: cs.CV
TL;DR: 论文提出了一种基于物理的神经延迟着色管道,用于分解数据驱动的渲染过程,并学习通用的着色函数,以实现照片级真实感的着色和重光照任务,同时提供高效的阴影估计。
Details
Motivation: 现有基于深度学习的渲染方法难以分解光照和材质参数,限制了场景重建的可控性。 Method: 采用物理基础的神经延迟着色管道,结合数据驱动的渲染分解和通用着色函数学习,并引入阴影估计器。 Result: 模型在性能上优于经典模型和当前最先进的神经着色模型,支持任意光照输入下的通用照片级真实感着色。 Conclusion: 该方法显著提升了渲染的可控性和真实感,适用于电影特效和游戏场景构建等应用。 Abstract: Deep learning based rendering has demonstrated major improvements for photo-realistic image synthesis, applicable to various applications including visual effects in movies and photo-realistic scene building in video games. However, a significant limitation is the difficulty of decomposing the illumination and material parameters, which limits such methods to reconstruct an input scene, without any possibility to control these parameters. This paper introduces a novel physics based neural deferred shading pipeline to decompose the data-driven rendering process, learn a generalizable shading function to produce photo-realistic results for shading and relighting tasks, we also provide a shadow estimator to efficiently mimic shadowing effect. Our model achieves improved performance compared to classical models and a state-of-art neural shading model, and enables generalizable photo-realistic shading from arbitrary illumination input.[90] The Tenth NTIRE 2025 Image Denoising Challenge Report
Lei Sun,Hang Guo,Bin Ren,Luc Van Gool,Radu Timofte,Yawei Li,Xiangyu Kong,Hyunhee Park,Xiaoxuan Yu,Suejin Han,Hakjae Jeon,Jia Li,Hyung-Ju Chun,Donghun Ryou,Inju Ha,Bohyung Han,Jingyu Ma,Zhijuan Huang,Huiyuan Fu,Hongyuan Yu,Boqi Zhang,Jiawei Shi,Heng Zhang,Huadong Ma,Deepak Kumar Tyagi,Aman Kukretti,Gajender Sharma,Sriharsha Koundinya,Asim Manna,Jun Cheng,Shan Tan,Jun Liu,Jiangwei Hao,Jianping Luo,Jie Lu,Satya Narayan Tazi,Arnim Gautam,Aditi Pawar,Aishwarya Joshi,Akshay Dudhane,Praful Hambadre,Sachin Chaudhary,Santosh Kumar Vipparthi,Subrahmanyam Murala,Jiachen Tu,Nikhil Akalwadi,Vijayalaxmi Ashok Aralikatti,Dheeraj Damodar Hegde,G Gyaneshwar Rao,Jatin Kalal,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Zhenyuan Lin,Yubo Dong,Weikun Li,Anqi Li,Ang Gao,Weijun Yuan,Zhan Li,Ruting Deng,Yihang Chen,Yifan Deng,Zhanglu Chen,Boyang Yao,Shuling Zheng,Feng Zhang,Zhiheng Fu,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Jan Seny,Pei Zhou,Jianhua Hu,K. L. Eddie Law,Jaeho Lee,M. J. Aashik Rasool,Abdur Rehman,SMA Sharif,Seongwan Kim,Alexandru Brateanu,Raul Balmez,Ciprian Orhei,Cosmin Ancuti,Zeyu Xiao,Zhuoyuan Li,Ziqi Wang,Yanyan Wei,Fei Wang,Kun Li,Shengeng Tang,Yunkai Zhang,Weirun Zhou,Haoxuan Lu
Main category: cs.CV
TL;DR: NTIRE 2025图像去噪挑战赛(σ=50)的综述,介绍了方法和结果,目标是开发高性能去噪网络架构。
Details
Motivation: 开发一种不受计算复杂度或模型大小限制的高质量去噪网络架构,定量评估使用PSNR。 Method: 假设独立加性高斯白噪声(AWGN),噪声水平固定为50,共有290名参与者注册,20支团队提交有效结果。 Result: 挑战赛展示了当前图像去噪领域的最新技术水平。 Conclusion: 通过挑战赛,提供了图像去噪领域的最新进展和性能评估。 Abstract: This paper presents an overview of the NTIRE 2025 Image Denoising Challenge ({\sigma} = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising.[91] How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions
Aditya Prakash,Benjamin Lundell,Dmitry Andreychuk,David Forsyth,Saurabh Gupta,Harpreet Sawhney
Main category: cs.CV
TL;DR: 论文提出了一种从单RGB视图、动作文本和3D接触点预测3D手部运动和接触图的方法,结合VQVAE和Transformer模块,并在大规模数据集上验证了其有效性。
Details
Motivation: 解决从单视角RGB输入预测3D手部运动和接触图的挑战性问题,以支持更自然的交互场景。 Method: 使用VQVAE学习手部姿势和接触点的潜在代码本(Interaction Codebook),并通过Transformer模块(Interaction Predictor)预测交互轨迹。数据引擎从HoloAssist数据集中提取训练数据。 Result: 模型在多样性和规模上超越现有工作,并在跨对象类别、动作类别和场景中表现出良好的泛化能力。 Conclusion: 提出的方法在Transformer和扩散基线模型上均表现出优越性能,验证了其有效性。 Abstract: We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.[92] SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians
Liam Schoneveld,Zhe Chen,Davide Davoli,Jiapeng Tang,Saimon Terazawa,Ko Nishino,Matthias Nießner
Main category: cs.CV
TL;DR: 提出了一种基于2D高斯分布的自监督头部几何预测方法(SHeaP),通过3DMM网格和高斯分布预测,显著提升了单目图像和视频的3D重建效果。
Details
Motivation: 由于大规模3D真实数据难以获取,现有方法依赖自监督学习从2D视频中学习,但传统可微分网格渲染方法存在局限性。 Method: 预测3DMM网格和一组绑定到网格的高斯分布,通过重动画目标帧并反向传播光度损失,优化3DMM和高斯预测网络。 Result: 在NoW基准测试中,几何评估优于现有自监督方法,并在非中性表情的新基准测试中表现优异,同时生成更具表现力的网格。 Conclusion: SHeaP方法通过高斯渲染显著提升自监督学习效果,在几何和表情分类任务中优于现有技术。 Abstract: Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.cs.CL [Back]
[93] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Hardy Chen,Haoqin Tu,Fali Wang,Hui Liu,Xianfeng Tang,Xinya Du,Yuyin Zhou,Cihang Xie
Main category: cs.CL
TL;DR: 研究发现,监督微调(SFT)会通过模仿专家模型的伪推理路径阻碍后续强化学习(RL),并提出新数据集VLAA-Thinking和RL方法GRPO,显著提升模型性能。
Details
Motivation: 揭示SFT对RL的负面影响,探索更有效的训练范式以提升大型视觉语言模型的推理能力。 Method: 引入VLAA-Thinking数据集,结合GRPO和混合奖励模块,对比SFT、RL及其组合的效果。 Result: VLAA-Thinker模型在4B规模LVLMs中取得最佳性能,超越之前SOTA 1.8%。 Conclusion: SFT可能导致模型陷入僵化的模仿推理模式,而RL方法能促进更真实、自适应的推理行为。 Abstract: This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.[94] ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng,Shijue Huang,Xingwei Qu,Ge Zhang,Yujia Qin,Baoquan Zhong,Chengquan Jiang,Jinxin Chi,Wanjun Zhong
Main category: cs.CL
TL;DR: ReTool通过动态代码执行和自动化RL训练,提升了模型在结构化问题解决中的表现,显著优于纯文本基线。
Details
Motivation: 现有推理模型在结构化问题(如几何推理、方程求解)中表现不佳,而计算工具(如代码解释器)具有优势,需结合两者。 Method: 提出ReTool,结合动态代码执行和自动化RL训练,通过合成数据微调基础模型,并利用任务结果作为奖励优化工具调用策略。 Result: 在AIME基准测试中,ReTool-32B以67%准确率优于文本基线(40%),扩展设置下达到72.5%,超过OpenAI模型27.9%。 Conclusion: ReTool展示了结果驱动的工具集成在复杂数学推理中的潜力,为混合神经符号系统提供了新见解。 Abstract: While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.[95] AskQE: Question Answering as Automatic Evaluation for Machine Translation
Dayeon Ki,Kevin Duh,Marine Carpuat
Main category: cs.CL
TL;DR: AskQE是一个通过生成和回答问题来检测机器翻译错误并提供反馈的框架,帮助不懂目标语言的用户判断翻译质量。
Details
Motivation: 解决现有机器翻译错误检测和质量评估技术无法满足单语用户判断翻译质量的实际需求。 Method: 基于ContraTICO数据集,利用LLaMA-3 70B和蕴含事实生成问题,优化AskQE框架。 Result: 在BioMQM数据集上,AskQE的Kendall's Tau相关性和决策准确性优于其他质量评估指标。 Conclusion: AskQE为单语用户提供了一种有效的机器翻译质量评估工具。 Abstract: How can a monolingual English speaker determine whether an automatic translation in French is good enough to be shared? Existing MT error detection and quality estimation (QE) techniques do not address this practical scenario. We introduce AskQE, a question generation and answering framework designed to detect critical MT errors and provide actionable feedback, helping users decide whether to accept or reject MT outputs even without the knowledge of the target language. Using ContraTICO, a dataset of contrastive synthetic MT errors in the COVID-19 domain, we explore design choices for AskQE and develop an optimized version relying on LLaMA-3 70B and entailed facts to guide question generation. We evaluate the resulting system on the BioMQM dataset of naturally occurring MT errors, where AskQE has higher Kendall's Tau correlation and decision accuracy with human ratings compared to other QE metrics.[96] Improving Instruct Models for Free: A Study on Partial Adaptation
Ozan İrsoy,Pengxiang Cheng,Jennifer L. Chen,Daniel Preoţiuc-Pietro,Shiyue Zhang,Duccio Pappadopulo
Main category: cs.CL
TL;DR: 研究发现,通过部分适应方法减弱指令调优的强度,可以在多任务少样本学习基准上显著提升性能,但会牺牲部分指令跟随能力。
Details
Motivation: 探讨指令调优对模型性能的影响,尤其是其在少样本学习中的潜在退化问题。 Method: 采用部分适应方法,逐步减弱指令调优的强度,并在不同模型家族和规模上进行实验。 Result: 减弱指令调优强度显著提升了少样本学习性能,但降低了指令跟随能力。 Conclusion: 研究揭示了少样本学习与指令跟随能力之间的权衡,对实际应用具有指导意义。 Abstract: Instruct models, obtained from various instruction tuning or post-training steps, are commonly deemed superior and more usable than their base counterpart. While the model gains instruction following ability, instruction tuning may lead to forgetting the knowledge from pre-training or it may encourage the model being overly conversational or verbose. This, in turn, can lead to degradation of in-context few-shot learning performance. In this work, we study the performance trajectory between base and instruct models by scaling down the strength of instruction-tuning via the partial adaption method. We show that, across several model families and model sizes, reducing the strength of instruction-tuning results in material improvement on a few-shot in-context learning benchmark covering a variety of classic natural language tasks. This comes at the cost of losing some degree of instruction following ability as measured by AlpacaEval. Our study shines light on the potential trade-off between in-context learning and instruction following abilities that is worth considering in practice.[97] Higher-Order Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions
Minwoo Kang,Suhong Moon,Seung Hyeong Lee,Ayush Raj,Joseph Suh,David M. Chan
Main category: cs.CL
TL;DR: 论文提出了一种新方法,通过生成详细的虚拟人物背景故事,使大语言模型能更准确地模拟人类对不同社会群体的感知,适用于政治学研究。
Details
Motivation: 研究动机是解决大语言模型在模拟人类群体感知方面的不足,特别是在政治学研究中,如极化动态和群体冲突等议题。 Method: 方法是通过生成多轮访谈转录形式的详细虚拟人物背景故事,构建更真实的虚拟人物。 Result: 结果显示,基于背景故事的虚拟人物能更接近人类反应分布(Wasserstein距离提升87%),且效应大小与原研究一致。 Conclusion: 结论是该方法扩展了大语言模型的应用范围,使其能用于更广泛的人类行为研究。 Abstract: Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses during the early phases of survey design. While previous studies have examined whether models can reflect individual opinions or attitudes, we argue that a \emph{higher-order} binding of virtual personas requires successfully approximating not only the opinions of a user as an identified member of a group, but also the nuanced ways in which that user perceives and evaluates those outside the group. In particular, faithfully simulating how humans perceive different social groups is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories" generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87\% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies. Altogether, our work extends the applicability of LLMs beyond estimating individual self-opinions, enabling their use in a broader range of human studies.[98] Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters
Takashi Morita,Timothy J. O'Donnell
Main category: cs.CL
TL;DR: 研究表明,英语中的日耳曼语和拉丁语词汇可以通过音位信息区分,无需依赖词源知识。
Details
Motivation: 探讨语言学习者如何在不了解词源的情况下区分不同来源的词汇。 Method: 使用无监督聚类方法对语料库中的词汇进行分析。 Result: 聚类结果与词源分类高度一致,并发现了新的语言学特征。 Conclusion: 音位信息可用于区分词汇来源,为未来实验研究提供了新方向。 Abstract: Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure is exclusive to Germanic verbs. When seeing them as a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters, offering novel hypotheses for future experimental studies.[99] Enhancing Web Agents with Explicit Rollback Mechanisms
Zhisong Zhang,Tianqing Fang,Kaixin Ma,Wenhao Yu,Hongming Zhang,Haitao Mi,Dong Yu
Main category: cs.CL
TL;DR: 论文提出了一种显式回滚机制,增强了网络代理在复杂动态环境中的规划和搜索能力。
Details
Motivation: 现有贪婪单向搜索策略难以从错误状态恢复,需改进网络代理的灵活性。 Method: 引入显式回滚机制,使代理能回退到导航轨迹的先前状态。 Result: 在零样本和微调设置下,实验证明该方法有效。 Conclusion: 显式回滚机制提升了网络导航的效率和效果。 Abstract: With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.[100] Selective Attention Federated Learning: Improving Privacy and Efficiency for Clinical Text Classification
Yue Li,Lihong Zhang
Main category: cs.CL
TL;DR: SAFL是一种动态微调关键Transformer层的联邦学习方法,显著减少通信开销并提升隐私保护。
Details
Motivation: 解决联邦学习在大型语言模型训练中的通信开销和模型隐私问题。 Method: 通过注意力模式动态识别关键Transformer层进行微调。 Result: 在临床NLP任务中表现接近集中式模型,同时显著提升通信效率和隐私保护。 Conclusion: SAFL为联邦学习在医疗领域的应用提供了高效且隐私安全的解决方案。 Abstract: Federated Learning (FL) faces major challenges regarding communication overhead and model privacy when training large language models (LLMs), especially in healthcare applications. To address these, we introduce Selective Attention Federated Learning (SAFL), a novel approach that dynamically fine-tunes only those transformer layers identified as attention-critical. By employing attention patterns to determine layer importance, SAFL significantly reduces communication bandwidth and enhances differential privacy resilience. Evaluations on clinical NLP benchmarks (i2b2 Clinical Concept Extraction and MIMIC-III discharge summaries) demonstrate that SAFL achieves competitive performance with centralized models while substantially improving communication efficiency and privacy preservation.[101] Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture
Biao Fu,Donglei Yu,Minpeng Liao,Chengxi Li,Yidong Chen,Kai Fan,Xiaodong Shi
Main category: cs.CL
TL;DR: EASiST提出了一种高效自适应的同步语音翻译方法,通过单向架构和多阶段训练策略,显著提升了翻译质量和延迟性能。
Details
Motivation: 现有基于LLM的同步语音翻译方法存在计算开销大或固定读写策略的问题,限制了效率和性能。 Method: EASiST采用单向架构,包括多延迟数据生成策略、显式读写标记和轻量级策略头,并通过多阶段训练优化翻译和策略行为。 Result: 在MuST-C数据集上,EASiST在延迟和质量上优于多个基线方法。 Conclusion: EASiST通过创新架构和训练策略,有效解决了同步语音翻译的挑战。 Abstract: Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C En$\rightarrow$De and En$\rightarrow$Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.[102] ARWI: Arabic Write and Improve
Kirill Chirkunov,Bashar Alhafni,Chatrine Qwaider,Nizar Habash,Ted Briscoe
Main category: cs.CL
TL;DR: ARWI是一个针对现代标准阿拉伯语的写作助手,提供语法纠错、自动评分等功能,填补了阿拉伯语写作工具的空白。
Details
Motivation: 阿拉伯语使用者众多,但高级写作辅助工具稀缺,ARWI旨在填补这一空白。 Method: ARWI整合了提示数据库、文本编辑器、语法纠错和自动评分功能,并支持数据收集用于研究。 Result: 初步用户研究表明,ARWI能提供有效反馈,帮助学习者提升写作能力。 Conclusion: ARWI为阿拉伯语学习者提供了实用的写作辅助工具,并支持相关研究。 Abstract: Although Arabic is spoken by over 400 million people, advanced Arabic writing assistance tools remain limited. To address this gap, we present ARWI, a new writing assistant that helps learners improve essay writing in Modern Standard Arabic. ARWI is the first publicly available Arabic writing assistant to include a prompt database for different proficiency levels, an Arabic text editor, state-of-the-art grammatical error detection and correction, and automated essay scoring aligned with the Common European Framework of Reference standards for language attainment. Moreover, ARWI can be used to gather a growing auto-annotated corpus, facilitating further research on Arabic grammar correction and essay scoring, as well as profiling patterns of errors made by native speakers and non-native learners. A preliminary user study shows that ARWI provides actionable feedback, helping learners identify grammatical gaps, assess language proficiency, and guide improvement.[103] Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
Julia Kreutzer,Eleftheria Briakou,Sweta Agrawal,Marzieh Fadaee,Kocmi Tom
Main category: cs.CL
TL;DR: 论文探讨了多语言大语言模型(mLLMs)生成能力的评估问题,借鉴机器翻译(MT)评估的经验,提出了改进建议。
Details
Motivation: 当前mLLMs生成能力的评估缺乏全面性、科学严谨性和一致性,限制了其指导模型发展的潜力。 Method: 通过实验借鉴MT评估的最佳实践,分析模型质量差异,并提出元评估的关键组件。 Result: 展示了MT评估方法如何提升mLLMs评估的深度,并提出了改进评估方法的建议。 Conclusion: 总结了一套行动建议清单,以推动mLLMs研究和开发的评估标准化。 Abstract: Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.[104] Could Thinking Multilingually Empower LLM Reasoning?
Changjiang Gao,Xu Huang,Wenhao Zhu,Shujian Huang,Lei Li,Fei Yuan
Main category: cs.CL
TL;DR: 研究发现某些非英语语言在推理任务中表现优于英语,多语言推理的上限比仅用英语高约10 Acc@$k$点,且更稳健。
Details
Motivation: 探索多语言推理在大型语言模型中的潜力,分析其上限及实现挑战。 Method: 研究多语言推理的上限,分析翻译质量和语言选择的影响,并评估常见答案选择方法的局限性。 Result: 多语言推理的上限显著高于仅用英语,且对翻译和语言选择的变化更具鲁棒性。 Conclusion: 多语言推理在LLMs中具有巨大潜力,但需克服现有方法的局限性以实现其上限。 Abstract: Previous work indicates that large language models exhibit a significant "English bias", i.e. they often perform better when tasks are presented in English. Interestingly, we have observed that using certain other languages in reasoning tasks can yield better performance than English. However, this phenomenon remains under-explored. In this paper, we explore the upper bound of harnessing multilingualism in reasoning tasks, suggesting that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning. Besides analyzing the reason behind the upper bound and challenges in reaching it, we also find that common answer selection methods cannot achieve this upper bound, due to their limitations and biases. These insights could pave the way for future research aimed at fully harnessing the potential of multilingual reasoning in LLMs.[105] FiSMiness: A Finite State Machine Based Paradigm for Emotional Support Conversations
Yue Zhao,Qingqing Gu,Xiaoyu Wang,Teng Chen,Zhonglin Jiang,Yong Chen,Luo Ji
Main category: cs.CL
TL;DR: 论文提出了一种基于有限状态机(FSM)的框架FiSMiness,用于提升情感支持对话(ESC)的效果,通过单一大语言模型(LLM)实现对话规划、情感推理和策略生成。
Details
Motivation: 现有研究在ESC中未从状态模型角度定义问题,导致长期满意度不足。 Method: 利用FSM结合LLM,提出FiSMiness框架,支持单模型在每轮对话中规划、推理情感和生成策略。 Result: 实验表明FiSMiness在ESC任务上优于多种基线方法,包括直接推理、自优化、思维链等。 Conclusion: FiSMiness通过状态机模型显著提升了ESC的效果,且优于参数更多的模型。 Abstract: Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called FiSMiness. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy and the final response upon each conversational turn. Substantial experiments on ESC datasets suggest that FiSMiness outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and external-assisted methods, even those with many more parameters.[106] Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection
Kabir Ahuja,Melanie Sclar,Yulia Tsvetkov
Main category: cs.CL
TL;DR: 该论文提出了一种通过检测故事中的情节漏洞(plot holes)来评估大型语言模型(LLMs)语言理解和推理能力的方法,并构建了一个名为FlawedFictions的基准测试。研究发现,当前最先进的LLMs在检测情节漏洞方面表现不佳,且随着故事长度增加,性能显著下降。此外,LLM生成的故事和摘要更容易引入情节漏洞。
Details
Motivation: 故事是人类体验的核心部分,而检测情节漏洞需要复杂的推理能力。随着LLMs在文本生成和理解中的应用增多,评估其深层次语言理解能力变得至关重要。现有基准测试主要关注表面理解,因此需要更深入的评估方法。 Method: 提出FlawedFictionsMaker算法,可控地在人工编写的故事中合成情节漏洞,并构建FlawedFictions基准测试。通过人类过滤确保数据质量,避免污染。 Result: 研究发现,当前最先进的LLMs在FlawedFictions测试中表现不佳,且随着故事长度增加,性能显著下降。此外,LLM生成的故事和摘要更容易引入情节漏洞。 Conclusion: 情节漏洞检测可作为评估LLMs深层次语言理解和推理能力的有效代理任务。当前LLMs在此任务上表现不足,表明其在复杂推理和一致性保持方面仍有改进空间。 Abstract: Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes -- inconsistencies in a storyline that break the internal logic or rules of a story's world -- requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot hole detection abilities in stories -- FlawedFictions -- , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.[107] An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Andrea Piergentili,Beatrice Savoldi,Matteo Negri,Luisa Bentivogli
Main category: cs.CL
TL;DR: 论文探讨了使用大型语言模型(LLMs)评估性别中立翻译(GNT)的方法,发现分步提示(短语级到句子级)能提高评估准确性。
Details
Motivation: 当前GNT评估方法局限于单语分类器,无法利用源句子信息且扩展性差。 Method: 研究采用两种提示方法:直接生成句子级评估和分步(短语级到句子级)评估。 Result: 实验表明LLMs可作为GNT评估工具,分步提示显著提高准确性。 Conclusion: LLMs为GNT评估提供了更优且可扩展的解决方案。 Abstract: Gender-neutral translation (GNT) aims to avoid expressing the gender of human referents when the source text lacks explicit cues about the gender of those referents. Evaluating GNT automatically is particularly challenging, with current solutions being limited to monolingual classifiers. Such solutions are not ideal because they do not factor in the source sentence and require dedicated data and fine-tuning to scale to new languages. In this work, we address such limitations by investigating the use of large language models (LLMs) as evaluators of GNT. Specifically, we explore two prompting approaches: one in which LLMs generate sentence-level assessments only, and another, akin to a chain-of-thought approach, where they first produce detailed phrase-level annotations before a sentence-level judgment. Through extensive experiments on multiple languages with five models, both open and proprietary, we show that LLMs can serve as evaluators of GNT. Moreover, we find that prompting for phrase-level annotations before sentence-level assessments consistently improves the accuracy of all models, providing a better and more scalable alternative to current solutions.[108] Robust and Fine-Grained Detection of AI Generated Texts
Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Kanwal Mehreen,Drishti Sharma,Siddhant Gupta,Jebish Purbey,Ashay Srivastava,Subhasya TippaReddy,Arvind Reddy Bobbili,Suraj Telugara Chandrashekhar,Modabbir Adeeb,Srinadh Vura,Hamza Farooq
Main category: cs.CL
TL;DR: 该论文提出了一种用于检测机器生成内容(尤其是人机合作文本)的模型,并在多领域、多语言和对抗性输入下表现良好。
Details
Motivation: 现有系统在短文本和部分人机合作文本中检测AI生成内容时表现不佳,因此需要一种更通用的检测方法。 Method: 使用基于标记分类的模型,训练于一个包含240万人机合作文本的数据集,覆盖23种语言和多种生成器。 Result: 模型在未见过的领域、生成器、非母语文本和对抗性输入下表现良好,并提供了详细的性能分析。 Conclusion: 该方法为检测机器生成内容提供了一种有效的解决方案,尤其在处理复杂和多样化文本时表现出色。 Abstract: An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models' performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.[109] LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho,Jiahao Huang,Florian Boudin,Akiko Aizawa
Main category: cs.CL
TL;DR: 本文探讨了使用LLM作为评估QA模型性能的替代方法,发现其与人类判断高度相关,优于传统EM/F1指标。
Details
Motivation: 传统EM和F1评分未能全面捕捉QA模型性能,而LLM作为评估工具可能更准确。 Method: 通过四种阅读理解QA数据集,比较不同LLM家族和答案类型,评估LLM作为评判者的效果。 Result: LLM作为评判者与人类判断相关性显著提升(0.85),优于EM(0.17)和F1(0.36)。 Conclusion: LLM作为评判者可替代传统指标,尽管对复杂答案类型仍有局限,但无偏见问题。 Abstract: Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.17 (EM) and 0.36 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.[110] SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes
Raúl Vázquez,Timothee Mickus,Elaine Zosa,Teemu Vahtola,Jörg Tiedemann,Aman Sinha,Vincent Segonne,Fernando Sánchez-Vega,Alessandro Raganato,Jindřich Libovický,Jussi Karlgren,Shaoxiong Ji,Jindřich Helcl,Liane Guillou,Ona de Gibert,Jaione Bengoetxea,Joseph Attieh,Marianna Apidianaki
Main category: cs.CL
TL;DR: Mu-SHROOM任务旨在检测多语言指令调优大语言模型(LLMs)的幻觉和过度生成错误,吸引了43个团队的2618份提交,展示了社区对幻觉检测的高度兴趣。
Details
Motivation: 解决LLMs在多语言环境中的幻觉问题,提升模型输出的准确性。 Method: 将幻觉检测任务定义为跨度标注任务,涵盖14种语言,并分析参与系统的表现。 Result: 提交数量多,表明社区兴趣浓厚;分析了影响任务表现的关键因素,并指出语言间幻觉差异和标注不一致的挑战。 Conclusion: Mu-SHROOM任务为幻觉检测提供了重要数据和分析,但语言差异和标注一致性仍需进一步研究。 Abstract: We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The large number of submissions underscores the interest of the community in hallucination detection. We present the results of the participating systems and conduct an empirical analysis to identify key factors contributing to strong performance in this task. We also emphasize relevant current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.[111] Language Models as Quasi-Crystalline Thought: Structure, Constraint, and Emergence in Generative Systems
Jose Manuel Guevara-Vela
Main category: cs.CL
TL;DR: 论文提出将大语言模型(LLMs)类比为准晶体,强调其通过局部约束生成全局连贯但非周期性重复的语言模式。
Details
Motivation: 传统评估LLMs的方式(如预测准确性、事实性或对齐性)未能捕捉其最典型的行为,即生成内部共振的语言模式。 Method: 通过结构视角,将LLMs视为生成准结构化语言的系统,强调约束传播和形式连贯性而非词级准确性。 Result: 这种视角为LLMs的评估和设计提供了新路径,突出了约束、共振和结构深度的逻辑。 Conclusion: LLMs既非完全随机也非严格规则驱动,而是由约束、共振和结构深度定义的生成语言空间。 Abstract: This essay proposes an analogy between large language models (LLMs) and quasicrystals: systems that exhibit global coherence without periodic repetition and that are generated through local constraints. While LLMs are often evaluated in terms of predictive accuracy, factuality, or alignment, this structural perspective suggests that their most characteristic behavior is the production of internally resonant linguistic patterns. Just as quasicrystals forced a redefinition of order in physical systems, viewing LLMs as generators of quasi-structured language opens new paths for evaluation and design: privileging propagation of constraint over token-level accuracy, and coherence of form over fixed meaning. LLM outputs should be read not only for what they say, but for the patterns of constraint and coherence that organize them. This shift reframes generative language as a space of emergent patterning: LLMs are neither fully random nor strictly rule-based, but defined by a logic of constraint, resonance, and structural depth.[112] Bayesian dynamic borrowing considering semantic similarity between outcomes for disproportionality analysis in FAERS
François Haguinet,Jeffery L Painter,Gregory E Powell,Andrea Callegaro,Andrew Bate
Main category: cs.CL
TL;DR: 提出了一种基于贝叶斯动态借用(BDB)的方法,通过语义相似性度量(SSM)增强自发报告系统(SRS)中不良事件(AEs)的定量识别。
Details
Motivation: 解决当前不成比例分析(DPA)中刚性分层分组的局限性,通过语义相似性实现更灵活的信息共享。 Method: 将稳健的元分析预测(MAP)先验嵌入贝叶斯层次模型,并利用SSM对临床相似的MedDRA首选术语(PT)进行加权信息共享。 Result: IC SSM方法在灵敏度上优于传统IC和HLGT借用方法,能更早检测到信号,尽管F1分数和Youden指数略有下降。 Conclusion: SSM-informed贝叶斯借用是一种可扩展且上下文感知的DPA增强方法,未来需进一步验证和优化。 Abstract: We present a Bayesian dynamic borrowing (BDB) approach to enhance the quantitative identification of adverse events (AEs) in spontaneous reporting systems (SRSs). The method embeds a robust meta-analytic predictive (MAP) prior within a Bayesian hierarchical model and incorporates semantic similarity measures (SSMs) to enable weighted information sharing from MedDRA Preferred Terms (PTs) that are clinical similar to the target PT. This continuous similarity-based borrowing addresses limitation of rigid hierarchical grouping in current disproportionality analysis (DPA). Using data from the FDA Adverse Event Reporting System (FAERS) between 2015 and 2019, we evalute this approach - termed IC SSM - against standard Information Component (IC) analysis and IC with borrowing at the MedDRA high-level group term (HLGT) level. A novel references set (PVLens), derived from FDA product label updates, enabled prospective evaluation of method performance in identifying AEs prior to official labeling. The IC SSM approach demonstrated improved sensitivity compared to both traditional IC and HLGT-based borrowing, with minor trade-offs in F1 scores and Youden's index. IC SSM consistently identified more true positives and detected signals over 5 months sooner than traditional IC. Despite a marginally lower aggregate Youden's index, IC SSM showed higher performance in the early post-marketing period, providing more stable and relevant estimates than HLGT-based borrowing and traditional IC. These findings support the use of SSM-informed Bayesian borrowing as a scalable and context-aware enhancement to traditional DPA methods. Future research should validate this approach across other datasets and explore additional similarity metrics and Bayesian inference strategies using case-level data.[113] Selective Demonstration Retrieval for Improved Implicit Hate Speech Detection
Yumin Kim,Hwanhee Lee
Main category: cs.CL
TL;DR: 提出了一种无需微调的新方法,通过上下文学习和自适应检索演示来检测隐式仇恨言论,效果优于现有技术。
Details
Motivation: 隐式仇恨言论检测因依赖上下文和文化差异而具有挑战性,现有模型易产生误判。 Method: 利用上下文学习,自适应检索相似群体或高相似度得分的演示,提升上下文理解。 Result: 实验表明,该方法优于当前最先进技术。 Conclusion: 该方法提高了检测精度,减少了模型偏见,增强了鲁棒性。 Abstract: Hate speech detection is a crucial area of research in natural language processing, essential for ensuring online community safety. However, detecting implicit hate speech, where harmful intent is conveyed in subtle or indirect ways, remains a major challenge. Unlike explicit hate speech, implicit expressions often depend on context, cultural subtleties, and hidden biases, making them more challenging to identify consistently. Additionally, the interpretation of such speech is influenced by external knowledge and demographic biases, resulting in varied detection results across different language models. Furthermore, Large Language Models often show heightened sensitivity to toxic language and references to vulnerable groups, which can lead to misclassifications. This over-sensitivity results in false positives (incorrectly identifying harmless statements as hateful) and false negatives (failing to detect genuinely harmful content). Addressing these issues requires methods that not only improve detection precision but also reduce model biases and enhance robustness. To address these challenges, we propose a novel method, which utilizes in-context learning without requiring model fine-tuning. By adaptively retrieving demonstrations that focus on similar groups or those with the highest similarity scores, our approach enhances contextual comprehension. Experimental results show that our method outperforms current state-of-the-art techniques. Implementation details and code are available at TBD.[114] Gauging Overprecision in LLMs: An Empirical Study
Adil Bahaj,Hamed Rahimi,Mohamed Chetouani,Mounir Ghogho
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)的过度自信问题,提出了一个包含生成、精炼和评估三阶段的框架,揭示了LLMs在数值任务中的未校准性及其对置信度概念的理解不足。
Details
Motivation: 由于LLMs的过度自信问题对其可信度评估至关重要,但现有方法依赖模型自我报告的置信度(verbalized confidence),容易受偏见和幻觉影响,因此需要新的研究视角。 Method: 设计了一个三阶段框架:1)生成阶段,通过提示LLMs生成带有预设置信度的数值区间答案;2)精炼阶段,优化生成的答案;3)评估阶段,分析LLMs的内部机制。 Result: 发现LLMs在数值任务中未校准,置信度与区间长度无关,且答案精度受任务、答案规模和提示技术影响,精炼通常无法提升精度。 Conclusion: 该研究为LLMs的过度自信问题提供了新视角,并为研究其过度精确性奠定了基线。 Abstract: Recently, overconfidence in large language models (LLMs) has garnered considerable attention due to its fundamental importance in quantifying the trustworthiness of LLM generation. However, existing approaches prompt the \textit{black box LLMs} to produce their confidence (\textit{verbalized confidence}), which can be subject to many biases and hallucinations. Inspired by a different aspect of overconfidence in cognitive science called \textit{overprecision}, we designed a framework for its study in black box LLMs. This framework contains three main phases: 1) generation, 2) refinement and 3) evaluation. In the generation phase we prompt the LLM to generate answers to numerical questions in the form of intervals with a certain level of confidence. This confidence level is imposed in the prompt and not required for the LLM to generate as in previous approaches. We use various prompting techniques and use the same prompt multiple times to gauge the effects of randomness in the generation process. In the refinement phase, answers from the previous phase are refined to generate better answers. The LLM answers are evaluated and studied in the evaluation phase to understand its internal workings. This study allowed us to gain various insights into LLM overprecision: 1) LLMs are highly uncalibrated for numerical tasks 2) {\color{blue}there is no correlation between the length of the interval and the imposed confidence level, which can be symptomatic of a a) lack of understanding of the concept of confidence or b) inability to adjust self-confidence by following instructions}, {\color{blue}3)} LLM numerical precision differs depending on the task, scale of answer and prompting technique {\color{blue}4) Refinement of answers doesn't improve precision in most cases}. We believe this study offers new perspectives on LLM overconfidence and serves as a strong baseline for overprecision in LLMs.[115] Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation
Shizhan Cai,Liang Ding,Dacheng Tao
Main category: cs.CL
TL;DR: 提出了一种新型水印方案,通过累积水印熵阈值提升检测性和文本质量,兼容现有采样函数。
Details
Motivation: 大型语言模型(LLMs)快速发展引发内容可追溯性和潜在滥用的担忧,现有水印方案在文本质量和检测鲁棒性之间存在权衡。 Method: 引入累积水印熵阈值,兼容并泛化现有采样函数,提升适应性。 Result: 在多个LLM上显著优于现有方法,在MATH和GSM8K等数据集上提升超过80%,同时保持高检测准确率。 Conclusion: 新方案有效解决了现有水印方案的不足,为内容可追溯性提供了更优解决方案。 Abstract: The rapid development of Large Language Models (LLMs) has intensified concerns about content traceability and potential misuse. Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks. To address these issues, we propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold. Our approach is compatible with and generalizes existing sampling functions, enhancing adaptability. Experimental results across multiple LLMs show that our scheme significantly outperforms existing methods, achieving over 80\% improvements on widely-used datasets, e.g., MATH and GSM8K, while maintaining high detection accuracy.[116] Multilingual Contextualization of Large Language Models for Document-Level Machine Translation
Miguel Moura Ramos,Patrick Fernandes,Sweta Agrawal,André F. T. Martins
Main category: cs.CL
TL;DR: 提出一种通过针对性微调改进LLM长文档翻译的方法,引入DocBlocks数据集,支持多种翻译范式,提升翻译质量和速度。
Details
Motivation: LLM在句子级翻译表现优异,但在文档级翻译中建模长距离依赖和跨句篇章现象仍具挑战性。 Method: 通过DocBlocks数据集对LLM进行针对性微调,支持文档到文档和分块翻译,结合上下文指令。 Result: 实验表明,多翻译范式结合提升了文档级翻译质量和推理速度。 Conclusion: 该方法有效解决了LLM在长文档翻译中的依赖和篇章问题,同时保持句子级翻译性能。 Abstract: Large language models (LLMs) have demonstrated strong performance in sentence-level machine translation, but scaling to document-level translation remains challenging, particularly in modeling long-range dependencies and discourse phenomena across sentences and paragraphs. In this work, we propose a method to improve LLM-based long-document translation through targeted fine-tuning on high-quality document-level data, which we curate and introduce as DocBlocks. Our approach supports multiple translation paradigms, including direct document-to-document and chunk-level translation, by integrating instructions both with and without surrounding context. This enables models to better capture cross-sentence dependencies while maintaining strong sentence-level translation performance. Experimental results show that incorporating multiple translation paradigms improves document-level translation quality and inference speed compared to prompting and agent-based methods.[117] Poem Meter Classification of Recited Arabic Poetry: Integrating High-Resource Systems for a Low-Resource Task
Maged S. Al-Shaibani,Zaid Alyafeai,Irfan Ahmad
Main category: cs.CL
TL;DR: 提出了一种用于识别阿拉伯诗歌韵律的先进框架,结合了两个高资源系统,并发布了基准数据集。
Details
Motivation: 阿拉伯诗歌韵律识别过程复杂且需要专业知识,自动识别系统需要大量标注数据。 Method: 整合两个高资源系统,完成低资源任务。 Result: 提出了一个先进的框架,并发布了基准数据集。 Conclusion: 该框架为未来研究提供了基础,并促进了阿拉伯诗歌韵律识别的发展。 Abstract: Arabic poetry is an essential and integral part of Arabic language and culture. It has been used by the Arabs to spot lights on their major events such as depicting brutal battles and conflicts. They also used it, as in many other languages, for various purposes such as romance, pride, lamentation, etc. Arabic poetry has received major attention from linguistics over the decades. One of the main characteristics of Arabic poetry is its special rhythmic structure as opposed to prose. This structure is referred to as a meter. Meters, along with other poetic characteristics, are intensively studied in an Arabic linguistic field called "\textit{Aroud}". Identifying these meters for a verse is a lengthy and complicated process. It also requires technical knowledge in \textit{Aruod}. For recited poetry, it adds an extra layer of processing. Developing systems for automatic identification of poem meters for recited poems need large amounts of labelled data. In this study, we propose a state-of-the-art framework to identify the poem meters of recited Arabic poetry, where we integrate two separate high-resource systems to perform the low-resource task. To ensure generalization of our proposed architecture, we publish a benchmark for this task for future research.[118] Mapping Controversies Using Artificial Intelligence: An Analysis of the Hamas-Israel Conflict on YouTube
Victor Manuel Hernandez Lopez,Jaime E. Cuellar
Main category: cs.CL
TL;DR: 该研究通过分析25万条西班牙语YouTube评论,结合STS和NLP(BERT模型),将评论分为七类,发现亲巴勒斯坦评论占多数,但亲以色列和反巴勒斯坦评论获更多点赞。研究还显示媒体议程设置显著影响公众立场转变。
Details
Motivation: 探讨哈马斯-以色列争议中公众舆论的动态变化,结合社会理论与技术工具分析媒体叙事的影响。 Method: 采用STS与NLP(BERT模型)结合的方法,自动分类评论并分析议程设置理论的应用。 Result: 亲巴勒斯坦评论数量最多,但亲以色列和反巴评论获更多互动;媒体覆盖导致公众立场从亲巴转向对以色列更批判。 Conclusion: 结合社会科学与计算工具能更有效分析复杂舆论现象,方法创新为争议研究提供新视角。 Abstract: This article analyzes the Hamas-Israel controversy through 253,925 Spanish-language YouTube comments posted between October 2023 and January 2024, following the October 7 attack that escalated the conflict. Adopting an interdisciplinary approach, the study combines the analysis of controversies from Science and Technology Studies (STS) with advanced computational methodologies, specifically Natural Language Processing (NLP) using the BERT (Bidirectional Encoder Representations from Transformers) model. Using this approach, the comments were automatically classified into seven categories, reflecting pro-Palestinian, pro-Israeli, anti- Palestinian, anti-Israeli positions, among others. The results show a predominance of pro- Palestinian comments, although pro-Israeli and anti-Palestinian comments received more "likes." This study also applies the agenda-setting theory to demonstrate how media coverage significantly influences public perception, observing a notable shift in public opinion, transitioning from a pro- Palestinian stance to a more critical position towards Israel. This work highlights the importance of combining social science perspectives with technological tools in the analysis of controversies, presenting a methodological innovation by integrating computational analysis with critical social theories to address complex public opinion phenomena and media narratives.[119] Trusting CHATGPT: how minor tweaks in the prompts lead to major differences in sentiment classification
Jaime E. Cuellar,Oscar Moreno-Martinez,Paula Sofia Torres-Rodriguez,Jaime Andres Pavlich-Mariscal,Andres Felipe Mican-Castiblanco,Juan Guillermo Torres-Hurtado
Main category: cs.CL
TL;DR: 研究发现,GPT-4o mini在情感极性分析中对提示的微小变化敏感,导致分类结果不一致,挑战了大型语言模型的稳健性和可信度。
Details
Motivation: 探讨复杂预测模型(如ChatGPT)的可信度,特别是在情感分析任务中对提示变化的敏感性。 Method: 使用10种略微不同的提示对10万条西班牙语评论进行分类,通过探索性和验证性分析评估结果差异。 Result: 提示的微小变化(如词汇、句法或模态)显著影响分类结果,模型表现出不一致性和幻觉现象。 Conclusion: 大型语言模型在分类任务中的可信度受提示变化影响,需结合社会和技术因素评估其使用。 Abstract: One fundamental question for the social sciences today is: how much can we trust highly complex predictive models like ChatGPT? This study tests the hypothesis that subtle changes in the structure of prompts do not produce significant variations in the classification results of sentiment polarity analysis generated by the Large Language Model GPT-4o mini. Using a dataset of 100.000 comments in Spanish on four Latin American presidents, the model classified the comments as positive, negative, or neutral on 10 occasions, varying the prompts slightly each time. The experimental methodology included exploratory and confirmatory analyses to identify significant discrepancies among classifications. The results reveal that even minor modifications to prompts such as lexical, syntactic, or modal changes, or even their lack of structure impact the classifications. In certain cases, the model produced inconsistent responses, such as mixing categories, providing unsolicited explanations, or using languages other than Spanish. Statistical analysis using Chi-square tests confirmed significant differences in most comparisons between prompts, except in one case where linguistic structures were highly similar. These findings challenge the robustness and trust of Large Language Models for classification tasks, highlighting their vulnerability to variations in instructions. Moreover, it was evident that the lack of structured grammar in prompts increases the frequency of hallucinations. The discussion underscores that trust in Large Language Models is based not only on technical performance but also on the social and institutional relationships underpinning their use.[120] SALAD: Improving Robustness and Generalization through Contrastive Learning with Structure-Aware and LLM-Driven Augmented Data
Suyoung Bae,Hyojun Kim,YunSeok Choi,Jee-Hyong Lee
Main category: cs.CL
TL;DR: SALAD通过生成结构感知和反事实增强数据,提升预训练语言模型的鲁棒性和泛化能力。
Details
Motivation: 解决预训练语言模型在微调时因虚假相关性导致的性能下降问题,特别是在分布外数据上。 Method: 利用标记方法生成结构感知正样本,结合大型语言模型生成多样化的反事实负样本,通过对比学习优化模型。 Result: 在情感分类、性别歧视检测和自然语言推理任务中,SALAD显著提升了模型的鲁棒性和泛化能力。 Conclusion: SALAD有效减少了模型对虚假相关性的依赖,增强了其在分布外数据和跨域场景中的表现。 Abstract: In various natural language processing (NLP) tasks, fine-tuning Pre-trained Language Models (PLMs) often leads to the issue of spurious correlations, which negatively impacts performance, particularly when dealing with out-of-distribution data. To address this problem, we propose SALAD}(Structure Aware and LLM-driven Augmented Data), a novel approach designed to enhance model robustness and generalization by generating structure-aware and counterfactually augmented data for contrastive learning. Our method leverages a tagging-based approach to generate structure-aware positive samples and utilizes large language models (LLMs) to generate counterfactual negative samples with diverse sentence patterns. By applying contrastive learning, SALAD enables the model to focus on learning the structural relationships between key sentence components while minimizing reliance on spurious correlations. We validate our approach through experiments on three tasks: Sentiment Classification, Sexism Detection, and Natural Language Inference. The results demonstrate that SALAD not only improves model robustness and performance across different environments but also enhances generalization to out-of-distribution datasets and cross-domain scenarios.[121] What Do Large Language Models Know? Tacit Knowledge as a Potential Causal-Explanatory Structure
Céline Budding
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLMs)是否具备隐性知识,并认为LLMs满足隐性知识的语义、句法和因果系统性约束。
Details
Motivation: 研究LLMs是否真正具备知识,尤其是隐性知识,以澄清对LLMs能力的误解。 Method: 通过分析LLMs的架构特征,验证其是否满足Martin Davies提出的隐性知识定义中的约束条件。 Result: LLMs的某些架构特征符合隐性知识的语义、句法和因果系统性要求。 Conclusion: 隐性知识可以作为描述、解释和干预LLMs及其行为的理论框架。 Abstract: It is sometimes assumed that Large Language Models (LLMs) know language, or for example that they know that Paris is the capital of France. But what -- if anything -- do LLMs actually know? In this paper, I argue that LLMs can acquire tacit knowledge as defined by Martin Davies (1990). Whereas Davies himself denies that neural networks can acquire tacit knowledge, I demonstrate that certain architectural features of LLMs satisfy the constraints of semantic description, syntactic structure, and causal systematicity. Thus, tacit knowledge may serve as a conceptual framework for describing, explaining, and intervening on LLMs and their behavior.[122] d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Siyan Zhao,Devaansh Gupta,Qinqing Zheng,Aditya Grover
Main category: cs.CL
TL;DR: 论文提出d1框架,通过监督微调(SFT)和强化学习(RL)将预训练的扩散大语言模型(dLLMs)转化为推理模型,显著提升了其推理性能。
Details
Motivation: 尽管扩散大语言模型(dLLMs)在语言建模上表现优异,但其推理能力是否与自回归模型(AR)相当尚不明确,因此需要探索如何提升dLLMs的推理能力。 Method: 结合监督微调(SFT)和强化学习(RL),提出d1框架,包括掩码SFT技术和新型无评论家的策略梯度算法diffu-GRPO。 Result: 实验表明,d1显著提升了dLLMs在数学和逻辑推理任务上的性能,并优于现有方法。 Conclusion: d1框架成功将dLLMs转化为高效的推理模型,为扩散模型的推理能力提供了新思路。 Abstract: Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.[123] BitNet b1.58 2B4T Technical Report
Shuming Ma,Hongyu Wang,Shaohan Huang,Xingxing Zhang,Ying Hu,Ting Song,Yan Xia,Furu Wei
Main category: cs.CL
TL;DR: BitNet b1.58 2B4T是首个开源的1-bit大型语言模型(LLM),参数规模达20亿,训练数据为4万亿token,性能与同类全精度模型相当,但计算效率更高。
Details
Motivation: 开发高效且性能优越的低精度LLM,以减少计算资源消耗。 Method: 训练一个1-bit的20亿参数LLM,使用4万亿token的数据集,并在多个基准测试中评估其性能。 Result: 模型在语言理解、数学推理、编程能力和对话能力上与同类全精度模型表现相当,同时显著降低了内存占用、能耗和解码延迟。 Conclusion: BitNet b1.58 2B4T为高效LLM研究提供了新方向,并开源了模型权重和推理实现。 Abstract: We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.cs.GR [Back]
[124] Recent Advance in 3D Object and Scene Generation: A Survey
Xiang Tang,Ruotong Li,Xiaopeng Fan
Main category: cs.GR
TL;DR: 本文综述了3D内容生成领域的最新进展,包括静态3D对象和场景生成的技术框架,并探讨了未来的研究方向。
Details
Motivation: 随着交互媒体、XR和元宇宙行业的发展,对3D内容的需求激增,传统建模方法效率低下,需要新的技术突破。 Method: 通过系统分类,分析了主流3D对象表示方法,以及数据驱动和监督学习、深度生成模型等技术路径;在场景生成方面,探讨了布局引导合成、2D先验生成和规则驱动建模。 Result: 建立了全面的技术框架,总结了当前3D生成领域的前沿成果。 Conclusion: 本文为读者提供了对3D生成技术的结构化理解,并提出了未来研究的潜在方向。 Abstract: In recent years, the demand for 3D content has grown exponentially with intelligent upgrading of interactive media, extended reality (XR), and Metaverse industries. In order to overcome the limitation of traditional manual modeling approaches, such as labor-intensive workflows and prolonged production cycles, revolutionary advances have been achieved through the convergence of novel 3D representation paradigms and artificial intelligence generative technologies. In this survey, we conduct a systematically review of the cutting-edge achievements in static 3D object and scene generation, as well as establish a comprehensive technical framework through systematic categorization. Specifically, we initiate our analysis with mainstream 3D object representations, followed by in-depth exploration of two principal technical pathways in object generation: data-driven supervised learning methods and deep generative model-based approaches. Regarding scene generation, we focus on three dominant paradigms: layout-guided compositional synthesis, 2D prior-based scene generation, and rule-driven modeling. Finally, we critically examine persistent challenges in 3D generation and propose potential research directions for future investigation. This survey aims to provide readers with a structured understanding of state-of-the-art 3D generation technologies while inspiring researchers to undertake more exploration in this domain.eess.IV [Back]
[125] Do Segmentation Models Understand Vascular Structure? A Blob-Based XAI Framework
Guillaume Garret,Antoine Vacavant,Carole Frindel
Main category: eess.IV
TL;DR: 论文提出了一种新的可解释性方法,用于分析3D血管分割模型,发现模型主要依赖局部特征而非全局解剖结构。
Details
Motivation: 尽管深度学习在医学图像分割中表现优异,但其黑盒特性限制了临床应用,尤其是在需要结合局部和全局解剖结构的血管分割任务中。 Method: 结合梯度归因、图引导点选择和显著性图的斑点分析,评估模型对全局解剖结构的利用。 Result: 模型决策主要由局部特征主导,与血管厚度、连通性等全局特征相关性低。 Conclusion: 现有分割模型在捕捉全局血管解剖结构方面存在局限,强调了结构化可解释性工具的重要性。 Abstract: Deep learning models have achieved impressive performance in medical image segmentation, yet their black-box nature limits clinical adoption. In vascular applications, trustworthy segmentation should rely on both local image cues and global anatomical structures, such as vessel connectivity or branching. However, the extent to which models leverage such global context remains unclear. We present a novel explainability pipeline for 3D vessel segmentation, combining gradient-based attribution with graph-guided point selection and a blob-based analysis of Saliency maps. Using vascular graphs extracted from ground truth, we define anatomically meaningful points of interest (POIs) and assess the contribution of input voxels via Saliency maps. These are analyzed at both global and local scales using a custom blob detector. Applied to IRCAD and Bullitt datasets, our analysis shows that model decisions are dominated by highly localized attribution blobs centered near POIs. Attribution features show little correlation with vessel-level properties such as thickness, tubularity, or connectivity -- suggesting limited use of global anatomical reasoning. Our results underline the importance of structured explainability tools and highlight the current limitations of segmentation models in capturing global vascular context.[126] Local Temporal Feature Enhanced Transformer with ROI-rank Based Masking for Diagnosis of ADHD
Byunggun Kim,Younghun Kwon
Main category: eess.IV
TL;DR: 本文提出了一种基于Transformer的ADHD诊断模型,通过rs-fMRI数据学习时空特征和注意力结构,显著提升了诊断性能。
Details
Motivation: ADHD是一种常见精神疾病,现有诊断方法需要更有效的时空生物标志物提取技术。 Method: 设计了CNN嵌入块、局部时间注意力和基于ROI排名的掩码方法,优化了Transformer模型。 Result: 模型在ADHD-200数据集上表现优异(ACC 77.78%,SPE 76.60%,SEN 79.22%,AUC 79.30%)。 Conclusion: 该模型为ADHD诊断提供了更精准的时空生物标志物,性能优于其他变体。 Abstract: In modern society, Attention-Deficit/Hyperactivity Disorder (ADHD) is one of the common mental diseases discovered not only in children but also in adults. In this context, we propose a ADHD diagnosis transformer model that can effectively simultaneously find important brain spatiotemporal biomarkers from resting-state functional magnetic resonance (rs-fMRI). This model not only learns spatiotemporal individual features but also learns the correlation with full attention structures specialized in ADHD diagnosis. In particular, it focuses on learning local blood oxygenation level dependent (BOLD) signals and distinguishing important regions of interest (ROI) in the brain. Specifically, the three proposed methods for ADHD diagnosis transformer are as follows. First, we design a CNN-based embedding block to obtain more expressive embedding features in brain region attention. It is reconstructed based on the previously CNN-based ADHD diagnosis models for the transformer. Next, for individual spatiotemporal feature attention, we change the attention method to local temporal attention and ROI-rank based masking. For the temporal features of fMRI, the local temporal attention enables to learn local BOLD signal features with only simple window masking. For the spatial feature of fMRI, ROI-rank based masking can distinguish ROIs with high correlation in ROI relationships based on attention scores, thereby providing a more specific biomarker for ADHD diagnosis. The experiment was conducted with various types of transformer models. To evaluate these models, we collected the data from 939 individuals from all sites provided by the ADHD-200 competition. Through this, the spatiotemporal enhanced transformer for ADHD diagnosis outperforms the performance of other different types of transformer variants. (77.78ACC 76.60SPE 79.22SEN 79.30AUC)[127] Deciphering scrolls with tomography: A training experiment
Sonia Foschiatti,Axel Kittenberger,Otmar Scherzer
Main category: eess.IV
TL;DR: 本文提出了一种教育实验室方法,通过可见光和非破坏性技术模拟古代文献的虚拟恢复过程。
Details
Motivation: 严重损坏的古代文献难以物理展开,需要非破坏性技术(如X射线CT)结合计算机视觉算法进行虚拟阅读。 Method: 开发了使用可见光替代有害X射线的实验装置,并设计了教学软件流程,帮助学生虚拟重建透明卷轴上的文本。 Result: 实验装置和软件流程成功模拟了古代文献的虚拟恢复过程。 Conclusion: 该方法为教育领域提供了一种安全且有效的古代文献虚拟恢复模拟工具。 Abstract: The recovery of severely damaged ancient written documents has proven to be a major challenge for many scientists, mainly due to the impracticality of physical unwrapping them. Non-destructive techniques, such as X-ray computed tomography (CT), combined with computer vision algorithms, have emerged as a means of facilitating the virtual reading of the hidden contents of the damaged documents. This paper proposes an educational laboratory aimed at simulating the entire process of acquisition and virtual recovery of the ancient works. We have developed an experimental setup that uses visible light to replace the detrimental X-rays, and a didactic software pipeline that allows students to virtually reconstruct a transparent rolled sheet with printed text on it, the wrapped scroll.[128] Attention GhostUNet++: Enhanced Segmentation of Adipose Tissue and Liver in CT Images
Mansoor Hayat,Supavadee Aramvith,Subrata Bhattacharjee,Nouman Ahmad
Main category: eess.IV
TL;DR: 论文提出了一种名为Attention GhostUNet++的新深度学习模型,用于腹部脂肪组织(SAT和VAT)及肝脏的精确分割,显著提升了分割效果和计算效率。
Details
Motivation: 准确分割腹部脂肪组织和肝脏对于理解身体组成及相关健康风险(如2型糖尿病和心血管疾病)至关重要。 Method: 模型结合了通道、空间和深度注意力机制,嵌入Ghost UNet++瓶颈中,实现自动化精确分割。 Result: 在AATTCT-IDS和LiTS数据集上,模型对VAT、SAT和肝脏分割的Dice系数分别达到0.9430、0.9639和0.9652,优于基线模型。 Conclusion: 尽管在边界细节分割上略有不足,模型显著提升了特征细化、上下文理解和计算效率,为身体组成分析提供了可靠解决方案。 Abstract: Accurate segmentation of abdominal adipose tissue, including subcutaneous (SAT) and visceral adipose tissue (VAT), along with liver segmentation, is essential for understanding body composition and associated health risks such as type 2 diabetes and cardiovascular disease. This study proposes Attention GhostUNet++, a novel deep learning model incorporating Channel, Spatial, and Depth Attention mechanisms into the Ghost UNet++ bottleneck for automated, precise segmentation. Evaluated on the AATTCT-IDS and LiTS datasets, the model achieved Dice coefficients of 0.9430 for VAT, 0.9639 for SAT, and 0.9652 for liver segmentation, surpassing baseline models. Despite minor limitations in boundary detail segmentation, the proposed model significantly enhances feature refinement, contextual understanding, and computational efficiency, offering a robust solution for body composition analysis. The implementation of the proposed Attention GhostUNet++ model is available at:https://github.com/MansoorHayat777/Attention-GhostUNetPlusPlus.[129] TextDiffSeg: Text-guided Latent Diffusion Model for 3d Medical Images Segmentation
Kangbo Ma
Main category: eess.IV
TL;DR: 论文提出了一种名为TextDiffSeg的文本引导扩散模型框架,用于解决3D医学图像分割任务中扩散概率模型的高计算成本和全局上下文信息不足的问题。
Details
Motivation: 扩散概率模型在3D医学图像分割中表现优异,但高计算成本和全局上下文信息捕捉不足限制了其实际应用。 Method: TextDiffSeg结合3D体积数据和自然语言描述,通过跨模态嵌入和共享语义空间,引入标签嵌入技术和跨模态注意力机制,降低计算复杂度并保持全局上下文完整性。 Result: 实验表明,TextDiffSeg在肾脏和胰腺肿瘤分割以及多器官分割任务中优于现有方法,消融研究验证了关键组件的有效性。 Conclusion: TextDiffSeg为3D医学图像分割提供了高效准确的解决方案,具有广泛的临床应用潜力。 Abstract: Diffusion Probabilistic Models (DPMs) have demonstrated significant potential in 3D medical image segmentation tasks. However, their high computational cost and inability to fully capture global 3D contextual information limit their practical applications. To address these challenges, we propose a novel text-guided diffusion model framework, TextDiffSeg. This method leverages a conditional diffusion framework that integrates 3D volumetric data with natural language descriptions, enabling cross-modal embedding and establishing a shared semantic space between visual and textual modalities. By enhancing the model's ability to recognize complex anatomical structures, TextDiffSeg incorporates innovative label embedding techniques and cross-modal attention mechanisms, effectively reducing computational complexity while preserving global 3D contextual integrity. Experimental results demonstrate that TextDiffSeg consistently outperforms existing methods in segmentation tasks involving kidney and pancreas tumors, as well as multi-organ segmentation scenarios. Ablation studies further validate the effectiveness of key components, highlighting the synergistic interaction between text fusion, image feature extractor, and label encoder. TextDiffSeg provides an efficient and accurate solution for 3D medical image segmentation, showcasing its broad applicability in clinical diagnosis and treatment planning.[130] Novel-view X-ray Projection Synthesis through Geometry-Integrated Deep Learning
Daiqi Liu,Fuxin Fan,Andreas Maier
Main category: eess.IV
TL;DR: 论文提出了一种名为DL-GIPS的创新模型,通过单次X射线投影合成新视角的投影,减少辐射暴露和临床复杂性。
Details
Motivation: 传统X射线成像需要多角度投影,导致辐射暴露增加和临床流程复杂化,亟需一种更高效的方法。 Method: DL-GIPS模型从单次投影中提取几何和纹理特征,调整几何特征以匹配新视角,并通过图像生成过程合成最终投影。 Result: 通过肺部成像示例验证了DL-GIPS的有效性和广泛适用性,展示了其在减少数据采集需求方面的潜力。 Conclusion: DL-GIPS有望通过减少多角度投影需求,革新立体和体积成像领域。 Abstract: X-ray imaging plays a crucial role in the medical field, providing essential insights into the internal anatomy of patients for diagnostics, image-guided procedures, and clinical decision-making. Traditional techniques often require multiple X-ray projections from various angles to obtain a comprehensive view, leading to increased radiation exposure and more complex clinical processes. This paper explores an innovative approach using the DL-GIPS model, which synthesizes X-ray projections from new viewpoints by leveraging a single existing projection. The model strategically manipulates geometry and texture features extracted from an initial projection to match new viewing angles. It then synthesizes the final projection by merging these modified geometry features with consistent texture information through an advanced image generation process. We demonstrate the effectiveness and broad applicability of the DL-GIPS framework through lung imaging examples, highlighting its potential to revolutionize stereoscopic and volumetric imaging by minimizing the need for extensive data acquisition.[131] Modality-Independent Explainable Detection of Inaccurate Organ Segmentations Using Denoising Autoencoders
Levente Lippenszky,István Megyeri,Krisztian Koos,Zsófia Karancsi,Borbála Deák-Karancsi,András Frontó,Árpád Makk,Attila Rádics,Erhan Bas,László Ruskó
Main category: eess.IV
TL;DR: 提出了一种基于去噪自编码器的方法,用于检测放射治疗计划中不准确的器官分割,该方法独立于成像模态,并在多数器官上表现优于现有方法。
Details
Motivation: 放射治疗计划中不准确的器官分割可能导致治疗效果不佳,需要一种方法来检测这些不准确的分割。 Method: 通过向真实器官分割添加噪声,训练自编码器进行去噪,生成重建图像以识别不准确区域。 Result: 该方法在MR和CT扫描中均有效,且多数器官上性能优于现有方法。 Conclusion: 该方法提供了一种可解释的检测手段,有助于改善放射治疗计划的质量。 Abstract: In radiation therapy planning, inaccurate segmentations of organs at risk can result in suboptimal treatment delivery, if left undetected by the clinician. To address this challenge, we developed a denoising autoencoder-based method to detect inaccurate organ segmentations. We applied noise to ground truth organ segmentations, and the autoencoders were tasked to denoise them. Through the application of our method to organ segmentations generated on both MR and CT scans, we demonstrated that the method is independent of imaging modality. By providing reconstructions, our method offers visual information about inaccurate regions of the organ segmentations, leading to more explainable detection of suboptimal segmentations. We compared our method to existing approaches in the literature and demonstrated that it achieved superior performance for the majority of organs.[132] Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography
Zhijin He,Alan B. McMillan
Main category: eess.IV
TL;DR: 该研究比较了基于放射组学和深度学习的AI模型在胸部X光疾病检测中的表现,重点评估了COVID-19、肺不张和病毒性肺炎的诊断准确性。
Details
Motivation: AI在医学影像中的应用推动了诊断实践的进步,但不同AI模型在数据有限或高吞吐环境下的表现尚不明确。本研究旨在填补这一空白,为临床选择AI模型提供依据。 Method: 研究系统比较了放射组学模型(如决策树、梯度提升、随机森林、SVM和MLP)与深度学习模型(如CNN和ViT)的诊断性能,分析了不同样本量下的表现。 Result: 结果显示,不同AI模型在不同场景下表现各异,深度学习方法在数据充足时表现优异,而放射组学在数据有限时更具优势。 Conclusion: 研究为临床实践中AI模型的选择提供了指导,强调了根据具体需求和环境选择合适模型的重要性。 Abstract: The application of artificial intelligence (AI) in medical imaging has revolutionized diagnostic practices, enabling advanced analysis and interpretation of radiological data. This study presents a comprehensive evaluation of radiomics-based and deep learning-based approaches for disease detection in chest radiography, focusing on COVID-19, lung opacity, and viral pneumonia. While deep learning models, particularly convolutional neural networks (CNNs) and vision transformers (ViTs), learn directly from image data, radiomics-based models extract and analyze quantitative features, potentially providing advantages in data-limited scenarios. This study systematically compares the diagnostic accuracy and robustness of various AI models, including Decision Trees, Gradient Boosting, Random Forests, Support Vector Machines (SVM), and Multi-Layer Perceptrons (MLP) for radiomics, against state-of-the-art computer vision deep learning architectures. Performance metrics across varying sample sizes reveal insights into each model's efficacy, highlighting the contexts in which specific AI approaches may offer enhanced diagnostic capabilities. The results aim to inform the integration of AI-driven diagnostic tools in clinical practice, particularly in automated and high-throughput environments where timely, reliable diagnosis is critical. This comparative study addresses an essential gap, establishing guidance for the selection of AI models based on clinical and operational needs.cs.AI [Back]
[133] From Conceptual Data Models to Multimodal Representation
Peter Stockinger
Main category: cs.AI
TL;DR: 本文探讨信息设计,分为定义文本数据的语义及其视觉表达两部分,强调语义建模与多模态可视化的应用。
Details
Motivation: 研究信息设计的理论与实践,特别是如何通过语义建模和多模态表达优化复杂数据的分析与传播。 Method: 结合结构符号学和语言学理论,使用概念网络或图进行语义建模,并在OKAPI等环境中实践多模态可视化。 Result: 提出了动态、可适应的模型,支持复杂数据的分析、发布和重用,并展示了视觉叙事和文档重构的创新方法。 Conclusion: 信息设计通过语义建模和多模态表达,提升了数据的互操作性、灵活性和传播效果,为数字数据的协作使用开辟了新途径。 Abstract: 1) Introduction and Conceptual Framework: This document explores the concept of information design by dividing it into two major practices: defining the meaning of a corpus of textual data and its visual or multimodal representation. It draws on expertise in enriching textual corpora, particularly audiovisual ones, and transforming them into multiple narrative formats. The text highlights a crucial distinction between the semantic content of a domain and the modalities of its graphic expression, illustrating this approach with concepts rooted in structural semiotics and linguistics traditions. 2) Modeling and Conceptual Design: The article emphasizes the importance of semantic modeling, often achieved through conceptual networks or graphs. These tools enable the structuring of knowledge within a domain by accounting for relationships between concepts, contexts of use, and specific objectives. Stockinger also highlights the constraints and challenges involved in creating dynamic and adaptable models, integrating elements such as thesauri or interoperable ontologies to facilitate the analysis and publication of complex corpora. 3) Applications and Multimodal Visualization: The text concludes by examining the practical application of these models in work environments like OKAPI, developed to analyze, publish, and reuse audiovisual data. It also discusses innovative approaches such as visual storytelling and document reengineering, which involve transforming existing content into new resources tailored to various contexts. These methods emphasize interoperability, flexibility, and the intelligence of communication systems, paving the way for richer and more collaborative use of digital data. The content of this document was presented during the "Semiotics of Information Design" Day organized by Anne Beyaert-Geslin of the University of Bordeaux Montaigne (MICA laboratory) on June 21, 2018, in Bordeaux.[134] HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
Haokun Liu,Sicong Huang,Jingyu Hu,Yangqiaoyu Zhou,Chenhao Tan
Main category: cs.AI
TL;DR: HypoBench是一个新基准,用于评估大语言模型(LLMs)和假设生成方法,涵盖实际效用、泛化性和假设发现率。结果显示现有方法能发现有效新颖模式,但在合成数据中表现不佳,任务难度增加时性能显著下降。
Details
Motivation: 解决如何定义好的假设以及如何系统评估假设生成方法的问题。 Method: 引入HypoBench基准,包含7个真实任务和5个合成任务,共194个数据集,评估4种LLMs和6种假设生成方法。 Result: 现有方法能发现有效新颖模式,但在合成数据中表现不佳,任务难度增加时仅恢复38.8%的真实假设。 Conclusion: HypoBench是改进科学发现AI系统的宝贵资源,当前假设生成方法仍有提升空间。 Abstract: There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.[135] GraphicBench: A Planning Benchmark for Graphic Design with Language Agents
Dayeon Ki,Tianyi Zhou,Marine Carpuat,Gang Wu,Puneet Mathur,Viswanathan Swaminathan
Main category: cs.AI
TL;DR: 论文介绍了GraphicBench和GraphicTown,用于测试LLM在开放目标创意设计任务中的规划和执行能力,发现LLM在空间关系推理、全局依赖协调和动作选择方面存在挑战。
Details
Motivation: 探索LLM在开放目标创意设计任务中的能力,填补现有研究在明确目标任务之外的空白。 Method: 提出GraphicBench基准和GraphicTown框架,包含三个设计专家和46种工具,测试六种LLM的规划和执行能力。 Result: LLM能生成整合显性和隐性约束的工作流,但在空间关系推理、全局协调和动作选择上表现不佳。 Conclusion: GraphicBench是推动LLM在创意设计任务中规划和执行能力的有价值测试平台。 Abstract: Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with open-ended goals remain underexplored. We introduce GraphicBench, a new planning benchmark for graphic design that covers 1,079 user queries and input images across four design types. We further present GraphicTown, an LLM agent framework with three design experts and 46 actions (tools) to choose from for executing each step of the planned workflows in web environments. Experiments with six LLMs demonstrate their ability to generate workflows that integrate both explicit design constraints from user queries and implicit commonsense constraints. However, these workflows often do not lead to successful execution outcomes, primarily due to challenges in: (1) reasoning about spatial relationships, (2) coordinating global dependencies across experts, and (3) retrieving the most appropriate action per step. We envision GraphicBench as a challenging yet valuable testbed for advancing LLM-agent planning and execution in creative design tasks.[136] Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?
Yiyou Sun,Georgia Zhou,Hao Wang,Dacheng Li,Nouha Dziri,Dawn Song
Main category: cs.AI
TL;DR: 论文分析了监督微调(SFT)对语言模型数学推理能力的影响,发现不同难度问题需要不同的推理风格,并揭示了数据集规模对性能提升的重要性。
Details
Motivation: 理解监督微调如何具体提升语言模型的数学推理能力,并探索问题难度与模型表现之间的关系。 Method: 通过分析AIME24数据集,将问题分为四个难度层级(Easy、Medium、Hard、Exh),研究模型在不同层级的推理表现。 Result: 发现从Easy到Medium需要R1推理风格和少量SFT,Hard层级因推理链错误而准确率停滞在65%,Exh层级需要非常规技能,模型普遍表现不佳。小规模数据集优势有限,扩大规模更有效。 Conclusion: 研究为提升语言模型数学推理能力提供了清晰的路线图,强调了数据集规模的重要性。 Abstract: Recent supervised fine-tuning (SFT) approaches have significantly improved language models' performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model's errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning.[137] Evaluating the Goal-Directedness of Large Language Models
Tom Everitt,Cristina Garbacea,Alexis Bellot,Jonathan Richens,Henry Papadatos,Siméon Campos,Rohin Shah
Main category: cs.AI
TL;DR: 论文研究了LLMs的目标导向性,评估了其在信息收集、认知努力和计划执行任务中的表现,发现目标导向性在不同任务中相对一致,但与任务表现不同,且对激励提示的敏感性较低。
Details
Motivation: 探讨LLMs如何利用其能力实现目标,以衡量其目标导向性。 Method: 通过信息收集、认知努力和计划执行任务评估LLMs的目标导向性,使用子任务推断模型的相关能力。 Result: 评估显示,目标导向性在不同任务中相对一致,与任务表现不同,且对激励提示的敏感性较低;大多数模型并未完全实现目标导向。 Conclusion: 目标导向性评估有助于更好地监控LLM进展,并为LLM的代理属性设计提供更明确的选择。 Abstract: To what extent do LLMs use their capabilities towards their given goal? We take this as a measure of their goal-directedness. We evaluate goal-directedness on tasks that require information gathering, cognitive effort, and plan execution, where we use subtasks to infer each model's relevant capabilities. Our evaluations of LLMs from Google DeepMind, OpenAI, and Anthropic show that goal-directedness is relatively consistent across tasks, differs from task performance, and is only moderately sensitive to motivational prompts. Notably, most models are not fully goal-directed. We hope our goal-directedness evaluations will enable better monitoring of LLM progress, and enable more deliberate design choices of agentic properties in LLMs.[138] ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation
Nada Shahin,Leila Ismail
Main category: cs.AI
TL;DR: 提出了一种自适应Transformer(ADAT),通过增强特征提取和自适应特征加权改进手语翻译,同时减少训练开销。
Details
Motivation: 现有Transformer架构在手语翻译中难以捕捉细粒度的短时依赖关系,且计算复杂度高。 Method: ADAT引入门控机制,增强特征提取和自适应特征加权,强调上下文相关特征。 Result: 在多个数据集上,ADAT在翻译准确性和训练效率上均优于基准模型。 Conclusion: ADAT通过改进特征处理机制,显著提升了手语翻译的性能和效率。 Abstract: Current sign language machine translation systems rely on recognizing hand movements, facial expressions and body postures, and natural language processing, to convert signs into text. Recent approaches use Transformer architectures to model long-range dependencies via positional encoding. However, they lack accuracy in recognizing fine-grained, short-range temporal dependencies between gestures captured at high frame rates. Moreover, their high computational complexity leads to inefficient training. To mitigate these issues, we propose an Adaptive Transformer (ADAT), which incorporates components for enhanced feature extraction and adaptive feature weighting through a gating mechanism to emphasize contextually relevant features while reducing training overhead and maintaining translation accuracy. To evaluate ADAT, we introduce MedASL, the first public medical American Sign Language dataset. In sign-to-gloss-to-text experiments, ADAT outperforms the encoder-decoder transformer, improving BLEU-4 accuracy by 0.1% while reducing training time by 14.33% on PHOENIX14T and 3.24% on MedASL. In sign-to-text experiments, it improves accuracy by 8.7% and reduces training time by 2.8% on PHOENIX14T and achieves 4.7% higher accuracy and 7.17% faster training on MedASL. Compared to encoder-only and decoder-only baselines in sign-to-text, ADAT is at least 6.8% more accurate despite being up to 12.1% slower due to its dual-stream structure.[139] Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning
Mahmoud Salhab,Marwan Elghitany,Shameed Sait,Syed Sibghat Ullah,Mohammad Abusheikh,Hasan Abusheikh
Main category: cs.AI
TL;DR: 论文提出了一种基于弱监督学习的阿拉伯语自动语音识别(ASR)模型,使用Conformer架构,在15,000小时的弱标注数据上训练,无需人工转录,取得了SOTA性能。
Details
Motivation: 开发高性能ASR模型对低资源语言(如阿拉伯语)具有挑战性,因为缺乏大规模标注数据集,而传统方法成本高昂。 Method: 采用弱监督学习,利用15,000小时的弱标注语音数据(涵盖现代标准阿拉伯语和方言阿拉伯语),使用Conformer架构从头训练模型。 Result: 模型在标准基准测试中表现优异,超越了之前的所有方法,证明了弱监督学习的有效性。 Conclusion: 弱监督学习是一种可扩展且成本效益高的替代方案,为低资源环境下的ASR系统提供了新思路。 Abstract: Automatic speech recognition (ASR) is crucial for human-machine interaction in diverse applications like conversational agents, industrial robotics, call center automation, and automated subtitling. However, developing high-performance ASR models remains challenging, particularly for low-resource languages like Arabic, due to the scarcity of large, labeled speech datasets, which are costly and labor-intensive to produce. In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA), eliminating the need for costly manual transcriptions. Despite the absence of human-verified labels, our approach attains state-of-the-art (SOTA) performance, exceeding all previous efforts in the field of Arabic ASR on the standard benchmarks. By demonstrating the effectiveness of weak supervision as a scalable, cost-efficient alternative to traditional supervised approaches, paving the way for improved ASR systems in low resource settings.[140] Adapting a World Model for Trajectory Following in a 3D Game
Marko Tot,Shu Ishida,Abdelhak Lemkhenter,David Bignell,Pallavi Choudhury,Chris Lovett,Luis França,Matheus Ribeiro Furtado de Mendonça,Tarun Gupta,Darren Gehring,Sam Devlin,Sergio Valcarcel Macua,Raluca Georgescu
Main category: cs.AI
TL;DR: 研究了在复杂3D游戏环境中使用逆动力学模型(IDM)进行轨迹跟随的方法,探讨了不同编码器和策略头的效果,并提出了应对分布偏移的未来对齐策略。
Details
Motivation: 在复杂环境中(如3D游戏),简单的动作重放不足以应对分布偏移和随机性,需要更鲁棒的方法。 Method: 应用IDM结合不同编码器和策略头,并在Bleeding Edge游戏中测试,同时研究了未来对齐策略。 Result: 在多样数据设置中,GPT风格策略头表现最佳;低数据量下,DINOv2编码器与GPT风格策略头效果最好;预训练后微调时,GPT和MLP风格策略头效果相当。 Conclusion: 不同配置在不同设置下表现各异,需根据具体场景选择最优方法。 Abstract: Imitation learning is a powerful tool for training agents by leveraging expert knowledge, and being able to replicate a given trajectory is an integral part of it. In complex environments, like modern 3D video games, distribution shift and stochasticity necessitate robust approaches beyond simple action replay. In this study, we apply Inverse Dynamics Models (IDM) with different encoders and policy heads to trajectory following in a modern 3D video game -- Bleeding Edge. Additionally, we investigate several future alignment strategies that address the distribution shift caused by the aleatoric uncertainty and imperfections of the agent. We measure both the trajectory deviation distance and the first significant deviation point between the reference and the agent's trajectory and show that the optimal configuration depends on the chosen setting. Our results show that in a diverse data setting, a GPT-style policy head with an encoder trained from scratch performs the best, DINOv2 encoder with the GPT-style policy head gives the best results in the low data regime, and both GPT-style and MLP-style policy heads had comparable results when pre-trained on a diverse setting and fine-tuned for a specific behaviour setting.physics.med-ph [Back]
[141] FACT: Foundation Model for Assessing Cancer Tissue Margins with Mass Spectrometry
Mohammad Farahmand,Amoon Jamzad,Fahimeh Fooladgar,Laura Connolly,Martin Kaufmann,Kevin Yi Mi Ren,John Rudan,Doug McKay,Gabor Fichtinger,Parvin Mousavi
Main category: physics.med-ph
TL;DR: 提出了一种名为FACT的基础模型,用于实时手术边缘评估中的REIMS数据分类,解决了标记数据稀缺的问题,并显著提升了分类性能。
Details
Motivation: 在癌症手术中准确分类组织边缘对完全切除肿瘤至关重要,但REIMS数据的标记数据稀缺限制了实时评估的发展。 Method: FACT是基于文本-音频关联基础模型的改进,采用监督对比学习(三元组损失)进行预训练,并通过消融研究与其他模型对比。 Result: FACT在分类性能上显著提升,AUROC达到82.4%±0.8,优于自监督和半监督基线模型。 Conclusion: 基础模型通过新方法预训练后,能有效分类REIMS数据,为数据稀缺的临床环境提供了可行的解决方案。 Abstract: Purpose: Accurately classifying tissue margins during cancer surgeries is crucial for ensuring complete tumor removal. Rapid Evaporative Ionization Mass Spectrometry (REIMS), a tool for real-time intraoperative margin assessment, generates spectra that require machine learning models to support clinical decision-making. However, the scarcity of labeled data in surgical contexts presents a significant challenge. This study is the first to develop a foundation model tailored specifically for REIMS data, addressing this limitation and advancing real-time surgical margin assessment. Methods: We propose FACT, a Foundation model for Assessing Cancer Tissue margins. FACT is an adaptation of a foundation model originally designed for text-audio association, pretrained using our proposed supervised contrastive approach based on triplet loss. An ablation study is performed to compare our proposed model against other models and pretraining methods. Results: Our proposed model significantly improves the classification performance, achieving state-of-the-art performance with an AUROC of $82.4\% \pm 0.8$. The results demonstrate the advantage of our proposed pretraining method and selected backbone over the self-supervised and semi-supervised baselines and alternative models. Conclusion: Our findings demonstrate that foundation models, adapted and pretrained using our novel approach, can effectively classify REIMS data even with limited labeled examples. This highlights the viability of foundation models for enhancing real-time surgical margin assessment, particularly in data-scarce clinical environments.cs.DB [Back]
[142] Language and Knowledge Representation: A Stratified Approach
Mayukh Bagchi
Main category: cs.DB
TL;DR: 论文提出表示异构性问题,并提出分层解决方案,包括形式化表示、语言表示、知识表示、知识共享和方法论,并通过两个项目验证。
Details
Motivation: 强调异构性是表示的固有属性,不同观察者使用不同概念、语言和知识对同一目标现实进行分层编码。 Method: 提出分层解决方案:形式化表示、基于UKC的语言表示、基于teleontology的知识表示、LiveKnowledge目录和kTelos方法论。 Result: 通过DataScientia和JIDEP项目验证了语言和知识表示的有效性。 Conclusion: 总结了解决方案的有效性,并提出了未来研究方向。 Abstract: The thesis proposes the problem of representation heterogeneity to emphasize the fact that heterogeneity is an intrinsic property of any representation, wherein, different observers encode different representations of the same target reality in a stratified manner using different concepts, language and knowledge (as well as data). The thesis then advances a top-down solution approach to the above stratified problem of representation heterogeneity in terms of several solution components, namely: (i) a representation formalism stratified into concept level, language level, knowledge level and data level to accommodate representation heterogeneity, (ii) a top-down language representation using Universal Knowledge Core (UKC), UKC namespaces and domain languages to tackle the conceptual and language level heterogeneity, (iii) a top-down knowledge representation using the notions of language teleontology and knowledge teleontology to tackle the knowledge level heterogeneity, (iv) the usage and further development of the existing LiveKnowledge catalog for enforcing iterative reuse and sharing of language and knowledge representations, and, (v) the kTelos methodology integrating the solution components above to iteratively generate the language and knowledge representations absolving representation heterogeneity. The thesis also includes proof-of-concepts of the language and knowledge representations developed for two international research projects - DataScientia (data catalogs) and JIDEP (materials modelling). Finally, the thesis concludes with future lines of research.cs.RO [Back]
[143] Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning
Azizul Zahid,Jie Fan,Farong Wang,Ashton Dy,Sai Swaminathan,Fei Liu
Main category: cs.RO
TL;DR: 提出了一种多模态演示学习框架,用于对齐人类和机器人在复杂任务中的行为。
Details
Motivation: 理解人类与机器人之间的动作对应关系对于评估决策对齐至关重要,尤其是在非结构化环境中的人机协作和模仿学习。 Method: 结合了基于ResNet的人类意图建模和基于Perceiver Transformer的机器人动作预测,使用RGB视频和体素化RGB-D空间数据。 Result: 人类模型准确率达71.67%,机器人模型达71.8%。 Conclusion: 该框架在复杂多模态行为对齐任务中表现出潜力。 Abstract: Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.[144] Probabilistic Task Parameterization of Tool-Tissue Interaction via Sparse Landmarks Tracking in Robotic Surgery
Yiting Wang,Yunxin Fan,Fei Liu
Main category: cs.RO
TL;DR: 提出了一种结合稀疏关键点跟踪和概率建模的框架,用于机器人手术中工具与组织交互的精确建模。
Details
Motivation: 传统方法依赖人工标注或刚性假设,灵活性不足。 Method: 通过稀疏关键点跟踪和PCA构建动态局部变换,结合TP-GMM整合数据驱动观察和临床知识。 Result: 能够预测工具与组织的相对位姿,并直接从视频数据中增强对机器人手术动作的视觉理解。 Conclusion: 该框架有效结合了数据驱动和专家知识,提升了机器人手术中工具与组织交互的建模精度。 Abstract: Accurate modeling of tool-tissue interactions in robotic surgery requires precise tracking of deformable tissues and integration of surgical domain knowledge. Traditional methods rely on labor-intensive annotations or rigid assumptions, limiting flexibility. We propose a framework combining sparse keypoint tracking and probabilistic modeling that propagates expert-annotated landmarks across endoscopic frames, even with large tissue deformations. Clustered tissue keypoints enable dynamic local transformation construction via PCA, and tool poses, tracked similarly, are expressed relative to these frames. Embedding these into a Task-Parameterized Gaussian Mixture Model (TP-GMM) integrates data-driven observations with labeled clinical expertise, effectively predicting relative tool-tissue poses and enhancing visual understanding of robotic surgical motions directly from video data.[145] DM-OSVP++: One-Shot View Planning Using 3D Diffusion Models for Active RGB-Based Object Reconstruction
Sicong Pan,Liren Jin,Xuying Huang,Cyrill Stachniss,Marija Popović,Maren Bennewitz
Main category: cs.RO
TL;DR: 论文提出了一种基于3D扩散模型的一次性视图规划方法,用于主动物体重建,通过利用生成模型的先验信息提高重建效率。
Details
Motivation: 主动物体重建在机器人应用中至关重要,但传统的在线重新规划耗时。本文旨在通过一次性视图规划提高数据收集效率。 Method: 利用3D扩散模型的生成能力,基于初始多视图图像生成近似物体模型,并整合几何和纹理分布进行视图规划。 Result: 通过仿真和真实实验验证了方法的有效性,证明了3D扩散先验在一次性视图规划中的优势。 Conclusion: 3D扩散模型为一次性视图规划提供了有效的先验信息,显著提高了主动物体重建的效率。 Abstract: Active object reconstruction is crucial for many robotic applications. A key aspect in these scenarios is generating object-specific view configurations to obtain informative measurements for reconstruction. One-shot view planning enables efficient data collection by predicting all views at once, eliminating the need for time-consuming online replanning. Our primary insight is to leverage the generative power of 3D diffusion models as valuable prior information. By conditioning on initial multi-view images, we exploit the priors from the 3D diffusion model to generate an approximate object model, serving as the foundation for our view planning. Our novel approach integrates the geometric and textural distributions of the object model into the view planning process, generating views that focus on the complex parts of the object to be reconstructed. We validate the proposed active object reconstruction system through both simulation and real-world experiments, demonstrating the effectiveness of using 3D diffusion priors for one-shot view planning.[146] An Online Adaptation Method for Robust Depth Estimation and Visual Odometry in the Open World
Xingwu Ji,Haochen Niu,Dexin Duan,Rendong Ying,Fei Wen,Peilin Liu
Main category: cs.RO
TL;DR: 该论文提出了一种自监督在线适应框架,用于单目视觉里程计,通过在线更新的深度估计模块提高系统在多样化新环境中的适应能力。
Details
Motivation: 学习型机器人导航系统在开放世界场景中的泛化能力受限,导致深度和姿态估计不可靠。 Method: 设计了轻量级细化模块的单目深度估计网络,并基于视觉里程计输出和场景语义信息构建自监督学习目标,包括稀疏深度稠密化模块和动态一致性增强模块。 Result: 在多个数据集和机器人平台上验证了方法的鲁棒性和泛化能力,优于现有学习型方法。 Conclusion: 提出的方法能够快速适应多样化新环境,提高了视觉里程计的可靠性和泛化能力。 Abstract: Recently, learning-based robotic navigation systems have gained extensive research attention and made significant progress. However, the diversity of open-world scenarios poses a major challenge for the generalization of such systems to practical scenarios. Specifically, learned systems for scene measurement and state estimation tend to degrade when the application scenarios deviate from the training data, resulting to unreliable depth and pose estimation. Toward addressing this problem, this work aims to develop a visual odometry system that can fast adapt to diverse novel environments in an online manner. To this end, we construct a self-supervised online adaptation framework for monocular visual odometry aided by an online-updated depth estimation module. Firstly, we design a monocular depth estimation network with lightweight refiner modules, which enables efficient online adaptation. Then, we construct an objective for self-supervised learning of the depth estimation module based on the output of the visual odometry system and the contextual semantic information of the scene. Specifically, a sparse depth densification module and a dynamic consistency enhancement module are proposed to leverage camera poses and contextual semantics to generate pseudo-depths and valid masks for the online adaptation. Finally, we demonstrate the robustness and generalization capability of the proposed method in comparison with state-of-the-art learning-based approaches on urban, in-house datasets and a robot platform. Code is publicly available at: https://github.com/jixingwu/SOL-SLAM.cs.LG [Back]
[147] Watermarking Needs Input Repetition Masking
David Khachaturov,Robert Mullins,Ilia Shumailov,Sumanth Dathathri
Main category: cs.LG
TL;DR: 论文探讨了人类和未加水印的LLMs可能无意中模仿LLM生成文本的特性,导致现有检测措施不可靠。研究发现这种模仿行为(称为“mimicry”)确实存在,挑战了当前学术假设,并建议改进水印技术。
Details
Motivation: 随着LLMs的进步,其潜在滥用(如传播虚假信息)引发担忧。现有检测措施(如机器学习检测器和水印技术)可能因人类或LLMs的模仿行为而失效。 Method: 研究调查了人类和LLMs在对话中模仿LLM生成文本的程度,包括水印信号。 Result: 发现人类和LLMs均存在模仿行为,甚至在看似不可能的场景中也会模仿水印信号。 Conclusion: 研究挑战了当前假设,建议降低误报率并使用更长词序列作为水印机制的基础,以提高长期可靠性。 Abstract: Recent advancements in Large Language Models (LLMs) raised concerns over potential misuse, such as for spreading misinformation. In response two counter measures emerged: machine learning-based detectors that predict if text is synthetic, and LLM watermarking, which subtly marks generated text for identification and attribution. Meanwhile, humans are known to adjust language to their conversational partners both syntactically and lexically. By implication, it is possible that humans or unwatermarked LLMs could unintentionally mimic properties of LLM generated text, making counter measures unreliable. In this work we investigate the extent to which such conversational adaptation happens. We call the concept $\textit{mimicry}$ and demonstrate that both humans and LLMs end up mimicking, including the watermarking signal even in seemingly improbable settings. This challenges current academic assumptions and suggests that for long-term watermarking to be reliable, the likelihood of false positives needs to be significantly lower, while longer word sequences should be used for seeding watermarking mechanisms.[148] SemDiff: Generating Natural Unrestricted Adversarial Examples via Semantic Attributes Optimization in Diffusion Models
Zeyu Dai,Shengcai Liu,Rui He,Jiahao Wu,Ning Lu,Wenqi Fan,Qing Li,Ke Tang
Main category: cs.LG
TL;DR: 论文提出SemDiff方法,通过探索扩散模型的语义潜在空间和多属性优化,生成更自然且不易察觉的无限制对抗样本(UAEs),在攻击成功率和隐蔽性上优于现有方法。
Details
Motivation: 现有方法生成的UAEs缺乏自然性和隐蔽性,因仅在中间潜在噪声中优化。 Method: SemDiff探索扩散模型的语义潜在空间,设计多属性优化方法,确保攻击成功的同时保持自然性和隐蔽性。 Result: 在CelebA-HQ、AFHQ和ImageNet数据集上,SemDiff在攻击成功率和隐蔽性上优于现有方法,且能规避多种防御。 Conclusion: SemDiff生成的UAEs自然且语义有意义,验证了其有效性和威胁性。 Abstract: Unrestricted adversarial examples (UAEs), allow the attacker to create non-constrained adversarial examples without given clean samples, posing a severe threat to the safety of deep learning models. Recent works utilize diffusion models to generate UAEs. However, these UAEs often lack naturalness and imperceptibility due to simply optimizing in intermediate latent noises. In light of this, we propose SemDiff, a novel unrestricted adversarial attack that explores the semantic latent space of diffusion models for meaningful attributes, and devises a multi-attributes optimization approach to ensure attack success while maintaining the naturalness and imperceptibility of generated UAEs. We perform extensive experiments on four tasks on three high-resolution datasets, including CelebA-HQ, AFHQ and ImageNet. The results demonstrate that SemDiff outperforms state-of-the-art methods in terms of attack success rate and imperceptibility. The generated UAEs are natural and exhibit semantically meaningful changes, in accord with the attributes' weights. In addition, SemDiff is found capable of evading different defenses, which further validates its effectiveness and threatening.[149] Analysis of Pseudo-Labeling for Online Source-Free Universal Domain Adaptation
Pascal Schlachter,Jonathan Fuss,Bin Yang
Main category: cs.LG
TL;DR: 论文分析了在线源自由通用域适应(SF-UniDA)中伪标签的作用,发现伪标签质量比数量更重要,并对比了对比损失和交叉熵损失的效果。
Details
Motivation: 解决训练和测试数据之间的域偏移问题,特别是在源数据受限且目标数据为连续流的实际场景中。 Method: 通过控制实验和模拟伪标签,系统分析了伪标签对适应结果的影响。 Result: 发现当前方法与完美伪标签的适应上限存在显著差距,对比损失在中等伪标签准确率下有效,而交叉熵损失在高准确率下表现更优。 Conclusion: 伪标签在SF-UniDA中至关重要,未来研究应优先考虑高质量伪标签。 Abstract: A domain (distribution) shift between training and test data often hinders the real-world performance of deep neural networks, necessitating unsupervised domain adaptation (UDA) to bridge this gap. Online source-free UDA has emerged as a solution for practical scenarios where access to source data is restricted and target data is received as a continuous stream. However, the open-world nature of many real-world applications additionally introduces category shifts meaning that the source and target label spaces may differ. Online source-free universal domain adaptation (SF-UniDA) addresses this challenge. Existing methods mainly rely on self-training with pseudo-labels, yet the relationship between pseudo-labeling and adaptation outcomes has not been studied yet. To bridge this gap, we conduct a systematic analysis through controlled experiments with simulated pseudo-labeling, offering valuable insights into pseudo-labeling for online SF-UniDA. Our findings reveal a substantial gap between the current state-of-the-art and the upper bound of adaptation achieved with perfect pseudo-labeling. Moreover, we show that a contrastive loss enables effective adaptation even with moderate pseudo-label accuracy, while a cross-entropy loss, though less robust to pseudo-label errors, achieves superior results when pseudo-labeling approaches perfection. Lastly, our findings indicate that pseudo-label accuracy is in general more crucial than quantity, suggesting that prioritizing fewer but high-confidence pseudo-labels is beneficial. Overall, our study highlights the critical role of pseudo-labeling in (online) SF-UniDA and provides actionable insights to drive future advancements in the field. Our code is available at https://github.com/pascalschlachter/PLAnalysis.cs.IR [Back]
[150] Rethinking LLM-Based Recommendations: A Query Generation-Based, Training-Free Approach
Donghee Han,Hwanjun Song,Mun Yong Yi
Main category: cs.IR
TL;DR: 提出了一种基于LLM的Query-to-Recommendation方法,解决了传统推荐系统中的效率、敏感性和评估问题,显著提升了推荐性能。
Details
Motivation: 现有基于LLM的推荐方法在处理大规模候选池时效率低下,对提示中的项目顺序敏感,且评估不现实。 Method: 利用LLM生成个性化查询,直接从整个候选池中检索相关项目,无需预选候选。 Result: 在三个数据集上实验显示性能提升最高57%,平均提升31%,且零样本表现优异。 Conclusion: 该方法无需额外训练即可集成到现有推荐系统,显著提升性能和多样性。 Abstract: Existing large language model LLM-based recommendation methods face several challenges, including inefficiency in handling large candidate pools, sensitivity to item order within prompts ("lost in the middle" phenomenon) poor scalability, and unrealistic evaluation due to random negative sampling. To address these issues, we propose a Query-to-Recommendation approach that leverages LLMs to generate personalized queries for retrieving relevant items from the entire candidate pool, eliminating the need for candidate pre-selection. This method can be integrated into an ID-based recommendation system without additional training, enhances recommendation performance and diversity through LLMs' world knowledge, and performs well even for less popular item groups. Experiments on three datasets show up to 57 percent improvement, with an average gain of 31 percent, demonstrating strong zero-shot performance and further gains when ensembled with existing models.[151] PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage
Wenyi Zhang,Ju Jia,Xiaojun Jia,Yihao Huang,Xinfeng Li,Cong Wu,Lina Wang
Main category: cs.IR
TL;DR: 提出了一种名为PATFinger的新型指纹方案,用于验证多模态数据集的使用情况,通过全局最优扰动和自适应提示捕捉数据集特性,效果优于现有方法30%。
Details
Motivation: 当前多模态数据集的使用验证方法多为单模态且存在侵入性方法降低模型精度或非侵入性方法稳定性不足的问题,需要一种更优的解决方案。 Method: 提出PATFinger方案,利用全局最优扰动(GOP)和自适应提示捕捉数据集特性,避免强制模型学习触发器,通过替代模型验证数据集使用。 Result: 实验表明,PATFinger在多模态检索架构中有效防止未经授权的数据集使用,性能优于现有基线30%。 Conclusion: PATFinger为多模态数据集使用验证提供了一种高效、稳定的非侵入性解决方案。 Abstract: The multimodal datasets can be leveraged to pre-train large-scale vision-language models by providing cross-modal semantics. Current endeavors for determining the usage of datasets mainly focus on single-modal dataset ownership verification through intrusive methods and non-intrusive techniques, while cross-modal approaches remain under-explored. Intrusive methods can adapt to multimodal datasets but degrade model accuracy, while non-intrusive methods rely on label-driven decision boundaries that fail to guarantee stable behaviors for verification. To address these issues, we propose a novel prompt-adapted transferable fingerprinting scheme from a training-free perspective, called PATFinger, which incorporates the global optimal perturbation (GOP) and the adaptive prompts to capture dataset-specific distribution characteristics. Our scheme utilizes inherent dataset attributes as fingerprints instead of compelling the model to learn triggers. The GOP is derived from the sample distribution to maximize embedding drifts between different modalities. Subsequently, our PATFinger re-aligns the adaptive prompt with GOP samples to capture the cross-modal interactions on the carefully crafted surrogate model. This allows the dataset owner to check the usage of datasets by observing specific prediction behaviors linked to the PATFinger during retrieval queries. Extensive experiments demonstrate the effectiveness of our scheme against unauthorized multimodal dataset usage on various cross-modal retrieval architectures by 30% over state-of-the-art baselines.cs.SD [Back]
[152] Dysarthria Normalization via Local Lie Group Transformations for Robust ASR
Mikhail Osipov
Main category: cs.SD
TL;DR: 提出了一种基于几何驱动的构音障碍语音归一化方法,通过局部李群变换对频谱图进行修正,显著提升了语音识别性能。