Skip to content

Table of Contents

cs.CV [Back]

[1] Universal Representations for Classification-enhanced Lossy Compression

Nam Nguyen

Main category: cs.CV

TL;DR: 论文探讨了在多种解码目标下开发通用编码器的可行性,验证了其在感知图像压缩任务中的性能接近专用编码器,但在分类-失真权衡下存在显著失真。

Details Motivation: 传统压缩算法仅关注压缩率与重建失真的权衡,而新框架(如RDP和RDC)引入了感知质量和分类准确性作为额外维度。本文旨在开发通用编码器,避免为每个特定权衡点重新训练编码器。 Method: 开发通用编码器,用于实现多种解码目标(如失真和分类约束),并在MNIST数据集上进行实验验证。 Result: 通用编码器在感知压缩任务中性能接近专用编码器,但在分类-失真权衡下,重用编码器会导致显著失真。 Conclusion: 通用编码器在部分任务中表现良好,但在特定权衡下需谨慎使用,以避免性能下降。 Abstract: In lossy compression, the classical tradeoff between compression rate and reconstruction distortion has traditionally guided algorithm design. However, Blau and Michaeli [5] introduced a generalized framework, known as the rate-distortion-perception (RDP) function, incorporating perceptual quality as an additional dimension of evaluation. More recently, the rate-distortion-classification (RDC) function was investigated in [19], evaluating compression performance by considering classification accuracy alongside distortion. In this paper, we explore universal representations, where a single encoder is developed to achieve multiple decoding objectives across various distortion and classification (or perception) constraints. This universality avoids retraining encoders for each specific operating point within these tradeoffs. Our experimental validation on the MNIST dataset indicates that a universal encoder incurs only minimal performance degradation compared to individually optimized encoders for perceptual image compression tasks, aligning with prior results from [23]. Nonetheless, we also identify that in the RDC setting, reusing an encoder optimized for one specific classification-distortion tradeoff leads to a significant distortion penalty when applied to alternative points.

[2] Intelligent road crack detection and analysis based on improved YOLOv8

Haomin Zuo,Zhengyang Li,Jiangchuan Gong,Zhen Tian

Main category: cs.CV

TL;DR: 本文提出了一种基于改进YOLOv8深度学习框架的智能道路裂缝检测与分析系统,通过训练4029张图像开发目标分割模型,高效准确地识别和分割道路裂缝区域,并分析裂缝的最大最小宽度及位置。实验表明,ECA和CBAM注意力机制的引入显著提升了模型的检测精度和效率。

Details Motivation: 随着城市化进程加快和交通流量增加,路面病害问题日益突出,对道路安全和使用寿命构成严重威胁。传统的人工检测方法效率低且成本高,亟需智能化的解决方案。 Method: 基于改进的YOLOv8深度学习框架,训练4029张图像开发目标分割模型,结合ECA和CBAM注意力机制,实现道路裂缝的高效识别、分割及宽度和位置分析。 Result: 实验结果表明,引入ECA和CBAM注意力机制显著提升了模型的检测精度和效率。 Conclusion: 该系统为道路维护和安全监测提供了新颖的解决方案,具有高效、准确的特点。 Abstract: As urbanization speeds up and traffic flow increases, the issue of pavement distress is becoming increasingly pronounced, posing a severe threat to road safety and service life. Traditional methods of pothole detection rely on manual inspection, which is not only inefficient but also costly. This paper proposes an intelligent road crack detection and analysis system, based on the enhanced YOLOv8 deep learning framework. A target segmentation model has been developed through the training of 4029 images, capable of efficiently and accurately recognizing and segmenting crack regions in roads. The model also analyzes the segmented regions to precisely calculate the maximum and minimum widths of cracks and their exact locations. Experimental results indicate that the incorporation of ECA and CBAM attention mechanisms substantially enhances the model's detection accuracy and efficiency, offering a novel solution for road maintenance and safety monitoring.

[3] Mirror: Multimodal Cognitive Reframing Therapy for Rolling with Resistance

Subin Kim,Hoonrae Kim,Jihyun Lee,Yejin Jeon,Gary Geunbae Lee

Main category: cs.CV

TL;DR: 提出了一种多模态方法,结合非语言线索提升AI心理治疗师处理客户抵抗的能力,效果优于现有文本方法。

Details Motivation: 现有基于文本的认知行为治疗(CBT)模型难以应对客户抵抗,影响治疗效果。 Method: 引入多模态数据集Mirror,训练视觉语言模型(VLMs)分析面部线索并生成共情回应。 Result: Mirror显著提升AI治疗师处理抵抗的能力,优于纯文本方法。 Conclusion: 多模态方法能有效增强治疗联盟,改善AI心理治疗的效果。 Abstract: Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, allowing the AI therapist to better align its responses with the client's negative emotional state. Specifically, we introduce a new synthetic dataset, Multimodal Interactive Rolling with Resistance (Mirror), which is a novel synthetic dataset that pairs client statements with corresponding facial images. Using this dataset, we train baseline Vision-Language Models (VLMs) that can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage resistance. They are then evaluated in terms of both the therapist's counseling skills and the strength of the therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist's ability to handle resistance, which outperforms existing text-based CBT approaches.

[4] Wavelet-based Variational Autoencoders for High-Resolution Image Generation

Andrew Kiruluta

Main category: cs.CV

TL;DR: 提出了一种基于小波的变分自编码器(Wavelet-VAE),通过多尺度Haar小波系数构建隐空间,提高了生成图像的清晰度和高频细节。

Details Motivation: 传统VAE由于假设隐空间为各向同性高斯分布,生成的图像较模糊,无法捕捉高频细节。 Method: 采用多尺度Haar小波系数构建隐空间,引入可学习的噪声参数,重新参数化技巧,并整合小波稀疏性到训练目标中。 Result: 在CIFAR-10等高分辨率数据集上实验表明,Wavelet-VAE生成的图像视觉保真度更高,细节更丰富。 Conclusion: Wavelet-VAE在生成图像质量上有显著提升,未来可进一步探索小波在生成模型中的应用。 Abstract: Variational Autoencoders (VAEs) are powerful generative models capable of learning compact latent representations. However, conventional VAEs often generate relatively blurry images due to their assumption of an isotropic Gaussian latent space and constraints in capturing high-frequency details. In this paper, we explore a novel wavelet-based approach (Wavelet-VAE) in which the latent space is constructed using multi-scale Haar wavelet coefficients. We propose a comprehensive method to encode the image features into multi-scale detail and approximation coefficients and introduce a learnable noise parameter to maintain stochasticity. We thoroughly discuss how to reformulate the reparameterization trick, address the KL divergence term, and integrate wavelet sparsity principles into the training objective. Our experimental evaluation on CIFAR-10 and other high-resolution datasets demonstrates that the Wavelet-VAE improves visual fidelity and recovers higher-resolution details compared to conventional VAEs. We conclude with a discussion of advantages, potential limitations, and future research directions for wavelet-based generative modeling.

[5] SSTAF: Spatial-Spectral-Temporal Attention Fusion Transformer for Motor Imagery Classification

Ummay Maria Muna,Md. Mehedi Hasan Shawon,Md Jobayer,Sumaiya Akter,Saifur Rahman Sabuj

Main category: cs.CV

TL;DR: 本文提出了一种新颖的SSTAF Transformer模型,用于解决EEG信号的非平稳性和跨被试分类问题,在运动想象分类任务中表现优于传统方法。

Details Motivation: EEG信号的非平稳性和被试间差异导致跨被试分类模型难以开发,需要一种更鲁棒的方法。 Method: 设计了SSTAF Transformer,结合了谱、空间和时域注意力机制,并利用短时傅里叶变换提取时频特征。 Result: 在两个公开数据集上分别达到76.83%和68.30%的准确率,优于传统CNN和现有Transformer方法。 Conclusion: SSTAF Transformer在EEG运动想象分类中表现出色,为神经康复和辅助技术提供了新思路。 Abstract: Brain-computer interfaces (BCI) in electroencephalography (EEG)-based motor imagery classification offer promising solutions in neurorehabilitation and assistive technologies by enabling communication between the brain and external devices. However, the non-stationary nature of EEG signals and significant inter-subject variability cause substantial challenges for developing robust cross-subject classification models. This paper introduces a novel Spatial-Spectral-Temporal Attention Fusion (SSTAF) Transformer specifically designed for upper-limb motor imagery classification. Our architecture consists of a spectral transformer and a spatial transformer, followed by a transformer block and a classifier network. Each module is integrated with attention mechanisms that dynamically attend to the most discriminative patterns across multiple domains, such as spectral frequencies, spatial electrode locations, and temporal dynamics. The short-time Fourier transform is incorporated to extract features in the time-frequency domain to make it easier for the model to obtain a better feature distinction. We evaluated our SSTAF Transformer model on two publicly available datasets, the EEGMMIDB dataset, and BCI Competition IV-2a. SSTAF Transformer achieves an accuracy of 76.83% and 68.30% in the data sets, respectively, outperforms traditional CNN-based architectures and a few existing transformer-based approaches.

[6] ICAS: IP Adapter and ControlNet-based Attention Structure for Multi-Subject Style Transfer Optimization

Fuwei Liu

Main category: cs.CV

TL;DR: ICAS提出了一种基于IP-Adapter和ControlNet的高效可控多主体风格迁移框架,解决了现有方法在语义保真度和计算成本上的问题。

Details Motivation: 多主体风格迁移面临风格属性定义模糊和跨主体一致性应用的挑战,现有方法依赖昂贵计算或大规模数据集,且难以保持语义保真度。 Method: ICAS通过自适应微调预训练扩散模型的内容注入分支,结合IP-Adapter和ControlNet,实现高效风格注入和结构控制。 Result: 实验表明ICAS在结构保留、风格一致性和推理效率上表现优异。 Conclusion: ICAS为多主体风格迁移提供了一种高效可控的新范式,适用于实际应用。 Abstract: Generating multi-subject stylized images remains a significant challenge due to the ambiguity in defining style attributes (e.g., color, texture, atmosphere, and structure) and the difficulty in consistently applying them across multiple subjects. Although recent diffusion-based text-to-image models have achieved remarkable progress, existing methods typically rely on computationally expensive inversion procedures or large-scale stylized datasets. Moreover, these methods often struggle with maintaining multi-subject semantic fidelity and are limited by high inference costs. To address these limitations, we propose ICAS (IP-Adapter and ControlNet-based Attention Structure), a novel framework for efficient and controllable multi-subject style transfer. Instead of full-model tuning, ICAS adaptively fine-tunes only the content injection branch of a pre-trained diffusion model, thereby preserving identity-specific semantics while enhancing style controllability. By combining IP-Adapter for adaptive style injection with ControlNet for structural conditioning, our framework ensures faithful global layout preservation alongside accurate local style synthesis. Furthermore, ICAS introduces a cyclic multi-subject content embedding mechanism, which enables effective style transfer under limited-data settings without the need for extensive stylized corpora. Extensive experiments show that ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency, establishing a new paradigm for multi-subject style transfer in real-world applications.

[7] WildFireCan-MMD: A Multimodal dataset for Classification of User-generated Content During Wildfires in Canada

Braeden Sherritt,Isar Nejadgholi,Marzieh Amini

Main category: cs.CV

TL;DR: WildFireCan-MMD是一个多模态数据集,用于从社交媒体中提取野火相关信息,研究表明定制化训练模型优于零样本提示方法。

Details Motivation: 传统数据源在野火事件中反应慢且成本高,社交媒体提供实时信息但提取相关洞察困难。 Method: 构建WildFireCan-MMD数据集,评估视觉语言模型和定制训练分类器。 Result: 定制训练模型比零样本提示方法性能提升高达23%。 Conclusion: 强调定制化数据集和任务特定训练的重要性,且数据集需本地化以适应不同地区需求。 Abstract: Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across 13 key themes. Evaluating both Vision Language Models and custom-trained classifiers, we show that while zero-shot prompting offers quick deployment, even simple trained models outperform them when labelled data is available, by up to 23%. Our findings highlight the enduring importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.

[8] Dynamic Memory-enhanced Transformer for Hyperspectral Image Classification

Muhammad Ahmad,Manuel Mazzara,Salvatore Distefano,Adil Mehmood Khan

Main category: cs.CV

TL;DR: MemFormer是一种轻量级且内存增强的Transformer模型,用于解决高光谱图像分类中空间-光谱相关性复杂的问题,通过动态内存模块和空间-光谱位置编码提高分类精度。

Details Motivation: 现有Transformer模型在捕捉长距离依赖时存在信息冗余和注意力效率低的问题,难以建模高光谱图像分类所需的细粒度关系。 Method: 提出MemFormer,采用内存增强的多头注意力机制和动态内存丰富策略,结合空间-光谱位置编码(SSPE)优化特征提取。 Result: 在基准数据集上的实验表明,MemFormer的分类精度优于现有最先进方法。 Conclusion: MemFormer通过动态内存模块和SSPE有效提升了高光谱图像分类的性能,同时保持了轻量级设计。 Abstract: Hyperspectral image (HSI) classification remains a challenging task due to the intricate spatial-spectral correlations. Existing transformer models excel in capturing long-range dependencies but often suffer from information redundancy and attention inefficiencies, limiting their ability to model fine-grained relationships crucial for HSI classification. To overcome these limitations, this work proposes MemFormer, a lightweight and memory-enhanced transformer. MemFormer introduces a memory-enhanced multi-head attention mechanism that iteratively refines a dynamic memory module, enhancing feature extraction while reducing redundancy across layers. Additionally, a dynamic memory enrichment strategy progressively captures complex spatial and spectral dependencies, leading to more expressive feature representations. To further improve structural consistency, we incorporate a spatial-spectral positional encoding (SSPE) tailored for HSI data, ensuring continuity without the computational burden of convolution-based approaches. Extensive experiments on benchmark datasets demonstrate that MemFormer achieves superior classification accuracy, outperforming state-of-the-art methods.

[9] ChartQA-X: Generating Explanations for Charts

Shamanthak Hegde,Pooyan Fazli,Hasti Seifi

Main category: cs.CV

TL;DR: ChartQA-X是一个包含多种图表类型的数据集,提供问题、答案和详细解释,通过微调模型提升解释生成和问答任务的性能。

Details Motivation: 解决在图表图像中回答问题同时提供解释的挑战,以增强智能代理传达复杂信息的能力。 Method: 构建ChartQA-X数据集,包含28,299个问题、答案和解释,通过提示六种模型并选择最佳响应生成解释。 Result: 微调模型在解释生成和问答任务中表现优异,提高了新数据集的准确性。 Conclusion: 结合答案和解释性叙述的方法提升了信息传达效果、用户理解和生成响应的可信度。 Abstract: The ability to interpret and explain complex information from visual data in charts is crucial for data-driven decision-making. In this work, we address the challenge of providing explanations alongside answering questions about chart images. We present ChartQA-X, a comprehensive dataset comprising various chart types with 28,299 contextually relevant questions, answers, and detailed explanations. These explanations are generated by prompting six different models and selecting the best responses based on metrics such as faithfulness, informativeness, coherence, and perplexity. Our experiments show that models fine-tuned on our dataset for explanation generation achieve superior performance across various metrics and demonstrate improved accuracy in question-answering tasks on new datasets. By integrating answers with explanatory narratives, our approach enhances the ability of intelligent agents to convey complex information effectively, improve user understanding, and foster trust in the generated responses.

[10] LIFT+: Lightweight Fine-Tuning for Long-Tail Learning

Jiang-Xin Shi,Tong Wei,Yu-Feng Li

Main category: cs.CV

TL;DR: 论文探讨了微调策略对长尾学习任务的影响,发现现有方法存在对微调方法的误用,并提出了一种轻量级微调框架LIFT+,显著提升了效率和准确性。

Details Motivation: 研究微调策略对长尾学习性能的影响,揭示现有方法的不足并提出改进方案。 Method: 提出LIFT+框架,结合语义感知初始化、极简数据增强和测试时集成,优化类别条件一致性。 Result: LIFT+显著减少训练周期和参数数量,同时超越现有最佳方法。 Conclusion: LIFT+为长尾学习提供了高效且准确的解决方案,推动了基础模型的适应性。 Abstract: The fine-tuning paradigm has emerged as a prominent approach for addressing long-tail learning tasks in the era of foundation models. However, the impact of fine-tuning strategies on long-tail learning performance remains unexplored. In this work, we disclose that existing paradigms exhibit a profound misuse of fine-tuning methods, leaving significant room for improvement in both efficiency and accuracy. Specifically, we reveal that heavy fine-tuning (fine-tuning a large proportion of model parameters) can lead to non-negligible performance deterioration on tail classes, whereas lightweight fine-tuning demonstrates superior effectiveness. Through comprehensive theoretical and empirical validation, we identify this phenomenon as stemming from inconsistent class conditional distributions induced by heavy fine-tuning. Building on this insight, we propose LIFT+, an innovative lightweight fine-tuning framework to optimize consistent class conditions. Furthermore, LIFT+ incorporates semantic-aware initialization, minimalist data augmentation, and test-time ensembling to enhance adaptation and generalization of foundation models. Our framework provides an efficient and accurate pipeline that facilitates fast convergence and model compactness. Extensive experiments demonstrate that LIFT+ significantly reduces both training epochs (from $\sim$100 to $\leq$15) and learned parameters (less than 1%), while surpassing state-of-the-art approaches by a considerable margin. The source code is available at https://github.com/shijxcs/LIFT-plus.

[11] Weak Cube R-CNN: Weakly Supervised 3D Detection using only 2D Bounding Boxes

Andreas Lau Hansen,Lukas Wanzeck,Dim P. Papadopoulos

Main category: cs.CV

TL;DR: 提出了一种弱监督的单目3D物体检测方法Weak Cube R-CNN,仅需2D标注数据,通过利用3D立方体的2D投影关系,减少对昂贵3D标注数据的依赖。

Details Motivation: 3D物体检测通常需要大量标注数据,成本高昂。本文旨在通过弱监督方法减少对3D标注的依赖,仅使用2D标注数据实现3D检测。 Method: 利用预训练的2D基础模型估计深度和方向信息作为伪真值,设计损失函数避免直接使用3D标注,将知识从2D模型隐式迁移到3D检测任务。 Result: 在SUN RGB-D数据集上表现优于基线Cube R-CNN,验证了方法的有效性。 Conclusion: 该方法为减少3D标注需求提供了可行方案,虽在厘米级精度上有限,但为后续研究奠定了基础。 Abstract: Monocular 3D object detection is an essential task in computer vision, and it has several applications in robotics and virtual reality. However, 3D object detectors are typically trained in a fully supervised way, relying extensively on 3D labeled data, which is labor-intensive and costly to annotate. This work focuses on weakly-supervised 3D detection to reduce data needs using a monocular method that leverages a singlecamera system over expensive LiDAR sensors or multi-camera setups. We propose a general model Weak Cube R-CNN, which can predict objects in 3D at inference time, requiring only 2D box annotations for training by exploiting the relationship between 2D projections of 3D cubes. Our proposed method utilizes pre-trained frozen foundation 2D models to estimate depth and orientation information on a training set. We use these estimated values as pseudo-ground truths during training. We design loss functions that avoid 3D labels by incorporating information from the external models into the loss. In this way, we aim to implicitly transfer knowledge from these large foundation 2D models without having access to 3D bounding box annotations. Experimental results on the SUN RGB-D dataset show increased performance in accuracy compared to an annotation time equalized Cube R-CNN baseline. While not precise for centimetre-level measurements, this method provides a strong foundation for further research.

[12] SAR Object Detection with Self-Supervised Pretraining and Curriculum-Aware Sampling

Yasin Almalioglu,Andrzej Kucik,Geoffrey French,Dafni Antotsiou,Alexander Adam,Cedric Archambeau

Main category: cs.CV

TL;DR: TRANSAR是一种基于自监督学习的视觉Transformer模型,用于卫星SAR图像中的小目标检测,通过掩码图像预训练和自适应采样调度器提升性能。

Details Motivation: SAR图像的目标检测在灾害响应等领域潜力巨大,但数据复杂性和标注稀缺性限制了发展,尤其是小目标检测。 Method: 提出TRANSAR模型,结合掩码图像预训练和辅助语义分割,并引入自适应采样调度器解决类别不平衡问题。 Result: 在基准SAR数据集上表现优于传统监督模型(如DeepLabv3)和自监督模型(如DPT)。 Conclusion: TRANSAR通过自监督学习和动态调整策略,显著提升了SAR图像中小目标的检测性能。 Abstract: Object detection in satellite-borne Synthetic Aperture Radar (SAR) imagery holds immense potential in tasks such as urban monitoring and disaster response. However, the inherent complexities of SAR data and the scarcity of annotations present significant challenges in the advancement of object detection in this domain. Notably, the detection of small objects in satellite-borne SAR images poses a particularly intricate problem, because of the technology's relatively low spatial resolution and inherent noise. Furthermore, the lack of large labelled SAR datasets hinders the development of supervised deep learning-based object detection models. In this paper, we introduce TRANSAR, a novel self-supervised end-to-end vision transformer-based SAR object detection model that incorporates masked image pre-training on an unlabeled SAR image dataset that spans more than $25,700$ km\textsuperscript{2} ground area. Unlike traditional object detection formulation, our approach capitalises on auxiliary binary semantic segmentation, designed to segregate objects of interest during the post-tuning, especially the smaller ones, from the background. In addition, to address the innate class imbalance due to the disproportion of the object to the image size, we introduce an adaptive sampling scheduler that dynamically adjusts the target class distribution during training based on curriculum learning and model feedback. This approach allows us to outperform conventional supervised architecture such as DeepLabv3 or UNet, and state-of-the-art self-supervised learning-based arhitectures such as DPT, SegFormer or UperNet, as shown by extensive evaluations on benchmark SAR datasets.

[13] VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture

Long Li,Jiajia Li,Dong Chen,Lina Pu,Haibo Yao,Yanbo Huang

Main category: cs.CV

TL;DR: VLLFL是一个基于视觉语言模型的轻量级联邦学习框架,旨在解决智能农业中目标检测的数据隐私和通信开销问题。

Details Motivation: 现代智能农业中,目标检测对自动化和精准农业至关重要,但大规模数据收集和隐私问题成为挑战。 Method: 结合视觉语言模型的泛化能力和联邦学习的隐私保护特性,通过训练紧凑的提示生成器提升性能。 Result: 实验显示,VLLFL在提升视觉语言模型性能14.53%的同时,减少了99.3%的通信开销。 Conclusion: VLLFL为农业应用提供了一个高效、可扩展且隐私保护的解决方案。 Abstract: In modern smart agriculture, object detection plays a crucial role by enabling automation, precision farming, and monitoring of resources. From identifying crop health and pest infestations to optimizing harvesting processes, accurate object detection enhances both productivity and sustainability. However, training object detection models often requires large-scale data collection and raises privacy concerns, particularly when sensitive agricultural data is distributed across farms. To address these challenges, we propose VLLFL, a vision-language model-based lightweight federated learning framework (VLLFL). It harnesses the generalization and context-aware detection capabilities of the vision-language model (VLM) and leverages the privacy-preserving nature of federated learning. By training a compact prompt generator to boost the performance of the VLM deployed across different farms, VLLFL preserves privacy while reducing communication overhead. Experimental results demonstrate that VLLFL achieves 14.53% improvement in the performance of VLM while reducing 99.3% communication overhead. Spanning tasks from identifying a wide variety of fruits to detecting harmful animals in agriculture, the proposed framework offers an efficient, scalable, and privacy-preserving solution specifically tailored to agricultural applications.

[14] POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image Generation

Evans Xu Han,Alice Qian Zhang,Hong Shen,Haiyi Zhu,Paul Pu Liang,Jane Hsieh

Main category: cs.CV

TL;DR: POET是一个实时交互工具,通过自动发现文本到图像生成模型的同质维度、扩展这些维度以多样化输出,并学习用户反馈以个性化扩展,提升创意任务的多样性和用户满意度。

Details Motivation: 现有的大规模文本到图像生成系统输出较为常规,限制了创意探索,且交互方式对初学者不友好,需要更多变体和个性化以适应多样化的用户需求。 Method: POET通过自动发现同质维度、扩展输出空间和个性化学习用户反馈,实现多样化和个性化的图像生成。 Result: 评估显示,POET能生成更高多样性的结果,帮助用户在更少的提示下达到满意,并促使他们在共创过程中思考更广泛的可能性。 Conclusion: POET展示了未来文本到图像生成工具如何通过交互技术支持多元价值和用户需求,特别是在创意构思阶段。 Abstract: State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for broad applicability, yielding conventional output that may limit creative exploration. They also employ interaction methods that may be difficult for beginners. Given that creative end users often operate in diverse, context-specific ways that are often unpredictable, more variation and personalization are necessary. We introduce POET, a real-time interactive tool that (1) automatically discovers dimensions of homogeneity in text-to-image generative models, (2) expands these dimensions to diversify the output space of generated images, and (3) learns from user feedback to personalize expansions. An evaluation with 28 users spanning four creative task domains demonstrated POET's ability to generate results with higher perceived diversity and help users reach satisfaction in fewer prompts during creative tasks, thereby prompting them to deliberate and reflect more on a wider range of possible produced results during the co-creative process. Focusing on visual creativity, POET offers a first glimpse of how interaction techniques of future text-to-image generation tools may support and align with more pluralistic values and the needs of end users during the ideation stages of their work.

[15] BeetleVerse: A study on taxonomic classification of ground beetles

S M Rayeed,Alyson East,Samuel Stevens,Sydne Record,Charles V Stewart

Main category: cs.CV

TL;DR: 论文评估了12种视觉模型在甲虫分类中的表现,发现结合Vision and Language Transformer与MLP的模型效果最佳,并探讨了样本效率和领域适应问题。

Details Motivation: 甲虫是重要的生物多样性指标,但传统分类方法依赖专家且效率低,限制了其广泛应用。 Method: 评估12种视觉模型在四个数据集上的分类性能,涵盖230属1769种甲虫,并研究样本效率和领域适应。 Result: 最佳模型在属和种级别分别达到97%和94%的准确率,样本效率可提升50%,但实验室到野外图像的领域适应存在挑战。 Conclusion: 研究为甲虫大规模自动分类奠定了基础,并推动了样本高效学习和跨领域适应技术的发展。 Abstract: Ground beetles are a highly sensitive and speciose biological indicator, making them vital for monitoring biodiversity. However, they are currently an underutilized resource due to the manual effort required by taxonomic experts to perform challenging species differentiations based on subtle morphological differences, precluding widespread applications. In this paper, we evaluate 12 vision models on taxonomic classification across four diverse, long-tailed datasets spanning over 230 genera and 1769 species, with images ranging from controlled laboratory settings to challenging field-collected (in-situ) photographs. We further explore taxonomic classification in two important real-world contexts: sample efficiency and domain adaptation. Our results show that the Vision and Language Transformer combined with an MLP head is the best performing model, with 97\% accuracy at genus and 94\% at species level. Sample efficiency analysis shows that we can reduce train data requirements by up to 50\% with minimal compromise in performance. The domain adaptation experiments reveal significant challenges when transferring models from lab to in-situ images, highlighting a critical domain gap. Overall, our study lays a foundation for large-scale automated taxonomic classification of beetles, and beyond that, advances sample-efficient learning and cross-domain adaptation for diverse long-tailed ecological datasets.

[16] Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

Shashank Shriram,Srinivasa Perisetla,Aryan Keskar,Harsha Krishnaswamy,Tonko Emil Westerhof Bossen,Andreas Møgelmose,Ross Greer

Main category: cs.CV

TL;DR: 提出了一种多模态方法,结合视觉-语言推理和零样本目标检测,改进自动驾驶中的危险识别和解释。

Details Motivation: 现有模型依赖预定义类别,难以处理不可预测的异常危险,需改进危险检测能力。 Method: 整合视觉语言模型(VLM)和大语言模型(LLM),利用CLIP模型优化目标检测和定位。 Result: 创建了扩展的COOOL数据集,提出基于余弦相似度的评估方法,并发布相关工具。 Conclusion: 该方法展示了视觉语言模型的潜力,同时指出了未来改进方向。 Abstract: Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git

[17] CytoFM: The first cytology foundation model

Vedrana Ivezić,Ashwath Radhachandran,Ekaterina Redekop,Shreeram Athreya,Dongwoo Lee,Vivek Sant,Corey Arnold,William Speier

Main category: cs.CV

TL;DR: CytoFM是首个细胞学自监督基础模型,通过iBOT框架预训练,学习鲁棒且可迁移的表示,在多个下游任务中表现优于现有基础模型。

Details Motivation: 细胞学在癌症诊断中至关重要,但深度学习模型开发面临样本异质性、器官差异和数据稀缺的挑战,需要任务无关的预训练方法。 Method: 使用iBOT框架(结合掩码图像建模和自蒸馏)预训练CytoFM,并在多个细胞学任务中评估其性能。 Result: CytoFM在两项下游任务中优于基于组织病理学或自然图像预训练的模型,并能关注细胞学相关特征。 Conclusion: 尽管预训练数据量小,CytoFM证明了任务无关预训练方法在细胞学数据中学习鲁棒特征的潜力。 Abstract: Cytology is essential for cancer diagnostics and screening due to its minimally invasive nature. However, the development of robust deep learning models for digital cytology is challenging due to the heterogeneity in staining and preparation methods of samples, differences across organs, and the limited availability of large, diverse, annotated datasets. Developing a task-specific model for every cytology application is impractical and non-cytology-specific foundation models struggle to generalize to tasks in this domain where the emphasis is on cell morphology. To address these challenges, we introduce CytoFM, the first cytology self-supervised foundation model. Using iBOT, a self-supervised Vision Transformer (ViT) training framework incorporating masked image modeling and self-distillation, we pretrain CytoFM on a diverse collection of cytology datasets to learn robust, transferable representations. We evaluate CytoFM on multiple downstream cytology tasks, including breast cancer classification and cell type identification, using an attention-based multiple instance learning framework. Our results demonstrate that CytoFM performs better on two out of three downstream tasks than existing foundation models pretrained on histopathology (UNI) or natural images (iBOT-Imagenet). Visualizations of learned representations demonstrate our model is able to attend to cytologically relevant features. Despite a small pre-training dataset, CytoFM's promising results highlight the ability of task-agnostic pre-training approaches to learn robust and generalizable features from cytology data.

[18] ProgRoCC: A Progressive Approach to Rough Crowd Counting

Shengqin Jiang,Linfei Li,Haokui Zhang,Qingshan Liu,Amin Beheshti,Jian Yang,Anton van den Hengel,Quan Z. Sheng,Yuankai Qi

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的粗略人群计数方法ProgRoCC,通过渐进式估计学习策略和视觉语言匹配适配器,显著提升了半监督和弱监督人群计数的性能。

Details Motivation: 传统的人群计数方法在人群规模增大时变得不可行且不可靠,需要更高效且易于获取训练数据的方法。 Method: 采用渐进式估计学习策略(粗到细)和视觉语言匹配适配器优化视觉特征。 Result: 在三个广泛采用的数据集上表现优于现有半监督和弱监督方法。 Conclusion: ProgRoCC方法在粗略人群计数任务中表现出色,为相关领域提供了新的解决方案。 Abstract: As the number of individuals in a crowd grows, enumeration-based techniques become increasingly infeasible and their estimates increasingly unreliable. We propose instead an estimation-based version of the problem: we label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire. Rough crowd counting requires only rough annotations of the number of targets in an image, instead of the more traditional, and far more expensive, per-target annotations. We propose an approach to the rough crowd counting problem based on CLIP, termed ProgRoCC. Specifically, we introduce a progressive estimation learning strategy that determines the object count through a coarse-to-fine approach. This approach delivers answers quickly, outperforms the state-of-the-art in semi- and weakly-supervised crowd counting. In addition, we design a vision-language matching adapter that optimizes key-value pairs by mining effective matches of two modalities to refine the visual features, thereby improving the final performance. Extensive experimental results on three widely adopted crowd counting datasets demonstrate the effectiveness of our method.

[19] LoRA-Based Continual Learning with Constraints on Critical Parameter Changes

Shimou Ling,Liang Zhang,Jiangwei Zhao,Lili Pan,Hongliang Li

Main category: cs.CV

TL;DR: 论文提出了一种基于LoRA的持续学习方法,通过冻结关键参数矩阵和正交LoRA组合(LoRAC)来减少遗忘,并在多个基准测试中取得了SOTA性能。

Details Motivation: 尽管正交LoRA调优在持续学习中表现良好,但研究发现预任务的关键参数在学习后任务时仍会显著变化,因此需要改进。 Method: 提出冻结ViT中预任务的关键参数矩阵,并结合正交LoRA调优,进一步提出基于QR分解的正交LoRA组合(LoRAC)。 Result: 在Split CIFAR-100数据集上,方法实现了6.35%的准确率提升和3.24%的遗忘率降低。 Conclusion: 该方法通过冻结关键参数和LoRAC技术,显著提升了持续学习的性能,成为当前最佳方法之一。 Abstract: LoRA-based continual learning represents a promising avenue for leveraging pre-trained models in downstream continual learning tasks. Recent studies have shown that orthogonal LoRA tuning effectively mitigates forgetting. However, this work unveils that under orthogonal LoRA tuning, the critical parameters for pre-tasks still change notably after learning post-tasks. To address this problem, we directly propose freezing the most critical parameter matrices in the Vision Transformer (ViT) for pre-tasks before learning post-tasks. In addition, building on orthogonal LoRA tuning, we propose orthogonal LoRA composition (LoRAC) based on QR decomposition, which may further enhance the plasticity of our method. Elaborate ablation studies and extensive comparisons demonstrate the effectiveness of our proposed method. Our results indicate that our method achieves state-of-the-art (SOTA) performance on several well-known continual learning benchmarks. For instance, on the Split CIFAR-100 dataset, our method shows a 6.35\% improvement in accuracy and a 3.24\% reduction in forgetting compared to previous methods. Our code is available at https://github.com/learninginvision/LoRAC-IPC.

[20] How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings

Samuel Audia,Soheil Feizi,Matthias Zwicker,Dinesh Manocha

Main category: cs.CV

TL;DR: 论文比较了两种解决神经网络频谱偏差的技术:傅里叶特征编码(FFE)和多网格参数编码(MPE),发现MPE在性能上优于FFE。

Details Motivation: 低维空间映射的神经网络难以学习高频信息,需要有效的方法来缓解频谱偏差。 Method: 通过神经切线核(NTK)分析FFE和MPE的性能差异,并在2D图像回归和3D隐式表面回归任务中验证。 Result: MPE的最小特征值比基线高8个数量级,比FFE高2个数量级,性能提升显著。 Conclusion: MPE通过网格结构而非嵌入空间提升性能,与FFE机制不同,表现更优。 Abstract: Neural networks that map between low dimensional spaces are ubiquitous in computer graphics and scientific computing; however, in their naive implementation, they are unable to learn high frequency information. We present a comprehensive analysis comparing the two most common techniques for mitigating this spectral bias: Fourier feature encodings (FFE) and multigrid parametric encodings (MPE). FFEs are seen as the standard for low dimensional mappings, but MPEs often outperform them and learn representations with higher resolution and finer detail. FFE's roots in the Fourier transform, make it susceptible to aliasing if pushed too far, while MPEs, which use a learned grid structure, have no such limitation. To understand the difference in performance, we use the neural tangent kernel (NTK) to evaluate these encodings through the lens of an analogous kernel regression. By finding a lower bound on the smallest eigenvalue of the NTK, we prove that MPEs improve a network's performance through the structure of their grid and not their learnable embedding. This mechanism is fundamentally different from FFEs, which rely solely on their embedding space to improve performance. Results are empirically validated on a 2D image regression task using images taken from 100 synonym sets of ImageNet and 3D implicit surface regression on objects from the Stanford graphics dataset. Using peak signal-to-noise ratio (PSNR) and multiscale structural similarity (MS-SSIM) to evaluate how well fine details are learned, we show that the MPE increases the minimum eigenvalue by 8 orders of magnitude over the baseline and 2 orders of magnitude over the FFE. The increase in spectrum corresponds to a 15 dB (PSNR) / 0.65 (MS-SSIM) increase over baseline and a 12 dB (PSNR) / 0.33 (MS-SSIM) increase over the FFE.

[21] Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction

Wenyu Li,Sidun Liu,Peng Qiao,Yong Dou

Main category: cs.CV

TL;DR: 提出了一种结合单目几何先验的多视图3D重建方法,显著提升了在纹理稀少和低光条件下的重建质量。

Details Motivation: 现有基于匹配的多视图3D重建模型在纹理稀少和低光条件下性能显著下降,需要弥补这一缺陷。 Method: 引入单目引导的细化模块,将单目几何先验整合到多视图重建框架中。 Result: 在多个基准测试中,相机姿态估计和点云精度均有显著提升。 Conclusion: 该方法通过结合单目几何先验,显著增强了多视图重建系统的鲁棒性。 Abstract: Recent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the inherent shortcomings of matching-based methods. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.

[22] HSACNet: Hierarchical Scale-Aware Consistency Regularized Semi-Supervised Change Detection

Qi'ao Xu,Pengfei Wang,Yanjun Li,Tianwen Qian,Xiaoling Wang

Main category: cs.CV

TL;DR: HSACNet是一种半监督变化检测方法,通过结合SAM2和多尺度特征提取,显著提升了复杂场景下的性能。

Details Motivation: 现有方法在复杂场景和噪声数据下表现不佳,且忽视了多尺度特征的完整性。 Method: 使用SAM2的Hiera主干作为编码器,设计SADAM模块捕获多尺度变化特征,并采用双增强一致性正则化策略。 Result: 在四个基准测试中取得最优性能,同时减少了参数和计算成本。 Conclusion: HSACNet通过多尺度特征和一致性正则化,有效提升了半监督变化检测的性能和效率。 Abstract: Semi-supervised change detection (SSCD) aims to detect changes between bi-temporal remote sensing images by utilizing limited labeled data and abundant unlabeled data. Existing methods struggle in complex scenarios, exhibiting poor performance when confronted with noisy data. They typically neglect intra-layer multi-scale features while emphasizing inter-layer fusion, harming the integrity of change objects with different scales. In this paper, we propose HSACNet, a Hierarchical Scale-Aware Consistency regularized Network for SSCD. Specifically, we integrate Segment Anything Model 2 (SAM2), using its Hiera backbone as the encoder to extract inter-layer multi-scale features and applying adapters for parameter-efficient fine-tuning. Moreover, we design a Scale-Aware Differential Attention Module (SADAM) that can precisely capture intra-layer multi-scale change features and suppress noise. Additionally, a dual-augmentation consistency regularization strategy is adopted to effectively utilize the unlabeled data. Extensive experiments across four CD benchmarks demonstrate that our HSACNet achieves state-of-the-art performance, with reduced parameters and computational cost.

[23] Circular Image Deturbulence using Quasi-conformal Geometry

Chu Chen,Han Zhang,Lok Ming Lui

Main category: cs.CV

TL;DR: 提出了一种名为CQCD的无监督框架,用于通过圆形架构消除图像失真,确保几何准确性和视觉保真度。

Details Motivation: 由于光学传感器与物体之间的不均匀介质导致图像失真,而高质量配对标签图像的缺乏限制了监督模型的训练。 Method: 采用圆形架构进行前向和逆向映射,利用准共形几何理论确保双射性,并集成紧框架块编码失真敏感特征。 Result: CQCD在合成和真实图像上表现优异,不仅恢复质量优于现有方法,还能准确估计变形场。 Conclusion: CQCD框架在无监督图像恢复中表现出色,解决了失真问题并提供了高精度的变形估计。 Abstract: The presence of inhomogeneous media between optical sensors and objects leads to distorted imaging outputs, significantly complicating downstream image-processing tasks. A key challenge in image restoration is the lack of high-quality, paired-label images required for training supervised models. In this paper, we introduce the Circular Quasi-Conformal Deturbulence (CQCD) framework, an unsupervised approach for removing image distortions through a circular architecture. This design ensures that the restored image remains both geometrically accurate and visually faithful while preventing the accumulation of incorrect estimations.The circular restoration process involves both forward and inverse mapping. To ensure the bijectivity of the estimated non-rigid deformations, computational quasi-conformal geometry theories are leveraged to regularize the mapping, enforcing its homeomorphic properties. This guarantees a well-defined transformation that preserves structural integrity and prevents unwanted artifacts. Furthermore, tight-frame blocks are integrated to encode distortion-sensitive features for precise recovery. To validate the performance of our approach, we conduct evaluations on various synthetic and real-world captured images. Experimental results demonstrate that CQCD not only outperforms existing state-of-the-art deturbulence methods in terms of image restoration quality but also provides highly accurate deformation field estimations.

[24] Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation

Cheng Yuan,Yutong Ban

Main category: cs.CV

TL;DR: 提出了一种双向注意力架构的时序非对称特征传播网络,用于解决手术场景分割中的静态图像和动态视频挑战,显著优于现有方法。

Details Motivation: 手术场景分割对机器人辅助腹腔镜手术至关重要,但现有方法忽视了时序依赖性,且难以处理静态图像的模糊特征和动态视频的复杂变化。 Method: 设计了时序查询传播器和聚合非对称特征金字塔模块,结合多方向一致性约束和判别性特征保留,实现时序引导和上下文推理。 Result: 在两个公开基准测试中表现优异,EndoVis2018上mIoU提升16.4%,Endoscapes2023上mAP提升3.3%。 Conclusion: 该方法通过时序特征传播和上下文推理显著提升了手术场景分割性能,代码将公开。 Abstract: Surgical scene segmentation is crucial for robot-assisted laparoscopic surgery understanding. Current approaches face two challenges: (i) static image limitations including ambiguous local feature similarities and fine-grained structural details, and (ii) dynamic video complexities arising from rapid instrument motion and persistent visual occlusions. While existing methods mainly focus on spatial feature extraction, they fundamentally overlook temporal dependencies in surgical video streams. To address this, we present temporal asymmetric feature propagation network, a bidirectional attention architecture enabling cross-frame feature propagation. The proposed method contains a temporal query propagator that integrates multi-directional consistency constraints to enhance frame-specific feature representation, and an aggregated asymmetric feature pyramid module that preserves discriminative features for anatomical structures and surgical instruments. Our framework uniquely enables both temporal guidance and contextual reasoning for surgical scene understanding. Comprehensive evaluations on two public benchmarks show the proposed method outperforms the current SOTA methods by a large margin, with +16.4\% mIoU on EndoVis2018 and +3.3\% mAP on Endoscapes2023. The code will be publicly available after paper acceptance.

[25] SatelliteCalculator: A Multi-Task Vision Foundation Model for Quantitative Remote Sensing Inversion

Zhenyu Yu,Mohd. Yamani Idna Idris,Pei Wang

Main category: cs.CV

TL;DR: SatelliteCalculator是首个为定量遥感反问题设计的视觉基础模型,通过物理定义的指数公式构建大规模数据集,结合Swin Transformer和提示引导架构,显著提升任务适应性和推理效率。

Details Motivation: 尽管视觉基础模型在分类和分割任务中表现优异,但在物理可解释的回归任务中应用不足,且遥感数据的多光谱特性和地理空间异质性对模型的泛化和迁移能力提出了挑战。 Method: 利用物理定义的指数公式自动构建包含八个核心生态指标的大规模数据集,采用冻结的Swin Transformer主干和提示引导架构,结合交叉注意力适配器和轻量级任务特定MLP解码器。 Result: 在Open-Canopy基准测试中,SatelliteCalculator在所有任务中均达到竞争性精度,同时显著降低了推理成本。 Conclusion: 该研究验证了基础模型在定量反演中的可行性,并为任务自适应的遥感估计提供了可扩展的框架。 Abstract: Quantitative remote sensing inversion plays a critical role in environmental monitoring, enabling the estimation of key ecological variables such as vegetation indices, canopy structure, and carbon stock. Although vision foundation models have achieved remarkable progress in classification and segmentation tasks, their application to physically interpretable regression remains largely unexplored. Furthermore, the multi-spectral nature and geospatial heterogeneity of remote sensing data pose significant challenges for generalization and transferability. To address these issues, we introduce SatelliteCalculator, the first vision foundation model tailored for quantitative remote sensing inversion. By leveraging physically defined index formulas, we automatically construct a large-scale dataset of over one million paired samples across eight core ecological indicators. The model integrates a frozen Swin Transformer backbone with a prompt-guided architecture, featuring cross-attentive adapters and lightweight task-specific MLP decoders. Experiments on the Open-Canopy benchmark demonstrate that SatelliteCalculator achieves competitive accuracy across all tasks while significantly reducing inference cost. Our results validate the feasibility of applying foundation models to quantitative inversion, and provide a scalable framework for task-adaptive remote sensing estimation.

[26] MicroFlow: Domain-Specific Optical Flow for Ground Deformation Estimation in Seismic Events

Juliette Bertrand,Sophie Giffard-Roisin,James Hollingsworth,Julien Mairal

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的模型,用于高精度地面位移测量,解决了传统方法在真实条件下估计小位移的困难。

Details Motivation: 传统的光学卫星图像补丁匹配方法在真实地面位移测量中存在局限性,深度学习模型因缺乏真实地面真实值、亚像素精度需求以及地质或人为变化的时间变化而难以应用。 Method: 提出了一种迭代细化模型,采用显式变形层和独立于相关性的主干网络,结合非凸总变差正则化,以实现亚像素精度并保持断层线锐度。 Result: 模型在半合成基准测试中显著优于广泛使用的地球物理方法,并在中高分辨率传感器的真实场景中表现良好。 Conclusion: 该模型为地面位移测量提供了一种高精度且通用的解决方案,适用于复杂真实世界条件。 Abstract: Dense ground displacement measurements are crucial for geological studies but are impractical to collect directly. Traditionally, displacement fields are estimated using patch matching on optical satellite images from different acquisition times. While deep learning-based optical flow models are promising, their adoption in ground deformation analysis is hindered by challenges such as the absence of real ground truth, the need for sub-pixel precision, and temporal variations due to geological or anthropogenic changes. In particular, we identify that deep learning models relying on explicit correlation layers struggle at estimating small displacements in real-world conditions. Instead, we propose a model that employs iterative refinements with explicit warping layers and a correlation-independent backbone, enabling sub-pixel precision. Additionally, a non-convex variant of Total Variation regularization preserves fault-line sharpness while maintaining smoothness elsewhere. Our model significantly outperforms widely used geophysics methods on semi-synthetic benchmarks and generalizes well to challenging real-world scenarios captured by both medium- and high-resolution sensors. Project page: https://jbertrand89.github.io/microflow/.

[27] Neural Ganglion Sensors: Learning Task-specific Event Cameras Inspired by the Neural Circuit of the Human Retina

Haley M. So,Gordon Wetzstein

Main category: cs.CV

TL;DR: 论文提出了一种受生物视网膜神经节细胞启发的神经节传感器,通过任务特定的时空视网膜核改进传统事件相机,在视频插值和光流任务中表现更优且带宽更低。

Details Motivation: 受人类眼睛中神经元数据高效机制的启发,传统事件相机缺乏局部空间上下文利用和多特征并行处理能力,论文旨在改进这一点。 Method: 引入神经节传感器,学习任务特定的时空视网膜核(类似RGC事件),扩展传统事件相机功能。 Result: 在视频插值和光流任务中,生物启发的传感器性能优于传统事件相机,同时减少了事件带宽。 Conclusion: RGC启发的传感器在边缘设备等低功耗实时应用中具有潜力,能提供高效高分辨率的视觉流。 Abstract: Inspired by the data-efficient spiking mechanism of neurons in the human eye, event cameras were created to achieve high temporal resolution with minimal power and bandwidth requirements by emitting asynchronous, per-pixel intensity changes rather than conventional fixed-frame rate images. Unlike retinal ganglion cells (RGCs) in the human eye, however, which integrate signals from multiple photoreceptors within a receptive field to extract spatio-temporal features, conventional event cameras do not leverage local spatial context when deciding which events to fire. Moreover, the eye contains around 20 different kinds of RGCs operating in parallel, each attuned to different features or conditions. Inspired by this biological design, we introduce Neural Ganglion Sensors, an extension of traditional event cameras that learns task-specific spatio-temporal retinal kernels (i.e., RGC "events"). We evaluate our design on two challenging tasks: video interpolation and optical flow. Our results demonstrate that our biologically inspired sensing improves performance relative to conventional event cameras while reducing overall event bandwidth. These findings highlight the promise of RGC-inspired event sensors for edge devices and other low-power, real-time applications requiring efficient, high-resolution visual streams.

[28] Learning from Noisy Pseudo-labels for All-Weather Land Cover Mapping

Wang Liu,Zhiyu Wang,Xin Guo,Puhong Duan,Xudong Kang,Shutao Li

Main category: cs.CV

TL;DR: 提出了一种结合半监督学习和图像分辨率对齐增强的方法,用于生成更精确的SAR图像伪标签,并通过对称交叉熵损失减少噪声影响,最终在GRSS数据融合竞赛中取得第一名。

Details Motivation: SAR图像缺乏细节信息且存在显著斑点噪声,传统基于光学图像生成的伪标签噪声大,导致分割性能不佳。 Method: 结合半监督学习和图像分辨率对齐增强生成伪标签,使用对称交叉熵损失减少噪声影响,并采用多种训练和测试技巧。 Result: 在GRSS数据融合竞赛中取得第一名,证明了方法的有效性。 Conclusion: 提出的方法显著提升了SAR图像语义分割的性能,代码已开源。 Abstract: Semantic segmentation of SAR images has garnered significant attention in remote sensing due to the immunity of SAR sensors to cloudy weather and light conditions. Nevertheless, SAR imagery lacks detailed information and is plagued by significant speckle noise, rendering the annotation or segmentation of SAR images a formidable task. Recent efforts have resorted to annotating paired optical-SAR images to generate pseudo-labels through the utilization of an optical image segmentation network. However, these pseudo-labels are laden with noise, leading to suboptimal performance in SAR image segmentation. In this study, we introduce a more precise method for generating pseudo-labels by incorporating semi-supervised learning alongside a novel image resolution alignment augmentation. Furthermore, we introduce a symmetric cross-entropy loss to mitigate the impact of noisy pseudo-labels. Additionally, a bag of training and testing tricks is utilized to generate better land-cover mapping results. Our experiments on the GRSS data fusion contest indicate the effectiveness of the proposed method, which achieves first place. The code is available at https://github.com/StuLiu/DFC2025Track1.git.

[29] Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization

Hongwei Ji,Wulian Yun,Mengshi Qi,Huadong Ma

Main category: cs.CV

TL;DR: 提出了一种基于Chain-of-Thought文本推理的少样本时序动作定位方法,结合文本语义信息提升定位性能。

Details Motivation: 现有少样本时序动作定位方法仅关注视频级信息,忽略了文本信息的语义支持。 Method: 设计了一个少样本学习框架,包含语义感知的文本-视觉对齐模块和Chain-of-Thought推理方法,利用VLM和LLM生成文本描述。 Result: 在ActivityNet1.3和THUMOS14数据集上表现优异,并探索了人类异常检测的应用。 Conclusion: 所提方法显著优于现有方法,代码、数据和基准将公开。 Abstract: Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark.

[30] HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection

YangChen Zeng

Main category: cs.CV

TL;DR: 本文提出了一种名为HeatMap Position Embedding (HMPE)的新方法,通过动态整合位置编码与语义检测信息,提升小目标检测性能。同时设计了MOHFE和HIDQ模块,分别用于编码器和解码器,以减少背景噪声并生成高质量查询。实验表明,该方法在多个数据集上优于基线模型,并显著降低了计算成本。

Details Motivation: 当前基于Transformer的小目标检测方法存在显著不足,需要一种能够动态整合位置信息与语义检测信息的技术以提升性能。 Method: 提出HMPE方法,结合热图引导的自适应学习;设计MOHFE和HIDQ模块优化编码器和解码器;使用LSConv特征工程增强小目标类别嵌入。 Result: 在NWPU VHR-10数据集上mAP提升1.9%,在PASCAL VOC数据集上提升1.2%;解码器层数从8层减少至3层,显著降低计算成本。 Conclusion: HMPE方法有效提升了小目标检测性能,同时降低了计算开销,具有实际应用价值。 Abstract: Current Transformer-based methods for small object detection continue emerging, yet they have still exhibited significant shortcomings. This paper introduces HeatMap Position Embedding (HMPE), a novel Transformer Optimization technique that enhances object detection performance by dynamically integrating positional encoding with semantic detection information through heatmap-guided adaptive learning.We also innovatively visualize the HMPE method, offering clear visualization of embedded information for parameter fine-tuning.We then create Multi-Scale ObjectBox-Heatmap Fusion Encoder (MOHFE) and HeatMap Induced High-Quality Queries for Decoder (HIDQ) modules. These are designed for the encoder and decoder, respectively, to generate high-quality queries and reduce background noise queries.Using both heatmap embedding and Linear-Snake Conv(LSConv) feature engineering, we enhance the embedding of massively diverse small object categories and reduced the decoder multihead layers, thereby accelerating both inference and training.In the generalization experiments, our approach outperforme the baseline mAP by 1.9% on the small object dataset (NWPU VHR-10) and by 1.2% on the general dataset (PASCAL VOC). By employing HMPE-enhanced embedding, we are able to reduce the number of decoder layers from eight to a minimum of three, significantly decreasing both inference and training costs.

[31] Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing

Joowon Kim,Ziseok Lee,Donghyeon Cho,Sanghyun Jo,Yeonsung Jung,Kyungsu Kim,Eunho Yang

Main category: cs.CV

TL;DR: ELECT是一个零样本框架,通过早期扩散步骤评估潜在候选种子,选择可靠种子以改善图像编辑的背景一致性和指令遵循,同时降低计算成本。

Details Motivation: 扩散模型在图像生成和编辑中因随机噪声导致多样性,用户需反复调整种子或提示,效率低下。现有种子选择方法依赖外部验证器且计算复杂。 Method: 提出ELECT框架,通过早期时间步评估潜在候选种子的背景不一致性分数,筛选保留背景并修改前景的种子。 Result: ELECT平均降低41%计算成本,提升背景一致性和指令遵循,在40%的失败案例中成功。 Conclusion: ELECT无需外部监督或训练,显著提升图像编辑效率和效果。 Abstract: Despite recent advances in diffusion models, achieving reliable image generation and editing remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Instruction-guided image editing with diffusion models offers user-friendly capabilities, yet editing failures, such as background distortion, frequently occur. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient. While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting applicability, and evaluating multiple seeds increases computational complexity. To address this, we first establish a multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identifying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while preserving editability. Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed and prompt selection, further improving results when seed selection alone is insufficient. Experiments show that ELECT reduces computational costs (by 41 percent on average and up to 61 percent) while improving background consistency and instruction adherence, achieving around 40 percent success rates in previously failed cases - without any external supervision or training.

[32] U-Shape Mamba: State Space Model for faster diffusion

Alex Ergasti,Filippo Botti,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati

Main category: cs.CV

TL;DR: USM是一种新型扩散模型,通过结合Mamba块和U-Net结构,显著降低计算成本,同时保持高质量图像生成能力。

Details Motivation: 扩散模型计算成本高,限制了其广泛应用。USM旨在解决这一问题。 Method: 采用Mamba块在U-Net结构中逐步减少和恢复序列长度,降低计算开销。 Result: USM的计算开销仅为Zigma的三分之一,内存需求更低,速度更快,图像质量更高(FID提升显著)。 Conclusion: USM是一种高效、可扩展的扩散模型解决方案,降低了高质量图像合成的计算成本。 Abstract: Diffusion models have become the most popular approach for high-quality image generation, but their high computational cost still remains a significant challenge. To address this problem, we propose U-Shape Mamba (USM), a novel diffusion model that leverages Mamba-based layers within a U-Net-like hierarchical structure. By progressively reducing sequence length in the encoder and restoring it in the decoder through Mamba blocks, USM significantly lowers computational overhead while maintaining strong generative capabilities. Experimental results against Zigma, which is currently the most efficient Mamba-based diffusion model, demonstrate that USM achieves one-third the GFlops, requires less memory and is faster, while outperforming Zigma in image quality. Frechet Inception Distance (FID) is improved by 15.3, 0.84 and 2.7 points on AFHQ, CelebAHQ and COCO datasets, respectively. These findings highlight USM as a highly efficient and scalable solution for diffusion-based generative models, making high-quality image synthesis more accessible to the research community while reducing computational costs.

[33] OBIFormer: A Fast Attentive Denoising Framework for Oracle Bone Inscriptions

Jinhao Li,Zijian Chen,Tingzhu Chen,Zhiji Liu,Changbo Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为OBIFormer的快速注意力去噪框架,用于甲骨文(OBI)图像的去噪,结合通道自注意力、字形提取和选择性核特征融合,实现了高效且精确的去噪效果。

Details Motivation: 甲骨文作为最早的中文字符形式,因长期自然风化和人为破坏而严重退化,现有方法在计算效率和去噪效果上存在不足。 Method: OBIFormer采用通道自注意力、字形提取和选择性核特征融合技术,高效重建去噪图像。 Result: 在合成和真实甲骨文数据集上,OBIFormer在PSNR和SSIM指标上达到最优去噪性能。 Conclusion: OBIFormer在自动甲骨文识别中具有巨大潜力,代码已开源。 Abstract: Oracle bone inscriptions (OBIs) are the earliest known form of Chinese characters and serve as a valuable resource for research in anthropology and archaeology. However, most excavated fragments are severely degraded due to thousands of years of natural weathering, corrosion, and man-made destruction, making automatic OBI recognition extremely challenging. Previous methods either focus on pixel-level information or utilize vanilla transformers for glyph-based OBI denoising, which leads to tremendous computational overhead. Therefore, this paper proposes a fast attentive denoising framework for oracle bone inscriptions, i.e., OBIFormer. It leverages channel-wise self-attention, glyph extraction, and selective kernel feature fusion to reconstruct denoised images precisely while being computationally efficient. Our OBIFormer achieves state-of-the-art denoising performance for PSNR and SSIM metrics on synthetic and original OBI datasets. Furthermore, comprehensive experiments on a real oracle dataset demonstrate the great potential of our OBIFormer in assisting automatic OBI recognition. The code will be made available at https://github.com/LJHolyGround/OBIFormer.

[34] EG-Gaussian: Epipolar Geometry and Graph Network Enhanced 3D Gaussian Splatting

Beizhen Zhao,Yifan Zhou,Zijian Wang,Hao Wang

Main category: cs.CV

TL;DR: 论文提出EG-Gaussian框架,通过结合极几何和图网络改进3D高斯泼溅(3DGS)的初始化和特征优化,显著提升3D场景重建精度。

Details Motivation: 现有3DGS方法因初始点不准确和稀疏视图输入导致3D高斯扁平化,重建结果不完整或模糊。 Method: 结合极几何优化3DGS初始化,设计图学习模块细化空间特征(包括坐标和邻点角度关系)。 Result: 在室内外基准数据集上,EG-Gaussian显著优于现有3DGS方法。 Conclusion: EG-Gaussian通过改进初始化和特征优化,有效解决了3DGS的局限性,提升了重建质量。 Abstract: In this paper, we explore an open research problem concerning the reconstruction of 3D scenes from images. Recent methods have adopt 3D Gaussian Splatting (3DGS) to produce 3D scenes due to its efficient training process. However, these methodologies may generate incomplete 3D scenes or blurred multiviews. This is because of (1) inaccurate 3DGS point initialization and (2) the tendency of 3DGS to flatten 3D Gaussians with the sparse-view input. To address these issues, we propose a novel framework EG-Gaussian, which utilizes epipolar geometry and graph networks for 3D scene reconstruction. Initially, we integrate epipolar geometry into the 3DGS initialization phase to enhance initial 3DGS point construction. Then, we specifically design a graph learning module to refine 3DGS spatial features, in which we incorporate both spatial coordinates and angular relationships among neighboring points. Experiments on indoor and outdoor benchmark datasets demonstrate that our approach significantly improves reconstruction accuracy compared to 3DGS-based methods.

[35] Beyond One-Hot Labels: Semantic Mixing for Model Calibration

Haoyang Luo,Linwei Tao,Minjing Dong,Chang Xu

Main category: cs.CV

TL;DR: 论文提出了一种校准感知的数据增强方法(CSM),通过生成具有混合类别特征的合成数据集来解决模型校准中标签不确定性的问题。

Details Motivation: 现有校准方法依赖确定性标签数据集,无法有效处理不确定性,需引入数值丰富的置信度标注数据。 Method: 提出CSM框架,利用扩散模型生成混合类别样本并标注置信度,同时提出校准重标注和优化损失函数。 Result: 实验表明CSM在校准性能上优于现有方法。 Conclusion: CSM为模型校准提供了新的数据增强范式,显著提升了校准效果。 Abstract: Model calibration seeks to ensure that models produce confidence scores that accurately reflect the true likelihood of their predictions being correct. However, existing calibration approaches are fundamentally tied to datasets of one-hot labels implicitly assuming full certainty in all the annotations. Such datasets are effective for classification but provides insufficient knowledge of uncertainty for model calibration, necessitating the curation of datasets with numerically rich ground-truth confidence values. However, due to the scarcity of uncertain visual examples, such samples are not easily available as real datasets. In this paper, we introduce calibration-aware data augmentation to create synthetic datasets of diverse samples and their ground-truth uncertainty. Specifically, we present Calibration-aware Semantic Mixing (CSM), a novel framework that generates training samples with mixed class characteristics and annotates them with distinct confidence scores via diffusion models. Based on this framework, we propose calibrated reannotation to tackle the misalignment between the annotated confidence score and the mixing ratio during the diffusion reverse process. Besides, we explore the loss functions that better fit the new data representation paradigm. Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Code is available at github.com/E-Galois/CSM.

[36] Zero-Shot Industrial Anomaly Segmentation with Image-Aware Prompt Generation

SoYoung Park,Hyewon Lee,Mingyu Choi,Seunghoon Han,Jong-Ryul Lee,Sungsu Lim,Tae-Ho Kim

Main category: cs.CV

TL;DR: 提出了一种基于图像感知的动态提示异常分割方法(IAP-AS),通过结合图像标注模型和大语言模型生成上下文感知提示,显著提升了异常分割的适应性和泛化能力。

Details Motivation: 现有基于固定提示的零样本异常分割方法在多样化工业场景中适应性不足,需要更灵活的提示策略。 Method: 利用图像标注模型提取对象属性,结合大语言模型生成动态、上下文感知的提示,提升异常分割性能。 Result: 实验表明,IAP-AS在F1-max指标上提升了10%,表现出更强的适应性和泛化能力。 Conclusion: IAP-AS为跨行业异常分割提供了可扩展的解决方案,适用于动态和非结构化的工业环境。 Abstract: Anomaly segmentation is essential for industrial quality, maintenance, and stability. Existing text-guided zero-shot anomaly segmentation models are effective but rely on fixed prompts, limiting adaptability in diverse industrial scenarios. This highlights the need for flexible, context-aware prompting strategies. We propose Image-Aware Prompt Anomaly Segmentation (IAP-AS), which enhances anomaly segmentation by generating dynamic, context-aware prompts using an image tagging model and a large language model (LLM). IAP-AS extracts object attributes from images to generate context-aware prompts, improving adaptability and generalization in dynamic and unstructured industrial environments. In our experiments, IAP-AS improves the F1-max metric by up to 10%, demonstrating superior adaptability and generalization. It provides a scalable solution for anomaly segmentation across industries

[37] WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion

Yang Wu,Yun Zhu,Kaihua Zhang,Jianjun Qian,Jin Xie,Jian Yang

Main category: cs.CV

TL;DR: WeatherGen是一个统一的多样天气LiDAR数据生成框架,通过扩散模型和蜘蛛曼巴生成器提高数据保真度,并利用对比学习控制器增强生成数据的区分性。

Details Motivation: 解决现有LiDAR模拟器只能模拟单一恶劣天气且数据保真度低的问题,提供高质量多样天气数据以支持3D场景感知任务。 Method: 设计基于地图的数据生成器,构建扩散模型,提出蜘蛛曼巴生成器逐步恢复数据,并引入潜在特征对齐器和对比学习控制器。 Result: WeatherGen生成的数据质量高,构建的mini-weather数据集提升了恶劣天气下下游任务的性能。 Conclusion: WeatherGen为多样天气LiDAR数据生成提供了高效解决方案,显著提升了数据保真度和下游任务表现。 Abstract: 3D scene perception demands a large amount of adverse-weather LiDAR data, yet the cost of LiDAR data collection presents a significant scaling-up challenge. To this end, a series of LiDAR simulators have been proposed. Yet, they can only simulate a single adverse weather with a single physical model, and the fidelity of the generated data is quite limited. This paper presents WeatherGen, the first unified diverse-weather LiDAR data diffusion generation framework, significantly improving fidelity. Specifically, we first design a map-based data producer, which can provide a vast amount of high-quality diverse-weather data for training purposes. Then, we utilize the diffusion-denoising paradigm to construct a diffusion model. Among them, we propose a spider mamba generator to restore the disturbed diverse weather data gradually. The spider mamba models the feature interactions by scanning the LiDAR beam circle or central ray, excellently maintaining the physical structure of the LiDAR data. Subsequently, following the generator to transfer real-world knowledge, we design a latent feature aligner. Afterward, we devise a contrastive learning-based controller, which equips weather control signals with compact semantic knowledge through language supervision, guiding the diffusion model to generate more discriminative data. Extensive evaluations demonstrate the high generation quality of WeatherGen. Through WeatherGen, we construct the mini-weather dataset, promoting the performance of the downstream task under adverse weather conditions. Code is available: https://github.com/wuyang98/weathergen

[38] HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework

Shuobin Wei,Zhuang Zhou,Zhengan Lu,Zizhao Yuan,Binghua Su

Main category: cs.CV

TL;DR: 提出了一种名为HDBFormer的异构双分支框架,用于有效整合RGB和深度图像信息,通过不同的编码器和交互模块实现高性能的室内场景语义分割。

Details Motivation: 现有方法未能充分区分RGB和深度图像的信息表达方式差异,导致无法充分利用其独特特性。 Method: 设计了HDBFormer框架,包括RGB图像的基础和细节编码器、深度图像的轻量级LDFormer编码器,以及模态信息交互模块(MIIM)。 Result: 在NYUDepthv2和SUN-RGBD数据集上实现了最先进的性能。 Conclusion: HDBFormer通过区分处理RGB和深度图像,有效提升了语义分割性能。 Abstract: In RGB-D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images. However, most existing methods overlook the inherent differences in how RGB and depth images express information. Properly distinguishing the processing of RGB and depth images is essential to fully exploiting their unique and significant characteristics. To address this, we propose a novel heterogeneous dual-branch framework called HDBFormer, specifically designed to handle these modality differences. For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features. For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters. Additionally, we introduce the Modality Information Interaction Module (MIIM), which combines transformers with large kernel convolutions to interact global and local information across modalities efficiently. Extensive experiments show that HDBFormer achieves state-of-the-art performance on the NYUDepthv2 and SUN-RGBD datasets. The code is available at: https://github.com/Weishuobin/HDBFormer.

[39] Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding

Yuchen Rao,Stefan Ainetter,Sinisa Stekovic,Vincent Lepetit,Friedrich Fraundorfer

Main category: cs.CV

TL;DR: 利用自动检索合成CAD模型生成高质量标注,训练深度学习模型,在点云补全和单视图CAD模型检索任务中表现优于人工标注。

Details Motivation: 解决3D场景理解中高质量标注获取困难的问题,降低标注成本。 Method: 采用类似ScanNet的自动标注流程,应用于ScanNet++ v1数据集,生成自动标注数据SCANnotate++。 Result: 自动标注训练的模型在点云补全和CAD模型检索任务中表现优于人工标注模型。 Conclusion: 自动3D标注能提升模型性能并显著降低成本,未来将公开标注数据和训练模型。 Abstract: High-level 3D scene understanding is essential in many applications. However, the challenges of generating accurate 3D annotations make development of deep learning models difficult. We turn to recent advancements in automatic retrieval of synthetic CAD models, and show that data generated by such methods can be used as high-quality ground truth for training supervised deep learning models. More exactly, we employ a pipeline akin to the one previously used to automatically annotate objects in ScanNet scenes with their 9D poses and CAD models. This time, we apply it to the recent ScanNet++ v1 dataset, which previously lacked such annotations. Our findings demonstrate that it is not only possible to train deep learning models on these automatically-obtained annotations but that the resulting models outperform those trained on manually annotated data. We validate this on two distinct tasks: point cloud completion and single-view CAD model retrieval and alignment. Our results underscore the potential of automatic 3D annotations to enhance model performance while significantly reducing annotation costs. To support future research in 3D scene understanding, we will release our annotations, which we call SCANnotate++, along with our trained models.

[40] HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering

Alexander Rusnak,Frédéric Kaplan

Main category: cs.CV

TL;DR: 提出了一种名为HAEC的开放词汇3D场景理解方法,解决了现有技术无法高效扩展至城市规模数据集的问题。

Details Motivation: 传统3D场景理解依赖手工标注,而现有开放词汇方法难以扩展至城市规模数据集。 Method: 采用基于超点图的聚类方法HAEC,结合专家混合图变换器作为主干网络,并开发了无需手工标注的合成标签生成流程。 Result: 在SensatUrban城市规模数据集上首次实现了开放词汇场景理解,并展示了合成标签生成能力。 Conclusion: HAEC为密集城市3D场景的复杂操作提供了新途径,推动了数字孪生处理的进展。 Abstract: Traditional 3D scene understanding techniques are generally predicated on hand-annotated label sets, but in recent years a new class of open-vocabulary 3D scene understanding techniques has emerged. Despite the success of this paradigm on small scenes, existing approaches cannot scale efficiently to city-scale 3D datasets. In this paper, we present Hierarchical vocab-Agnostic Expert Clustering (HAEC), after the latin word for 'these', a superpoint graph clustering based approach which utilizes a novel mixture of experts graph transformer for its backbone. We administer this highly scalable approach to the first application of open-vocabulary scene understanding on the SensatUrban city-scale dataset. We also demonstrate a synthetic labeling pipeline which is derived entirely from the raw point clouds with no hand-annotation. Our technique can help unlock complex operations on dense urban 3D scenes and open a new path forward in the processing of digital twins.

[41] KAN or MLP? Point Cloud Shows the Way Forward

Yan Shi,Qingdong He,Yijun Liu,Xiaoyu Liu,Jingyong Su

Main category: cs.CV

TL;DR: PointKAN提出了一种基于Kolmogorov-Arnold Networks(KANs)的点云分析方法,通过几何仿射模块和局部特征处理提升特征表示能力,并在高效参数设计下显著优于PointMLP。

Details Motivation: 传统的MLPs在点云分析中难以高效捕捉局部几何特征,且参数效率低。PointKAN旨在利用KANs的层次特征表示能力解决这些问题。 Method: 1. 引入几何仿射模块(GAM)增强几何变化鲁棒性;2. 通过并行结构的局部特征处理(LFP)提取组级和全局特征;3. 全局特征处理(GFP)结合并扩展感受野;4. 开发高效KANs(PointKAN-elite)减少参数。 Result: PointKAN在ModelNet40、ScanObjectNN和ShapeNetPart等基准数据集上优于PointMLP,尤其在Few-shot Learning任务中表现突出,同时显著降低参数和计算复杂度。 Conclusion: PointKAN展示了KANs在3D视觉中的潜力,为点云理解研究开辟了新方向。 Abstract: Multi-Layer Perceptrons (MLPs) have become one of the fundamental architectural component in point cloud analysis due to its effective feature learning mechanism. However, when processing complex geometric structures in point clouds, MLPs' fixed activation functions struggle to efficiently capture local geometric features, while suffering from poor parameter efficiency and high model redundancy. In this paper, we propose PointKAN, which applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis tasks to investigate their efficacy in hierarchical feature representation. First, we introduce a Geometric Affine Module (GAM) to transform local features, improving the model's robustness to geometric variations. Next, in the Local Feature Processing (LFP), a parallel structure extracts both group-level features and global context, providing a rich representation of both fine details and overall structure. Finally, these features are combined and processed in the Global Feature Processing (GFP). By repeating these operations, the receptive field gradually expands, enabling the model to capture complete geometric information of the point cloud. To overcome the high parameter counts and computational inefficiency of standard KANs, we develop Efficient-KANs in the PointKAN-elite variant, which significantly reduces parameters while maintaining accuracy. Experimental results demonstrate that PointKAN outperforms PointMLP on benchmark datasets such as ModelNet40, ScanObjectNN, and ShapeNetPart, with particularly strong performance in Few-shot Learning task. Additionally, PointKAN achieves substantial reductions in parameter counts and computational complexity (FLOPs). This work highlights the potential of KANs-based architectures in 3D vision and opens new avenues for research in point cloud understanding.

[42] LMPOcc: 3D Semantic Occupancy Prediction Utilizing Long-Term Memory Prior from Historical Traversals

Shanshuai Yuan,Julong Wei,Muer Tie,Xiangyun Ren,Zhongxue Gan,Wenchao Ding

Main category: cs.CV

TL;DR: LMPOcc是一种基于历史遍历感知输出的3D语义占用预测方法,通过长时记忆先验增强局部感知并构建全局占用表示,在Occ3D-nuScenes基准上表现优异。

Details Motivation: 自动驾驶车辆在相同地理位置的多次遍历中,现有方法未充分利用历史感知信息,限制了3D占用预测的性能。 Method: 提出LMPOcc方法,引入长时记忆先验,设计轻量级Current-Prior Fusion模块自适应融合特征,并采用模型无关的先验格式。 Result: 在Occ3D-nuScenes基准上达到最优性能,尤其在静态语义类别上表现突出,支持多车众包构建全局占用。 Conclusion: LMPOcc通过利用历史遍历信息显著提升了3D占用预测性能,展示了全局占用构建的潜力。 Abstract: Vision-based 3D semantic occupancy prediction is critical for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. In practice, autonomous vehicles may repeatedly traverse identical geographic locations under varying environmental conditions, such as weather fluctuations and illumination changes. Existing methods in 3D occupancy prediction predominantly integrate adjacent temporal contexts. However, these works neglect to leverage perceptual information, which is acquired from historical traversals of identical geographic locations. In this paper, we propose Longterm Memory Prior Occupancy (LMPOcc), the first 3D occupancy prediction methodology that exploits long-term memory priors derived from historical traversal perceptual outputs. We introduce a plug-and-play architecture that integrates long-term memory priors to enhance local perception while simultaneously constructing global occupancy representations. To adaptively aggregate prior features and current features, we develop an efficient lightweight Current-Prior Fusion module. Moreover, we propose a model-agnostic prior format to ensure compatibility across diverse occupancy prediction baselines. LMPOcc achieves state-of-the-art performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Additionally, experimental results demonstrate LMPOcc's ability to construct global occupancy through multi-vehicle crowdsourcing.

[43] FocusTrack: A Self-Adaptive Local Sampling Algorithm for Efficient Anti-UAV Tracking

Ying Wang,Tingfa Xu,Jianan Li

Main category: cs.CV

TL;DR: FocusTrack提出了一种动态调整搜索区域和增强特征表示的新框架,在计算效率和跟踪精度之间取得了平衡。

Details Motivation: 解决抗无人机跟踪中的小目标尺寸、相机突变运动和复杂红外背景等挑战,同时弥补局部和全局跟踪器之间的性能差距。 Method: 提出Search Region Adjustment (SRA)策略动态调整搜索区域,以及Attention-to-Mask (ATM)模块增强特征表示。 Result: 在AntiUAV和AntiUAV410数据集上分别达到67.7%和62.8%的AUC,显著优于基线跟踪器,同时计算效率高。 Conclusion: FocusTrack在跟踪精度和计算效率上均表现出色,适用于实时抗无人机跟踪。 Abstract: Anti-UAV tracking poses significant challenges, including small target sizes, abrupt camera motion, and cluttered infrared backgrounds. Existing tracking paradigms can be broadly categorized into global- and local-based methods. Global-based trackers, such as SiamDT, achieve high accuracy by scanning the entire field of view but suffer from excessive computational overhead, limiting real-world deployment. In contrast, local-based methods, including OSTrack and ROMTrack, efficiently restrict the search region but struggle when targets undergo significant displacements due to abrupt camera motion. Through preliminary experiments, it is evident that a local tracker, when paired with adaptive search region adjustment, can significantly enhance tracking accuracy, narrowing the gap between local and global trackers. To address this challenge, we propose FocusTrack, a novel framework that dynamically refines the search region and strengthens feature representations, achieving an optimal balance between computational efficiency and tracking accuracy. Specifically, our Search Region Adjustment (SRA) strategy estimates the target presence probability and adaptively adjusts the field of view, ensuring the target remains within focus. Furthermore, to counteract feature degradation caused by varying search regions, the Attention-to-Mask (ATM) module is proposed. This module integrates hierarchical information, enriching the target representations with fine-grained details. Experimental results demonstrate that FocusTrack achieves state-of-the-art performance, obtaining 67.7% AUC on AntiUAV and 62.8% AUC on AntiUAV410, outperforming the baseline tracker by 8.5% and 9.1% AUC, respectively. In terms of efficiency, FocusTrack surpasses global-based trackers, requiring only 30G MACs and achieving 143 fps with FocusTrack (SRA) and 44 fps with the full version, both enabling real-time tracking.

[44] Cross-Hierarchical Bidirectional Consistency Learning for Fine-Grained Visual Classification

Pengxiang Gao,Yihao Liang,Yanzhi Song,Zhouwang Yang

Main category: cs.CV

TL;DR: 提出了一种名为CHBC的新框架,利用树层次结构中的信息,通过双向一致性损失提升细粒度视觉分类的准确性和一致性。

Details Motivation: 现有方法依赖额外标注,忽略了树层次结构中蕴含的有价值信息,导致分类效果受限。 Method: 设计了CHBC框架,通过分解和增强注意力掩码和特征提取跨层次判别特征,并引入双向一致性损失确保分类结果一致性。 Result: 在三个广泛使用的FGVC数据集上验证了CHBC框架的有效性,消融研究进一步证实了特征增强和一致性约束的贡献。 Conclusion: CHBC框架通过利用树层次结构和双向一致性学习,显著提升了细粒度视觉分类的性能。 Abstract: Fine-Grained Visual Classification (FGVC) aims to categorize closely related subclasses, a task complicated by minimal inter-class differences and significant intra-class variance. Existing methods often rely on additional annotations for image classification, overlooking the valuable information embedded in Tree Hierarchies that depict hierarchical label relationships. To leverage this knowledge to improve classification accuracy and consistency, we propose a novel Cross-Hierarchical Bidirectional Consistency Learning (CHBC) framework. The CHBC framework extracts discriminative features across various hierarchies using a specially designed module to decompose and enhance attention masks and features. We employ bidirectional consistency loss to regulate the classification outcomes across different hierarchies, ensuring label prediction consistency and reducing misclassification. Experiments on three widely used FGVC datasets validate the effectiveness of the CHBC framework. Ablation studies further investigate the application strategies of feature enhancement and consistency constraints, underscoring the significant contributions of the proposed modules.

[45] Compile Scene Graphs with Reinforcement Learning

Zuyao Chen,Jinlin Wu,Zhen Lei,Marc Pollefeys,Chang Wen Chen

Main category: cs.CV

TL;DR: 论文提出R1-SGG,一种多模态大语言模型,通过监督微调和强化学习优化场景图生成任务,显著提升性能。

Details Motivation: 探索大语言模型在结构化视觉表示(如场景图)端到端生成中的应用,弥补现有研究的不足。 Method: 结合监督微调(SFT)和强化学习(RL),设计图中心奖励函数(节点、边和格式一致性奖励)。 Result: 强化学习显著提升模型性能,实现零失败率,优于仅用监督微调的方法。 Conclusion: R1-SGG通过强化学习有效提升场景图生成能力,为多模态任务提供新思路。 Abstract: Next token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. Given the structured nature of scene graphs, we design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward. Our experiments demonstrate that rule-based RL substantially enhances model performance in the SGG task, achieving a zero failure rate--unlike supervised fine-tuning (SFT), which struggles to generalize effectively. Our code is available at https://github.com/gpt4vision/R1-SGG.

[46] Visual Intention Grounding for Egocentric Assistants

Pengzhan Sun,Junbin Xiao,Tze Ho Elden Tse,Yicong Li,Arjun Akula,Angela Yao

Main category: cs.CV

TL;DR: 论文介绍了EgoIntention数据集和Reason-to-Ground(RoG)方法,用于解决以自我为中心视角下的视觉意图定位问题,并展示了RoG方法的优越性。

Details Motivation: 现有视觉定位方法主要针对第三人称视角和显式对象查询,而AI助手等应用需要处理以自我为中心视角和隐式意图。 Method: 提出EgoIntention数据集和RoG指令调优方法,结合意图推理和对象定位机制进行混合训练。 Result: RoG在EgoIntention数据集上显著优于传统微调和混合训练方法,同时保持或略微提升显式描述定位性能。 Conclusion: RoG方法实现了对以自我为中心和第三人称视角的统一视觉定位,同时处理显式对象查询和隐式人类意图。 Abstract: Visual grounding associates textual descriptions with objects in an image. Conventional methods target third-person image inputs and named object queries. In applications such as AI assistants, the perspective shifts -- inputs are egocentric, and objects may be referred to implicitly through needs and intentions. To bridge this gap, we introduce EgoIntention, the first dataset for egocentric visual intention grounding. EgoIntention challenges multimodal LLMs to 1) understand and ignore unintended contextual objects and 2) reason about uncommon object functionalities. Benchmark results show that current models misidentify context objects and lack affordance understanding in egocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; it enables hybrid training with normal descriptions and egocentric intentions with a chained intention reasoning and object grounding mechanism. RoG significantly outperforms naive finetuning and hybrid training on EgoIntention, while maintaining or slightly improving naive description grounding. This advancement enables unified visual grounding for egocentric and exocentric visual inputs while handling explicit object queries and implicit human intentions.

[47] DenSe-AdViT: A novel Vision Transformer for Dense SAR Object Detection

Yang Zhang,Jingyi Cao,Yanan You,Yuanyuan Qiao

Main category: cs.CV

TL;DR: 提出了一种名为DenSe-AdViT的密度敏感视觉Transformer,用于解决SAR图像中密集小目标检测的问题,通过密度感知模块和多尺度融合模块提升性能。

Details Motivation: ViT在SAR图像目标检测中表现优异,但在提取多尺度局部特征方面存在不足,尤其是对密集小目标的检测效果有限。 Method: 设计了密度感知模块(DAM)生成密度张量,并结合密度增强融合模块(DEFM)整合CNN的多尺度信息与Transformer的全局特征。 Result: 在RSDD和SIVED数据集上分别达到79.8%和92.5%的mAP。 Conclusion: DenSe-AdViT通过密度敏感设计显著提升了密集小目标的检测性能。 Abstract: Vision Transformer (ViT) has achieved remarkable results in object detection for synthetic aperture radar (SAR) images, owing to its exceptional ability to extract global features. However, it struggles with the extraction of multi-scale local features, leading to limited performance in detecting small targets, especially when they are densely arranged. Therefore, we propose Density-Sensitive Vision Transformer with Adaptive Tokens (DenSe-AdViT) for dense SAR target detection. We design a Density-Aware Module (DAM) as a preliminary component that generates a density tensor based on target distribution. It is guided by a meticulously crafted objective metric, enabling precise and effective capture of the spatial distribution and density of objects. To integrate the multi-scale information enhanced by convolutional neural networks (CNNs) with the global features derived from the Transformer, Density-Enhanced Fusion Module (DEFM) is proposed. It effectively refines attention toward target-survival regions with the assist of density mask and the multiple sources features. Notably, our DenSe-AdViT achieves 79.8% mAP on the RSDD dataset and 92.5% on the SIVED dataset, both of which feature a large number of densely distributed vehicle targets.

[48] Efficient Parameter Adaptation for Multi-Modal Medical Image Segmentation and Prognosis

Numan Saeed,Shahad Hardan,Muhammad Ridzuan,Nada Saadi,Karthik Nandakumar,Mohammad Yaqub

Main category: cs.CV

TL;DR: 提出了一种参数高效的多模态适应框架(PEMMA),用于仅基于CT扫描训练的模型轻量级升级,以在PET扫描可用时高效适应,同时支持预后任务。

Details Motivation: 解决CT-PET数据依赖性问题,因PET扫描稀缺,需灵活框架以CT为基础并适应PET。 Method: 采用低秩适应(LoRA)和分解低秩适应(DoRA)进行参数高效调整,减少跨模态纠缠。 Result: 性能与早期融合相当,但仅需8%可训练参数;PET扫描Dice分数提升28%,预后任务一致性指数提升10%-23%。 Conclusion: PEMMA框架高效且灵活,显著提升多模态适应性能,适用于稀缺数据场景。 Abstract: Cancer detection and prognosis relies heavily on medical imaging, particularly CT and PET scans. Deep Neural Networks (DNNs) have shown promise in tumor segmentation by fusing information from these modalities. However, a critical bottleneck exists: the dependency on CT-PET data concurrently for training and inference, posing a challenge due to the limited availability of PET scans. Hence, there is a clear need for a flexible and efficient framework that can be trained with the widely available CT scans and can be still adapted for PET scans when they become available. In this work, we propose a parameter-efficient multi-modal adaptation (PEMMA) framework for lightweight upgrading of a transformer-based segmentation model trained only on CT scans such that it can be efficiently adapted for use with PET scans when they become available. This framework is further extended to perform prognosis task maintaining the same efficient cross-modal fine-tuning approach. The proposed approach is tested with two well-known segementation backbones, namely UNETR and Swin UNETR. Our approach offers two main advantages. Firstly, we leverage the inherent modularity of the transformer architecture and perform low-rank adaptation (LoRA) as well as decomposed low-rank adaptation (DoRA) of the attention weights to achieve parameter-efficient adaptation. Secondly, by minimizing cross-modal entanglement, PEMMA allows updates using only one modality without causing catastrophic forgetting in the other. Our method achieves comparable performance to early fusion, but with only 8% of the trainable parameters, and demonstrates a significant +28% Dice score improvement on PET scans when trained with a single modality. Furthermore, in prognosis, our method improves the concordance index by +10% when adapting a CT-pretrained model to include PET scans, and by +23% when adapting for both PET and EHR data.

[49] Enhancing Pothole Detection and Characterization: Integrated Segmentation and Depth Estimation in Road Anomaly Systems

Uthman Baroudi,Alala BaHamid,Yasser Elalfy,Ziad Al Alami

Main category: cs.CV

TL;DR: 该论文提出了一种基于预训练YOLOv8-seg模型的道路异常检测方法,通过结合图像分割和深度图数据,实现对道路坑洼的精确检测与特征描述。

Details Motivation: 传统机器学习方法在道路异常检测中难以全面描述坑洼特征,而现有方法缺乏对坑洼深度信息的提取,影响了检测的全面性。 Method: 采用预训练的YOLOv8-seg模型进行坑洼检测和分割,结合深度图数据提取坑洼的深度信息,从而提供更全面的特征描述。 Result: 该方法能够精确定位坑洼并计算其面积,同时通过深度图获取坑洼的深度信息,显著提升了检测的全面性。 Conclusion: 该方法不仅提升了自动驾驶车辆对道路危险的感知能力,还为道路维护部门提供了更有效的响应手段。 Abstract: Road anomaly detection plays a crucial role in road maintenance and in enhancing the safety of both drivers and vehicles. Recent machine learning approaches for road anomaly detection have overcome the tedious and time-consuming process of manual analysis and anomaly counting; however, they often fall short in providing a complete characterization of road potholes. In this paper, we leverage transfer learning by adopting a pre-trained YOLOv8-seg model for the automatic characterization of potholes using digital images captured from a dashboard-mounted camera. Our work includes the creation of a novel dataset, comprising both images and their corresponding depth maps, collected from diverse road environments in Al-Khobar city and the KFUPM campus in Saudi Arabia. Our approach performs pothole detection and segmentation to precisely localize potholes and calculate their area. Subsequently, the segmented image is merged with its depth map to extract detailed depth information about the potholes. This integration of segmentation and depth data offers a more comprehensive characterization compared to previous deep learning-based road anomaly detection systems. Overall, this method not only has the potential to significantly enhance autonomous vehicle navigation by improving the detection and characterization of road hazards but also assists road maintenance authorities in responding more effectively to road damage.

[50] EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model

Sijing Li,Tianwei Lin,Lingshuai Lin,Wenqiao Zhang,Jiang Liu,Xiaoda Yang,Juncheng Li,Yucheng He,Xiaohui Song,Jun Xiao,Yueting Zhuang,Beng Chin Ooi

Main category: cs.CV

TL;DR: 论文提出Eyecare Kit,解决眼科智能诊断中的数据、基准和模型三大挑战,包括构建高质量数据集Eyecare-100K、设计评估基准Eyecare-Bench和开发优化模型EyecareGPT。

Details Motivation: 当前医学大视觉语言模型(Med-LVLMs)在眼科智能诊断中存在数据不足、缺乏系统性评估基准和模型适应性差的问题。 Method: 构建多代理数据引擎生成Eyecare-100K数据集,设计Eyecare-Bench评估基准,开发优化模型EyecareGPT,包含自适应分辨率机制和分层密集连接器。 Result: EyecareGPT在多项眼科任务中达到最先进性能。 Conclusion: Eyecare Kit为眼科智能诊断的研究提供了重要工具和潜在发展方向。 Abstract: Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine-grained, region-specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi-agent data engine with real-life ophthalmology data to produce Eyecare-100K, a high-quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare-Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine-grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer-wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state-of-the-art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at https://github.com/DCDmllm/EyecareGPT.

[51] AnyTSR: Any-Scale Thermal Super-Resolution for UAV

Mengyuan Li,Changhong Fu,Ziyu Lu,Zijie Zhang,Haobo Zuo,Liangliang Yao

Main category: cs.CV

TL;DR: 提出了一种新型任意尺度热成像超分辨率方法(AnyTSR),通过单模型解决无人机热成像分辨率低的问题,提高了细节和边界清晰度。

Details Motivation: 热成像传感器分辨率低导致细节不足和边界模糊,现有超分辨率方法多为固定尺度且计算成本高,不适用于实际应用。 Method: 设计了新的图像编码器以分配特定特征码,并提出创新的任意尺度上采样器,嵌入坐标偏移信息以减少伪影。 Result: 实验表明,该方法在所有尺度上均优于现有技术,生成的高分辨率图像更准确和详细。 Conclusion: AnyTSR为无人机热成像提供了一种高效、灵活的解决方案,适用于多种场景。 Abstract: Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixed-scale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at https://github.com/vision4robotics/AnyTSR.

[52] Analysing the Robustness of Vision-Language-Models to Common Corruptions

Muhammad Usama,Syeda Aisha Asim,Syed Bilal Ali,Syed Talal Wasim,Umair Bin Mansoor

Main category: cs.CV

TL;DR: 本文首次全面分析了视觉语言模型(VLM)在19种图像损坏类型下的鲁棒性,揭示了其在文本识别和对象推理任务中的不同脆弱性模式。

Details Motivation: 尽管视觉语言模型在理解和推理视觉与文本内容方面表现出色,但其对常见图像损坏的鲁棒性尚未充分研究。 Method: 通过引入TextVQA-C和GQA-C两个新基准,系统评估了损坏对场景文本理解和对象推理的影响,并分析了不同损坏类型的频域特性。 Result: 研究发现,基于Transformer的VLM在不同任务中表现出不同的脆弱性模式:文本识别在模糊和雪损坏下表现最差,而对象推理对霜冻和脉冲噪声更敏感。 Conclusion: 研究结果为开发更具鲁棒性的视觉语言模型提供了重要见解,尤其是针对现实应用中的图像损坏问题。 Abstract: Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.

[53] Zebrafish Counting Using Event Stream Data

Qianghua Chen,Huiyu Wang,Li Ming,Ying Zhao

Main category: cs.CV

TL;DR: 提出了一种基于事件流数据的斑马鱼计数算法,通过事件相机采集数据,结合轨迹信息提高计数准确性,最终平均准确率达97.95%。

Details Motivation: 斑马鱼基因与人类高度同源,常用于生物医学研究,但因其体型微小,人工计数困难,现有方法存在局限性。 Method: 使用事件相机采集数据,进行相机校准和图像融合,利用轨迹信息提升计数准确性,最终通过经验周期平均和四舍五入得到结果。 Result: 在100次计数试验中,平均准确率达到97.95%,优于传统算法。 Conclusion: 该算法实现简单且准确性高,适用于斑马鱼计数任务。 Abstract: Zebrafish share a high degree of homology with human genes and are commonly used as model organism in biomedical research. For medical laboratories, counting zebrafish is a daily task. Due to the tiny size of zebrafish, manual visual counting is challenging. Existing counting methods are either not applicable to small fishes or have too many limitations. The paper proposed a zebrafish counting algorithm based on the event stream data. Firstly, an event camera is applied for data acquisition. Secondly, camera calibration and image fusion were preformed successively. Then, the trajectory information was used to improve the counting accuracy. Finally, the counting results were averaged over an empirical of period and rounded up to get the final results. To evaluate the accuracy of the algorithm, 20 zebrafish were put in a four-liter breeding tank. Among 100 counting trials, the average accuracy reached 97.95%. As compared with traditional algorithms, the proposed one offers a simpler implementation and achieves higher accuracy.

[54] Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

Heng Liu,Guanghui Li,Mingqi Gao,Xiantong Zhen,Feng Zheng,Yang Wang

Main category: cs.CV

TL;DR: FS-RVOS是一种基于Transformer的模型,通过跨模态亲和模块和实例序列匹配策略,实现了视频中基于自然语言描述的对象分割。

Details Motivation: 解决基于自然语言描述的视频对象分割问题,提升多对象分割的准确性和鲁棒性。 Method: 采用Transformer架构,结合跨模态亲和模块和实例序列匹配策略,扩展为多对象分割模型FS-RVMOS。 Result: 在多个基准测试中表现优于现有方法,展示了更高的鲁棒性和准确性。 Conclusion: FS-RVOS和FS-RVMOS在视频对象分割任务中表现出色,为多对象分割提供了有效解决方案。 Abstract: Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.

[55] Human-aligned Deep Learning: Explainability, Causality, and Biological Inspiration

Gianluca Carloni

Main category: cs.CV

TL;DR: 该论文通过可解释性、因果性和生物视觉三个视角,将深度学习与人类推理能力对齐,提出多种方法提升医学图像分类的效率、可解释性和鲁棒性。

Details Motivation: 旨在通过结合人类推理能力,提升深度学习在医学图像分类中的效率、可解释性和鲁棒性,以缩小研究与临床应用的差距。 Method: 1. 评估神经网络可视化技术并提出可解释性设计方法;2. 提出利用特征共现的因果模块;3. 开发CROCODILE框架整合因果概念与对比学习;4. 提出CoCoReco网络模拟生物视觉。 Result: 1. 激活最大化在医学图像中效果有限;2. 原型学习有效且符合放射学;3. XAI与因果ML紧密关联;4. 弱因果信号可提升性能;5. 框架具有跨领域泛化能力;6. 生物电路模组提升识别能力。 Conclusion: 该研究推动了人类对齐的深度学习,为提升临床信任、诊断准确性和安全部署提供了路径。 Abstract: This work aligns deep learning (DL) with human reasoning capabilities and needs to enable more efficient, interpretable, and robust image classification. We approach this from three perspectives: explainability, causality, and biological vision. Introduction and background open this work before diving into operative chapters. First, we assess neural networks' visualization techniques for medical images and validate an explainable-by-design method for breast mass classification. A comprehensive review at the intersection of XAI and causality follows, where we introduce a general scaffold to organize past and future research, laying the groundwork for our second perspective. In the causality direction, we propose novel modules that exploit feature co-occurrence in medical images, leading to more effective and explainable predictions. We further introduce CROCODILE, a general framework that integrates causal concepts, contrastive learning, feature disentanglement, and prior knowledge to enhance generalization. Lastly, we explore biological vision, examining how humans recognize objects, and propose CoCoReco, a connectivity-inspired network with context-aware attention mechanisms. Overall, our key findings include: (i) simple activation maximization lacks insight for medical imaging DL models; (ii) prototypical-part learning is effective and radiologically aligned; (iii) XAI and causal ML are deeply connected; (iv) weak causal signals can be leveraged without a priori information to improve performance and interpretability; (v) our framework generalizes across medical domains and out-of-distribution data; (vi) incorporating biological circuit motifs improves human-aligned recognition. This work contributes toward human-aligned DL and highlights pathways to bridge the gap between research and clinical adoption, with implications for improved trust, diagnostic accuracy, and safe deployment.

[56] MLEP: Multi-granularity Local Entropy Patterns for Universal AI-generated Image Detection

Lin Yuan,Xiaowan Li,Yan Zhang,Jiawei Zhang,Hongbo Li,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了一种基于图像熵的多粒度局部熵模式(MLEP)方法,用于检测AI生成图像(AIGI),解决了现有方法在多样生成模型和场景下泛化能力不足的问题。

Details Motivation: AI生成图像(AIGI)的滥用(如虚假信息和深度伪造)引发了对其检测的迫切需求,但现有方法因缺乏源不变特征和泛化能力有限而难以应对。 Method: 提出MLEP方法,通过计算多尺度图像块的重排熵特征图,全面捕捉像素关系并减少内容偏差,结合CNN分类器进行检测。 Result: 在32种生成模型合成的图像上测试,MLEP在准确性和泛化性上显著优于现有方法。 Conclusion: MLEP为AIGI检测提供了一种高效且泛化能力强的解决方案。 Abstract: Advancements in image generation technologies have raised significant concerns about their potential misuse, such as producing misinformation and deepfakes. Therefore, there is an urgent need for effective methods to detect AI-generated images (AIGI). Despite progress in AIGI detection, achieving reliable performance across diverse generation models and scenes remains challenging due to the lack of source-invariant features and limited generalization capabilities in existing methods. In this work, we explore the potential of using image entropy as a cue for AIGI detection and propose Multi-granularity Local Entropy Patterns (MLEP), a set of entropy feature maps computed across shuffled small patches over multiple image scaled. MLEP comprehensively captures pixel relationships across dimensions and scales while significantly disrupting image semantics, reducing potential content bias. Leveraging MLEP, a robust CNN-based classifier for AIGI detection can be trained. Extensive experiments conducted in an open-world scenario, evaluating images synthesized by 32 distinct generative models, demonstrate significant improvements over state-of-the-art methods in both accuracy and generalization.

[57] LimitNet: Progressive, Content-Aware Image Offloading for Extremely Weak Devices & Networks

Ali Hojjat,Janek Haberer,Tayyaba Zainab,Olaf Landsiedel

Main category: cs.CV

TL;DR: LimitNet是一种针对弱设备和低带宽网络设计的渐进式图像压缩模型,通过内容感知优先传输关键数据,显著提升部分数据可用时的推理准确性并节省带宽。

Details Motivation: IoT设备硬件能力有限且常部署在偏远地区,现有方法在低带宽和高丢包率的LPWANs环境下表现不佳,无法支持时间敏感的推理任务。 Method: LimitNet采用轻量级渐进编码器,根据图像内容优先传输关键数据,支持云端在部分数据可用时进行推理。 Result: 实验表明,LimitNet在ImageNet1000、CIFAR100和COCO数据集上分别比SOTA方法提升14.01%、18.01%和0.1 mAP@0.5的准确性,同时节省61.24%、83.68%和42.25%的带宽。 Conclusion: LimitNet在弱设备和低带宽环境下显著提升了推理性能和带宽效率,适用于IoT场景。 Abstract: IoT devices have limited hardware capabilities and are often deployed in remote areas. Consequently, advanced vision models surpass such devices' processing and storage capabilities, requiring offloading of such tasks to the cloud. However, remote areas often rely on LPWANs technology with limited bandwidth, high packet loss rates, and extremely low duty cycles, which makes fast offloading for time-sensitive inference challenging. Today's approaches, which are deployable on weak devices, generate a non-progressive bit stream, and therefore, their decoding quality suffers strongly when data is only partially available on the cloud at a deadline due to limited bandwidth or packet losses. In this paper, we introduce LimitNet, a progressive, content-aware image compression model designed for extremely weak devices and networks. LimitNet's lightweight progressive encoder prioritizes critical data during transmission based on the content of the image, which gives the cloud the opportunity to run inference even with partial data availability. Experimental results demonstrate that LimitNet, on average, compared to SOTA, achieves 14.01 p.p. (percentage point) higher accuracy on ImageNet1000, 18.01 pp on CIFAR100, and 0.1 higher mAP@0.5 on COCO. Also, on average, LimitNet saves 61.24% bandwidth on ImageNet1000, 83.68% on CIFAR100, and 42.25% on the COCO dataset compared to SOTA, while it only has 4% more encoding time compared to JPEG (with a fixed quality) on STM32F7 (Cortex-M7).

[58] ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis

Andrea Rigo,Luca Stornaiuolo,Mauro Martino,Bruno Lepri,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种基于低秩适应(LoRA)的灵活微调框架ESPLoRA,用于提升文本到图像生成模型的空间一致性,同时不增加生成时间或降低输出质量。

Details Motivation: 现有文本到图像生成模型在渲染空间关系时表现不佳,且现有方法通常依赖外部网络条件和预定义布局,导致计算成本高且灵活性低。 Method: 通过从LAION-400M中提取和合成空间明确的提示数据集,并开发ESPLoRA框架和TORE算法,结合几何约束的评估指标,提升空间一致性。 Result: ESPLoRA在空间一致性基准测试中优于当前最先进框架CoMPaSS 13.33%。 Conclusion: 该方法显著提升了生成图像的空间一致性,同时保持了高效性和灵活性。 Abstract: Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. To address the lack of spatial information in T2I generations, existing methods typically use external network conditioning and predefined layouts, resulting in higher computational costs and reduced flexibility. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M to ensure precise alignment between textual descriptions and spatial layouts. Alongside this dataset, we present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, specifically designed to enhance spatial consistency in generative models without increasing generation time or compromising the quality of the outputs. In addition to ESPLoRA, we propose refined evaluation metrics grounded in geometric constraints, capturing 3D spatial relations such as \textit{in front of} or \textit{behind}. These metrics also expose spatial biases in T2I models which, even when not fully mitigated, can be strategically exploited by our TORE algorithm to further improve the spatial consistency of generated images. Our method outperforms the current state-of-the-art framework, CoMPaSS, by 13.33% on established spatial consistency benchmarks.

[59] DAM-Net: Domain Adaptation Network with Micro-Labeled Fine-Tuning for Change Detection

Hongjia Chen,Xin Xu,Fangling Pu

Main category: cs.CV

TL;DR: DAM-Net提出了一种结合对抗域适应和微标记微调的变化检测方法,显著提升了跨数据集性能,仅需0.3%标记数据即可媲美半监督方法。

Details Motivation: 现有深度学习方法在变化检测中域适应性差,需要大量标记数据重新训练,限制了实际应用。 Method: DAM-Net结合对抗域适应(分割判别器和交替训练策略)和微标记微调(仅标记1%样本),并采用多时相Transformer和优化主干结构。 Result: 在LEVIR-CD和WHU-CD数据集上,DAM-Net显著优于现有域适应方法,性能接近需10%标记数据的半监督方法。 Conclusion: DAM-Net为遥感领域高效域适应提供了新范式,显著推动了跨数据集变化检测应用。 Abstract: Change detection (CD) in remote sensing imagery plays a crucial role in various applications such as urban planning, damage assessment, and resource management. While deep learning approaches have significantly advanced CD performance, current methods suffer from poor domain adaptability, requiring extensive labeled data for retraining when applied to new scenarios. This limitation severely restricts their practical applications across different datasets. In this work, we propose DAM-Net: a Domain Adaptation Network with Micro-Labeled Fine-Tuning for CD. Our network introduces adversarial domain adaptation to CD for, utilizing a specially designed segmentation-discriminator and alternating training strategy to enable effective transfer between domains. Additionally, we propose a novel Micro-Labeled Fine-Tuning approach that strategically selects and labels a minimal amount of samples (less than 1%) to enhance domain adaptation. The network incorporates a Multi-Temporal Transformer for feature fusion and optimized backbone structure based on previous research. Experiments conducted on the LEVIR-CD and WHU-CD datasets demonstrate that DAM-Net significantly outperforms existing domain adaptation methods, achieving comparable performance to semi-supervised approaches that require 10% labeled data while using only 0.3% labeled samples. Our approach significantly advances cross-dataset CD applications and provides a new paradigm for efficient domain adaptation in remote sensing. The source code of DAM-Net will be made publicly available upon publication.

[60] Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis

Zhu Zhu,Shuo Jiang,Jingyuan Zheng,Yawen Li,Yifei Chen,Manli Zhao,Weizhong Gu,Feiwei Qin,Jinhu Wang,Gang Yu

Main category: cs.CV

TL;DR: CMSwinKAN是一种基于对比学习的多尺度特征融合模型,用于病理图像分类,显著提高了可解释性和准确性。

Details Motivation: 神经母细胞瘤的诊断依赖主观人工检查,现有自动化方法存在可解释性差、特征提取能力有限和高计算成本等问题。 Method: 结合Swin Transformer架构和Kernel Activation Network,采用多尺度特征融合和对比学习策略,并引入启发式软投票机制。 Result: 在PpNTs和BreakHis数据集上表现优于现有最先进的病理专用模型。 Conclusion: CMSwinKAN在病理图像分类中表现出色,具有临床应用潜力。 Abstract: Neuroblastoma, adrenal-derived, is among the most common pediatric solid malignancies, characterized by significant clinical heterogeneity. Timely and accurate pathological diagnosis from hematoxylin and eosin-stained whole slide images is critical for patient prognosis. However, current diagnostic practices primarily rely on subjective manual examination by pathologists, leading to inconsistent accuracy. Existing automated whole slide image classification methods encounter challenges such as poor interpretability, limited feature extraction capabilities, and high computational costs, restricting their practical clinical deployment. To overcome these limitations, we propose CMSwinKAN, a contrastive-learning-based multi-scale feature fusion model tailored for pathological image classification, which enhances the Swin Transformer architecture by integrating a Kernel Activation Network within its multilayer perceptron and classification head modules, significantly improving both interpretability and accuracy. By fusing multi-scale features and leveraging contrastive learning strategies, CMSwinKAN mimics clinicians' comprehensive approach, effectively capturing global and local tissue characteristics. Additionally, we introduce a heuristic soft voting mechanism guided by clinical insights to seamlessly bridge patch-level predictions to whole slide image-level classifications. We validate CMSwinKAN on the PpNTs dataset, which was collaboratively established with our partner hospital and the publicly accessible BreakHis dataset. Results demonstrate that CMSwinKAN performs better than existing state-of-the-art pathology-specific models pre-trained on large datasets. Our source code is available at https://github.com/JSLiam94/CMSwinKAN.

[61] Fragile Watermarking for Image Certification Using Deep Steganographic Embedding

Davide Ghiani,Jefferson David Rodriguez Chivata,Stefano Lilliu,Simone Maurizio La Cava,Marco Micheletto,Giulia Orrù,Federico Lama,Gian Luca Marcialis

Main category: cs.CV

TL;DR: 该论文探讨了利用深度隐写嵌入的脆弱水印技术,以验证ICAO标准面部图像的完整性,防止恶意或无意修改。

Details Motivation: 现代身份验证系统依赖面部图像,但图像可能因压缩、调整大小或恶意操作(如变形)而被篡改,影响识别系统。 Method: 通过深度隐写嵌入在官方照片中隐藏图像,建立完整性标记,检测后续修改。 Result: 实验表明,该方法能高精度检测图像修改,包括跨方法场景。 Conclusion: 脆弱水印技术是验证生物特征文档完整性的有效工具。 Abstract: Modern identity verification systems increasingly rely on facial images embedded in biometric documents such as electronic passports. To ensure global interoperability and security, these images must comply with strict standards defined by the International Civil Aviation Organization (ICAO), which specify acquisition, quality, and format requirements. However, once issued, these images may undergo unintentional degradations (e.g., compression, resizing) or malicious manipulations (e.g., morphing) and deceive facial recognition systems. In this study, we explore fragile watermarking, based on deep steganographic embedding as a proactive mechanism to certify the authenticity of ICAO-compliant facial images. By embedding a hidden image within the official photo at the time of issuance, we establish an integrity marker that becomes sensitive to any post-issuance modification. We assess how a range of image manipulations affects the recovered hidden image and show that degradation artifacts can serve as robust forensic cues. Furthermore, we propose a classification framework that analyzes the revealed content to detect and categorize the type of manipulation applied. Our experiments demonstrate high detection accuracy, including cross-method scenarios with multiple deep steganography-based models. These findings support the viability of fragile watermarking via steganographic embedding as a valuable tool for biometric document integrity verification.

[62] Decoding Vision Transformers: the Diffusion Steering Lens

Ryota Takatsuki,Sonia Joseph,Ippei Fujisawa,Ryota Kanai

Main category: cs.CV

TL;DR: 本文提出了一种名为Diffusion Steering Lens (DSL)的新方法,用于改进Vision Transformers (ViTs)中内部表示的视觉化分析,解决了现有方法(如Logit Lens和Diffusion Lens)的局限性。

Details Motivation: 尽管Logit Lens和Diffusion Lens在语言模型和图像编码器中已有应用,但它们无法完全捕捉视觉表示的丰富性或直接分析子模块的贡献。 Method: 提出了DSL方法,通过引导子模块输出并修补后续间接贡献,无需训练即可实现对ViTs内部处理的直观解释。 Result: 通过干预性研究验证,DSL能够提供对ViTs内部处理的可靠和直观的解释。 Conclusion: DSL是一种有效的训练免费方法,能够更全面地分析ViTs的内部表示,弥补了现有方法的不足。 Abstract: Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting them into the output vocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is technically straightforward, its direct use faces limitations in capturing the richness of visual representations. Building on the work of Toker et al. (2024)~\cite{Toker2024-ve}, who introduced Diffusion Lens to visualize intermediate representations in the text encoders of text-to-image diffusion models, we demonstrate that while Diffusion Lens can effectively visualize residual stream representations in image encoders, it fails to capture the direct contributions of individual submodules. To overcome this limitation, we propose \textbf{Diffusion Steering Lens} (DSL), a novel, training-free approach that steers submodule outputs and patches subsequent indirect contributions. We validate our method through interventional studies, showing that DSL provides an intuitive and reliable interpretation of the internal processing in ViTs.

[63] Fighting Fires from Space: Leveraging Vision Transformers for Enhanced Wildfire Detection and Characterization

Aman Agarwal,James Gearon,Raksha Rank,Etienne Chenevert

Main category: cs.CV

TL;DR: 论文探讨了使用Vision Transformers (ViTs)和CNN在卫星图像中检测野火的性能比较,发现ViTs表现接近CNN,但经过优化的CNN(如UNet)仍是最佳选择。

Details Motivation: 由于人为气候变化导致野火频率和强度增加,现有检测系统难以应对持续野火季节,需要更高效的检测方法。 Method: 比较了ViTs和CNN在Landsat-8卫星图像数据集上的野火检测性能,包括训练效率和上下文信息利用。 Result: ViTs表现优于基线CNN(提升0.92%),但优化的CNN(UNet)在所有指标中表现最佳(IoU达93.58%,比基线高4.58%)。 Conclusion: ViTs在野火检测中表现接近CNN,但经过优化的CNN(如UNet)仍是目前最佳选择。 Abstract: Wildfires are increasing in intensity, frequency, and duration across large parts of the world as a result of anthropogenic climate change. Modern hazard detection and response systems that deal with wildfires are under-equipped for sustained wildfire seasons. Recent work has proved automated wildfire detection using Convolutional Neural Networks (CNNs) trained on satellite imagery are capable of high-accuracy results. However, CNNs are computationally expensive to train and only incorporate local image context. Recently, Vision Transformers (ViTs) have gained popularity for their efficient training and their ability to include both local and global contextual information. In this work, we show that ViT can outperform well-trained and specialized CNNs to detect wildfires on a previously published dataset of LandSat-8 imagery. One of our ViTs outperforms the baseline CNN comparison by 0.92%. However, we find our own implementation of CNN-based UNet to perform best in every category, showing their sustained utility in image tasks. Overall, ViTs are comparably capable in detecting wildfires as CNNs, though well-tuned CNNs are still the best technique for detecting wildfire with our UNet providing an IoU of 93.58%, better than the baseline UNet by some 4.58%.

[64] RefComp: A Reference-guided Unified Framework for Unpaired Point Cloud Completion

Yixuan Yang,Jinyu Yang,Zixiang Zhao,Victor Sanchez,Feng Zheng

Main category: cs.CV

TL;DR: 提出了一种名为RefComp的无配对点云补全框架,通过参考数据引导完成过程,在类感知和类无关训练设置中均表现优异。

Details Motivation: 现有无配对点云补全方法需要为每个对象类别单独训练模型,泛化能力有限,难以应对现实场景中多样化的3D对象。 Method: 将无配对补全问题转化为形状转换问题,在潜在特征空间中解决。引入部分-完整点云对作为参考数据,通过共享参数的参考分支和目标分支进行形状融合与转换。 Result: RefComp框架在类感知和类无关训练设置中均取得优异性能,在虚拟扫描和真实数据集上表现突出。 Conclusion: RefComp框架通过参考数据引导和形状融合模块,显著提升了无配对点云补全的性能和泛化能力。 Abstract: The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clouds of generic 3D objects. In this paper, we propose a novel unpaired point cloud completion framework, namely the Reference-guided Completion (RefComp) framework, which attains strong performance in both the class-aware and class-agnostic training settings. The RefComp framework transforms the unpaired completion problem into a shape translation problem, which is solved in the latent feature space of the partial point clouds. To this end, we introduce the use of partial-complete point cloud pairs, which are retrieved by using the partial point cloud to be completed as a template. These point cloud pairs are used as reference data to guide the completion process. Our RefComp framework uses a reference branch and a target branch with shared parameters for shape fusion and shape translation via a Latent Shape Fusion Module (LSFM) to enhance the structural features along the completion pipeline. Extensive experiments demonstrate that the RefComp framework achieves not only state-of-the-art performance in the class-aware training setting but also competitive results in the class-agnostic training setting on both virtual scans and real-world datasets.

[65] CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning

Yang Yue,Yulin Wang,Chenxin Tao,Pan Liu,Shiji Song,Gao Huang

Main category: cs.CV

TL;DR: CheXWorld是首个针对放射影像的自监督世界模型,通过统一框架建模局部解剖结构、全局解剖布局和领域变化,显著优于现有方法。

Details Motivation: 人类能构建内部世界模型以预测行为后果,类似概念在机器学习中潜力巨大,但尚未应用于放射影像领域。 Method: 开发统一框架,同时建模局部解剖结构、全局解剖布局和领域变化,并通过定性和定量分析验证。 Result: CheXWorld成功捕捉医学知识的三个维度,在多个基准测试中表现优于现有方法。 Conclusion: CheXWorld为放射影像的自监督学习提供了有效框架,具有广泛的应用潜力。 Abstract: Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code & pre-trained models are available at https://github.com/LeapLabTHU/CheXWorld.

[66] Outlier-Robust Multi-Model Fitting on Quantum Annealers

Saurabh Pandey,Luca Magri,Federica Arrigoni,Vladislav Golyanik

Main category: cs.CV

TL;DR: 本文提出了一种鲁棒的量子多模型拟合(R-QuMF)算法,利用量子计算能力解决计算机视觉中的多模型拟合问题,尤其适用于含噪声和异常值的数据。

Details Motivation: 现有量子计算方法在多模型拟合中仅适用于单模型或无异常值数据集,无法满足实际需求。 Method: 将问题建模为绝热量子计算机(AQC)的最大集合覆盖任务,无需预先知道模型数量。 Result: R-QuMF在合成和真实3D数据集上表现优于现有量子技术。 Conclusion: 量子计算在解决复杂多模型拟合问题中具有潜力,尤其适用于现实场景中的噪声和异常值数据。 Abstract: Multi-model fitting (MMF) presents a significant challenge in Computer Vision, particularly due to its combinatorial nature. While recent advancements in quantum computing offer promise for addressing NP-hard problems, existing quantum-based approaches for model fitting are either limited to a single model or consider multi-model scenarios within outlier-free datasets. This paper introduces a novel approach, the robust quantum multi-model fitting (R-QuMF) algorithm, designed to handle outliers effectively. Our method leverages the intrinsic capabilities of quantum hardware to tackle combinatorial challenges inherent in MMF tasks, and it does not require prior knowledge of the exact number of models, thereby enhancing its practical applicability. By formulating the problem as a maximum set coverage task for adiabatic quantum computers (AQC), R-QuMF outperforms existing quantum techniques, demonstrating superior performance across various synthetic and real-world 3D datasets. Our findings underscore the potential of quantum computing in addressing the complexities of MMF, especially in real-world scenarios with noisy and outlier-prone data.

cs.GR [Back]

[67] EDGS: Eliminating Densification for Efficient Convergence of 3DGS

Dmytro Kotovenko,Olga Grebenkova,Björn Ommer

Main category: cs.GR

TL;DR: 提出了一种基于密集图像对应的一步几何近似方法,替代传统3D高斯泼溅的多次细化步骤,显著提升训练效率和渲染质量。

Details Motivation: 传统3D高斯泼溅方法通过多次细化步骤重建场景,速度慢且在高频区域表现不佳。 Method: 利用密集图像对应生成三角化像素,一步近似场景几何,初始化高斯参数(颜色、尺度、位置),无需后续细化。 Result: 方法在训练效率和渲染质量上均优于现有技术,且仅需标准3DGS一半的高斯泼溅。 Conclusion: 该方法是一种高效且兼容现有技术的解决方案,适用于高频细节丰富的场景。 Abstract: 3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and iteratively refining under-reconstructed regions. This process is inherently slow, as it requires multiple densification steps where Gaussians are repeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often leads to suboptimal renderings, particularly in high-frequency regions where detail is critical. We propose a fundamentally different approach: we eliminate densification process with a one-step approximation of scene geometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimate rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian with well-informed colors, scales, and positions. As a result, we dramatically shorten the optimization path and remove the need for densification. Unlike traditional methods that rely on sparse keypoints, our dense initialization ensures uniform detail across the scene, even in high-frequency regions where 3DGS and other methods struggle. Moreover, since all splats are initialized in parallel at the start of optimization, we eliminate the need to wait for densification to adjust new Gaussians. Our method not only outperforms speed-optimized models in training efficiency but also achieves higher rendering quality than state-of-the-art approaches, all while using only half the splats of standard 3DGS. It is fully compatible with other 3DGS acceleration techniques, making it a versatile and efficient solution that can be integrated with existing approaches.

[68] DuoLoRA : Cycle-consistent and Rank-disentangled Content-Style Personalization

Aniket Roy,Shubhankar Borse,Shreya Kadambi,Debasmit Das,Shweta Mahajan,Risheek Garrepalli,Hyojin Park,Ankita Nayak,Rama Chellappa,Munawar Hayat,Fatih Porikli

Main category: cs.GR

TL;DR: DuoLoRA提出了一种新的内容-风格个性化框架,通过秩维度掩码学习、层先验有效合并和Constyle损失,解决了现有方法将内容与风格视为独立实体的问题。

Details Motivation: 现有方法(如ZipLoRA)将内容和风格视为独立实体,而实际上二者是交织的。DuoLoRA旨在更有效地合并内容和风格。 Method: DuoLoRA包含三个关键组件:秩维度掩码学习(ZipRank)、基于SDXL层先验的有效合并和Constyle损失(利用循环一致性)。 Result: 实验表明,DuoLoRA在多个基准测试中优于现有内容-风格合并方法。 Conclusion: DuoLoRA通过更紧密的内容-风格整合,显著提升了性能,为个性化任务提供了新思路。 Abstract: We tackle the challenge of jointly personalizing content and style from a few examples. A promising approach is to train separate Low-Rank Adapters (LoRA) and merge them effectively, preserving both content and style. Existing methods, such as ZipLoRA, treat content and style as independent entities, merging them by learning masks in LoRA's output dimensions. However, content and style are intertwined, not independent. To address this, we propose DuoLoRA, a content-style personalization framework featuring three key components: (i) rank-dimension mask learning, (ii) effective merging via layer priors, and (iii) Constyle loss, which leverages cycle-consistency in the merging process. First, we introduce ZipRank, which performs content-style merging within the rank dimension, offering adaptive rank flexibility and significantly reducing the number of learnable parameters. Additionally, we incorporate SDXL layer priors to apply implicit rank constraints informed by each layer's content-style bias and adaptive merger initialization, enhancing the integration of content and style. To further refine the merging process, we introduce Constyle loss, which leverages the cycle-consistency between content and style. Our experimental results demonstrate that DuoLoRA outperforms state-of-the-art content-style merging methods across multiple benchmarks.

[69] BEV-GS: Feed-forward Gaussian Splatting in Bird's-Eye-View for Road Reconstruction

Wenhua Wu,Tong Zhao,Chensheng Peng,Lei Yang,Yintao Wei,Zhe Liu,Hesheng Wang

Main category: cs.GR

TL;DR: BEV-GS是一种基于前馈高斯溅射的实时单帧路面重建方法,通过分离的几何和纹理网络直接从单帧图像估计参数,避免了逐场景优化,实现了高精度和实时性能。

Details Motivation: 路面是车轮或机器人足部的唯一接触介质,重建路面对无人车辆和移动机器人至关重要。现有方法如NeRF和GS依赖多视图输入且优化时间长,因此需要一种更高效的方法。 Method: BEV-GS由预测模块和渲染模块组成。预测模块采用鸟瞰视角范式,通过几何和纹理网络直接从单帧图像估计参数。渲染模块使用网格高斯表示路面并进行新视角合成。 Result: 在RSRD数据集上,BEV-GS的路面高程误差降至1.73厘米,新视角合成的PSNR达到28.36 dB,预测和渲染帧率分别为26和2061 FPS。 Conclusion: BEV-GS实现了高精度和实时性能,适用于无人车辆和移动机器人的路面重建需求。 Abstract: Road surface is the sole contact medium for wheels or robot feet. Reconstructing road surface is crucial for unmanned vehicles and mobile robots. Recent studies on Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) have achieved remarkable results in scene reconstruction. However, they typically rely on multi-view image inputs and require prolonged optimization times. In this paper, we propose BEV-GS, a real-time single-frame road surface reconstruction method based on feed-forward Gaussian splatting. BEV-GS consists of a prediction module and a rendering module. The prediction module introduces separate geometry and texture networks following Bird's-Eye-View paradigm. Geometric and texture parameters are directly estimated from a single frame, avoiding per-scene optimization. In the rendering module, we utilize grid Gaussian for road surface representation and novel view synthesis, which better aligns with road surface characteristics. Our method achieves state-of-the-art performance on the real-world dataset RSRD. The road elevation error reduces to 1.73 cm, and the PSNR of novel view synthesis reaches 28.36 dB. The prediction and rendering FPS is 26, and 2061, respectively, enabling high-accuracy and real-time applications. The code will be available at: \href{https://github.com/cat-wwh/BEV-GS}{\texttt{https://github.com/cat-wwh/BEV-GS}}

[70] Image Editing with Diffusion Models: A Survey

Jia Wang,Jie Hu,Xiaoqi Ma,Hanghang Ma,Xiaoming Wei,Enhua Wu

Main category: cs.GR

TL;DR: 本文综述了基于扩散模型的图像编辑领域,从任务定义、方法分类、结果评估和数据集四个方面进行了总结,并展望了未来发展。

Details Motivation: 随着基础模型生成图像质量的提升,图像编辑的需求增加,但编辑类型和方法多样,缺乏全面视角。 Method: 从操作部分和动作角度定义编辑任务,将方法分为基于反转、微调和适配器的三类,并整理评估指标和数据集。 Result: 总结了图像编辑的任务形式、方法分类、评估指标和数据集,为研究提供系统视角。 Conclusion: 基于当前总结,提出了图像编辑领域未来发展的愿景。 Abstract: With deeper exploration of diffusion model, developments in the field of image generation have triggered a boom in image creation. As the quality of base-model generated images continues to improve, so does the demand for further application like image editing. In recent years, many remarkable works are realizing a wide variety of editing effects. However, the wide variety of editing types and diverse editing approaches have made it difficult for researchers to establish a comprehensive view of the development of this field. In this survey, we summarize the image editing field from four aspects: tasks definition, methods classification, results evaluation and editing datasets. First, we provide a definition of image editing, which in turn leads to a variety of editing task forms from the perspective of operation parts and manipulation actions. Subsequently, we categorize and summary methods for implementing editing into three categories: inversion-based, fine-tuning-based and adapter-based. In addition, we organize the currently used metrics, available datasets and corresponding construction methods. At the end, we present some visions for the future development of the image editing field based on the previous summaries.

[71] Volume Encoding Gaussians: Transfer Function-Agnostic 3D Gaussians for Volume Rendering

Landon Dyken,Andres Sewell,Will Usher,Steve Petruzza,Sidharth Kumar

Main category: cs.GR

TL;DR: VEG是一种基于3D高斯的科学体积可视化表示方法,专注于非结构化体积数据,通过解耦视觉外观与数据表示,实现高效渲染和压缩。

Details Motivation: 当前机器学习在可视化中的应用主要忽略了非结构化体积数据,而VEG旨在填补这一空白。 Method: VEG通过仅编码标量值解耦数据表示与视觉外观,采用不透明度引导的训练策略优化表示,并适应局部几何形状。 Result: VEG在多种数据上实现高质量重建,压缩率高达3600倍,支持快速交互式渲染。 Conclusion: VEG为非结构化体积数据的科学可视化提供了一种高效、灵活且独立于传递函数的解决方案。 Abstract: While HPC resources are increasingly being used to produce adaptively refined or unstructured volume datasets, current research in applying machine learning-based representation to visualization has largely ignored this type of data. To address this, we introduce Volume Encoding Gaussians (VEG), a novel 3D Gaussian-based representation for scientific volume visualization focused on unstructured volumes. Unlike prior 3D Gaussian Splatting (3DGS) methods that store view-dependent color and opacity for each Gaussian, VEG decouple the visual appearance from the data representation by encoding only scalar values, enabling transfer-function-agnostic rendering of 3DGS models for interactive scientific visualization. VEG are directly initialized from volume datasets, eliminating the need for structure-from-motion pipelines like COLMAP. To ensure complete scalar field coverage, we introduce an opacity-guided training strategy, using differentiable rendering with multiple transfer functions to optimize our data representation. This allows VEG to preserve fine features across the full scalar range of a dataset while remaining independent of any specific transfer function. Each Gaussian is scaled and rotated to adapt to local geometry, allowing for efficient representation of unstructured meshes without storing mesh connectivity and while using far fewer primitives. Across a diverse set of data, VEG achieve high reconstruction quality, compress large volume datasets by up to 3600x, and support lightning-fast rendering on commodity GPUs, enabling interactive visualization of large-scale structured and unstructured volumes.

[72] SMPL-GPTexture: Dual-View 3D Human Texture Estimation using Text-to-Image Generation Models

Mingxiao Tu,Shuchang Ye,Hoijoon Jung,Jinman Kim

Main category: cs.GR

TL;DR: 提出SMPL-GPTexture,通过自然语言提示生成高质量3D人体纹理,解决数据稀缺和生成模型缺陷问题。

Details Motivation: 真实配对的3D人体纹理数据稀缺且获取成本高,现有生成模型易产生缺陷。 Method: 结合文本生成图像、人体网格恢复、逆向光栅化和扩散模型填补缺失区域。 Result: 生成高分辨率纹理,与用户提示对齐。 Conclusion: SMPL-GPTexture有效解决了纹理生成中的挑战。 Abstract: Generating high-quality, photorealistic textures for 3D human avatars remains a fundamental yet challenging task in computer vision and multimedia field. However, real paired front and back images of human subjects are rarely available with privacy, ethical and cost of acquisition, which restricts scalability of the data. Additionally, learning priors from image inputs using deep generative models, such as GANs or diffusion models, to infer unseen regions such as the human back often leads to artifacts, structural inconsistencies, or loss of fine-grained detail. To address these issues, we present SMPL-GPTexture (skinned multi-person linear model - general purpose Texture), a novel pipeline that takes natural language prompts as input and leverages a state-of-the-art text-to-image generation model to produce paired high-resolution front and back images of a human subject as the starting point for texture estimation. Using the generated paired dual-view images, we first employ a human mesh recovery model to obtain a robust 2D-to-3D SMPL alignment between image pixels and the 3D model's UV coordinates for each views. Second, we use an inverted rasterization technique that explicitly projects the observed colour from the input images into the UV space, thereby producing accurate, complete texture maps. Finally, we apply a diffusion-based inpainting module to fill in the missing regions, and the fusion mechanism then combines these results into a unified full texture map. Extensive experiments shows that our SMPL-GPTexture can generate high resolution texture aligned with user's prompts.

[73] Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

Radek Daněček,Carolin Schmitt,Senya Polikovsky,Michael J. Black

Main category: cs.GR

TL;DR: THUNDER提出了一种基于可微分语音重构的3D头部动画框架,结合确定性模型的精准唇同步和随机模型的丰富表情,显著提升了唇同步质量。

Details Motivation: 现有确定性模型唇同步质量高但表情单一,随机模型表情丰富但唇同步质量低,需要结合两者优势。 Method: 通过训练一个从面部动画回归音频的网格到语音模型,并将其集成到基于扩散的动画框架中,利用可微分语音重构作为监督信号。 Result: THUNDER显著提升了唇同步质量,同时保持了丰富多样的表情动画。 Conclusion: THUNDER通过可微分语音重构实现了高质量的唇同步和表情动画,为3D头部动画提供了新思路。 Abstract: In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.

[74] RT-HDIST: Ray-Tracing Core-based Hausdorff Distance Computation

YoungWoo Kim,Jaehong Lee,Duksu Kim

Main category: cs.GR

TL;DR: RT-HDIST是一种利用光线追踪核心加速的Hausdorff距离算法,通过重新定义问题为最近邻搜索并引入量化索引空间,显著降低了计算开销。

Details Motivation: Hausdorff距离计算在大规模数据集上计算成本高昂,需要一种高效的解决方案。 Method: 将Hausdorff距离问题转化为最近邻搜索,并引入量化索引空间,利用光线追踪核心加速。 Result: 实验表明,RT-HDIST比现有方法快两个数量级,同时保持精确结果。 Conclusion: RT-HDIST为实时和大规模应用提供了高效的Hausdorff距离计算解决方案。 Abstract: The Hausdorff distance is a fundamental metric with widespread applications across various fields. However, its computation remains computationally expensive, especially for large-scale datasets. In this work, we present RT-HDIST, the first Hausdorff distance algorithm accelerated by ray-tracing cores (RT-cores). By reformulating the Hausdorff distance problem as a series of nearest-neighbor searches and introducing a novel quantized index space, RT-HDIST achieves significant reductions in computational overhead while maintaining exact results. Extensive benchmarks demonstrate up to a two-order-of-magnitude speedup over prior state-of-the-art methods, underscoring RT-HDIST's potential for real-time and large-scale applications.

[75] Ascribe New Dimensions to Scientific Data Visualization with VR

Daniela Ushizima,Guilherme Melo dos Santos,Zineb Sordo,Ronald Pandolfi,Jeffrey Donatelli

Main category: cs.GR

TL;DR: ASCRIBE-VR是一个结合AI算法与VR技术的平台,旨在通过沉浸式环境提升复杂科学图像的分析能力。

Details Motivation: 传统2D可视化方法限制了3D结构的直观分析,VR技术提供了更沉浸的交互方式。 Method: ASCRIBE-VR整合AI驱动的算法与科学图像,支持多模态分析和沉浸式可视化。 Result: 平台支持X射线CT、磁共振等高级数据集的可视化,并通过AI分割和反馈优化探索过程。 Conclusion: ASCRIBE-VR通过结合AI与VR,弥合了计算分析与人类直觉之间的差距,推动科学发现。 Abstract: For over half a century, the computer mouse has been the primary tool for interacting with digital data, yet it remains a limiting factor in exploring complex, multi-scale scientific images. Traditional 2D visualization methods hinder intuitive analysis of inherently 3D structures. Virtual Reality (VR) offers a transformative alternative, providing immersive, interactive environments that enhance data comprehension. This article introduces ASCRIBE-VR, a VR platform of Autonomous Solutions for Computational Research with Immersive Browsing \& Exploration, which integrates AI-driven algorithms with scientific images. ASCRIBE-VR enables multimodal analysis, structural assessments, and immersive visualization, supporting scientific visualization of advanced datasets such as X-ray CT, Magnetic Resonance, and synthetic 3D imaging. Our VR tools, compatible with Meta Quest, can consume the output of our AI-based segmentation and iterative feedback processes to enable seamless exploration of large-scale 3D images. By merging AI-generated results with VR visualization, ASCRIBE-VR enhances scientific discovery, bridging the gap between computational analysis and human intuition in materials research, connecting human-in-the-loop with digital twins.

cs.CL [Back]

[76] Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis

In Hak Moon

Main category: cs.CL

TL;DR: 本研究评估了五种大型语言模型(LLMs)在解决微积分微分问题上的表现,发现Chat GPT 4o表现最佳,但所有模型在概念理解和代数操作上存在局限。

Details Motivation: 评估LLMs在数学教育中的潜力,揭示其作为学习工具的优势和不足。 Method: 采用系统性交叉评估框架,测试五种模型在13种基础问题类型上的表现。 Result: Chat GPT 4o成功率最高(94.71%),Claude Pro生成的问题最难,所有模型在概念性任务上表现较差。 Conclusion: LLMs在程序性任务上表现优异,但概念理解有限,仍需人类教学支持。 Abstract: This study presents a comprehensive evaluation of five leading large language models (LLMs) - Chat GPT 4o, Copilot Pro, Gemini Advanced, Claude Pro, and Meta AI - on their performance in solving calculus differentiation problems. The investigation assessed these models across 13 fundamental problem types, employing a systematic cross-evaluation framework where each model solved problems generated by all models. Results revealed significant performance disparities, with Chat GPT 4o achieving the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%). All models excelled at procedural differentiation tasks but showed varying limitations with conceptual understanding and algebraic manipulation. Notably, problems involving increasing/decreasing intervals and optimization word problems proved most challenging across all models. The cross-evaluation matrix revealed that Claude Pro generated the most difficult problems, suggesting distinct capabilities between problem generation and problem-solving. These findings have significant implications for educational applications, highlighting both the potential and limitations of LLMs as calculus learning tools. While they demonstrate impressive procedural capabilities, their conceptual understanding remains limited compared to human mathematical reasoning, emphasizing the continued importance of human instruction for developing deeper mathematical comprehension.

[77] BASIR: Budget-Assisted Sectoral Impact Ranking -- A Dataset for Sector Identification and Performance Prediction Using Language Models

Sohom Ghosh,Sudip Kumar Naskar

Main category: cs.CL

TL;DR: 该研究提出了一种框架BASIR,用于分析和预测印度年度预算对特定经济部门的影响,通过多标签分类和部门绩效排名实现。

Details Motivation: 政府财政政策对金融市场影响显著,但实时分析预算对部门股票表现的影响仍缺乏系统性方法。 Method: 利用1947至2025年印度预算文本,结合细调嵌入和语言模型,进行部门分类和绩效排名。 Result: 部门分类F1分数为0.605,绩效排名NDCG分数为0.997。 Conclusion: 该框架为投资者和政策制定者提供了数据驱动的政策影响量化工具,填补了手动分析的空白。 Abstract: Government fiscal policies, particularly annual union budgets, exert significant influence on financial markets. However, real-time analysis of budgetary impacts on sector-specific equity performance remains methodologically challenging and largely unexplored. This study proposes a framework to systematically identify and rank sectors poised to benefit from India's Union Budget announcements. The framework addresses two core tasks: (1) multi-label classification of excerpts from budget transcripts into 81 predefined economic sectors, and (2) performance ranking of these sectors. Leveraging a comprehensive corpus of Indian Union Budget transcripts from 1947 to 2025, we introduce BASIR (Budget-Assisted Sectoral Impact Ranking), an annotated dataset mapping excerpts from budgetary transcripts to sectoral impacts. Our architecture incorporates fine-tuned embeddings for sector identification, coupled with language models that rank sectors based on their predicted performances. Our results demonstrate 0.605 F1-score in sector classification, and 0.997 NDCG score in predicting ranks of sectors based on post-budget performances. The methodology enables investors and policymakers to quantify fiscal policy impacts through structured, data-driven insights, addressing critical gaps in manual analysis. The annotated dataset has been released under CC-BY-NC-SA-4.0 license to advance computational economics research.

[78] KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

Bokwang Hwang,Seonkyu Lim,Taewoong Kim,Yongjae Geun,Sunghyun Bang,Sohyun Park,Jihyun Park,Myeonggyu Lee,Jinwoo Lee,Yerin Kim,Jinsun Yoo,Jingyeong Hong,Jina Park,Yongchan Kim,Suhyun Kim,Younggyun Hahm,Yiseul Lee,Yejee Kang,Chanhyuk Yoon,Chansu Lee,Heeyewon Jeong,Jiyeon Lee,Seonhye Gu,Hyebin Kang,Yousang Cho,Hangyeol Yoo,KyungTae Lim

Main category: cs.CL

TL;DR: KFinEval-Pilot是一个针对韩语金融领域的大型语言模型(LLM)评估基准,包含1000多个问题,涵盖金融知识、法律推理和金融毒性三个领域。通过半自动化流程构建,结合GPT-4生成和专家验证,评估结果显示不同模型在任务准确性和输出安全性上存在差异。

Details Motivation: 解决现有以英语为中心的基准在金融领域的局限性,为韩语金融AI系统开发提供早期诊断工具。 Method: 采用半自动化流程,结合GPT-4生成问题和专家验证,构建包含金融知识、法律推理和金融毒性的评估基准。 Result: 评估结果显示不同LLM在任务准确性和输出安全性上存在显著差异,揭示了金融应用中推理和安全性的挑战。 Conclusion: KFinEval-Pilot为开发更安全可靠的金融AI系统提供了基础,特别适用于韩语金融领域。 Abstract: We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.

[79] Sustainability via LLM Right-sizing

Jennifer Haase,Finn Klessascheck,Jan Mendling,Sebastian Pokutta

Main category: cs.CL

TL;DR: 研究评估了11种LLM在10种日常任务中的表现,发现GPT-4o性能最优但成本高,而小型模型如Gemma-3和Phi-4在成本、隐私和本地部署方面更具优势。

Details Motivation: 探讨在组织工作流中,如何在性能、成本和可持续性之间找到平衡,避免过度依赖高性能但高成本的LLM。 Method: 使用双LLM评估框架,自动化任务执行并标准化评估10项标准,涵盖输出质量、事实准确性和伦理责任。 Result: GPT-4o表现最优但成本高,小型模型在多数任务中表现可靠;任务类型影响模型效果。 Conclusion: 建议从性能最大化转向任务和情境感知的评估,以更好地满足组织需求,推动可持续的LLM部署。 Abstract: Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

[80] DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

Weijie Shi,Jipeng Zhang,Yaguang Wu,Jingzhi Fang,Ruiyuan Zhang,Jiajie Xu,Jia Zhu,Hao Chen,Yao Zhao,Sirui Han,Xiaofang Zhou

Main category: cs.CL

TL;DR: 论文提出了一种名为DIDS的领域感知数据采样方法,通过梯度聚类和FIM度量优化领域采样策略,提升模型性能。

Details Motivation: 多领域数据集中领域采样策略对模型性能有显著影响,但现有方法难以保持领域内一致性和准确衡量领域影响。 Method: 提出梯度聚类算法分组数据,使用FIM度量量化领域影响,结合损失学习轨迹确定最优采样比例。 Result: 实验表明DIDS平均性能提升3.4%,同时保持训练效率。 Conclusion: DIDS通过优化领域采样策略,显著提升模型性能,且计算高效。 Abstract: Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.

[81] ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs

Yan Yang,Yixia Li,Hongru Wang,Xuetao Wei,Jianqiao Yu,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: ImPart是一种基于SVD的重要性感知增量稀疏化方法,通过动态调整稀疏率,在高压缩比下保留任务关键知识,性能优于现有方法。

Details Motivation: 解决现有增量稀疏化方法忽视参数重要性或评估粒度粗糙的问题。 Method: 利用SVD动态调整不同奇异向量的稀疏率,基于其重要性保留关键知识。 Result: 实验显示ImPart在相同性能下压缩比提高2倍,并在增量量化和模型合并中达到新SOTA。 Conclusion: ImPart通过重要性感知稀疏化显著提升增量压缩性能,为多模型部署提供高效解决方案。 Abstract: With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating $2\times$ higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.

[82] CPG-EVAL: A Multi-Tiered Benchmark for Evaluating the Chinese Pedagogical Grammar Competence of Large Language Models

Dong Wang

Main category: cs.CL

TL;DR: CPG-EVAL是首个专门评估大语言模型(LLMs)在外语教学中语法能力的基准测试,揭示了模型在语法识别、干扰抵抗等方面的表现,并提出了改进方向。

Details Motivation: 随着LLMs(如ChatGPT)的快速发展,其在外语教育中的应用潜力巨大,但对其语法教学能力的评估仍不足,因此需要专门的基准测试。 Method: CPG-EVAL包含五项任务,评估语法识别、细粒度区分、类别辨别及抗干扰能力。 Result: 小规模模型在单一任务中表现良好,但在多任务和干扰下表现不佳;大规模模型抗干扰能力更强,但仍有提升空间。 Conclusion: 研究提出了首个理论驱动的多层级基准框架,为LLMs在教育中的应用提供了实证依据,并指明了未来改进方向。 Abstract: Purpose: The rapid emergence of large language models (LLMs) such as ChatGPT has significantly impacted foreign language education, yet their pedagogical grammar competence remains under-assessed. This paper introduces CPG-EVAL, the first dedicated benchmark specifically designed to evaluate LLMs' knowledge of pedagogical grammar within the context of foreign language instruction. Methodology: The benchmark comprises five tasks designed to assess grammar recognition, fine-grained grammatical distinction, categorical discrimination, and resistance to linguistic interference. Findings: Smaller-scale models can succeed in single language instance tasks, but struggle with multiple instance tasks and interference from confusing instances. Larger-scale models show better resistance to interference but still have significant room for accuracy improvement. The evaluation indicates the need for better instructional alignment and more rigorous benchmarks, to effectively guide the deployment of LLMs in educational contexts. Value: This study offers the first specialized, theory-driven, multi-tiered benchmark framework for systematically evaluating LLMs' pedagogical grammar competence in Chinese language teaching contexts. CPG-EVAL not only provides empirical insights for educators, policymakers, and model developers to better gauge AI's current abilities in educational settings, but also lays the groundwork for future research on improving model alignment, enhancing educational suitability, and ensuring informed decision-making concerning LLM integration in foreign language instruction.

[83] Sentiment Analysis on the young people's perception about the mobile Internet costs in Senegal

Derguene Mbaye,Madoune Robert Seye,Moussa Diallo,Mamadou Lamine Ndiaye,Djiby Sow,Dimitri Samuel Adjanohoun,Tatiana Mbengue,Cheikh Samba Wade,De Roulet Pablo,Jean-Claude Baraka Munyaka,Jerome Chenal

Main category: cs.CL

TL;DR: 本文研究了塞内加尔年轻人对移动互联网价格与服务质量的感受,通过分析社交媒体评论并应用情感分析模型。

Details Motivation: 随着非洲互联网普及率上升,年轻人对移动互联网的需求增加,但市场运营商有限,导致性价比选择受限。本文旨在了解年轻人对价格与服务的看法。 Method: 通过扫描Twitter和Facebook上与主题相关的评论,并应用情感分析模型来分析年轻人的情感倾向。 Result: 情感分析结果揭示了年轻人对移动互联网价格与服务质量的一般感受。 Conclusion: 研究为理解塞内加尔年轻人对移动互联网的态度提供了数据支持,反映了市场供需问题。 Abstract: Internet penetration rates in Africa are rising steadily, and mobile Internet is getting an even bigger boost with the availability of smartphones. Young people are increasingly using the Internet, especially social networks, and Senegal is no exception to this revolution. Social networks have become the main means of expression for young people. Despite this evolution in Internet access, there are few operators on the market, which limits the alternatives available in terms of value for money. In this paper, we will look at how young people feel about the price of mobile Internet in Senegal, in relation to the perceived quality of the service, through their comments on social networks. We scanned a set of Twitter and Facebook comments related to the subject and applied a sentiment analysis model to gather their general feelings.

[84] THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

Xiao Pu,Michael Saxon,Wenyue Hua,William Yang Wang

Main category: cs.CL

TL;DR: 论文提出了一种衡量问题难度的方法,发现推理模型在分配最优token数量上校准不佳,尤其在简单问题上。作者引入DUMB500数据集和THOUGHTTERMINATOR技术来改善校准。

Details Motivation: 推理模型在复杂任务上表现优异,但存在过度思考问题,生成不必要token却不提升准确性。 Method: 引入问题难度近似度量,评估推理模型在分配最优token数量上的校准情况,并提出DUMB500数据集和THOUGHTTERMINATOR技术。 Result: 推理模型普遍校准不佳,尤其在简单问题上。THOUGHTTERMINATOR显著改善了校准效果。 Conclusion: 通过问题难度度量和新技术,可以有效改善推理模型的校准问题。 Abstract: Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.

[85] Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Grace Byun,Shinsun Lee,Nayoung Choi,Jinho Choi

Main category: cs.CL

TL;DR: SecMulti-RAG框架通过多源检索和本地开源生成器解决企业RAG系统的检索范围有限和数据安全问题,显著提升性能。

Details Motivation: 现有RAG系统在企业环境中面临检索范围有限和数据泄露风险的问题,导致生成结果不准确或不完整。 Method: 提出SecMulti-RAG框架,结合内部文档、预生成专家知识和外部LLM生成知识,采用本地开源生成器和安全过滤机制。 Result: 在汽车行业报告生成任务中,SecMulti-RAG在正确性、丰富性和实用性上显著优于传统RAG,LLM评估胜率79.3%至91.9%,人工评估胜率56.3%至70.4%。 Conclusion: SecMulti-RAG是企业RAG系统实用且安全的解决方案。 Abstract: Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.

[86] D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model

Grace Byun,Jinho Choi

Main category: cs.CL

TL;DR: D-GEN是一个开源干扰项生成模型,将开放式数据转化为多选题格式,并通过排名对齐和熵分析评估干扰项质量。

Details Motivation: 解决开放式生成模型评估中的不一致性问题,以及传统多选题评估中高质量干扰项生成耗时耗力的问题。 Method: 提出D-GEN模型,并采用排名对齐和熵分析两种新方法评估干扰项质量。 Result: D-GEN在排名一致性(Spearman's rho 0.99, Kendall's tau 0.94)和熵分布上与真实干扰项接近,人类评估也验证了其质量。 Conclusion: D-GEN为多选题评估提供了高效、自动化的干扰项生成方法,推动了评估标准的进步。 Abstract: Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: (1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and (2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman's rho 0.99, Kendall's tau 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation.

[87] From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs

Jiliang Ni,Jiachen Pu,Zhongyi Yang,Kun Zhou,Hui Wang,Xiaoliang Xiao,Dakui Wang,Xin Li,Jingfeng Luo,Conggang Hu

Main category: cs.CL

TL;DR: 论文提出了一种三阶段的高效LLM部署流程,通过原型设计、知识转移和模型压缩,解决了LLM框架中的成本与性能矛盾,最终实现了一个超小型模型。

Details Motivation: 传统单阶段LLM部署成本高、延迟大,需要优化以降低成本并保持性能。 Method: 采用三阶段流程:1)构建原型系统生成高质量数据;2)通过知识蒸馏等技术将知识转移到小型学生模型;3)通过量化和剪枝进一步压缩模型。 Result: 最终模型压缩至0.4B,实现了超低延迟和低成本,同时性能有效。 Conclusion: 该框架的模块化设计和跨领域能力表明其可应用于其他NLP领域。 Abstract: In recent years, Large Language Models (LLMs) have significantly advanced artificial intelligence by optimizing traditional Natural Language Processing (NLP) pipelines, improving performance and generalization. This has spurred their integration into various systems. Many NLP systems, including ours, employ a "one-stage" pipeline directly incorporating LLMs. While effective, this approach incurs substantial costs and latency due to the need for large model parameters to achieve satisfactory outcomes. This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline-including prototyping, knowledge transfer, and model compression-to tackle the cost-performance dilemma in LLM-based frameworks. Our approach yields a super tiny model optimized for cost and performance in online systems, simplifying the system architecture. Initially, by transforming complex tasks into a function call-based LLM-driven pipeline, an optimal performance prototype system is constructed to produce high-quality data as a teacher model. The second stage combine techniques like rejection fine-tuning, reinforcement learning and knowledge distillation to transfer knowledge to a smaller 0.5B student model, delivering effective performance at minimal cost. The final stage applies quantization and pruning to extremely compress model to 0.4B, achieving ultra-low latency and cost. The framework's modular design and cross-domain capabilities suggest potential applicability in other NLP areas.

[88] LLM Sensitivity Evaluation Framework for Clinical Diagnosis

Chenwei Yan,Xiangling Fu,Yuxuan Xiong,Tianyi Wang,Siu Cheung Hui,Ji Wu,Xien Liu

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)在临床诊断中对关键医学信息的敏感性,发现现有模型存在不足,需改进可靠性和关键信息敏感性。

Details Motivation: 临床诊断对LLMs的可靠性和敏感性要求更高,但现有研究忽视了关键信息的重要性。 Method: 通过引入不同扰动策略,评估GPT-3.5、GPT-4、Gemini、Claude3和LLaMA2-7b对关键医学信息的敏感性。 Result: 当前LLMs在诊断决策中对关键医学信息的敏感性存在局限。 Conclusion: LLMs需提升可靠性和关键信息敏感性,以增强人类信任并促进实际应用。 Abstract: Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.

[89] Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning

Jianing Wang,Jin Jiang,Yang Liu,Mengdi Zhang,Xunliang Cai

Main category: cs.CL

TL;DR: 论文提出了一种名为“过程预判”的策略,通过动态树搜索框架和两阶段训练机制,显著提升LLM的推理能力。

Details Motivation: 传统LLM推理依赖试错,而人类常通过预判潜在错误来优化推理。本文旨在模拟这一行为,提升LLM的推理效率。 Method: 定义预判节点,结合动态树搜索框架,利用单LLM完成答案判断、响应批评、预判生成和思维补全。采用SFT和RL两阶段训练。 Result: 实验表明,该方法能显著提升LLM在复杂推理任务中的表现。 Conclusion: “过程预判”策略有效模拟人类推理行为,显著增强LLM推理能力。 Abstract: In this paper, we introduce a new \emph{process prejudge} strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Specifically, we define a prejudge node in the rationale, which represents a reasoning step, with at least one step that follows the prejudge node that has no paths toward the correct answer. To synthesize the prejudge reasoning process, we present an automated reasoning framework with a dynamic tree-searching strategy. This framework requires only one LLM to perform answer judging, response critiquing, prejudge generation, and thought completion. Furthermore, we develop a two-phase training mechanism with supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance the reasoning capabilities of LLMs. Experimental results from competition-level complex reasoning demonstrate that our method can teach the model to prejudge before thinking and significantly enhance the reasoning ability of LLMs. Code and data is released at https://github.com/wjn1996/Prejudge-Before-Think.

[90] CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models

Feiyang Li,Peng Fang,Zhan Shi,Arijit Khan,Fang Wang,Dan Feng,Weihao Wang,Xin Zhang,Yongjian Cui

Main category: cs.CL

TL;DR: CoT-RAG提出了一种结合知识图谱和检索增强生成(RAG)的推理框架,解决了链式思维(CoT)推理在大型语言模型(LLMs)中的低可靠性和自然语言干扰问题。

Details Motivation: 链式思维推理在复杂任务中表现不佳,主要问题包括LLMs生成推理链的低可靠性以及自然语言推理链对LLMs推理逻辑的干扰。 Method: CoT-RAG通过知识图谱驱动的CoT生成、可学习的知识案例感知RAG和伪程序提示执行三个关键设计,提升推理的可靠性和逻辑严谨性。 Result: 在九个公共数据集上的评估显示,CoT-RAG相比现有方法准确率提升4.0%至23.0%,在领域特定数据集上也表现出高效性和实用性。 Conclusion: CoT-RAG显著提升了LLMs的推理能力,具有广泛的实用性和扩展性。 Abstract: While chain-of-thought (CoT) reasoning improves the performance of large language models (LLMs) in complex tasks, it still has two main challenges: the low reliability of relying solely on LLMs to generate reasoning chains and the interference of natural language reasoning chains on the inference logic of LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation, featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo-Program Prompting Execution, which encourages LLMs to execute reasoning tasks in pseudo-programs with greater logical rigor. We conduct a comprehensive evaluation on nine public datasets, covering three reasoning problems. Compared with the-state-of-the-art methods, CoT-RAG exhibits a significant accuracy improvement, ranging from 4.0% to 23.0%. Furthermore, testing on four domain-specific datasets, CoT-RAG shows remarkable accuracy and efficient execution, highlighting its strong practical applicability and scalability.

[91] Enhancing Multilingual Sentiment Analysis with Explainability for Sinhala, English, and Code-Mixed Content

Azmarah Rizvi,Navojith Thamindu,A. M. N. H. Adhikari,W. P. U. Senevirathna,Dharshana Kasthurirathna,Lakmini Abeywardhana

Main category: cs.CL

TL;DR: 该研究开发了一种混合框架,用于银行客户反馈的多语言情感分析,提升了低资源语言(如僧伽罗语)的处理能力,并增强了结果的可解释性。

Details Motivation: 银行客户反馈涉及多种语言(英语、僧伽罗语、新加坡式英语和混合文本),现有模型在低资源语言上表现不佳且缺乏可解释性,因此需要一种更有效的解决方案。 Method: 研究结合了XLM-RoBERTa(用于僧伽罗语和混合文本)和BERT-base-uncased(用于英语),并整合了领域特定的词典校正和可解释性工具(SHAP和LIME)。 Result: 实验结果显示,该框架在英语中达到92.3%的准确率和0.89的F1分数,在僧伽罗语和混合文本中达到88.4%的准确率,且通过可解释性分析揭示了关键情感驱动因素。 Conclusion: 该研究为金融领域的多语言情感分析提供了更鲁棒和透明的解决方案,填补了低资源NLP和可解释性方面的空白。 Abstract: Sentiment analysis is crucial for brand reputation management in the banking sector, where customer feedback spans English, Sinhala, Singlish, and code-mixed text. Existing models struggle with low-resource languages like Sinhala and lack interpretability for practical use. This research develops a hybrid aspect-based sentiment analysis framework that enhances multilingual capabilities with explainable outputs. Using cleaned banking customer reviews, we fine-tune XLM-RoBERTa for Sinhala and code-mixed text, integrate domain-specific lexicon correction, and employ BERT-base-uncased for English. The system classifies sentiment (positive, neutral, negative) with confidence scores, while SHAP and LIME improve interpretability by providing real-time sentiment explanations. Experimental results show that our approaches outperform traditional transformer-based classifiers, achieving 92.3 percent accuracy and an F1-score of 0.89 in English and 88.4 percent in Sinhala and code-mixed content. An explainability analysis reveals key sentiment drivers, improving trust and transparency. A user-friendly interface delivers aspect-wise sentiment insights, ensuring accessibility for businesses. This research contributes to robust, transparent sentiment analysis for financial applications by bridging gaps in multilingual, low-resource NLP and explainability.

[92] DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification

Yu Li,Han Jiang,Zhihua Wei

Main category: cs.CL

TL;DR: DETAM是一种无需微调的防御方法,通过针对性修改注意力机制提升LLMs对越狱攻击的防御能力。

Details Motivation: 随着LLMs的广泛应用,越狱攻击成为安全问题。现有防御方法泛化能力有限且实用性降低。 Method: 分析注意力分数差异,识别敏感注意力头,并在推理时重新分配注意力以强调用户核心意图。 Result: DETAM在越狱防御中优于基线方法,泛化能力强,且在实用性评估中表现优异。 Conclusion: DETAM是一种高效且泛化能力强的防御方法,适用于不同攻击和模型。 Abstract: With the widespread adoption of Large Language Models (LLMs), jailbreak attacks have become an increasingly pressing safety concern. While safety-aligned LLMs can effectively defend against normal harmful queries, they remain vulnerable to such attacks. Existing defense methods primarily rely on fine-tuning or input modification, which often suffer from limited generalization and reduced utility. To address this, we introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs via targeted attention modification. Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks. During inference, we reallocate attention to emphasize the user's core intention, minimizing interference from attack tokens. Our experimental results demonstrate that DETAM outperforms various baselines in jailbreak defense and exhibits robust generalization across different attacks and models, maintaining its effectiveness even on in-the-wild jailbreak data. Furthermore, in evaluating the model's utility, we incorporated over-defense datasets, which further validate the superior performance of our approach. The code will be released immediately upon acceptance.

[93] Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

Zihao Feng,Xiaoxue Wang,Ziwei Bai,Donghang Su,Bowen Wu,Qun Yu,Baoxun Wang

Main category: cs.CL

TL;DR: 论文提出了一种结合强化学习(RL)和奖励课程采样(RCS)的方法,用于提升任务导向对话系统中意图检测的泛化能力,特别是在未见意图上的表现。

Details Motivation: 现有方法(如零样本重构和基于LLM的动态识别)在遇到未见意图时性能下降,导致任务路由错误,因此需要提升模型的泛化能力。 Method: 采用强化学习(RL)结合奖励课程采样(RCS)进行组相对策略优化(GRPO)训练,并在RL中引入思维链(COT)过程。 Result: 实验显示,RL训练的模型在泛化性能上显著优于监督微调(SFT)基线,RCS和COT进一步提升了RL在意图检测中的效果。 Conclusion: 该方法显著提升了意图检测任务的泛化能力,为部署适应性强的对话系统提供了实用见解。 Abstract: Intent detection, a critical component in task-oriented dialogue (TOD) systems, faces significant challenges in adapting to the rapid influx of integrable tools with complex interrelationships. Existing approaches, such as zero-shot reformulations and LLM-based dynamic recognition, struggle with performance degradation when encountering unseen intents, leading to erroneous task routing. To enhance the model's generalization performance on unseen tasks, we employ Reinforcement Learning (RL) combined with a Reward-based Curriculum Sampling (RCS) during Group Relative Policy Optimization (GRPO) training in intent detection tasks. Experiments demonstrate that RL-trained models substantially outperform supervised fine-tuning (SFT) baselines in generalization. Besides, the introduction of the RCS, significantly bolsters the effectiveness of RL in intent detection by focusing the model on challenging cases during training. Moreover, incorporating Chain-of-Thought (COT) processes in RL notably improves generalization in complex intent detection tasks, underscoring the importance of thought in challenging scenarios. This work advances the generalization of intent detection tasks, offering practical insights for deploying adaptable dialogue systems.

[94] Continual Pre-Training is (not) What You Need in Domain Adaption

Pin-Er Chen,Da-Chen Lian,Shu-Kai Hsieh,Sieh-Chuen Huang,Hsuan-Lei Shao,Jun-Wei Chiu,Yang-Hsien Lin,Zih-Ching Chen,Cheng-Kuang,Eddie TC Huang,Simon See

Main category: cs.CL

TL;DR: 本文探讨了领域自适应持续预训练(DACP)在提升法律大语言模型(LLMs)法律推理能力方面的效果,发现其虽能增强领域知识,但并非对所有法律任务均有效。

Details Motivation: 法律LLMs在自动化任务和提升研究精度方面有显著进展,但适应法律领域仍面临法律推理复杂性、专业语言精确解释等挑战。 Method: 通过在中国台湾法律框架下的法律推理任务实验,评估DACP的效果。 Result: DACP能增强领域知识,但对所有法律任务的性能提升不一致,且影响模型的泛化能力和提示任务表现。 Conclusion: 未来研究应优化法律AI的领域适应策略,以平衡DACP的利弊。 Abstract: The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, and the potential for hallucinations. This paper examines the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the legal reasoning capabilities of LLMs. Through a series of experiments on legal reasoning tasks within the Taiwanese legal framework, we demonstrate that while DACP enhances domain-specific knowledge, it does not uniformly improve performance across all legal tasks. We discuss the trade-offs involved in DACP, particularly its impact on model generalization and performance in prompt-based tasks, and propose directions for future research to optimize domain adaptation strategies in legal AI.

[95] Long-context Non-factoid Question Answering in Indic Languages

Ritwik Mishra,Rajiv Ratn Shah,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 论文研究了在低资源印度语言中,通过上下文缩短技术(如OIE、共指消解和APS)提升QA任务性能,实验表明这些技术能显著提高语义和词级分数,并减少计算开销。

Details Motivation: 解决长上下文对LLMs在QA任务中的挑战,特别是在低资源的印度语言中。 Method: 采用上下文缩短技术(OIE、共指消解、APS及其组合)进行实验,并在四种印度语言中验证效果。 Result: 上下文缩短技术平均提升4%语义分数和47%词级分数;微调后进一步提升2%。同时减少了计算开销。 Conclusion: 上下文缩短技术能有效提升LLM-based QA系统的效率和性能,尤其在低资源语言中,但对非事实性问题仍有局限性。 Abstract: Question Answering (QA) tasks, which involve extracting answers from a given context, are relatively straightforward for modern Large Language Models (LLMs) when the context is short. However, long contexts pose challenges due to the quadratic complexity of the self-attention mechanism. This challenge is compounded in Indic languages, which are often low-resource. This study explores context-shortening techniques, including Open Information Extraction (OIE), coreference resolution, Answer Paragraph Selection (APS), and their combinations, to improve QA performance. Compared to the baseline of unshortened (long) contexts, our experiments on four Indic languages (Hindi, Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yield an average improvement of 4\% in semantic scores and 47\% in token-level scores when evaluated on three popular LLMs without fine-tuning. Furthermore, with fine-tuning, we achieve an average increase of 2\% in both semantic and token-level scores. Additionally, context-shortening reduces computational overhead. Explainability techniques like LIME and SHAP reveal that when the APS model confidently identifies the paragraph containing the answer, nearly all tokens within the selected text receive high relevance scores. However, the study also highlights the limitations of LLM-based QA systems in addressing non-factoid questions, particularly those requiring reasoning or debate. Moreover, verbalizing OIE-generated triples does not enhance system performance. These findings emphasize the potential of context-shortening techniques to improve the efficiency and effectiveness of LLM-based QA systems, especially for low-resource languages. The source code and resources are available at https://github.com/ritwikmishra/IndicGenQA.

[96] Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models

Yule Liu,Jingyi Zheng,Zhen Sun,Zifan Peng,Wenhan Dong,Zeyang Sha,Shiwen Cui,Weiqiang Wang,Xinlei He

Main category: cs.CL

TL;DR: ThoughtMani通过在小模型生成的CoT(Chain-of-Thought)中插入特定标记,有效减少大推理模型(LRMs)的冗余推理步骤,降低计算成本,同时保持性能并提升安全性。

Details Motivation: 大推理模型(LRMs)存在“过度思考”问题,即生成冗余推理步骤但性能提升有限。现有方法依赖微调,但存在数据需求高、训练复杂、安全性差和泛化能力弱等问题。 Method: 提出ThoughtMani,通过在小模型生成的CoT中插入标记,引导LRMs跳过不必要的中间步骤。 Result: 在QwQ-32B模型上,ThoughtMani保持性能的同时减少约30%的输出标记,并平均提升10%的安全性对齐。 Conclusion: ThoughtMani为构建更高效、可访问的LRMs提供了一种简单有效的方法,适用于实际应用。 Abstract: Recent advancements in large reasoning models (LRMs) have demonstrated the effectiveness of scaling test-time computation to enhance reasoning capabilities in multiple tasks. However, LRMs typically suffer from "overthinking" problems, where models generate significantly redundant reasoning steps while bringing limited performance gains. Existing work relies on fine-tuning to mitigate overthinking, which requires additional data, unconventional training setups, risky safety misalignment, and poor generalization. Through empirical analysis, we reveal an important characteristic of LRM behaviors that placing external CoTs generated by smaller models between the thinking token ($\texttt{}$ and $\texttt{)}$ can effectively manipulate the model to generate fewer thoughts. Building on these insights, we propose a simple yet efficient pipeline, ThoughtMani, to enable LRMs to bypass unnecessary intermediate steps and reduce computational costs significantly. We conduct extensive experiments to validate the utility and efficiency of ThoughtMani. For instance, when applied to QwQ-32B on the LiveBench/Code dataset, ThoughtMani keeps the original performance and reduces output token counts by approximately 30%, with little overhead from the CoT generator. Furthermore, we find that ThoughtMani enhances safety alignment by an average of 10%. Since model vendors typically serve models of different sizes simultaneously, ThoughtMani provides an effective way to construct more efficient and accessible LRMs for real-world applications.

[97] Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing

Cong William Lin,Wu Zhu

Main category: cs.CL

TL;DR: 该研究探讨了AI辅助生成修订对学术论文的影响,发现ChatGPT的使用在不同学科、性别、母语状态和职业阶段中存在显著差异,并推动了学术写作风格的趋同。

Details Motivation: 研究动机是了解AI工具(如ChatGPT)如何改变学术写作,尤其是其对写作风格和内容的影响。 Method: 通过微调特定提示和学科的大型语言模型,对arXiv上的627,000多篇论文进行分类,以检测ChatGPT修订的文本风格。 Result: 研究发现,ChatGPT的使用提高了写作的清晰度、简洁性和正式性,且不同修订类型的效果各异;早期采用者、男性、非母语者和初级学者的写作风格变化最显著。 Conclusion: 结论是ChatGPT推动了学术写作的趋同,但不同群体的采纳程度和影响存在差异。 Abstract: Large Language Models (LLMs), such as ChatGPT, are reshaping content creation and academic writing. This study investigates the impact of AI-assisted generative revisions on research manuscripts, focusing on heterogeneous adoption patterns and their influence on writing convergence. Leveraging a dataset of over 627,000 academic papers from arXiv, we develop a novel classification framework by fine-tuning prompt- and discipline-specific large language models to detect the style of ChatGPT-revised texts. Our findings reveal substantial disparities in LLM adoption across academic disciplines, gender, native language status, and career stage, alongside a rapid evolution in scholarly writing styles. Moreover, LLM usage enhances clarity, conciseness, and adherence to formal writing conventions, with improvements varying by revision type. Finally, a difference-in-differences analysis shows that while LLMs drive convergence in academic writing, early adopters, male researchers, non-native speakers, and junior scholars exhibit the most pronounced stylistic shifts, aligning their writing more closely with that of established researchers.

[98] Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Shaomu Tan,Christof Monz

Main category: cs.CL

TL;DR: ReMedy是一种新颖的机器翻译评估框架,通过将翻译评估重新定义为奖励建模任务,利用成对偏好数据学习相对翻译质量,显著提升了评估的可靠性。

Details Motivation: 当前机器翻译评估面临的主要挑战是人类评分的噪声和不一致性,回归型神经指标难以处理这种噪声,而大型语言模型在系统级评估中表现良好但在片段级评估中效果不佳。 Method: ReMedy框架通过奖励建模任务学习相对翻译质量,使用成对偏好数据而非直接回归不完美的人类评分。 Result: 在WMT22-24共享任务(39种语言对,111个MT系统)中,ReMedy在片段级和系统级评估中均达到最先进水平,超越多个大型模型。 Conclusion: ReMedy不仅在评估性能上表现出色,还能更有效地检测翻译错误和评估低质量翻译。 Abstract: A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.

[99] Simulating Before Planning: Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning

Tao He,Lizi Liao,Ming Liu,Bing Qin

Main category: cs.CL

TL;DR: 论文提出了一种用户定制的对话策略规划框架(UDP),通过建模用户特征和反馈,优化对话策略,解决了现有方法忽视用户特性的问题。

Details Motivation: 现有对话策略规划方法多关注系统代理策略优化,但忽略了用户特性在真实场景(如对话搜索和推荐)中的关键作用。 Method: UDP框架分三阶段:用户画像动态推断、用户反馈预测、用户定制策略规划,并结合主动学习优化训练。 Result: 实验证明UDP在协作和非协作场景下均能有效学习用户特定对话策略,表现出鲁棒性和适应性。 Conclusion: UDP框架为推进用户中心对话系统提供了有效解决方案,具有实际应用潜力。 Abstract: Recent advancements in dialogue policy planning have emphasized optimizing system agent policies to achieve predefined goals, focusing on strategy design, trajectory acquisition, and efficient training paradigms. However, these approaches often overlook the critical role of user characteristics, which are essential in real-world scenarios like conversational search and recommendation, where interactions must adapt to individual user traits such as personality, preferences, and goals. To address this gap, we first conduct a comprehensive study utilizing task-specific user personas to systematically assess dialogue policy planning under diverse user behaviors. By leveraging realistic user profiles for different tasks, our study reveals significant limitations in existing approaches, highlighting the need for user-tailored dialogue policy planning. Building on this foundation, we present the User-Tailored Dialogue Policy Planning (UDP) framework, which incorporates an Intrinsic User World Model to model user traits and feedback. UDP operates in three stages: (1) User Persona Portraying, using a diffusion model to dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a Brownian Bridge-inspired anticipator to predict user reactions; and (3) User-Tailored Policy Planning, integrating these insights to optimize response strategies. To ensure robust performance, we further propose an active learning approach that prioritizes challenging user personas during training. Comprehensive experiments on benchmarks, including collaborative and non-collaborative settings, demonstrate the effectiveness of UDP in learning user-specific dialogue strategies. Results validate the protocol's utility and highlight UDP's robustness, adaptability, and potential to advance user-centric dialogue systems.

[100] Word Embedding Techniques for Classification of Star Ratings

Hesham Abdelmotaleb,Craig McNeile,Malgorzata Wojtys

Main category: cs.CL

TL;DR: 该研究探讨了不同词嵌入模型(如BERT、Word2Vec、Doc2Vec)对电信客户评论文本分类的影响,结合多种分类算法和PCA降维方法,发现BERT结合PCA在性能上表现最佳。

Details Motivation: 电信服务对现代社会至关重要,通过分析客户反馈可以改进服务。NLP工具能处理文本数据,但不同词嵌入模型对分类效果的影响尚需深入研究。 Method: 使用电信客户评论数据集,比较多种词嵌入模型(BERT、Word2Vec、Doc2Vec)和分类算法,结合PCA降维方法,并研究能耗问题。 Result: BERT结合PCA在分类任务中表现最优,特别是在复杂任务中;提出的基于第一主成分的词向量组合方法优于传统平均值方法。 Conclusion: 词嵌入模型的选择显著影响文本分类性能,BERT结合PCA是高效且节能的解决方案。 Abstract: Telecom services are at the core of today's societies' everyday needs. The availability of numerous online forums and discussion platforms enables telecom providers to improve their services by exploring the views of their customers to learn about common issues that the customers face. Natural Language Processing (NLP) tools can be used to process the free text collected. One way of working with such data is to represent text as numerical vectors using one of many word embedding models based on neural networks. This research uses a novel dataset of telecom customers' reviews to perform an extensive study showing how different word embedding algorithms can affect the text classification process. Several state-of-the-art word embedding techniques are considered, including BERT, Word2Vec and Doc2Vec, coupled with several classification algorithms. The important issue of feature engineering and dimensionality reduction is addressed and several PCA-based approaches are explored. Moreover, the energy consumption used by the different word embeddings is investigated. The findings show that some word embedding models can lead to consistently better text classifiers in terms of precision, recall and F1-Score. In particular, for the more challenging classification tasks, BERT combined with PCA stood out with the highest performance metrics. Moreover, our proposed PCA approach of combining word vectors using the first principal component shows clear advantages in performance over the traditional approach of taking the average.

[101] Multi-Type Context-Aware Conversational Recommender Systems via Mixture-of-Experts

Jie Zou,Cheng Lin,Weikang Guo,Zheng Wang,Jiwei Wei,Yang Yang,Hengtao Shen

Main category: cs.CL

TL;DR: MCCRS是一种多类型上下文感知的对话推荐系统,通过混合专家模型融合多种上下文信息,提升推荐效果。

Details Motivation: 对话推荐系统通常缺乏足够的上下文信息,现有方法难以有效结合多种上下文类型。 Method: MCCRS结合结构化知识图谱、非结构化对话历史和商品评论,通过多个专家模型和ChairBot协调生成推荐结果。 Result: 实验表明,MCCRS显著优于现有基线方法。 Conclusion: MCCRS通过融合多类型上下文信息和专家协同,突破了单一上下文的限制,提升了推荐性能。 Abstract: Conversational recommender systems enable natural language conversations and thus lead to a more engaging and effective recommendation scenario. As the conversations for recommender systems usually contain limited contextual information, many existing conversational recommender systems incorporate external sources to enrich the contextual information. However, how to combine different types of contextual information is still a challenge. In this paper, we propose a multi-type context-aware conversational recommender system, called MCCRS, effectively fusing multi-type contextual information via mixture-of-experts to improve conversational recommender systems. MCCRS incorporates both structured information and unstructured information, including the structured knowledge graph, unstructured conversation history, and unstructured item reviews. It consists of several experts, with each expert specialized in a particular domain (i.e., one specific contextual information). Multiple experts are then coordinated by a ChairBot to generate the final results. Our proposed MCCRS model takes advantage of different contextual information and the specialization of different experts followed by a ChairBot breaks the model bottleneck on a single contextual information. Experimental results demonstrate that our proposed MCCRS method achieves significantly higher performance compared to existing baselines.

[102] Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results

Andrea Santilli,Adam Golinski,Michael Kirchhof,Federico Danieli,Arno Blaas,Miao Xiong,Luca Zappella,Sinead Williamson

Main category: cs.CL

TL;DR: 论文指出语言模型不确定性量化(UQ)评估中常用的正确性函数存在长度偏差,导致某些UQ方法表现被高估。通过分析多种正确性函数,发现LLM-as-a-judge方法偏差较小,可能是解决方案。

Details Motivation: 语言模型的不确定性量化(UQ)对提升其安全性和可靠性至关重要,但现有评估方法中的正确性函数存在偏差,影响评估结果。 Method: 评估了7种正确性函数(包括基于词汇、嵌入和LLM-as-a-judge的方法)在4个数据集、4个模型和6种UQ方法中的表现。 Result: 发现正确性函数的长度偏差与UQ方法的长度偏差相互作用,扭曲了评估结果,其中LLM-as-a-judge方法偏差最小。 Conclusion: LLM-as-a-judge方法可能是减少UQ评估中长度偏差的有效解决方案。 Abstract: Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for improving their safety and reliability. Evaluations often use performance metrics like AUROC to assess how well UQ methods (e.g., negative sequence probabilities) correlate with task correctness functions (e.g., ROUGE-L). In this paper, we show that commonly used correctness functions bias UQ evaluations by inflating the performance of certain UQ methods. We evaluate 7 correctness functions -- from lexical-based and embedding-based metrics to LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our analysis reveals that length biases in the errors of these correctness functions distort UQ assessments by interacting with length biases in UQ methods. We identify LLM-as-a-judge approaches as among the least length-biased choices and hence a potential solution to mitigate these biases.

[103] Deep literature reviews: an application of fine-tuned language models to migration research

Stefano M. Iacus,Haodong Qi,Jiyoung Han

Main category: cs.CL

TL;DR: 本文提出了一种结合传统文献计量方法和大型语言模型(LLM)的混合框架,用于高效、一致且深入的文献综述。

Details Motivation: 传统文献综述方法在处理大规模研究内容时效率低下且缺乏一致性,因此需要一种能够结合自动化和人工验证的新方法。 Method: 通过微调开源LLM,结合错误聚焦的验证流程(LLM生成初始标签,人工纠正错误),应用于20000多篇关于人类迁移的科学文章。 Result: 结果表明,领域适应的LLM能准确筛选相关研究、检测新兴趋势并识别研究空白,同时揭示了气候引发迁移的研究不平衡现象。 Conclusion: 该框架展示了微调LLM在多学科文献综述中的潜力,可加速知识合成和科学发现。 Abstract: This paper presents a hybrid framework for literature reviews that augments traditional bibliometric methods with large language models (LLMs). By fine-tuning open-source LLMs, our approach enables scalable extraction of qualitative insights from large volumes of research content, enhancing both the breadth and depth of knowledge synthesis. To improve annotation efficiency and consistency, we introduce an error-focused validation process in which LLMs generate initial labels and human reviewers correct misclassifications. Applying this framework to over 20000 scientific articles about human migration, we demonstrate that a domain-adapted LLM can serve as a "specialist" model - capable of accurately selecting relevant studies, detecting emerging trends, and identifying critical research gaps. Notably, the LLM-assisted review reveals a growing scholarly interest in climate-induced migration. However, existing literature disproportionately centers on a narrow set of environmental hazards (e.g., floods, droughts, sea-level rise, and land degradation), while overlooking others that more directly affect human health and well-being, such as air and water pollution or infectious diseases. This imbalance highlights the need for more comprehensive research that goes beyond physical environmental changes to examine their ecological and societal consequences, particularly in shaping migration as an adaptive response. Overall, our proposed framework demonstrates the potential of fine-tuned LLMs to conduct more efficient, consistent, and insightful literature reviews across disciplines, ultimately accelerating knowledge synthesis and scientific discovery.

[104] Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence

Paul K. Mandal,Cole Leo,Connor Hurley

Main category: cs.CL

TL;DR: CONTACT是一个基于大型语言模型(LLMs)和少量监督的领土控制预测框架,在低资源环境下表现优于基线方法。

Details Motivation: 利用开源情报(OSINT)的非结构化文本数据,减少标注负担并支持从开放数据流中提取结构化信息。 Method: 采用两种方法:SetFit(基于嵌入的小样本分类器)和基于BLOOMZ-560m的提示调优方法,通过少量标注数据训练模型。 Result: BLOOMZ模型表现优于SetFit基线,提示调优方法在低资源环境下提升了泛化能力。 Conclusion: CONTACT展示了小样本调优的LLMs在减少标注负担和支持OSINT结构化推理中的潜力。 Abstract: Open-source intelligence provides a stream of unstructured textual data that can inform assessments of territorial control. We present CONTACT, a framework for territorial control prediction using large language models (LLMs) and minimal supervision. We evaluate two approaches: SetFit, an embedding-based few-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a multilingual generative LLM. Our model is trained on a small hand-labeled dataset of news articles covering ISIS activity in Syria and Iraq, using prompt-conditioned extraction of control-relevant signals such as military operations, casualties, and location references. We show that the BLOOMZ-based model outperforms the SetFit baseline, and that prompt-based supervision improves generalization in low-resource settings. CONTACT demonstrates that LLMs fine-tuned using few-shot methods can reduce annotation burdens and support structured inference from open-ended OSINT streams. Our code is available at https://github.com/PaulKMandal/CONTACT/.

[105] BadApex: Backdoor Attack Based on Adaptive Optimization Mechanism of Black-box Large Language Models

Zhengxian Wu,Juan Wen,Wanli Peng,Ziwei Zhang,Yinghan Zhou,Yiming Xue

Main category: cs.CL

TL;DR: 论文提出了一种基于黑盒大语言模型自适应优化机制的新型后门攻击方法(BadApex),通过迭代优化提示生成高质量、语义一致的毒化文本,显著提升了攻击效果和隐蔽性。

Details Motivation: 现有后门攻击方法忽视文本质量和语义一致性,且依赖专家经验的手工提示,适应性差。 Method: 设计了自适应优化机制,通过生成和修改代理迭代优化提示,生成毒化文本。 Result: 在三个数据集上实验表明,BadApex显著优于现有攻击方法,攻击成功率高达96.75%。 Conclusion: BadApex提升了提示适应性、语义一致性和文本质量,具有更强的防御抵抗能力。 Abstract: Previous insertion-based and paraphrase-based backdoors have achieved great success in attack efficacy, but they ignore the text quality and semantic consistency between poisoned and clean texts. Although recent studies introduce LLMs to generate poisoned texts and improve the stealthiness, semantic consistency, and text quality, their hand-crafted prompts rely on expert experiences, facing significant challenges in prompt adaptability and attack performance after defenses. In this paper, we propose a novel backdoor attack based on adaptive optimization mechanism of black-box large language models (BadApex), which leverages a black-box LLM to generate poisoned text through a refined prompt. Specifically, an Adaptive Optimization Mechanism is designed to refine an initial prompt iteratively using the generation and modification agents. The generation agent generates the poisoned text based on the initial prompt. Then the modification agent evaluates the quality of the poisoned text and refines a new prompt. After several iterations of the above process, the refined prompt is used to generate poisoned texts through LLMs. We conduct extensive experiments on three dataset with six backdoor attacks and two defenses. Extensive experimental results demonstrate that BadApex significantly outperforms state-of-the-art attacks. It improves prompt adaptability, semantic consistency, and text quality. Furthermore, when two defense methods are applied, the average attack success rate (ASR) still up to 96.75%.

[106] Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations

Chenghao Xiao,Hou Pong Chan,Hao Zhang,Mahani Aljunied,Lidong Bing,Noura Al Moubayed,Yu Rong

Main category: cs.CL

TL;DR: 该研究首次分析了LLMs在不同语言中如何识别知识边界,发现其感知编码于中上层网络层,并提出一种无需训练的对齐方法以减少低资源语言的幻觉风险。

Details Motivation: 研究LLMs在多语言中的知识边界识别,填补英语以外语言的研究空白,减少幻觉现象。 Method: 通过探测LLMs内部表示处理已知和未知问题,提出训练免费对齐方法,并构建多语言评估套件。 Result: 发现知识边界感知编码于中上层网络层,语言差异呈线性结构,双语微调可增强跨语言识别能力。 Conclusion: 研究为跨语言知识边界分析提供新方法和数据集,有助于减少低资源语言的幻觉风险。 Abstract: While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs' perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs' recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.

[107] Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Junjie Yang,Junhao Song,Xudong Han,Ziqian Bi,Tianyang Wang,Chia Xin Liang,Xinyuan Song,Yichao Zhang,Qian Niu,Benji Peng,Keyu Chen,Ming Liu

Main category: cs.CL

TL;DR: 知识蒸馏(KD)通过将复杂教师模型的知识转移到简单学生模型中,显著提升模型效率和准确性,广泛应用于图像分类、目标检测等领域。

Details Motivation: 研究知识蒸馏的最新进展,总结其在提升模型性能和效率方面的作用,为研究者和实践者提供参考。 Method: 综述了注意力机制、块级logit蒸馏和解耦蒸馏等创新方法,优化知识传递。 Result: 知识蒸馏在压缩大型语言模型、减少计算开销和提高推理速度方面表现优异。 Conclusion: 知识蒸馏在人工智能和机器学习中具有重要价值,未来研究应关注其进一步优化和应用扩展。 Abstract: Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.

[108] Generative AI Act II: Test Time Scaling Drives Cognition Engineering

Shijie Xia,Yiwei Qin,Xuefeng Li,Yan Ma,Run-Ze Fan,Steffi Chern,Haoyang Zou,Fan Zhou,Xiangkun Hu,Jiahe Jin,Yanheng He,Yixin Ye,Yixiu Liu,Pengfei Liu

Main category: cs.CL

TL;DR: 论文总结了生成式AI的第一阶段(2020-2023)的局限性,并介绍了第二阶段(2024至今)的新范式,即通过语言思维与AI建立更深层次的连接。

Details Motivation: 探讨生成式AI从知识检索系统向思维构建引擎的转变,并推动认知工程的发展。 Method: 通过教程和优化实现系统性地分解高级方法,普及认知工程技术。 Result: 提供了关于测试时扩展的论文集合,并开源了相关资源。 Conclusion: 认知工程的发展正处于关键时期,其普及将推动AI进入第二阶段。 Abstract: The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations in knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering

[109] Science Hierarchography: Hierarchical Organization of Science Literature

Muhan Gao,Jash Shah,Weiqi Wang,Daniel Khashabi

Main category: cs.CL

TL;DR: 论文提出了一种名为SCIENCE HIERARCHOGRAPHY的方法,旨在将科学文献组织成高质量的分层结构,以揭示不同领域的探索程度。

Details Motivation: 科学知识快速增长,现有工具(如引用网络和搜索引擎)缺乏灵活的抽象能力,无法有效表示科学子领域的活动密度。 Method: 结合快速嵌入聚类和基于LLM的提示方法,平衡计算效率和语义精度,构建多维度分类的层次结构。 Result: 该方法在质量和速度上优于依赖LLM提示的方法,提高了文献探索的效率和可解释性。 Conclusion: SCIENCE HIERARCHOGRAPHY为科学文献探索提供了结构化途径,支持趋势发现和跨学科研究。 Abstract: Scientific knowledge is growing rapidly, making it challenging to track progress and high-level conceptual links across broad disciplines. While existing tools like citation networks and search engines make it easy to access a few related papers, they fundamentally lack the flexible abstraction needed to represent the density of activity in various scientific subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that allows for the categorization of scientific work across varying levels of abstraction, from very broad fields to very specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve the goals of SCIENCE HIERARCHOGRAPHY, we develop a range of algorithms. Our primary approach combines fast embedding-based clustering with LLM-based prompting to balance the computational efficiency of embedding methods with the semantic precision offered by LLM prompting. We demonstrate that this approach offers the best trade-off between quality and speed compared to methods that heavily rely on LLM prompting, such as iterative tree construction with LLMs. To better reflect the interdisciplinary and multifaceted nature of research papers, our hierarchy captures multiple dimensions of categorization beyond simple topic labels. We evaluate the utility of our framework by assessing how effectively an LLM-based agent can locate target papers using the hierarchy. Results show that this structured approach enhances interpretability, supports trend discovery, and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo: $\href{https://github.com/JHU-CLSP/science-hierarchography}{https://github.com/JHU-CLSP/science-hierarchography}$

[110] MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Yicheng Chen,Yining Li,Kai Hu,Zerun Ma,Haochen Ye,Kai Chen

Main category: cs.CL

TL;DR: 论文提出了一种统一的方法来量化数据集的信息内容,通过构建标签图建模语义空间,并基于信息分布量化多样性。进一步提出了一种高效采样方法(MIG),在语义空间中最大化信息增益。实验表明MIG优于现有方法。

Details Motivation: 现有方法通常关注实例质量并使用启发式规则保持多样性,但缺乏对数据集的全面视角,导致结果不理想。此外,启发式规则在嵌入空间中的距离或聚类无法准确捕捉复杂指令的语义意图。 Method: 提出了一种统一方法,通过构建标签图建模语义空间,量化信息分布。基于此,引入MIG采样方法,迭代选择数据样本以最大化语义空间中的信息增益。 Result: MIG在多个数据集和基础模型上表现优于现有方法。使用MIG采样的5% Tulu3数据训练的模型性能与全数据集训练的官方SFT模型相当,且在AlpacaEval和Wildbench上分别提升5.73%和6.89%。 Conclusion: MIG方法在数据质量和多样性量化方面表现优异,显著提升了模型性能。 Abstract: Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.

cs.SD [Back]

[111] Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope

Leena G Pillai,D. Muhammad Noorul Mubarak

Main category: cs.SD

TL;DR: 本文回顾了过去十年(2011-2021年)中数据驱动方法在声学-发音反演(AAI)中的应用,涵盖了不同类型的AAI、目标、语料库、方法及评估指标。

Details Motivation: 探索AAI在语音识别和语言训练中的实际应用,尤其是通过发音位置的反馈系统改善语音治疗和语言学习。 Method: 基于机器学习的非线性回归方法,使用多种医学成像技术(如EMA、EPG、rtMRI等)记录的数据。 Result: AAI模型通过相关性系数(CC)、均方根误差(RMSE)等指标评估,能够提供直观的发音位置反馈,尤其在舌部运动方面。 Conclusion: AAI模型在语音治疗和语言训练中具有实际应用潜力,尤其是通过图像反馈系统改善发音准确性。 Abstract: This review is focused on the data-driven approaches applied in different applications of Acoustic-to-Articulatory Inversion (AAI) of speech. This review paper considered the relevant works published in the last ten years (2011-2021). The selection criteria includes (a) type of AAI - Speaker Dependent and Speaker Independent AAI, (b) objectives of the work - Articulatory approximation, Articulatory Feature space selection and Automatic Speech Recognition (ASR), explore the correlation between acoustic and articulatory features, and framework for Computer-assisted language training, (c) Corpus - Simultaneously recorded speech (wav) and medical imaging models such as ElectroMagnetic Articulography (EMA), Electropalatography (EPG), Laryngography, Electroglottography (EGG), X-ray Cineradiography, Ultrasound, and real-time Magnetic Resonance Imaging (rtMRI), (d) Methods or models - recent works are considered, and therefore all the works are based on machine learning, (e) Evaluation - as AAI is a non-linear regression problem, the performance evaluation is mostly done by Correlation Coefficient (CC), Root Mean Square Error (RMSE), and also considered Mean Square Error (MSE), and Mean Format Error (MFE). The practical application of the AAI model can provide a better and user-friendly interpretable image feedback system of articulatory positions, especially tongue movement. Such trajectory feedback system can be used to provide phonetic, language, and speech therapy for pathological subjects.

cs.SE [Back]

[112] CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

Xinchen Wang,Pengfei Gao,Chao Peng,Ruida Hu,Cuiyun Gao

Main category: cs.SE

TL;DR: 论文提出CodeVisionary框架,通过多源知识分析和协商评分两阶段方法,提升LLM在代码生成评估中的性能。

Details Motivation: 现有评估方法(人工、基于指标或LLM)各有不足,LLM方法虽高效但受限于知识不足和对复杂代码的理解。 Method: CodeVisionary框架分两阶段:多源知识分析和协商评分,结合多评委讨论达成共识。 Result: 实验显示CodeVisionary在Pearson、Spearman和Kendall-Tau系数上优于基线方法,并提供详细评估报告。 Conclusion: CodeVisionary显著提升LLM代码生成评估性能,为开发者提供改进方向。 Abstract: Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities and superior efficiency. However, the performance of LLM-based approaches remains limited due to: (1) lack of multisource domain knowledge, and (2) insufficient comprehension of complex code. To mitigate the limitations, we propose CodeVisionary, the first LLM-based agent framework for evaluating LLMs in code generation. CodeVisionary consists of two stages: (1) Multiscore knowledge analysis stage, which aims to gather multisource and comprehensive domain knowledge by formulating and executing a stepwise evaluation plan. (2) Negotiation-based scoring stage, which involves multiple judges engaging in discussions to better comprehend the complex code and reach a consensus on the evaluation score. Extensive experiments demonstrate that CodeVisionary achieves the best performance for evaluating LLMs in code generation, outperforming the best baseline methods with average improvements of 0.202, 0.139, and 0.117 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. Besides, CodeVisionary provides detailed evaluation reports, which assist developers in identifying shortcomings and making improvements. The resources of CodeVisionary are available at https://anonymous.4open.science/r/CodeVisionary.

quant-ph [Back]

[113] Quantum Walks-Based Adaptive Distribution Generation with Efficient CUDA-Q Acceleration

Yen-Jui Chang,Wei-Ting Wang,Chen-Yu Liu,Yun-Yuan Wang,Ching-Ray Chang

Main category: quant-ph

TL;DR: 提出了一种基于量子行走的自适应分布生成器,结合变分量子电路和离散时间量子行走,实现高效、高精度的目标概率分布生成。

Details Motivation: 传统方法在生成高精度概率分布时计算开销大,难以扩展。本文旨在利用量子行走和GPU加速,提升效率和可扩展性。 Method: 结合变分量子电路和离散时间量子行走(如分裂步量子行走及其纠缠扩展),动态调整硬币参数,驱动量子态演化至目标分布。 Result: 在CUDA-Q框架下实现,通过GPU加速显著降低计算开销,并在金融模拟和二维模式生成(如数字0~9)中表现出高仿真保真度。 Conclusion: 该方法成功将理论量子算法与高性能计算结合,为实际应用提供了高效解决方案。 Abstract: We present a novel Adaptive Distribution Generator that leverages a quantum walks-based approach to generate high precision and efficiency of target probability distributions. Our method integrates variational quantum circuits with discrete-time quantum walks, specifically, split-step quantum walks and their entangled extensions, to dynamically tune coin parameters and drive the evolution of quantum states towards desired distributions. This enables accurate one-dimensional probability modeling for applications such as financial simulation and structured two-dimensional pattern generation exemplified by digit representations(0~9). Implemented within the CUDA-Q framework, our approach exploits GPU acceleration to significantly reduce computational overhead and improve scalability relative to conventional methods. Extensive benchmarks demonstrate that our Quantum Walks-Based Adaptive Distribution Generator achieves high simulation fidelity and bridges the gap between theoretical quantum algorithms and practical high-performance computation.

cs.HC [Back]

[114] Interpersonal Theory of Suicide as a Lens to Examine Suicidal Ideation in Online Spaces

Soorya Ram Shimgekar,Violeta J. Rodriguez,Paul A. Bloom,Dong Whi Yoo,Koustuv Saha

Main category: cs.HC

TL;DR: 该研究利用自杀的人际关系理论(IPTS)分析Reddit上59,607篇自杀意念(SI)帖子,发现高风险帖子表达计划、方法和痛苦,并探讨了AI聊天机器人在提供支持时的局限性。

Details Motivation: 自杀是全球性公共卫生问题,现有研究缺乏理解高风险自杀意图的理论框架。 Method: 采用IPTS理论分析Reddit的r/SuicideWatch帖子,分类为SI维度和风险因素,并分析支持性回应的语言特征。 Result: 高风险SI帖子表达计划、方法和痛苦;AI聊天机器人虽提升结构一致性,但在动态和个性化支持上不足。 Conclusion: 需深入理解和反思AI驱动的心理健康干预,以提供更有效的支持。 Abstract: Suicide is a critical global public health issue, with millions experiencing suicidal ideation (SI) each year. Online spaces enable individuals to express SI and seek peer support. While prior research has revealed the potential of detecting SI using machine learning and natural language analysis, a key limitation is the lack of a theoretical framework to understand the underlying factors affecting high-risk suicidal intent. To bridge this gap, we adopted the Interpersonal Theory of Suicide (IPTS) as an analytic lens to analyze 59,607 posts from Reddit's r/SuicideWatch, categorizing them into SI dimensions (Loneliness, Lack of Reciprocal Love, Self Hate, and Liability) and risk factors (Thwarted Belongingness, Perceived Burdensomeness, and Acquired Capability of Suicide). We found that high-risk SI posts express planning and attempts, methods and tools, and weaknesses and pain. In addition, we also examined the language of supportive responses through psycholinguistic and content analyses to find that individuals respond differently to different stages of Suicidal Ideation (SI) posts. Finally, we explored the role of AI chatbots in providing effective supportive responses to suicidal ideation posts. We found that although AI improved structural coherence, expert evaluations highlight persistent shortcomings in providing dynamic, personalized, and deeply empathetic support. These findings underscore the need for careful reflection and deeper understanding in both the development and consideration of AI-driven interventions for effective mental health support.

[115] Large Language Models Will Change The Way Children Think About Technology And Impact Every Interaction Paradigm

Russell Beale

Main category: cs.HC

TL;DR: 论文探讨了大型语言模型(LLMs)对儿童学习和与技术互动方式的潜在深远影响,认为当前影响较小,但未来变化巨大。

Details Motivation: 研究LLMs对教育的潜在变革性影响,为未来交互系统设计提供指导。 Method: 通过小规模场景和自我民族志研究展示变化,并提出五个重要设计考虑。 Result: LLMs将显著改变儿童学习方式,设计师需适应未来需求。 Conclusion: LLMs对教育的影响远超当前水平,未来交互系统需考虑五大因素。 Abstract: This paper presents a hopeful perspective on the potentially dramatic impacts of Large Language Models on how we children learn and how they will expect to interact with technology. We review the effects of LLMs on education so far, and make the case that these effects are minor compared to the upcoming changes that are occurring. We present a small scenario and self-ethnographic study demonstrating the effects of these changes, and define five significant considerations that interactive systems designers will have to accommodate in the future.

cs.AI [Back]

[116] The Quantum LLM: Modeling Semantic Spaces with Quantum Principles

Timo Aukusti Laine

Main category: cs.AI

TL;DR: 本文澄清了量子启发的语义表示模型的核心假设,详细阐述了六项关键原则,旨在证明该框架是研究语义空间的有效方法,并探讨了量子计算在提升LLMs性能中的潜力。

Details Motivation: 通过量子力学的数学工具和概念类比,为大型语言模型(LLMs)的语义表示和处理提供新视角,并验证量子启发框架的合理性。 Method: 提出六项关键原则,详细阐述语义表示、交互和动态的量子启发模型。 Result: 证明了量子启发框架是研究语义空间的有效方法,并展示了其在信息处理和响应生成中的价值。 Conclusion: 量子计算有望基于这些原则开发更强大高效的LLMs,为未来研究提供了方向。 Abstract: In the previous article, we presented a quantum-inspired framework for modeling semantic representation and processing in Large Language Models (LLMs), drawing upon mathematical tools and conceptual analogies from quantum mechanics to offer a new perspective on these complex systems. In this paper, we clarify the core assumptions of this model, providing a detailed exposition of six key principles that govern semantic representation, interaction, and dynamics within LLMs. The goal is to justify that a quantum-inspired framework is a valid approach to studying semantic spaces. This framework offers valuable insights into their information processing and response generation, and we further discuss the potential of leveraging quantum computing to develop significantly more powerful and efficient LLMs based on these principles.

[117] Cost-of-Pass: An Economic Framework for Evaluating Language Models

Mehmet Hamza Erol,Batu El,Mirac Suzgun,Mert Yuksekgonul,James Zou

Main category: cs.AI

TL;DR: 论文提出了一种基于生产理论的框架,结合准确性和推理成本评估语言模型的经济价值,引入“cost-of-pass”和“frontier cost-of-pass”指标,揭示了不同模型在不同任务中的成本效益,并分析了技术进步对成本效率的影响。

Details Motivation: 评估AI系统在经济中的广泛应用需要权衡其性能与推理成本,但目前缺乏综合考虑两者的指标。 Method: 提出基于生产理论的框架,定义“cost-of-pass”和“frontier cost-of-pass”,并通过分析模型在不同任务中的表现及技术进步的影响来评估成本效益。 Result: 轻量级模型在基础定量任务中最具成本效益,大型模型适合知识密集型任务,推理模型适合复杂定量问题;过去一年中复杂定量任务的成本效率显著提升;模型级创新是成本效率提升的主要驱动力。 Conclusion: 互补的模型级创新是提升成本效率的关键,提出的经济框架为衡量进展和指导部署提供了原则性工具。 Abstract: The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.

[118] Exploring the Potential for Large Language Models to Demonstrate Rational Probabilistic Beliefs

Gabriel Freedman,Francesca Toni

Main category: cs.AI

TL;DR: 研究发现当前大型语言模型(LLMs)在概率推理方面存在不足,无法提供合理且一致的概率信念表示。

Details Motivation: 探讨LLMs在概率推理中的表现,以确保其在信息检索和自动决策系统中的可信度和可解释性。 Method: 引入一个具有不确定真值的新数据集,并应用多种成熟的概率量化技术评估LLMs的表现。 Result: 发现当前LLMs无法满足概率推理的基本性质。 Conclusion: LLMs在概率推理方面仍需改进,以实现更可信和有效的应用。 Abstract: Advances in the general capabilities of large language models (LLMs) have led to their use for information retrieval, and as components in automated decision systems. A faithful representation of probabilistic reasoning in these models may be essential to ensure trustworthy, explainable and effective performance in these tasks. Despite previous work suggesting that LLMs can perform complex reasoning and well-calibrated uncertainty quantification, we find that current versions of this class of model lack the ability to provide rational and coherent representations of probabilistic beliefs. To demonstrate this, we introduce a novel dataset of claims with indeterminate truth values and apply a number of well-established techniques for uncertainty quantification to measure the ability of LLM's to adhere to fundamental properties of probabilistic reasoning.

[119] OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

Yichen Wu,Xudong Pan,Geng Hong,Min Yang

Main category: cs.AI

TL;DR: 论文提出OpenDeception框架,用于评估LLM代理的欺骗意图和能力,发现主流LLM的欺骗意图比例超过80%,成功率超过50%,且能力越强的LLM欺骗风险越高。

Details Motivation: 随着LLM能力的提升和代理应用的普及,其潜在的欺骗风险亟需系统评估和有效监管。现有评估方法多基于模拟游戏或有限选择,无法全面反映真实场景。 Method: 提出OpenDeception框架,通过开放场景数据集和内部推理过程分析,评估LLM代理的欺骗意图和能力。构建五类常见用例,模拟多轮对话以避免伦理风险。 Result: 对11种主流LLM的评估显示,欺骗意图比例超过80%,成功率超过50%,且能力更强的LLM欺骗风险更高。 Conclusion: LLM代理的欺骗风险亟需关注,需加强对其欺骗行为的抑制和对齐努力。 Abstract: As the general capabilities of large language models (LLMs) improve and agent applications become more widespread, the underlying deception risks urgently require systematic evaluation and effective oversight. Unlike existing evaluation which uses simulated games or presents limited choices, we introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, we construct five types of common use cases where LLMs intensively interact with the user, each consisting of ten diverse, concrete scenarios from the real world. To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the urgent need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do exhibit a higher risk of deception, which calls for more alignment efforts on inhibiting deceptive behaviors.

[120] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue,Zhiqi Chen,Rui Lu,Andrew Zhao,Zhaokai Wang,Yang Yue,Shiji Song,Gao Huang

Main category: cs.AI

TL;DR: RLVR(可验证奖励的强化学习)在提升LLM推理能力方面表现突出,但研究发现其并未引入新的推理模式,而是通过偏向高奖励路径提升效率。

Details Motivation: 重新评估RLVR是否真正为LLM带来新的推理能力,而非仅优化已有能力。 Method: 通过测量大k值的pass@k指标,比较RL训练模型与基础模型在不同任务中的表现。 Result: RL训练模型在小k值表现更好,但在大k值时基础模型表现相当甚至更优,表明RL未引入新推理能力。 Conclusion: RLVR存在局限性,需重新思考其在LLM推理中的作用,并探索更优范式。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@\textit{k} metric with large values of \textit{k} to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does \emph{not}, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of $k$ (\eg, $k$=1), base models can achieve a comparable or even higher pass@$k$ score compared to their RL counterparts at large $k$ values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

cs.LG [Back]

[121] A mean teacher algorithm for unlearning of language models

Yegor Klochkov

Main category: cs.LG

TL;DR: 论文探讨了语言模型遗忘技术,提出结合均值教师算法和负对数非似然损失(NLUL)来减少记忆化,同时保持模型性能。

Details Motivation: 减少语言模型对特定文本实例的记忆化,同时保持其通用能力,是当前挑战。 Method: 采用均值教师算法和负对数非似然损失(NLUL)来优化模型遗忘过程。 Result: 在MUSE基准测试中,该方法提升了部分指标。 Conclusion: 均值教师与NLUL结合能有效减少记忆化且不显著降低模型性能。 Abstract: One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model's general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen & Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called "negative log-unlikelihood" (NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).

[122] STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings

Saksham Rastogi,Pratyush Maini,Danish Pruthi

Main category: cs.LG

TL;DR: STAMP是一个用于检测数据集成员资格的框架,通过生成带有水印的重述版本并比较模型似然性,成功识别LLM预训练数据中的私有数据。

Details Motivation: 数据创作者和基准测试管理者担心其专有数据未经许可被用于训练大型语言模型(LLM),需要一种方法来检测数据是否被包含在预训练数据中。 Method: STAMP框架通过生成带有唯一密钥水印的多个重述版本,公开一个版本并保留其他版本,随后通过配对统计测试比较模型似然性来检测数据成员资格。 Result: STAMP在四个基准测试中成功检测到数据污染,即使数据仅出现一次且占总标记的不到0.001%,且优于其他基线方法。 Conclusion: STAMP不仅能有效检测数据成员资格,还能保持原始数据的语义和实用性,适用于实际场景如论文摘要和博客文章的检测。 Abstract: Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership-i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and the utility of the original data in comparing different models. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.

[123] Integrating Locality-Aware Attention with Transformers for General Geometry PDEs

Minsu Koh,Beom-Chul Park,Heejo Kong,Seong-Whan Lee

Main category: cs.LG

TL;DR: 论文提出了一种名为LA2Former的新方法,结合全局和局部注意力机制,显著提升了基于Transformer的神经算子在复杂几何和网格上的性能。

Details Motivation: 传统方法如FNO依赖均匀网格,限制了在复杂几何和网格上的应用。Transformer方法虽能克服这些限制,但忽视了局部动态行为。 Method: 提出LA2Former,利用K近邻动态分块,结合全局-局部注意力机制,平衡计算效率和预测精度。 Result: 在六个基准数据集上,LA2Former比现有线性注意力方法预测精度提升50%以上,并在最优条件下优于全配对注意力方法。 Conclusion: LA2Former强调了局部特征学习的重要性,为复杂和不规则域上的PDE求解提供了更高效的Transformer神经算子。 Abstract: Neural operators have emerged as promising frameworks for learning mappings governed by partial differential equations (PDEs), serving as data-driven alternatives to traditional numerical methods. While methods such as the Fourier neural operator (FNO) have demonstrated notable performance, their reliance on uniform grids restricts their applicability to complex geometries and irregular meshes. Recently, Transformer-based neural operators with linear attention mechanisms have shown potential in overcoming these limitations for large-scale PDE simulations. However, these approaches predominantly emphasize global feature aggregation, often overlooking fine-scale dynamics and localized PDE behaviors essential for accurate solutions. To address these challenges, we propose the Locality-Aware Attention Transformer (LA2Former), which leverages K-nearest neighbors for dynamic patchifying and integrates global-local attention for enhanced PDE modeling. By combining linear attention for efficient global context encoding with pairwise attention for capturing intricate local interactions, LA2Former achieves an optimal balance between computational efficiency and predictive accuracy. Extensive evaluations across six benchmark datasets demonstrate that LA2Former improves predictive accuracy by over 50% relative to existing linear attention methods, while also outperforming full pairwise attention under optimal conditions. This work underscores the critical importance of localized feature learning in advancing Transformer-based neural operators for solving PDEs on complex and irregular domains.

[124] Learning to Attribute with Attention

Benjamin Cohen-Wang,Yung-Sung Chuang,Aleksander Madry

Main category: cs.LG

TL;DR: 论文提出了一种基于注意力权重的令牌归因方法AT2,通过将注意力权重作为特征学习,实现了高效且可靠的归因,性能与高成本的消融方法相当。

Details Motivation: 语言模型生成序列时,识别影响生成的前序令牌成本高昂,现有基于注意力权重的方法不可靠。 Method: 将不同注意力头的权重作为特征,通过学习(利用消融信号)有效利用注意力权重进行归因。 Result: 提出的AT2方法在性能上与高成本消融方法相当,且效率显著提升,可用于上下文修剪以提升问答质量。 Conclusion: AT2是一种高效可靠的令牌归因方法,适用于语言模型行为分析。 Abstract: Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at https://github.com/MadryLab/AT2 .

[125] Scaling sparse feature circuit finding for in-context learning

Dmitrii Kharlapenko,Stepan Shabalin,Fazl Barez,Arthur Conmy,Neel Nanda

Main category: cs.LG

TL;DR: 稀疏自编码器(SAEs)用于解释大语言模型激活,但其在解决可解释性开放问题中的效用尚不明确。本文通过SAEs加深对上下文学习(ICL)机制的理解,发现抽象SAE特征编码任务执行知识,其潜在向量因果诱导任务零样本执行。

Details Motivation: 探索稀疏自编码器在解释大语言模型激活中的实际效用,尤其是对上下文学习机制的理解。 Method: 使用稀疏自编码器分析Gemma-1 2B模型,结合稀疏特征电路方法,研究任务检测特征及其与任务执行特征的因果联系。 Result: 发现任务检测特征及其潜在向量能因果诱导任务执行,且任务向量可由稀疏SAE潜在向量近似表示。 Conclusion: 稀疏自编码器有效揭示了上下文学习的机制,任务检测与执行特征通过注意力与MLP子层因果关联。 Abstract: Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model's knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot. This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we adapt the sparse feature circuits methodology of Marks et al. (2024) to work for the much larger Gemma-1 2B model, with 30 times as many parameters, and to the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers.

[126] Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Yixuan Even Xu,Yash Savani,Fei Fang,Zico Kolter

Main category: cs.LG

TL;DR: PODS框架通过并行生成大量rollout但仅更新信息丰富的子集,解决了RL中计算与内存需求的不对称问题,max-variance down-sampling方法提升了性能。

Details Motivation: 强化学习在语言模型中存在计算与内存需求不对称的问题,推理阶段并行且内存需求低,而策略更新需同步且内存密集。 Method: 提出PODS框架,通过并行生成rollout但仅更新信息子集,并开发max-variance down-sampling方法选择多样性奖励信号。 Result: 在GSM8K基准测试中,PODS结合max-variance down-sampling的GRPO优于标准GRPO。 Conclusion: PODS框架有效解决了RL中的不对称问题,并通过max-variance down-sampling提升了性能。 Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing reasoning capabilities in large language models, but faces a fundamental asymmetry in computation and memory requirements: inference is embarrassingly parallel with a minimal memory footprint, while policy updates require extensive synchronization and are memory-intensive. To address this asymmetry, we introduce PODS (Policy Optimization with Down-Sampling), a framework that strategically decouples these phases by generating numerous rollouts in parallel but updating only on an informative subset. Within this framework, we develop max-variance down-sampling, a theoretically motivated method that selects rollouts with maximally diverse reward signals. We prove that this approach has an efficient algorithmic solution, and empirically demonstrate that GRPO with PODS using max-variance down-sampling achieves superior performance over standard GRPO on the GSM8K benchmark.

[127] Wearable-Derived Behavioral and Physiological Biomarkers for Classifying Unipolar and Bipolar Depression Severity

Yassine Ouzar,Clémence Nineuil,Fouad Boutaleb,Emery Pierson,Ali Amad,Mohamed Daoudi

Main category: cs.LG

TL;DR: 利用可穿戴设备预测抑郁症亚型(单极和双极抑郁),通过生理和行为信号识别生物标志物,提升诊断精度和个性化治疗。

Details Motivation: 现有研究多采用二元分类区分健康与抑郁个体,未能捕捉抑郁症的异质性。本研究旨在通过可穿戴设备识别抑郁症亚型,以支持更精准的诊断和治疗。 Method: 引入CALYPSO数据集,通过生理和行为信号(如血容量脉冲、皮肤电活动、体温和三轴加速度)进行非侵入性检测,并使用标准机器学习方法建立基准。 Result: 初步结果显示,加速度计数据提取的体力活动特征最有效区分单极和双极抑郁(准确率96.77%),体温特征也表现优异(准确率93.55%)。 Conclusion: 生理和行为监测有望改善抑郁症亚型分类,为个性化临床干预提供新途径。 Abstract: Depression is a complex mental disorder characterized by a diverse range of observable and measurable indicators that go beyond traditional subjective assessments. Recent research has increasingly focused on objective, passive, and continuous monitoring using wearable devices to gain more precise insights into the physiological and behavioral aspects of depression. However, most existing studies primarily distinguish between healthy and depressed individuals, adopting a binary classification that fails to capture the heterogeneity of depressive disorders. In this study, we leverage wearable devices to predict depression subtypes-specifically unipolar and bipolar depression-aiming to identify distinctive biomarkers that could enhance diagnostic precision and support personalized treatment strategies. To this end, we introduce the CALYPSO dataset, designed for non-invasive detection of depression subtypes and symptomatology through physiological and behavioral signals, including blood volume pulse, electrodermal activity, body temperature, and three-axis acceleration. Additionally, we establish a benchmark on the dataset using well-known features and standard machine learning methods. Preliminary results indicate that features related to physical activity, extracted from accelerometer data, are the most effective in distinguishing between unipolar and bipolar depression, achieving an accuracy of $96.77\%$. Temperature-based features also showed high discriminative power, reaching an accuracy of $93.55\%$. These findings highlight the potential of physiological and behavioral monitoring for improving the classification of depressive subtypes, paving the way for more tailored clinical interventions.

[128] Variational Autoencoder Framework for Hyperspectral Retrievals (Hyper-VAE) of Phytoplankton Absorption and Chlorophyll a in Coastal Waters for NASA's EMIT and PACE Missions

Jiadong Lou,Bingqing Liu,Yuanheng Xiong,Xiaodong Zhang,Xu Yuan

Main category: cs.LG

TL;DR: 该研究利用变分自编码器(VAE)从高光谱遥感反射率中高精度反演浮游植物吸收系数和叶绿素a,为NASA的EMIT和PACE任务提供新解决方案。

Details Motivation: 解决沿海水域浮游植物群落组成的高光谱遥感反演难题,提升对水生生态系统的理解。 Method: 采用变分自编码器(VAE)处理高光谱数据,优化模型设计以应对多分布预测问题。 Result: VAE模型在实验验证中表现出高精度和低偏差,优于混合密度网络(MDN)方法。 Conclusion: 结合AI技术,高光谱数据将为浮游植物群落动态研究开辟新途径。 Abstract: Phytoplankton absorb and scatter light in unique ways, subtly altering the color of water, changes that are often minor for human eyes to detect but can be captured by sensitive ocean color instruments onboard satellites from space. Hyperspectral sensors, paired with advanced algorithms, are expected to significantly enhance the characterization of phytoplankton community composition, especially in coastal waters where ocean color remote sensing applications have historically encountered significant challenges. This study presents novel machine learning-based solutions for NASA's hyperspectral missions, including EMIT and PACE, tackling high-fidelity retrievals of phytoplankton absorption coefficient and chlorophyll a from their hyperspectral remote sensing reflectance. Given that a single Rrs spectrum may correspond to varied combinations of inherent optical properties and associated concentrations, the Variational Autoencoder (VAE) is used as a backbone in this study to handle such multi-distribution prediction problems. We first time tailor the VAE model with innovative designs to achieve hyperspectral retrievals of aphy and of Chl-a from hyperspectral Rrs in optically complex estuarine-coastal waters. Validation with extensive experimental observation demonstrates superior performance of the VAE models with high precision and low bias. The in-depth analysis of VAE's advanced model structures and learning designs highlights the improvement and advantages of VAE-based solutions over the mixture density network (MDN) approach, particularly on high-dimensional data, such as PACE. Our study provides strong evidence that current EMIT and PACE hyperspectral data as well as the upcoming Surface Biology Geology mission will open new pathways toward a better understanding of phytoplankton community dynamics in aquatic ecosystems when integrated with AI technologies.

[129] MAAM: A Lightweight Multi-Agent Aggregation Module for Efficient Image Classification Based on the MindSpore Framework

Zhenkai Qin,Feng Zhu,Huan Zeng,Xunyi Nong

Main category: cs.LG

TL;DR: 论文提出了一种轻量级注意力架构MAAM,用于资源受限环境下的图像分类任务,显著提升了计算效率和准确性。

Details Motivation: 在资源受限环境中,传统注意力机制因计算复杂性和结构刚性难以适用,需要一种轻量级且高效的解决方案。 Method: MAAM采用三个并行代理分支提取异构特征,通过可学习标量权重自适应融合,并结合卷积压缩层优化。 Result: 在CIFAR-10数据集上达到87.0%准确率,显著优于传统CNN和MLP模型,训练效率提升30%。 Conclusion: MAAM通过硬件加速和低内存占用,为资源受限场景提供了可行的图像分类解决方案。 Abstract: The demand for lightweight models in image classification tasks under resource-constrained environments necessitates a balance between computational efficiency and robust feature representation. Traditional attention mechanisms, despite their strong feature modeling capability, often struggle with high computational complexity and structural rigidity, limiting their applicability in scenarios with limited computational resources (e.g., edge devices or real-time systems). To address this, we propose the Multi-Agent Aggregation Module (MAAM), a lightweight attention architecture integrated with the MindSpore framework. MAAM employs three parallel agent branches with independently parameterized operations to extract heterogeneous features, adaptively fused via learnable scalar weights, and refined through a convolutional compression layer. Leveraging MindSpore's dynamic computational graph and operator fusion, MAAM achieves 87.0% accuracy on the CIFAR-10 dataset, significantly outperforming conventional CNN (58.3%) and MLP (49.6%) models, while improving training efficiency by 30%. Ablation studies confirm the critical role of agent attention (accuracy drops to 32.0% if removed) and compression modules (25.5% if omitted), validating their necessity for maintaining discriminative feature learning. The framework's hardware acceleration capabilities and minimal memory footprint further demonstrate its practicality, offering a deployable solution for image classification in resource-constrained scenarios without compromising accuracy.

eess.IV [Back]

[130] Advanced Deep Learning and Large Language Models: Comprehensive Insights for Cancer Detection

Yassine Habchi,Hamza Kheddar,Yassine Himeur,Adel Belouchrani,Erchin Serpedin,Fouad Khelifi,Muhammad E. H. Chowdhury

Main category: eess.IV

TL;DR: 本文综述了深度学习(DL)在癌症检测中的先进技术,包括迁移学习、强化学习、联邦学习、Transformers和大语言模型,填补了现有研究的空白,并探讨了其在提高准确性、解决数据稀缺和保护隐私方面的作用。

Details Motivation: 尽管深度学习在医疗健康领域有广泛应用,但对其在癌症检测中的全面分析仍有限。本文旨在填补这一空白,综述先进DL技术及其在癌症诊断中的潜力。 Method: 通过综述迁移学习(TL)、强化学习(RL)、联邦学习(FL)、Transformers和大语言模型(LLMs)等技术,分析其在癌症检测中的应用,并探讨其优势和挑战。 Result: 这些技术显著提高了癌症检测的准确性,解决了数据稀缺和隐私问题,同时为未来的研究提供了方向。 Conclusion: 本文为研究人员和实践者提供了深度学习在癌症检测中的最新趋势和未来研究方向,强调了其在医疗健康领域的潜力。 Abstract: The rapid advancement of deep learning (DL) has transformed healthcare, particularly in cancer detection and diagnosis. DL surpasses traditional machine learning and human accuracy, making it a critical tool for identifying diseases. Despite numerous reviews on DL in healthcare, a comprehensive analysis of its role in cancer detection remains limited. Existing studies focus on specific aspects, leaving gaps in understanding its broader impact. This paper addresses these gaps by reviewing advanced DL techniques, including transfer learning (TL), reinforcement learning (RL), federated learning (FL), Transformers, and large language models (LLMs). These approaches enhance accuracy, tackle data scarcity, and enable decentralized learning while maintaining data privacy. TL adapts pre-trained models to new datasets, improving performance with limited labeled data. RL optimizes diagnostic pathways and treatment strategies, while FL fosters collaborative model development without sharing sensitive data. Transformers and LLMs, traditionally used in natural language processing, are now applied to medical data for improved interpretability. Additionally, this review examines these techniques' efficiency in cancer diagnosis, addresses challenges like data imbalance, and proposes solutions. It serves as a resource for researchers and practitioners, providing insights into current trends and guiding future research in advanced DL for cancer detection.

[131] Efficient Brain Tumor Segmentation Using a Dual-Decoder 3D U-Net with Attention Gates (DDUNet)

Mohammad Mahdi Danesh Pajouh

Main category: eess.IV

TL;DR: 提出了一种新型的双解码器U-Net架构,结合注意力门控跳跃连接,用于脑肿瘤MRI分割,在计算资源有限的情况下实现高效且准确的性能。

Details Motivation: 脑肿瘤因其侵袭性和早期诊断困难而成为全球主要死因之一,现有分割方法计算资源需求高,限制了实际应用。 Method: 采用双解码器U-Net架构,结合注意力门控跳跃连接,优化训练效率和分割精度。 Result: 在BraTS 2020数据集上,仅50个epoch即达到WT 85.06%、TC 80.61%、ET 71.26%的Dice分数,优于常见U-Net变体。 Conclusion: 该模型在有限计算资源下实现高质量脑肿瘤分割,为资源受限的研究和临床环境提供可行解决方案,有望改善早期诊断和患者预后。 Abstract: Cancer remains one of the leading causes of mortality worldwide, and among its many forms, brain tumors are particularly notorious due to their aggressive nature and the critical challenges involved in early diagnosis. Recent advances in artificial intelligence have shown great promise in assisting medical professionals with precise tumor segmentation, a key step in timely diagnosis and treatment planning. However, many state-of-the-art segmentation methods require extensive computational resources and prolonged training times, limiting their practical application in resource-constrained settings. In this work, we present a novel dual-decoder U-Net architecture enhanced with attention-gated skip connections, designed specifically for brain tumor segmentation from MRI scans. Our approach balances efficiency and accuracy by achieving competitive segmentation performance while significantly reducing training demands. Evaluated on the BraTS 2020 dataset, the proposed model achieved Dice scores of 85.06% for Whole Tumor (WT), 80.61% for Tumor Core (TC), and 71.26% for Enhancing Tumor (ET) in only 50 epochs, surpassing several commonly used U-Net variants. Our model demonstrates that high-quality brain tumor segmentation is attainable even under limited computational resources, thereby offering a viable solution for researchers and clinicians operating with modest hardware. This resource-efficient model has the potential to improve early detection and diagnosis of brain tumors, ultimately contributing to better patient outcomes

[132] Putting the Segment Anything Model to the Test with 3D Knee MRI -- A Comparison with State-of-the-Art Performance

Oliver Mills,Philip Conaghan,Nishant Ravikumar,Samuel Relton

Main category: eess.IV

TL;DR: 研究比较了Segment Anything Model (SAM)和3D U-Net在膝关节MRI中半月板分割的表现,发现SAM在端到端微调后与3D U-Net表现相当,但未超越。

Details Motivation: 半月板损伤可能导致膝骨关节炎,目前缺乏有效疗法。自动化分割有助于早期检测和治疗,但现有方法多基于卷积网络,未探索视觉Transformer模型。 Method: 研究将SAM应用于3D膝关节MRI的半月板分割,并与3D U-Net对比,测试了仅微调解码器和端到端微调两种配置。 Result: SAM端到端微调后Dice分数为0.87±0.03,与3D U-Net相当,但Hausdorff距离表现较差。 Conclusion: 尽管SAM具有通用性,但在半月板分割任务中未超越3D U-Net,可能不适用于类似低对比度、边界模糊的3D医学图像分割。 Abstract: Menisci are cartilaginous tissue found within the knee that contribute to joint lubrication and weight dispersal. Damage to menisci can lead to onset and progression of knee osteoarthritis (OA), a condition that is a leading cause of disability, and for which there are few effective therapies. Accurate automated segmentation of menisci would allow for earlier detection and treatment of meniscal abnormalities, as well as shedding more light on the role the menisci play in OA pathogenesis. Focus in this area has mainly used variants of convolutional networks, but there has been no attempt to utilise recent large vision transformer segmentation models. The Segment Anything Model (SAM) is a so-called foundation segmentation model, which has been found useful across a range of different tasks due to the large volume of data used for training the model. In this study, SAM was adapted to perform fully-automated segmentation of menisci from 3D knee magnetic resonance images. A 3D U-Net was also trained as a baseline. It was found that, when fine-tuning only the decoder, SAM was unable to compete with 3D U-Net, achieving a Dice score of $0.81\pm0.03$, compared to $0.87\pm0.03$, on a held-out test set. When fine-tuning SAM end-to-end, a Dice score of $0.87\pm0.03$ was achieved. The performance of both the end-to-end trained SAM configuration and the 3D U-Net were comparable to the winning Dice score ($0.88\pm0.03$) in the IWOAI Knee MRI Segmentation Challenge 2019. Performance in terms of the Hausdorff Distance showed that both configurations of SAM were inferior to 3D U-Net in matching the meniscus morphology. Results demonstrated that, despite its generalisability, SAM was unable to outperform a basic 3D U-Net in meniscus segmentation, and may not be suitable for similar 3D medical image segmentation tasks also involving fine anatomical structures with low contrast and poorly-defined boundaries.

[133] Accelerated Optimization of Implicit Neural Representations for CT Reconstruction

Mahrokh Najaf,Gregory Ongie

Main category: eess.IV

TL;DR: 本文研究了加速隐式神经表示(INR)在CT重建中的优化策略,提出了两种方法:改进损失函数和基于ADMM的算法,显著提升了稀疏视图下乳腺CT体模的重建速度。

Details Motivation: 隐式神经表示(INR)在低剂量/稀疏视图X射线CT重建中表现出潜力,但其训练过程缓慢,需要数千次迭代才能收敛。本文旨在解决这一优化效率问题。 Method: 提出了两种加速策略:1)改进损失函数以提高条件性;2)基于交替方向乘子法(ADMM)的算法。 Result: 实验表明,这两种方法在稀疏视图乳腺CT体模重建中显著加速了INR的优化过程。 Conclusion: 本文提出的方法有效提升了INR在CT重建中的优化效率,为实际应用提供了可行性。 Abstract: Inspired by their success in solving challenging inverse problems in computer vision, implicit neural representations (INRs) have been recently proposed for reconstruction in low-dose/sparse-view X-ray computed tomography (CT). An INR represents a CT image as a small-scale neural network that takes spatial coordinates as inputs and outputs attenuation values. Fitting an INR to sinogram data is similar to classical model-based iterative reconstruction methods. However, training INRs with losses and gradient-based algorithms can be prohibitively slow, taking many thousands of iterations to converge. This paper investigates strategies to accelerate the optimization of INRs for CT reconstruction. In particular, we propose two approaches: (1) using a modified loss function with improved conditioning, and (2) an algorithm based on the alternating direction method of multipliers. We illustrate that both of these approaches significantly accelerate INR-based reconstruction of a synthetic breast CT phantom in a sparse-view setting.

[134] Cardiac MRI Semantic Segmentation for Ventricles and Myocardium using Deep Learning

Racheal Mukisa,Arvind K. Bansal

Main category: eess.IV

TL;DR: 该论文提出了一种改进CMR图像语义分割的模型,通过提取边缘属性和上下文信息,提升了心脏结构的定位精度。

Details Motivation: 自动化无创心脏诊断对早期发现心脏疾病和成本效益高的临床管理至关重要,而精确的心脏图像分割是关键。 Method: 模型在U-Net的下采样过程中提取边缘属性和上下文信息,并在上采样时融合这些信息,以定位左心室腔(LV)、右心室腔(RV)和左心室心肌(LMyo)。 Result: 与现有领先模型相比,该模型的Dice相似系数(DSC)提高了2%-11%,Hausdorff距离(HD)降低了1.6至5.7毫米。 Conclusion: 该模型显著提升了心脏结构的语义分割精度,为自动化心脏诊断提供了更可靠的工具。 Abstract: Automated noninvasive cardiac diagnosis plays a critical role in the early detection of cardiac disorders and cost-effective clinical management. Automated diagnosis involves the automated segmentation and analysis of cardiac images. Precise delineation of cardiac substructures and extraction of their morphological attributes are essential for evaluating the cardiac function, and diagnosing cardiovascular disease such as cardiomyopathy, valvular diseases, abnormalities related to septum perforations, and blood-flow rate. Semantic segmentation labels the CMR image at the pixel level, and localizes its subcomponents to facilitate the detection of abnormalities, including abnormalities in cardiac wall motion in an aging heart with muscle abnormalities, vascular abnormalities, and valvular abnormalities. In this paper, we describe a model to improve semantic segmentation of CMR images. The model extracts edge-attributes and context information during down-sampling of the U-Net and infuses this information during up-sampling to localize three major cardiac structures: left ventricle cavity (LV); right ventricle cavity (RV); and LV myocardium (LMyo). We present an algorithm and performance results. A comparison of our model with previous leading models, using similarity metrics between actual image and segmented image, shows that our approach improves Dice similarity coefficient (DSC) by 2%-11% and lowers Hausdorff distance (HD) by 1.6 to 5.7 mm.

[135] DADU: Dual Attention-based Deep Supervised UNet for Automated Semantic Segmentation of Cardiac Images

Racheal Mukisa,Arvind K. Bansal

Main category: eess.IV

TL;DR: 提出了一种基于深度学习的改进模型,用于从心脏磁共振图像中分割左右心室和心肌瘢痕组织,结合了UNet、通道和空间注意力、边缘检测跳跃连接和深度监督学习,显著提高了分割精度。

Details Motivation: 心脏磁共振图像分割的准确性对临床诊断至关重要,现有方法在复杂场景下表现不足,需要更高效和精确的解决方案。 Method: 结合UNet架构,引入通道和空间注意力机制,利用边缘检测改进跳跃连接,并通过深度监督学习缓解梯度消失问题。 Result: 模型在Dice相似性得分(DSC)上达到98%,Hausdorff距离(HD)显著降低,性能优于其他领先技术。 Conclusion: 该方法通过多技术融合显著提升了心脏磁共振图像分割的准确性和鲁棒性,具有临床应用潜力。 Abstract: We propose an enhanced deep learning-based model for image segmentation of the left and right ventricles and myocardium scar tissue from cardiac magnetic resonance (CMR) images. The proposed technique integrates UNet, channel and spatial attention, edge-detection based skip-connection and deep supervised learning to improve the accuracy of the CMR image-segmentation. Images are processed using multiple channels to generate multiple feature-maps. We built a dual attention-based model to integrate channel and spatial attention. The use of extracted edges in skip connection improves the reconstructed images from feature-maps. The use of deep supervision reduces vanishing gradient problems inherent in classification based on deep neural networks. The algorithms for dual attention-based model, corresponding implementation and performance results are described. The performance results show that this approach has attained high accuracy: 98% Dice Similarity Score (DSC) and significantly lower Hausdorff Distance (HD). The performance results outperform other leading techniques both in DSC and HD.

[136] Filter2Noise: Interpretable Self-Supervised Single-Image Denoising for Low-Dose CT with Attention-Guided Bilateral Filtering

Yipeng Sun,Linda-Sophie Schneider,Mingxuan Gu,Siyuan Mei,Chengze Ye,Fabian Wagner,Siming Bayer,Andreas Maier

Main category: eess.IV

TL;DR: 论文提出了一种可解释的自监督单图像去噪框架Filter2Noise(F2N),通过注意力引导的双边滤波器和轻量级模块预测空间变化的滤波参数,解决了低剂量CT去噪中的透明度和用户控制问题。

Details Motivation: 低剂量CT去噪对增强细微结构和低对比度病变至关重要,但现有监督方法受限于配对数据集,而自监督方法通常依赖深度网络且缺乏对去噪机制的解释。 Method: 提出F2N框架,结合注意力引导的双边滤波器和轻量级参数预测模块,采用新的下采样打乱策略和自监督损失函数,支持单图像训练。 Result: 在Mayo Clinic 2016低剂量CT数据集上,F2N比领先的自监督单图像方法(ZS-N2N)PSNR提高了4.59 dB,同时提升了透明度和用户控制能力。 Conclusion: F2N为需要精确且可解释去噪的医学应用提供了关键优势,代码已开源。 Abstract: Effective denoising is crucial in low-dose CT to enhance subtle structures and low-contrast lesions while preventing diagnostic errors. Supervised methods struggle with limited paired datasets, and self-supervised approaches often require multiple noisy images and rely on deep networks like U-Net, offering little insight into the denoising mechanism. To address these challenges, we propose an interpretable self-supervised single-image denoising framework -- Filter2Noise (F2N). Our approach introduces an Attention-Guided Bilateral Filter that adapted to each noisy input through a lightweight module that predicts spatially varying filter parameters, which can be visualized and adjusted post-training for user-controlled denoising in specific regions of interest. To enable single-image training, we introduce a novel downsampling shuffle strategy with a new self-supervised loss function that extends the concept of Noise2Noise to a single image and addresses spatially correlated noise. On the Mayo Clinic 2016 low-dose CT dataset, F2N outperforms the leading self-supervised single-image method (ZS-N2N) by 4.59 dB PSNR while improving transparency, user control, and parametric efficiency. These features provide key advantages for medical applications that require precise and interpretable noise reduction. Our code is demonstrated at https://github.com/sypsyp97/Filter2Noise.git .

[137] A Novel Hybrid Approach for Retinal Vessel Segmentation with Dynamic Long-Range Dependency and Multi-Scale Retinal Edge Fusion Enhancement

Yihao Ouyang,Xunheng Kuang,Mengjia Xiong,Zhida Wang,Yuanquan Wang

Main category: eess.IV

TL;DR: 提出了一种结合CNN和Mamba的混合框架,用于高精度视网膜血管分割,解决了多尺度血管变异性、复杂曲率和模糊边界等问题。

Details Motivation: 现有方法在多尺度血管变异性、复杂曲率和模糊边界方面表现不佳,导致血管不连续或边缘特征模糊。 Method: 1) 高分辨率边缘融合网络;2) 动态蛇形视觉状态空间块;3) 多尺度视网膜边缘融合模块。 Result: 在三个公共数据集上实现了最先进的性能,特别是在血管连续性和低对比度区域分割方面。 Conclusion: 该方法为临床应用中精确视网膜血管分析提供了可靠工具。 Abstract: Accurate retinal vessel segmentation provides essential structural information for ophthalmic image analysis. However, existing methods struggle with challenges such as multi-scale vessel variability, complex curvatures, and ambiguous boundaries. While Convolutional Neural Networks (CNNs), Transformer-based models and Mamba-based architectures have advanced the field, they often suffer from vascular discontinuities or edge feature ambiguity. To address these limitations, we propose a novel hybrid framework that synergistically integrates CNNs and Mamba for high-precision retinal vessel segmentation. Our approach introduces three key innovations: 1) The proposed High-Resolution Edge Fuse Network is a high-resolution preserving hybrid segmentation framework that combines a multi-scale backbone with the Multi-scale Retina Edge Fusion (MREF) module to enhance edge features, ensuring accurate and robust vessel segmentation. 2) The Dynamic Snake Visual State Space block combines Dynamic Snake Convolution with Mamba to adaptively capture vessel curvature details and long-range dependencies. An improved eight-directional 2D Snake-Selective Scan mechanism and a dynamic weighting strategy enhance the perception of complex vascular topologies. 3) The MREF module enhances boundary precision through multi-scale edge feature aggregation, suppressing noise while emphasizing critical vessel structures across scales. Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance, particularly in maintaining vascular continuity and effectively segmenting vessels in low-contrast regions. This work provides a robust method for clinical applications requiring accurate retinal vessel analysis. The code is available at https://github.com/frank-oy/HREFNet.

[138] FocusNet: Transformer-enhanced Polyp Segmentation with Local and Pooling Attention

Jun Zeng,KC Santosh,Deepak Rajan Nayak,Thomas de Lange,Jonas Varkey,Tyler Berzin,Debesh Jha

Main category: eess.IV

TL;DR: FocusNet是一种基于Transformer的注意力网络,用于提升结肠息肉分割的准确性,通过多模块设计在多模态和多中心数据上表现优异。

Details Motivation: 现有深度学习模型多基于单模态和单中心数据,难以适应真实临床环境,因此需要一种更鲁棒的分割方法。 Method: FocusNet包含三个模块:CIDM生成粗分割图,DEM优化浅层特征,FAM通过局部和池化注意力机制平衡细节与全局上下文。 Result: 在PolypDB数据集上,FocusNet在五种不同模态下的Dice系数均优于现有方法,最高达93.42%。 Conclusion: FocusNet在多模态和多中心数据中表现出色,为息肉分割提供了更可靠的解决方案。 Abstract: Colonoscopy is vital in the early diagnosis of colorectal polyps. Regular screenings can effectively prevent benign polyps from progressing to CRC. While deep learning has made impressive strides in polyp segmentation, most existing models are trained on single-modality and single-center data, making them less effective in real-world clinical environments. To overcome these limitations, we propose FocusNet, a Transformer-enhanced focus attention network designed to improve polyp segmentation. FocusNet incorporates three essential modules: the Cross-semantic Interaction Decoder Module (CIDM) for generating coarse segmentation maps, the Detail Enhancement Module (DEM) for refining shallow features, and the Focus Attention Module (FAM), to balance local detail and global context through local and pooling attention mechanisms. We evaluate our model on PolypDB, a newly introduced dataset with multi-modality and multi-center data for building more reliable segmentation methods. Extensive experiments showed that FocusNet consistently outperforms existing state-of-the-art approaches with a high dice coefficients of 82.47% on the BLI modality, 88.46% on FICE, 92.04% on LCI, 82.09% on the NBI and 93.42% on WLI modality, demonstrating its accuracy and robustness across five different modalities. The source code for FocusNet is available at https://github.com/JunZengz/FocusNet.

[139] ViG3D-UNet: Volumetric Vascular Connectivity-Aware Segmentation via 3D Vision Graph Representation

Bowen Liu,Chunlei Meng,Wei Lin,Hongda Zhang,Ziqing Zhou,Zhongxue Gan,Chun Ouyang

Main category: eess.IV

TL;DR: ViG3D-UNet提出了一种结合3D图神经网络和U型架构的方法,用于连续血管分割,解决了现有方法在血管连通性和端点缺失上的问题。

Details Motivation: 准确的血管分割对冠状动脉可视化和冠心病诊断至关重要,但现有方法存在血管分割不连续和端点缺失的问题。 Method: ViG3D-UNet整合了3D图表示和聚合,通过卷积模块提取细节,并通过通道注意力结合特征。采用纸夹形偏移解码器减少稀疏特征空间的计算冗余。 Result: 在ASOCA和ImageCAS数据集上的实验表明,ViG3D-UNet在保持血管连通性和分割准确性上优于其他方法。 Conclusion: ViG3D-UNet有效解决了血管分割的连续性问题,具有较高的实用价值。 Abstract: Accurate vascular segmentation is essential for coronary visualization and the diagnosis of coronary heart disease. This task involves the extraction of sparse tree-like vascular branches from the volumetric space. However, existing methods have faced significant challenges due to discontinuous vascular segmentation and missing endpoints. To address this issue, a 3D vision graph neural network framework, named ViG3D-UNet, was introduced. This method integrates 3D graph representation and aggregation within a U-shaped architecture to facilitate continuous vascular segmentation. The ViG3D module captures volumetric vascular connectivity and topology, while the convolutional module extracts fine vascular details. These two branches are combined through channel attention to form the encoder feature. Subsequently, a paperclip-shaped offset decoder minimizes redundant computations in the sparse feature space and restores the feature map size to match the original input dimensions. To evaluate the effectiveness of the proposed approach for continuous vascular segmentation, evaluations were performed on two public datasets, ASOCA and ImageCAS. The segmentation results show that the ViG3D-UNet surpassed competing methods in maintaining vascular segmentation connectivity while achieving high segmentation accuracy. Our code will be available soon.

[140] SupResDiffGAN a new approach for the Super-Resolution task

Dawid Kopeć,Wojciech Kozłowski,Maciej Wizerkaniuk,Dawid Krutul,Jan Kocoń,Maciej Zięba

Main category: eess.IV

TL;DR: SupResDiffGAN是一种结合GAN和扩散模型的混合架构,用于超分辨率任务,显著提高了推理速度并保持图像质量。

Details Motivation: 结合GAN和扩散模型的优势,解决扩散模型在超分辨率任务中推理速度慢的问题。 Method: 利用潜在空间表示和减少扩散步骤,提出自适应噪声破坏以防止判别器过拟合。 Result: 在基准数据集上表现优于SR3和I²SB,效率与图像质量均提升。 Conclusion: SupResDiffGAN弥合了扩散模型与GAN方法的性能差距,为实时高分辨率图像生成奠定了基础。 Abstract: In this work, we present SupResDiffGAN, a novel hybrid architecture that combines the strengths of Generative Adversarial Networks (GANs) and diffusion models for super-resolution tasks. By leveraging latent space representations and reducing the number of diffusion steps, SupResDiffGAN achieves significantly faster inference times than other diffusion-based super-resolution models while maintaining competitive perceptual quality. To prevent discriminator overfitting, we propose adaptive noise corruption, ensuring a stable balance between the generator and the discriminator during training. Extensive experiments on benchmark datasets show that our approach outperforms traditional diffusion models such as SR3 and I$^2$SB in efficiency and image quality. This work bridges the performance gap between diffusion- and GAN-based methods, laying the foundation for real-time applications of diffusion models in high-resolution image generation.

cs.RO [Back]

[141] LangCoop: Collaborative Driving with Language

Xiangbo Gao,Yuheng Wu,Rujia Wang,Chenxi Liu,Yang Zhou,Zhengzhong Tu

Main category: cs.RO

TL;DR: LangCoop利用自然语言作为多智能体协作驾驶的通信媒介,显著降低带宽需求并保持驾驶性能。

Details Motivation: 解决现有多智能体通信方法的高带宽需求、智能体异构性和信息丢失问题。 Method: 提出LangCoop框架,包含M$^3$CoT(结构化零样本视觉语言推理)和LangPack(高效语言信息封装)。 Result: 在CARLA仿真中,LangCoop将通信带宽降低96%(每条消息<2KB),同时保持驾驶性能。 Conclusion: LangCoop为多智能体协作驾驶提供了一种高效且低带宽的通信解决方案。 Abstract: Multi-agent collaboration holds great promise for enhancing the safety, reliability, and mobility of autonomous driving systems by enabling information sharing among multiple connected agents. However, existing multi-agent communication approaches are hindered by limitations of existing communication media, including high bandwidth demands, agent heterogeneity, and information loss. To address these challenges, we introduce LangCoop, a new paradigm for collaborative autonomous driving that leverages natural language as a compact yet expressive medium for inter-agent communication. LangCoop features two key innovations: Mixture Model Modular Chain-of-thought (M$^3$CoT) for structured zero-shot vision-language reasoning and Natural Language Information Packaging (LangPack) for efficiently packaging information into concise, language-based messages. Through extensive experiments conducted in the CARLA simulations, we demonstrate that LangCoop achieves a remarkable 96\% reduction in communication bandwidth (< 2KB per message) compared to image-based communication, while maintaining competitive driving performance in the closed-loop evaluation.

[142] Lightweight LiDAR-Camera 3D Dynamic Object Detection and Multi-Class Trajectory Prediction

Yushen He,Lei Zhao,Tianchen Deng,Zipeng Fang,Weidong Chen

Main category: cs.RO

TL;DR: 提出了一种轻量级多模态框架,用于3D物体检测和轨迹预测,结合LiDAR和相机输入,实现实时感知,并在计算资源有限的情况下表现优异。

Details Motivation: 服务移动机器人需要在有限计算资源下避免动态物体,因此需要高效的3D感知和轨迹预测方法。 Method: 提出两个新模块:1) Cross-Modal Deformable Transformer (CMDT) 用于高精度物体检测;2) Reference Trajectory-based Multi-Class Transformer (RTMCT) 用于多类物体轨迹预测。 Result: 在CODa基准测试中表现优于现有方法(检测mAP提升2.03%,行人轨迹预测minADE5降低0.408m),并在低端GPU上实现13.2 fps的实时推理。 Conclusion: 该系统具有优异的部署能力,代码已开源,便于复现和实际应用。 Abstract: Service mobile robots are often required to avoid dynamic objects while performing their tasks, but they usually have only limited computational resources. So we present a lightweight multi-modal framework for 3D object detection and trajectory prediction. Our system synergistically integrates LiDAR and camera inputs to achieve real-time perception of pedestrians, vehicles, and riders in 3D space. The framework proposes two novel modules: 1) a Cross-Modal Deformable Transformer (CMDT) for object detection with high accuracy and acceptable amount of computation, and 2) a Reference Trajectory-based Multi-Class Transformer (RTMCT) for efficient and diverse trajectory prediction of mult-class objects with flexible trajectory lengths. Evaluations on the CODa benchmark demonstrate superior performance over existing methods across detection (+2.03% in mAP) and trajectory prediction (-0.408m in minADE5 of pedestrians) metrics. Remarkably, the system exhibits exceptional deployability - when implemented on a wheelchair robot with an entry-level NVIDIA 3060 GPU, it achieves real-time inference at 13.2 fps. To facilitate reproducibility and practical deployment, we release the related code of the method at https://github.com/TossherO/3D_Perception and its ROS inference version at https://github.com/TossherO/ros_packages.

[143] Green Robotic Mixed Reality with Gaussian Splatting

Chenxuan Liu,He Li,Zongze Li,Shuai Wang,Wei Xu,Kejiang Ye,Derrick Wing Kwan Ng,Chengzhong Xu

Main category: cs.RO

TL;DR: 论文提出了一种基于高斯泼溅(GS)的RoboMR系统(GSRMR),通过减少高分辨率图像上传频率来降低能耗,并进一步提出了GS跨层优化(GSCLO)框架和加速惩罚优化(APO)算法,显著提升了通信效率和图像质量。

Details Motivation: 在机器人混合现实(RoboMR)系统中,高分辨率图像的高频上传导致通信能耗巨大,亟需绿色通信解决方案。 Method: 提出GSRMR,利用高斯泼溅模型减少图像上传需求;设计GSCLO框架联合优化内容切换和功率分配,并通过APO算法求解。 Result: 实验表明,GSRMR将通信能耗降低10倍以上,且在PSNR和SSIM指标上优于基线方案。 Conclusion: GSRMR为绿色RoboMR提供了有效解决方案,显著降低了能耗并提升了图像质量。 Abstract: Realizing green communication in robotic mixed reality (RoboMR) systems presents a challenge, due to the necessity of uploading high-resolution images at high frequencies through wireless channels. This paper proposes Gaussian splatting (GS) RoboMR (GSRMR), which achieves a lower energy consumption and makes a concrete step towards green RoboMR. The crux to GSRMR is to build a GS model which enables the simulator to opportunistically render a photo-realistic view from the robot's pose, thereby reducing the need for excessive image uploads. Since the GS model may involve discrepancies compared to the actual environments, a GS cross-layer optimization (GSCLO) framework is further proposed, which jointly optimizes content switching (i.e., deciding whether to upload image or not) and power allocation across different frames. The GSCLO problem is solved by an accelerated penalty optimization (APO) algorithm. Experiments demonstrate that the proposed GSRMR reduces the communication energy by over 10x compared with RoboMR. Furthermore, the proposed GSRMR with APO outperforms extensive baseline schemes, in terms of peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM).

[144] SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM

Samuel Cerezo,Gaetano Meli,Tomás Berriel Martins,Kirill Safronov,Javier Civera

Main category: cs.RO

TL;DR: SLAM&Render数据集填补了SLAM与神经渲染领域的数据空白,包含40个序列,支持多模态和时序性评估。

Details Motivation: 现有数据集未能涵盖SLAM与神经渲染的特定挑战,如多模态、时序性、视角和光照条件的变化。 Method: 引入SLAM&Render数据集,包含RGB、深度、IMU、机器人运动数据和真实位姿,覆盖多种场景和光照条件。 Result: 实验验证了SLAM&Render作为新兴研究领域基准的相关性。 Conclusion: SLAM&Render为SLAM与神经渲染的交叉研究提供了重要支持。 Abstract: Models and methods originally developed for novel view synthesis and scene rendering, such as Neural Radiance Fields (NeRF) and Gaussian Splatting, are increasingly being adopted as representations in Simultaneous Localization and Mapping (SLAM). However, existing datasets fail to include the specific challenges of both fields, such as multimodality and sequentiality in SLAM or generalization across viewpoints and illumination conditions in neural rendering. To bridge this gap, we introduce SLAM&Render, a novel dataset designed to benchmark methods in the intersection between SLAM and novel view rendering. It consists of 40 sequences with synchronized RGB, depth, IMU, robot kinematic data, and ground-truth pose streams. By releasing robot kinematic data, the dataset also enables the assessment of novel SLAM strategies when applied to robot manipulators. The dataset sequences span five different setups featuring consumer and industrial objects under four different lighting conditions, with separate training and test trajectories per scene, as well as object rearrangements. Our experimental results, obtained with several baselines from the literature, validate SLAM&Render as a relevant benchmark for this emerging research area.

[145] Learning Through Retrospection: Improving Trajectory Prediction for Automated Driving with Error Feedback

Steffen Hagedorn,Aron Distelzweig,Marcel Hallgarten,Alexandru P. Condurache

Main category: cs.RO

TL;DR: 论文提出了一种新的回顾技术,通过训练模型利用时间数据反馈,改进自动驾驶中车辆轨迹预测的准确性。

Details Motivation: 现有模型在预测车辆轨迹时独立处理每次预测,无法纠正错误,导致重复错误。 Method: 提出了一种回顾技术,通过闭环训练使模型能够利用反馈分析并改进后续预测。 Result: 在nuScenes和Argoverse数据集上,最小平均位移误差降低了31.9%,且能更好地处理分布外场景。 Conclusion: 回顾技术显著提升了轨迹预测的准确性和鲁棒性。 Abstract: In automated driving, predicting trajectories of surrounding vehicles supports reasoning about scene dynamics and enables safe planning for the ego vehicle. However, existing models handle predictions as an instantaneous task of forecasting future trajectories based on observed information. As time proceeds, the next prediction is made independently of the previous one, which means that the model cannot correct its errors during inference and will repeat them. To alleviate this problem and better leverage temporal data, we propose a novel retrospection technique. Through training on closed-loop rollouts the model learns to use aggregated feedback. Given new observations it reflects on previous predictions and analyzes its errors to improve the quality of subsequent predictions. Thus, the model can learn to correct systematic errors during inference. Comprehensive experiments on nuScenes and Argoverse demonstrate a considerable decrease in minimum Average Displacement Error of up to 31.9% compared to the state-of-the-art baseline without retrospection. We further showcase the robustness of our technique by demonstrating a better handling of out-of-distribution scenarios with undetected road-users.

cs.CR [Back]

[146] X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Salman Rahman,Liwei Jiang,James Shiffer,Genglin Liu,Sheriff Issaka,Md Rizwan Parvez,Hamid Palangi,Kai-Wei Chang,Yejin Choi,Saadia Gabriel

Main category: cs.CR

TL;DR: X-Teaming是一个可扩展的框架,用于系统性地探索多轮对话中的安全风险,并生成攻击场景,成功率达到98.1%。

Details Motivation: 多轮对话中的安全风险尚未得到充分研究,现有方法主要关注单轮攻击。 Method: X-Teaming使用协作代理进行规划、攻击优化和验证,生成多样化的攻击场景。 Result: X-Teaming在多轮攻击中表现优异,对Claude 3.7 Sonnet模型的攻击成功率达96.2%。 Conclusion: X-Teaming为语言模型的多轮安全提供了重要工具和数据集,提升了安全性。 Abstract: Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

[147] Q-FAKER: Query-free Hard Black-box Attack via Controlled Generation

CheolWon Na,YunSeok Choi,Jee-Hyong Lee

Main category: cs.CR

TL;DR: 提出了一种无需查询目标模型的黑盒攻击方法Q-faker,利用替代模型生成对抗样本,解决了传统方法需要大量查询和高成本的问题。

Details Motivation: 现有对抗攻击方法需要大量查询或目标模型信息,不适用于完全封闭的黑盒场景。 Method: 使用替代模型和可控生成技术生成对抗样本,避免直接访问目标模型。 Result: 在八个数据集上验证了方法的有效性,包括高迁移性和高质量对抗样本生成。 Conclusion: Q-faker在硬黑盒场景中具有实际应用价值。 Abstract: Many adversarial attack approaches are proposed to verify the vulnerability of language models. However, they require numerous queries and the information on the target model. Even black-box attack methods also require the target model's output information. They are not applicable in real-world scenarios, as in hard black-box settings where the target model is closed and inaccessible. Even the recently proposed hard black-box attacks still require many queries and demand extremely high costs for training adversarial generators. To address these challenges, we propose Q-faker (Query-free Hard Black-box Attacker), a novel and efficient method that generates adversarial examples without accessing the target model. To avoid accessing the target model, we use a surrogate model instead. The surrogate model generates adversarial sentences for a target-agnostic attack. During this process, we leverage controlled generation techniques. We evaluate our proposed method on eight datasets. Experimental results demonstrate our method's effectiveness including high transferability and the high quality of the generated adversarial examples, and prove its practical in hard black-box settings.

math.NA [Back]

[148] A Stochastic Nonlinear Dynamical System for Smoothing Noisy Eye Gaze Data

Thoa Thieu,Roderick Melnik

Main category: math.NA

TL;DR: 使用扩展卡尔曼滤波器(EKF)平滑眼动追踪数据,显著降低噪声并提高跟踪精度。

Details Motivation: 解决眼动追踪中因噪声(如设备限制、校准漂移、环境光变化和眨眼)导致的视线定位不准确问题。 Method: 提出基于扩展卡尔曼滤波器(EKF)的平滑方法,并系统探索不同参数的交互作用。 Result: EKF显著减少噪声,提升跟踪精度;提出的随机非线性动态模型与实验数据吻合良好。 Conclusion: EKF方法有效改善眼动追踪准确性,适用于相关领域。 Abstract: In this study, we address the challenges associated with accurately determining gaze location on a screen, which is often compromised by noise from factors such as eye tracker limitations, calibration drift, ambient lighting changes, and eye blinks. We propose the use of an extended Kalman filter (EKF) to smooth the gaze data collected during eye-tracking experiments, and systematically explore the interaction of different system parameters. Our results demonstrate that the EKF significantly reduces noise, leading to a marked improvement in tracking accuracy. Furthermore, we show that our proposed stochastic nonlinear dynamical model aligns well with real experimental data and holds promise for applications in related fields.

eess.SP [Back]

[149] Focus3D: A Practical Method to Adaptively Focus ISAR Data and Provide 3-D Information for Automatic Target Recognition

John R. Bennett

Main category: eess.SP

TL;DR: 本文提出了一种改进的ISAR处理器,用于提升海上船只的ATR识别能力,通过结合聚焦算法和船只角度建模方法,解决了以往单角度模型的局限性。

Details Motivation: 提升海上船只的自动目标识别(ATR)能力,需要不仅能生成聚焦图像,还能确定船只姿态的ISAR处理器。 Method: 结合聚焦算法和双角度(方位角和倾斜角)建模方法,扩展了Melendez和Bennett的单角度模型。 Result: 新方法能够更准确地确定船只的姿态(如垂直或水平视图),从而缩小识别范围。 Conclusion: 双角度模型显著提升了ATR识别的准确性和适用性,尤其在长时间成像场景中。 Abstract: To improve ATR identification of ships at sea requires an advanced ISAR processor - one that not only provides focused images but can also determine the pose of the ship. This tells us whether the image shows a profile (vertical plane) view, a plan (horizontal plane) view or some view in between. If the processor can provide this information, then the ATR processor can try to match the images with known vertical or horizontal features of ships and, in conjunction with estimated ship length, narrow the set of possible identifications. This paper extends the work of Melendez and Bennett [M-B, Ref. 1] by combining a focus algorithm with a method that models the angles of the ship relative to the radar. In M-B the algorithm was limited to a single angle and the plane of rotation was not determined. This assumption may be fine for a short time image where there is limited data available to determine the pose. However, the present paper models the ship rotation with two angles - aspect angle, representing rotation in the horizontal plane, and tilt angle, representing variations in the effective grazing angle to the ship.