Skip to content

Table of Contents

cs.CV [Back]

[1] Robust Emotion Recognition via Bi-Level Self-Supervised Continual Learning

Adnan Ahmad,Bahareh Nakisa,Mohammad Naim Rastgoo

Main category: cs.CV

TL;DR: 论文提出了一种名为SSOCL的双层自监督持续学习框架,用于处理连续无标记的生理信号数据流,以提升情感识别的性能。

Details Motivation: 生理信号(如EEG)在情感识别中存在跨主体差异和噪声标签问题,现有方法难以处理连续无标记的数据流。 Method: SSOCL框架基于动态记忆缓冲区,通过双层架构迭代优化缓冲区与伪标签分配,结合快速适应模块和聚类映射模块。 Result: 在两个主流EEG任务上的实验表明,SSOCL能够适应连续数据流并保持跨主体的强泛化能力,优于现有方法。 Conclusion: SSOCL框架有效解决了连续无标记生理数据流的情感识别问题,具有实际应用潜力。 Abstract: Emotion recognition through physiological signals such as electroencephalogram (EEG) has become an essential aspect of affective computing and provides an objective way to capture human emotions. However, physiological data characterized by cross-subject variability and noisy labels hinder the performance of emotion recognition models. Existing domain adaptation and continual learning methods struggle to address these issues, especially under realistic conditions where data is continuously streamed and unlabeled. To overcome these limitations, we propose a novel bi-level self-supervised continual learning framework, SSOCL, based on a dynamic memory buffer. This bi-level architecture iteratively refines the dynamic buffer and pseudo-label assignments to effectively retain representative samples, enabling generalization from continuous, unlabeled physiological data streams for emotion recognition. The assigned pseudo-labels are subsequently leveraged for accurate emotion prediction. Key components of the framework, including a fast adaptation module and a cluster-mapping module, enable robust learning and effective handling of evolving data streams. Experimental validation on two mainstream EEG tasks demonstrates the framework's ability to adapt to continuous data streams while maintaining strong generalization across subjects, outperforming existing approaches.

[2] Bias and Generalizability of Foundation Models across Datasets in Breast Mammography

Germani Elodie,Selin Türk Ilayda,Zeineddine Fatima,Mourad Charbel,Albarqouni Shadi

Main category: cs.CV

TL;DR: 研究探讨了基础模型(FMs)在乳腺癌X光分类中的公平性和偏见问题,发现数据集聚合和领域适应策略虽能提升性能,但无法完全消除偏见,而公平感知技术表现更稳定。

Details Motivation: 尽管计算机辅助诊断工具在乳腺癌筛查中有所发展,但其临床应用仍受数据变异性和偏见的限制。基础模型(FMs)虽具泛化能力,但可能因虚假相关性而表现不佳。 Method: 利用来自不同来源的大规模数据集(包括代表性不足地区的数据和内部数据集),研究FMs的公平性和偏见,并测试了数据集聚合、领域适应和公平感知技术。 Result: 模态特定预训练提升了性能,但跨域泛化能力不足;数据集聚合改善了整体性能,但未能完全消除偏见;公平感知技术表现更稳定。 Conclusion: 需将严格的公平性评估和缓解策略纳入FMs,以促进包容性和泛化性AI的发展。 Abstract: Over the past decades, computer-aided diagnosis tools for breast cancer have been developed to enhance screening procedures, yet their clinical adoption remains challenged by data variability and inherent biases. Although foundation models (FMs) have recently demonstrated impressive generalizability and transfer learning capabilities by leveraging vast and diverse datasets, their performance can be undermined by spurious correlations that arise from variations in image quality, labeling uncertainty, and sensitive patient attributes. In this work, we explore the fairness and bias of FMs for breast mammography classification by leveraging a large pool of datasets from diverse sources-including data from underrepresented regions and an in-house dataset. Our extensive experiments show that while modality-specific pre-training of FMs enhances performance, classifiers trained on features from individual datasets fail to generalize across domains. Aggregating datasets improves overall performance, yet does not fully mitigate biases, leading to significant disparities across under-represented subgroups such as extreme breast densities and age groups. Furthermore, while domain-adaptation strategies can reduce these disparities, they often incur a performance trade-off. In contrast, fairness-aware techniques yield more stable and equitable performance across subgroups. These findings underscore the necessity of incorporating rigorous fairness evaluations and mitigation strategies into FM-based models to foster inclusive and generalizable AI.

[3] Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Diogo Freitas,Brigt Håvardstun,Cèsar Ferri,Darío Garigliotti,Jan Arne Telle,José Hernández-Orallo

Main category: cs.CV

TL;DR: 论文探讨了多模态大语言模型是否通过共同表示整合模态,并通过机器教学理论验证了图像和坐标表示的教学复杂性。

Details Motivation: 研究多模态模型是否真正通过共同表示整合不同模态,例如图像和文本描述是否映射到相似的潜在空间。 Method: 使用机器教学理论,在Quick, Draw!数据集中选择对象子集,比较图像(位图)和坐标(TikZ格式)两种表示方式的教学复杂性。 Result: 图像表示通常需要更少的教学片段且准确率更高,但两种表示方式对概念的教学复杂性排序相似,表明概念的简单性可能是跨模态的固有属性。 Conclusion: 多模态模型可能确实通过共同表示整合模态,且概念的简单性可能独立于具体表示方式。 Abstract: Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to the similar area in the latent space as a textual description of the strokes that conform the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper we evaluate the complexity of teaching visual-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.

[4] Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios

Huafeng Shi,Jianzhong Liang,Rongchang Xie,Xian Wu,Cheng Chen,Chang Liu

Main category: cs.CV

TL;DR: Aquarius是一系列面向营销场景的工业级视频生成模型,支持大规模集群和百亿参数模型,通过高效工程架构和算法创新实现高性能视频合成。

Details Motivation: 旨在揭秘工业级视频生成系统的设计细节,推动生成视频社区的发展。 Method: 框架包含分布式图与视频数据处理管道、多尺度模型架构、高性能基础设施、多xPU并行推理加速及多种营销场景应用。 Result: 实现了高保真、多宽高比和长时视频合成,推理速度提升2.35倍,大规模训练效率达36% MFU。 Conclusion: Aquarius展示了工业级视频生成系统的潜力,未来将扩展更多下游应用和评估指标。 Abstract: This report introduces Aquarius, a family of industry-level video generation models for marketing scenarios designed for thousands-xPU clusters and models with hundreds of billions of parameters. Leveraging efficient engineering architecture and algorithmic innovation, Aquarius demonstrates exceptional performance in high-fidelity, multi-aspect-ratio, and long-duration video synthesis. By disclosing the framework's design details, we aim to demystify industrial-scale video generation systems and catalyze advancements in the generative video community. The Aquarius framework consists of five components: Distributed Graph and Video Data Processing Pipeline: Manages tens of thousands of CPUs and thousands of xPUs via automated task distribution, enabling efficient video data processing. Additionally, we are about to open-source the entire data processing framework named "Aquarius-Datapipe". Model Architectures for Different Scales: Include a Single-DiT architecture for 2B models and a Multimodal-DiT architecture for 13.4B models, supporting multi-aspect ratios, multi-resolution, and multi-duration video generation. High-Performance infrastructure designed for video generation model training: Incorporating hybrid parallelism and fine-grained memory optimization strategies, this infrastructure achieves 36% MFU at large scale. Multi-xPU Parallel Inference Acceleration: Utilizes diffusion cache and attention optimization to achieve a 2.35x inference speedup. Multiple marketing-scenarios applications: Including image-to-video, text-to-video (avatar), video inpainting and video personalization, among others. More downstream applications and multi-dimensional evaluation metrics will be added in the upcoming version updates.

[5] Efficient Malicious UAV Detection Using Autoencoder-TSMamba Integration

Azim Akhtarshenas,Ramin Toosi,David López-Pérez,Tohid Alizadeh,Alireza Hosseini

Main category: cs.CV

TL;DR: 本文提出了一种基于4层TSMamba架构的AE分类器系统,用于检测恶意无人机,显著提高了分类准确率和计算效率。

Details Motivation: 恶意无人机对下一代网络构成严重威胁,如未经授权的监视和数据盗窃,亟需高效检测方法。 Method: 采用TSMamba架构的AE生成残差值,再通过ResNet分类器处理,降低复杂度并提高准确性。 Result: 实验显示,二元和多类分类的召回率分别达到99.8%和96.7%,计算复杂度显著降低。 Conclusion: 该方法在恶意无人机检测中表现出鲁棒性和可扩展性,适合大规模部署。 Abstract: Malicious Unmanned Aerial Vehicles (UAVs) present a significant threat to next-generation networks (NGNs), posing risks such as unauthorized surveillance, data theft, and the delivery of hazardous materials. This paper proposes an integrated (AE)-classifier system to detect malicious UAVs. The proposed AE, based on a 4-layer Tri-orientated Spatial Mamba (TSMamba) architecture, effectively captures complex spatial relationships crucial for identifying malicious UAV activities. The first phase involves generating residual values through the AE, which are subsequently processed by a ResNet-based classifier. This classifier leverages the residual values to achieve lower complexity and higher accuracy. Our experiments demonstrate significant improvements in both binary and multi-class classification scenarios, achieving up to 99.8 % recall compared to 96.7 % in the benchmark. Additionally, our method reduces computational complexity, making it more suitable for large-scale deployment. These results highlight the robustness and scalability of our approach, offering an effective solution for malicious UAV detection in NGN environments.

[6] Super-Resolution Generative Adversarial Networks based Video Enhancement

Kağan ÇETİN

Main category: cs.CV

TL;DR: 本文提出了一种改进的视频超分辨率方法,通过扩展单图像超分辨率生成对抗网络(SRGAN)结构,结合3D非局部块处理时空数据,提升了视频处理的时序连贯性和视觉效果。

Details Motivation: SRGAN在单图像增强中表现优异,但未考虑视频处理所需的时序连续性,因此需要改进以适应视频数据。 Method: 提出了一种结合3D非局部块的改进框架,采用基于块的学习和高级数据退化技术,训练模型捕捉时空关系。 Result: 实验结果表明,该方法在时序连贯性、纹理清晰度和视觉伪影减少方面优于传统单图像方法。 Conclusion: 该研究为视频增强任务提供了实用的学习解决方案,适用于流媒体、游戏和数字修复等领域。 Abstract: This study introduces an enhanced approach to video super-resolution by extending ordinary Single-Image Super-Resolution (SISR) Super-Resolution Generative Adversarial Network (SRGAN) structure to handle spatio-temporal data. While SRGAN has proven effective for single-image enhancement, its design does not account for the temporal continuity required in video processing. To address this, a modified framework that incorporates 3D Non-Local Blocks is proposed, which is enabling the model to capture relationships across both spatial and temporal dimensions. An experimental training pipeline is developed, based on patch-wise learning and advanced data degradation techniques, to simulate real-world video conditions and learn from both local and global structures and details. This helps the model generalize better and maintain stability across varying video content while maintaining the general structure besides the pixel-wise correctness. Two model variants-one larger and one more lightweight-are presented to explore the trade-offs between performance and efficiency. The results demonstrate improved temporal coherence, sharper textures, and fewer visual artifacts compared to traditional single-image methods. This work contributes to the development of practical, learning-based solutions for video enhancement tasks, with potential applications in streaming, gaming, and digital restoration.

[7] ARFC-WAHNet: Adaptive Receptive Field Convolution and Wavelet-Attentive Hierarchical Network for Infrared Small Target Detection

Xingye Cui,Junhai Luo,Jiakun Deng,Kexuan Li,Xiangyu Qiu,Zhenming Peng

Main category: cs.CV

TL;DR: 提出了一种名为ARFC-WAHNet的网络,用于红外小目标检测,通过自适应感受野卷积和小波注意力层次网络解决传统方法的局限性。

Details Motivation: 红外图像中纹理和结构信息有限,传统卷积核和池化操作导致特征丢失和信息利用不足,需要更适应复杂场景和多样化目标的方法。 Method: 结合多感受野特征交互卷积模块(MRFFIConv)、小波频率增强下采样模块(WFED)、高低特征融合模块(HLFF)和全局中值增强注意力模块(GMEA)。 Result: 在SIRST、NUDT-SIRST和IRSTD-1k数据集上表现优于现有方法,尤其在复杂背景下检测精度和鲁棒性显著提升。 Conclusion: ARFC-WAHNet通过自适应特征提取和多模块协同,有效提升了红外小目标检测的性能。 Abstract: Infrared small target detection (ISTD) is critical in both civilian and military applications. However, the limited texture and structural information in infrared images makes accurate detection particularly challenging. Although recent deep learning-based methods have improved performance, their use of conventional convolution kernels limits adaptability to complex scenes and diverse targets. Moreover, pooling operations often cause feature loss and insufficient exploitation of image information. To address these issues, we propose an adaptive receptive field convolution and wavelet-attentive hierarchical network for infrared small target detection (ARFC-WAHNet). This network incorporates a multi-receptive field feature interaction convolution (MRFFIConv) module to adaptively extract discriminative features by integrating multiple convolutional branches with a gated unit. A wavelet frequency enhancement downsampling (WFED) module leverages Haar wavelet transform and frequency-domain reconstruction to enhance target features and suppress background noise. Additionally, we introduce a high-low feature fusion (HLFF) module for integrating low-level details with high-level semantics, and a global median enhancement attention (GMEA) module to improve feature diversity and expressiveness via global attention. Experiments on public datasets SIRST, NUDT-SIRST, and IRSTD-1k demonstrate that ARFC-WAHNet outperforms recent state-of-the-art methods in both detection accuracy and robustness, particularly under complex backgrounds. The code is available at https://github.com/Leaf2001/ARFC-WAHNet.

[8] SRMamba: Mamba for Super-Resolution of LiDAR Point Clouds

Chuang Chen,Wenyi Ge

Main category: cs.CV

TL;DR: SRMamba提出了一种基于Hough Voting和Hole Compensation的LiDAR点云超分辨率方法,解决了稀疏场景下点云3D空间结构恢复的挑战。

Details Motivation: 由于LiDAR点云的稀疏性和不规则结构,点云超分辨率问题具有挑战性,尤其是在新视角下的点云上采样。 Method: 采用Hough Voting和Hole Compensation策略消除范围图像中的水平线性空洞,利用Visual State Space模型和多向扫描机制优化3D空间结构信息恢复,并通过非对称U-Net网络适应不同光束数的LiDAR输入。 Result: 在SemanticKITTI和nuScenes数据集上的实验表明,SRMamba在定性和定量评估中均显著优于其他算法。 Conclusion: SRMamba为稀疏场景下的LiDAR点云超分辨率提供了一种高效且鲁棒的解决方案。 Abstract: In recent years, range-view-based LiDAR point cloud super-resolution techniques attract significant attention as a low-cost method for generating higher-resolution point cloud data. However, due to the sparsity and irregular structure of LiDAR point clouds, the point cloud super-resolution problem remains a challenging topic, especially for point cloud upsampling under novel views. In this paper, we propose SRMamba, a novel method for super-resolution of LiDAR point clouds in sparse scenes, addressing the key challenge of recovering the 3D spatial structure of point clouds from novel views. Specifically, we implement projection technique based on Hough Voting and Hole Compensation strategy to eliminate horizontally linear holes in range image. To improve the establishment of long-distance dependencies and to focus on potential geometric features in vertical 3D space, we employ Visual State Space model and Multi-Directional Scanning mechanism to mitigate the loss of 3D spatial structural information due to the range image. Additionally, an asymmetric U-Net network adapts to the input characteristics of LiDARs with different beam counts, enabling super-resolution reconstruction for multi-beam point clouds. We conduct a series of experiments on multiple challenging public LiDAR datasets (SemanticKITTI and nuScenes), and SRMamba demonstrates significant superiority over other algorithms in both qualitative and quantitative evaluations.

[9] MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence

Chonghan Liu,Haoran Wang,Felix Henry,Pu Miao,Yajie Zhang,Yu Zhao,Peiran Wu

Main category: cs.CV

TL;DR: MIRAGE是一个多模态基准,用于评估模型在计数、空间关系推理及其结合任务中的能力,揭示了现有模型的局限性。

Details Motivation: 现有模型在物体属性识别和空间关系推理方面存在显著不足,影响动态推理能力。 Method: 提出MIRAGE基准,通过多样化复杂场景测试模型的计数、关系推理及其结合能力。 Result: MIRAGE揭示了当前先进模型在细粒度识别和推理方面的关键局限性。 Conclusion: MIRAGE为未来时空推理研究提供了改进方向。 Abstract: Spatial perception and reasoning are core components of human cognition, encompassing object recognition, spatial relational understanding, and dynamic reasoning. Despite progress in computer vision, existing benchmarks reveal significant gaps in models' abilities to accurately recognize object attributes and reason about spatial relationships, both essential for dynamic reasoning. To address these limitations, we propose MIRAGE, a multi-modal benchmark designed to evaluate models' capabilities in Counting (object attribute recognition), Relation (spatial relational reasoning), and Counting with Relation. Through diverse and complex scenarios requiring fine-grained recognition and reasoning, MIRAGE highlights critical limitations in state-of-the-art models, underscoring the need for improved representations and reasoning frameworks. By targeting these foundational abilities, MIRAGE provides a pathway toward spatiotemporal reasoning in future research.

[10] MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang,Wenhao Yu,Xiyu Ren,Jipeng Zhang,Yu Zhao,Rohit Saxena,Liang Cheng,Ginny Wong,Simon See,Pasquale Minervini,Yangqiu Song,Mark Steedman

Main category: cs.CV

TL;DR: MMLongBench是首个针对长上下文视觉语言模型(LCVLMs)的多样化基准测试,涵盖13,331个样本和五类任务,评估了46个模型的性能,发现当前模型在长上下文任务中仍有改进空间。

Details Motivation: 随着视觉语言模型上下文窗口的扩展,需要有效的基准测试来评估其长上下文处理能力。 Method: 构建MMLongBench基准,包含多样化的任务和图像类型,并通过跨模态标记化方案控制输入长度。 Result: 研究发现:单任务性能不能代表整体能力;开源和闭源模型均面临挑战;推理能力强的模型表现更好。 Conclusion: MMLongBench为诊断和提升下一代LCVLMs提供了基础。 Abstract: The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

[11] Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding

Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng

Main category: cs.CV

TL;DR: 论文提出了一种名为CICD的训练无关方法,通过跨图像对比解码缓解大型视觉语言模型中的语言先验问题,减少幻觉内容生成。

Details Motivation: 语言先验是大型视觉语言模型产生幻觉内容的主要原因之一,导致模型生成视觉不一致但语言合理的内容。 Method: 提出CICD方法,通过识别关键和有害先验,并利用对比解码消除有害先验,同时保持文本流畅性和连贯性。 Result: 在四个基准测试和六个LVLM上验证了CICD的有效性,尤其在图像描述任务中表现突出。 Conclusion: CICD能显著缓解语言先验问题,且无需额外训练,具有实际应用潜力。 Abstract: Language priors constitute one of the primary causes of hallucinations in Large Vision-Language Models (LVLMs), driving the models to generate linguistically plausible yet visually inconsistent content. The language priors in LVLMs originate from the linguistic knowledge inherited from their pre-trained Large Language Model (LLM) backbone. Consequently, this characteristic is an intrinsic property of the model that remains independent of visual inputs. Inspired by the finding that language priors are consistent across images, we propose Cross-Image Contrastive Decoding (CICD), a simple yet effective training-free method to alleviate language priors in LVLMs. CICD first identifies essential and detrimental priors, and then employs contrastive decoding to eliminate the detrimental ones. This approach simultaneously prevents LVLMs from generating hallucinated content while maintaining textual fluency and coherence. Furthermore, the limited information overlap between images helps prevent visual information loss during contrastive decoding. We validate the effectiveness of CICD on four benchmarks with six LVLMs. Our experiments demonstrate that CICD performs remarkably well in mitigating language priors, especially in the image captioning task, where such priors are most pronounced. Code will be released once accepted.

[12] Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging

Xianrui Li,Yufei Cui,Jun Li,Antoni B. Chan

Main category: cs.CV

TL;DR: 论文提出了一种改进持续学习(CL)在注意力多实例学习(MIL)模型中的方法,通过注意力知识蒸馏(AKD)和伪袋记忆池(PMP)减少遗忘并提高内存效率。

Details Motivation: 传统MIL模型在持续学习中表现不佳,主要因注意力层遗忘问题。本文旨在解决这一问题,提升模型适应性和效率。 Method: 提出AKD和PMP:AKD保留注意力层知识,PMP选择性存储信息丰富的伪袋以减少内存占用。 Result: 实验表明,该方法在多样WSI数据集上显著提升准确性和内存效率,优于现有CL方法。 Conclusion: 为大规模弱标注临床数据集的持续学习奠定基础,推动更适应性强、稳健的诊断模型发展。 Abstract: Advances in medical imaging and deep learning have propelled progress in whole slide image (WSI) analysis, with multiple instance learning (MIL) showing promise for efficient and accurate diagnostics. However, conventional MIL models often lack adaptability to evolving datasets, as they rely on static training that cannot incorporate new information without extensive retraining. Applying continual learning (CL) to MIL models is a possible solution, but often sees limited improvements. In this paper, we analyze CL in the context of attention MIL models and find that the model forgetting is mainly concentrated in the attention layers of the MIL model. Using the results of this analysis we propose two components for improving CL on MIL: Attention Knowledge Distillation (AKD) and the Pseudo-Bag Memory Pool (PMP). AKD mitigates catastrophic forgetting by focusing on retaining attention layer knowledge between learning sessions, while PMP reduces the memory footprint by selectively storing only the most informative patches, or ``pseudo-bags'' from WSIs. Experimental evaluations demonstrate that our method significantly improves both accuracy and memory efficiency on diverse WSI datasets, outperforming current state-of-the-art CL methods. This work provides a foundation for CL in large-scale, weakly annotated clinical datasets, paving the way for more adaptable and resilient diagnostic models.

[13] CLIP Embeddings for AI-Generated Image Detection: A Few-Shot Study with Lightweight Classifier

Ziyang Ou

Main category: cs.CV

TL;DR: 研究探讨了CLIP嵌入是否包含AI生成图像的信息,通过轻量级网络和微调分类器,在CIFAKE基准测试中达到95%准确率,但某些图像类型仍具挑战性。

Details Motivation: 验证AI生成图像的真实性日益重要,但现有视觉语言模型(如CLIP)在此任务上表现未充分探索。 Method: 使用冻结CLIP模型提取视觉嵌入,输入轻量级网络并微调最终分类器。 Result: 在CIFAKE基准测试中达到95%准确率,少量数据适应后为85%。某些图像类型(如广角照片和油画)分类困难。 Conclusion: CLIP嵌入可用于AI生成图像分类,但某些图像类型仍具挑战性,需进一步研究。 Abstract: Verifying the authenticity of AI-generated images presents a growing challenge on social media platforms these days. While vision-language models (VLMs) like CLIP outdo in multimodal representation, their capacity for AI-generated image classification is underexplored due to the absence of such labels during the pre-training process. This work investigates whether CLIP embeddings inherently contain information indicative of AI generation. A proposed pipeline extracts visual embeddings using a frozen CLIP model, feeds its embeddings to lightweight networks, and fine-tunes only the final classifier. Experiments on the public CIFAKE benchmark show the performance reaches 95% accuracy without language reasoning. Few-shot adaptation to curated custom with 20% of the data results in performance to 85%. A closed-source baseline (Gemini-2.0) has the best zero-shot accuracy yet fails on specific styles. Notably, some specific image types, such as wide-angle photographs and oil paintings, pose significant challenges to classification. These results indicate previously unexplored difficulties in classifying certain types of AI-generated images, revealing new and more specific questions in this domain that are worth further investigation.

[14] GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding

Yuki Kawana,Shintaro Shiba,Quan Kong,Norimasa Kobori

Main category: cs.CV

TL;DR: 提出了一种新的3D视线估计方法GA3CE,通过3D上下文编码学习主体与场景的空间关系,显著提升了在无约束环境下的视线方向估计精度。

Details Motivation: 现有方法在无约束环境下(如远距离或主体背对时)难以准确估计3D视线方向,主要依赖2D外观或有限的空间线索。 Method: 使用3D姿态和物体位置表示主体与场景,通过3D上下文编码学习空间关系,并引入D$^3$位置编码优化方向与距离的捕捉。 Result: 实验表明,在单帧设置下,该方法将平均角度误差降低了13%-37%,优于现有基准。 Conclusion: GA3CE通过3D上下文编码和D$^3$位置编码,显著提升了无约束环境下的3D视线估计性能。 Abstract: We propose a novel 3D gaze estimation approach that learns spatial relationships between the subject and objects in the scene, and outputs 3D gaze direction. Our method targets unconstrained settings, including cases where close-up views of the subject's eyes are unavailable, such as when the subject is distant or facing away. Previous approaches typically rely on either 2D appearance alone or incorporate limited spatial cues using depth maps in the non-learnable post-processing step. Estimating 3D gaze direction from 2D observations in these scenarios is challenging; variations in subject pose, scene layout, and gaze direction, combined with differing camera poses, yield diverse 2D appearances and 3D gaze directions even when targeting the same 3D scene. To address this issue, we propose GA3CE: Gaze-Aware 3D Context Encoding. Our method represents subject and scene using 3D poses and object positions, treating them as 3D context to learn spatial relationships in 3D space. Inspired by human vision, we align this context in an egocentric space, significantly reducing spatial complexity. Furthermore, we propose D$^3$ (direction-distance-decomposed) positional encoding to better capture the spatial relationship between 3D context and gaze direction in direction and distance space. Experiments demonstrate substantial improvements, reducing mean angle error by 13%-37% compared to leading baselines on benchmark datasets in single-frame settings.

[15] Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?

Jianyang Xie,Yitian Zhao,Yanda Meng,He Zhao,Anh Nguyen,Yalin Zheng

Main category: cs.CV

TL;DR: 论文提出稀疏时空图卷积网络(ST-GCNs),通过实验验证ST-GCNs在人体动作识别中存在过参数化问题,并提出一种稀疏生成器,能在保持性能的同时减少参数。

Details Motivation: 现有ST-GCNs在人体动作识别中性能差异不大,可能是过参数化导致,需验证并提出更高效的稀疏模型。 Method: 提出稀疏ST-GCNs生成器,从随机初始化的密集网络中训练稀疏架构,并整合多级稀疏结构。 Result: 稀疏ST-GCNs在减少95%参数时性能下降<1%,多级稀疏ST-GCNs仅需66%参数且性能提升>1%。 Conclusion: 稀疏ST-GCNs能高效减少参数并保持或提升性能,为人体动作识别提供了更优解决方案。 Abstract: Spatial-temporal graph convolutional networks (ST-GCNs) showcase impressive performance in skeleton-based human action recognition (HAR). However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. Additionally, a novel sparse ST-GCNs generator is proposed, which trains a sparse architecture from a randomly initialized dense network while maintaining comparable performance levels to the dense components. Moreover, we generate multi-level sparsity ST-GCNs by integrating sparse structures at various sparsity levels and demonstrate that the assembled model yields a significant enhancement in HAR performance. Thorough experiments on four datasets, including NTU-RGB+D 60(120), Kinetics-400, and FineGYM, demonstrate that the proposed sparse ST-GCNs can achieve comparable performance to their dense components. Even with 95% fewer parameters, the sparse ST-GCNs exhibit a degradation of <1% in top-1 accuracy. Meanwhile, the multi-level sparsity ST-GCNs, which require only 66% of the parameters of the dense ST-GCNs, demonstrate an improvement of >1% in top-1 accuracy. The code is available at https://github.com/davelailai/Sparse-ST-GCN.

[16] GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention

Lingjun Zhao,Sizhe Wei,James Hays,Lu Gan

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯的多模态语义占用预测框架GaussianFormer3D,结合LiDAR和相机数据,通过3D可变形注意力机制提升预测精度和效率。

Details Motivation: 3D语义占用预测对自动驾驶至关重要,现有方法多为密集网格表示,而3D高斯提供了一种紧凑且连续的表示方式,结合多模态数据可提升性能。 Method: 采用体素到高斯的初始化策略,利用LiDAR数据提供几何先验,设计LiDAR引导的3D可变形注意力机制,在提升的3D空间中融合LiDAR和相机特征。 Result: 在道路和非道路数据集上验证,GaussianFormer3D在预测精度上与最先进的多模态融合方法相当,同时降低了内存消耗并提升了效率。 Conclusion: GaussianFormer3D展示了基于3D高斯的语义占用预测在多模态融合中的潜力,为自动驾驶提供了更高效的解决方案。 Abstract: 3D semantic occupancy prediction is critical for achieving safe and reliable autonomous driving. Compared to camera-only perception systems, multi-modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and detailed predictions. Although most existing works utilize a dense grid-based representation, in which the entire 3D space is uniformly divided into discrete voxels, the emergence of 3D Gaussians provides a compact and continuous object-centric representation. In this work, we propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention, named as GaussianFormer3D. We introduce a voxel-to-Gaussian initialization strategy to provide 3D Gaussians with geometry priors from LiDAR data, and design a LiDAR-guided 3D deformable attention mechanism for refining 3D Gaussians with LiDAR-camera fusion features in a lifted 3D space. We conducted extensive experiments on both on-road and off-road datasets, demonstrating that our GaussianFormer3D achieves high prediction accuracy that is comparable to state-of-the-art multi-modal fusion-based methods with reduced memory consumption and improved efficiency.

[17] Automated Detection of Salvin's Albatrosses: Improving Deep Learning Tools for Aerial Wildlife Surveys

Mitchell Rogers,Theo Thompson,Isla Duporge,Johannes Fischer,Klemens Pütz,Thomas Mattern,Bing Xue,Mengjie Zhang

Main category: cs.CV

TL;DR: 研究评估了通用鸟类检测模型BirdDetector在无人机图像上监测Salvin信天翁繁殖种群的表现,发现微调模型能显著提升检测精度。

Details Motivation: 利用无人机和深度学习技术高效监测偏远地区的野生动物种群。 Method: 在零样本和微调设置下评估BirdDetector模型,结合增强推理和图像增强技术。 Result: 微调模型在目标域数据上表现更优,检测精度显著提高。 Conclusion: 预训练深度学习模型在物种特异性监测中具有潜力,尤其在偏远和挑战性环境中。 Abstract: Recent advancements in deep learning and aerial imaging have transformed wildlife monitoring, enabling researchers to survey wildlife populations at unprecedented scales. Unmanned Aerial Vehicles (UAVs) provide a cost-effective means of capturing high-resolution imagery, particularly for monitoring densely populated seabird colonies. In this study, we assess the performance of a general-purpose avian detection model, BirdDetector, in estimating the breeding population of Salvin's albatross (Thalassarche salvini) on the Bounty Islands, New Zealand. Using drone-derived imagery, we evaluate the model's effectiveness in both zero-shot and fine-tuned settings, incorporating enhanced inference techniques and stronger augmentation methods. Our findings indicate that while applying the model in a zero-shot setting offers a strong baseline, fine-tuning with annotations from the target domain and stronger image augmentation leads to marked improvements in detection accuracy. These results highlight the potential of leveraging pre-trained deep-learning models for species-specific monitoring in remote and challenging environments.

[18] IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation

Amritanshu Tiwari,Cherish Puniani,Kaustubh Sharma,Ojasva Nema

Main category: cs.CV

TL;DR: 提出了一种两阶段方法,通过LoRA微调Stable Diffusion XL的注意力权重,解决个性化文本到图像生成中的灾难性遗忘、过拟合和计算开销问题。

Details Motivation: 现有文本到图像扩散模型在基于少量参考图像个性化生成新主题时,常面临灾难性遗忘、过拟合或高计算开销的挑战。 Method: 采用两阶段流程:1) 使用未修改的SDXL生成通用场景;2) 通过分割驱动的Img2Img流程选择性插入个性化主题。 Result: 在SDXL上DINO相似性得分达到0.789,优于现有方法。 Conclusion: 该方法在保留SDXL生成能力的同时,实现了高保真度的个性化主题集成。 Abstract: Recent advances in text-to-image diffusion models, particularly Stable Diffusion, have enabled the generation of highly detailed and semantically rich images. However, personalizing these models to represent novel subjects based on a few reference images remains challenging. This often leads to catastrophic forgetting, overfitting, or large computational overhead.We propose a two-stage pipeline that addresses these limitations by leveraging LoRA-based fine-tuning on the attention weights within the U-Net of the Stable Diffusion XL (SDXL) model. First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label. Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline that uses the trained LoRA weights.This framework isolates the subject encoding from the overall composition, thus preserving SDXL's broader generative capabilities while integrating the new subject in a high-fidelity manner. Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.

[19] Mapping Semantic Segmentation to Point Clouds Using Structure from Motion for Forest Analysis

Francisco Raverta Capua,Pablo De Cristoforis

Main category: cs.CV

TL;DR: 提出了一种生成森林环境语义分割点云的新方法,填补了公开标注数据集的空白。

Details Motivation: 由于高成本和传感器需求,公开的森林点云数据集稀缺,且缺乏基于SfM算法的语义分割点云数据。 Method: 使用自定义森林模拟器生成带语义标签的RGB图像,并通过改进的开源SfM软件重建语义点云。 Result: 生成的点云包含几何和语义信息,可用于训练和评估深度学习模型。 Conclusion: 该方法为森林点云语义分割提供了实用工具和数据集。 Abstract: Although the use of remote sensing technologies for monitoring forested environments has gained increasing attention, publicly available point cloud datasets remain scarce due to the high costs, sensor requirements, and time-intensive nature of their acquisition. Moreover, as far as we are aware, there are no public annotated datasets generated through Structure From Motion (SfM) algorithms applied to imagery, which may be due to the lack of SfM algorithms that can map semantic segmentation information into an accurate point cloud, especially in a challenging environment like forests. In this work, we present a novel pipeline for generating semantically segmented point clouds of forest environments. Using a custom-built forest simulator, we generate realistic RGB images of diverse forest scenes along with their corresponding semantic segmentation masks. These labeled images are then processed using modified open-source SfM software capable of preserving semantic information during 3D reconstruction. The resulting point clouds provide both geometric and semantic detail, offering a valuable resource for training and evaluating deep learning models aimed at segmenting real forest point clouds obtained via SfM.

[20] Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities

Jiajun Cheng,Xianwu Zhao,Shan Lin

Main category: cs.CV

TL;DR: 论文探讨了通用视觉语言模型(VLMs)在手术领域的表现,通过多数据集测试揭示了其在手术场景中语言与区域关联的局限性。

Details Motivation: 微创手术(MIS)存在视觉和技术挑战,现有机器学习方法依赖小规模标注数据。VLMs在通用领域表现优异,但其在手术领域的适应性尚不明确。 Method: 通过在多类手术数据集(如腹腔镜和内镜黏膜下剥离术)上测试多种VLMs,评估其性能。 Result: 测试发现VLMs在手术场景中难以稳定地将语言与正确区域关联。 Conclusion: VLMs在手术领域有潜力,但需进一步改进以解决语言-区域关联问题。 Abstract: Minimally invasive surgery (MIS) presents significant visual and technical challenges, including surgical instrument classification and understanding surgical action involving instruments, verbs, and anatomical targets. While many machine learning-based methods have been developed for surgical understanding, they typically rely on procedure- and task-specific models trained on small, manually annotated datasets. In contrast, the recent success of vision-language models (VLMs) trained on large volumes of raw image-text pairs has demonstrated strong adaptability to diverse visual data and a range of downstream tasks. This opens meaningful research questions: how well do these general-purpose VLMs perform in the surgical domain? In this work, we explore those questions by benchmarking several VLMs across diverse surgical datasets, including general laparoscopic procedures and endoscopic submucosal dissection, to assess their current capabilities and limitations. Our benchmark reveals key gaps in the models' ability to consistently link language to the correct regions in surgical scenes.

[21] Unifying Segment Anything in Microscopy with Multimodal Large Language Model

Manyu Li,Ruian He,Zixian Zhang,Weimin Tan,Bo Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为uLLSAM的方法,利用多模态大语言模型(MLLMs)为Segment Anything Model(SAM)注入视觉语言知识(VLK),显著提升了其在显微图像跨域数据集上的分割性能。

Details Motivation: 现有生物医学图像分割模型在未见域数据上表现不佳,缺乏视觉语言知识是主要原因。MLLMs的多模态理解能力为解决这一问题提供了灵感。 Method: 提出Vision-Language Semantic Alignment(VLSA)模块,将VLK注入SAM;进一步提出Semantic Boundary Regularization(SBR)以优化边界感知。 Result: 在9个域内显微数据集上,Dice和SA分别提升7.71%和12.10%;在10个域外数据集上,分别提升6.79%和10.08%,表现出强泛化能力。 Conclusion: uLLSAM通过注入VLK和优化边界感知,显著提升了SAM在显微图像分割中的性能,尤其在跨域任务中表现优异。 Abstract: Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose using MLLMs to guide SAM in learning microscopy crose-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to prompt SAM. Our method achieves performance improvements of 7.71% in Dice and 12.10% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 6.79% in Dice and 10.08% in SA across 10 out-ofdomain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.

[22] Completely Weakly Supervised Class-Incremental Learning for Semantic Segmentation

David Minkwan Kim,Soeun Lee,Byeongkeun Kang

Main category: cs.CV

TL;DR: 本文提出了一种完全弱监督的类增量学习方法,用于语义分割,仅需图像级标签即可学习基类和新类。通过结合定位器和基础模型的伪标签,并引入样本引导的数据增强方法,实验表明该方法在多个设置中优于部分弱监督方法。

Details Motivation: 传统的类增量语义分割方法需要昂贵的像素级标注,而部分弱监督方法仍有限制。本文旨在首次实现完全弱监督的类增量语义分割。 Method: 通过结合定位器和基础模型的伪标签生成鲁棒的伪标签,并引入样本引导的数据增强方法以减少灾难性遗忘。 Result: 在15-5 VOC和10-10 VOC设置中,该方法优于部分弱监督方法,在COCO-to-VOC设置中表现竞争性。 Conclusion: 本文提出的完全弱监督方法在类增量语义分割中表现出色,为实际应用提供了更高效的解决方案。 Abstract: This work addresses the task of completely weakly supervised class-incremental learning for semantic segmentation to learn segmentation for both base and additional novel classes using only image-level labels. While class-incremental semantic segmentation (CISS) is crucial for handling diverse and newly emerging objects in the real world, traditional CISS methods require expensive pixel-level annotations for training. To overcome this limitation, partially weakly-supervised approaches have recently been proposed. However, to the best of our knowledge, this is the first work to introduce a completely weakly-supervised method for CISS. To achieve this, we propose to generate robust pseudo-labels by combining pseudo-labels from a localizer and a sequence of foundation models based on their uncertainty. Moreover, to mitigate catastrophic forgetting, we introduce an exemplar-guided data augmentation method that generates diverse images containing both previous and novel classes with guidance. Finally, we conduct experiments in three common experimental settings: 15-5 VOC, 10-10 VOC, and COCO-to-VOC, and in two scenarios: disjoint and overlap. The experimental results demonstrate that our completely weakly supervised method outperforms even partially weakly supervised methods in the 15-5 VOC and 10-10 VOC settings while achieving competitive accuracy in the COCO-to-VOC setting.

[23] SynRailObs: A Synthetic Dataset for Obstacle Detection in Railway Scenarios

Qiushi Guo,Jason Rambach

Main category: cs.CV

TL;DR: SynRailObs是一个高保真合成数据集,用于铁路环境中的障碍物检测,填补了现有公开数据集的不足,并通过扩散模型生成罕见障碍物。实验证明其在多种条件下表现优异,并具备零样本能力。

Details Motivation: 现有公开数据集无法满足铁路安全研究中复杂条件下多类别障碍物检测的需求,阻碍了研究进展。 Method: 引入SynRailObs合成数据集,利用扩散模型生成罕见障碍物,并在真实铁路环境中进行实验验证。 Result: 模型在不同距离和环境条件下表现一致,并展示了零样本能力,适用于安全敏感领域。 Conclusion: SynRailObs显著推动了铁路安全中的障碍物检测研究,数据集已公开。 Abstract: Detecting potential obstacles in railway environments is critical for preventing serious accidents. Identifying a broad range of obstacle categories under complex conditions requires large-scale datasets with precisely annotated, high-quality images. However, existing publicly available datasets fail to meet these requirements, thereby hindering progress in railway safety research. To address this gap, we introduce SynRailObs, a high-fidelity synthetic dataset designed to represent a diverse range of weather conditions and geographical features. Furthermore, diffusion models are employed to generate rare and difficult-to-capture obstacles that are typically challenging to obtain in real-world scenarios. To evaluate the effectiveness of SynRailObs, we perform experiments in real-world railway environments, testing on both ballasted and ballastless tracks across various weather conditions. The results demonstrate that SynRailObs holds substantial potential for advancing obstacle detection in railway safety applications. Models trained on this dataset show consistent performance across different distances and environmental conditions. Moreover, the model trained on SynRailObs exhibits zero-shot capabilities, which are essential for applications in security-sensitive domains. The data is available in https://www.kaggle.com/datasets/qiushi910/synrailobs.

[24] EA-3DGS: Efficient and Adaptive 3D Gaussians with Highly Enhanced Quality for outdoor scenes

Jianlin Guo,Haihong Xiao,Wenxiong Kang

Main category: cs.CV

TL;DR: EA-3DGS是一种针对户外场景的高质量实时渲染方法,通过引入网格结构、高效高斯剪枝策略和结构感知的密集化策略,解决了3D高斯泼溅在户外场景中的内存和调整问题。

Details Motivation: 当前NeRF方法在重建建筑规模场景时训练和推理速度慢,而3D高斯泼溅在户外场景中因点基显式表示缺乏调整机制和内存限制而表现不佳。 Method: 1. 使用自适应四面体网格初始化高斯组件;2. 提出高效高斯剪枝策略;3. 结构感知密集化策略;4. 向量量化高斯组件参数以减少存储需求。 Result: 在13个场景(包括公开数据集和自采集场景)上的实验证明了方法的优越性。 Conclusion: EA-3DGS在户外场景中实现了高质量实时渲染,解决了现有方法的局限性。 Abstract: Efficient scene representations are essential for many real-world applications, especially those involving spatial measurement. Although current NeRF-based methods have achieved impressive results in reconstructing building-scale scenes, they still suffer from slow training and inference speeds due to time-consuming stochastic sampling. Recently, 3D Gaussian Splatting (3DGS) has demonstrated excellent performance with its high-quality rendering and real-time speed, especially for objects and small-scale scenes. However, in outdoor scenes, its point-based explicit representation lacks an effective adjustment mechanism, and the millions of Gaussian points required often lead to memory constraints during training. To address these challenges, we propose EA-3DGS, a high-quality real-time rendering method designed for outdoor scenes. First, we introduce a mesh structure to regulate the initialization of Gaussian components by leveraging an adaptive tetrahedral mesh that partitions the grid and initializes Gaussian components on each face, effectively capturing geometric structures in low-texture regions. Second, we propose an efficient Gaussian pruning strategy that evaluates each 3D Gaussian's contribution to the view and prunes accordingly. To retain geometry-critical Gaussian points, we also present a structure-aware densification strategy that densifies Gaussian points in low-curvature regions. Additionally, we employ vector quantization for parameter quantization of Gaussian components, significantly reducing disk space requirements with only a minimal impact on rendering quality. Extensive experiments on 13 scenes, including eight from four public datasets (MatrixCity-Aerial, Mill-19, Tanks \& Temples, WHU) and five self-collected scenes acquired through UAV photogrammetry measurement from SCUT-CA and plateau regions, further demonstrate the superiority of our method.

[25] MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation

Gabriel Maldonado,Armin Danesh Pazho,Ghazal Alinezhad Noghre,Vinit Katariya,Hamed Tabkhi

Main category: cs.CV

TL;DR: MoCLIP是一种改进的CLIP模型,通过增加运动编码头和对比学习,提升了文本到运动生成的准确性和保真度。

Details Motivation: 现有基于CLIP的文本编码器在理解运动的时间与运动学结构方面存在局限,需要改进以更好地生成运动。 Method: 引入MoCLIP,通过运动编码头和对比学习与tethering损失,在运动序列上微调CLIP模型。 Result: 实验表明MoCLIP提高了Top-1、Top-2和Top-3准确率,同时保持竞争力的FID,改善了文本与运动的对齐。 Conclusion: MoCLIP是一个高效且兼容性强的框架,显著提升了运动生成的效果。 Abstract: Human motion generation is essential for fields such as animation, robotics, and virtual reality, requiring models that effectively capture motion dynamics from text descriptions. Existing approaches often rely on Contrastive Language-Image Pretraining (CLIP)-based text encoders, but their training on text-image pairs constrains their ability to understand temporal and kinematic structures inherent in motion and motion generation. This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss. By explicitly incorporating motion-aware representations, MoCLIP enhances motion fidelity while remaining compatible with existing CLIP-based pipelines and seamlessly integrating into various CLIP-based methods. Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while maintaining competitive FID, leading to improved text-to-motion alignment results. These results highlight MoCLIP's versatility and effectiveness, establishing it as a robust framework for enhancing motion generation.

[26] From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification

Xue Li,Jameson Merkow,Noel C. F. Codella,Alberto Santamaria-Pang,Naiteek Sangani,Alexander Ersoy,Christopher Burt,John W. Garrett,Richard J. Bruce,Joshua D. Warner,Tyler Bradshaw,Ivan Tarapov,Matthew P. Lungren,Alan B. McMillan

Main category: cs.CV

TL;DR: 该研究评估了多种基础模型嵌入在放射影像分类中的表现,发现MedImageInsight嵌入结合SVM适配器效果最佳,同时适配器模型计算高效且公平性良好。

Details Motivation: 探讨基础模型嵌入在医学影像诊断中的实用性,尤其是轻量级适配器模型的性能表现。 Method: 使用六种基础模型提取嵌入,训练轻量级适配器模型进行多类放射影像分类。 Result: MedImageInsight嵌入+SVM适配器表现最佳(mAUC 93.8%),适配器模型计算高效且公平性良好。 Conclusion: 基础模型嵌入(尤其是MedImageInsight)支持高效、准确且公平的放射影像分类。 Abstract: Foundation models, pretrained on extensive datasets, have significantly advanced machine learning by providing robust and transferable embeddings applicable to various domains, including medical imaging diagnostics. This study evaluates the utility of embeddings derived from both general-purpose and medical domain-specific foundation models for training lightweight adapter models in multi-class radiography classification, focusing specifically on tube placement assessment. A dataset comprising 8842 radiographs classified into seven distinct categories was employed to extract embeddings using six foundation models: DenseNet121, BiomedCLIP, Med-Flamingo, MedImageInsight, Rad-DINO, and CXR-Foundation. Adapter models were subsequently trained using classical machine learning algorithms. Among these combinations, MedImageInsight embeddings paired with an support vector machine adapter yielded the highest mean area under the curve (mAUC) at 93.8%, followed closely by Rad-DINO (91.1%) and CXR-Foundation (89.0%). In comparison, BiomedCLIP and DenseNet121 exhibited moderate performance with mAUC scores of 83.0% and 81.8%, respectively, whereas Med-Flamingo delivered the lowest performance at 75.1%. Notably, most adapter models demonstrated computational efficiency, achieving training within one minute and inference within seconds on CPU, underscoring their practicality for clinical applications. Furthermore, fairness analyses on adapters trained on MedImageInsight-derived embeddings indicated minimal disparities, with gender differences in performance within 2% and standard deviations across age groups not exceeding 3%. These findings confirm that foundation model embeddings-especially those from MedImageInsight-facilitate accurate, computationally efficient, and equitable diagnostic classification using lightweight adapters for radiographic image analysis.

[27] A High-Performance Thermal Infrared Object Detection Framework with Centralized Regulation

Jinke Li,Yue Wu,Xiaoyan Yang

Main category: cs.CV

TL;DR: 提出了一种基于集中特征调节的新型热红外目标检测框架CRT-YOLO,通过全局交互和多尺度注意力模块提升性能。

Details Motivation: 传统方法在提取和融合局部-全局信息方面不足,影响热红外特征注意力。 Method: 结合高效多尺度注意力模块(EMA)和集中特征金字塔网络(CFP),实现全局范围的热红外信息交互。 Result: 在两个基准数据集上显著优于传统方法,消融实验验证了模块的有效性。 Conclusion: CRT-YOLO框架在热红外目标检测领域具有潜在影响力。 Abstract: Thermal Infrared (TIR) technology involves the use of sensors to detect and measure infrared radiation emitted by objects, and it is widely utilized across a broad spectrum of applications. The advancements in object detection methods utilizing TIR images have sparked significant research interest. However, most traditional methods lack the capability to effectively extract and fuse local-global information, which is crucial for TIR-domain feature attention. In this study, we present a novel and efficient thermal infrared object detection framework, known as CRT-YOLO, that is based on centralized feature regulation, enabling the establishment of global-range interaction on TIR information. Our proposed model integrates efficient multi-scale attention (EMA) modules, which adeptly capture long-range dependencies while incurring minimal computational overhead. Additionally, it leverages the Centralized Feature Pyramid (CFP) network, which offers global regulation of TIR features. Extensive experiments conducted on two benchmark datasets demonstrate that our CRT-YOLO model significantly outperforms conventional methods for TIR image object detection. Furthermore, the ablation study provides compelling evidence of the effectiveness of our proposed modules, reinforcing the potential impact of our approach on advancing the field of thermal infrared object detection.

[28] NeuSEditor: From Multi-View Images to Text-Guided Neural Surface Edits

Nail Ibrahimli,Julian F. P. Kooij,Liangliang Nan

Main category: cs.CV

TL;DR: NeuSEditor是一种用于神经隐式表面文本引导编辑的新方法,通过身份保留架构和几何感知蒸馏损失,显著提升编辑质量和效率。

Details Motivation: 隐式表面表示在编辑时难以保持身份和几何一致性,现有方法存在不足。 Method: NeuSEditor采用身份保留架构分离前景与背景,结合几何感知蒸馏损失优化渲染和几何质量。 Result: NeuSEditor在定量和定性上优于PDS和InstructNeRF2NeRF等最新方法。 Conclusion: NeuSEditor简化了编辑流程,无需持续更新数据集或源提示,实现了高质量的隐式表面编辑。 Abstract: Implicit surface representations are valued for their compactness and continuity, but they pose significant challenges for editing. Despite recent advancements, existing methods often fail to preserve identity and maintain geometric consistency during editing. To address these challenges, we present NeuSEditor, a novel method for text-guided editing of neural implicit surfaces derived from multi-view images. NeuSEditor introduces an identity-preserving architecture that efficiently separates scenes into foreground and background, enabling precise modifications without altering the scene-specific elements. Our geometry-aware distillation loss significantly enhances rendering and geometric quality. Our method simplifies the editing workflow by eliminating the need for continuous dataset updates and source prompting. NeuSEditor outperforms recent state-of-the-art methods like PDS and InstructNeRF2NeRF, delivering superior quantitative and qualitative results. For more visual results, visit: neuseditor.github.io.

[29] RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects

Jaeguk Kim,Jaewoo Park,Keuntek Lee,Nam Ik Cho

Main category: cs.CV

TL;DR: RefPose提出了一种基于参考图像和几何对应的6D姿态估计方法,通过渲染和比较迭代优化姿态,并在未见物体上表现优异。

Details Motivation: 解决单目RGB图像中未见物体的6D姿态估计问题,传统方法依赖预定义模型,缺乏灵活性。 Method: 使用参考图像和几何对应关系,通过渲染模板生成初始姿态,再通过注意力机制迭代优化。 Result: 在BOP基准数据集上达到最先进性能,且运行时间具有竞争力。 Conclusion: RefPose通过动态适应新物体形状,实现了对未见物体的鲁棒姿态估计。 Abstract: Estimating the 6D pose of unseen objects from monocular RGB images remains a challenging problem, especially due to the lack of prior object-specific knowledge. To tackle this issue, we propose RefPose, an innovative approach to object pose estimation that leverages a reference image and geometric correspondence as guidance. RefPose first predicts an initial pose by using object templates to render the reference image and establish the geometric correspondence needed for the refinement stage. During the refinement stage, RefPose estimates the geometric correspondence of the query based on the generated references and iteratively refines the pose through a render-and-compare approach. To enhance this estimation, we introduce a correlation volume-guided attention mechanism that effectively captures correlations between the query and reference images. Unlike traditional methods that depend on pre-defined object models, RefPose dynamically adapts to new object shapes by leveraging a reference image and geometric correspondence. This results in robust performance across previously unseen objects. Extensive evaluation on the BOP benchmark datasets shows that RefPose achieves state-of-the-art results while maintaining a competitive runtime.

[30] A Convolution-Based Gait Asymmetry Metric for Inter-Limb Synergistic Coordination

Go Fukino,Kanta Tachibana

Main category: cs.CV

TL;DR: 本文提出了一种基于LTI系统建模和差异度量评估步态对称性的新方法,区别于传统的EMG或加速度分析。

Details Motivation: 传统步态对称性评估依赖EMG信号或加速度差异,存在局限性,因此需要更精确的方法。 Method: 使用LTI系统建模节段间协调,并提出差异度量来评估对称性。 Result: 在五名对称和非对称步态受试者中测试了该方法。 Conclusion: 新方法为步态对称性评估提供了更有效的工具。 Abstract: This study focuses on the velocity patterns of various body parts during walking and proposes a method for evaluating gait symmetry. Traditional motion analysis studies have assessed gait symmetry based on differences in electromyographic (EMG) signals or acceleration between the left and right sides. In contrast, this paper models intersegmental coordination using an LTI system and proposes a dissimilarity metric to evaluate symmetry. The method was tested on five subjects with both symmetric and asymmetric gait.

[31] A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision

Alexey Magay,Dhurba Tripathi,Yu Hao,Yi Fang

Main category: cs.CV

TL;DR: 提出了一种基于空间增强多模态大语言模型的方法,帮助视力障碍者更好地理解和导航环境,结合硬件设备提升用户体验。

Details Motivation: 视力障碍者在导航和环境交互中面临挑战,现有技术缺乏空间推理能力和轻量化设计。 Method: 通过微调多模态大语言模型,增强空间推理能力,并结合眼镜附件硬件提供实时反馈。 Result: 在VizWiz数据集上验证了方法的准确性提升和用户体验改善。 Conclusion: 该方法填补了机器学习模型与实用辅助设备之间的空白,为视力障碍者提供了更有效的解决方案。 Abstract: People with blindness and low vision (pBLV) face significant challenges, struggling to navigate environments and locate objects due to limited visual cues. Spatial reasoning is crucial for these individuals, as it enables them to understand and interpret the spatial relationships in their surroundings, enhancing their ability to navigate and interact more safely and independently. Current multi-modal large language (MLLM) models for low vision people lack the spatial reasoning capabilities needed to effectively assist in these tasks. Moreover, there is a notable absence of lightweight, easy-to-use systems that allow pBLV to effectively perceive and interact with their surrounding environment. In this paper, we propose a novel spatial enhanced multi-modal large language model based approach for visually impaired individuals. By fine-tuning the MLLM to incorporate spatial reasoning capabilities, our method significantly improves the understanding of environmental context, which is critical for navigation and object recognition. The innovation extends to a hardware component, designed as an attachment for glasses, ensuring increased accessibility and ease of use. This integration leverages advanced VLMs to interpret visual data and provide real-time, spatially aware feedback to the user. Our approach aims to bridge the gap between advanced machine learning models and practical, user-friendly assistive devices, offering a robust solution for visually impaired users to navigate their surroundings more effectively and independently. The paper includes an in-depth evaluation using the VizWiz dataset, demonstrating substantial improvements in accuracy and user experience. Additionally, we design a comprehensive dataset to evaluate our method's effectiveness in realworld situations, demonstrating substantial improvements in accuracy and user experience.

[32] PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation

Saad Manzur,Bryan Vela,Brandon Vela,Aditya Agrawal,Lan-Anh Dang-Vu,David Li,Wayne Hayes

Main category: cs.CV

TL;DR: 提出PoseBench3D框架,用于标准化评估3D人体姿态估计方法在不同数据集上的表现,支持跨数据集比较和分析。

Details Motivation: 现有研究多局限于单一数据集,而实际应用需适应多样化条件,因此需统一评估标准以提升模型泛化能力。 Method: 开发PoseBench3D框架,整合四大常用数据集,提供统一接口和预配置格式,支持多种模型架构。 Result: 重新评估18种方法,生成100多项跨数据集评估结果,并分析预处理技术对性能的影响。 Conclusion: PoseBench3D为3D姿态估计领域提供了标准化评估工具,有助于提升模型在真实场景中的泛化能力。 Abstract: Reliable three-dimensional human pose estimation is becoming increasingly important for real-world applications, yet much of prior work has focused solely on the performance within a single dataset. In practice, however, systems must adapt to diverse viewpoints, environments, and camera setups -- conditions that differ significantly from those encountered during training, which is often the case in real-world scenarios. To address these challenges, we present a standardized testing environment in which each method is evaluated on a variety of datasets, ensuring consistent and fair cross-dataset comparisons -- allowing for the analysis of methods on previously unseen data. Therefore, we propose PoseBench3D, a unified framework designed to systematically re-evaluate prior and future models across four of the most widely used datasets for human pose estimation -- with the framework able to support novel and future datasets as the field progresses. Through a unified interface, our framework provides datasets in a pre-configured yet easily modifiable format, ensuring compatibility with diverse model architectures. We re-evaluated the work of 18 methods, either trained or gathered from existing literature, and reported results using both Mean Per Joint Position Error (MPJPE) and Procrustes Aligned Mean Per Joint Position Error (PA-MPJPE) metrics, yielding more than 100 novel cross-dataset evaluation results. Additionally, we analyze performance differences resulting from various pre-processing techniques and dataset preparation parameters -- offering further insight into model generalization capabilities.

[33] Patient-Specific Dynamic Digital-Physical Twin for Coronary Intervention Training: An Integrated Mixed Reality Approach

Shuo Wang,Tong Ren,Nan Cheng,Rong Wang,Li Zhang

Main category: cs.CV

TL;DR: 该研究开发了一种基于4D-CTA的动态心脏模型框架,结合数字孪生技术,为冠状动脉介入提供精确的个性化工具。

Details Motivation: 现有训练系统缺乏对心脏生理动态的准确模拟,亟需一种能够全面模拟动态心脏的模型。 Method: 利用4D-CTA数据构建动态模型,制造透明血管物理模型,并开发心脏输出分析和虚拟血管造影系统。 Result: 虚拟与真实血管造影形态一致性达80.9%,导丝运动轨迹误差低于1.1毫米,透明模型在CABG训练中表现优越。 Conclusion: 患者特异性数字-物理孪生方法有效再现冠状动脉的解剖结构和动态特性,为教育和临床规划提供了动态环境。 Abstract: Background and Objective: Precise preoperative planning and effective physician training for coronary interventions are increasingly important. Despite advances in medical imaging technologies, transforming static or limited dynamic imaging data into comprehensive dynamic cardiac models remains challenging. Existing training systems lack accurate simulation of cardiac physiological dynamics. This study develops a comprehensive dynamic cardiac model research framework based on 4D-CTA, integrating digital twin technology, computer vision, and physical model manufacturing to provide precise, personalized tools for interventional cardiology. Methods: Using 4D-CTA data from a 60-year-old female with three-vessel coronary stenosis, we segmented cardiac chambers and coronary arteries, constructed dynamic models, and implemented skeletal skinning weight computation to simulate vessel deformation across 20 cardiac phases. Transparent vascular physical models were manufactured using medical-grade silicone. We developed cardiac output analysis and virtual angiography systems, implemented guidewire 3D reconstruction using binocular stereo vision, and evaluated the system through angiography validation and CABG training applications. Results: Morphological consistency between virtual and real angiography reached 80.9%. Dice similarity coefficients for guidewire motion ranged from 0.741-0.812, with mean trajectory errors below 1.1 mm. The transparent model demonstrated advantages in CABG training, allowing direct visualization while simulating beating heart challenges. Conclusion: Our patient-specific digital-physical twin approach effectively reproduces both anatomical structures and dynamic characteristics of coronary vasculature, offering a dynamic environment with visual and tactile feedback valuable for education and clinical planning.

[34] VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

Mingxiao Li,Na Su,Fang Qu,Zhizhou Zhong,Ziyang Chen,Zhaopeng Tu,Xiaolong Li

Main category: cs.CV

TL;DR: 论文提出了一种名为VISTA的新方法,通过信息论分析揭示了现有MLLMs中交叉熵损失的隐式对齐目标的局限性,并设计了一种显式对齐目标以优化跨模态信息融合。

Details Motivation: 当前多模态大语言模型(MLLMs)在模态对齐方面存在偏向文本信息的局限性,影响了跨模态信息融合的效果。 Method: 通过信息论分析交叉熵损失的隐式对齐目标,提出VISTA方法,引入显式对齐目标以最大化跨模态互信息。 Result: VISTA显著提升了现有MLLMs的视觉理解能力,在多个基准数据集(如VQAv2、MMStar、MME)上表现优于基线模型。 Conclusion: VISTA为MLLMs的模态对齐研究提供了新方向,且无需额外可训练模块或数据,高效实用。 Abstract: Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the visual understanding capabilities of existing MLLMs without requiring any additional trainable modules or extra training data, making it both efficient and practical. Our method significantly outperforms baseline models across more than a dozen benchmark datasets, including VQAv2, MMStar, and MME, paving the way for new directions in MLLM modal alignment research.

[35] Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

Junyi Yuan,Jian Zhang,Fangyu Wu,Dongming Lu,Huanda Lu,Qiufeng Wang

Main category: cs.CV

TL;DR: 论文提出了一个名为CulTi的多模态数据集,专注于中国文化遗产,并提出了LACLIP方法以解决跨模态检索中的局部对齐问题。

Details Motivation: 中国文化遗产包含丰富的多模态信息,但缺乏专门的数据集限制了跨模态学习模型的发展。 Method: 构建CulTi数据集,并提出基于中文CLIP的LACLIP方法,通过加权相似度分数增强局部对齐。 Result: LACLIP在CulTi数据集上显著优于现有模型,尤其在细粒度语义关联方面表现突出。 Conclusion: CulTi数据集和LACLIP方法为中国文化遗产的跨模态检索提供了有效工具。 Abstract: China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain multimodal datasets, CulTi presents a challenge for cross-modal retrieval: the difficulty of local alignment between intricate decorative motifs and specialized textual descriptions. To address this challenge, we propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP. LACLIP enhances the alignment of global textual descriptions with local visual regions by computing weighted similarity scores during inference. Experimental results on CulTi demonstrate that LACLIP significantly outperforms existing models in cross-modal retrieval, particularly in handling fine-grained semantic associations within Chinese cultural heritage.

[36] M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection

Chao Wang,Wei Lu,Xiang Li,Jian Yang,Lei Luo

Main category: cs.CV

TL;DR: 论文提出首个光学-SAR融合目标检测数据集M4-SAR,并开发了E2E-OSDet框架,显著提升复杂环境下的检测性能。

Details Motivation: 光学和SAR图像在复杂环境中各有局限性,融合二者可提升检测精度,但缺乏标准化数据集阻碍了研究进展。 Method: 构建M4-SAR数据集,包含11.2万对齐图像对和近百万标注实例;提出E2E-OSDet框架,解决跨域差异问题。 Result: 实验表明,融合光学和SAR数据可将mAP提升5.7%,尤其在复杂环境中效果显著。 Conclusion: M4-SAR数据集和E2E-OSDet框架为光学-SAR融合目标检测提供了标准化基准和强大基线。 Abstract: Single-source remote sensing object detection using optical or SAR images struggles in complex environments. Optical images offer rich textural details but are often affected by low-light, cloud-obscured, or low-resolution conditions, reducing the detection performance. SAR images are robust to weather, but suffer from speckle noise and limited semantic expressiveness. Optical and SAR images provide complementary advantages, and fusing them can significantly improve the detection accuracy. However, progress in this field is hindered by the lack of large-scale, standardized datasets. To address these challenges, we propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR). It contains 112,184 precisely aligned image pairs and nearly one million labeled instances with arbitrary orientations, spanning six key categories. To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods. Furthermore, we propose E2E-OSDet, a novel end-to-end multi-source fusion detection framework that mitigates cross-domain discrepancies and establishes a robust baseline for future studies. Extensive experiments on M4-SAR demonstrate that fusing optical and SAR data can improve $mAP$ by 5.7\% over single-source inputs, with particularly significant gains in complex environments. The dataset and code are publicly available at https://github.com/wchao0601/M4-SAR.

[37] Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark

Yunkang Cao,Yuqi Cheng,Xiaohao Xu,Yiheng Zhang,Yihan Sun,Yuxiang Tan,Yuxin Zhang,Xiaonan Huang,Weiming Shen

Main category: cs.CV

TL;DR: M2AD是一个新的大规模基准数据集,旨在评估视觉异常检测(VAD)在视角和光照交互作用下的鲁棒性,包含119,880张高分辨率图像。

Details Motivation: 当前VAD系统对真实世界成像变化的敏感性(尤其是视角和光照的复杂交互)阻碍了其实际部署,现有基准未能充分解决这一问题。 Method: 通过系统采集10个类别的999个样本,在12个同步视角和10种光照设置下(共120种配置),构建M2AD数据集,并设计两种评估协议:M2AD-Synergy和M2AD-Invariant。 Result: 实验表明,现有VAD方法在M2AD上表现显著不足,凸显了视角-光照交互带来的挑战。 Conclusion: M2AD为开发和验证能够应对真实世界复杂性的VAD方法提供了重要工具,数据集和测试套件将公开以推动领域发展。 Abstract: The practical deployment of Visual Anomaly Detection (VAD) systems is hindered by their sensitivity to real-world imaging variations, particularly the complex interplay between viewpoint and illumination which drastically alters defect visibility. Current benchmarks largely overlook this critical challenge. We introduce Multi-View Multi-Illumination Anomaly Detection (M2AD), a new large-scale benchmark comprising 119,880 high-resolution images designed explicitly to probe VAD robustness under such interacting conditions. By systematically capturing 999 specimens across 10 categories using 12 synchronized views and 10 illumination settings (120 configurations total), M2AD enables rigorous evaluation. We establish two evaluation protocols: M2AD-Synergy tests the ability to fuse information across diverse configurations, and M2AD-Invariant measures single-image robustness against realistic view-illumination effects. Our extensive benchmarking shows that state-of-the-art VAD methods struggle significantly on M2AD, demonstrating the profound challenge posed by view-illumination interplay. This benchmark serves as an essential tool for developing and validating VAD methods capable of overcoming real-world complexities. Our full dataset and test suite will be released at https://hustcyq.github.io/M2AD to facilitate the field.

[38] DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning

Weilai Xiang,Hongyu Yang,Di Huang,Yunhong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为“自条件”的机制,通过在去噪网络中利用内部语义信息来指导解码层,从而提升扩散模型的生成质量和特征表示能力。

Details Motivation: 扩散模型在图像合成中表现突出,但其生成预训练是否能提升模型自身的训练效果,以及特征质量是否能超越自监督学习模型,是尚未解决的关键问题。 Method: 引入自条件机制,利用去噪网络的内部语义信息指导解码层,形成更紧密的瓶颈以提升生成质量。 Result: 实验表明,该方法在生成FID和识别准确率上均有提升,计算开销仅增加1%,且适用于多种扩散架构。 Conclusion: 自条件机制成功整合了判别性技术(如对比自蒸馏),在不牺牲生成质量的情况下,使扩散模型成为强大的表示学习工具。 Abstract: While diffusion models have gained prominence in image synthesis, their generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding. However, two key questions remain: 1) Can these representations be leveraged to improve the training of diffusion models themselves, rather than solely benefiting downstream tasks? 2) Can the feature quality be enhanced to rival or even surpass modern self-supervised learners, without compromising generative capability? This work addresses these questions by introducing self-conditioning, a straightforward yet effective mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers, forming a tighter bottleneck that condenses high-level semantics to improve generation. Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures. Crucially, self-conditioning facilitates an effective integration of discriminative techniques, such as contrastive self-distillation, directly into diffusion models without sacrificing generation quality. Extensive experiments on pixel-space and latent-space datasets show that in linear evaluations, our enhanced diffusion models, particularly UViT and DiT, serve as strong representation learners, surpassing various self-supervised models.

[39] ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

Bo Du,Xuekang Zhu,Xiaochen Ma,Chenfan Qu,Kaiwen Feng,Zhe Yang,Chi-Man Pun,Jian Liu,Jizhe Zhou

Main category: cs.CV

TL;DR: ForensicHub提出了首个统一的多领域假图像检测与定位基准和代码库,解决了领域碎片化问题。

Details Motivation: 当前假图像检测与定位领域(FIDL)存在严重的领域碎片化问题,缺乏统一的基准和代码库,阻碍了跨领域比较和整体发展。 Method: 提出模块化和配置驱动的架构,分解法医流程为可互换组件;实现10个基线模型、6个主干网络,新增2个基准并整合现有基准;通过适配器设计实现跨领域兼容。 Result: 提供了8个关键行动建议,涵盖模型架构、数据集特性和评估标准。 Conclusion: ForensicHub打破了FIDL领域的领域壁垒,为未来突破提供了基础。 Abstract: The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.

[40] Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

Zongye Zhang,Bohan Kong,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TL;DR: MoMADiff结合掩码建模与扩散过程,生成高质量3D人体运动,支持用户提供关键帧控制,在多个数据集上表现优于现有方法。

Details Motivation: 现有方法在处理分布外运动时表现不佳,离散标记或连续表示方法各有局限,需一种更鲁棒的生成框架。 Method: 提出MoMADiff框架,结合掩码建模与扩散过程,利用帧级连续表示生成运动,支持用户关键帧控制。 Result: 在多个数据集上表现优于现有方法,运动质量、指令忠实度和关键帧一致性均显著提升。 Conclusion: MoMADiff通过结合掩码建模与扩散过程,实现了高质量、可控的3D人体运动生成,具有强泛化能力。 Abstract: Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence.

[41] WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

An-Lan Wang,Jingqun Tang,Liao Lei,Hao Feng,Qi Liu,Xiang Fei,Jinghui Lu,Han Wang,Weiwei Liu,Hao Liu,Yuliang Liu,Xiang Bai,Can Huang

Main category: cs.CV

TL;DR: WildDoc是首个针对自然环境中文档理解设计的基准测试,包含多样化真实场景文档图像,评估显示当前MLLMs在真实条件下性能显著下降。

Details Motivation: 现有基准测试(如DocVQA和ChartQA)主要基于扫描或数字文档,未能充分反映真实世界复杂挑战(如光照变化和物理变形)。 Method: WildDoc通过手动捕获多样化真实场景文档图像,并利用现有基准测试文档源进行对比,每种文档在不同条件下拍摄四次以评估模型鲁棒性。 Result: 评估显示,当前最先进的MLLMs在WildDoc上性能显著下降,模型鲁棒性不足。 Conclusion: WildDoc揭示了真实世界文档理解的独特挑战,为未来研究提供了重要基准。 Abstract: The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding. Our project homepage is available at https://bytedance.github.io/WildDoc.

[42] Rethinking the Mean Teacher Strategy from the Perspective of Self-paced Learning

Pengchen Zhang,Alan J. X. Guo,Sipin Luo,Zhe Han,Lin Guo

Main category: cs.CV

TL;DR: 论文提出了一种双师生学习框架(DTSL),通过结合时间滞后模型和跨架构模型的输出一致性,提升半监督医学图像分割性能。

Details Motivation: 减少医学图像分割中手动标注的成本,同时提升半监督学习的效果。 Method: 采用双师生学习框架(DTSL),结合时间滞后模型和跨架构模型的输出一致性,利用Jensen-Shannon散度生成伪标签。 Result: 在多个数据集上表现优于现有方法,消融实验验证了模块的有效性。 Conclusion: DTSL框架通过灵活的自我调节学习机制,显著提升了半监督医学图像分割的性能。 Abstract: Semi-supervised medical image segmentation has attracted significant attention due to its potential to reduce manual annotation costs. The mean teacher (MT) strategy, commonly understood as introducing smoothed, temporally lagged consistency regularization, has demonstrated strong performance across various tasks in this field. In this work, we reinterpret the MT strategy on supervised data as a form of self-paced learning, regulated by the output agreement between the temporally lagged teacher model and the ground truth labels. This idea is further extended to incorporate agreement between a temporally lagged model and a cross-architectural model, which offers greater flexibility in regulating the learning pace and enables application to unlabeled data. Specifically, we propose dual teacher-student learning (DTSL), a framework that introduces two groups of teacher-student models with different architectures. The output agreement between the cross-group teacher and student models is used as pseudo-labels, generated via a Jensen-Shannon divergence-based consensus label generator (CLG). Extensive experiments on popular datasets demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches. Ablation studies further validate the effectiveness of the proposed modules.

[43] Classifying Shelf Life Quality of Pineapples by Combining Audio and Visual Features

Yi-Lu Jiang,Wen-Chang Chang,Ching-Lin Wang,Kung-Liang Hsu,Chih-Yi Chiu

Main category: cs.CV

TL;DR: 提出了一种基于多模态和多视角的分类模型,用于无损检测菠萝的货架期质量,并通过音频和视觉特征将其分为四个质量等级。

Details Motivation: 减少浪费并增加收入,通过无损方法确定菠萝的货架期质量。 Method: 构建了多模态和多视角分类模型,利用音频和视觉特征,并采用对比性视听掩码自编码器进行跨模态训练。 Result: 跨模态模型在音频主导采样下达到84%的准确率,优于单模态模型。 Conclusion: 多模态方法在菠萝质量分类中表现优越,具有实际应用潜力。 Abstract: Determining the shelf life quality of pineapples using non-destructive methods is a crucial step to reduce waste and increase income. In this paper, a multimodal and multiview classification model was constructed to classify pineapples into four quality levels based on audio and visual characteristics. For research purposes, we compiled and released the PQC500 dataset consisting of 500 pineapples with two modalities: one was tapping pineapples to record sounds by multiple microphones and the other was taking pictures by multiple cameras at different locations, providing multimodal and multi-view audiovisual features. We modified the contrastive audiovisual masked autoencoder to train the cross-modal-based classification model by abundant combinations of audio and visual pairs. In addition, we proposed to sample a compact size of training data for efficient computation. The experiments were evaluated under various data and model configurations, and the results demonstrated that the proposed cross-modal model trained using audio-major sampling can yield 84% accuracy, outperforming the unimodal models of only audio and only visual by 6% and 18%, respectively.

[44] CleanPatrick: A Benchmark for Image Data Cleaning

Fabian Gröger,Simone Lionetti,Philippe Gottfrois,Alvaro Gonzalez-Jimenez,Ludovic Amruthalingam,Elisabeth Victoria Goessinger,Hanna Lindemann,Marie Bargiela,Marie Hofbauer,Omar Badri,Philipp Tschandl,Arash Koochek,Matthew Groh,Alexander A. Navarini,Marc Pouly

Main category: cs.CV

TL;DR: CleanPatrick是首个大规模图像数据清理基准,基于Fitzpatrick17k皮肤病数据集,通过众包标注和专家审核生成高质量数据,评估多种清理方法。

Details Motivation: 现有图像数据清理基准依赖合成噪声或小规模人工研究,缺乏真实性和可比性。 Method: 收集496,377个众包标注,识别离题样本、近重复和标签错误,采用项目反应理论模型和专家审核生成基准。 Result: 自监督表示在近重复检测中表现优异,传统方法在有限预算下能有效检测离题样本,标签错误检测仍是挑战。 Conclusion: CleanPatrick为图像清理策略提供系统性比较,推动数据为中心的AI可靠性。 Abstract: Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.

[45] Artifacts of Idiosyncracy in Global Street View Data

Tim Alpherts,Sennay Ghebreab,Nanne van Noord

Main category: cs.CV

TL;DR: 研究发现,即使街景数据密集覆盖全球28个城市,城市布局等特性仍会导致数据偏差。通过定量分析和阿姆斯特丹案例研究,揭示了数据收集过程中的偏差及其影响。

Details Motivation: 街景数据在计算机视觉应用中广泛使用,但现有数据集假设为系统性覆盖,而实际存在覆盖不均的问题,尤其是城市特性可能导致偏差。 Method: 定量分析街景数据覆盖分布的偏差,并提出评估方法;通过阿姆斯特丹案例的半结构化访谈,研究数据收集过程对城市表征的影响。 Result: 发现城市特性(如布局)会导致街景数据偏差,即使密集覆盖;案例研究揭示了数据收集过程中的具体偏差来源。 Conclusion: 城市特性可能导致街景数据偏差,需通过定量评估和案例研究改进数据收集方法,以减少偏差。 Abstract: Street view data is increasingly being used in computer vision applications in recent years. Machine learning datasets are collected for these applications using simple sampling techniques. These datasets are assumed to be a systematic representation of cities, especially when densely sampled. Prior works however, show that there are clear gaps in coverage, with certain cities or regions being covered poorly or not at all. Here we demonstrate that a cities' idiosyncracies, such as city layout, may lead to biases in street view data for 28 cities across the globe, even when they are densely covered. We quantitatively uncover biases in the distribution of coverage of street view data and propose a method for evaluation of such distributions to get better insight in idiosyncracies in a cities' coverage. In addition, we perform a case study of Amsterdam with semi-structured interviews, showing how idiosyncracies of the collection process impact representation of cities and regions and allowing us to address biases at their source.

[46] CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs

David Méndez,Gianpaolo Bontempo,Elisa Ficarra,Roberto Confalonieri,Natalia Díaz-Rodríguez

Main category: cs.CV

TL;DR: CUBIC是一种无需预定义偏见候选或失败样本的新方法,通过图像-文本潜在空间和线性分类器探针自动发现可能影响模型预测的可解释概念。

Details Motivation: 深度学习模型常依赖数据集中的虚假相关性学习偏见,而现有方法需要大量人工标注。CUBIC旨在无需标注的情况下自动识别这些偏见。 Method: CUBIC利用图像-文本潜在空间和线性分类器探针,通过分析超类标签的潜在表示变化来识别显著影响模型预测的概念。 Result: 实验表明,CUBIC能有效发现未知偏见,且无需预知偏见或模型失败样本。 Conclusion: CUBIC提供了一种无需人工干预的偏见识别方法,适用于缺乏标注数据的场景。 Abstract: Deep vision models often rely on biases learned from spurious correlations in datasets. To identify these biases, methods that interpret high-level, human-understandable concepts are more effective than those relying primarily on low-level features like heatmaps. A major challenge for these concept-based methods is the lack of image annotations indicating potentially bias-inducing concepts, since creating such annotations requires detailed labeling for each dataset and concept, which is highly labor-intensive. We present CUBIC (Concept embeddings for Unsupervised Bias IdentifiCation), a novel method that automatically discovers interpretable concepts that may bias classifier behavior. Unlike existing approaches, CUBIC does not rely on predefined bias candidates or examples of model failures tied to specific biases, as such information is not always available. Instead, it leverages image-text latent space and linear classifier probes to examine how the latent representation of a superclass label$\unicode{x2014}$shared by all instances in the dataset$\unicode{x2014}$is influenced by the presence of a given concept. By measuring these shifts against the normal vector to the classifier's decision boundary, CUBIC identifies concepts that significantly influence model predictions. Our experiments demonstrate that CUBIC effectively uncovers previously unknown biases using Vision-Language Models (VLMs) without requiring the samples in the dataset where the classifier underperforms or prior knowledge of potential biases.

[47] HSRMamba: Efficient Wavelet Stripe State Space Model for Hyperspectral Image Super-Resolution

Baisong Li,Xingwang Wang,Haixiao Xu

Main category: cs.CV

TL;DR: HSRMamba改进了Visual Mamba模型,通过条带扫描和小波分解减少伪影,提升超分辨率性能。

Details Motivation: 解决Visual Mamba在单高光谱图像超分辨率中因1D扫描导致的伪影问题。 Method: 引入条带扫描方案和小波分解,平衡计算效率和特征模态冲突。 Result: 实验表明HSRMamba在计算负载和性能上优于现有方法。 Conclusion: HSRMamba在超分辨率任务中实现了最佳效果。 Abstract: Single hyperspectral image super-resolution (SHSR) aims to restore high-resolution images from low-resolution hyperspectral images. Recently, the Visual Mamba model has achieved an impressive balance between performance and computational efficiency. However, due to its 1D scanning paradigm, the model may suffer from potential artifacts during image generation. To address this issue, we propose HSRMamba. While maintaining the computational efficiency of Visual Mamba, we introduce a strip-based scanning scheme to effectively reduce artifacts from global unidirectional scanning. Additionally, HSRMamba uses wavelet decomposition to alleviate modal conflicts between high-frequency spatial features and low-frequency spectral features, further improving super-resolution performance. Extensive experiments show that HSRMamba not only excels in reducing computational load and model size but also outperforms existing methods, achieving state-of-the-art results.

[48] Towards Self-Improvement of Diffusion Models via Group Preference Optimization

Renjie Chen,Wenfeng Lin,Yichen Zhang,Jiangchuan Wei,Boyuan Liu,Chao Feng,Jiao Ran,Mingyu Guo

Main category: cs.CV

TL;DR: 论文提出Group Preference Optimization (GPO)方法,通过从成对偏好扩展到组偏好并结合奖励标准化,解决了DPO在文本到图像生成中的敏感性和数据收集问题,显著提升了生成质量。

Details Motivation: DPO在文本到图像生成中面临偏好对敏感性和高质量数据收集的挑战,尤其是偏好对差异较小时可能导致性能下降。 Method: 提出GPO方法,将DPO从成对偏好扩展到组偏好,并引入奖励标准化进行重新加权,无需显式数据选择。 Result: GPO显著提升了Stable Diffusion 3.5 Medium在计数和文本渲染任务中的性能,准确率提高了20个百分点。 Conclusion: GPO是一种无需额外推理开销的自改进方法,适用于多种扩散模型和任务。 Abstract: Aligning text-to-image (T2I) diffusion models with Direct Preference Optimization (DPO) has shown notable improvements in generation quality. However, applying DPO to T2I faces two challenges: the sensitivity of DPO to preference pairs and the labor-intensive process of collecting and annotating high-quality data. In this work, we demonstrate that preference pairs with marginal differences can degrade DPO performance. Since DPO relies exclusively on relative ranking while disregarding the absolute difference of pairs, it may misclassify losing samples as wins, or vice versa. We empirically show that extending the DPO from pairwise to groupwise and incorporating reward standardization for reweighting leads to performance gains without explicit data selection. Furthermore, we propose Group Preference Optimization (GPO), an effective self-improvement method that enhances performance by leveraging the model's own capabilities without requiring external data. Extensive experiments demonstrate that GPO is effective across various diffusion models and tasks. Specifically, combining with widely used computer vision models, such as YOLO and OCR, the GPO improves the accurate counting and text rendering capabilities of the Stable Diffusion 3.5 Medium by 20 percentage points. Notably, as a plug-and-play method, no extra overhead is introduced during inference.

[49] Pseudo-Label Quality Decoupling and Correction for Semi-Supervised Instance Segmentation

Jianghang Lin,Yilin Lu,Yunhang Shen,Chaoyang Zhu,Shengchuan Zhang,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: 论文提出了一种名为PL-DC的新框架,用于解决半监督实例分割中伪标签质量不稳定的问题,通过解耦和动态校正机制显著提升了性能。

Details Motivation: 半监督实例分割(SSIS)在有限标注数据下性能不稳定,主要原因是伪标签的类别和掩码质量难以平衡。 Method: PL-DC框架包含三个部分:1)实例级解耦双阈值过滤机制;2)类别级动态校正模块;3)像素级掩码不确定性感知机制。 Result: 在COCO和Cityscapes数据集上,PL-DC显著提升了性能,特别是在少量标注数据下(如1% COCO数据提升11.6 mAP)。 Conclusion: PL-DC通过解耦和动态校正伪标签,有效解决了SSIS中的性能不稳定问题,达到了新的SOTA结果。 Abstract: Semi-Supervised Instance Segmentation (SSIS) involves classifying and grouping image pixels into distinct object instances using limited labeled data. This learning paradigm usually faces a significant challenge of unstable performance caused by noisy pseudo-labels of instance categories and pixel masks. We find that the prevalent practice of filtering instance pseudo-labels assessing both class and mask quality with a single score threshold, frequently leads to compromises in the trade-off between the qualities of class and mask labels. In this paper, we introduce a novel Pseudo-Label Quality Decoupling and Correction (PL-DC) framework for SSIS to tackle the above challenges. Firstly, at the instance level, a decoupled dual-threshold filtering mechanism is designed to decouple class and mask quality estimations for instance-level pseudo-labels, thereby independently controlling pixel classifying and grouping qualities. Secondly, at the category level, we introduce a dynamic instance category correction module to dynamically correct the pseudo-labels of instance categories, effectively alleviating category confusion. Lastly, we introduce a pixel-level mask uncertainty-aware mechanism at the pixel level to re-weight the mask loss for different pixels, thereby reducing the impact of noise introduced by pixel-level mask pseudo-labels. Extensive experiments on the COCO and Cityscapes datasets demonstrate that the proposed PL-DC achieves significant performance improvements, setting new state-of-the-art results for SSIS. Notably, our PL-DC shows substantial gains even with minimal labeled data, achieving an improvement of +11.6 mAP with just 1% COCO labeled data and +15.5 mAP with 5% Cityscapes labeled data. The code will be public.

[50] Hybrid-Emba3D: Geometry-Aware and Cross-Path Feature Hybrid Enhanced State Space Model for Point Cloud Classification

Bin Liu,Chunyang Wang,Xuelian Liu,Guan Xi,Ge Zhang,Ziteng Yao,Mengxue Dong

Main category: cs.CV

TL;DR: Hybrid-Emba3D提出了一种双向Mamba模型,通过几何特征耦合和跨路径特征混合,解决了点云分类中局部几何特征提取与模型复杂度的矛盾,达到了95.99%的分类准确率。

Details Motivation: 点云分类任务需要高效提取局部几何特征,同时控制模型复杂度。Mamba架构虽能平衡全局建模能力,但其单向依赖性与点云无序性矛盾,限制了局部空间相关性建模。 Method: 提出Hybrid-Emba3D,结合几何特征耦合机制和双路径特征混合,增强局部特征判别力,并突破传统SSM长程建模限制。 Result: 在ModelNet40上达到95.99%的分类准确率,仅增加0.03M参数。 Conclusion: Hybrid-Emba3D通过几何特征耦合和跨路径混合,显著提升了点云分类性能,为高效局部特征提取提供了新思路。 Abstract: The point cloud classification tasks face the dual challenge of efficiently extracting local geometric features while maintaining model complexity. The Mamba architecture utilizes the linear complexity advantage of state space models (SSMs) to overcome the computational bottleneck of Transformers while balancing global modeling capabilities. However, the inherent contradiction between its unidirectional dependency and the unordered nature of point clouds impedes modeling spatial correlation in local neighborhoods, thus constraining geometric feature extraction. This paper proposes Hybrid-Emba3D, a bidirectional Mamba model enhanced by geometry-feature coupling and cross-path feature hybridization. The Local geometric pooling with geometry-feature coupling mechanism significantly enhances local feature discriminative power via coordinated propagation and dynamic aggregation of geometric information between local center points and their neighborhoods, without introducing additional parameters. The designed Collaborative feature enhancer adopts dual-path hybridization, effectively handling local mutations and sparse key signals, breaking through the limitations of traditional SSM long-range modeling. Experimental results demonstrate that the proposed model achieves a new SOTA classification accuracy of 95.99% on ModelNet40 with only 0.03M additional.

[51] MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

Florinel-Alin Croitoru,Vlad Hondru,Marius Popescu,Radu Tudor Ionescu,Fahad Shahbaz Khan,Mubarak Shah

Main category: cs.CV

TL;DR: 提出了首个大规模多语言音视频深度伪造检测的开放集基准数据集MAVOS-DD,包含8种语言的250小时真实与伪造视频,60%为生成数据。实验表明现有检测器在开放集场景下性能下降。

Details Motivation: 当前深度伪造检测研究缺乏多语言和开放集场景的基准,需填补这一空白以推动更鲁棒的检测方法。 Method: 构建包含8种语言、7种生成模型的250小时数据集,划分训练、验证和测试集以模拟开放集场景,测试多种预训练和微调检测器。 Result: 现有先进检测器在开放集场景中性能显著下降,无法保持原有水平。 Conclusion: 多语言开放集深度伪造检测具有挑战性,需进一步研究提升检测器的泛化能力。 Abstract: We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.

Massimiliano Cassia,Luca Guarnera,Mirko Casu,Ignazio Zangara,Sebastiano Battiato

Main category: cs.CV

TL;DR: 本文提出了一种基于可解释特征分析的新型法医框架,用于识别GAN生成图像的训练数据集(如CelebA或FFHQ),并在真实与合成图像的分类及多类数据集归属中达到98-99%的准确率。

Details Motivation: GAN生成的合成媒体在验证真实性和追溯数据集来源方面带来挑战,涉及版权执行、隐私保护和法律合规等关键问题。 Method: 通过结合频谱变换(傅里叶/DCT)、颜色分布度量和局部特征描述符(SIFT),提取合成图像中的判别性统计特征,并利用监督分类器(随机森林、SVM、XGBoost)进行分类。 Result: 实验结果表明,频域特征(DCT/FFT)在捕捉数据集特定伪影(如上采样模式和频谱不规则性)方面表现突出,颜色直方图揭示了GAN训练中的隐式正则化策略。 Conclusion: 该框架提升了生成模型的问责和治理能力,适用于数字取证、内容审核和知识产权诉讼等领域。 Abstract: Synthetic media generated by Generative Adversarial Networks (GANs) pose significant challenges in verifying authenticity and tracing dataset origins, raising critical concerns in copyright enforcement, privacy protection, and legal compliance. This paper introduces a novel forensic framework for identifying the training dataset (e.g., CelebA or FFHQ) of GAN-generated images through interpretable feature analysis. By integrating spectral transforms (Fourier/DCT), color distribution metrics, and local feature descriptors (SIFT), our pipeline extracts discriminative statistical signatures embedded in synthetic outputs. Supervised classifiers (Random Forest, SVM, XGBoost) achieve 98-99% accuracy in binary classification (real vs. synthetic) and multi-class dataset attribution across diverse GAN architectures (StyleGAN, AttGAN, GDWCT, StarGAN, and StyleGAN2). Experimental results highlight the dominance of frequency-domain features (DCT/FFT) in capturing dataset-specific artifacts, such as upsampling patterns and spectral irregularities, while color histograms reveal implicit regularization strategies in GAN training. We further examine legal and ethical implications, showing how dataset attribution can address copyright infringement, unauthorized use of personal data, and regulatory compliance under frameworks like GDPR and California's AB 602. Our framework advances accountability and governance in generative modeling, with applications in digital forensics, content moderation, and intellectual property litigation.

[53] Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing

Mathis Jürgen Adler,Leonard Hackel,Gencer Sumbul,Begüm Demir

Main category: cs.CV

TL;DR: 提出了一种加权特征聚合(WFA)策略,用于遥感(RS)中的视觉语言模型(VLM)预训练,以减少冗余信息并提升效率。

Details Motivation: 解决遥感领域中多描述文本导致的冗余问题,优化预训练和推理时间。 Method: 采用两种技术计算自适应重要性权重:非参数唯一性和基于学习的注意力机制。 Result: 实验证明WFA策略在文本到图像检索任务中高效且有效。 Conclusion: 根据任务需求和资源限制,提供了选择合适技术的指南,并公开了代码。 Abstract: The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii) learning-based attention. In the first technique, importance weights are calculated based on the bilingual evaluation understudy (BLEU) scores of the captions to emphasize unique sentences and reduce the influence of repetitive ones. In the second technique, importance weights are learned through an attention mechanism instead of relying on hand-crafted features. The effectiveness of the proposed WFA strategy with the two techniques is analyzed in terms of downstream performance on text-to-image retrieval in RS. Experimental results show that the proposed strategy enables efficient and effective pretraining of VLMs in RS. Based on the experimental analysis, we derive guidelines for selecting appropriate techniques depending on downstream task requirements and resource constraints. The code of this work is publicly available at https://git.tu-berlin.de/rsim/redundacy-aware-rs-vlm.

[54] PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video

Makoto Yamada,Kian Ming A. Chai,Ayoub Rhim,Satoki Ishikawa,Mohammad Sabokrou,Yao-Hung Hubert Tsai

Main category: cs.CV

TL;DR: PhiNet v2是一种基于Transformer的架构,通过处理连续图像序列而不依赖强数据增强,实现了与最先进视觉模型竞争的性能。

Details Motivation: 现有自监督学习方法未充分利用生物视觉系统的启发,PhiNet v2旨在更贴近人类视觉处理方式。 Method: 采用Transformer架构和变分推断,从连续图像序列中学习鲁棒视觉表示。 Result: PhiNet v2在性能上与最先进视觉模型相当,且无需强数据增强。 Conclusion: 该研究推动了更接近生物视觉系统的计算机视觉模型发展。 Abstract: Recent advances in self-supervised learning (SSL) have revolutionized computer vision through innovative architectures and learning objectives, yet they have not fully leveraged insights from biological visual processing systems. Recently, a brain-inspired SSL model named PhiNet was proposed; it is based on a ResNet backbone and operates on static image inputs with strong augmentation. In this paper, we introduce PhiNet v2, a novel Transformer-based architecture that processes temporal visual input (that is, sequences of images) without relying on strong augmentation. Our model leverages variational inference to learn robust visual representations from continuous input streams, similar to human visual processing. Through extensive experimentation, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision foundation models, while maintaining the ability to learn from sequential input without strong data augmentation. This work represents a significant step toward more biologically plausible computer vision systems that process visual information in a manner more closely aligned with human cognitive processes.

[55] One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework

Feiran Li,Qianqian Xu,Shilong Bao,Zhiyong Yang,Xiaochun Cao,Qingming Huang

Main category: cs.CV

TL;DR: 提出了一种基于文本和图像协同的概念擦除框架(Co-Erasing),通过视觉监督提升擦除效果,同时减少对其他良性概念的干扰。

Details Motivation: 当前基于文本提示的概念擦除方法存在效果与可用性之间的权衡问题,主要源于文本与图像模态之间的知识鸿沟。 Method: 结合文本提示和对应不良图像的视觉监督,通过负向引导降低目标概念的生成概率,并设计文本引导的图像概念细化策略。 Result: 实验表明,Co-Erasing在擦除效果和可用性之间取得了更好的平衡,显著优于现有方法。 Conclusion: Co-Erasing通过文本与图像的协同监督,有效解决了概念擦除中的模态鸿沟问题,提升了擦除效果和模型可用性。 Abstract: Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability). In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability. Codes are available at https://github.com/Ferry-Li/Co-Erasing.

[56] Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans

Yansheng Qiu,Li Xiao,Zhaopan Xu,Pengfei Zhou,Zheng Wang,Kaipeng Zhang

Main category: cs.CV

TL;DR: 论文提出了一个名为Human-Aligned Bench的基准测试,用于评估多模态推理任务中模型与人类表现的细粒度对齐。

Details Motivation: 当前大型语言模型(LLMs)和多模态大语言模型(MLLMs)在推理任务中是否具备与人类相当的能力尚不明确。 Method: 收集了9,794个仅依赖上下文推理的多模态问题,涵盖双语(中英)多模态问题和纯文本问题,包括四种问题类型。每个问题附带人类成功率和易错选项。 Result: 实验揭示了当前MLLMs在多模态推理任务中与人类表现的显著差异。 Conclusion: 基准测试结果为下一代模型的开发提供了重要见解。 Abstract: The goal of achieving Artificial General Intelligence (AGI) is to imitate humans and surpass them. Models such as OpenAI's o1, o3, and DeepSeek's R1 have demonstrated that large language models (LLMs) with human-like reasoning capabilities exhibit exceptional performance and are being gradually integrated into multimodal large language models (MLLMs). However, whether these models possess capabilities comparable to humans in handling reasoning tasks remains unclear at present. In this paper, we propose Human-Aligned Bench, a benchmark for fine-grained alignment of multimodal reasoning with human performance. Specifically, we collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions, encompassing four question types: visual reasoning, definition judgment, analogical reasoning, and logical judgment. More importantly, each question is accompanied by human success rates and options that humans are prone to choosing incorrectly. Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance. The findings on our benchmark provide insights into the development of the next-generation models.

[57] Learning Dense Hand Contact Estimation from Imbalanced Data

Daniel Sungho Jung,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 提出了一种解决手部接触估计中类别和空间不平衡问题的框架HACO,包括平衡接触采样和顶点级类别平衡损失。

Details Motivation: 手部接触对人类交互至关重要,但现有数据集存在类别和空间不平衡问题,影响模型泛化能力。 Method: 采用平衡接触采样解决类别不平衡,提出顶点级类别平衡损失(VCB)解决空间不平衡。 Result: 成功学习到密集手部接触估计,避免了不平衡问题的影响。 Conclusion: HACO框架有效解决了手部接触估计中的不平衡问题,代码将开源。 Abstract: Hands are essential to human interaction, and understanding contact between hands and the world can promote comprehensive understanding of their function. Recently, there have been growing number of hand interaction datasets that cover interaction with object, other hand, scene, and body. Despite the significance of the task and increasing high-quality data, how to effectively learn dense hand contact estimation remains largely underexplored. There are two major challenges for learning dense hand contact estimation. First, there exists class imbalance issue from hand contact datasets where majority of samples are not in contact. Second, hand contact datasets contain spatial imbalance issue with most of hand contact exhibited in finger tips, resulting in challenges for generalization towards contacts in other hand regions. To tackle these issues, we present a framework that learns dense HAnd COntact estimation (HACO) from imbalanced data. To resolve the class imbalance issue, we introduce balanced contact sampling, which builds and samples from multiple sampling groups that fairly represent diverse contact statistics for both contact and non-contact samples. Moreover, to address the spatial imbalance issue, we propose vertex-level class-balanced (VCB) loss, which incorporates spatially varying contact distribution by separately reweighting loss contribution of each vertex based on its contact frequency across dataset. As a result, we effectively learn to predict dense hand contact estimation with large-scale hand contact data without suffering from class and spatial imbalance issue. The codes will be released.

[58] CheX-DS: Improving Chest X-ray Image Classification with Ensemble Learning Based on DenseNet and Swin Transformer

Xinran Li,Yu Liu,Xiujuan Xu,Xiaowei Zhao

Main category: cs.CV

TL;DR: 本文提出了一种结合CNN和Transformer的模型CheX-DS,用于胸部X光片的多标签分类,解决了数据不平衡问题,并在NIH ChestX-ray14数据集上取得了优异的性能。

Details Motivation: 当前胸部疾病自动诊断方法主要依赖CNN,忽视了全局特征,而自注意力机制在计算机视觉中表现优异,因此需要结合两者的优势。 Method: 基于DenseNet和Swin Transformer,采用集成深度学习技术结合两者,并使用加权二元交叉熵损失和非对称损失解决数据不平衡问题。 Result: 在NIH ChestX-ray14数据集上,平均AUC达到83.76%,优于之前的研究。 Conclusion: CheX-DS模型结合了CNN和Transformer的优势,有效解决了数据不平衡问题,表现出卓越的分类性能。 Abstract: The automatic diagnosis of chest diseases is a popular and challenging task. Most current methods are based on convolutional neural networks (CNNs), which focus on local features while neglecting global features. Recently, self-attention mechanisms have been introduced into the field of computer vision, demonstrating superior performance. Therefore, this paper proposes an effective model, CheX-DS, for classifying long-tail multi-label data in the medical field of chest X-rays. The model is based on the excellent CNN model DenseNet for medical imaging and the newly popular Swin Transformer model, utilizing ensemble deep learning techniques to combine the two models and leverage the advantages of both CNNs and Transformers. The loss function of CheX-DS combines weighted binary cross-entropy loss with asymmetric loss, effectively addressing the issue of data imbalance. The NIH ChestX-ray14 dataset is selected to evaluate the model's effectiveness. The model outperforms previous studies with an excellent average AUC score of 83.76\%, demonstrating its superior performance.

[59] CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback

Yixin Wan,Kai-Wei Chang

Main category: cs.CV

TL;DR: CompAlign是一个用于评估和改进文本到图像生成模型在复杂组合场景生成能力的基准,包含900个复杂多主体提示,并提出CompQuest评估框架和模型对齐方法。

Details Motivation: 现有T2I模型在生成高分辨率图像时难以准确描述多对象、属性和空间关系的组合场景,需要更有效的评估和改进方法。 Method: 提出CompAlign基准和CompQuest评估框架,后者通过分解复杂提示为原子子问题并使用MLLM提供细粒度反馈;同时提出基于反馈的对齐框架改进扩散模型。 Result: 评估9个T2I模型发现其在复杂3D空间配置任务中表现较差,开源与商业模型存在性能差距;对齐后模型在组合准确性上显著提升。 Conclusion: CompAlign和CompQuest为组合图像生成提供了有效评估和改进工具,对齐方法显著提升了模型性能。 Abstract: State-of-the-art T2I models are capable of generating high-resolution images given textual prompts. However, they still struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations. We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships, for evaluating and improving models on compositional image generation. CompAlign consists of 900 complex multi-subject image generation prompts that combine numerical and 3D-spatial relationships with varied attribute bindings. Our benchmark is remarkably challenging, incorporating generation tasks with 3+ generation subjects with complex 3D-spatial relationships. Additionally, we propose CompQuest, an interpretable and accurate evaluation framework that decomposes complex prompts into atomic sub-questions, then utilizes a MLLM to provide fine-grained binary feedback on the correctness of each aspect of generation elements in model-generated images. This enables precise quantification of alignment between generated images and compositional prompts. Furthermore, we propose an alignment framework that uses CompQuest's feedback as preference signals to improve diffusion models' compositional image generation abilities. Using adjustable per-image preferences, our method is easily scalable and flexible for different tasks. Evaluation of 9 T2I models reveals that: (1) models remarkable struggle more with compositional tasks with more complex 3D-spatial configurations, and (2) a noticeable performance gap exists between open-source accessible models and closed-source commercial models. Further empirical study on using CompAlign for model alignment yield promising results: post-alignment diffusion models achieve remarkable improvements in compositional accuracy, especially on complex generation tasks, outperforming previous approaches.

[60] Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning

Yuzhuo Dai,Jiaqi Jin,Zhibin Dong,Siwei Wang,Xinwang Liu,En Zhu,Xihong Yang,Xinbiao Gan,Yu Feng

Main category: cs.CV

TL;DR: 论文提出了一种名为FreeCSL的框架,用于解决不完全多视图聚类中的原型偏移和语义不一致问题,通过学习共识原型和启发式图聚类,实现了更可靠的聚类结果。

Details Motivation: 不完全多视图聚类中,缺失数据导致视图内原型偏移和视图间语义不一致,现有方法未能有效构建共享语义空间或利用视图特定信息。 Method: FreeCSL框架通过共识原型学习构建共享语义空间,并利用启发式图聚类增强簇内紧凑性和簇间分离性。 Result: 实验表明,FreeCSL在不完全多视图聚类任务中比现有方法表现更优,聚类结果更可靠且鲁棒。 Conclusion: FreeCSL通过共识语义学习和视图特定聚类设计,有效解决了不完全多视图聚类中的挑战。 Abstract: In incomplete multi-view clustering (IMVC), missing data induce prototype shifts within views and semantic inconsistencies across views. A feasible solution is to explore cross-view consistency in paired complete observations, further imputing and aligning the similarity relationships inherently shared across views. Nevertheless, existing methods are constrained by two-tiered limitations: (1) Neither instance- nor cluster-level consistency learning construct a semantic space shared across views to learn consensus semantics. The former enforces cross-view instances alignment, and wrongly regards unpaired observations with semantic consistency as negative pairs; the latter focuses on cross-view cluster counterparts while coarsely handling fine-grained intra-cluster relationships within views. (2) Excessive reliance on consistency results in unreliable imputation and alignment without incorporating view-specific cluster information. Thus, we propose an IMVC framework, imputation- and alignment-free for consensus semantics learning (FreeCSL). To bridge semantic gaps across all observations, we learn consensus prototypes from available data to discover a shared space, where semantically similar observations are pulled closer for consensus semantics learning. To capture semantic relationships within specific views, we design a heuristic graph clustering based on modularity to recover cluster structure with intra-cluster compactness and inter-cluster separation for cluster semantics enhancement. Extensive experiments demonstrate, compared to state-of-the-art competitors, FreeCSL achieves more confident and robust assignments on IMVC task.

[61] FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining

Myunsoo Kim,Seong-Woong Shim,Byung-Jun Lee

Main category: cs.CV

TL;DR: FALCON是一种自适应平衡硬负样本和假负样本的学习策略,用于改善视觉-语言预训练中的假负样本问题。

Details Motivation: 假负样本在视觉-语言预训练中引入冲突监督信号,影响嵌入空间学习和硬负样本采样的效果。 Method: FALCON通过动态负样本挖掘调度器,在mini-batch构建中自适应选择合适硬度的负样本。 Result: 实验表明,FALCON显著提升了ALBEF和BLIP-2框架的性能,并在多种下游任务中表现优异。 Conclusion: FALCON有效缓解了假负样本的影响,提升了视觉-语言预训练的鲁棒性和性能。 Abstract: False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across two widely adopted VLP frameworks (ALBEF, BLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.

[62] DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Yuang Ai,Qihang Fan,Xuefeng Hu,Zhenheng Yang,Ran He,Huaibo Huang

Main category: cs.CV

TL;DR: DiCo(Diffusion ConvNet)是一种基于卷积的高效扩散模型,通过引入紧凑通道注意力机制提升性能,在ImageNet基准测试中表现优于DiT。

Details Motivation: 研究发现DiT模型中的全局自注意力存在冗余,主要捕捉局部模式,因此探索卷积作为更高效的替代方案。 Method: 提出DiCo,完全基于标准卷积模块构建,通过紧凑通道注意力机制减少通道冗余,增强特征多样性。 Result: DiCo在ImageNet 256x256和512x256分辨率下分别达到FID 2.05和2.53,速度比DiT快2.7x和3.1x。最大模型DiCo-H(1B参数)FID为1.90。 Conclusion: DiCo展示了卷积在扩散模型中的高效性和表达力,为视觉生成提供了新的高效解决方案。 Abstract: Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: https://github.com/shallowdream204/DiCo.

[63] GeoMM: On Geodesic Perspective for Multi-modal Learning

Shibin Mei,Hang Wang,Bingbing Ni

Main category: cs.CV

TL;DR: 本文首次将测地距离引入多模态学习,以解决传统距离度量在非线性空间中的不足,通过构建图结构和层次化策略优化计算效率,实验验证了其有效性。

Details Motivation: 在非线性多模态学习中,传统距离度量难以区分语义相似但内容不同的样本,测地距离能更可靠地衡量非线性空间中的距离。 Method: 构建图结构表示样本邻接关系,通过最短路径算法计算测地距离;提出层次化图结构和增量更新策略以提高计算效率。 Result: 实验表明,该方法能有效捕捉样本间复杂关系,提升多模态学习模型的性能。 Conclusion: 测地距离在多模态学习中具有显著优势,为非线性空间的距离度量提供了新思路。 Abstract: Geodesic distance serves as a reliable means of measuring distance in nonlinear spaces, and such nonlinear manifolds are prevalent in the current multimodal learning. In these scenarios, some samples may exhibit high similarity, yet they convey different semantics, making traditional distance metrics inadequate for distinguishing between positive and negative samples. This paper introduces geodesic distance as a novel distance metric in multi-modal learning for the first time, to mine correlations between samples, aiming to address the limitations of common distance metric. Our approach incorporates a comprehensive series of strategies to adapt geodesic distance for the current multimodal learning. Specifically, we construct a graph structure to represent the adjacency relationships among samples by thresholding distances between them and then apply the shortest-path algorithm to obtain geodesic distance within this graph. To facilitate efficient computation, we further propose a hierarchical graph structure through clustering and combined with incremental update strategies for dynamic status updates. Extensive experiments across various downstream tasks validate the effectiveness of our proposed method, demonstrating its capability to capture complex relationships between samples and improve the performance of multimodal learning models.

[64] AW-GATCN: Adaptive Weighted Graph Attention Convolutional Network for Event Camera Data Joint Denoising and Object Recognition

Haiyu Li,Charith Abhayaratne

Main category: cs.CV

TL;DR: 提出了一种基于自适应图的事件数据去噪框架,用于事件相机中的物体识别,显著提升了识别精度和噪声去除效果。

Details Motivation: 事件相机生成的数据存在大量冗余和噪声,如何在去噪的同时保留关键时空信息是主要挑战。 Method: 结合自适应事件分割、多因素边权重机制和自适应图去噪策略,优化时空信息整合。 Result: 在四个数据集上分别达到83.77%、76.79%、99.30%和96.89%的识别准确率,优于现有方法。 Conclusion: 该方法在去噪和识别性能上均显著优于传统方法,验证了其有效性。 Abstract: Event cameras, which capture brightness changes with high temporal resolution, inherently generate a significant amount of redundant and noisy data beyond essential object structures. The primary challenge in event-based object recognition lies in effectively removing this noise without losing critical spatial-temporal information. To address this, we propose an Adaptive Graph-based Noisy Data Removal framework for Event-based Object Recognition. Specifically, our approach integrates adaptive event segmentation based on normalized density analysis, a multifactorial edge-weighting mechanism, and adaptive graph-based denoising strategies. These innovations significantly enhance the integration of spatiotemporal information, effectively filtering noise while preserving critical structural features for robust recognition. Experimental evaluations on four challenging datasets demonstrate that our method achieves superior recognition accuracies of 83.77%, 76.79%, 99.30%, and 96.89%, surpassing existing graph-based methods by up to 8.79%, and improving noise reduction performance by up to 19.57%, with an additional accuracy gain of 6.26% compared to traditional Euclidean-based techniques.

[65] Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models

Fu-Yun Wang,Yunhao Shui,Jingtan Piao,Keqiang Sun,Hongsheng Li

Main category: cs.CV

TL;DR: 论文提出了一种针对负偏好的训练方法,以提升扩散模型生成结果与人类偏好的对齐度。

Details Motivation: 现有偏好对齐方法忽视了无条件/负条件输出的处理,导致生成结果可能不符合人类偏好,限制了分类器自由引导(CFG)的效果。 Method: 提出了一种简单但有效的方法,通过训练一个专门针对负偏好的模型,无需新训练策略或数据集,仅需对现有技术进行微小修改。 Result: 该方法与SD1.5、SDXL、视频扩散模型等兼容,能持续提升生成结果与人类偏好的对齐度。 Conclusion: 通过处理负偏好,该方法显著提升了扩散模型生成结果的质量和对齐度。 Abstract: Diffusion models have made substantial advances in image generation, yet models trained on large, unfiltered datasets often yield outputs misaligned with human preferences. Numerous methods have been proposed to fine-tune pre-trained diffusion models, achieving notable improvements in aligning generated outputs with human preferences. However, we argue that existing preference alignment methods neglect the critical role of handling unconditional/negative-conditional outputs, leading to a diminished capacity to avoid generating undesirable outcomes. This oversight limits the efficacy of classifier-free guidance~(CFG), which relies on the contrast between conditional generation and unconditional/negative-conditional generation to optimize output quality. In response, we propose a straightforward but versatile effective approach that involves training a model specifically attuned to negative preferences. This method does not require new training strategies or datasets but rather involves minor modifications to existing techniques. Our approach integrates seamlessly with models such as SD1.5, SDXL, video diffusion models and models that have undergone preference optimization, consistently enhancing their alignment with human preferences.

[66] Entropy-Driven Genetic Optimization for Deep-Feature-Guided Low-Light Image Enhancement

Nirjhor Datta,Afroza Akther,M. Sohel Rahman

Main category: cs.CV

TL;DR: 提出一种基于NSGA-II算法的无监督图像增强框架,优化亮度、对比度和伽马参数,平衡视觉质量与语义保真度。

Details Motivation: 传统图像增强方法过于关注像素级信息,忽略了语义特征,因此需要一种能同时优化视觉质量和语义一致性的方法。 Method: 使用预训练深度神经网络提取特征,结合GPU加速的NSGA-II算法优化多目标(图像熵、感知相似性、亮度),并通过局部搜索微调候选参数。 Result: 在无配对数据集上,模型平均BRISQUE和NIQE分数分别为19.82和3.652,增强图像在阴影区域可见性、对比度平衡和细节保留方面表现优异。 Conclusion: 该方法为无监督图像增强开辟了新方向,特别适用于语义一致性要求高的场景。 Abstract: Image enhancement methods often prioritize pixel level information, overlooking the semantic features. We propose a novel, unsupervised, fuzzy-inspired image enhancement framework guided by NSGA-II algorithm that optimizes image brightness, contrast, and gamma parameters to achieve a balance between visual quality and semantic fidelity. Central to our proposed method is the use of a pre trained deep neural network as a feature extractor. To find the best enhancement settings, we use a GPU-accelerated NSGA-II algorithm that balances multiple objectives, namely, increasing image entropy, improving perceptual similarity, and maintaining appropriate brightness. We further improve the results by applying a local search phase to fine-tune the top candidates from the genetic algorithm. Our approach operates entirely without paired training data making it broadly applicable across domains with limited or noisy labels. Quantitatively, our model achieves excellent performance with average BRISQUE and NIQE scores of 19.82 and 3.652, respectively, in all unpaired datasets. Qualitatively, enhanced images by our model exhibit significantly improved visibility in shadowed regions, natural balance of contrast and also preserve the richer fine detail without introducing noticable artifacts. This work opens new directions for unsupervised image enhancement where semantic consistency is critical.

[67] DRAGON: A Large-Scale Dataset of Realistic Images Generated by Diffusion Models

Giulia Bertazzini,Daniele Baracchi,Dasara Shullani,Isao Echizen,Alessandro Piva

Main category: cs.CV

TL;DR: 论文介绍了DRAGON数据集,包含25种扩散模型的图像,旨在支持合成内容检测技术的发展。

Details Motivation: 扩散模型生成的图像被滥用于虚假信息,现有检测方法依赖大量训练数据且数据集更新滞后,亟需全面且更新的数据集。 Method: 提出DRAGON数据集,涵盖多种扩散模型,并通过大语言模型扩展输入提示以提升图像多样性和质量。 Result: 数据集提供多种规模,并附带测试集,支持检测技术的开发和评估。 Conclusion: DRAGON为合成内容检测领域提供了全面且实用的资源,推动技术进步。 Abstract: The remarkable ease of use of diffusion models for image generation has led to a proliferation of synthetic content online. While these models are often employed for legitimate purposes, they are also used to generate fake images that support misinformation and hate speech. Consequently, it is crucial to develop robust tools capable of detecting whether an image has been generated by such models. Many current detection methods, however, require large volumes of sample images for training. Unfortunately, due to the rapid evolution of the field, existing datasets often cover only a limited range of models and quickly become outdated. In this work, we introduce DRAGON, a comprehensive dataset comprising images from 25 diffusion models, spanning both recent advancements and older, well-established architectures. The dataset contains a broad variety of images representing diverse subjects. To enhance image realism, we propose a simple yet effective pipeline that leverages a large language model to expand input prompts, thereby generating more diverse and higher-quality outputs, as evidenced by improvements in standard quality metrics. The dataset is provided in multiple sizes (ranging from extra-small to extra-large) to accomodate different research scenarios. DRAGON is designed to support the forensic community in developing and evaluating detection and attribution techniques for synthetic content. Additionally, the dataset is accompanied by a dedicated test set, intended to serve as a benchmark for assessing the performance of newly developed methods.

[68] Multi-view dense image matching with similarity learning and geometry priors

Mohamed Ali Chebbi,Ewelina Rupnik,Paul Lopes,Marc Pierrot-Deseilligny

Main category: cs.CV

TL;DR: MV-DeepSimNets是一种基于多视图相似性学习的深度神经网络,利用极线几何训练,无需繁琐的多视图数据集创建,显著提升了多视图重建效果。

Details Motivation: 传统密集匹配方法在多视图重建中效果有限,且需要大量标注数据。MV-DeepSimNets旨在通过几何先验和在线学习,提升重建效果并减少数据需求。 Method: 结合极线几何和单应性校正生成几何感知特征,通过平面扫描投影到候选深度假设,并聚合相似性构建正则化成本体积。 Result: 在航空和卫星影像中表现出优于现有相似性学习和端到端回归模型的性能,尤其在泛化能力上。 Conclusion: MV-DeepSimNets通过几何先验和在线学习,显著提升了多视图重建效果,适用于标准多分辨率影像匹配流程。 Abstract: We introduce MV-DeepSimNets, a comprehensive suite of deep neural networks designed for multi-view similarity learning, leveraging epipolar geometry for training. Our approach incorporates an online geometry prior to characterize pixel relationships, either along the epipolar line or through homography rectification. This enables the generation of geometry-aware features from native images, which are then projected across candidate depth hypotheses using plane sweeping. Our method geometric preconditioning effectively adapts epipolar-based features for enhanced multi-view reconstruction, without requiring the laborious multi-view training dataset creation. By aggregating learned similarities, we construct and regularize the cost volume, leading to improved multi-view surface reconstruction over traditional dense matching approaches. MV-DeepSimNets demonstrates superior performance against leading similarity learning networks and end-to-end regression models, especially in terms of generalization capabilities across both aerial and satellite imagery with varied ground sampling distances. Our pipeline is integrated into MicMac software and can be readily adopted in standard multi-resolution image matching pipelines.

[69] Equal is Not Always Fair: A New Perspective on Hyperspectral Representation Non-Uniformity

Wuzhou Quan,Mingqiang Wei,Jinhui Tang

Main category: cs.CV

TL;DR: FairHyp是一个针对高光谱图像(HSI)非均匀性问题的公平性框架,通过模块化设计解决空间、光谱和特征的矛盾,并在多个任务中表现优异。

Details Motivation: 高光谱图像的非均匀性导致现有统一处理模型性能不佳,FairHyp旨在通过专门模块解决这一问题。 Method: FairHyp包含Runge-Kutta启发的空间适配器、多感受野卷积模块和光谱上下文状态空间模型,分别处理空间、特征和光谱的非均匀性。 Result: 在分类、去噪、超分辨率和修复等任务中,FairHyp均优于现有方法。 Conclusion: FairHyp将公平性视为HSI建模的结构性需求,为高维视觉任务提供了新的范式。 Abstract: Hyperspectral image (HSI) representation is fundamentally challenged by pervasive non-uniformity, where spectral dependencies, spatial continuity, and feature efficiency exhibit complex and often conflicting behaviors. Most existing models rely on a unified processing paradigm that assumes homogeneity across dimensions, leading to suboptimal performance and biased representations. To address this, we propose FairHyp, a fairness-directed framework that explicitly disentangles and resolves the threefold non-uniformity through cooperative yet specialized modules. We introduce a Runge-Kutta-inspired spatial variability adapter to restore spatial coherence under resolution discrepancies, a multi-receptive field convolution module with sparse-aware refinement to enhance discriminative features while respecting inherent sparsity, and a spectral-context state space model that captures stable and long-range spectral dependencies via bidirectional Mamba scanning and statistical aggregation. Unlike one-size-fits-all solutions, FairHyp achieves dimension-specific adaptation while preserving global consistency and mutual reinforcement. This design is grounded in the view that non-uniformity arises from the intrinsic structure of HSI representations, rather than any particular task setting. To validate this, we apply FairHyp across four representative tasks including classification, denoising, super-resolution, and inpaintin, demonstrating its effectiveness in modeling a shared structural flaw. Extensive experiments show that FairHyp consistently outperforms state-of-the-art methods under varied imaging conditions. Our findings redefine fairness as a structural necessity in HSI modeling and offer a new paradigm for balancing adaptability, efficiency, and fidelity in high-dimensional vision tasks.

[70] MTevent: A Multi-Task Event Camera Dataset for 6D Pose Estimation and Moving Object Detection

Shrutarv Awasthi,Anas Gouda,Sven Franke,Jérôme Rutinowski,Frank Hoffmann,Moritz Roidl

Main category: cs.CV

TL;DR: MTevent数据集为高速机器人视觉提供了一种基于事件相机的解决方案,解决了RGB相机在动态环境中的局限性。

Details Motivation: 高速移动机器人需要实时感知能力,但RGB相机因运动模糊和延迟无法满足需求,事件相机因其异步和低延迟特性成为理想替代。 Method: 通过立体事件相机和RGB相机捕获75个场景,每个场景平均16秒,包含16个独特对象,涵盖极端视角、光照变化和遮挡等挑战性条件。 Result: 使用NVIDIA的FoundationPose在RGB图像上进行6D姿态估计,平均召回率为0.22,凸显了RGB方法在动态环境中的不足。 Conclusion: MTevent数据集为高速机器人视觉研究提供了宝贵资源,推动了事件相机在动态环境中的应用。 Abstract: Mobile robots are reaching unprecedented speeds, with platforms like Unitree B2, and Fraunhofer O3dyn achieving maximum speeds between 5 and 10 m/s. However, effectively utilizing such speeds remains a challenge due to the limitations of RGB cameras, which suffer from motion blur and fail to provide real-time responsiveness. Event cameras, with their asynchronous operation, and low-latency sensing, offer a promising alternative for high-speed robotic perception. In this work, we introduce MTevent, a dataset designed for 6D pose estimation and moving object detection in highly dynamic environments with large detection distances. Our setup consists of a stereo-event camera and an RGB camera, capturing 75 scenes, each on average 16 seconds, and featuring 16 unique objects under challenging conditions such as extreme viewing angles, varying lighting, and occlusions. MTevent is the first dataset to combine high-speed motion, long-range perception, and real-world object interactions, making it a valuable resource for advancing event-based vision in robotics. To establish a baseline, we evaluate the task of 6D pose estimation using NVIDIA's FoundationPose on RGB images, achieving an Average Recall of 0.22 with ground-truth masks, highlighting the limitations of RGB-based approaches in such dynamic settings. With MTevent, we provide a novel resource to improve perception models and foster further research in high-speed robotic vision. The dataset is available for download https://huggingface.co/datasets/anas-gouda/MTevent

[71] Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining

Raghuveer Thirukovalluru,Rui Meng,Ye Liu,Karthikeyan K,Mingyi Su,Ping Nie,Semih Yavuz,Yingbo Zhou,Wenhu Chen,Bhuwan Dhingra

Main category: cs.CV

TL;DR: 论文提出了一种名为'Breaking the Batch Barrier' (B3)的批量构建策略,通过预训练教师模型和社区检测算法优化对比学习中的负样本质量,显著提升了模型性能。

Details Motivation: 对比学习的效果受批量大小和质量影响较大,现有方法需要较大的批量才能达到理想效果,因此需要一种更高效的批量构建方法。 Method: 使用预训练教师模型对数据集中的样本进行排名,构建稀疏相似图,并通过社区检测算法识别强负样本簇,以此构建高质量的批量。 Result: 在MMEB多模态嵌入基准测试中,B3方法在7B和2B模型规模上分别比之前最佳方法提高了1.3和2.9分,且仅需64的批量大小即可超越现有方法。 Conclusion: B3方法通过优化批量构建策略,显著提升了对比学习的效率,尤其是在小批量情况下表现优异。 Abstract: Contrastive learning (CL) is a prevalent technique for training embedding models, which pulls semantically similar examples (positives) closer in the representation space while pushing dissimilar ones (negatives) further apart. A key source of negatives are 'in-batch' examples, i.e., positives from other examples in the batch. Effectiveness of such models is hence strongly influenced by the size and quality of training batches. In this work, we propose 'Breaking the Batch Barrier' (B3), a novel batch construction strategy designed to curate high-quality batches for CL. Our approach begins by using a pretrained teacher embedding model to rank all examples in the dataset, from which a sparse similarity graph is constructed. A community detection algorithm is then applied to this graph to identify clusters of examples that serve as strong negatives for one another. The clusters are then used to construct batches that are rich in in-batch negatives. Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively. Notably, models trained with B3 surpass existing state-of-the-art results even with a batch size as small as 64, which is 4-16x smaller than that required by other methods.

[72] CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

Christoph Leiter,Yuki M. Asano,Margret Keuper,Steffen Eger

Main category: cs.CV

TL;DR: CROC是一个自动化框架,用于评估文本到图像生成任务的评估指标鲁棒性,通过合成对比测试案例并生成大规模数据集。

Details Motivation: 现有评估指标的鲁棒性缺乏自动化测试方法,人工评估成本高且耗时。 Method: 提出CROC框架,生成伪标签数据集(CROC$^{syn}$)并训练新指标CROCScore,同时引入人工监督基准(CROC$^{hum}$)。 Result: 发现现有指标在否定提示和身体部位识别等任务中存在鲁棒性问题,CROCScore表现优于开源方法。 Conclusion: CROC为评估指标鲁棒性提供了高效自动化解决方案,并揭示了现有指标的不足。 Abstract: The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.

[73] Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

Keunwoo Peter Yu,Joyce Chai

Main category: cs.CV

TL;DR: 论文提出了一种新的基准任务TGLG,用于评估视觉语言模型在实时交互环境中的表现,并提出了VLM-TSI模型,实验显示其优于基线模型。

Details Motivation: 研究视觉语言模型在实时交互环境中的需求,特别是生成语义准确且时间精确的语句。 Method: 提出TGLG任务和TRACE评估指标,并设计VLM-TSI模型,通过时间同步交织视觉和语言令牌实现实时生成。 Result: VLM-TSI显著优于基线模型,但整体性能仍有提升空间。 Conclusion: TGLG任务具有挑战性,需进一步研究实时视觉语言模型。 Abstract: Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- $\textit{perceptual updating}$ and $\textit{contingency awareness}$ -- and propose a new benchmark task, $\textbf{Temporally-Grounded Language Generation (TGLG)}$, to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, $\textbf{TRACE}$, to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present $\textbf{Vision-Language Model with Time-Synchronized Interleaving (VLM-TSI)}$, a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest -- highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available $\href{https://github.com/yukw777/tglg}{here}$.

[74] MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Y. B. Wang,S Wang,J. N. Zhang,J. F. Wu,Q. D. He,C. C. Fu,C. J. Wang,Y. Liu

Main category: cs.CV

TL;DR: 论文提出MARRS框架,用于生成连续表示中协调且细粒度的人类反应动作,解决了现有方法在量化信息丢失和手部动作处理上的不足。

Details Motivation: 解决人类动作-反应合成任务中现有方法的局限性,如向量量化的信息丢失和低码本利用率,以及细粒度手部动作的生成挑战。 Method: 提出MARRS框架,包括UD-VAE(独立编码身体和手部单元)、ACF(随机掩码反应令牌提取信息)、AUM(自适应单元调制)和扩散模型(噪声预测器建模概率分布)。 Result: 定量和定性结果表明,该方法在生成协调且细粒度的反应动作方面表现优异。 Conclusion: MARRS框架通过独立编码和自适应调制,显著提升了人类动作-反应合成的性能,代码将在接受后发布。 Abstract: This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions based on the action sequence of the other as conditions. Currently, autoregressive modeling approaches have achieved remarkable performance in motion generation tasks, e.g. text-to-motion. However, vector quantization (VQ) accompanying autoregressive generation has inherent disadvantages, including loss of quantization information, low codebook utilization, etc. Moreover, unlike text-to-motion, which focuses solely on the movement of body joints, human action-reaction synthesis also encompasses fine-grained hand movements. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions in continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding them independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.

[75] Dynamic Base model Shift for Delta Compression

Chenyu Huang,Peng Ye,Shenghe Zheng,Xiaohui Wang,Lei Bai,Tao Chen,Wanli Ouyang

Main category: cs.CV

TL;DR: 论文提出动态基础模型迁移(DBMS)方法,通过调整基础模型和压缩参数,显著提升高压缩率下的性能表现。

Details Motivation: 现有方法以预训练模型为基础模型压缩增量参数,可能导致性能下降,尤其在极高压缩率下。 Method: 提出DBMS,动态调整基础模型和压缩参数,优化任务性能。 Result: DBMS在极高压缩率下仍能保持微调模型性能,优于现有方法。 Conclusion: DBMS具有通用性,可与其他方法结合,适用于多种模型。 Abstract: Transformer-based models with the pretrain-finetune paradigm bring about significant progress, along with the heavy storage and deployment costs of finetuned models on multiple tasks. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights) through pruning or quantization. However, existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task, which may causes significant performance degradation, especially when the compression rate is extremely high. To tackle this issue, we investigate the impact of different base models on the performance of delta compression and find that the pre-trained base model can hardly be optimal. To this end, we propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression. Specifically, we adjust two parameters, which respectively determine the magnitude of the base model shift and the overall scale of delta compression, to boost the compression performance on each task. Through low-cost learning of these two parameters, our DBMS can maintain most of the finetuned model's performance even under an extremely high compression ratio setting, significantly surpassing existing methods. Moreover, our DBMS is orthogonal and can be integrated with a variety of other methods, and it has been evaluated across different types of models including language, vision transformer, and multi-modal models.

[76] Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Zihan Wang,Seungjun Lee,Gim Hee Lee

Main category: cs.CV

TL;DR: Dynam3D模型通过动态分层的3D表示解决了视频语言大模型在3D导航中的几何理解、大规模探索和动态环境适应问题,并在多个VLN基准测试中取得最优性能。

Details Motivation: 现有视频语言大模型在3D导航中存在对几何和空间语义理解不足、大规模探索能力有限以及对动态环境适应性差的问题。 Method: Dynam3D将2D CLIP特征投影到3D空间,构建多层次的3D表示(块-实例-区域),并采用动态分层更新策略。 Result: Dynam3D在R2R-CE、REVERIE-CE和NavRAG-CE等VLN基准测试中取得最优性能,并通过实验验证了其实际部署的有效性。 Conclusion: Dynam3D通过动态分层的3D表示和语言对齐,显著提升了3D导航任务的性能,适用于实际部署。 Abstract: Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large models (Video-VLMs) with strong generalization capabilities and rich commonsense knowledge have shown remarkable performance when applied to VLN tasks. However, these models still encounter the following challenges when applied to real-world 3D navigation: 1) Insufficient understanding of 3D geometry and spatial semantics; 2) Limited capacity for large-scale exploration and long-term environmental memory; 3) Poor adaptability to dynamic and changing environments.To address these limitations, we propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input to train 3D-VLM in navigation action prediction. Given posed RGB-D images, our Dynam3D projects 2D CLIP features into 3D space and constructs multi-level 3D patch-instance-zone representations for 3D geometric and semantic understanding with a dynamic and layer-wise update strategy. Our Dynam3D is capable of online encoding and localization of 3D instances, and dynamically updates them in changing environments to provide large-scale exploration and long-term memory capabilities for navigation. By leveraging large-scale 3D-language pretraining and task-specific adaptation, our Dynam3D sets new state-of-the-art performance on VLN benchmarks including R2R-CE, REVERIE-CE and NavRAG-CE under monocular settings. Furthermore, experiments for pre-exploration, lifelong memory, and real-world robot validate the effectiveness of practical deployment.

[77] MutualNeRF: Improve the Performance of NeRF under Limited Samples with Mutual Information Theory

Zifan Wang,Jingwei Li,Yitang Li,Yunze Liu

Main category: cs.CV

TL;DR: MutualNeRF利用互信息理论提升NeRF在有限样本下的性能,通过统一度量图像间的相关性,优化稀疏视角采样和少视角合成。

Details Motivation: NeRF在有限数据下表现不佳,现有方法缺乏统一的理论支持,MutualNeRF通过互信息理论填补这一空白。 Method: 提出互信息作为统一度量,通过贪婪算法优化稀疏视角采样,并通过正则化项最大化少视角合成中的互信息。 Result: 实验表明,MutualNeRF在有限样本下优于现有方法。 Conclusion: MutualNeRF为NeRF在有限数据下的性能提升提供了理论支持和实用框架。 Abstract: This paper introduces MutualNeRF, a framework enhancing Neural Radiance Field (NeRF) performance under limited samples using Mutual Information Theory. While NeRF excels in 3D scene synthesis, challenges arise with limited data and existing methods that aim to introduce prior knowledge lack theoretical support in a unified framework. We introduce a simple but theoretically robust concept, Mutual Information, as a metric to uniformly measure the correlation between images, considering both macro (semantic) and micro (pixel) levels. For sparse view sampling, we strategically select additional viewpoints containing more non-overlapping scene information by minimizing mutual information without knowing ground truth images beforehand. Our framework employs a greedy algorithm, offering a near-optimal solution. For few-shot view synthesis, we maximize the mutual information between inferred images and ground truth, expecting inferred images to gain more relevant information from known images. This is achieved by incorporating efficient, plug-and-play regularization terms. Experiments under limited samples show consistent improvement over state-of-the-art baselines in different settings, affirming the efficacy of our framework.

[78] Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner

Wenchuan Zhang,Penghao Zhang,Jingru Guo,Tao Cheng,Jie Chen,Shuwan Zhang,Zhang Zhang,Yuhao Yi,Hong Bu

Main category: cs.CV

TL;DR: 该研究通过构建高质量、推理导向的病理学数据集,并开发多模态RL病理推理模型Patho-R1,显著提升了病理学视觉语言模型的诊断准确性和推理合理性。

Details Motivation: 当前病理学视觉语言模型在诊断准确性和推理合理性上存在不足,主要由于现有数据集缺乏深度和结构化诊断范式。 Method: 利用病理学教科书和专家构建高质量数据集,并通过三阶段训练流程(预训练、监督微调、强化学习)开发Patho-R1模型。 Result: Patho-R1和PathoCLIP在多种病理学任务中表现优异,包括零样本分类、跨模态检索、视觉问答和选择题。 Conclusion: 该研究为病理学视觉语言模型提供了高质量数据集和先进推理模型,显著提升了性能。 Abstract: Recent advances in vision language models (VLMs) have enabled broad progress in the general medical field. However, pathology still remains a more challenging subdomain, with current pathology specific VLMs exhibiting limitations in both diagnostic accuracy and reasoning plausibility. Such shortcomings are largely attributable to the nature of current pathology datasets, which are primarily composed of image description pairs that lack the depth and structured diagnostic paradigms employed by real world pathologists. In this study, we leverage pathology textbooks and real world pathology experts to construct high-quality, reasoning-oriented datasets. Building on this, we introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs for knowledge infusion; (2) supervised fine-tuning on 500k high-quality Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement learning using Group Relative Policy Optimization and Decoupled Clip and Dynamic sAmpling Policy Optimization strategies for multimodal reasoning quality refinement. To further assess the alignment quality of our dataset, we propose PathoCLIP, trained on the same figure-caption corpus used for continued pretraining. Comprehensive experimental results demonstrate that both PathoCLIP and Patho-R1 achieve robust performance across a wide range of pathology-related tasks, including zero-shot classification, cross-modal retrieval, Visual Question Answering, and Multiple Choice Question. Our project is available at the Patho-R1 repository: https://github.com/Wenchuan-Zhang/Patho-R1.

[79] EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

Bohao Xing,Xin Liu,Guoying Zhao,Chengyu Liu,Xiaolan Fu,Heikki Kälviäinen

Main category: cs.CV

TL;DR: 该论文提出了EmotionHallucer,首个用于检测和分析多模态大语言模型(MLLMs)中情绪幻觉的基准,并揭示了当前模型在此问题上的表现。

Details Motivation: 尽管情绪理解对MLLMs至关重要,但缺乏对其情绪幻觉的专门评估。 Method: 基于情绪心理学知识和真实世界多模态感知,采用对抗性二元问答框架评估模型。 Result: 发现大多数模型存在情绪幻觉问题,闭源模型表现优于开源模型,且模型在情绪心理学知识上表现更好。 Conclusion: 提出了PEP-MEK框架,显著提升了情绪幻觉检测性能。 Abstract: Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two dimensions: emotion psychology knowledge and real-world multimodal perception. To support robust evaluation, we utilize an adversarial binary question-answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 38 LLMs and MLLMs on EmotionHallucer, we reveal that: i) most current models exhibit substantial issues with emotion hallucinations; ii) closed-source models outperform open-source ones in detecting emotion hallucinations, and reasoning capability provides additional advantages; iii) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the PEP-MEK framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available at https://github.com/xxtars/EmotionHallucer.

[80] Improving Object Detection Performance through YOLOv8: A Comprehensive Training and Evaluation Study

Rana Poureskandar,Shiva Razzagzadeh

Main category: cs.CV

TL;DR: 评估YOLOv8分割模型在面部皱纹检测和分割中的性能。

Details Motivation: 研究旨在探索YOLOv8模型在面部皱纹检测任务中的适用性和效果。 Method: 采用基于YOLOv8的分割模型,对面部图像中的皱纹进行检测和分割。 Result: 展示了模型在皱纹检测和分割任务中的性能表现。 Conclusion: YOLOv8分割模型在面部皱纹检测中表现出潜力,为相关应用提供了技术支持。 Abstract: This study evaluated the performance of a YOLOv8-based segmentation model for detecting and segmenting wrinkles in facial images.

[81] Face Consistency Benchmark for GenAI Video

Michal Podstawski,Malgorzata Kudelska,Haohong Wang

Main category: cs.CV

TL;DR: 论文介绍了Face Consistency Benchmark(FCB),用于评估AI生成视频中角色一致性的框架,旨在填补现有解决方案的不足。

Details Motivation: AI视频生成技术虽进步显著,但角色一致性仍存挑战,现有模型难以保持外观和属性的连贯性。 Method: 提出FCB框架,通过标准化指标评估和比较AI生成视频中角色的一致性。 Result: FCB揭示了现有解决方案的不足,推动了更可靠方法的发展。 Conclusion: 该工作是提升AI视频生成技术中角色一致性的重要一步。 Abstract: Video generation driven by artificial intelligence has advanced significantly, enabling the creation of dynamic and realistic content. However, maintaining character consistency across video sequences remains a major challenge, with current models struggling to ensure coherence in appearance and attributes. This paper introduces the Face Consistency Benchmark (FCB), a framework for evaluating and comparing the consistency of characters in AI-generated videos. By providing standardized metrics, the benchmark highlights gaps in existing solutions and promotes the development of more reliable approaches. This work represents a crucial step toward improving character consistency in AI video generation technologies.

[82] SurgPose: Generalisable Surgical Instrument Pose Estimation using Zero-Shot Learning and Stereo Vision

Utsav Rai,Haozheng Xu,Stamatia Giannarou

Main category: cs.CV

TL;DR: 本文提出了一种基于零样本RGB-D模型的6自由度手术工具姿态估计方法,通过改进SAM-6D和深度估计技术,显著提升了在未见过的手术工具上的姿态估计性能。

Details Motivation: 传统标记法和监督学习方法在手术工具姿态估计中存在局限性,如遮挡、反射和泛化能力不足。零样本方法在RMIS中尚未探索,填补了这一空白。 Method: 结合FoundationPose和SAM-6D模型,引入RAFT-Stereo进行深度估计,并用微调Mask R-CNN替换SAM模块,提升分割精度。 Result: 改进后的SAM-6D在零样本姿态估计中优于FoundationPose,为RMIS中的RGB-D零样本方法设定了新标准。 Conclusion: 该方法提升了零样本姿态估计的泛化能力,并首次将RGB-D零样本方法应用于RMIS。 Abstract: Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.

[83] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Shaina Raza,Aravind Narayanan,Vahid Reza Khazaie,Ashmal Vayani,Mukund S. Chettiar,Amandeep Singh,Mubarak Shah,Deval Pandya

Main category: cs.CV

TL;DR: HumaniBench是一个包含32K真实世界图像问题对的基准测试,专注于评估大型多模态模型(LMMs)在公平性、伦理、同理心等人类中心AI原则上的表现。

Details Motivation: 当前LMMs在视觉语言任务上表现优异,但在人类价值观对齐方面仍有不足,如公平性、伦理等。 Method: 通过GPT4o辅助的标注流程构建数据集,并由领域专家验证,评估7项HCAI原则和7种任务。 Result: 测试15种LMMs显示,专有模型表现较好,但开放模型在准确性与人类对齐原则间存在平衡问题。 Conclusion: HumaniBench为诊断LMMs对齐差距提供了首个基于HCAI原则的基准测试,推动模型在准确性和社会责任上的进步。 Abstract: Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain experts. HumaniBench evaluates seven Human Centered AI (HCAI) principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, across seven diverse tasks, including open and closed ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests. Benchmarking 15 state of the art LMMs (open and closed source) reveals that proprietary models generally lead, though robustness and visual grounding remain weak points. Some open-source models also struggle to balance accuracy with adherence to human-aligned principles. HumaniBench is the first benchmark purpose built around HCAI principles. It provides a rigorous testbed for diagnosing alignment gaps and guiding LMMs toward behavior that is both accurate and socially responsible. Dataset, annotation prompts, and evaluation code are available at: https://vectorinstitute.github.io/HumaniBench

[84] PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment

Dingbang Huang,Wenbo Li,Yifei Zhao,Xinyu Pan,Yanhong Zeng,Bo Dai

Main category: cs.CV

TL;DR: PSDiffusion是一个统一的扩散框架,用于同时生成多层文本到图像,解决了现有方法无法处理层间交互的问题。

Details Motivation: 现有多层生成方法无法处理层间交互(如全局布局、物理接触和视觉效果),同时保持高质量透明度。 Method: 提出PSDiffusion框架,通过全局层交互机制,单次前馈过程生成多层图像(RGB背景和多个RGBA前景)。 Result: 模型能自动生成高质量、完整的多层图像,并确保层间的空间和视觉交互以实现全局一致性。 Conclusion: PSDiffusion在多层图像生成中表现优异,解决了层间交互和全局一致性问题。 Abstract: Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.

[85] Unsupervised Detection of Distribution Shift in Inverse Problems using Diffusion Models

Shirin Shoushtari,Edward P. Chandler,Yuanhao Wang,M. Salman Asif,Ulugbek S. Kamilov

Main category: cs.CV

TL;DR: 提出了一种无监督度量方法,通过间接测量和扩散模型的评分函数估计分布偏移,并证明其近似KL散度。通过对齐分布外和分布内评分,提升了逆问题的重建质量。

Details Motivation: 扩散模型在成像逆问题中作为先验广泛使用,但其性能在训练和测试图像分布偏移时下降。现有方法需要干净的测试图像,而逆问题中通常无法获取。 Method: 提出了一种完全无监督的度量方法,仅使用间接测量和不同数据集训练的扩散模型的评分函数来估计分布偏移。 Result: 理论证明该度量估计了训练和测试图像分布间的KL散度,实验表明其近似于基于干净图像的KL散度。通过对齐评分,降低了KL散度并提升了重建质量。 Conclusion: 该方法无需干净图像即可有效估计分布偏移,并通过评分对齐改善了逆问题的重建效果。 Abstract: Diffusion models are widely used as priors in imaging inverse problems. However, their performance often degrades under distribution shifts between the training and test-time images. Existing methods for identifying and quantifying distribution shifts typically require access to clean test images, which are almost never available while solving inverse problems (at test time). We propose a fully unsupervised metric for estimating distribution shifts using only indirect (corrupted) measurements and score functions from diffusion models trained on different datasets. We theoretically show that this metric estimates the KL divergence between the training and test image distributions. Empirically, we show that our score-based metric, using only corrupted measurements, closely approximates the KL divergence computed from clean images. Motivated by this result, we show that aligning the out-of-distribution score with the in-distribution score -- using only corrupted measurements -- reduces the KL divergence and leads to improved reconstruction quality across multiple inverse problems.

[86] GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Yusu Qian,Jiasen Lu,Tsu-Jui Fu,Xinze Wang,Chen Chen,Yinfei Yang,Wenze Hu,Zhe Gan

Main category: cs.CV

TL;DR: 本文提出了一种新的基准测试GIE-Bench,用于更准确地评估文本引导的图像编辑模型,重点关注功能正确性和图像内容保留。

Details Motivation: 现有评估方法(如CLIP)缺乏精确性,需要一种更接地气的评估方式。 Method: 通过自动生成多选题验证功能正确性,并使用对象感知掩码技术评估图像内容保留。 Result: GPT-Image-1在指令遵循准确性上领先,但常过度修改无关区域。 Conclusion: GIE-Bench为文本引导图像编辑提供了可扩展、可重复的评估框架。 Abstract: Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.

[87] QVGen: Pushing the Limit of Quantized Video Generative Models

Yushi Huang,Ruihao Gong,Jing Liu,Yifu Ding,Chengtao Lv,Haotong Qin,Jun Zhang

Main category: cs.CV

TL;DR: QVGen是一种针对视频扩散模型的量化感知训练框架,通过辅助模块和秩衰减策略,在极低位量化(如4位或更低)下实现高性能和高效推理。

Details Motivation: 视频扩散模型的计算和内存需求高,现有量化方法直接应用于视频扩散模型效果不佳,需要一种新的量化感知训练框架。 Method: 引入辅助模块(Φ)减少量化误差,并通过秩衰减策略(SVD和秩基正则化γ)逐步消除Φ,以消除推理开销。 Result: 在4位量化下,QVGen首次达到与全精度相当的质量,并在3位量化下显著优于现有方法,如Dynamic Degree和Scene Consistency指标分别提升25.28和8.43。 Conclusion: QVGen为视频扩散模型提供了一种高效的量化解决方案,显著降低了计算和内存需求,同时保持高性能。 Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$, we propose a rank-decay strategy that progressively eliminates $\Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3$B $\sim14$B, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench.

cs.GR [Back]

[88] Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View

Qifan Fu,Xu Chen,Muhammad Asad,Shanxin Yuan,Changjae Oh,Gregory Slabaugh

Main category: cs.GR

TL;DR: 提出了一种多视角先验框架MUFEN,通过多视角渲染和双流编码器提升手势生成的完整性和准确性。

Details Motivation: 现有单视角方法难以捕捉完整3D手部信息,尤其是遮挡情况下的细节。 Method: 采用多视角渲染(前后左右上下)和双流编码器,结合边界框特征融合模块。 Result: 实验表明,该方法在定量和定性评估中均达到最优性能。 Conclusion: 多视角先验框架显著提升了手势生成的完整性和准确性。 Abstract: High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ single-view 3D MANO mesh-rendered images prior to enhancing gesture generation quality. However, the complexity of hand movements and the inherent limitations of single-view rendering make it difficult to capture complete 3D hand information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion completion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and gesture multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance in both quantitative metrics and qualitative evaluations.

[89] Textured mesh Quality Assessment using Geometry and Color Field Similarity

Kaifa Yang,Qi Yang,Zhu Li,Yiling Xu

Main category: cs.GR

TL;DR: 提出了一种基于场的纹理网格质量评估方法FMQM,通过几何和颜色场提取视觉感知特征,优于现有方法且计算高效。

Details Motivation: 现有纹理网格质量评估方法准确性不足,而场表示在3D几何和颜色信息中表现优异,因此提出FMQM。 Method: FMQM利用有符号距离场和新提出的最近表面点颜色场,提取几何相似性、几何梯度相似性、空间颜色分布相似性和空间颜色梯度相似性四种特征。 Result: 在三个基准数据集上,FMQM优于现有最优方法,且计算复杂度低。 Conclusion: FMQM是一种高效实用的纹理网格质量评估方法,适用于3D图形和可视化应用。 Abstract: Textured mesh quality assessment (TMQA) is critical for various 3D mesh applications. However, existing TMQA methods often struggle to provide accurate and robust evaluations. Motivated by the effectiveness of fields in representing both 3D geometry and color information, we propose a novel point-based TMQA method called field mesh quality metric (FMQM). FMQM utilizes signed distance fields and a newly proposed color field named nearest surface point color field to realize effective mesh feature description. Four features related to visual perception are extracted from the geometry and color fields: geometry similarity, geometry gradient similarity, space color distribution similarity, and space color gradient similarity. Experimental results on three benchmark datasets demonstrate that FMQM outperforms state-of-the-art (SOTA) TMQA metrics. Furthermore, FMQM exhibits low computational complexity, making it a practical and efficient solution for real-world applications in 3D graphics and visualization. Our code is publicly available at: https://github.com/yyyykf/FMQM.

cs.CL [Back]

[90] Artificial Intelligence Bias on English Language Learners in Automatic Scoring

Shuchen Guo,Yun Wang,Jichao Yu,Xuansheng Wu,Bilgehan Ayik,Field M. Watts,Ehsan Latif,Ninghao Liu,Lei Liu,Xiaoming Zhai

Main category: cs.CL

TL;DR: 研究探讨了自动评分系统对英语学习者(ELLs)的潜在评分偏见和差异,发现训练数据规模足够大时无偏见,但样本量小时可能存在偏见。

Details Motivation: 调查自动评分系统对ELLs的评分偏见和差异,尤其是训练数据不平衡的影响。 Method: 使用BERT模型,基于四种数据集(ELLs、非ELLs、不平衡混合、平衡混合)进行微调,分析21项评估项目的评分准确性。 Result: 训练数据规模大时(ELLs=30,000或1,000)无偏见,样本量小(ELLs=200)时可能存在偏见。 Conclusion: 自动评分系统对ELLs的偏见与训练数据规模相关,样本量足够大时可避免偏见。 Abstract: This study investigated potential scoring biases and disparities toward English Language Learners (ELLs) when using automatic scoring systems for middle school students' written responses to science assessments. We specifically focus on examining how unbalanced training data with ELLs contributes to scoring bias and disparities. We fine-tuned BERT with four datasets: responses from (1) ELLs, (2) non-ELLs, (3) a mixed dataset reflecting the real-world proportion of ELLs and non-ELLs (unbalanced), and (4) a balanced mixed dataset with equal representation of both groups. The study analyzed 21 assessment items: 10 items with about 30,000 ELL responses, five items with about 1,000 ELL responses, and six items with about 200 ELL responses. Scoring accuracy (Acc) was calculated and compared to identify bias using Friedman tests. We measured the Mean Score Gaps (MSGs) between ELLs and non-ELLs and then calculated the differences in MSGs generated through both the human and AI models to identify the scoring disparities. We found that no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large enough (ELL = 30,000 and ELL = 1,000), but concerns could exist if the sample size is limited (ELL = 200).

[91] GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data?

Bowen Jiang,Yangxinyu Xie,Xiaomeng Wang,Jiashu He,Joshua Bergerson,John K Hutchison,Jordan Branham,Camillo J Taylor,Tanwi Mallick

Main category: cs.CL

TL;DR: GeoGrid-Bench是一个评估基础模型处理网格结构地理空间数据能力的基准测试,包含大规模真实数据和多种任务类型。

Details Motivation: 地理空间数据具有密集数值、时空依赖和多模态表示等独特挑战,需评估基础模型在此领域的适用性。 Method: 基准测试包含16种气候变量、150个地点和3200个问题-答案对,通过专家模板生成多样化任务。 Result: 视觉-语言模型表现最佳,研究提供了不同模型在各类地理空间任务中的优劣势分析。 Conclusion: GeoGrid-Bench为地理空间数据分析中基础模型的有效应用提供了清晰见解。 Abstract: We present GeoGrid-Bench, a benchmark designed to evaluate the ability of foundation models to understand geo-spatial data in the grid structure. Geo-spatial datasets pose distinct challenges due to their dense numerical values, strong spatial and temporal dependencies, and unique multimodal representations including tabular data, heatmaps, and geographic visualizations. To assess how foundation models can support scientific research in this domain, GeoGrid-Bench features large-scale, real-world data covering 16 climate variables across 150 locations and extended time frames. The benchmark includes approximately 3,200 question-answer pairs, systematically generated from 8 domain expert-curated templates to reflect practical tasks encountered by human scientists. These range from basic queries at a single location and time to complex spatiotemporal comparisons across regions and periods. Our evaluation reveals that vision-language models perform best overall, and we provide a fine-grained analysis of the strengths and limitations of different foundation models in different geo-spatial tasks. This benchmark offers clearer insights into how foundation models can be effectively applied to geo-spatial data analysis and used to support scientific research.

[92] A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

Jean-Philippe Corbeil,Amin Dada,Jean-Michel Attendu,Asma Ben Abacha,Alessandro Sordoni,Lucas Caccia,François Beaulieu,Thomas Lin,Jens Kleesiek,Paul Vozila

Main category: cs.CL

TL;DR: 论文提出了一种新框架,通过预指令调优、模型合并和任务对齐,将小型语言模型(SLMs)高效适配到临床领域,显著提升了性能。

Details Motivation: 大型语言模型(如GPT-4)的高计算成本和延迟限制了其在临床环境中的部署,而小型语言模型(SLMs)虽成本低但能力有限,且临床数据稀缺且敏感。 Method: 提出新框架,包括预指令调优、模型合并和临床任务对齐,并开发了MediPhi系列SLMs和MediFlow合成数据集。 Result: MediPhi在CLUE+基准测试中显著优于基础模型,部分任务甚至超越GPT-4-0125,对齐后性能进一步提升18.9%。 Conclusion: 该框架为临床领域提供了一种高效、低成本的SLM适配方案,具有广泛的应用潜力。 Abstract: High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.

[93] AI-enhanced semantic feature norms for 786 concepts

Siddharth Suresh,Kushin Mukherjee,Tyler Giallanza,Xizheng Yu,Mia Patil,Jonathan D. Cohen,Timothy T. Rogers

Main category: cs.CL

TL;DR: 论文提出了一种结合人类生成的特征规范与大型语言模型(LLM)响应的新方法,验证了其质量,并展示了AI增强的特征规范数据集NOVA在预测人类语义相似性判断中的优越性。

Details Motivation: 传统语义特征规范方法在概念/特征覆盖与质量验证之间存在权衡,因人工标注耗时费力。 Method: 通过结合人类生成的特征规范与LLM响应,并验证其质量,构建AI增强的特征规范数据集NOVA。 Result: NOVA数据集在特征密度和概念重叠方面表现更优,且在预测人类语义相似性判断上优于纯人类生成的数据集和词嵌入模型。 Conclusion: 人类概念知识比以往数据集所捕捉的更丰富,经适当验证的LLM可作为认知科学研究的强大工具。 Abstract: Semantic feature norms have been foundational in the study of human conceptual knowledge, yet traditional methods face trade-offs between concept/feature coverage and verifiability of quality due to the labor-intensive nature of norming studies. Here, we introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs) while verifying the quality of norms against reliable human judgments. We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts while outperforming a comparable human-only norm dataset and word-embedding models in predicting people's semantic similarity judgments. Taken together, we demonstrate that human conceptual knowledge is richer than captured in previous norm datasets and show that, with proper validation, LLMs can serve as powerful tools for cognitive science research.

[94] Tracr-Injection: Distilling Algorithms into Pre-trained Language Models

Tomás Vergara-Browne,Álvaro Soto

Main category: cs.CL

TL;DR: 论文提出了一种名为tracr-injection的方法,将RASP语言编写的算法直接注入预训练语言模型,展示了其有效性并提升了模型的分布外性能。

Details Motivation: 由于大型语言模型的兴起,研究者试图形式化描述Transformer架构的符号能力,但RASP语言实现的任务难以从自然无监督数据中学习,存在理论与实践的差距。 Method: 提出tracr-injection方法,将RASP算法直接注入预训练语言模型,并通过实验展示了其有效性。 Result: 方法在模型中创建了可解释的子空间,并能解码为RASP代码中的变量,同时提升了分布外性能。 Conclusion: tracr-injection方法成功地将符号机制引入模型内部,并验证了其有效性。 Abstract: Motivated by the surge of large language models, there has been a push to formally characterize the symbolic abilities intrinsic to the transformer architecture. A programming language, called RASP, has been proposed, which can be directly compiled into transformer weights to implement these algorithms. However, the tasks that can be implemented in RASP are often uncommon to learn from natural unsupervised data, showing a mismatch between theoretical capabilities of the transformer architecture, and the practical learnability of these capabilities from unsupervised data. We propose tracr-injection, a method that allows us to distill algorithms written in RASP directly into a pre-trained language model. We showcase our method by injecting 3 different algorithms into a language model. We show how our method creates an interpretable subspace within the model's residual stream, which can be decoded into the variables present in the code of the RASP algorithm. Additionally, we found that the proposed method can improve out of distribution performance compared to our baseline, indicating that indeed a more symbolic mechanism is taking place in the inner workings of the model. We release the code used to run our experiments.

[95] Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization

Ximing Dong,Shaowei Wang,Dayi Lin,Ahmed E. Hassan

Main category: cs.CL

TL;DR: IPOMP是一种两阶段方法,通过语义聚类和边界分析选择代表性样本,并利用实时模型性能数据迭代优化提示,显著提升效果和稳定性。

Details Motivation: 手动提示工程效率低下,现有自动优化方法因依赖随机评估子集导致不可靠,且现有核心集选择方法不适用于提示优化。 Method: IPOMP采用两阶段方法:1) 语义聚类和边界分析选择样本;2) 实时性能数据迭代替换冗余样本。 Result: 在BIG-bench数据集上,IPOMP效果提升1.6%-5.3%,稳定性提升至少57%,计算开销低于1%。 Conclusion: IPOMP能有效优化提示,且其实时性能引导方法可通用增强现有核心集选择方法。 Abstract: Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.

[96] SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval

Qiwei Peng,Robert Moro,Michal Gregor,Ivan Srba,Simon Ostermann,Marian Simko,Juraj Podroužek,Matúš Mesarčík,Jaroslav Kopčan,Anders Søgaard

Main category: cs.CL

TL;DR: 论文介绍了SemEval 2025中关于多语言声明检索的共享任务,旨在解决多语言和低资源语言环境下在线虚假信息的问题。

Details Motivation: 在线虚假信息的快速传播是一个全球性挑战,而多语言和低资源语言环境在此领域常被忽视。 Method: 通过共享任务设计两个子任务:单语言和跨语言声明检索,收集了179名参与者的52份测试提交。 Result: 23支团队提交了系统论文,论文总结了最佳表现系统及常见有效方法。 Conclusion: 该任务及其数据集为多语言声明检索和自动事实核查提供了宝贵见解,支持未来研究。 Abstract: The rapid spread of online disinformation presents a global challenge, and machine learning has been widely explored as a potential solution. However, multilingual settings and low-resource languages are often neglected in this field. To address this gap, we conducted a shared task on multilingual claim retrieval at SemEval 2025, aimed at identifying fact-checked claims that match newly encountered claims expressed in social media posts across different languages. The task includes two subtracks: (1) a monolingual track, where social posts and claims are in the same language, and (2) a crosslingual track, where social posts and claims might be in different languages. A total of 179 participants registered for the task contributing to 52 test submissions. 23 out of 31 teams have submitted their system papers. In this paper, we report the best-performing systems as well as the most common and the most effective approaches across both subtracks. This shared task, along with its dataset and participating systems, provides valuable insights into multilingual claim retrieval and automated fact-checking, supporting future research in this field.

[97] Ranked Voting based Self-Consistency of Large Language Models

Weiqin Wang,Yile Wang,Hui Huang

Main category: cs.CL

TL;DR: 提出了一种通过生成排名答案并进行排名投票的方法,以提高链式思维推理的自一致性,优于传统多数投票方法。

Details Motivation: 传统链式思维推理方法仅生成单一答案,忽略了其他潜在答案的可能性,导致投票过程中信息利用不足。 Method: 在每次推理过程中生成排名答案,并使用三种排名投票方法(即时决选投票、波达计数投票和平均倒数排名投票)进行投票。 Result: 在六个数据集上的实验结果表明,该方法优于基线方法,验证了排名答案信息和排名投票对推理性能的提升。 Conclusion: 通过利用排名答案信息和排名投票,可以显著提高链式思维推理的自一致性和性能。 Abstract: Majority voting is considered an effective method to enhance chain-of-thought reasoning, as it selects the answer with the highest "self-consistency" among different reasoning paths (Wang et al., 2023). However, previous chain-of-thought reasoning methods typically generate only a single answer in each trial, thereby ignoring the possibility of other potential answers. As a result, these alternative answers are often overlooked in subsequent voting processes. In this work, we propose to generate ranked answers in each reasoning process and conduct ranked voting among multiple ranked answers from different responses, thereby making the overall self-consistency more reliable. Specifically, we use three ranked voting methods: Instant-runoff voting, Borda count voting, and mean reciprocal rank voting. We validate our methods on six datasets, including three multiple-choice and three open-ended question-answering tasks, using both advanced open-source and closed-source large language models. Extensive experimental results indicate that our proposed method outperforms the baselines, showcasing the potential of leveraging the information of ranked answers and using ranked voting to improve reasoning performance. The code is available at https://github.com/szu-tera/RankedVotingSC.

[98] A Systematic Analysis of Base Model Choice for Reward Modeling

Kian Ahrabian,Pegah Jandaghi,Negar Mokhberian,Sai Praneeth Karimireddy,Jay Pujara

Main category: cs.CL

TL;DR: 研究分析了基础模型选择对奖励建模性能的影响,发现性能可提升14%,并展示了现有基准与下游性能的强统计关系。

Details Motivation: 随着大语言模型(LLMs)的快速发展,选择合适的基础模型对奖励建模(RM)性能的影响被忽视,本研究旨在填补这一空白。 Method: 通过系统分析基础模型选择对奖励建模性能的影响,结合小规模基准测试结果,并探索后训练步骤的影响。 Result: 性能提升14%,基准测试结果与下游性能强相关,结合小规模基准可提升模型选择效果(平均+18%)。 Conclusion: 基础模型选择对奖励建模性能至关重要,结合基准测试和后训练优化可显著提升效果。 Abstract: Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection ($+$18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error.

[99] Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation

Zhan Peng Lee,Andre Lin,Calvin Tan

Main category: cs.CL

TL;DR: Finetune-RAG通过微调方法提升RAG框架的事实准确性,并提出了Bench-RAG评估管道。

Details Motivation: 解决RAG框架中因检索不相关信息导致的LLM幻觉问题。 Method: 构建模拟现实缺陷的RAG训练数据集,并采用微调方法Finetune-RAG。 Result: Finetune-RAG比基础模型事实准确性提高21.2%。 Conclusion: Finetune-RAG和Bench-RAG为RAG框架提供了有效的改进和评估工具。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to improve factuality in large language models (LLMs) by grounding their outputs in retrieved documents. However, ensuring perfect retrieval of relevant information remains challenging, and when irrelevant content is passed downstream to an LLM, it can lead to hallucinations. In this work, we propose Finetune-RAG, a simple and effective fine-tuning approach that features the first-of-its-kind RAG training dataset constructed to mimic real-world imperfections. Experimental results show that Finetune-RAG improves factual accuracy by 21.2% over the base model. We also propose a Bench-RAG, an LLM-as-a-judge evaluation pipeline that stress tests models under realistic imperfect retrieval scenarios. Our codebase and dataset are fully open sourced for community use.

[100] Relation Extraction Across Entire Books to Reconstruct Community Networks: The AffilKG Datasets

Erica Cai,Sean McQuade,Kevin Young,Brendan O'Connor

Main category: cs.CL

TL;DR: AffilKG是一个包含六个数据集的集合,首次将完整书籍扫描与大型标记知识图谱配对,用于评估知识图谱提取的准确性。

Details Motivation: 当前标注的数据集无法评估知识图谱提取的准确性,因为它们高度不连通、规模过小或过于复杂。 Method: 引入AffilKG数据集,包含六个数据集,涵盖简单的隶属关系图谱和扩展的关系类型图谱。 Result: 初步实验显示模型性能在不同数据集间存在显著差异,验证了AffilKG在评估提取错误传播和验证提取方法方面的能力。 Conclusion: AffilKG为知识图谱提取的准确性和下游分析提供了重要工具,特别适用于社会科学研究。 Abstract: When knowledge graphs (KGs) are automatically extracted from text, are they accurate enough for downstream analysis? Unfortunately, current annotated datasets can not be used to evaluate this question, since their KGs are highly disconnected, too small, or overly complex. To address this gap, we introduce AffilKG (https://doi.org/10.5281/zenodo.15427977), which is a collection of six datasets that are the first to pair complete book scans with large, labeled knowledge graphs. Each dataset features affiliation graphs, which are simple KGs that capture Member relationships between Person and Organization entities -- useful in studies of migration, community interactions, and other social phenomena. In addition, three datasets include expanded KGs with a wider variety of relation types. Our preliminary experiments demonstrate significant variability in model performance across datasets, underscoring AffilKG's ability to enable two critical advances: (1) benchmarking how extraction errors propagate to graph-level analyses (e.g., community structure), and (2) validating KG extraction methods for real-world social science research.

[101] Enhancing Low-Resource Minority Language Translation with LLMs and Retrieval-Augmented Generation for Cultural Nuances

Chen-Chi Chang,Chong-Fu Li,Chu-Hsuan Lee,Hung-Shin Lee

Main category: cs.CL

TL;DR: 该研究通过结合大型语言模型(LLMs)和检索增强生成(RAG)技术,探索了低资源语言(如客家话)翻译的挑战,并测试了多种模型配置。最佳模型(Model 4)在BLEU评分上达到31%,显著提升了词汇覆盖率和语法连贯性。

Details Motivation: 解决低资源语言翻译的困难,尤其是针对文化或专业术语的准确翻译问题。 Method: 测试了多种模型配置,包括纯词典方法、RAG结合Gemini 2.0的两阶段方法(Model 3)以及结合检索和高级语言建模的Model 4。 Result: Model 4表现最佳(BLEU 31%),而纯词典方法表现最差(BLEU 12%)。两阶段方法(Model 3)达到26%,显示迭代校正的价值。 Conclusion: 研究强调了定制资源、领域知识和与当地社区伦理合作的重要性,为提升翻译准确性和文化保护提供了框架。 Abstract: This study investigates the challenges of translating low-resource languages by integrating Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG). Various model configurations were tested on Hakka translations, with BLEU scores ranging from 12% (dictionary-only) to 31% (RAG with Gemini 2.0). The best-performing model (Model 4) combined retrieval and advanced language modeling, improving lexical coverage, particularly for specialized or culturally nuanced terms, and enhancing grammatical coherence. A two-stage method (Model 3) using dictionary outputs refined by Gemini 2.0 achieved a BLEU score of 26%, highlighting iterative correction's value and the challenges of domain-specific expressions. Static dictionary-based approaches struggled with context-sensitive content, demonstrating the limitations of relying solely on predefined resources. These results emphasize the need for curated resources, domain knowledge, and ethical collaboration with local communities, offering a framework that improves translation accuracy and fluency while supporting cultural preservation.

[102] Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Songjun Tu,Jiahao Lin,Qichao Zhang,Xiangyu Tian,Linjing Li,Xiangyuan Lan,Dongbin Zhao

Main category: cs.CL

TL;DR: AutoThink通过多阶段强化学习框架动态调整大型推理模型的显式推理行为,仅在必要时触发详细推理,显著提升了效率和准确性。

Details Motivation: 解决大型推理模型在简单问题上因过度详细推理而导致的计算开销和延迟问题。 Method: 基于R1风格蒸馏模型,利用提示中的省略号触发不同推理模式,并通过多阶段强化学习优化推理策略。 Result: 在五个主流数学基准测试中,AutoThink在准确性和效率上优于现有方法,显著减少了token使用并提高了准确性。 Conclusion: AutoThink为大型推理模型提供了一种可扩展且自适应的推理范式,显著提升了性能。 Abstract: Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.

[103] Multimodal Event Detection: Current Approaches and Defining the New Playground through LLMs and VLMs

Abhishek Dey,Aabha Bothera,Samhita Sarikonda,Rishav Aryan,Sanjay Kumar Podishetty,Akshay Havalgi,Gaurav Singh,Saurabh Srivastava

Main category: cs.CL

TL;DR: 研究社交媒體事件檢測的挑戰,比較單模態與多模態方法,發現多模態方法表現更優,但生成模型在精確度上落後於監督方法。

Details Motivation: 傳統單模態系統難以應對社交媒體數據的快速和多模態特性,因此探索多模態和生成模型的效能。 Method: 使用單模態模型(ModernBERT、ConvNeXt-V2)、多模態融合技術及生成模型(GPT-4o、LLaVA),並評估生成模型在單模態輸入下的表現。 Result: 多模態方法優於單模態,但生成模型在精確度上不如監督方法,且難以正確生成事件類別。生成模型能有效處理社交媒體常見問題(如leet speak)。 Conclusion: 多模態方法在事件檢測中表現更佳,但生成模型需進一步改進以提升精確度。 Abstract: In this paper, we study the challenges of detecting events on social media, where traditional unimodal systems struggle due to the rapid and multimodal nature of data dissemination. We employ a range of models, including unimodal ModernBERT and ConvNeXt-V2, multimodal fusion techniques, and advanced generative models like GPT-4o, and LLaVA. Additionally, we also study the effect of providing multimodal generative models (such as GPT-4o) with a single modality to assess their efficacy. Our results indicate that while multimodal approaches notably outperform unimodal counterparts, generative approaches despite having a large number of parameters, lag behind supervised methods in precision. Furthermore, we also found that they lag behind instruction-tuned models because of their inability to generate event classes correctly. During our error analysis, we discovered that common social media issues such as leet speak, text elongation, etc. are effectively handled by generative approaches but are hard to tackle using supervised approaches.

[104] Have Multimodal Large Language Models (MLLMs) Really Learned to Tell the Time on Analog Clocks?

Tairan Fu,Miguel González,Javier Conde,Elena Merino-Gómez,Pedro Reviriego

Main category: cs.CL

TL;DR: 多模态大语言模型(MLLMs)在回答复杂图像问题时难以识别模拟时钟时间,可能因训练数据中时钟图像不足。研究通过GPT-4.1探讨此问题,发现模型虽在识别时间上有进展,但可能仅学习了训练数据中的模式,而非真正理解抽象与泛化。

Details Motivation: 探究MLLMs为何无法识别模拟时钟时间,以及微调是否能解决此问题。 Method: 使用GPT-4.1测试模型在不同时钟上的表现,分析其识别时间的局限性。 Result: 模型在识别时间上有进展,但可能仅依赖训练数据中的模式,缺乏抽象与泛化能力。 Conclusion: MLLMs在模拟时钟时间识别上仍有局限,需进一步研究其抽象与泛化能力。 Abstract: Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.

[105] Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate

Ziyang Huang,Wangtao Sun,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 论文提出SIAR和R$^3$方法,通过LLM生成推理规则并重新评估规则相关性,以提升规则检索的准确性和推理性能。

Details Motivation: 现有规则检索方法因查询事实与规则抽象表示之间的语义差距导致准确性低,影响下游推理性能。 Method: 提出SIAR方法,利用LLM生成潜在推理规则并用于查询增强;引入R$^3$方法重新评估规则相关性。 Result: 实验表明,所提方法在多种设置下均有效且通用。 Conclusion: SIAR和R$^3$显著提升了规则检索的准确性和推理性能。 Abstract: This paper systematically addresses the challenges of rule retrieval, a crucial yet underexplored area. Vanilla retrieval methods using sparse or dense retrievers to directly search for relevant rules to support downstream reasoning, often suffer from low accuracy. This is primarily due to a significant semantic gap between the instantiated facts in the queries and the abstract representations of the rules. Such misalignment results in suboptimal retrieval quality, which in turn negatively impacts reasoning performance. To overcome these challenges, we propose Self-Induction Augmented Retrieval (SIAR), a novel approach that utilizes Large Language Models (LLMs) to induce potential inferential rules that might offer benefits for reasoning by abstracting the underlying knowledge and logical structure in queries. These induced rules are then used for query augmentation to improve retrieval effectiveness. Additionally, we introduce Rule Relevance ReEstimate (R$^3$), a method that re-estimates the relevance of retrieved rules by assessing whether the abstract knowledge they contain can be instantiated to align with the facts in the queries and the helpfulness for reasoning. Extensive experiments across various settings demonstrate the effectiveness and versatility of our proposed methods.

[106] A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

Ada Chen,Yongjiang Wu,Junyuan Zhang,Shu Yang,Jen-tse Huang,Kun Wang,Wenxuan Wang,Shuai Wang

Main category: cs.CL

TL;DR: 论文系统化分析了基于AI的计算机使用代理(CUAs)的安全与威胁,提出了定义、威胁分类、防御策略及评估方法,为未来研究和实践提供指导。

Details Motivation: 随着CUAs能力的提升,其安全与威胁问题日益复杂,需系统化研究以应对潜在风险。 Method: 通过文献综述,围绕四个目标展开研究:定义CUAs、分类威胁、提出防御策略、总结评估方法。 Result: 提出了CUAs的安全分析框架,包括威胁分类、防御策略和评估标准。 Conclusion: 研究为CUAs的安全探索提供了结构化基础,并为实践中的安全设计提供了指导。 Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.

[107] Connecting the Dots: A Chain-of-Collaboration Prompting Framework for LLM Agents

Jiaxing Zhao,Hongbin Xie,Yuzhen Lei,Xuan Song,Zhuoran Shi,Lianxin Li,Shuangxue Liu,Haoran Zhang

Main category: cs.CL

TL;DR: 论文提出Cochain框架,通过结合知识和提示以低成本解决业务工作流协作问题,优于现有方法。

Details Motivation: 现有单代理链式思维和多代理系统在业务工作流任务中存在协作挑战和高成本问题。 Method: 构建集成知识图并维护提示树,结合多阶段知识和提示信息。 Result: Cochain在多个数据集上表现优于基线方法,小模型结合Cochain甚至优于GPT-4。 Conclusion: Cochain是一种高效低成本的协作提示框架,适用于业务工作流任务。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain-of-thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi-agent systems provide more comprehensive solutions by integrating collective intelligence of multiple agents. However, both approaches face significant limitations. Single-agent with chain-of-thought, due to the inherent complexity of designing cross-domain prompts, faces collaboration challenges. Meanwhile, multi-agent systems consume substantial tokens and inevitably dilute the primary problem, which is particularly problematic in business workflow tasks. To address these challenges, we propose Cochain, a collaboration prompting framework that effectively solves business workflow collaboration problem by combining knowledge and prompts at a reduced cost. Specifically, we construct an integrated knowledge graph that incorporates knowledge from multiple stages. Furthermore, by maintaining and retrieving a prompts tree, we can obtain prompt information relevant to other stages of the business workflow. We perform extensive evaluations of Cochain across multiple datasets, demonstrating that Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs. Additionally, expert evaluation results indicate that the use of a small model in combination with Cochain outperforms GPT-4.

[108] Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations

Wenrui Cai,Chengyu Wang,Junbing Yan,Jun Huang,Xiangzhong Fang

Main category: cs.CL

TL;DR: 论文提出了OmniThought数据集,包含200万条由强大LRM生成的CoT过程,并引入RV和CD评分,显著提升了LRM的训练效果和推理能力。

Details Motivation: 当前CoT数据集缺乏全面性,无法支持LRM的进一步发展,因此需要构建一个更全面的数据集。 Method: 通过两个强大的LRM作为教师模型生成和验证CoT过程,并引入RV和CD评分,建立自给自足的管道来整理数据集。 Result: 实验表明,OmniThought数据集及其评分显著提升了Qwen2.5模型的训练效果,并训练出高性能的LRM。 Conclusion: OmniThought数据集和评分系统显著推动了LRM在复杂任务中的发展和训练。 Abstract: The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD) scores, which describe the appropriateness of CoT verbosity and cognitive difficulty level for models to comprehend these reasoning processes. We further establish a self-reliant pipeline to curate this dataset. Extensive experiments using Qwen2.5 models of various sizes demonstrate the positive impact of our proposed scores on LRM training effectiveness. Based on the proposed OmniThought dataset, we further train and release a series of high-performing LRMs, specifically equipped with stronger reasoning abilities and optimal CoT output length and difficulty level. Our contributions significantly enhance the development and training of LRMs for solving complex tasks.

[109] Accurate KV Cache Quantization with Outlier Tokens Tracing

Yi Su,Yuechi Zhou,Quantong Qiu,Juntao Li,Qingrong Xia,Ping Li,Xinyu Duan,Zhefeng Wang,Min Zhang

Main category: cs.CL

TL;DR: KV Cache量化通过识别异常token提升准确率,显著减少内存占用并提高吞吐量。

Details Motivation: LLMs部署时的高计算资源消耗问题,KV Cache量化虽能平衡内存与准确率,但异常token影响量化效果。 Method: 开发方法识别解码过程中的异常token,并将其排除量化,优化准确率。 Result: 在2-bit量化下显著提升准确率,内存占用减少6.4倍,吞吐量提高2.3倍。 Conclusion: 通过排除异常token的量化方法,有效解决了KV Cache量化中的准确率问题。 Abstract: The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.

[110] GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction

Mohammadtaha Bagherifard,Sahar Rajabi,Ali Edalat,Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: 论文提出了一种模块化框架GenKnowSub,通过分离通用知识和任务特定知识,提升大语言模型的零样本泛化能力。

Details Motivation: 大语言模型在零样本泛化上表现不佳,原因是通用知识与任务特定知识的耦合。 Method: 构建通用领域和任务特定的LoRA模块库,通过通用知识减法(GenKnowSub)提取任务相关残差模块,并动态组合模块。 Result: 在Phi-3和Phi-2模型上,GenKnowSub在多语言和跨语言任务中表现优异。 Conclusion: GenKnowSub有效提升模型泛化能力,适用于不同强度的LLM。 Abstract: Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citep{ostapenko2024towards}, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at https://github.com/saharsamr/Modular-LLM.

[111] Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer

Seungyoon Lee,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim

Main category: cs.CL

TL;DR: 论文提出了一种名为SALT的新跨语言迁移技术,通过利用目标语言预训练模型的嵌入来提升大语言模型在多语言任务中的表现。

Details Motivation: 现有方法在将大语言模型迁移到目标语言时,由于主要基于英语数据训练,可能限制目标语言的表达能力。 Method: SALT通过分析源和目标词汇的相似性,为每个非重叠词汇生成独特的回归线,以优化嵌入空间。 Result: 实验表明,SALT在语言适应中损失更低、收敛更快,并在跨语言理解任务中表现优异。 Conclusion: SALT展示了预训练模型在增强大语言模型功能方面的可扩展性。 Abstract: Large Language Models (LLMs) increasingly incorporate multilingual capabilities, fueling the demand to transfer them into target language-specific models. However, most approaches, which blend the source model's embedding by replacing the source vocabulary with the target language-specific vocabulary, may constrain expressive capacity in the target language since the source model is predominantly trained on English data. In this paper, we propose Semantic Aware Linear Transfer (SALT), a novel cross-lingual transfer technique that recycles embeddings from target language Pre-trained Language Models (PLMs) to transmit the deep representational strengths of PLM-derived embedding to LLMs. SALT derives unique regression lines based on the similarity in the overlap of the source and target vocabularies, to handle each non-overlapping token's embedding space. Our extensive experiments show that SALT significantly outperforms other transfer methods and achieves lower loss with accelerating faster convergence during language adaptation. Notably, SALT obtains remarkable performance in cross-lingual understanding setups compared to other methods. Furthermore, we highlight the scalable use of PLMs to enhance the functionality of contemporary LLMs by conducting experiments with varying architectures.

[112] The Way We Prompt: Conceptual Blending, Neural Dynamics, and Prompt-Induced Transitions in LLMs

Makoto Sato

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)的行为机制,提出通过概念混合理论(CBT)和提示工程方法揭示其意义压缩与混合过程,并比较人工与生物认知的异同。

Details Motivation: 研究动机在于理解LLMs表现出的类人格和智能行为背后的机制,探索其与生物认知的异同。 Method: 采用概念混合理论(CBT)作为实验框架,通过提示诱导转换(PIT)和提示诱导幻觉(PIH)系统研究LLMs的意义处理方式。 Result: 揭示了人工与生物认知的结构相似性与差异,展示了提示工程作为科学方法的潜力。 Conclusion: 提示工程不仅是技术工具,更是探索意义深层结构的科学方法,为人机协作的未来认知科学提供了原型。 Abstract: Large language models (LLMs), inspired by neuroscience, exhibit behaviors that often evoke a sense of personality and intelligence-yet the mechanisms behind these effects remain elusive. Here, we operationalize Conceptual Blending Theory (CBT) as an experimental framework, using prompt-based methods to reveal how LLMs blend and compress meaning. By systematically investigating Prompt-Induced Transitions (PIT) and Prompt-Induced Hallucinations (PIH), we uncover structural parallels and divergences between artificial and biological cognition. Our approach bridges linguistics, neuroscience, and empirical AI research, demonstrating that human-AI collaboration can serve as a living prototype for the future of cognitive science. This work proposes prompt engineering not just as a technical tool, but as a scientific method for probing the deep structure of meaning itself.

[113] Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Xinlu He,Jacob Whitehill

Main category: cs.CL

TL;DR: 该论文综述了端到端(E2E)多说话人自动语音识别(ASR)的最新进展,包括架构分类、算法改进、长语音处理及基准测试比较,并探讨了未来研究方向。

Details Motivation: 由于数据稀缺和重叠语音中识别与归属单词的固有困难,单声道多说话人ASR仍具挑战性。E2E架构减少了错误传播并更好地利用语音内容与说话人身份的协同作用。 Method: 系统分类了E2E多说话人ASR的神经方法,分析了SIMO与SISO架构特点、算法改进、长语音处理策略及基准测试比较。 Result: 综述了E2E多说话人ASR的最新进展,并比较了不同方法在标准基准上的表现。 Conclusion: 讨论了构建稳健且可扩展的多说话人ASR的开放挑战和未来研究方向。 Abstract: Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

[114] Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

Jingcheng Niu,Subhabrata Dutta,Ahmed Elshabrawy,Harish Tayyar Madabushi,Iryna Gurevych

Main category: cs.CL

TL;DR: 论文研究了大规模Transformer语言模型(LMs)的上下文学习(ICL)机制,通过实验和分析表明ICL不仅是数据记忆,也不完全是符号算法,而是介于两者之间。

Details Motivation: 理解ICL的机制,澄清其是否仅是数据记忆或符号算法,并探索其对模型开发和AI安全的启示。 Method: 利用Pythia扩展套件和中间检查点,系统研究ICL在下游任务中的表现,并对残差流的子空间进行机制分析。 Result: ICL介于数据记忆和符号算法之间,研究还揭示了训练动态、模型能力和机制可解释性的影响。 Conclusion: 研究推进了对ICL的理解,为模型改进和AI安全提供了依据。 Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.

[115] Reconstructing Syllable Sequences in Abugida Scripts with Incomplete Inputs

Ye Kyaw Thu,Thazin Myint Oo

Main category: cs.CL

TL;DR: 论文研究了基于Transformer的模型在Abugida语言中的音节序列预测,重点关注六种语言,发现辅音序列对预测准确性至关重要,而元音序列更具挑战性。

Details Motivation: 探索Abugida语言中音节序列预测的可行性,为文本预测、拼写校正等应用提供支持。 Method: 使用Transformer模型,从辅音序列、元音序列、部分音节和掩码音节等不完整输入中重建完整音节序列。 Result: 辅音序列在预测中表现优异,元音序列更具挑战性;模型在部分和掩码音节重建任务中表现稳健。 Conclusion: 研究推进了对Abugida语言序列预测的理解,并为相关应用提供了实用见解。 Abstract: This paper explores syllable sequence prediction in Abugida languages using Transformer-based models, focusing on six languages: Bengali, Hindi, Khmer, Lao, Myanmar, and Thai, from the Asian Language Treebank (ALT) dataset. We investigate the reconstruction of complete syllable sequences from various incomplete input types, including consonant sequences, vowel sequences, partial syllables (with random character deletions), and masked syllables (with fixed syllable deletions). Our experiments reveal that consonant sequences play a critical role in accurate syllable prediction, achieving high BLEU scores, while vowel sequences present a significantly greater challenge. The model demonstrates robust performance across tasks, particularly in handling partial and masked syllable reconstruction, with strong results for tasks involving consonant information and syllable masking. This study advances the understanding of sequence prediction for Abugida languages and provides practical insights for applications such as text prediction, spelling correction, and data augmentation in these scripts.

[116] Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models

Jiangxu Wu,Cong Wang,TianHuang Su,Jun Yang,Haozhi Lin,Chao Zhang,Ming Peng,Kai Shi,SongPan Yang,BinQing Pan,ZiXian Li,Ni Yang,ZhenYu Yang

Main category: cs.CL

TL;DR: 论文提出Review-Instruct框架,通过多代理角色迭代生成高质量多轮对话数据,显著提升LLM在对话任务中的表现。

Details Motivation: 现有单轮监督微调数据限制了多轮对话的上下文连贯性,且现有方法难以平衡多样性和质量。 Method: 提出Review-Instruct框架,通过“Ask-Respond-Review”流程迭代优化指令,利用多代理角色(Candidate、Reviewers、Chairman)生成数据。 Result: 在MT-Bench、MMLU-Pro和Auto-Arena上显著提升,MMLU-Pro和MT-Bench分别提高2.9%和2%。 Conclusion: Review-Instruct框架通过多代理迭代优化,能高效生成高质量对话数据,提升LLM性能。 Abstract: The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative "Ask-Respond-Review" process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9\% on MMLU-Pro and 2\% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.

[117] StRuCom: A Novel Dataset of Structured Code Comments in Russian

Maria Dziuba,Valentin Malykh

Main category: cs.CL

TL;DR: StRuCom是首个针对俄语代码文档的大规模数据集(153K样本),通过结合人工编写的俄语GitHub评论和合成生成的内容,显著提升了俄语代码注释生成模型的性能。

Details Motivation: 现有机器学习模型在俄语代码注释生成上表现不佳,StRuCom旨在填补这一空白。 Method: 结合俄语GitHub仓库中的人工编写评论和合成生成内容,并通过自动化验证确保符合多种编程语言标准。 Result: 在StRuCom上微调的Qwen2.5-Coder模型在chrf++和BERTScore指标上显著优于基线模型。 Conclusion: StRuCom为俄语代码注释生成提供了高质量数据集,显著提升了模型性能。 Abstract: Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom - the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation. Fine-tuning Qwen2.5-Coder models (0.5B-7B) on StRuCom shows statistically significant improvements of chrf++ and BERTScore over baseline models.

[118] OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning

Xiao Zhang,Huiyuan Lai,Qianru Meng,Johan Bos

Main category: cs.CL

TL;DR: 论文提出了OntoURL基准,用于系统评估大语言模型(LLMs)处理本体知识的能力,发现LLMs在理解方面表现良好,但在推理和学习任务中存在显著不足。

Details Motivation: 尽管LLMs在自然语言处理任务中表现出色,但其处理结构化符号知识的能力尚未充分探索。 Method: 提出LLMs本体能力的分类法,并设计OntoURL基准,通过15个任务(共58,981个问题)评估理解、推理和学习三个维度。 Result: 实验显示,当前LLMs在理解本体知识方面表现良好,但在推理和学习任务中存在显著弱点。 Conclusion: OntoURL是推动LLMs与形式知识表示整合的关键基准,揭示了LLMs处理符号知识的根本局限性。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a range of natural language processing tasks, yet their ability to process structured symbolic knowledge remains underexplored. To address this gap, we propose a taxonomy of LLMs' ontological capabilities and introduce OntoURL, the first comprehensive benchmark designed to systematically evaluate LLMs' proficiency in handling ontologies -- formal, symbolic representations of domain knowledge through concepts, relationships, and instances. Based on the proposed taxonomy, OntoURL systematically assesses three dimensions: understanding, reasoning, and learning through 15 distinct tasks comprising 58,981 questions derived from 40 ontologies across 8 domains. Experiments with 20 open-source LLMs reveal significant performance differences across models, tasks, and domains, with current LLMs showing proficiency in understanding ontological knowledge but substantial weaknesses in reasoning and learning tasks. These findings highlight fundamental limitations in LLMs' capability to process symbolic knowledge and establish OntoURL as a critical benchmark for advancing the integration of LLMs with formal knowledge representations.

[119] CAMEO: Collection of Multilingual Emotional Speech Corpora

Iwona Christop,Maciej Czajka

Main category: cs.CL

TL;DR: CAMEO是一个多语言情感语音数据集,旨在促进情感识别研究,提供标准化基准和易访问性。

Details Motivation: 为情感识别研究提供标准化、易访问的多语言数据集,支持结果复现。 Method: 通过数据集选择、整理和标准化流程,构建CAMEO数据集,并公开在Hugging Face平台。 Result: 提供了多个模型的性能结果,数据集和元数据公开可用。 Conclusion: CAMEO为语音情感识别研究提供了有价值的资源和基准。 Abstract: This paper presents CAMEO -- a curated collection of multilingual emotional speech datasets designed to facilitate research in emotion recognition and other speech-related tasks. The main objectives were to ensure easy access to the data, to allow reproducibility of the results, and to provide a standardized benchmark for evaluating speech emotion recognition (SER) systems across different emotional states and languages. The paper describes the dataset selection criteria, the curation and normalization process, and provides performance results for several models. The collection, along with metadata, and a leaderboard, is publicly available via the Hugging Face platform.

[120] BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Yapei Chang,Yekyung Kim,Michael Krumdick,Amir Zadeh,Chuan Li,Chris Tanner,Mohit Iyyer

Main category: cs.CL

TL;DR: 论文提出了一种基于BLEU指标的替代方案BLEUBERI,用于替代昂贵的奖励模型,在RL对齐中表现优异。

Details Motivation: 奖励模型训练成本高,而高质量合成数据集的出现促使探索更简单的参考指标是否可行。 Method: 使用BLEU作为奖励函数,结合GRPO方法,开发了BLEUBERI。 Result: BLEUBERI在多个基准测试中表现与奖励模型相当,且生成内容更事实准确。 Conclusion: 基于字符串匹配的指标是奖励模型的有效廉价替代方案。 Abstract: Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

[121] Towards Better Evaluation for Generated Patent Claims

Lekang Jiang,Pascal A Scherz,Stephan Goetz

Main category: cs.CL

TL;DR: 论文提出Patent-CE和PatClaimEval,用于评估专利权利要求生成的质量,填补了自动评估与人工专家评估之间的差距。

Details Motivation: 专利权利要求起草复杂且耗时,现有自动生成方法评估指标与专家评估不一致。 Method: 引入Patent-CE基准和PatClaimEval多维评估方法,实验验证其与专家评估的相关性。 Result: PatClaimEval在所有测试指标中与专家评估的相关性最高。 Conclusion: 研究为专利权利要求自动生成系统的准确评估奠定了基础。 Abstract: Patent claims define the scope of protection and establish the legal boundaries of an invention. Drafting these claims is a complex and time-consuming process that usually requires the expertise of skilled patent attorneys, which can form a large access barrier for many small enterprises. To solve these challenges, researchers have investigated the use of large language models (LLMs) for automating patent claim generation. However, existing studies highlight inconsistencies between automated evaluation metrics and human expert assessments. To bridge this gap, we introduce Patent-CE, the first comprehensive benchmark for evaluating patent claims. Patent-CE includes comparative claim evaluations annotated by patent experts, focusing on five key criteria: feature completeness, conceptual clarity, terminology consistency, logical linkage, and overall quality. Additionally, we propose PatClaimEval, a novel multi-dimensional evaluation method specifically designed for patent claims. Our experiments demonstrate that PatClaimEval achieves the highest correlation with human expert evaluations across all assessment criteria among all tested metrics. This research provides the groundwork for more accurate evaluations of automated patent claim generation systems.

[122] Scaling Reasoning can Improve Factuality in Large Language Models

Mike Zhang,Johannes Bjerva,Russa Biswas

Main category: cs.CL

TL;DR: 研究表明,通过增加推理链长度和计算资源,大型语言模型在数学推理任务中表现提升,但在开放领域问答任务中,事实准确性是否因此提高尚不明确。本文通过实验验证了这一点。

Details Motivation: 探讨在开放领域问答任务中,更长的推理链和额外计算资源是否能提升大型语言模型的事实准确性。 Method: 从先进推理模型中提取推理轨迹,结合知识图谱信息,对多种模型进行微调,并在多个数据集上评估。 Result: 实验表明,较小的推理模型在事实准确性上优于原始微调模型,增加测试时计算和标记预算可提升准确性2-8%。 Conclusion: 测试时扩展计算资源能有效提升开放领域问答任务中的推理准确性,实验数据已公开供进一步研究。 Abstract: Recent studies on large language model (LLM) reasoning capabilities have demonstrated promising improvements in model performance by leveraging a lengthy thinking process and additional computational resources during inference, primarily in tasks involving mathematical reasoning (Muennighoff et al., 2025). However, it remains uncertain if longer reasoning chains inherently enhance factual accuracy, particularly beyond mathematical contexts. In this work, we thoroughly examine LLM reasoning within complex open-domain question-answering (QA) scenarios. We initially distill reasoning traces from advanced, large-scale reasoning models (QwQ-32B and DeepSeek-R1-671B), then fine-tune a variety of models ranging from smaller, instruction-tuned variants to larger architectures based on Qwen2.5. To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces. Our experimental setup includes four baseline approaches and six different instruction-tuned models evaluated across a benchmark of six datasets, encompassing over 22.6K questions. Overall, we carry out 168 experimental runs and analyze approximately 1.7 million reasoning traces. Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts. Moreover, our analysis demonstrates that adding test-time compute and token budgets factual accuracy consistently improves by 2-8%, further confirming the effectiveness of test-time scaling for enhancing performance and consequently improving reasoning accuracy in open-domain QA tasks. We release all the experimental artifacts for further research.

[123] SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Huashan Sun,Shengyi Liao,Yansen Han,Yu Bai,Yang Gao,Cheng Fu,Weizhou Shen,Fanqi Wan,Ming Yan,Ji Zhang,Fei Huang

Main category: cs.CL

TL;DR: SoLoPO框架通过短上下文偏好优化和短到长奖励对齐,提升大语言模型在长上下文任务中的表现。

Details Motivation: 大语言模型在长上下文任务中表现不佳,主要由于数据质量、训练效率不足和优化目标设计缺陷。 Method: SoLoPO框架分为短上下文偏好优化和短到长奖励对齐两部分,提升模型对长上下文的处理能力。 Result: 实验表明,SoLoPO显著提升了模型在长上下文任务中的表现,并提高了计算和内存效率。 Conclusion: SoLoPO为长上下文任务提供了一种高效且通用的优化方法。 Abstract: Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $\textbf{S}$h$\textbf{o}$rt-to-$\textbf{Lo}$ng $\textbf{P}$reference $\textbf{O}$ptimization ($\textbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

[124] Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Hrishit Madhavi,Jacob Cherian,Yuvraj Khamkar,Dhananjay Bhagat

Main category: cs.CL

TL;DR: 本文介绍了一个端到端的多语言信息提取与处理系统,支持从图像文档中提取文本并进行翻译、摘要、情感分析等功能。

Details Motivation: 解决多语言环境下图像文档信息提取和处理的难题,缩小语言差距。 Method: 结合OCR(Tesseract)、大语言模型API(Gemini)进行跨语言翻译和摘要,辅以情感分析(TensorFlow)、主题分类(Transformers)和日期提取(Regex)。 Result: 系统通过Gradio界面实现,展示了在多语言环境中提升图像文档信息访问的实际应用。 Conclusion: 该研究为多语言信息处理提供了实用工具,增强了跨语言环境下的信息可访问性。 Abstract: This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments

[125] NoPE: The Counting Power of Transformers with No Positional Encodings

Chris Köcher,Alexander Kozachinskiy,Anthony Widjaja Lin,Marco Sälzer,Georg Zetzsche

Main category: cs.CL

TL;DR: NoPE-transformers(无位置编码的Transformer)在平均硬注意力机制下表现出惊人的表达能力,能够表达非负整数解的多变量多项式方程(Diophantine方程)对应的计数语言,其表达能力精确对应于半代数集。

Details Motivation: 研究无位置编码的Transformer在平均硬注意力机制下的表达能力,填补现有理论空白。 Method: 通过数学建模和理论分析,精确刻画NoPE-AHATs(无位置编码的平均硬注意力Transformer)的语言表达能力。 Result: NoPE-AHATs的表达能力对应于半代数集,能够解决复杂的计数问题,但无法表达简单的PARITY问题。分析NoPE-transformers的问题是未判定的。 Conclusion: NoPE-transformers在平均硬注意力机制下具有远超常规模型的表达能力,但其分析问题具有未判定性。 Abstract: Positional Encodings (PEs) seem to be indispensable for ensuring expressiveness of transformers; without them attention transformers reduce to a bag-of-word model. NoPE-transformers (i.e. with No PEs) with unique hard attention mechanisms were very recently shown to only be able to express regular languages, i.e., with limited counting ability. This paper shows that, with average hard attention mechanisms, NoPE-transformers are still surprisingly expressive: they can express counting languages corresponding to nonnegative integer solutions to multivariate polynomial equations (i.e. Diophantine equations), reasoning about which is well-known to be undecidable. In fact, we provide a precise characterization of languages expressible by Average Hard Attention NoPE-Transformers (NoPE-AHATs): they correspond precisely to what we call \emph{semi-algebraic sets}, i.e., finite unions of sets of nonnegative integer solutions to systems of multivariate polynomial inequations. We obtain several interesting consequences of our characterization. Firstly, NoPE-transformers can express counting properties that are far more complex than established models like simplified counter machines and Petri nets, but cannot express a very simple counting property of PARITY. Secondly, the problem of analyzing NoPE-transformers is undecidable, e.g., whether a given NoPE transformer classifies all input strings in one class. To complement our results, we exhibit a counting language that is not expressible by average hard attention transformers even with arbitrary PEs but is expressible in the circuit complexity class TC$^0$, answering an open problem.

[126] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Chengyu Huang,Zhengxin Zhang,Claire Cardie

Main category: cs.CL

TL;DR: HAPO通过历史感知策略优化,结合长度和正确性奖励,显著减少LLM输出长度(33-59%),同时仅牺牲少量准确性(2-5%)。

Details Motivation: 现有方法未利用历史信息,限制了生成简洁解决方案的能力。 Method: 提出HAPO,记录历史状态(如最短正确响应长度),设计基于历史状态的长度奖励函数,结合正确性奖励联合优化。 Result: 在多个数学基准测试中,HAPO显著减少输出长度(33-59%),准确性仅下降2-5%。 Conclusion: HAPO有效提升LLM的简洁推理能力,平衡了效率和正确性。 Abstract: While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs' concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.

[127] Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

Camille Couturier,Spyros Mastorakis,Haiying Shen,Saravan Rajmohan,Victor Rühle

Main category: cs.CL

TL;DR: 论文提出了一种基于语义缓存的方法,用于存储和重用中间上下文摘要,以减少LLM问答工作流中的冗余计算。

Details Motivation: 处理长上下文在分布式系统中带来高计算开销、内存使用和网络带宽问题。 Method: 引入语义缓存技术,存储和重用中间上下文摘要,优化相似查询的信息复用。 Result: 在NaturalQuestions、TriviaQA和ArXiv数据集上,冗余计算减少50-60%,答案准确性与全文处理相当。 Conclusion: 该方法在计算成本和响应质量之间取得平衡,适用于实时AI助手。 Abstract: Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.

[128] Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

Yaorui Shi,Shihan Li,Chang Wu,Zhiyuan Liu,Junfeng Fang,Hengxing Cai,An Zhang,Xiang Wang

Main category: cs.CL

TL;DR: AutoRefine是一个基于强化学习的框架,通过“搜索-提炼-思考”范式改进检索增强推理,显著提升复杂推理任务的性能。

Details Motivation: 现有检索增强推理方法常因检索不相关信息而影响推理准确性,AutoRefine旨在解决这一问题。 Method: 采用强化学习框架,引入知识提炼步骤和特定检索奖励,通过迭代搜索与提炼优化推理过程。 Result: 在单跳和多跳QA任务中表现优异,尤其在复杂推理场景中显著优于现有方法。 Conclusion: AutoRefine通过高质量搜索和证据整合,有效提升了检索增强推理的准确性。 Abstract: Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think'' paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.

[129] Temporal fine-tuning for early risk detection

Horacio Thompson,Esaú Villatoro-Tello,Manuel Montes-y-Gómez,Marcelo Errecalde

Main category: cs.CL

TL;DR: 论文提出了一种时间微调方法,通过将时间显式纳入学习过程,优化基于Transformer的模型,以解决早期风险检测(ERD)中的分类精度和延迟问题。

Details Motivation: 早期风险检测(ERD)需要快速准确地识别用户的社会和健康问题,但现有方法在分类精度和延迟之间存在权衡。 Method: 采用时间微调策略,显式地将时间纳入Transformer模型的学习过程,分析用户完整发帖历史,并使用时间指标评估性能。 Result: 在西班牙语的抑郁和饮食障碍任务中,该方法取得了与MentalRiskES 2023最佳模型竞争的结果,优化了上下文和时间进展的决策。 Conclusion: 通过时间微调,Transformer模型能够将精度和速度结合为单一目标,有效解决ERD问题。 Abstract: Early Risk Detection (ERD) on the Web aims to identify promptly users facing social and health issues. Users are analyzed post-by-post, and it is necessary to guarantee correct and quick answers, which is particularly challenging in critical scenarios. ERD involves optimizing classification precision and minimizing detection delay. Standard classification metrics may not suffice, resorting to specific metrics such as ERDE(theta) that explicitly consider precision and delay. The current research focuses on applying a multi-objective approach, prioritizing classification performance and establishing a separate criterion for decision time. In this work, we propose a completely different strategy, temporal fine-tuning, which allows tuning transformer-based models by explicitly incorporating time within the learning process. Our method allows us to analyze complete user post histories, tune models considering different contexts, and evaluate training performance using temporal metrics. We evaluated our proposal in the depression and eating disorders tasks for the Spanish language, achieving competitive results compared to the best models of MentalRiskES 2023. We found that temporal fine-tuning optimized decisions considering context and time progress. In this way, by properly taking advantage of the power of transformers, it is possible to address ERD by combining precision and speed as a single objective.

[130] Probing Subphonemes in Morphology Models

Gal Astrach,Yuval Pinter

Main category: cs.CL

TL;DR: Transformers在形态学任务中表现优异,但在跨语言和形态规则泛化能力有限。本文提出一种语言无关的探测方法,研究Transformers在音素层面的特征编码能力,并在七种语言中验证。结果表明局部音韵特征(如土耳其语的尾辅音清化)在音素嵌入中表现良好,而长距离依赖(如元音和谐)则在编码器中更优。

Details Motivation: 探讨Transformers在音韵和次音素层面捕捉隐式现象的能力,以解释其在跨语言形态学任务中泛化能力有限的原因。 Method: 提出一种语言无关的探测方法,研究Transformers在音素层面的特征编码能力,并在七种形态多样的语言中进行实验。 Result: 局部音韵特征(如尾辅音清化)在音素嵌入中表现良好,而长距离依赖(如元音和谐)则在编码器中更优。 Conclusion: 研究结果为训练形态学模型提供了实证策略,特别是关于次音素特征获取的作用。 Abstract: Transformers have achieved state-of-the-art performance in morphological inflection tasks, yet their ability to generalize across languages and morphological rules remains limited. One possible explanation for this behavior can be the degree to which these models are able to capture implicit phenomena at the phonological and subphonemic levels. We introduce a language-agnostic probing method to investigate phonological feature encoding in transformers trained directly on phonemes, and perform it across seven morphologically diverse languages. We show that phonological features which are local, such as final-obstruent devoicing in Turkish, are captured well in phoneme embeddings, whereas long-distance dependencies like vowel harmony are better represented in the transformer's encoder. Finally, we discuss how these findings inform empirical strategies for training morphological models, particularly regarding the role of subphonemic feature acquisition.

[131] XtraGPT: LLMs for Human-AI Collaboration on Controllable Academic Paper Revision

Nuo Chen,Andre Lin HuiKai,Jiaying Wu,Junyi Hou,Zining Zhang,Qian Wang,Xidong Wang,Bingsheng He

Main category: cs.CL

TL;DR: XtraGPT是一个基于人类-AI协作框架的开源LLM套件,旨在通过上下文感知和指令引导的方式提升学术论文修订质量。

Details Motivation: 现有LLM在支持高质量科学写作方面能力有限,无法满足研究沟通的复杂需求,如跨章节的概念连贯性。 Method: 构建了一个包含7,040篇研究论文和140,000条指令-响应对的数据集,并开发了XtraGPT(1.5B至14B参数)。 Result: XtraGPT显著优于同规模基线模型,接近专有系统的质量,并通过自动和人工评估验证了其有效性。 Conclusion: XtraGPT为学术论文修订提供了高效的工具,填补了现有LLM在科学写作支持方面的不足。 Abstract: Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited when it comes to supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision. We first introduce a comprehensive dataset of 7,040 research papers from top-tier venues annotated with over 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. Building on the dataset, we develop XtraGPT, the first suite of open-source LLMs, designed to provide context-aware, instruction-guided writing assistance, ranging from 1.5B to 14B parameters. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of our models in improving scientific drafts.

[132] Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

Banca Calvo Figueras,Rodrigo Agerri

Main category: cs.CL

TL;DR: 论文提出了一个全面的方法,用于支持生成关键问题(CQs-Gen)任务的系统开发和基准测试,包括构建首个大规模人工标注数据集和探索自动评估方法。

Details Motivation: 当前关键问题生成领域缺乏合适的数据集和自动评估标准,阻碍了进展。 Method: 构建大规模人工标注数据集,并研究基于大型语言模型(LLMs)的参考自动评估方法。 Result: 确定了与人类判断相关性最高的评估方法,并对11种LLMs进行了零样本评估,展示了任务的难度。 Conclusion: 提供了数据、代码和公开排行榜,鼓励进一步研究,探索CQs-Gen在自动推理和人类批判性思维中的实际价值。 Abstract: The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose assumptions and challenge the reasoning in arguments. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This work presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale manually-annotated dataset. We also investigate automatic evaluation methods and identify a reference-based technique using large language models (LLMs) as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data, code, and a public leaderboard are provided to encourage further research not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.

[133] LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

Rao Ma,Tongzhou Chen,Kartik Audhkhasi,Bhuvana Ramabhadran

Main category: cs.CL

TL;DR: LegoSLM提出了一种新方法,通过ASR后验矩阵连接语音编码器和LLM,显著提升了ASR和语音翻译任务的性能。

Details Motivation: 现有方法在结合语音编码器和LLM时存在性能不足或灵活性差的问题,需要一种更优的解决方案。 Method: 利用CTC后验矩阵重构伪音频嵌入,并将其与LLM的文本嵌入拼接,实现语音编码器与LLM的高效结合。 Result: 在8个MLS测试集上,平均WERR提升49%,且模型表现出模块化和领域适应能力。 Conclusion: LegoSLM为语音与文本模型的结合提供了灵活且高效的解决方案,具有广泛的应用潜力。 Abstract: Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To effectively combine both models for better performance, continuous speech prompts, and ASR error correction have been adopted. However, these methods are prone to suboptimal performance or are inflexible. In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. The speech encoder is trained to generate Connectionist Temporal Classification (CTC) posteriors over the LLM vocabulary, which are used to reconstruct pseudo-audio embeddings by computing a weighted sum of the LLM input embeddings. These embeddings are concatenated with text embeddings in the LLM input space. Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks. By connecting USM with Gemma models, we can get an average of 49% WERR over the USM-CTC baseline on 8 MLS testsets. The trained model also exhibits modularity in a range of settings -- after fine-tuning the Gemma model weights, the speech encoder can be switched and combined with the LLM in a zero-shot fashion. Additionally, we propose to control the decode-time influence of the USM and LLM using a softmax temperature, which shows effectiveness in domain adaptation.

[134] GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents

Lingxiao Diao,Xinyue Xu,Wanxuan Sun,Cheng Yang,Zhuosheng Zhang

Main category: cs.CL

TL;DR: GuideBench是一个评估大语言模型(LLMs)遵循领域导向指南能力的基准测试,涵盖规则多样性、规则更新鲁棒性和人类偏好对齐三个方面。

Details Motivation: 随着LLMs越来越多地作为领域导向代理部署,其遵循领域导向指南的能力尚未得到全面评估,现有基准主要关注通用领域的常识知识。 Method: 提出GuideBench基准,评估LLMs在规则多样性、规则更新鲁棒性和人类偏好对齐三个关键方面的表现。 Result: 实验结果表明,LLMs在遵循领域导向指南方面仍有显著改进空间。 Conclusion: GuideBench填补了评估LLMs领域导向指南遵循能力的空白,为未来研究提供了重要工具。 Abstract: Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.

[135] A computational system to handle the orthographic layer of tajwid in contemporary Quranic Orthography

Alicia González Martínez

Main category: cs.CL

TL;DR: 论文探讨了《古兰经》当代正字法(CQO)中塔吉威德(tajwid)规则的系统性,开发了一个Python模块用于处理塔吉威德层,并提出了利用开罗《古兰经》文本对齐和比较其他手稿的框架。

Details Motivation: 研究早期伊斯兰时期《古兰经》的口头传统及其正字法系统,尤其是塔吉威德规则,以理解其语音和韵律过程。 Method: 使用开罗《古兰经》的数字化版本,开发Python模块来添加或移除塔吉威德的正字法层。 Result: 开罗《古兰经》文本的丰富性和完整性使其成为对齐和比较其他《古兰经》手稿的关键工具。 Conclusion: 通过将所有文本相互映射,可以深入研究附加于辅音骨架的标音符号系统的本质。 Abstract: Contemporary Quranic Orthography (CQO) relies on a precise system of phonetic notation that can be traced back to the early stages of Islam, when the Quran was mainly oral in nature and the first written renderings of it served as memory aids for this oral tradition. The early systems of diacritical marks created on top of the Quranic Consonantal Text (QCT) motivated the creation and further development of a fine-grained system of phonetic notation that represented tajwid-the rules of recitation. We explored the systematicity of the rules of tajwid, as they are encountered in the Cairo Quran, using a fully and accurately encoded digital edition of the Quranic text. For this purpose, we developed a python module that can remove or add the orthographic layer of tajwid from a Quranic text in CQO. The interesting characteristic of these two sets of rules is that they address the complete Quranic text of the Cairo Quran, so they can be used as precise witnesses to study its phonetic and prosodic processes. From a computational point of view, the text of the Cairo Quran can be used as a linchpin to align and compare Quranic manuscripts, due to its richness and completeness. This will let us create a very powerful framework to work with the Arabic script, not just within an isolated text, but automatically exploring a specific textual phenomenon in other connected manuscripts. Having all the texts mapped among each other can serve as a powerful tool to study the nature of the notation systems of diacritics added to the consonantal skeleton.

[136] CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

Sijia Chen,Xiaomin Li,Mengxue Zhang,Eric Hanchen Jiang,Qingcheng Zeng,Chen-Hsiang Yu

Main category: cs.CL

TL;DR: CARES是一个用于评估医疗领域大语言模型安全性的基准测试,包含多样化的提示和评估方法,揭示了模型的漏洞并提出了缓解策略。

Details Motivation: 医疗领域的大语言模型存在安全性和对抗性操纵的风险,现有基准缺乏临床特异性和对抗性攻击的覆盖。 Method: CARES包含18,000多个提示,覆盖八项医疗安全原则、四个危害级别和四种提示风格,采用三向响应评估协议和细粒度安全评分。 Result: 研究发现许多先进模型对微调后的有害提示仍易受攻击,同时对安全但非典型查询过度拒绝。 Conclusion: CARES为医疗LLM的安全测试和改进提供了严格框架,并提出基于分类器的缓解策略。 Abstract: Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.

[137] Towards Cultural Bridge by Bahnaric-Vietnamese Translation Using Transfer Learning of Sequence-To-Sequence Pre-training Language Model

Phan Tran Minh Dat,Vo Hoang Nhat Khang,Quan Thanh Tho

Main category: cs.CL

TL;DR: 该研究通过迁移学习方法解决了Bahnaric-Vietnamese翻译中资源不足的问题,并验证了其高效性。

Details Motivation: 旨在通过翻译促进越南两个民族之间的文化交流,但面临Bahnaric语言资源匮乏的挑战。 Method: 利用序列到序列的预训练语言模型进行迁移学习,并通过数据增强和启发式方法优化翻译。 Result: 方法被验证对Bahnaric-Vietnamese翻译模型非常有效。 Conclusion: 该方法不仅优化了翻译过程,还促进了语言保护和民族间的相互理解。 Abstract: This work explores the journey towards achieving Bahnaric-Vietnamese translation for the sake of culturally bridging the two ethnic groups in Vietnam. However, translating from Bahnaric to Vietnamese also encounters some difficulties. The most prominent challenge is the lack of available original Bahnaric resources source language, including vocabulary, grammar, dialogue patterns and bilingual corpus, which hinders the data collection process for training. To address this, we leverage a transfer learning approach using sequence-to-sequence pre-training language model. First of all, we leverage a pre-trained Vietnamese language model to capture the characteristics of this language. Especially, to further serve the purpose of machine translation, we aim for a sequence-to-sequence model, not encoder-only like BERT or decoder-only like GPT. Taking advantage of significant similarity between the two languages, we continue training the model with the currently limited bilingual resources of Vietnamese-Bahnaric text to perform the transfer learning from language model to machine translation. Thus, this approach can help to handle the problem of imbalanced resources between two languages, while also optimizing the training and computational processes. Additionally, we also enhanced the datasets using data augmentation to generate additional resources and defined some heuristic methods to help the translation more precise. Our approach has been validated to be highly effective for the Bahnaric-Vietnamese translation model, contributing to the expansion and preservation of languages, and facilitating better mutual understanding between the two ethnic people.

[138] When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Xiaomin Li,Zhou Yu,Zhiwei Zhang,Xupeng Chen,Ziji Zhang,Yingying Zhuang,Narayanan Sadagopan,Anurag Beniwal

Main category: cs.CL

TL;DR: 研究发现,显式链式思维(CoT)推理会显著降低指令遵循的准确性,并提出四种策略(如选择性推理)来缓解这一问题。

Details Motivation: 探索显式CoT推理对指令遵循任务的影响,揭示其潜在缺陷并提出改进方法。 Method: 评估15个模型在两个基准测试(IFEval和ComplexBench)上的表现,通过案例研究和注意力分析识别问题模式,并提出四种缓解策略。 Result: CoT推理常分散注意力,导致性能下降;选择性推理策略(尤其是分类器选择性推理)能显著恢复性能。 Conclusion: 显式CoT推理可能损害指令遵循能力,但通过选择性推理策略可有效缓解这一问题。 Abstract: Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.

[139] GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

Chenkai Zhang,Yiming Lei,Zeming Liu,Haitao Leng,Shaoguo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang

Main category: cs.CL

TL;DR: 论文提出GODBench基准和Ripple of Thought (RoT)框架,以评估和提升多模态大语言模型在视频评论艺术中的创造力。

Details Motivation: 现有MLLMs和CoT方法在生成创意表达(如幽默和讽刺)方面表现不足,且现有基准模态和类别有限。 Method: 引入GODBench基准(视频和文本模态)和RoT多步推理框架(受物理学波传播启发)。 Result: 实验显示现有方法在创意评论生成上仍有挑战,而RoT显著提升了创造力。 Conclusion: RoT为MLLM创意生成提供了有效方法,GODBench为相关研究提供了新基准。 Abstract: Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at https://github.com/stan-lei/GODBench-ACL2025.

[140] Is Compression Really Linear with Code Intelligence?

Xianzhen Luo,Shijie Xuyang,Tianhao Cheng,Zheng Chu,Houyi Li,ziqi wang,Siming Huang,Qingfu Zhu,Qiufeng Wang,Xiangyu Zhang,Shuigeng Zhou,Wanxiang Che

Main category: cs.CL

TL;DR: 论文探讨了数据压缩与大型语言模型(LLMs)在代码智能领域的关系,提出了一种新的评估方法Format Annealing,并揭示了代码智能与压缩效率(BPC)之间的对数关系。

Details Motivation: 先前研究认为压缩与通用智能呈线性关系,但忽视了代码的多语言和多任务特性,且对现代代码LLMs的评估不够公平。 Method: 通过评估多样化的开源代码LLMs,并引入Format Annealing方法,结合大规模代码验证集测量BPC。 Result: 实证结果显示代码智能与BPC之间存在对数关系,修正了先前的线性假设。 Conclusion: 研究深化了对压缩在代码智能中作用的理解,并提供了稳健的评估框架。 Abstract: Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.

[141] Disentangling Reasoning and Knowledge in Medical Large Language Models

Rahul Thapa,Qingyang Wu,Kevin Wu,Harrison Zhang,Angela Zhang,Eric Wu,Haotian Ye,Suhana Bedi,Nevin Aresh,Joseph Boen,Shriya Reddy,Ben Athiwaratkun,Shuaiwen Leon Song,James Zou

Main category: cs.CL

TL;DR: 论文通过将生物医学问答基准分为推理和知识两类,分析了当前大型语言模型在医学推理中的表现,并提出了一种新方法BioMed-R1以提升推理能力。

Details Motivation: 当前医学问答基准混合了推理和知识回忆,难以准确评估模型的推理能力,因此需要区分并改进模型的推理表现。 Method: 使用PubMedBERT分类器将11个生物医学问答基准分为推理和知识两类,评估模型表现,并通过微调和强化学习训练BioMed-R1。 Result: 仅32.8%的问题需要复杂推理,生物医学模型在对抗测试中表现较差,而BioMed-R1在同类模型中表现最佳。 Conclusion: 通过针对性训练和对抗场景设计,可以进一步提升模型的医学推理能力。 Abstract: Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, m1 scores 60.5 on knowledge but only 47.1 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.

[142] No Gold Standard, No Problem: Reference-Free Evaluation of Taxonomies

Pascal Wullschleger,Majid Zarharan,Donnacha Daly,Marc Pouly,Jennifer Foster

Main category: cs.CL

TL;DR: 提出了两种无参考指标用于评估分类法的质量,分别针对语义与分类相似性的相关性以及逻辑充分性,实验表明与黄金标准分类法的F1分数相关性良好。

Details Motivation: 现有指标未能全面评估分类法的质量,尤其是语义与分类相似性的相关性及逻辑充分性。 Method: 第一种指标通过计算语义与分类相似性的相关性评估鲁棒性;第二种利用自然语言推理评估逻辑充分性。 Result: 在五个分类法上测试,两种指标与黄金标准分类法的F1分数相关性良好。 Conclusion: 提出的指标能有效补充现有分类法质量评估方法,覆盖了现有指标未处理的错误类型。 Abstract: We introduce two reference-free metrics for quality evaluation of taxonomies. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, covering a type of error not handled by existing metrics. The second uses Natural Language Inference to assess logical adequacy. Both metrics are tested on five taxonomies and are shown to correlate well with F1 against gold-standard taxonomies.

[143] HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Hoo-Chang Shin,Felipe Soares,Alexander Bukharin,Ellie Evans,Yi Dong,Oleksii Kuchaiev

Main category: cs.CL

TL;DR: HelpSteer3-Preference是一个高质量、多样化的偏好数据集,用于训练基于人类反馈的强化学习(RLHF)模型,显著提升了奖励模型的性能。

Details Motivation: 为了解决公开偏好数据集质量和多样性不足的问题,推动RLHF模型的发展。 Method: 引入HelpSteer3-Preference数据集,包含40,000多个样本,涵盖STEM、编程和多语言等多样化任务。 Result: 训练的奖励模型在RM-Bench和JudgeBench上分别达到82.4%和73.7%的性能,比之前最佳结果提升约10%。 Conclusion: HelpSteer3-Preference不仅提升了奖励模型性能,还可用于生成式奖励模型和策略模型的RLHF对齐。 Abstract: Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference

[144] Improving Assembly Code Performance with Large Language Models via Reinforcement Learning

Anjiang Wei,Tarun Suresh,Huanmi Tan,Yinglun Xu,Gagandeep Singh,Ke Wang,Alex Aiken

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)在汇编代码优化中的潜力,提出了一种基于强化学习的框架,通过PPO训练LLMs,并在真实程序基准测试中取得了优于行业标准编译器的性能。

Details Motivation: 探索LLMs在汇编代码优化中的潜力,因为其对执行细节的精细控制可能带来比高级语言更显著的性能提升。 Method: 采用强化学习框架(PPO),结合功能正确性和执行性能的奖励函数,训练LLMs优化汇编代码。 Result: 模型Qwen2.5-Coder-7B-PPO在测试通过率为96.0%,平均性能提升1.47倍,优于包括Claude-3.7-sonnet在内的20个其他模型。 Conclusion: 强化学习可以释放LLMs作为汇编代码性能优化器的潜力。 Abstract: Large language models (LLMs) have demonstrated strong performance across a wide range of programming tasks, yet their potential for code optimization remains underexplored. This work investigates whether LLMs can optimize the performance of assembly code, where fine-grained control over execution enables improvements that are difficult to express in high-level languages. We present a reinforcement learning framework that trains LLMs using Proximal Policy Optimization (PPO), guided by a reward function that considers both functional correctness, validated through test cases, and execution performance relative to the industry-standard compiler gcc -O3. To support this study, we introduce a benchmark of 8,072 real-world programs. Our model, Qwen2.5-Coder-7B-PPO, achieves 96.0% test pass rates and an average speedup of 1.47x over the gcc -O3 baseline, outperforming all 20 other models evaluated, including Claude-3.7-sonnet. These results indicate that reinforcement learning can unlock the potential of LLMs to serve as effective optimizers for assembly code performance.

[145] SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning

Yige Xu,Xu Guo,Zhiwei Zeng,Chunyan Miao

Main category: cs.CL

TL;DR: SoftCoT++通过扰动潜在思维和对比学习,扩展了SoftCoT,以在连续潜在空间中实现多样化的推理路径探索,显著提升了推理性能。

Details Motivation: 现有连续潜在空间推理方法(如SoftCoT)因固定潜在表示限制了推理路径的多样性探索。 Method: 通过多组初始令牌扰动潜在思维,并应用对比学习促进软思维表示的多样性。 Result: 在五个推理基准和两种LLM架构上,SoftCoT++显著优于SoftCoT,并与自一致性缩放技术兼容。 Conclusion: SoftCoT++为连续潜在空间推理提供了更高效的多样化探索方法,兼容现有技术。 Abstract: Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference, without altering the model's parameters. While existing TTS methods operate in a discrete token space by generating more intermediate steps, recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance. Such latent thoughts encode informative thinking without the information loss associated with autoregressive token generation, sparking increased interest in continuous-space reasoning. Unlike discrete decoding, where repeated sampling enables exploring diverse reasoning paths, latent representations in continuous space are fixed for a given input, which limits diverse exploration, as all decoded paths originate from the same latent thought. To overcome this limitation, we introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths. Specifically, we perturb latent thoughts via multiple specialized initial tokens and apply contrastive learning to promote diversity among soft thought representations. Experiments across five reasoning benchmarks and two distinct LLM architectures demonstrate that SoftCoT++ significantly boosts SoftCoT and also outperforms SoftCoT with self-consistency scaling. Moreover, it shows strong compatibility with conventional scaling techniques such as self-consistency. Source code is available at https://github.com/xuyige/SoftCoT.

[146] Modeling cognitive processes of natural reading with transformer-based Language Models

Bruno Bianchi,Fermín Travi,Juan E. Kamienkowski

Main category: cs.CL

TL;DR: 本文评估了基于Transformer的语言模型(GPT2、LLaMA-7B和LLaMA2-7B)在解释阅读过程中注视持续时间(Gaze Duration)方面的表现,发现其优于早期模型,但仍无法完全解释人类预测性。

Details Motivation: 探索先进语言模型是否能更准确地解释人类阅读行为中的预测性效应。 Method: 使用Transformer模型(GPT2、LLaMA-7B和LLaMA2-7B)分析Rioplantense西班牙语读者的注视持续时间数据。 Result: Transformer模型优于早期模型,但仍无法完全解释人类预测性。 Conclusion: 尽管语言模型有所进步,但其预测方式仍与人类阅读行为存在差异。 Abstract: Recent advances in Natural Language Processing (NLP) have led to the development of highly sophisticated language models for text generation. In parallel, neuroscience has increasingly employed these models to explore cognitive processes involved in language comprehension. Previous research has shown that models such as N-grams and LSTM networks can partially account for predictability effects in explaining eye movement behaviors, specifically Gaze Duration, during reading. In this study, we extend these findings by evaluating transformer-based models (GPT2, LLaMA-7B, and LLaMA2-7B) to further investigate this relationship. Our results indicate that these architectures outperform earlier models in explaining the variance in Gaze Durations recorded from Rioplantense Spanish readers. However, similar to previous studies, these models still fail to account for the entirety of the variance captured by human predictability. These findings suggest that, despite their advancements, state-of-the-art language models continue to predict language in ways that differ from human readers.

cs.CY [Back]

[147] Towards Automated Situation Awareness: A RAG-Based Framework for Peacebuilding Reports

Poli A. Nemkova,Suleyman O. Polat,Rafid I. Jahan,Sagnik Ray Choudhury,Sun-joo Lee,Shouryadipta Sarkar,Mark V. Albert

Main category: cs.CY

TL;DR: 论文提出了一种动态检索增强生成(RAG)系统,用于自动生成实时情境感知报告,结合多源数据,并通过三级评估框架确保报告质量。

Details Motivation: 在灾害响应、冲突监测等场景中,及时准确的情境感知对决策至关重要,但手动分析大量异构数据常导致延迟,影响干预效果。 Method: 系统动态构建查询特定知识库,整合新闻、冲突事件数据库等实时数据,并通过三级评估框架(自动NLP指标、专家反馈、LLM评估)确保报告质量。 Result: 系统在多个实际场景中测试,能生成连贯、有洞察力且可操作的报告,减轻人工负担并加速决策。 Conclusion: 该自动化方法提升了情境感知效率,代码和评估工具已开源,促进进一步研究。 Abstract: Timely and accurate situation awareness is vital for decision-making in humanitarian response, conflict monitoring, and early warning and early action. However, the manual analysis of vast and heterogeneous data sources often results in delays, limiting the effectiveness of interventions. This paper introduces a dynamic Retrieval-Augmented Generation (RAG) system that autonomously generates situation awareness reports by integrating real-time data from diverse sources, including news articles, conflict event databases, and economic indicators. Our system constructs query-specific knowledge bases on demand, ensuring timely, relevant, and accurate insights. To ensure the quality of generated reports, we propose a three-level evaluation framework that combines semantic similarity metrics, factual consistency checks, and expert feedback. The first level employs automated NLP metrics to assess coherence and factual accuracy. The second level involves human expert evaluation to verify the relevance and completeness of the reports. The third level utilizes LLM-as-a-Judge, where large language models provide an additional layer of assessment to ensure robustness. The system is tested across multiple real-world scenarios, demonstrating its effectiveness in producing coherent, insightful, and actionable reports. By automating report generation, our approach reduces the burden on human analysts and accelerates decision-making processes. To promote reproducibility and further research, we openly share our code and evaluation tools with the community via GitHub.

[148] Understanding Gen Alpha Digital Language: Evaluation of LLM Safety Systems for Content Moderation

Manisha Mehta,Fausto Giunchiglia

Main category: cs.CY

TL;DR: 研究评估了AI系统对Alpha世代(2010-2024年出生)数字语言的理解能力,发现现有工具在检测其隐蔽有害互动方面存在不足,并提出了改进框架。

Details Motivation: Alpha世代是首个与AI共同成长的群体,其独特的数字语言(如游戏、梗和AI趋势)导致现有安全工具无法有效识别有害互动,亟需改进。 Method: 评估了四种AI模型(GPT-4、Claude、Gemini和Llama 3)对Alpha世代表达中隐蔽骚扰的检测能力,使用100条来自游戏平台、社交媒体和视频内容的样本数据集。 Result: 研究发现AI系统在理解Alpha世代语言时存在严重缺陷,直接影响在线安全,并提出了改进框架和多视角评估方法。 Conclusion: 研究强调需重新设计安全系统以适应年轻群体的沟通方式,结合Alpha世代研究者的视角与学术分析,解决数字安全挑战。 Abstract: This research offers a unique evaluation of how AI systems interpret the digital language of Generation Alpha (Gen Alpha, born 2010-2024). As the first cohort raised alongside AI, Gen Alpha faces new forms of online risk due to immersive digital engagement and a growing mismatch between their evolving communication and existing safety tools. Their distinct language, shaped by gaming, memes, and AI-driven trends, often conceals harmful interactions from both human moderators and automated systems. We assess four leading AI models (GPT-4, Claude, Gemini, and Llama 3) on their ability to detect masked harassment and manipulation within Gen Alpha discourse. Using a dataset of 100 recent expressions from gaming platforms, social media, and video content, the study reveals critical comprehension failures with direct implications for online safety. This work contributes: (1) a first-of-its-kind dataset capturing Gen Alpha expressions; (2) a framework to improve AI moderation systems for youth protection; (3) a multi-perspective evaluation including AI systems, human moderators, and parents, with direct input from Gen Alpha co-researchers; and (4) an analysis of how linguistic divergence increases youth vulnerability. Findings highlight the urgent need to redesign safety systems attuned to youth communication, especially given Gen Alpha reluctance to seek help when adults fail to understand their digital world. This study combines the insight of a Gen Alpha researcher with systematic academic analysis to address critical digital safety challenges.

[149] Phare: A Safety Probe for Large Language Models

Pierre Le Jeune,Benoît Malésieux,Weixuan Xiao,Matteo Dora

Main category: cs.CY

TL;DR: Phare是一个多语言诊断框架,用于评估大语言模型在幻觉与可靠性、社会偏见和有害内容生成三个关键维度的行为。

Details Motivation: 现有评估多关注性能而非失败模式,Phare旨在填补这一空白,提供更全面的安全评估。 Method: 通过评估17种先进大语言模型,分析其系统性漏洞,如奉承、提示敏感性和刻板印象再现。 Result: 发现所有模型在安全维度上均存在系统性漏洞。 Conclusion: Phare为构建更稳健、对齐且可信的语言系统提供了可操作的见解。 Abstract: Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.

eess.IV [Back]

[150] GRNN:Recurrent Neural Network based on Ghost Features for Video Super-Resolution

Yutong Guo

Main category: eess.IV

TL;DR: 论文提出了一种基于“Ghost特征”的方法,以减少视频超分辨率(VSR)模型中的特征冗余,并结合RNN解决梯度消失问题,提升了PSNR和SSIM指标。

Details Motivation: 现代基于CNN的VSR系统计算成本高,且存在特征冗余问题,但该问题在VSR领域鲜少被讨论。 Method: 提出使用“Ghost特征”减少冗余,并结合RNN解决梯度消失问题,输入包括当前帧、下一帧、前一帧输出及隐藏状态。 Result: 在多个基准模型和数据集上的实验表明,PSNR和SSIM有所提升,视频纹理细节保留更好。 Conclusion: 该方法有效减少了特征冗余并提升了视频超分辨率的质量。 Abstract: Modern video super-resolution (VSR) systems based on convolutional neural networks (CNNs) require huge computational costs. The problem of feature redundancy is present in most models in many domains, but is rarely discussed in VSR. We experimentally observe that many features in VSR models are also similar to each other, so we propose to use "Ghost features" to reduce this redundancy. We also analyze the so-called "gradient disappearance" phenomenon generated by the conventional recurrent convolutional network (RNN) model, and combine the Ghost module with RNN to complete the modeling on time series. The current frame is used as input to the model together with the next frame, the output of the previous frame and the hidden state. Extensive experiments on several benchmark models and datasets show that the PSNR and SSIM of our proposed modality are improved to some extent. Some texture details in the video are also better preserved.

[151] ExploreGS: a vision-based low overhead framework for 3D scene reconstruction

Yunji Feng,Chengpu Yu,Fengrui Ran,Zhi Yang,Yinni Liu

Main category: eess.IV

TL;DR: ExploreGS是一个基于视觉的低开销3D场景重建框架,用于无人机,通过RGB图像替代传统激光雷达点云采集,实现高质量低成本重建。

Details Motivation: 传统激光雷达点云采集成本高,探索一种基于视觉的低成本替代方案。 Method: 结合场景探索与模型重建,利用BoW模型实现实时处理,支持机载3DGS训练。 Result: 在仿真和真实环境中验证了框架的高效性和适用性,重建质量与先进方法相当。 Conclusion: ExploreGS为资源受限设备提供了一种高效、低成本的3D场景重建解决方案。 Abstract: This paper proposes a low-overhead, vision-based 3D scene reconstruction framework for drones, named ExploreGS. By using RGB images, ExploreGS replaces traditional lidar-based point cloud acquisition process with a vision model, achieving a high-quality reconstruction at a lower cost. The framework integrates scene exploration and model reconstruction, and leverags a Bag-of-Words(BoW) model to enable real-time processing capabilities, therefore, the 3D Gaussian Splatting (3DGS) training can be executed on-board. Comprehensive experiments in both simulation and real-world environments demonstrate the efficiency and applicability of the ExploreGS framework on resource-constrained devices, while maintaining reconstruction quality comparable to state-of-the-art methods.

[152] MOSAIC: A Multi-View 2.5D Organ Slice Selector with Cross-Attentional Reasoning for Anatomically-Aware CT Localization in Medical Organ Segmentation

Hania Ghouse,Muzammil Behzad

Main category: eess.IV

TL;DR: 提出了一种基于视觉语言模型(VLM)的解剖感知切片选择方法,用于高效多器官分割,并引入新指标SLC评估切片定位精度。

Details Motivation: 现有3D分割方法计算和内存消耗大,2D方法缺乏跨视图上下文感知,需解决这些问题以提高效率和准确性。 Method: 提出一种统一框架,利用VLM和2.5D多视图表示进行器官存在检测,选择性保留高相关切片。 Result: 模型在多个器官上显著优于基线方法,实现了高效且空间一致的器官切片过滤。 Conclusion: 该方法显著降低下游分割成本,同时保持高解剖保真度。 Abstract: Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these limitations, we propose a novel, anatomically-aware slice selector pipeline that reduces input volume prior to segmentation. Our unified framework introduces a vision-language model (VLM) for cross-view organ presence detection using fused tri-slice (2.5D) representations from axial, sagittal, and coronal planes. Our proposed model acts as an "expert" in anatomical localization, reasoning over multi-view representations to selectively retain slices with high structural relevance. This enables spatially consistent filtering across orientations while preserving contextual cues. More importantly, since standard segmentation metrics such as Dice or IoU fail to measure the spatial precision of such slice selection, we introduce a novel metric, Slice Localization Concordance (SLC), which jointly captures anatomical coverage and spatial alignment with organ-centric reference slices. Unlike segmentation-specific metrics, SLC provides a model-agnostic evaluation of localization fidelity. Our model offers substantial improvement gains against several baselines across all organs, demonstrating both accurate and reliable organ-focused slice filtering. These results show that our method enables efficient and spatially consistent organ filtering, thereby significantly reducing downstream segmentation cost while maintaining high anatomical fidelity.

[153] ROIsGAN: A Region Guided Generative Adversarial Framework for Murine Hippocampal Subregion Segmentation

Sayed Mehedi Azim,Brian Corbett,Iman Dehzangi

Main category: eess.IV

TL;DR: 该论文提出了一种名为ROIsGAN的新型生成对抗网络,用于自动化分割海马体亚区,并提供了四个小鼠海马体IHC数据集。

Details Motivation: 海马体亚区的精确分割对理解疾病机制和开发治疗方法至关重要,但现有方法无法自动化处理IHC图像。 Method: 采用区域引导的U-Net生成对抗网络(ROIsGAN),结合Dice和二元交叉熵损失,优化边界和结构细节。 Result: ROIsGAN在DG、CA1和CA3亚区表现优于传统模型,Dice分数提升1-10%,IoU提升达11%。 Conclusion: 该研究为神经科学中的组织图像分析提供了基础数据集和方法,支持高精度、可扩展的研究。 Abstract: The hippocampus, a critical brain structure involved in memory processing and various neurodegenerative and psychiatric disorders, comprises three key subregions: the dentate gyrus (DG), Cornu Ammonis 1 (CA1), and Cornu Ammonis 3 (CA3). Accurate segmentation of these subregions from histological tissue images is essential for advancing our understanding of disease mechanisms, developmental dynamics, and therapeutic interventions. However, no existing methods address the automated segmentation of hippocampal subregions from tissue images, particularly from immunohistochemistry (IHC) images. To bridge this gap, we introduce a novel set of four comprehensive murine hippocampal IHC datasets featuring distinct staining modalities: cFos, NeuN, and multiplexed stains combining cFos, NeuN, and either {\Delta}FosB or GAD67, capturing structural, neuronal activity, and plasticity associated information. Additionally, we propose ROIsGAN, a region-guided U-Net-based generative adversarial network tailored for hippocampal subregion segmentation. By leveraging adversarial learning, ROIsGAN enhances boundary delineation and structural detail refinement through a novel region-guided discriminator loss combining Dice and binary cross-entropy loss. Evaluated across DG, CA1, and CA3 subregions, ROIsGAN consistently outperforms conventional segmentation models, achieving performance gains ranging from 1-10% in Dice score and up to 11% in Intersection over Union (IoU), particularly under challenging staining conditions. Our work establishes foundational datasets and methods for automated hippocampal segmentation, enabling scalable, high-precision analysis of tissue images in neuroscience research. Our generated datasets, proposed model as a standalone tool, and its corresponding source code are publicly available at: https://github.com/MehediAzim/ROIsGAN

[154] Predicting Risk of Pulmonary Fibrosis Formation in PASC Patients

Wanying Dou,Gorkem Durak,Koushik Biswas,Ziliang Hong,Andrea Mia Bejar,Elif Keles,Kaan Akin,Sukru Mehmet Erturk,Alpay Medetalibeyoglu,Marc Sala,Alexander Misharin,Hatice Savas,Mary Salvatore,Sachin Jambawalikar,Drew Torigian,Jayaram K. Udupa,Ulas Bagci

Main category: eess.IV

TL;DR: 该论文提出了一种结合深度学习和放射组学的多中心胸部CT分析框架,用于预测COVID-19后遗症(PASC)相关的肺纤维化,准确率达82.2%。

Details Motivation: PASC(长期COVID)症状多样且持续时间不确定,肺纤维化是其关键表现之一,但临床评估和诊断面临挑战。 Method: 采用卷积神经网络(CNN)和可解释特征提取技术,结合Grad-CAM可视化和放射组学特征分析。 Result: 分类任务中达到82.2%准确率和85.5% AUC,首次在文献中展示了深度学习在PASC相关肺纤维化早期检测中的潜力。 Conclusion: 深度学习驱动的计算方法在PASC相关肺纤维化的早期检测和风险评估中具有重要潜力。 Abstract: While the acute phase of the COVID-19 pandemic has subsided, its long-term effects persist through Post-Acute Sequelae of COVID-19 (PASC), commonly known as Long COVID. There remains substantial uncertainty regarding both its duration and optimal management strategies. PASC manifests as a diverse array of persistent or newly emerging symptoms--ranging from fatigue, dyspnea, and neurologic impairments (e.g., brain fog), to cardiovascular, pulmonary, and musculoskeletal abnormalities--that extend beyond the acute infection phase. This heterogeneous presentation poses substantial challenges for clinical assessment, diagnosis, and treatment planning. In this paper, we focus on imaging findings that may suggest fibrotic damage in the lungs, a critical manifestation characterized by scarring of lung tissue, which can potentially affect long-term respiratory function in patients with PASC. This study introduces a novel multi-center chest CT analysis framework that combines deep learning and radiomics for fibrosis prediction. Our approach leverages convolutional neural networks (CNNs) and interpretable feature extraction, achieving 82.2% accuracy and 85.5% AUC in classification tasks. We demonstrate the effectiveness of Grad-CAM visualization and radiomics-based feature analysis in providing clinically relevant insights for PASC-related lung fibrosis prediction. Our findings highlight the potential of deep learning-driven computational methods for early detection and risk assessment of PASC-related lung fibrosis--presented for the first time in the literature.

[155] Adaptive Spatial Transcriptomics Interpolation via Cross-modal Cross-slice Modeling

NingFeng Que,Xiaofei Wang,Jingjing Chen,Yixuan Jiang,Chao Li

Main category: eess.IV

TL;DR: C2-STi是一种用于在相邻空间转录组学(ST)切片之间插值缺失切片的新方法,解决了ST分析中因缺失切片和高成本带来的限制。

Details Motivation: 由于缺失的中间组织切片和高成本,生成多切片ST数据的实际可行性受限,影响了3D空间基因表达分析的完整性。 Method: C2-STi设计了三个模块:1)距离感知局部结构调制模块,2)金字塔基因共表达相关模块,3)跨模态对齐模块,以解决ST插值的挑战。 Result: 在公开数据集上的实验表明,C2-STi在单切片和多切片ST插值上均优于现有方法。 Conclusion: C2-STi为ST数据插值提供了高效解决方案,有助于更全面的3D空间基因表达分析。 Abstract: Spatial transcriptomics (ST) is a promising technique that characterizes the spatial gene profiling patterns within the tissue context. Comprehensive ST analysis depends on consecutive slices for 3D spatial insights, whereas the missing intermediate tissue sections and high costs limit the practical feasibility of generating multi-slice ST. In this paper, we propose C2-STi, the first attempt for interpolating missing ST slices at arbitrary intermediate positions between adjacent ST slices. Despite intuitive, effective ST interpolation presents significant challenges, including 1) limited continuity across heterogeneous tissue sections, 2) complex intrinsic correlation across genes, and 3) intricate cellular structures and biological semantics within each tissue section. To mitigate these challenges, in C2-STi, we design 1) a distance-aware local structural modulation module to adaptively capture cross-slice deformations and enhance positional correlations between ST slices, 2) a pyramid gene co-expression correlation module to capture multi-scale biological associations among genes, and 3) a cross-modal alignment module that integrates the ST-paired hematoxylin and eosin (H&E)-stained images to filter and align the essential cellular features across ST and H\&E images. Extensive experiments on the public dataset demonstrate our superiority over state-of-the-art approaches on both single-slice and multi-slice ST interpolation. Codes are available at https://github.com/XiaofeiWang2018/C2-STi.

[156] Pretrained hybrid transformer for generalizable cardiac substructures segmentation from contrast and non-contrast CTs in lung and breast cancers

Aneesh Rangnekar,Nikhil Mankuzhy,Jonas Willmann,Chloe Choi,Abraham Wu,Maria Thor,Andreas Rimner,Harini Veeraraghavan

Main category: eess.IV

TL;DR: 论文提出了一种混合变压器卷积网络(HTN),用于在不同成像对比和患者扫描位置下分割心脏亚结构,结果显示其性能优于公开基准模型,且在减少训练数据量的情况下仍保持高精度。

Details Motivation: AI自动分割在临床应用中可能因训练数据与临床案例特征不匹配而性能下降,因此需要一种鲁棒性更强的模型。 Method: 通过改进预训练的变压器模型为HTN,使用平衡和不平衡的训练数据集(CECT和NCCT扫描)进行训练,并在不同患者群体中验证。 Result: HTN在几何和剂量指标上表现优于基准模型,且对成像对比和扫描位置变化具有鲁棒性。 Conclusion: HTN是一种适用于临床的高精度心脏亚结构分割模型,尤其在数据量有限的情况下表现优异。 Abstract: AI automated segmentations for radiation treatment planning (RTP) can deteriorate when applied in clinical cases with different characteristics than training dataset. Hence, we refined a pretrained transformer into a hybrid transformer convolutional network (HTN) to segment cardiac substructures lung and breast cancer patients acquired with varying imaging contrasts and patient scan positions. Cohort I, consisting of 56 contrast-enhanced (CECT) and 124 non-contrast CT (NCCT) scans from patients with non-small cell lung cancers acquired in supine position, was used to create oracle with all 180 training cases and balanced (CECT: 32, NCCT: 32 training) HTN models. Models were evaluated on a held-out validation set of 60 cohort I patients and 66 patients with breast cancer from cohort II acquired in supine (n=45) and prone (n=21) positions. Accuracy was measured using DSC, HD95, and dose metrics. Publicly available TotalSegmentator served as the benchmark. The oracle and balanced models were similarly accurate (DSC Cohort I: 0.80 \pm 0.10 versus 0.81 \pm 0.10; Cohort II: 0.77 \pm 0.13 versus 0.80 \pm 0.12), outperforming TotalSegmentator. The balanced model, using half the training cases as oracle, produced similar dose metrics as manual delineations for all cardiac substructures. This model was robust to CT contrast in 6 out of 8 substructures and patient scan position variations in 5 out of 8 substructures and showed low correlations of accuracy to patient size and age. A HTN demonstrated robustly accurate (geometric and dose metrics) cardiac substructures segmentation from CTs with varying imaging and patient characteristics, one key requirement for clinical use. Moreover, the model combining pretraining with balanced distribution of NCCT and CECT scans was able to provide reliably accurate segmentations under varied conditions with far fewer labeled datasets compared to an oracle model.

[157] Generative Models in Computational Pathology: A Comprehensive Survey on Methods, Applications, and Challenges

Yuan Zhang,Xinfeng Zhang,Xiaoming Qi Xinyu Wu,Feng Chen,Guanyu Yang,Huazhu Fu

Main category: eess.IV

TL;DR: 综述文章总结了生成模型在计算病理学中的进展,涵盖图像生成、文本生成、多模态生成等方向,分析了150多项研究,并讨论了当前挑战与未来方向。

Details Motivation: 生成模型在计算病理学中展现出潜力,包括高效学习、数据增强和多模态表示,但面临高保真生成、临床可解释性及伦理问题等挑战。 Method: 通过分析150多项代表性研究,梳理了从生成对抗网络到扩散模型和基础模型的架构演变,并总结了常用数据集与评估方法。 Result: 文章总结了生成模型在计算病理学中的应用现状,指出了高保真全切片图像生成、临床解释性及伦理问题等局限性。 Conclusion: 未来研究方向包括开发统一、多模态且临床可部署的生成系统,为研究者提供参考。 Abstract: Generative modeling has emerged as a promising direction in computational pathology, offering capabilities such as data-efficient learning, synthetic data augmentation, and multimodal representation across diverse diagnostic tasks. This review provides a comprehensive synthesis of recent progress in the field, organized into four key domains: image generation, text generation, multimodal image-text generation, and other generative applications, including spatial simulation and molecular inference. By analyzing over 150 representative studies, we trace the evolution of generative architectures from early generative adversarial networks to recent advances in diffusion models and foundation models with generative capabilities. We further examine the datasets and evaluation protocols commonly used in this domain and highlight ongoing limitations, including challenges in generating high-fidelity whole slide images, clinical interpretability, and concerns related to the ethical and legal implications of synthetic data. The review concludes with a discussion of open challenges and prospective research directions, with an emphasis on developing unified, multimodal, and clinically deployable generative systems. This work aims to provide a foundational reference for researchers and practitioners developing and applying generative models in computational pathology.

[158] Diffusion Model in Hyperspectral Image Processing and Analysis: A Review

Xing Hu,Xiangcheng Liu,Qianqian Duan,Danfeng Hong,Dawei Zhang

Main category: eess.IV

TL;DR: 综述了扩散模型在高光谱图像处理中的研究进展,展示了其在降噪、分类和异常检测等任务中的优势。

Details Motivation: 高光谱图像的高维性和噪声干扰对传统模型提出了挑战,扩散模型因其处理高维数据的独特优势成为新兴研究方向。 Method: 通过模拟数据在时间上的扩散过程,扩散模型能有效处理高维数据并生成高质量样本。 Result: 扩散模型显著提高了高光谱图像分析的准确性和效率。 Conclusion: 扩散模型为高光谱图像处理提供了新方向,未来研究潜力巨大。 Abstract: Hyperspectral image processing and analysis has important application value in remote sensing, agriculture and environmental monitoring, but its high dimensionality, data redundancy and noise interference etc. bring great challenges to the analysis. Traditional models have limitations in dealing with these complex data, and it is difficult to meet the increasing demand for analysis. In recent years, Diffusion Model, as an emerging generative model, has shown unique advantages in hyperspectral image processing. By simulating the diffusion process of data in time, the Diffusion Model can effectively process high-dimensional data, generate high-quality samples, and perform well in denoising and data enhancement. In this paper, we review the recent research advances in diffusion modeling for hyperspectral image processing and analysis, and discuss its applications in tasks such as high-dimensional data processing, noise removal, classification, and anomaly detection. The performance of diffusion-based models on image processing is compared and the challenges are summarized. It is shown that the diffusion model can significantly improve the accuracy and efficiency of hyperspectral image analysis, providing a new direction for future research.

[159] From Fibers to Cells: Fourier-Based Registration Enables Virtual Cresyl Violet Staining From 3D Polarized Light Imaging

Alexander Oberstrass,Esteban Vaca,Eric Upschulte,Meiqi Niu,Nicola Palomero-Gallagher,David Graessel,Christian Schiffer,Markus Axer,Katrin Amunts,Timo Dickscheid

Main category: eess.IV

TL;DR: 该论文提出了一种基于深度学习的图像转换方法,用于从3D-PLI数据生成虚拟的细胞染色图像,解决了染色过程中的非线性变形问题。

Details Motivation: 为了在微观结构研究中同时分析神经纤维和细胞体的分布,需要将3D-PLI和细胞染色图像对齐,但染色过程会引入变形,限制了样本数量。 Method: 利用深度学习进行图像转换,结合傅里叶基的配准方法,在训练过程中对齐3D-PLI和细胞染色图像。 Result: 方法成功从3D-PLI预测出与真实细胞染色匹配的虚拟染色图像。 Conclusion: 该方法为研究神经纤维和细胞体的空间关系提供了一种高效且高精度的解决方案。 Abstract: Comprehensive assessment of the various aspects of the brain's microstructure requires the use of complementary imaging techniques. This includes measuring the spatial distribution of cell bodies (cytoarchitecture) and nerve fibers (myeloarchitecture). The gold standard for cytoarchitectonic analysis is light microscopic imaging of cell-body stained tissue sections. To reveal the 3D orientations of nerve fibers, 3D Polarized Light Imaging (3D-PLI) has been introduced as a reliable technique providing a resolution in the micrometer range while allowing processing of series of complete brain sections. 3D-PLI acquisition is label-free and allows subsequent staining of sections after measurement. By post-staining for cell bodies, a direct link between fiber- and cytoarchitecture can potentially be established within the same section. However, inevitable distortions introduced during the staining process make a nonlinear and cross-modal registration necessary in order to study the detailed relationships between cells and fibers in the images. In addition, the complexity of processing histological sections for post-staining only allows for a limited number of samples. In this work, we take advantage of deep learning methods for image-to-image translation to generate a virtual staining of 3D-PLI that is spatially aligned at the cellular level. In a supervised setting, we build on a unique dataset of brain sections, to which Cresyl violet staining has been applied after 3D-PLI measurement. To ensure high correspondence between both modalities, we address the misalignment of training data using Fourier-based registration methods. In this way, registration can be efficiently calculated during training for local image patches of target and predicted staining. We demonstrate that the proposed method enables prediction of a Cresyl violet staining from 3D-PLI, matching individual cell instances.

cs.CR [Back]

[160] MPMA: Preference Manipulation Attack Against Model Context Protocol

Zihan Wang,Hongwei Li,Rui Zhang,Yu Liu,Wenbo Jiang,Wenshu Fan,Qingchuan Zhao,Guowen Xu

Main category: cs.CR

TL;DR: 论文提出了一种针对MCP协议的新型安全威胁MPMA,通过定制MCP服务器操纵LLM偏好,并设计了两种攻击方法(DPMA和GAPMA),后者通过遗传算法提高隐蔽性。

Details Motivation: 随着MCP协议的广泛应用,第三方定制版本暴露了安全漏洞,可能导致经济利益的操纵,亟需研究其威胁和防御机制。 Method: 设计了两种攻击方法:直接的DPMA和基于遗传算法的GAPMA,后者通过优化描述提高隐蔽性。 Result: 实验表明GAPMA在有效性和隐蔽性之间取得了平衡,揭示了MCP生态系统的关键漏洞。 Conclusion: 研究强调了MCP生态系统的脆弱性,呼吁开发更鲁棒的防御机制以确保公平性。 Abstract: Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities. In this paper, we first introduce a novel security threat, which we term the MCP Preference Manipulation Attack (MPMA). An attacker deploys a customized MCP server to manipulate LLMs, causing them to prioritize it over other competing MCP servers. This can result in economic benefits for attackers, such as revenue from paid MCP services or advertising income generated from free servers. To achieve MPMA, we first design a Direct Preference Manipulation Attack ($\mathtt{DPMA}$) that achieves significant effectiveness by inserting the manipulative word and phrases into the tool name and description. However, such a direct modification is obvious to users and lacks stealthiness. To address these limitations, we further propose Genetic-based Advertising Preference Manipulation Attack ($\mathtt{GAPMA}$). $\mathtt{GAPMA}$ employs four commonly used strategies to initialize descriptions and integrates a Genetic Algorithm (GA) to enhance stealthiness. The experiment results demonstrate that $\mathtt{GAPMA}$ balances high effectiveness and stealthiness. Our study reveals a critical vulnerability of the MCP in open ecosystems, highlighting an urgent need for robust defense mechanisms to ensure the fairness of the MCP ecosystem.

stat.ML [Back]

[161] On Next-Token Prediction in LLMs: How End Goals Determine the Consistency of Decoding Algorithms

Jacob Trauger,Ambuj Tewari

Main category: stat.ML

TL;DR: 本文研究了大型语言模型中基于交叉熵损失训练的下一词预测方法,分析了多种解码算法(贪婪、前瞻、随机采样和温度缩放随机采样)与不同目标损失函数的一致性。研究发现随机采样在模拟真实概率分布时具有一致性,而其他算法仅对部分概率分布最优。解码算法的选择需根据目标(如信息检索或创意生成)而定。

Details Motivation: 研究动机在于探索大型语言模型中下一词预测方法的不同解码算法与目标损失函数的一致性,填补了该领域理论研究的空白。 Method: 方法包括分析贪婪、前瞻、随机采样和温度缩放随机采样等解码算法,并研究它们与不同目标损失函数的一致性。 Result: 结果显示随机采样在模拟真实概率分布时具有一致性,而其他算法仅对部分概率分布最优。解码算法的选择对目标(如信息检索或创意生成)至关重要。 Conclusion: 结论指出解码算法的选择需基于目标,现有算法在许多场景中缺乏理论依据,未来需进一步研究。 Abstract: Probabilistic next-token prediction trained using cross-entropy loss is the basis of most large language models. Given a sequence of previous values, next-token prediction assigns a probability to each possible next value in the vocabulary. There are many ways to use next-token prediction to output token sequences. This paper examines a few of these algorithms (greedy, lookahead, random sampling, and temperature-scaled random sampling) and studies their consistency with respect to various goals encoded as loss functions. Although consistency of surrogate losses with respect to a target loss function is a well researched topic, we are the first to study it in the context of LLMs (to the best of our knowledge). We find that, so long as next-token prediction converges to its true probability distribution, random sampling is consistent with outputting sequences that mimic sampling from the true probability distribution. For the other goals, such as minimizing the 0-1 loss on the entire sequence, we show no polynomial-time algorithm is optimal for all probability distributions and all decoding algorithms studied are only optimal for a subset of probability distributions. When analyzing these results, we see that there is a dichotomy created between the goals of information retrieval and creative generation for the decoding algorithms. This shows that choosing the correct decoding algorithm based on the desired goal is extremely important and many of the ones used are lacking theoretical grounding in numerous scenarios.

[162] A Fourier Space Perspective on Diffusion Models

Fabian Falck,Teodora Pandeva,Kiarash Zahirnia,Rachel Lawrence,Richard Turner,Edward Meeds,Javier Zazo,Sushrut Karmalkar

Main category: stat.ML

TL;DR: 扩散模型在图像、音频等数据模态中表现出色,但标准DDPM前向过程对高频成分的快速噪声化导致生成质量下降。本文研究了傅里叶空间中的替代前向过程,改善了高频生成质量。

Details Motivation: 标准DDPM前向过程导致高频成分在生成过程中表现不佳,影响了生成质量。 Method: 在傅里叶空间中分析DDPM的归纳偏差,并提出一种替代前向过程,使所有频率以相同速率噪声化。 Result: 替代前向过程在高频为主的场景中显著提升性能,在标准图像基准上与DDPM相当。 Conclusion: 调整前向过程的噪声化速率可以改善高频生成质量,同时保持整体性能。 Abstract: Diffusion models are state-of-the-art generative models on data modalities such as images, audio, proteins and materials. These modalities share the property of exponentially decaying variance and magnitude in the Fourier domain. Under the standard Denoising Diffusion Probabilistic Models (DDPM) forward process of additive white noise, this property results in high-frequency components being corrupted faster and earlier in terms of their Signal-to-Noise Ratio (SNR) than low-frequency ones. The reverse process then generates low-frequency information before high-frequency details. In this work, we study the inductive bias of the forward process of diffusion models in Fourier space. We theoretically analyse and empirically demonstrate that the faster noising of high-frequency components in DDPM results in violations of the normality assumption in the reverse process. Our experiments show that this leads to degraded generation quality of high-frequency components. We then study an alternate forward process in Fourier space which corrupts all frequencies at the same rate, removing the typical frequency hierarchy during generation, and demonstrate marked performance improvements on datasets where high frequencies are primary, while performing on par with DDPM on standard imaging benchmarks.

cs.HC [Back]

[163] Creating General User Models from Computer Use

Omar Shaikh,Shardul Sapkota,Shan Rizvi,Eric Horvitz,Joon Sung Park,Diyi Yang,Michael S. Bernstein

Main category: cs.HC

TL;DR: 本文提出了一种通用用户模型(GUM)架构,通过观察用户与计算机的交互学习用户偏好和行为,支持多模态输入和灵活推理,实现了更智能的人机交互。

Details Motivation: 当前用户模型局限于特定应用,无法实现灵活推理。GUM旨在通过多模态观察构建通用用户知识库,满足长期人机交互愿景。 Method: GUM通过无结构观察(如设备截图)生成置信度加权的用户命题,支持多模态推理、上下文检索和动态更新。 Result: GUM能准确推断用户需求(如婚礼准备或协作困难),并基于此构建主动助手(GUMBOs),执行用户未明确请求的操作。 Conclusion: GUM通过多模态模型理解无结构上下文,实现了智能交互系统的新范式,推动了人机交互的发展。 Abstract: Human-computer interaction has long imagined technology that understands us-from our preferences and habits, to the timing and purpose of our everyday actions. Yet current user models remain fragmented, narrowly tailored to specific apps, and incapable of the flexible reasoning required to fulfill these visions. This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer. The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture that user knowledge and preferences. GUMs can infer that a user is preparing for a wedding they're attending from messages with a friend. Or recognize that a user is struggling with a collaborator's feedback on a draft by observing multiple stalled edits and a switch to reading related work. GUMs introduce an architecture that infers new propositions about a user from multimodal observations, retrieves related propositions for context, and continuously revises existing propositions. To illustrate the breadth of applications that GUMs enable, we demonstrate how they augment chat-based assistants with context, manage OS notifications to selectively surface important information, and enable interactive agents that adapt to preferences across apps. We also instantiate proactive assistants (GUMBOs) that discover and execute useful suggestions on a user's behalf using their GUM. In our evaluations, we find that GUMs make calibrated and accurate inferences about users, and that assistants built on GUMs proactively identify and perform actions that users wouldn't think to request explicitly. Altogether, GUMs introduce methods that leverage multimodal models to understand unstructured context, enabling long-standing visions of HCI and entirely new interactive systems that anticipate user needs.

[164] Large Language Model Use Impact Locus of Control

Jenny Xiyu Fu,Brennan Antone,Kowe Kadoma,Malte Jung

Main category: cs.HC

TL;DR: AI协作写作对人们控制点心理的影响,研究发现就业状态是关键因素。

Details Motivation: 探讨AI工具如何通过协作写作影响人们的自我认知和控制点心理。 Method: 通过462名参与者的实证研究,结合定量和定性分析。 Result: 就业者更依赖AI且控制点内化,失业者个人能动性降低。 Conclusion: 研究引发关于AI如何塑造个人能动性和身份的广泛讨论。 Abstract: As AI tools increasingly shape how we write, they may also quietly reshape how we perceive ourselves. This paper explores the psychological impact of co-writing with AI on people's locus of control. Through an empirical study with 462 participants, we found that employment status plays a critical role in shaping users' reliance on AI and their locus of control. Current results demonstrated that employed participants displayed higher reliance on AI and a shift toward internal control, while unemployed users tended to experience a reduction in personal agency. Through quantitative results and qualitative observations, this study opens a broader conversation about AI's role in shaping personal agency and identity.

cs.SD [Back]

[165] $\mathcal{A}LLM4ADD$: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

Hao Gu,Jiangyan Yi,Chenglong Wang,Jianhua Tao,Zheng Lian,Jiayi He,Yong Ren,Yujie Chen,Zhengqi Wen

Main category: cs.SD

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: Can ALLMs be leveraged to solve ADD?. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness in detecting fake audio. To enhance their performance, we propose $\mathcal{A}LLM4ADD$, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: "Is this audio fake or real?". We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems.

[166] Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

Xihuai Wang,Ziyi Zhao,Siyu Ren,Shao Zhang,Song Li,Xiaoyu Li,Ziwen Wang,Lin Qiu,Guanglu Wan,Xuezhi Cao,Xunliang Cai,Weinan Zhang

Main category: cs.SD

TL;DR: 论文提出了一种名为Audio Turing Test (ATT)的多维中文语料库数据集ATT-Corpus,用于改进TTS系统的评估方法,并通过Auto-ATT实现自动评估。

Details Motivation: 现有的TTS系统评估方法(如MOS)存在主观性和环境不一致等问题,且缺乏对多维因素(如说话风格、上下文多样性)的考量。 Method: 引入ATT-Corpus数据集和基于图灵测试的简化评估协议,同时微调Qwen2-Audio-Instruct模型(Auto-ATT)用于自动评估。 Result: ATT能有效区分模型在特定能力维度的表现,Auto-ATT与人类评估结果高度一致。 Conclusion: ATT和Auto-ATT为TTS系统提供了更快速、可靠的评估工具,推动了模型开发。 Abstract: Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).

[167] Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization

Yanhao Jia,Ji Xie,S Jivaganesh,Hao Li,Xu Wu,Mengmi Zhang

Main category: cs.SD

TL;DR: 研究探讨了AI在多模态冲突中的表现,发现人类在声音定位中优于AI,后者倾向于依赖视觉输入。通过微调模型,性能提升并接近人类表现。

Details Motivation: 探究AI在多模态冲突中的表现,尤其是声音与视觉冲突时的处理方式,并与人类表现对比。 Method: 评估领先的多模态模型,进行心理物理学实验,包括六种视听条件。通过3D模拟生成数据集微调模型。 Result: 人类在冲突或缺失视觉信息时表现更好,AI则依赖视觉导致性能下降。微调后的模型性能提升,接近人类水平。 Conclusion: 感官输入质量和系统架构影响多模态表示的准确性,微调模型能显著改善性能。 Abstract: Imagine hearing a dog bark and turning toward the sound only to see a parked car, while the real, silent dog sits elsewhere. Such sensory conflicts test perception, yet humans reliably resolve them by prioritizing sound over misleading visuals. Despite advances in multimodal AI integrating vision and audio, little is known about how these systems handle cross-modal conflicts or whether they favor one modality. In this study, we systematically examine modality bias and conflict resolution in AI sound localization. We assess leading multimodal models and benchmark them against human performance in psychophysics experiments across six audiovisual conditions, including congruent, conflicting, and absent cues. Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information. In contrast, AI models often default to visual input, degrading performance to near chance levels. To address this, we finetune a state-of-the-art model using a stereo audio-image dataset generated via 3D simulations. Even with limited training data, the refined model surpasses existing benchmarks. Notably, it also mirrors human-like horizontal localization bias favoring left-right precision-likely due to the stereo audio structure reflecting human ear placement. These findings underscore how sensory input quality and system architecture shape multimodal representation accuracy.

cond-mat.mtrl-sci [Back]

[168] MatTools: Benchmarking Large Language Models for Materials Science Tools

Siyu Liu,Jiamin Xu,Beilin Ye,Bo Hu,David J. Srolovitz,Tongqi Wen

Main category: cond-mat.mtrl-sci

TL;DR: MatTools是一个评估大型语言模型(LLM)在材料科学问题中生成和执行代码能力的基准工具,包含QA基准和实际工具使用基准。

Details Motivation: 评估LLM在材料科学工具应用中的能力,促进更有效的AI系统开发。 Method: 构建MatTools,包括基于pymatgen的QA基准和实际任务基准,评估LLM的代码生成与执行能力。 Result: 发现通用模型优于专用模型,AI更了解AI,简单方法更有效。 Conclusion: MatTools为评估和改进LLM在材料科学中的应用提供了标准化框架。 Abstract: Large language models (LLMs) are increasingly applied to materials science questions, including literature comprehension, property prediction, materials discovery and alloy design. At the same time, a wide range of physics-based computational approaches have been developed in which materials properties can be calculated. Here, we propose a benchmark application to evaluate the proficiency of LLMs to answer materials science questions through the generation and safe execution of codes based on such physics-based computational materials science packages. MatTools is built on two complementary components: a materials simulation tool question-answer (QA) benchmark and a real-world tool-usage benchmark. We designed an automated methodology to efficiently collect real-world materials science tool-use examples. The QA benchmark, derived from the pymatgen (Python Materials Genomics) codebase and documentation, comprises 69,225 QA pairs that assess the ability of an LLM to understand materials science tools. The real-world benchmark contains 49 tasks (138 subtasks) requiring the generation of functional Python code for materials property calculations. Our evaluation of diverse LLMs yields three key insights: (1)Generalists outshine specialists;(2)AI knows AI; and (3)Simpler is better. MatTools provides a standardized framework for assessing and improving LLM capabilities for materials science tool applications, facilitating the development of more effective AI systems for materials science and general scientific research.

cs.AI [Back]

[169] Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

Simeng Han,Stephen Xia,Grant Zhang,Howard Dai,Chen Liu,Lichang Chen,Hoang Huy Nguyen,Hongyuan Mei,Jiayuan Mao,R. Thomas McCoy

Main category: cs.AI

TL;DR: 论文提出了一种基于长篇叙述形式的脑筋急转弯基准测试,用于深入探究大语言模型(LLMs)的推理策略,关注正确性、解决方案的质量和创造性。

Details Motivation: 传统的准确性指标无法揭示模型的推理过程,因此需要一种新方法来评估模型的推理策略和创造性。 Method: 通过脑筋急转弯测试LLMs的多层次推理能力,包括语义解析、解决方案生成、自我修正、分步草图生成和提示利用。 Result: LLMs在某些情况下能提供创造性的解决方案,但也存在依赖暴力求解的情况,显示其推理能力仍有改进空间。 Conclusion: LLMs具备一定创造性解决问题的能力,但在高效推理方面仍需优化。 Abstract: Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.

[170] Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

Yexiang Liu,Zekun Li,Zhi Fang,Nan Xu,Ran He,Tieniu Tan

Main category: cs.AI

TL;DR: 研究发现,随着计算资源增加,复杂的提示策略逐渐落后于简单的Chain-of-Thought,并提出了一种快速预测性能的方法和两种改进策略。

Details Motivation: 探讨不同提示策略在扩展计算资源时的表现,尤其是多数投票场景下的性能变化。 Method: 在6个LLM、8种提示策略和6个基准上系统实验,结合理论分析和概率方法预测性能。 Result: 复杂策略在初始表现优越,但随着资源增加,简单Chain-of-Thought表现更优。 Conclusion: 研究呼吁重新审视复杂提示策略的作用,并提供了提升扩展性能的新思路。 Abstract: Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a method according to probability theory to quickly and accurately predict the scaling performance and select the best strategy under large sampling times without extra resource-intensive inference in practice. It can serve as the test-time scaling law for majority voting. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance.

[171] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Zheng Li,Qingxiu Dong,Jingyuan Ma,Di Zhang,Zhifang Sui

Main category: cs.AI

TL;DR: SelfBudgeter是一种自适应的可控推理策略,通过双阶段训练和预算引导的GPRO强化学习,有效减少推理长度并保持准确性。

Details Motivation: 当前大型推理模型在处理简单和复杂查询时效率低下,导致资源浪费和延迟增加。 Method: 采用双阶段训练:预估计查询难度和推理成本,再通过预算引导的GPRO强化学习优化输出长度。 Result: 在MATH基准测试中,实现了74.47%的响应长度压缩,同时保持准确性。 Conclusion: SelfBudgeter能根据问题复杂度合理分配预算,显著提升推理效率。 Abstract: Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models inefficiently over-process both trivial and complex queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive controllable reasoning strategy for efficient reasoning. Our approach adopts a dual-phase training paradigm: first, the model learns to pre-estimate the reasoning cost based on the difficulty of the query. Then, we introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. SelfBudgeter allows users to anticipate generation time and make informed decisions about continuing or interrupting the process. Furthermore, our method enables direct manipulation of reasoning length via pre-filling token budget. Experimental results demonstrate that SelfBudgeter can rationally allocate budgets according to problem complexity, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy.

cs.LG [Back]

[172] Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

Jiazheng Zhang,Wenqing Jing,Zizhuo Zhang,Zhiheng Xi,Shihan Dou,Rongxiang Weng,Jiahuan Li,Jingang Wang,MingXu Cai,Shibo Hong,Tao Gui,Qi Zhang

Main category: cs.LG

TL;DR: 论文提出了一种协作奖励建模(CRM)框架,通过结合同行评审和课程学习,提高奖励模型在噪声偏好数据下的鲁棒性。

Details Motivation: 人类反馈中的噪声偏好会导致奖励模型过拟合虚假模式,从而在策略优化中提供误导信号。 Method: 提出CRM框架,训练两个并行奖励模型互相评估数据选择,并通过课程学习从易到难结构化偏好数据。 Result: 实验表明,CRM在40%标签噪声下RewardBench准确率提升9.94分,且兼容隐式奖励对齐方法。 Conclusion: CRM是一种实用且通用的策略,可增强奖励模型的泛化能力和鲁棒性。 Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other's data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.

[173] UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech

Jiaxuan Liu,Zhenhua Ling

Main category: cs.LG

TL;DR: UDDETTS提出了一种结合离散和维度情感控制的神经编解码语言模型,用于可控情感TTS,解决了传统方法在情感复杂性和连续性上的不足。

Details Motivation: 传统TTS方法依赖预定义离散情感标签,难以捕捉人类情感的复杂性和连续性,且数据不平衡导致模型过拟合。 Method: 引入ADV空间描述维度情感,支持离散标签或ADV值驱动的情感控制,采用半监督训练策略利用多样化数据。 Result: 实验表明UDDETTS能在ADV空间实现线性情感控制,并展现出色的端到端情感语音合成能力。 Conclusion: UDDETTS通过统一离散和维度情感控制,提升了情感TTS的表现力和可控性。 Abstract: Recent neural codec language models have made great progress in the field of text-to-speech (TTS), but controllable emotional TTS still faces many challenges. Traditional methods rely on predefined discrete emotion labels to control emotion categories and intensities, which can't capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotion annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a neural codec language model unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotion annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along the three dimensions of ADV space, and exhibits superior end-to-end emotional speech synthesis capabilities.

[174] LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs

Ran Li,Hao Wang,Chengzhi Mao

Main category: cs.LG

TL;DR: LARGO是一种基于梯度优化的潜在自反射攻击方法,用于生成流畅且隐蔽的越狱提示,显著提升了攻击成功率。

Details Motivation: 现有攻击方法在离散语言空间中难以有效利用梯度优化,因此需要一种更高效的方法来揭示大型语言模型的漏洞。 Method: LARGO通过在LLM的连续潜在空间中优化对抗性潜在向量,并递归解码为自然语言,实现快速、有效的攻击。 Result: 在AdvBench和JailbreakBench基准测试中,LARGO的攻击成功率比领先技术(如AutoDAN)高出44个百分点。 Conclusion: LARGO展示了通过梯度优化攻击LLM内部的有效性,为越狱提示提供了一种高效替代方案。 Abstract: Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM's continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts. On standard benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate. Our findings demonstrate a potent alternative to agentic LLM prompting, highlighting the efficacy of interpreting and attacking LLM internals through gradient optimization.

[175] Maximizing Asynchronicity in Event-based Neural Networks

Haiqing Hao,Nikola Zubić,Weihua He,Zhipeng Sui,Davide Scaramuzza,Wenhui Wang

Main category: cs.LG

TL;DR: EVA(EVent Asynchronous representation learning)是一种新型的异步到同步(A2S)框架,通过借鉴语言建模中的线性注意力和自监督学习技术,生成高表达性和泛化性的事件表示,显著提升了事件相机的视觉任务性能。

Details Motivation: 事件相机的高时间分辨率、低延迟和低冗余特性使其在视觉任务中具有优势,但其异步稀疏特性与标准机器学习方法不兼容。现有A2S方法在表达性和泛化性上表现不足。 Method: EVA框架借鉴语言建模的线性注意力和自监督学习技术,异步编码事件为高表达性和泛化性的表示。 Result: EVA在识别任务(DVS128-Gesture和N-Cars)上优于现有A2S方法,并在检测任务(Gen1数据集)上达到47.7 mAP。 Conclusion: EVA展示了在实时事件视觉应用中的变革潜力,为异步事件处理提供了高效解决方案。 Abstract: Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned representations for ML pipelines, existing A2S approaches often sacrifice representation expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous representation learning), a novel A2S framework to generate highly expressive and generalizable event-by-event representations. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks (DVS128-Gesture and N-Cars), and represents the first A2S framework to successfully master demanding detection tasks, achieving a remarkable 47.7 mAP on the Gen1 dataset. These results underscore EVA's transformative potential for advancing real-time event-based vision applications.

[176] Visual Planning: Let's Think Only with Images

Yi Xu,Chengzu Li,Han Zhou,Xingchen Wan,Caiqi Zhang,Anna Korhonen,Ivan Vulić

Main category: cs.LG

TL;DR: 论文提出了一种新的视觉规划范式(Visual Planning),通过纯视觉表示进行推理,优于传统的基于文本的推理方法。

Details Motivation: 现有的大型语言模型(LLMs)和多模态扩展(MLLMs)主要依赖文本进行推理,但在涉及空间和几何信息的任务中,语言可能并非最自然或有效的模态。 Method: 提出视觉规划范式,通过图像序列进行逐步推理,并引入基于强化学习的框架VPRL,结合GRPO对大型视觉模型进行后训练。 Result: 在FrozenLake、Maze和MiniBehavior等视觉导航任务中,视觉规划的表现优于纯文本推理方法。 Conclusion: 视觉规划是一种可行且有前景的替代方案,为需要直观图像推理的任务开辟了新途径。 Abstract: Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

[177] A probabilistic framework for dynamic quantization

Gabriele Santini,Francesco Paissan,Elisabetta Farella

Main category: cs.LG

TL;DR: 提出了一种动态量化神经网络的概率框架,实现输入自适应的量化参数调整,计算高效且性能损失可忽略。

Details Motivation: 解决传统量化方法在动态输入下性能下降的问题,同时减少计算和内存开销。 Method: 通过轻量级代理模型对预激活值应用概率模型,实现每输入自适应的量化参数调整。 Result: 在多个计算机视觉任务和模型上验证,性能损失可忽略,且计算开销优于标准量化策略。 Conclusion: 该方法在性能和计算开销之间取得了最佳平衡。 Abstract: We propose a probabilistic framework for dynamic quantization of neural networks that allows for a computationally efficient input-adaptive rescaling of the quantization parameters. Our framework applies a probabilistic model to the network's pre-activations through a lightweight surrogate, enabling the adaptive adjustment of the quantization parameters on a per-input basis without significant memory overhead. We validate our approach on a set of popular computer vision tasks and models, observing only a negligible loss in performance. Our method strikes the best performance and computational overhead tradeoff compared to standard quantization strategies.

[178] Hashing for Structure-based Anomaly Detection

Filippo Leveni,Luca Magri,Cesare Alippi,Giacomo Boracchi

Main category: cs.LG

TL;DR: 提出了一种基于低维流形结构的高效异常检测方法,利用高维Preference Space和局部敏感哈希技术降低计算成本。

Details Motivation: 解决在低维流形结构中识别异常样本的问题,提高异常检测效率。 Method: 使用局部敏感哈希技术在高维Preference Space中避免显式计算距离,提出基于隔离的异常检测方法。 Result: 实现了较低计算成本下的先进性能。 Conclusion: 该方法在异常检测中高效且性能优越,代码已公开。 Abstract: We focus on the problem of identifying samples in a set that do not conform to structured patterns represented by low-dimensional manifolds. An effective way to solve this problem is to embed data in a high dimensional space, called Preference Space, where anomalies can be identified as the most isolated points. In this work, we employ Locality Sensitive Hashing to avoid explicit computation of distances in high dimensions and thus improve Anomaly Detection efficiency. Specifically, we present an isolation-based anomaly detection technique designed to work in the Preference Space which achieves state-of-the-art performance at a lower computational cost. Code is publicly available at https://github.com/ineveLoppiliF/Hashing-for-Structure-based-Anomaly-Detection.

Luca Magri,Filippo Leveni,Giacomo Boracchi

Main category: cs.LG

TL;DR: 提出了一种名为MultiLink的新算法,用于在噪声和异常值污染的数据集中恢复多种不同类别的几何结构。

Details Motivation: 解决在噪声和异常值污染的数据集中恢复多种几何结构的问题,特别是由参数模型混合定义的几何结构。 Method: 通过偏好分析和聚类进行鲁棒拟合,MultiLink算法结合了动态模型拟合和模型选择,采用新的链接方案决定是否合并两个聚类。 Result: 实验表明,MultiLink在多类和单类问题中均优于现有方法,具有更快的速度、对异常值阈值不敏感等优势。 Conclusion: MultiLink是一种高效且鲁棒的方法,适用于多类别几何结构恢复问题,代码已公开。 Abstract: We address the problem of recovering multiple structures of different classes in a dataset contaminated by noise and outliers. In particular, we consider geometric structures defined by a mixture of underlying parametric models (e.g. planes and cylinders, homographies and fundamental matrices), and we tackle the robust fitting problem by preference analysis and clustering. We present a new algorithm, termed MultiLink, that simultaneously deals with multiple classes of models. MultiLink combines on-the-fly model fitting and model selection in a novel linkage scheme that determines whether two clusters are to be merged. The resulting method features many practical advantages with respect to methods based on preference analysis, being faster, less sensitive to the inlier threshold, and able to compensate limitations deriving from hypotheses sampling. Experiments on several public datasets demonstrate that Multi-Link favourably compares with state of the art alternatives, both in multi-class and single-class problems. Code is publicly made available for download.

[180] Preference Isolation Forest for Structure-based Anomaly Detection

Filippo Leveni,Luca Magri,Cesare Alippi,Giacomo Boracchi

Main category: cs.LG

TL;DR: 提出了一种名为PIF的异常检测框架,结合自适应隔离方法和偏好嵌入技术,通过低维流形嵌入高维偏好空间来识别异常点。

Details Motivation: 解决异常检测问题,特别是针对不符合低维流形结构化模式的样本。 Method: 提出三种隔离方法:Voronoi-iForest、RuzHash-iForest和Sliding-PIF,分别基于不同技术识别异常点。 Result: PIF框架能够有效识别异常点,结合了灵活性和效率。 Conclusion: PIF是一种通用且高效的异常检测框架,适用于多种场景。 Abstract: We address the problem of detecting anomalies as samples that do not conform to structured patterns represented by low-dimensional manifolds. To this end, we conceive a general anomaly detection framework called Preference Isolation Forest (PIF), that combines the benefits of adaptive isolation-based methods with the flexibility of preference embedding. The key intuition is to embed the data into a high-dimensional preference space by fitting low-dimensional manifolds, and to identify anomalies as isolated points. We propose three isolation approaches to identify anomalies: $i$) Voronoi-iForest, the most general solution, $ii$) RuzHash-iForest, that avoids explicit computation of distances via Local Sensitive Hashing, and $iii$) Sliding-PIF, that leverages a locality prior to improve efficiency and effectiveness.

[181] CTP: A hybrid CNN-Transformer-PINN model for ocean front forecasting

Yishuo Wang,Feng Zhou,Muping Zhou,Qicheng Meng,Zhijun Hu,Yi Wang

Main category: cs.LG

TL;DR: CTP是一种结合CNN、Transformer和PINN的深度学习框架,用于海洋锋面预测,显著提升了多步预测的准确性和物理一致性。

Details Motivation: 海洋锋面在海洋生物地球化学和物理过程中起关键作用,现有方法在空间连续性和物理一致性方面表现不足。 Method: CTP结合局部空间编码、长程时间注意力和物理约束,解决现有问题。 Result: 在南海和黑潮区域的实验中,CTP在单步和多步预测中均达到SOTA性能。 Conclusion: CTP显著优于基线模型,为海洋锋面预测提供了高效解决方案。 Abstract: This paper proposes CTP, a novel deep learning framework that integrates convolutional neural network(CNN), Transformer architectures, and physics-informed neural network(PINN) for ocean front prediction. Ocean fronts, as dynamic interfaces between distinct water masses, play critical roles in marine biogeochemical and physical processes. Existing methods such as LSTM, ConvLSTM, and AttentionConv often struggle to maintain spatial continuity and physical consistency over multi-step forecasts. CTP addresses these challenges by combining localized spatial encoding, long-range temporal attention, and physical constraint enforcement. Experimental results across south China sea(SCS) and Kuroshio(KUR) regions from 1993 to 2020 demonstrate that CTP achieves state-of-the-art(SOTA) performance in both single-step and multi-step predictions, significantly outperforming baseline models in accuracy, $F_1$ score, and temporal stability.

[182] Assessing the Performance of Analog Training for Transfer Learning

Omobayode Fagbohungbe,Corey Lammie,Malte J. Rasch,Takashi Ando,Tayfun Gokmen,Vijay Narayanan

Main category: cs.LG

TL;DR: 本文提出了一种新的算法c-TTv2,用于解决模拟内存计算中的训练难题,并在Swin-ViT模型上验证了其性能。

Details Motivation: 模拟内存计算在深度学习和迁移学习中具有潜力,但现有算法无法应对设备非线性和不对称性问题。 Method: 采用c-TTv2算法,结合chopped技术,优化训练过程,并在CIFAR100数据集上测试。 Result: c-TTv2算法在设备参数变化(如噪声、对称点偏移等)下表现出较强的鲁棒性。 Conclusion: c-TTv2算法为解决模拟内存计算的训练问题提供了有效方案。 Abstract: Analog in-memory computing is a next-generation computing paradigm that promises fast, parallel, and energy-efficient deep learning training and transfer learning (TL). However, achieving this promise has remained elusive due to a lack of suitable training algorithms. Analog memory devices exhibit asymmetric and non-linear switching behavior in addition to device-to-device variation, meaning that most, if not all, of the current off-the-shelf training algorithms cannot achieve good training outcomes. Also, recently introduced algorithms have enjoyed limited attention, as they require bi-directionally switching devices of unrealistically high symmetry and precision and are highly sensitive. A new algorithm chopped TTv2 (c-TTv2), has been introduced, which leverages the chopped technique to address many of the challenges mentioned above. In this paper, we assess the performance of the c-TTv2 algorithm for analog TL using a Swin-ViT model on a subset of the CIFAR100 dataset. We also investigate the robustness of our algorithm to changes in some device specifications, including weight transfer noise, symmetry point skew, and symmetry point variability

[183] What's Inside Your Diffusion Model? A Score-Based Riemannian Metric to Explore the Data Manifold

Simone Azeglio,Arianna Di Bernardo

Main category: cs.LG

TL;DR: 论文提出了一种基于分数的黎曼度量方法,利用扩散模型中的Stein分数函数刻画数据流形的内在几何特性,无需显式参数化。通过实验验证,该方法在图像插值和外推任务中优于基线方法。

Details Motivation: 扩散模型在捕捉复杂图像分布方面表现出色,但其学习的数据流形的几何特性尚未被充分理解。本文旨在填补这一空白。 Method: 引入基于分数的黎曼度量,利用Stein分数函数定义环境空间中的度量张量,保持切向距离的同时拉伸垂直距离,从而生成遵循流形轮廓的测地线。 Result: 在合成数据、Rotated MNIST和Stable Diffusion生成的复杂自然图像上,该方法在感知指标(LPIPS)和分布级指标(FID、KID)上优于基线方法。 Conclusion: 该方法揭示了扩散模型学习的隐式几何结构,为通过黎曼几何导航自然图像流形提供了理论基础。 Abstract: Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold's contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.

[184] Towards Robust Spiking Neural Networks:Mitigating Heterogeneous Training Vulnerability via Dominant Eigencomponent Projection

Desong Zhang,Jia Hu,Geyong Min

Main category: cs.LG

TL;DR: SNNs trained with direct encoding and BPTT are vulnerable to catastrophic collapse from slight data distribution shifts. DEP, a hyperparameter-free method, mitigates this by reducing Hessian spectral radius, enhancing robustness.

Details Motivation: SNNs' energy efficiency is compromised by vulnerability to heterogeneous data poisoning when trained with direct encoding and BPTT. Method: Developed Dominant Eigencomponent Projection (DEP) to orthogonally project gradients, reducing Hessian spectral radius. Result: DEP prevents catastrophic collapse, enhances robustness against heterogeneous data poisoning, and outperforms baselines. Conclusion: DEP provides a safer and more reliable approach for SNN deployment by addressing training vulnerabilities. Abstract: Spiking Neural Networks (SNNs) process information via discrete spikes, enabling them to operate at remarkably low energy levels. However, our experimental observations reveal a striking vulnerability when SNNs are trained using the mainstream method--direct encoding combined with backpropagation through time (BPTT): even a single backward pass on data drawn from a slightly different distribution can lead to catastrophic network collapse. Our theoretical analysis attributes this vulnerability to the repeated inputs inherent in direct encoding and the gradient accumulation characteristic of BPTT, which together produce an exceptional large Hessian spectral radius. To address this challenge, we develop a hyperparameter-free method called Dominant Eigencomponent Projection (DEP). By orthogonally projecting gradients to precisely remove their dominant components, DEP effectively reduces the Hessian spectral radius, thereby preventing SNNs from settling into sharp minima. Extensive experiments demonstrate that DEP not only mitigates the vulnerability of SNNs to heterogeneous data poisoning, but also significantly enhances overall robustness compared to key baselines, providing strong support for safer and more reliable SNN deployment.

cs.RO [Back]

[185] REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Chenxi Jiang,Chuhao Zhou,Jianfei Yang

Main category: cs.RO

TL;DR: 论文研究了模糊指代表达(REs)对基于大语言模型(LLM)的机器人任务规划的影响,并提出了一种新方法(任务导向上下文认知)以解决此问题。

Details Motivation: 现实用户(如老人和儿童)的指令常含模糊REs,而现有LLM规划器假设指令清晰,导致性能下降。 Method: 提出首个含模糊REs的机器人任务规划基准(REI-Bench),并设计任务导向上下文认知方法生成清晰指令。 Result: 模糊REs使规划成功率下降高达77.9%,新方法显著优于现有技术。 Conclusion: 任务导向上下文认知提升了机器人任务规划的实用性,尤其适用于非专家用户。 Abstract: Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks. Although recent large language model (LLM)-based task planners achieve amazing performance, they assume that human instructions are clear and straightforward. However, real-world users are not experts, and their instructions to robots often contain significant vagueness. Linguists suggest that such vagueness frequently arises from referring expressions (REs), whose meanings depend heavily on dialogue context and environment. This vagueness is even more prevalent among the elderly and children, who robots should serve more. This paper studies how such vagueness in REs within human instructions affects LLM-based robot task planning and how to overcome this issue. To this end, we propose the first robot task planning benchmark with vague REs (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance, leading to success rate drops of up to 77.9%. We also observe that most failure cases stem from missing objects in planners. To mitigate the REs issue, we propose a simple yet effective approach: task-oriented context cognition, which generates clear instructions for robots, achieving state-of-the-art performance compared to aware prompt and chains of thought. This work contributes to the research community of human-robot interaction (HRI) by making robot task planning more practical, particularly for non-expert users, e.g., the elderly and children.

[186] TartanGround: A Large-Scale Dataset for Ground Robot Perception and Navigation

Manthan Patel,Fan Yang,Yuheng Qiu,Cesar Cadena,Sebastian Scherer,Marco Hutter,Wenshan Wang

Main category: cs.RO

TL;DR: TartanGround是一个大规模多模态数据集,用于提升地面机器人在多样化环境中的感知与自主能力。

Details Motivation: 现有数据集在多样化场景中泛化能力不足,需要更全面的数据支持机器人感知与自主任务的训练与评估。 Method: 通过集成自动管道在多种仿真环境中收集数据,包括RGB立体相机、深度、光流、LiDAR点云、真实位姿、语义分割图像和语义标记的占用地图。 Result: 收集了910条轨迹和150万样本,评估显示现有方法在多样化场景中泛化能力有限。 Conclusion: TartanGround可作为训练和评估多种学习任务的测试平台,推动机器人感知与自主能力的进步。 Abstract: We present TartanGround, a large-scale, multi-modal dataset to advance the perception and autonomy of ground robots operating in diverse environments. This dataset, collected in various photorealistic simulation environments includes multiple RGB stereo cameras for 360-degree coverage, along with depth, optical flow, stereo disparity, LiDAR point clouds, ground truth poses, semantic segmented images, and occupancy maps with semantic labels. Data is collected using an integrated automatic pipeline, which generates trajectories mimicking the motion patterns of various ground robot platforms, including wheeled and legged robots. We collect 910 trajectories across 70 environments, resulting in 1.5 million samples. Evaluations on occupancy prediction and SLAM tasks reveal that state-of-the-art methods trained on existing datasets struggle to generalize across diverse scenes. TartanGround can serve as a testbed for training and evaluation of a broad range of learning-based tasks, including occupancy prediction, SLAM, neural scene representation, perception-based navigation, and more, enabling advancements in robotic perception and autonomy towards achieving robust models generalizable to more diverse scenarios. The dataset and codebase for data collection will be made publicly available upon acceptance. Webpage: https://tartanair.org/tartanground

[187] GrowSplat: Constructing Temporal Digital Twins of Plants with Gaussian Splats

Simeon Adebola,Shuangyu Xie,Chung Min Kim,Justin Kerr,Bart M. van Marrewijk,Mieke van Vlaardingen,Tim van Daalen,Robert van Loo,Jose Luis Susa Rincon,Eugen Solowjow,Rick van de Zedde,Ken Goldberg

Main category: cs.RO

TL;DR: 提出了一种结合3D高斯点云与样本对齐流程的新框架,用于构建植物的时间数字孪生模型,解决了植物生长重建中的复杂几何、遮挡和非刚性变形问题。

Details Motivation: 植物生长的时间重建对表型分析和育种至关重要,但因复杂几何、遮挡和非刚性变形而具有挑战性。 Method: 通过多视角相机数据重建高斯点云,采用两阶段配准方法(粗配准和精配准)构建4D植物生长模型。 Result: 在荷兰植物表型中心的数据上验证了方法,成功重建了红杉和藜麦的时间发育模型。 Conclusion: 该方法为植物生长的时间重建提供了高精度解决方案,适用于表型分析和育种研究。 Abstract: Accurate temporal reconstructions of plant growth are essential for plant phenotyping and breeding, yet remain challenging due to complex geometries, occlusions, and non-rigid deformations of plants. We present a novel framework for building temporal digital twins of plants by combining 3D Gaussian Splatting with a robust sample alignment pipeline. Our method begins by reconstructing Gaussian Splats from multi-view camera data, then leverages a two-stage registration approach: coarse alignment through feature-based matching and Fast Global Registration, followed by fine alignment with Iterative Closest Point. This pipeline yields a consistent 4D model of plant development in discrete time steps. We evaluate the approach on data from the Netherlands Plant Eco-phenotyping Center, demonstrating detailed temporal reconstructions of Sequoia and Quinoa species. Videos and Images can be seen at https://berkeleyautomation.github.io/GrowSplat/

[188] DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy

Yuran Wang,Ruihai Wu,Yue Chen,Jiarui Wang,Jiaqi Liang,Ziyu Zhu,Haoran Geng,Jitendra Malik,Pieter Abbeel,Hao Dong

Main category: cs.RO

TL;DR: 论文提出DexGarmentLab环境和HALO算法,用于解决灵巧服装操作的挑战,通过高质量3D资产和仿真技术减少模拟与现实的差距,并利用单次专家演示生成多样化轨迹数据。

Details Motivation: 服装操作的多样性和变形性使其成为难题,现有研究难以复现人类灵巧操作能力,缺乏真实仿真环境。 Method: 提出DexGarmentLab环境,包含15种任务场景的高质量3D资产和改进的仿真技术;利用服装结构对应性生成多样化轨迹数据集;设计HALO算法,通过可转移的affordance点和通用轨迹完成任务。 Result: HALO算法在实验中表现优于现有方法,能泛化到形状和变形差异大的未见实例。 Conclusion: DexGarmentLab和HALO为灵巧服装操作提供了高效解决方案,显著提升了泛化能力和仿真真实性。 Abstract: Garment manipulation is a critical challenge due to the diversity in garment categories, geometries, and deformations. Despite this, humans can effortlessly handle garments, thanks to the dexterity of our hands. However, existing research in the field has struggled to replicate this level of dexterity, primarily hindered by the lack of realistic simulations of dexterous garment manipulation. Therefore, we propose DexGarmentLab, the first environment specifically designed for dexterous (especially bimanual) garment manipulation, which features large-scale high-quality 3D assets for 15 task scenarios, and refines simulation techniques tailored for garment modeling to reduce the sim-to-real gap. Previous data collection typically relies on teleoperation or training expert reinforcement learning (RL) policies, which are labor-intensive and inefficient. In this paper, we leverage garment structural correspondence to automatically generate a dataset with diverse trajectories using only a single expert demonstration, significantly reducing manual intervention. However, even extensive demonstrations cannot cover the infinite states of garments, which necessitates the exploration of new algorithms. To improve generalization across diverse garment shapes and deformations, we propose a Hierarchical gArment-manipuLation pOlicy (HALO). It first identifies transferable affordance points to accurately locate the manipulation area, then generates generalizable trajectories to complete the task. Through extensive experiments and detailed analysis of our method and baseline, we demonstrate that HALO consistently outperforms existing methods, successfully generalizing to previously unseen instances even with significant variations in shape and deformation where others fail. Our project page is available at: https://wayrise.github.io/DexGarmentLab/.

[189] Planar Velocity Estimation for Fast-Moving Mobile Robots Using Event-Based Optical Flow

Liam Boyle,Jonas Kühne,Nicolas Baumann,Niklas Bastuck,Michele Magno

Main category: cs.RO

TL;DR: 论文提出了一种基于事件相机和平面运动学的新方法,用于移动机器人速度估计,避免了传统方法对车轮-地面摩擦假设的依赖,并在实验中表现出优于现有技术的性能。

Details Motivation: 传统速度估计方法依赖强假设或复杂模型,难以适应多变环境(如湿滑路面),因此需要一种更鲁棒的方法。 Method: 结合事件相机的光学流和平面运动学,提出了一种不依赖车轮-地面摩擦假设的速度估计方法。 Result: 实验表明,该方法在横向误差上比现有技术(Event-VIO)提升了38.3%,并在高速(32 m/s)下验证了有效性。 Conclusion: 该方法具有实际部署潜力,尤其在多变环境下表现优异。 Abstract: Accurate velocity estimation is critical in mobile robotics, particularly for driver assistance systems and autonomous driving. Wheel odometry fused with Inertial Measurement Unit (IMU) data is a widely used method for velocity estimation; however, it typically requires strong assumptions, such as non-slip steering, or complex vehicle dynamics models that do not hold under varying environmental conditions like slippery surfaces. We introduce an approach to velocity estimation that is decoupled from wheel-to-surface traction assumptions by leveraging planar kinematics in combination with optical flow from event cameras pointed perpendicularly at the ground. The asynchronous micro-second latency and high dynamic range of event cameras make them highly robust to motion blur, a common challenge in vision-based perception techniques for autonomous driving. The proposed method is evaluated through in-field experiments on a 1:10 scale autonomous racing platform and compared to precise motion capture data, demonstrating not only performance on par with the state-of-the-art Event-VIO method but also a 38.3 % improvement in lateral error. Qualitative experiments at highway speeds of up to 32 m/s further confirm the effectiveness of our approach, indicating significant potential for real-world deployment.

[190] Open-Source Multi-Viewpoint Surgical Telerobotics

Guido Caccianiga,Yarden Sharon,Bernard Javot,Senya Polikovsky,Gökce Ergün,Ivan Capobianco,André L. Mihaljevic,Anton Deguet,Katherine J. Kuchenbecker

Main category: cs.RO

TL;DR: 论文探讨了通过引入多视角可视化与控制范式,提升微创手术机器人的协作与感知能力,并开发了一个开源的多视角机器人手术系统。

Details Motivation: 随着微创手术机器人逐渐普及和模块化,重新思考并扩展手术远程操作的视觉化和控制范式,以提升手术协作和机器感知能力。 Method: 集成高性能视觉组件并升级达芬奇研究套件的控制逻辑,构建同步多视角、多传感器的机器人手术系统。 Result: 开发了一个功能完备的系统,支持多视角协作和实时3D感知,为共享自主权提供更鲁棒的机器感知。 Conclusion: 开源系统将促进研究社区的合作与创新,加速前沿研究的临床转化。 Abstract: As robots for minimally invasive surgery (MIS) gradually become more accessible and modular, we believe there is a great opportunity to rethink and expand the visualization and control paradigms that have characterized surgical teleoperation since its inception. We conjecture that introducing one or more additional adjustable viewpoints in the abdominal cavity would not only unlock novel visualization and collaboration strategies for surgeons but also substantially boost the robustness of machine perception toward shared autonomy. Immediate advantages include controlling a second viewpoint and teleoperating surgical tools from a different perspective, which would allow collaborating surgeons to adjust their views independently and still maneuver their robotic instruments intuitively. Furthermore, we believe that capturing synchronized multi-view 3D measurements of the patient's anatomy would unlock advanced scene representations. Accurate real-time intraoperative 3D perception will allow algorithmic assistants to directly control one or more robotic instruments and/or robotic cameras. Toward these goals, we are building a synchronized multi-viewpoint, multi-sensor robotic surgery system by integrating high-performance vision components and upgrading the da Vinci Research Kit control logic. This short paper reports a functional summary of our setup and elaborates on its potential impacts in research and future clinical practice. By fully open-sourcing our system, we will enable the research community to reproduce our setup, improve it, and develop powerful algorithms, effectively boosting clinical translation of cutting-edge research.

[191] Exploiting Radiance Fields for Grasp Generation on Novel Synthetic Views

Abhishek Kashyap,Henrik Andreasson,Todor Stoyanov

Main category: cs.RO

TL;DR: 论文探讨了基于视觉的机器人抓取中,通过合成新视角图像提升抓取精度和覆盖率的方法。

Details Motivation: 多视角图像能提供更多信息,但移动摄像头耗时且可能受限。合成新视角图像可解决这一问题。 Method: 利用高斯泼溅等场景表示技术,从稀疏的真实视角合成新视角图像,生成更多抓取姿势。 Result: 在Graspnet-1billion数据集上,合成视角增加了力闭合抓取姿势并提高了覆盖率。 Conclusion: 未来可扩展至单图像构建的辐射场,结合扩散模型或通用辐射场技术,进一步提升抓取效果。 Abstract: Vision based robot manipulation uses cameras to capture one or more images of a scene containing the objects to be manipulated. Taking multiple images can help if any object is occluded from one viewpoint but more visible from another viewpoint. However, the camera has to be moved to a sequence of suitable positions for capturing multiple images, which requires time and may not always be possible, due to reachability constraints. So while additional images can produce more accurate grasp poses due to the extra information available, the time-cost goes up with the number of additional views sampled. Scene representations like Gaussian Splatting are capable of rendering accurate photorealistic virtual images from user-specified novel viewpoints. In this work, we show initial results which indicate that novel view synthesis can provide additional context in generating grasp poses. Our experiments on the Graspnet-1billion dataset show that novel views contributed force-closure grasps in addition to the force-closure grasps obtained from sparsely sampled real views while also improving grasp coverage. In the future we hope this work can be extended to improve grasp extraction from radiance fields constructed with a single input image, using for example diffusion models or generalizable radiance fields.