cs.CV [Total: 154]
cs.GR [Total: 7]
cs.CL [Total: 66]
eess.AS [Total: 3]
cs.SE [Total: 2]
cs.CY [Total: 3]
cs.SI [Total: 2]
cond-mat.mtrl-sci [Total: 1]
eess.IV [Total: 1]
cs.LG [Total: 18]
astro-ph.CO [Total: 1]
cs.CR [Total: 2]
cs.IR [Total: 3]
cs.RO [Total: 6]
math.CO [Total: 1]
cs.AI [Total: 15]
cs.HC [Total: 15]

cs.CV [Back]

[1] Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

Dibyadip Chatterjee,Edoardo Remelli,Yale Song,Bugra Tekin,Abhay Mittal,Bharat Bhatnagar,Necati Cihan Camgöz,Shreyas Hampali,Eric Sauser,Shugao Ma,Angela Yao,Fadime Sener

Main category: cs.CV

TL;DR: ProVideLLM是一个端到端框架，用于实时程序化视频理解，通过多模态缓存和高效令牌设计，显著降低计算和内存需求，并在多个任务上取得最优结果。

Details

Motivation: 现有方法在处理长时观察时计算和内存需求高，ProVideLLM旨在通过多模态缓存和令牌设计解决这一问题。 Method: 集成多模态缓存，存储文本令牌和视觉令牌，使用DETR-QFormer编码细节，实现高效令牌表示和计算。 Result: 令牌数量减少22倍，支持10 FPS的逐帧推理和25 FPS的流式对话，GPU内存占用仅2GB，在六个任务上达到最优。 Conclusion: ProVideLLM通过高效设计显著提升了视频理解的实时性和性能，适用于多种程序化任务。 Abstract: We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

[2] Entropy Rectifying Guidance for Diffusion and Flow Models

Tariq Berrada Ifriqi,Adriana Romero-Soriano,Michal Drozdzal,Jakob Verbeek,Karteek Alahari

Main category: cs.CV

TL;DR: 该论文提出了一种名为熵校正引导（ERG）的新方法，通过调整扩散变换器架构中的注意力机制，在推理时同时提升图像质量、多样性和提示一致性。

Details

Motivation: 现有的分类器自由引导（CFG）方法在图像生成中存在质量、多样性和一致性之间的权衡，且现有改进方法需要额外模型或更多计算资源。 Method: 提出熵校正引导（ERG），利用扩散变换器架构中注意力机制的变化，在推理时动态调整生成过程。 Result: ERG在文本到图像、类条件生成和无条件生成任务中表现优异，并能与其他引导方法（如CADS和APG）结合进一步提升性能。 Conclusion: ERG是一种简单有效的引导方法，能够在不增加额外资源的情况下显著提升生成任务的效果。 Abstract: Guidance techniques are commonly used in diffusion and flow models to improve image quality and consistency for conditional generative tasks such as class-conditional and text-to-image generation. In particular, classifier-free guidance (CFG) -- the most widely adopted guidance technique -- contrasts conditional and unconditional predictions to improve the generated images. This results, however, in trade-offs across quality, diversity and consistency, improving some at the expense of others. While recent work has shown that it is possible to disentangle these factors to some extent, such methods come with an overhead of requiring an additional (weaker) model, or require more forward passes per sampling step. In this paper, we propose Entropy Rectifying Guidance (ERG), a simple and effective guidance mechanism based on inference-time changes in the attention mechanism of state-of-the-art diffusion transformer architectures, which allows for simultaneous improvements over image quality, diversity and prompt consistency. ERG is more general than CFG and similar guidance techniques, as it extends to unconditional sampling. ERG results in significant improvements in various generation tasks such as text-to-image, class-conditional and unconditional image generation. We also show that ERG can be seamlessly combined with other recent guidance methods such as CADS and APG, further boosting generation performance.

[3] Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training

Andrea Amaduzzi,Pierluigi Zama Ramirez,Giuseppe Lisanti,Samuele Salti,Luigi Di Stefano

Main category: cs.CV

TL;DR: LLaNA是首个能够直接处理NeRF权重并执行NeRF描述和问答任务的多模态大语言模型（MLLM），无需渲染图像或生成3D数据结构。

Details

Motivation: 现有MLLM在理解图像和3D数据时存在几何和外观表征的局限性，而NeRF作为一种替代方案，能够编码几何和真实感属性。 Method: 提出LLaNA模型，直接处理NeRF的MLP权重，并构建了一个包含30万NeRF的大规模数据集用于训练和评估。 Result: 直接处理NeRF权重的LLaNA在NeRF-语言任务上表现优于依赖2D或3D表征的方法。 Conclusion: LLaNA展示了直接处理NeRF权重的可行性，为MLLM在NeRF领域的应用提供了新方向。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q\&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.

[4] Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

Fulvio Sanguigni,Davide Morelli,Marcella Cornia,Rita Cucchiara

Main category: cs.CV

TL;DR: 论文提出了一种名为Fashion-RAG的新方法，通过文本输入实现时尚物品的定制化生成，结合检索和生成技术，显著提升了虚拟试穿的效果。

Details

Motivation: 现有虚拟试穿方法通常需要具体的服装输入，而实际场景中用户可能仅提供文本描述。Fashion-RAG旨在解决这一局限性。 Method: 采用检索增强生成方法，通过文本反转技术将检索到的服装图像投影到Stable Diffusion的文本嵌入空间，实现个性化图像生成。 Result: 在Dress Code数据集上，Fashion-RAG在质量和数量上均优于现有方法，能够捕捉细粒度视觉细节。 Conclusion: Fashion-RAG是首个针对多模态时尚图像编辑的检索增强生成方法，为虚拟试穿提供了更实用的解决方案。 Abstract: In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffusion models have emerged as a leading approach for such generative tasks, offering superior image quality and diversity. However, most existing virtual try-on methods rely on having a specific garment input, which is often impractical in real-world scenarios where users may only provide textual specifications. To address this limitation, in this work we introduce Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that enables the customization of fashion items based on user preferences provided in textual form. Our approach retrieves multiple garments that match the input specifications and generates a personalized image by incorporating attributes from the retrieved items. To achieve this, we employ textual inversion techniques, where retrieved garment images are projected into the textual embedding space of the Stable Diffusion text encoder, allowing seamless integration of retrieved elements into the generative process. Experimental results on the Dress Code dataset demonstrate that Fashion-RAG outperforms existing methods both qualitatively and quantitatively, effectively capturing fine-grained visual details from retrieved garments. To the best of our knowledge, this is the first work to introduce a retrieval-augmented generation approach specifically tailored for multimodal fashion image editing.

[5] LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

Haiwen Huang,Anpei Chen,Volodymyr Havrylov,Andreas Geiger,Dan Zhang

Main category: cs.CV

TL;DR: 论文提出了一种基于坐标交叉注意力变换器和自蒸馏训练目标的特征上采样方法，显著提升了像素级理解任务的性能。

Details

Motivation: 现有视觉基础模型（如DINOv2和CLIP）在像素级任务中表现受限，特征分辨率不足是主要挑战。 Method: 引入坐标交叉注意力变换器架构，结合高分辨率图像、坐标和低分辨率特征；提出利用类无关掩码和自蒸馏构建高分辨率伪真值特征的训练目标。 Result: 实验表明，该方法在多种下游任务中显著优于现有特征上采样技术。 Conclusion: 该方法能有效捕捉细节并适应不同分辨率输入，为像素级任务提供了高效解决方案。 Abstract: Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

[6] Occlusion-Ordered Semantic Instance Segmentation

Soroosh Baselizadeh,Cheuk-To Yu,Olga Veksler,Yuri Boykov

Main category: cs.CV

TL;DR: 论文提出了一种名为OOSIS的任务，结合相对深度排序和实例分割，利用遮挡信息提供3D分析，优于传统单目深度估计方法。

Details

Motivation: 传统的2D实例分割缺乏3D信息，而单目深度估计难度较大。因此，作者提出利用遮挡信息实现更可靠的相对深度排序，并结合实例分割。 Method: 提出OOSIS任务，通过定向遮挡边界和语义分割同时提取实例及其遮挡顺序，将其建模为标签问题。开发了一种新的定向遮挡边界方法。 Result: 在KINS和COCOA数据集上表现优于基线方法，定向遮挡边界方法显著优于先前工作。 Conclusion: OOSIS通过结合遮挡顺序和实例分割，提供了一种简单有效的3D信息提取方法，优于传统深度估计方法。 Abstract: Standard semantic instance segmentation provides useful, but inherently 2D information from a single image. To enable 3D analysis, one usually integrates absolute monocular depth estimation with instance segmentation. However, monocular depth is a difficult task. Instead, we leverage a simpler single-image task, occlusion-based relative depth ordering, providing coarser but useful 3D information. We show that relative depth ordering works more reliably from occlusions than from absolute depth. We propose to solve the joint task of relative depth ordering and segmentation of instances based on occlusions. We call this task Occlusion-Ordered Semantic Instance Segmentation (OOSIS). We develop an approach to OOSIS that extracts instances and their occlusion order simultaneously from oriented occlusion boundaries and semantic segmentation. Unlike popular detect-and-segment framework for instance segmentation, combining occlusion ordering with instance segmentation allows a simple and clean formulation of OOSIS as a labeling problem. As a part of our solution for OOSIS, we develop a novel oriented occlusion boundaries approach that significantly outperforms prior work. We also develop a new joint OOSIS metric based both on instance mask accuracy and correctness of their occlusion order. We achieve better performance than strong baselines on KINS and COCOA datasets.

[7] Towards Scale-Aware Low-Light Enhancement via Structure-Guided Transformer Design

Wei Dong,Yan Min,Han Zhou,Jun Chen

Main category: cs.CV

TL;DR: SG-LLIE是一种基于结构先验的多尺度CNN-Transformer混合框架，用于低光图像增强，通过提取光照不变边缘检测器的结构先验，结合CNN和Transformer模块，在多个基准测试中表现优异。

Details

Motivation: 现有低光图像增强方法在极端低光环境下效果有限，主要由于问题的病态性和从严重损坏图像中提取语义的困难。 Method: 提出SG-LLIE框架，利用光照不变边缘检测器提取结构先验，结合CNN-Transformer混合模块（HSGFE）和多尺度UNet架构。 Result: 在多个低光图像增强基准测试中取得最优性能，并在NTIRE 2025挑战赛中排名第二。 Conclusion: SG-LLIE通过结构先验和多尺度混合架构，显著提升了低光图像增强的效果。 Abstract: Current Low-light Image Enhancement (LLIE) techniques predominantly rely on either direct Low-Light (LL) to Normal-Light (NL) mappings or guidance from semantic features or illumination maps. Nonetheless, the intrinsic ill-posedness of LLIE and the difficulty in retrieving robust semantics from heavily corrupted images hinder their effectiveness in extremely low-light environments. To tackle this challenge, we present SG-LLIE, a new multi-scale CNN-Transformer hybrid framework guided by structure priors. Different from employing pre-trained models for the extraction of semantics or illumination maps, we choose to extract robust structure priors based on illumination-invariant edge detectors. Moreover, we develop a CNN-Transformer Hybrid Structure-Guided Feature Extractor (HSGFE) module at each scale with in the UNet encoder-decoder architecture. Besides the CNN blocks which excels in multi-scale feature extraction and fusion, we introduce a Structure-Guided Transformer Block (SGTB) in each HSGFE that incorporates structural priors to modulate the enhancement process. Extensive experiments show that our method achieves state-of-the-art performance on several LLIE benchmarks in both quantitative metrics and visual quality. Our solution ranks second in the NTIRE 2025 Low-Light Enhancement Challenge. Code is released at https://github.com/minyan8/imagine.

[8] Retinex-guided Histogram Transformer for Mask-free Shadow Removal

Wei Dong,Han Zhou,Seyed Amirreza Mousavi,Jun Chen

Main category: cs.CV

TL;DR: ReHiT是一种基于混合CNN-Transformer架构的无掩模阴影去除框架，结合Retinex理论，通过双分支管道分别建模反射和光照分量，并利用IG-HCT模块进行恢复。

Details

Motivation: 现有深度学习方法依赖难以获取的阴影掩模，限制了其在真实场景中的泛化能力。 Method: 提出双分支管道和IG-HCT模块，结合CNN和Transformer，处理非均匀光照和复杂阴影。 Result: 在多个基准数据集上表现优于现有无掩模方法，参数少且推理速度快。 Conclusion: ReHiT适用于计算资源有限的真实场景，具有高效性和竞争力。 Abstract: While deep learning methods have achieved notable progress in shadow removal, many existing approaches rely on shadow masks that are difficult to obtain, limiting their generalization to real-world scenes. In this work, we propose ReHiT, an efficient mask-free shadow removal framework based on a hybrid CNN-Transformer architecture guided by Retinex theory. We first introduce a dual-branch pipeline to separately model reflectance and illumination components, and each is restored by our developed Illumination-Guided Hybrid CNN-Transformer (IG-HCT) module. Second, besides the CNN-based blocks that are capable of learning residual dense features and performing multi-scale semantic fusion, multi-scale semantic fusion, we develop the Illumination-Guided Histogram Transformer Block (IGHB) to effectively handle non-uniform illumination and spatially complex shadows. Extensive experiments on several benchmark datasets validate the effectiveness of our approach over existing mask-free methods. Trained solely on the NTIRE 2025 Shadow Removal Challenge dataset, our solution delivers competitive results with one of the smallest parameter sizes and fastest inference speeds among top-ranked entries, highlighting its applicability for real-world applications with limited computational resources. The code is available at https://github.com/dongw22/oath.

[9] VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

Yogesh Kulkarni,Pooyan Fazli

Main category: cs.CV

TL;DR: VideoPASTA通过偏好对齐优化提升视频语言模型的空间、时间和跨帧关系理解能力，仅需少量偏好对即可显著提升性能。

Details

Motivation: 现有视频语言模型在空间关系、时间顺序和跨帧连续性方面表现不足，需针对性优化。 Method: 引入VideoPASTA框架，通过对抗样本训练模型区分准确与错误的视频表示，并应用直接偏好优化。 Result: 在多个基准测试中性能显著提升（如VideoMME提升3.05%），且无需大量标注或计算资源。 Conclusion: VideoPASTA证明针对性优化比大规模预训练或架构修改更有效，是一种高效、可扩展的解决方案。 Abstract: Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully generated adversarial examples that deliberately violate spatial, temporal, or cross-frame relations. By applying Direct Preference Optimization to just 7,020 preference pairs, VideoPASTA learns robust representations that capture fine-grained spatial relationships and long-range temporal dynamics. Experiments on standard video benchmarks show significant relative performance gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on LongVideoBench, over the baseline Qwen2.5-VL model. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without human annotation or captioning, relying on just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior work. This efficiency makes our approach a scalable, plug-and-play solution that seamlessly integrates with existing models while preserving their capabilities.

[10] Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models

Zhenyu Yu,Mohd Yamani Idna Idris,Pei Wang,Yuelong Xia

Main category: cs.CV

TL;DR: DanceText是一种无需训练的多语言图像文本编辑框架，支持复杂几何变换并实现无缝的前景-背景融合。

Details

Motivation: 解决基于扩散的生成模型在文本引导图像合成中缺乏可控性和布局一致性的问题。 Method: 采用分层编辑策略分离文本与背景，引入深度感知模块增强真实感和空间一致性，无需训练即可部署。 Result: 在AnyWord-3M基准测试中表现出色，尤其在复杂变换场景下视觉质量优越。 Conclusion: DanceText通过模块化设计实现了高效、可控的多语言文本编辑，适用于复杂变换需求。 Abstract: We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios.

[11] Lightweight Road Environment Segmentation using Vector Quantization

Jiyong Kwag,Alper Yilmaz,Charles Toth

Main category: cs.CV

TL;DR: 论文提出了一种基于向量量化的道路环境分割方法，结合MobileUNETR模型，显著提升了分割性能。

Details

Motivation: 现有基于FCN和Transformer的编码器依赖连续特征表示，难以捕捉离散信息，限制了语义分割的效率和准确性。 Method: 采用向量量化技术，将连续特征映射为离散向量，结合MobileUNETR模型进行分割。 Result: 在Cityscapes数据集上达到77.0% mIoU，比基线模型提升2.9%。 Conclusion: 向量量化能有效提升道路环境分割性能，且不增加模型复杂度。 Abstract: Road environment segmentation plays a significant role in autonomous driving. Numerous works based on Fully Convolutional Networks (FCNs) and Transformer architectures have been proposed to leverage local and global contextual learning for efficient and accurate semantic segmentation. In both architectures, the encoder often relies heavily on extracting continuous representations from the image, which limits the ability to represent meaningful discrete information. To address this limitation, we propose segmentation of the autonomous driving environment using vector quantization. Vector quantization offers three primary advantages for road environment segmentation. (1) Each continuous feature from the encoder is mapped to a discrete vector from the codebook, helping the model discover distinct features more easily than with complex continuous features. (2) Since a discrete feature acts as compressed versions of the encoder's continuous features, they also compress noise or outliers, enhancing the image segmentation task. (3) Vector quantization encourages the latent space to form coarse clusters of continuous features, forcing the model to group similar features, making the learned representations more structured for the decoding process. In this work, we combined vector quantization with the lightweight image segmentation model MobileUNETR and used it as a baseline model for comparison to demonstrate its efficiency. Through experiments, we achieved 77.0 % mIoU on Cityscapes, outperforming the baseline by 2.9 % without increasing the model's initial size or complexity.

Yaning Zhang,Jiahe Zhang,Chunjie Ma,Weili Guan,Tian Gan,Zan Gao

Main category: cs.CV

TL;DR: 论文提出了一种双模态引导的多视角表示学习框架（BMRL），用于零样本深度伪造溯源（ZS-DFA），通过视觉、解析和语言模态增强对未见生成器的溯源能力。

Details

Motivation: 现有深度伪造溯源方法主要关注视觉模态，忽视其他模态（如文本和面部解析），且难以在细粒度上评估对未见生成器的泛化性能。 Method: 设计了多视角视觉编码器（MPVE）探索图像、噪声和边缘三个视角的特征；提出解析编码器捕获全局面部属性嵌入；语言编码器捕获细粒度语言嵌入；并引入深度伪造溯源对比中心（DFACC）损失优化模型。 Result: 实验表明，该方法在ZS-DFA任务上优于现有技术。 Conclusion: BMRL框架通过多模态和多视角学习，显著提升了对未见生成器的溯源能力。 Abstract: The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen generators in a fine-grained manner. In this paper, we propose a novel bi-modal guided multi-perspective representation learning (BMRL) framework for zero-shot deepfake attribution (ZS-DFA), which facilitates effective traceability to unseen generators. Specifically, we design a multi-perspective visual encoder (MPVE) to explore general deepfake attribution visual characteristics across three views (i.e., image, noise, and edge). We devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via vision-parsing matching. A language encoder is proposed to capture fine-grained language embeddings, facilitating language-guided general visual forgery representation learning through vision-language alignment. Additionally, we present a novel deepfake attribution contrastive center (DFACC) loss, to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results demonstrate that our method outperforms the state-of-the-art on the ZS-DFA task through various protocols evaluation.

[13] Transforming hyperspectral images into chemical maps: A new deep learning based approach to hyperspectral image processing

Ole-Christian Galbo Engstrøm,Michela Albano-Gaglio,Erik Schou Dreier,Yamine Bouzembrak,Maria Font-i-Furnols,Puneet Mishra,Kim Steenstrup Pedersen

Main category: cs.CV

TL;DR: 本研究提出了一种基于改进U-Net和自定义损失函数的端到端深度学习方法，用于直接从高光谱图像生成化学图谱，相比传统的PLS回归方法，U-Net在预测精度和空间相关性上表现更优。

Details

Motivation: 传统的高光谱图像化学图谱生成方法（如PLS回归）存在噪声高且忽略空间上下文的问题，需要一种更高效且准确的方法。 Method: 采用改进的U-Net结构和自定义损失函数，直接生成化学图谱，避免了传统像素级分析的中间步骤。 Result: U-Net在猪肉样本数据集上的均方根误差比PLS低9%-13%，且生成的化学图谱99.91%的方差具有空间相关性，而PLS仅为2.53%。 Conclusion: U-Net在化学图谱生成任务中优于PLS回归，尤其在预测精度和空间相关性方面表现显著。 Abstract: Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. We compare the U-Net with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error of between 9% and 13% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.53% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.

[14] HFBRI-MAE: Handcrafted Feature Based Rotation-Invariant Masked Autoencoder for 3D Point Cloud Analysis

Xuanhua Yin,Dingxin Zhang,Jianhui Yu,Weidong Cai

Main category: cs.CV

TL;DR: HFBRI-MAE是一种新型自监督学习框架，通过引入旋转不变的手工特征改进了MAE，解决了现有方法在旋转点云处理中的性能下降问题。

Details

Motivation: 现有基于MAE的自监督学习方法缺乏旋转不变性，导致在真实场景中处理旋转点云时性能显著下降。 Method: HFBRI-MAE结合旋转不变的局部和全局特征进行标记嵌入和位置嵌入，并重新定义重建目标为规范对齐版本。 Result: 在ModelNet40、ScanObjectNN和ShapeNetPart上的实验表明，HFBRI-MAE在分类、分割和少样本学习任务中优于现有方法。 Conclusion: HFBRI-MAE具有鲁棒性和强泛化能力，适用于真实世界的3D应用。 Abstract: Self-supervised learning (SSL) has demonstrated remarkable success in 3D point cloud analysis, particularly through masked autoencoders (MAEs). However, existing MAE-based methods lack rotation invariance, leading to significant performance degradation when processing arbitrarily rotated point clouds in real-world scenarios. To address this limitation, we introduce Handcrafted Feature-Based Rotation-Invariant Masked Autoencoder (HFBRI-MAE), a novel framework that refines the MAE design with rotation-invariant handcrafted features to ensure stable feature learning across different orientations. By leveraging both rotation-invariant local and global features for token embedding and position embedding, HFBRI-MAE effectively eliminates rotational dependencies while preserving rich geometric structures. Additionally, we redefine the reconstruction target to a canonically aligned version of the input, mitigating rotational ambiguities. Extensive experiments on ModelNet40, ScanObjectNN, and ShapeNetPart demonstrate that HFBRI-MAE consistently outperforms existing methods in object classification, segmentation, and few-shot learning, highlighting its robustness and strong generalization ability in real-world 3D applications.

[15] Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

Hangyu Liu,Bo Peng,Pengxiang Ding,Donglin Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为2D-TGAF的多目标对抗攻击框架，通过扩散模型生成二维语义张量指导噪声生成，显著提高了攻击成功率。

Details

Motivation: 现有生成式多目标攻击方法缺乏实践验证和全面总结，论文旨在填补这一空白，验证语义特征质量和数量对攻击迁移性的关键影响。 Method: 提出2D-TGAF框架，利用扩散模型将目标标签编码为二维语义张量指导噪声生成，并设计掩码策略确保噪声保留目标类别的完整语义信息。 Result: 在ImageNet数据集上的实验表明，2D-TGAF在攻击成功率和对抗防御机制方面均优于现有方法。 Conclusion: 2D-TGAF通过优化语义特征质量和数量，显著提升了多目标对抗攻击的效果。 Abstract: Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. Existing generative approaches for multi-target attacks mainly analyze the effect of the use of target labels on noise generation from a theoretical perspective, lacking practical validation and comprehensive summarization. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model's attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (2D-TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments on the standard ImageNet dataset demonstrate that 2D-TGAF consistently surpasses state-of-the-art methods in attack success rates, both on normally trained models and across various defense mechanisms.

[16] Segment Any Crack: Deep Semantic Segmentation Adaptation for Crack Detection

Ghodsiyeh Rostami,Po-Han Chen,Mahdi S. Hosseini

Main category: cs.CV

TL;DR: 该研究提出了一种选择性微调策略，专注于调整归一化组件，以提高分割模型在裂缝检测中的适应性。该方法在性能和计算效率上优于完全微调和其他常见微调技术。

Details

Motivation: 现有裂缝检测模型需要大量标注数据和计算成本，限制了其适应性。 Method: 提出选择性微调策略，仅调整归一化参数，应用于SAM和五种分割模型。 Result: 在OmniCrack30k数据集上，SAC模型达到61.22% F1-score和44.13% IoU，同时在零样本数据集上表现最佳。 Conclusion: 选择性微调策略显著提高了分割准确性并降低了计算开销。 Abstract: Image-based crack detection algorithms are increasingly in demand in infrastructure monitoring, as early detection of cracks is of paramount importance for timely maintenance planning. While deep learning has significantly advanced crack detection algorithms, existing models often require extensive labeled datasets and high computational costs for fine-tuning, limiting their adaptability across diverse conditions. This study introduces an efficient selective fine-tuning strategy, focusing on tuning normalization components, to enhance the adaptability of segmentation models for crack detection. The proposed method is applied to the Segment Anything Model (SAM) and five well-established segmentation models. Experimental results demonstrate that selective fine-tuning of only normalization parameters outperforms full fine-tuning and other common fine-tuning techniques in both performance and computational efficiency, while improving generalization. The proposed approach yields a SAM-based model, Segment Any Crack (SAC), achieving a 61.22\% F1-score and 44.13\% IoU on the OmniCrack30k benchmark dataset, along with the highest performance across three zero-shot datasets and the lowest standard deviation. The results highlight the effectiveness of the adaptation approach in improving segmentation accuracy while significantly reducing computational overhead.

[17] ThyroidEffi 1.0: A Cost-Effective System for High-Performance Multi-Class Thyroid Carcinoma Classification

Hai Pham-Ngoc,De Nguyen-Van,Dung Vu-Tien,Phuong Le-Hong

Main category: cs.CV

TL;DR: 开发了一种高效、可解释的深度学习系统，用于甲状腺细针穿刺活检（FNAB）图像的多类分类，并在越南进行了外部验证。

Details

Motivation: 解决甲状腺FNAB图像分类中的数据不足、观察者间差异和计算成本高的问题，为临床决策提供支持。 Method: 结合YOLOv10细胞簇检测、课程学习协议、轻量级EfficientNetB0和Transformer模块，实现多尺度特征提取与分析。 Result: 在内部测试集上，ThyroidEffi Basic的宏F1为89.19%，AUC分别为0.98（B2）、0.95（B5）和0.96（B6）；外部验证AUC为0.9495（B2）、0.7436（B5）和0.8396（B6）。ThyroidEffi Premium进一步提升了性能。 Conclusion: 研究表明，高精度、可解释的甲状腺FNAB图像分类可以在低计算需求下实现。 Abstract: Background: Automated classification of thyroid fine needle aspiration biopsy (FNAB) images faces challenges in limited data, inter-observer variability, and computational cost. Efficient, interpretable models are crucial for clinical support. Objective: To develop and externally validate a deep learning system for the multi-class classification of thyroid FNAB images into three key categories that directly guide post-biopsy treatment decisions in Vietnam: benign (B2), suspicious for malignancy (B5), and malignant (B6), while achieving high diagnostic accuracy with low computational overhead. Methods: Our framework features: (1) YOLOv10-based cell cluster detection for informative sub-region extraction and noise reduction; (2) a curriculum learning-inspired protocol sequencing localized crops to full images for multi-scale feature capture; (3) adaptive lightweight EfficientNetB0 (4 millions parameters) selection balancing performance and efficiency; and (4) a Transformer-inspired module for multi-scale, multi-region analysis. External validation used 1,015 independent FNAB images. Results: ThyroidEffi Basic achieved a macro F1 of 89.19\% and AUCs of 0.98 (B2), 0.95 (B5), and 0.96 (B6) on the internal test set. External validation yielded AUCs of 0.9495 (B2), 0.7436 (B5), and 0.8396 (B6). ThyroidEffi Premium improved macro F1 to 89.77\%. Grad-CAM highlighted key diagnostic regions, confirming interpretability. The system processed 1000 cases in 30 seconds, demonstrating feasibility on widely accessible hardware like a 12-core CPU. Conclusions: This work demonstrates that high-accuracy, interpretable thyroid FNAB image classification is achievable with minimal computational demands.

[18] Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Sergio Arnaud,Paul McVay,Ada Martin,Arjun Majumdar,Krishna Murthy Jatavallabhula,Phillip Thomas,Ruslan Partsey,Daniel Dugas,Abha Gejji,Alexander Sax,Vincent-Pierre Berges,Mikael Henaff,Ayush Jain,Ang Cao,Ishita Prasad,Mrinal Kalakrishnan,Michael Rabbat,Nicolas Ballas,Mido Assran,Oleksandr Maksymets,Aravind Rajeswaran,Franziska Meier

Main category: cs.CV

TL;DR: LOCATE 3D是一个通过描述性语言定位3D场景中物体的模型，采用3D-JEPA自监督学习算法，结合2D基础模型和掩码预测任务，实现了在标准基准上的最优性能，并展示了强大的泛化能力。

Details

Motivation: 解决3D场景中基于语言描述的物体定位问题，适用于机器人和AR设备的实际部署。 Method: 使用3D-JEPA自监督学习算法，结合2D基础模型（如CLIP、DINO）对3D点云进行特征提取，并通过掩码预测任务学习上下文特征。训练后的编码器与语言条件解码器联合预测3D掩码和边界框。 Result: 在标准基准上达到最优性能，展示了强大的泛化能力，并发布了包含13万标注的新数据集LOCATE 3D DATASET。 Conclusion: LOCATE 3D在3D物体定位任务中表现出色，结合自监督学习和多模态数据，为实际应用提供了有效解决方案。 Abstract: We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.

[19] Segregation and Context Aggregation Network for Real-time Cloud Segmentation

Yijie Li,Hewei Wang,Jiayi Zhang,Jinjiang You,Jinfeng Xu,Puzhen Wu,Yunzhong Xiao,Soumyabrata Dev

Main category: cs.CV

TL;DR: SCANet是一种轻量级云分割模型，通过Segregation and Context Aggregation Module（SCAM）提升分割精度和计算效率，适用于边缘设备。

Details

Motivation: 现有方法在分割精度和计算效率之间难以平衡，限制了在边缘设备上的实际部署。 Method: SCANet采用SCAM模块，将粗略分割图细化为加权的天空和云特征，并分别处理。 Result: SCANet-large参数减少70.9%，性能与现有方法相当；SCANet-lite达到1390 fps（FP16），远超实时标准。 Conclusion: SCANet在保持高性能的同时显著降低计算复杂度，适用于实时云分割任务。 Abstract: Cloud segmentation from intensity images is a pivotal task in atmospheric science and computer vision, aiding weather forecasting and climate analysis. Ground-based sky/cloud segmentation extracts clouds from images for further feature analysis. Existing methods struggle to balance segmentation accuracy and computational efficiency, limiting real-world deployment on edge devices, so we introduce SCANet, a novel lightweight cloud segmentation model featuring Segregation and Context Aggregation Module (SCAM), which refines rough segmentation maps into weighted sky and cloud features processed separately. SCANet achieves state-of-the-art performance while drastically reducing computational complexity. SCANet-large (4.29M) achieves comparable accuracy to state-of-the-art methods with 70.9% fewer parameters. Meanwhile, SCANet-lite (90K) delivers 1390 fps in FP16, surpassing real-time standards. Additionally, we propose an efficient pre-training strategy that enhances performance even without ImageNet pre-training.

[20] Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization

Huiyi Chen,Jiawei Peng,Kaihua Tang,Xin Geng,Xu Yang

Main category: cs.CV

TL;DR: KeCO是一种基于视觉特征的键值核心集优化框架，用于提升大型视觉语言模型的上下文学习性能，显著降低计算和内存成本。

Details

Motivation: 现有方法在图像分类任务中选择核心集时效率低下且信息损失严重，亟需一种更高效的方法。 Method: 提出KeCO框架，利用未使用的数据构建紧凑且信息丰富的核心集，并通过视觉特征键值优化选择策略。 Result: 在粗粒度和细粒度图像分类任务中，KeCO平均性能提升超过20%，且在资源受限场景下表现优异。 Conclusion: KeCO为图像分类任务提供了一种高效且实用的上下文学习优化方案。 Abstract: In-context learning (ICL) enables Large Vision-Language Models (LVLMs) to adapt to new tasks without parameter updates, using a few demonstrations from a large support set. However, selecting informative demonstrations leads to high computational and memory costs. While some methods explore selecting a small and representative coreset in the text classification, evaluating all support set samples remains costly, and discarded samples lead to unnecessary information loss. These methods may also be less effective for image classification due to differences in feature spaces. Given these limitations, we propose Key-based Coreset Optimization (KeCO), a novel framework that leverages untapped data to construct a compact and informative coreset. We introduce visual features as keys within the coreset, which serve as the anchor for identifying samples to be updated through different selection strategies. By leveraging untapped samples from the support set, we update the keys of selected coreset samples, enabling the randomly initialized coreset to evolve into a more informative coreset under low computational cost. Through extensive experiments on coarse-grained and fine-grained image classification benchmarks, we demonstrate that KeCO effectively enhances ICL performance for image classification task, achieving an average improvement of more than 20\%. Notably, we evaluate KeCO under a simulated online scenario, and the strong performance in this scenario highlights the practical value of our framework for resource-constrained real-world scenarios.

[21] Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis

Zichuan Liu,Liming Jiang,Qing Yan,Yumin Jia,Hao Kang,Xin Lu

Main category: cs.CV

TL;DR: 提出了一种基于多模态编码的ID保留生成框架FaceCLIP，通过联合嵌入空间实现身份与文本的统一输入，结合扩散模型生成身份一致且文本对齐的图像。

Details

Motivation: 现有方法通过适配器注入身份特征，限制了生成效果。本文旨在通过多模态编码策略提升身份保留和文本对齐能力。 Method: 引入FaceCLIP多模态编码器，学习身份与文本的联合嵌入空间，结合扩散模型生成图像，并设计多模态对齐算法训练FaceCLIP。 Result: FaceCLIP-SDXL在身份保留和文本对齐上优于现有方法，生成更逼真的肖像图像。 Conclusion: FaceCLIP-SDXL通过多模态编码策略显著提升了ID保留生成的效果，为图像合成提供了新思路。 Abstract: We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.

[22] Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

Wenbing Zhu,Lidong Wang,Ziqing Zhou,Chengjie Wang,Yurui Pan,Ruoyi Zhang,Zhuhao Chen,Linjie Cheng,Bin-Bin Gao,Jiangning Zhang,Zhenye Gan,Yuxie Wang,Yulong Chen,Shuguang Qian,Mingmin Chi,Bo Peng,Lizhuang Ma

Main category: cs.CV

TL;DR: 论文介绍了Real-IAD D3数据集，这是一个高精度多模态工业异常检测数据集，包含RGB图像、微米级3D点云和通过光度立体生成的伪3D模态。同时提出了一种多模态融合方法以提升检测性能。

Details

Motivation: 现有工业异常检测数据集（如MVTec 3D）在规模和分辨率上存在不足，难以模拟真实工业环境。因此，需要更高质量的多模态数据集和方法来提升检测性能。 Method: 提出Real-IAD D3数据集，包含RGB、3D点云和伪3D模态，并设计了一种融合多模态信息的方法。 Result: 实验表明多模态信息能显著提升检测鲁棒性和性能。 Conclusion: Real-IAD D3为多模态工业异常检测提供了高质量基准，多模态融合方法有效提升了检测效果。 Abstract: The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D3, a high-precision multimodal dataset that uniquely incorporates an additional pseudo3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds. Real-IAD D3 features finer defects, diverse anomalies, and greater scale across 20 categories, providing a challenging benchmark for multimodal IAD Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The dataset and code are publicly accessible for research purposes at https://realiad4ad.github.io/Real-IAD D3

[23] Revisiting CLIP for SF-OSDA: Unleashing Zero-Shot Potential with Adaptive Threshold and Training-Free Feature Filtering

Yongguang Li,Jindong Li,Qi Wang,Qianli Xing,Runliang Niu,Shengsheng Wang,Menglin Yang

Main category: cs.CV

TL;DR: 论文提出CLIPXpert方法，通过自适应阈值和未知类特征过滤模块解决SF-OSDA中CLIP的阈值选择和特征分离问题，实验表现优于现有方法。

Details

Motivation: 现有SF-OSDA方法依赖固定阈值且忽略类内趋势，导致CLIP潜力未充分发挥。 Method: 提出BGAT模块动态确定阈值，SUFF模块过滤未知类特征，提升已知与未知类的分离。 Result: 在DomainNet上优于UOTA 1.92%，在Office-Home等数据集上达到SOTA水平。 Conclusion: CLIPXpert验证了CLIP在SF-OSDA任务中的零样本潜力，方法有效且无需训练。 Abstract: Source-Free Unsupervised Open-Set Domain Adaptation (SF-OSDA) methods using CLIP face significant issues: (1) while heavily dependent on domain-specific threshold selection, existing methods employ simple fixed thresholds, underutilizing CLIP's zero-shot potential in SF-OSDA scenarios; and (2) overlook intrinsic class tendencies while employing complex training to enforce feature separation, incurring deployment costs and feature shifts that compromise CLIP's generalization ability. To address these issues, we propose CLIPXpert, a novel SF-OSDA approach that integrates two key components: an adaptive thresholding strategy and an unknown class feature filtering module. Specifically, the Box-Cox GMM-Based Adaptive Thresholding (BGAT) module dynamically determines the optimal threshold by estimating sample score distributions, balancing known class recognition and unknown class sample detection. Additionally, the Singular Value Decomposition (SVD)-Based Unknown-Class Feature Filtering (SUFF) module reduces the tendency of unknown class samples towards known classes, improving the separation between known and unknown classes. Experiments show that our source-free and training-free method outperforms state-of-the-art trained approach UOTA by 1.92% on the DomainNet dataset, achieves SOTA-comparable performance on datasets such as Office-Home, and surpasses other SF-OSDA methods. This not only validates the effectiveness of our proposed method but also highlights CLIP's strong zero-shot potential for SF-OSDA tasks.

[24] Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

Johannes Spoecklberger,Wei Lin,Pedro Hermosilla,Sivan Doveh,Horst Possegger,M. Jehanzeb Mirza

Main category: cs.CV

TL;DR: 本文探讨了视觉基础模型（VFMs）在LiDAR 3D语义分割任务中的应用，通过融合2D-3D数据提升性能。

Details

Motivation: VFMs在多模态任务中潜力巨大，但尚未充分探索其在3D任务中的应用。本文旨在利用VFMs的跨模态信息提升3D语义分割性能。 Method: 提出一种融合网络，结合2D图像和3D点云数据，动态调整模态贡献，利用VFMs的特征训练3D主干网络。 Result: 在多个实验中显著优于现有方法，平均提升6.5 mIoU。 Conclusion: VFMs在3D语义分割任务中具有显著潜力，跨模态融合是提升性能的有效方法。 Abstract: Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.

[25] Single Document Image Highlight Removal via A Large-Scale Real-World Dataset and A Location-Aware Network

Lu Pan,Yu-Hsuan Huang,Hongxia Xie,Cheng Zhang,Hongwei Zhao,Hong-Han Shuai,Wen-Huang Cheng

Main category: cs.CV

TL;DR: 论文提出DocHR14K数据集和L2HRNet网络，用于解决文档图像中高光去除问题，显著提升性能。

Details

Motivation: 文档图像中的高光严重影响文本可读性和视觉质量，现有深度学习方法因缺乏专用数据集和针对性设计而效果不佳。 Method: 提出DocHR14K数据集和基于Highlight Location Prior (HLP)的L2HRNet网络，利用残差图估计高光区域，结合拉普拉斯金字塔和扩散模块恢复细节。 Result: DocHR14K显著提升高光去除效果，L2HRNet在多个数据集上达到最优性能，PSNR提高5.01%，RMSE降低13.17%。 Conclusion: DocHR14K和L2HRNet为文档高光去除提供了有效解决方案，具有实际应用潜力。 Abstract: Reflective documents often suffer from specular highlights under ambient lighting, severely hindering text readability and degrading overall visual quality. Although recent deep learning methods show promise in highlight removal, they remain suboptimal for document images, primarily due to the lack of dedicated datasets and tailored architectural designs. To tackle these challenges, we present DocHR14K, a large-scale real-world dataset comprising 14,902 high-resolution image pairs across six document categories and various lighting conditions. To the best of our knowledge, this is the first high-resolution dataset for document highlight removal that captures a wide range of real-world lighting conditions. Additionally, motivated by the observation that the residual map between highlighted and clean images naturally reveals the spatial structure of highlight regions, we propose a simple yet effective Highlight Location Prior (HLP) to estimate highlight masks without human annotations. Building on this prior, we present the Location-Aware Laplacian Pyramid Highlight Removal Network (L2HRNet), which effectively removes highlights by leveraging estimated priors and incorporates diffusion module to restore details. Extensive experiments demonstrate that DocHR14K improves highlight removal under diverse lighting conditions. Our L2HRNet achieves state-of-the-art performance across three benchmark datasets, including a 5.01\% increase in PSNR and a 13.17\% reduction in RMSE on DocHR14K.

[26] ROI-Guided Point Cloud Geometry Compression Towards Human and Machine Vision

Xie Liang,Gao Wei,Zhenghui Ming,Li Ge

Main category: cs.CV

TL;DR: 提出了一种基于ROI引导的点云几何压缩方法（RPCGC），通过双分支并行结构优化压缩性能，同时提升机器视觉任务的准确性。

Details

Motivation: 点云数据在自动驾驶、虚拟现实等领域应用广泛，但其大体积带来存储和传输挑战。现有高压缩比方法常损害语义细节，影响下游任务准确性。 Method: 采用双分支并行结构：基础层编码简化点云，增强层聚焦几何细节；通过ROI预测网络优化残差信息，并结合掩码信息进行RD优化。 Result: 在ScanNet和SUN RGB-D数据集上，RPCGC在高比特率下实现了优异的压缩性能和10%的检测精度提升。 Conclusion: RPCGC通过ROI引导和双分支结构，有效平衡了压缩比与语义细节保留，提升了机器视觉任务的准确性。 Abstract: Point cloud data is pivotal in applications like autonomous driving, virtual reality, and robotics. However, its substantial volume poses significant challenges in storage and transmission. In order to obtain a high compression ratio, crucial semantic details usually confront severe damage, leading to difficulties in guaranteeing the accuracy of downstream tasks. To tackle this problem, we are the first to introduce a novel Region of Interest (ROI)-guided Point Cloud Geometry Compression (RPCGC) method for human and machine vision. Our framework employs a dual-branch parallel structure, where the base layer encodes and decodes a simplified version of the point cloud, and the enhancement layer refines this by focusing on geometry details. Furthermore, the residual information of the enhancement layer undergoes refinement through an ROI prediction network. This network generates mask information, which is then incorporated into the residuals, serving as a strong supervision signal. Additionally, we intricately apply these mask details in the Rate-Distortion (RD) optimization process, with each point weighted in the distortion calculation. Our loss function includes RD loss and detection loss to better guide point cloud encoding for the machine. Experiment results demonstrate that RPCGC achieves exceptional compression performance and better detection accuracy (10% gain) than some learning-based compression methods at high bitrates in ScanNet and SUN RGB-D datasets.

Yikun Ji,Yan Hong,Jiahui Zhan,Haoxing Chen,jun lan,Huijia Zhu,Weiqiang Wang,Liqing Zhang,Jianfu Zhang

Main category: cs.CV

TL;DR: 论文探讨了基于多模态大语言模型（MLLMs）的AI生成图像检测方法，提出了一种更透明、可解释的检测框架。

Details

Motivation: 图像生成技术的进步引发了公共安全问题，需要一种既能泛化又透明的假图像检测方法。 Method: 通过设计六种不同的提示词，并整合这些提示词构建一个基于推理的检测框架，同时对比了MLLMs与传统方法及人类评估者的表现。 Result: MLLMs在检测AI生成图像方面展现出潜力，但也存在局限性。提出的框架提高了检测的鲁棒性和可解释性。 Conclusion: MLLMs为假图像检测提供了新的可能性，但需进一步优化以提升性能。 Abstract: Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.

[28] Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

Bin Ren,Eduard Zamfir,Zongwei Wu,Yawei Li,Yidi Li,Danda Pani Paudel,Radu Timofte,Ming-Hsuan Yang,Luc Van Gool,Nicu Sebe

Main category: cs.CV

TL;DR: AnyIR提出了一种统一的方法，通过联合嵌入机制高效恢复多种退化图像，无需增加模型规模或依赖大型语言模型。

Details

Motivation: 传统方法需为每种退化训练专用模型，效率低且冗余；现有方法增加了模型复杂度或依赖跨模态迁移。AnyIR旨在解决这些问题。 Method: 通过子潜在空间分析输入，采用门控重加权机制，结合空间-频率并行融合策略增强局部-全局交互和频率细节。 Result: AnyIR在全能恢复任务中表现SOTA，参数和FLOPs分别减少约82%和85%。 Conclusion: AnyIR提供了一种高效、统一的图像恢复方法，显著降低了模型复杂度。 Abstract: Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing model size, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models.Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To fuse the intrinsic degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed for enhancing spatial-aware local-global interactions and enriching the restoration details from the frequency perspective. Extensive benchmarking in the all-in-one restoration setting confirms AnyIR's SOTA performance, reducing model complexity by around 82\% in parameters and 85\% in FLOPs. Our code will be available at our Project page (https://amazingren.github.io/AnyIR/)

[29] ColorVein: Colorful Cancelable Vein Biometrics

Yifan Wang,Jie Gui,Xinli Shi,Linqing Gui,Yuan Yan Tang,James Tin-Yau Kwok

Main category: cs.CV

TL;DR: 本文提出了一种创新的可取消静脉生物特征生成方案ColorVein，通过引入颜色信息增强静脉图像的信息密度，并优化特征提取模型以提高安全性。

Details

Motivation: 当前缺乏专门针对静脉生物特征的可取消模板生成方案，生物信息泄露可能威胁用户隐私和匿名性。 Method: ColorVein通过交互式着色将静态灰度信息转换为动态可控的颜色表示，并引入安全中心损失优化特征提取模型。 Result: ColorVein在识别性能、不可链接性、不可逆性和可撤销性方面表现优异，安全性和隐私性分析验证了其有效性。 Conclusion: ColorVein在性能上与现有先进方法竞争，同时提供了更高的安全性和隐私保护。 Abstract: Vein recognition technologies have become one of the primary solutions for high-security identification systems. However, the issue of biometric information leakage can still pose a serious threat to user privacy and anonymity. Currently, there is no cancelable biometric template generation scheme specifically designed for vein biometrics. Therefore, this paper proposes an innovative cancelable vein biometric generation scheme: ColorVein. Unlike previous cancelable template generation schemes, ColorVein does not destroy the original biometric features and introduces additional color information to grayscale vein images. This method significantly enhances the information density of vein images by transforming static grayscale information into dynamically controllable color representations through interactive colorization. ColorVein allows users/administrators to define a controllable pseudo-random color space for grayscale vein images by editing the position, number, and color of hint points, thereby generating protected cancelable templates. Additionally, we propose a new secure center loss to optimize the training process of the protected feature extraction model, effectively increasing the feature distance between enrolled users and any potential impostors. Finally, we evaluate ColorVein's performance on all types of vein biometrics, including recognition performance, unlinkability, irreversibility, and revocability, and conduct security and privacy analyses. ColorVein achieves competitive performance compared with state-of-the-art methods.

[30] Visual Consensus Prompting for Co-Salient Object Detection

Jie Wang,Nana Yu,Zihao Zhang,Yahong Han

Main category: cs.CV

TL;DR: 提出了一种参数高效的视觉共识提示（VCP）架构，解决了现有CoSOD方法在共识提取和参数效率上的局限性，显著提升了性能。

Details

Motivation: 现有CoSOD方法依赖编码特征提取共识且参数更新效率低，限制了预训练模型的表现。 Method: 引入参数高效的提示调优范式，通过共识提示生成器（CPG）和分散器（CPD）生成任务特定的视觉共识提示（VCP）。 Result: 在最具挑战性的CoCA数据集上，F_m指标提升了6.8%，优于13种前沿全微调模型。 Conclusion: VCP架构通过参数高效的方式显著提升了CoSOD任务的性能，为预训练模型的应用提供了新思路。 Abstract: Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at https://github.com/WJ-CV/VCP.

[31] Cross-attention for State-based model RWKV-7

Liu Xiao,Li Zhiyuan,Lin Yueyu

Main category: cs.CV

TL;DR: CrossWKV是一种新型的跨注意力机制，用于增强RWKV-7模型在文本到图像生成中的表达能力，通过线性复杂度的WKV架构和低秩适应技术实现跨模态对齐。

Details

Motivation: 提升文本到图像生成的表达能力，同时保持线性复杂度和高效的内存使用。 Method: 结合RWKV-7的WKV架构，使用广义delta规则和低秩适应技术（LoRA）实现跨模态对齐。 Result: 在DIR-7框架下，CrossWKV在ImageNet 256x256上取得FID 2.88和CLIP分数0.33，性能达到最先进水平。 Conclusion: CrossWKV在跨模态任务中表现出色，具有高分辨率和动态状态操作的潜力。 Abstract: We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the $\mathrm{TC}^0$ complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like $S_5$ permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state manipulation.Code at https://github.com/TorchRWKV/flash-linear-attention

[32] Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

Li Yu,Xuanzhe Sun,Wei Zhou,Moncef Gabbouj

Main category: cs.CV

TL;DR: 本文提出TAVDiff模型，通过结合视觉、听觉和文本模态，利用扩散模型预测视频显著性。实验表明其性能优于现有方法。

Details

Motivation: 视频显著性预测对下游应用（如视频压缩和人机交互）至关重要。多模态学习的发展促使研究者探索多模态显著性预测，以提升准确性。 Method: 提出TAVDiff模型，将显著性预测视为基于文本、音频和视觉输入的图像生成任务，通过逐步去噪预测显著性图。引入SITR机制处理文本模态，并改进DiT以解耦条件信息与时间步。 Result: TAVDiff在SIM、CC、NSS和AUC-J指标上分别提升1.03%、2.35%、2.71%和0.33%，优于现有方法。 Conclusion: TAVDiff通过多模态融合和改进的条件引导机制，显著提升了视频显著性预测的准确性。 Abstract: Video saliency prediction is crucial for downstream applications, such as video compression and human-computer interaction. With the flourishing of multimodal learning, researchers started to explore multimodal video saliency prediction, including audio-visual and text-visual approaches. Auditory cues guide the gaze of viewers to sound sources, while textual cues provide semantic guidance for understanding video content. Integrating these complementary cues can improve the accuracy of saliency prediction. Therefore, we attempt to simultaneously analyze visual, auditory, and textual modalities in this paper, and propose TAVDiff, a Text-Audio-Visual-conditioned Diffusion Model for video saliency prediction. TAVDiff treats video saliency prediction as an image generation task conditioned on textual, audio, and visual inputs, and predicts saliency maps through stepwise denoising. To effectively utilize text, a large multimodal model is used to generate textual descriptions for video frames and introduce a saliency-oriented image-text response (SITR) mechanism to generate image-text response maps. It is used as conditional information to guide the model to localize the visual regions that are semantically related to the textual description. Regarding the auditory modality, it is used as another conditional information for directing the model to focus on salient regions indicated by sounds. At the same time, since the diffusion transformer (DiT) directly concatenates the conditional information with the timestep, which may affect the estimation of the noise level. To achieve effective conditional guidance, we propose Saliency-DiT, which decouples the conditional information from the timestep. Experimental results show that TAVDiff outperforms existing methods, improving 1.03\%, 2.35\%, 2.71\% and 0.33\% on SIM, CC, NSS and AUC-J metrics, respectively.

[33] RAMCT: Novel Region-adaptive Multi-channel Tracker with Iterative Tikhonov Regularization for Thermal Infrared Tracking

Shang Zhang,Yuke Hou,Guoqiang Gong,Ruoyan Xiong,Yue Zhang

Main category: cs.CV

TL;DR: RAMCT是一种区域自适应稀疏相关滤波器跟踪器，通过多通道特征优化和自适应正则化策略，解决了热红外目标跟踪中的低分辨率、遮挡、背景干扰和目标变形等问题。

Details

Motivation: 现有相关滤波器跟踪器在热红外目标跟踪中面临低分辨率、遮挡、背景干扰和目标变形等挑战，影响跟踪性能。 Method: 1. 引入空间自适应二进制掩码优化CF学习过程；2. 提出基于GSVD的区域自适应迭代Tikhonov正则化方法；3. 设计动态差异参数调整的在线优化策略。 Result: 在多个基准测试中，RAMCT在准确性和鲁棒性上优于其他先进跟踪器。 Conclusion: RAMCT通过自适应和优化策略显著提升了热红外目标跟踪的性能。 Abstract: Correlation filter (CF)-based trackers have gained significant attention for their computational efficiency in thermal infrared (TIR) target tracking. However, ex-isting methods struggle with challenges such as low-resolution imagery, occlu-sion, background clutter, and target deformation, which severely impact tracking performance. To overcome these limitations, we propose RAMCT, a region-adaptive sparse correlation filter tracker that integrates multi-channel feature opti-mization with an adaptive regularization strategy. Firstly, we refine the CF learn-ing process by introducing a spatially adaptive binary mask, which enforces spar-sity in the target region while dynamically suppressing background interference. Secondly, we introduce generalized singular value decomposition (GSVD) and propose a novel GSVD-based region-adaptive iterative Tikhonov regularization method. This enables flexible and robust optimization across multiple feature channels, improving resilience to occlusion and background variations. Thirdly, we propose an online optimization strategy with dynamic discrepancy-based pa-rameter adjustment. This mechanism facilitates real time adaptation to target and background variations, thereby improving tracking accuracy and robustness. Ex-tensive experiments on LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR2017 benchmarks demonstrate that RAMCT outperforms other state-of-the-art trackers in terms of accuracy and robustness.

[34] CLIP-Powered Domain Generalization and Domain Adaptation: A Comprehensive Survey

Jindong Li,Yongguang Li,Yali Fu,Jiahong Liu,Yixin Liu,Menglin Yang,Irwin King

Main category: cs.CV

TL;DR: 本文综述了CLIP在领域泛化（DG）和领域适应（DA）中的应用，填补了现有文献的空白，并提出了未来研究方向。

Details

Motivation: 由于缺乏对CLIP在DG和DA中应用的系统性综述，本文旨在填补这一空白，为研究者和实践者提供有价值的参考。 Method: 在DG中，方法分为优化提示学习和利用CLIP作为特征提取器；在DA中，分为基于源数据的方法和源无关方法。 Result: 综述总结了CLIP在DG和DA中的应用，并指出了关键挑战（如过拟合、领域多样性）和未来机会。 Conclusion: 本文为CLIP在DG和DA中的应用提供了全面视角，旨在推动更鲁棒的机器学习模型的发展。 Abstract: As machine learning evolves, domain generalization (DG) and domain adaptation (DA) have become crucial for enhancing model robustness across diverse environments. Contrastive Language-Image Pretraining (CLIP) plays a significant role in these tasks, offering powerful zero-shot capabilities that allow models to perform effectively in unseen domains. However, there remains a significant gap in the literature, as no comprehensive survey currently exists that systematically explores the applications of CLIP in DG and DA, highlighting the necessity for this review. This survey presents a comprehensive review of CLIP's applications in DG and DA. In DG, we categorize methods into optimizing prompt learning for task alignment and leveraging CLIP as a backbone for effective feature extraction, both enhancing model adaptability. For DA, we examine both source-available methods utilizing labeled source data and source-free approaches primarily based on target domain data, emphasizing knowledge transfer mechanisms and strategies for improved performance across diverse contexts. Key challenges, including overfitting, domain diversity, and computational efficiency, are addressed, alongside future research opportunities to advance robustness and efficiency in practical applications. By synthesizing existing literature and pinpointing critical gaps, this survey provides valuable insights for researchers and practitioners, proposing directions for effectively leveraging CLIP to enhance methodologies in domain generalization and adaptation. Ultimately, this work aims to foster innovation and collaboration in the quest for more resilient machine learning models that can perform reliably across diverse real-world scenarios. A more up-to-date version of the papers is maintained at: https://github.com/jindongli-Ai/Survey_on_CLIP-Powered_Domain_Generalization_and_Adaptation.

[35] ISTD-YOLO: A Multi-Scale Lightweight High-Performance Infrared Small Target Detection Algorithm

Shang Zhang,Yujie Cui,Ruoyan Xiong,Huanbin Zhang

Main category: cs.CV

TL;DR: ISTD-YOLO是一种基于改进YOLOv7的轻量级红外小目标检测算法，通过轻量化重构、引入无参数注意力机制和优化NWD指标，显著提升了红外小目标的检测效果。

Details

Motivation: 针对红外图像检测中背景复杂、信噪比低、目标尺寸小和亮度弱等困难，提出一种轻量化的检测算法。 Method: 1. 轻量化重构YOLOv7网络结构，设计三尺度轻量网络；2. 用VoV-GSCSP替换ELAN-W模块降低计算成本；3. 引入无参数注意力机制增强局部信息相关性；4. 使用NWD优化IoU指标提升小目标定位精度。 Result: 实验表明，ISTD-YOLO相比YOLOv7和主流算法，检测效果显著提升，各项指标均有改善。 Conclusion: ISTD-YOLO能有效实现红外小目标的高质量检测。 Abstract: Aiming at the detection difficulties of infrared images such as complex background, low signal-to-noise ratio, small target size and weak brightness, a lightweight infrared small target detection algorithm ISTD-YOLO based on improved YOLOv7 was proposed. Firstly, the YOLOv7 network structure was lightweight reconstructed, and a three-scale lightweight network architecture was designed. Then, the ELAN-W module of the model neck network is replaced by VoV-GSCSP to reduce the computational cost and the complexity of the network structure. Secondly, a parameter-free attention mechanism was introduced into the neck network to enhance the relevance of local con-text information. Finally, the Normalized Wasserstein Distance (NWD) was used to optimize the commonly used IoU index to enhance the localization and detection accuracy of small targets. Experimental results show that compared with YOLOv7 and the current mainstream algorithms, ISTD-YOLO can effectively improve the detection effect, and all indicators are effectively improved, which can achieve high-quality detection of infrared small targets.

[36] Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization

Shouwei Ruan,Zhenyu Wu,Yao Huang,Ruochen Zhang,Yitong Sun,Caixin Kang,Xingxing Wei

Main category: cs.CV

TL;DR: SC-DPO是一个用于文本到图像生成模型安全对齐的新框架，通过结合安全约束和人类偏好校准，平衡安全性和生成质量。

Details

Motivation: 现有方法无法完全保证有害概念下的安全性或难以平衡安全性与生成质量。 Method: 提出SC-DPO框架，包括安全成本模型、对比学习和成本锚定目标，以及动态聚焦机制（DFM）。 Result: SC-DPO在实验中表现优于现有方法，有效防御NSFW内容并保持生成质量和人类偏好对齐。 Conclusion: SC-DPO为T2I生成提供了一种高效的安全对齐解决方案，具有对抗性提示的鲁棒性。 Abstract: Ensuring the safety of generated content remains a fundamental challenge for Text-to-Image (T2I) generation. Existing studies either fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality. To address these issues, we propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I models. SC-DPO integrates safety constraints into the general human preference calibration, aiming to maximize the likelihood of generating human-preferred samples while minimizing the safety cost of the generated outputs. In SC-DPO, we introduce a safety cost model to accurately quantify harmful levels for images, and train it effectively using the proposed contrastive learning and cost anchoring objectives. To apply SC-DPO for effective T2I safety alignment, we constructed SCP-10K, a safety-constrained preference dataset containing rich harmful concepts, which blends safety-constrained preference pairs under both harmful and clean instructions, further mitigating the trade-off between safety and sample quality. Additionally, we propose a Dynamic Focusing Mechanism (DFM) for SC-DPO, promoting the model's learning of difficult preference pair samples. Extensive experiments demonstrate that SC-DPO outperforms existing methods, effectively defending against various NSFW content while maintaining optimal sample quality and human preference alignment. Additionally, SC-DPO exhibits resilience against adversarial prompts designed to generate harmful content.

[37] From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion

Pourya Shamsolmoali,Masoumeh Zareapoor,Huiyu Zhou,Michael Felsberg,Dacheng Tao,Xuelong Li

Main category: cs.CV

TL;DR: ConFill是一种新型图像补全框架，通过上下文自适应差异模型（CAD）和动态采样机制，显著提升了生成内容与原始图像的融合效果。

Details

Motivation: 现有扩散模型在图像补全中难以保持已知与未知区域的连贯性，缺乏显式的空间和语义对齐，导致生成内容与原始图像不一致。 Method: 提出ConFill框架，结合CAD模型逐步对齐已知与未知区域的中间分布，并采用动态采样机制在高复杂度区域自适应增加采样率。 Result: 实验表明，ConFill在图像补全任务中优于现有方法，成为新的基准。 Conclusion: ConFill通过上下文对齐和动态采样，显著提升了图像补全的质量和一致性。 Abstract: Image completion is a challenging task, particularly when ensuring that generated content seamlessly integrates with existing parts of an image. While recent diffusion models have shown promise, they often struggle with maintaining coherence between known and unknown (missing) regions. This issue arises from the lack of explicit spatial and semantic alignment during the diffusion process, resulting in content that does not smoothly integrate with the original image. Additionally, diffusion models typically rely on global learned distributions rather than localized features, leading to inconsistencies between the generated and existing image parts. In this work, we propose ConFill, a novel framework that introduces a Context-Adaptive Discrepancy (CAD) model to ensure that intermediate distributions of known and unknown regions are closely aligned throughout the diffusion process. By incorporating CAD, our model progressively reduces discrepancies between generated and original images at each diffusion step, leading to contextually aligned completion. Moreover, ConFill uses a new Dynamic Sampling mechanism that adaptively increases the sampling rate in regions with high reconstruction complexity. This approach enables precise adjustments, enhancing detail and integration in restored areas. Extensive experiments demonstrate that ConFill outperforms current methods, setting a new benchmark in image completion.

[38] Balancing Privacy and Action Performance: A Penalty-Driven Approach to Image Anonymization

Nazia Aslam,Kamal Nasrollahi

Main category: cs.CV

TL;DR: 论文提出了一种隐私保护的图像匿名化技术，通过优化匿名器以平衡隐私泄漏与动作识别性能。

Details

Motivation: 视频监控系统的发展引发了隐私与性能之间的权衡问题，如何在保护隐私的同时不牺牲动作识别性能是一个挑战。 Method: 提出了一种基于特征惩罚的方案，优化匿名器以减少隐私泄漏并保持动作识别性能。 Result: 实验表明，该方法在保持隐私泄漏一致的同时，显著提升了动作识别性能。 Conclusion: 该方法首次实现了隐私保护与动作识别性能的平衡，符合欧盟AI法案和GDPR的监管标准。 Abstract: The rapid development of video surveillance systems for object detection, tracking, activity recognition, and anomaly detection has revolutionized our day-to-day lives while setting alarms for privacy concerns. It isn't easy to strike a balance between visual privacy and action recognition performance in most computer vision models. Is it possible to safeguard privacy without sacrificing performance? It poses a formidable challenge, as even minor privacy enhancements can lead to substantial performance degradation. To address this challenge, we propose a privacy-preserving image anonymization technique that optimizes the anonymizer using penalties from the utility branch, ensuring improved action recognition performance while minimally affecting privacy leakage. This approach addresses the trade-off between minimizing privacy leakage and maintaining high action performance. The proposed approach is primarily designed to align with the regulatory standards of the EU AI Act and GDPR, ensuring the protection of personally identifiable information while maintaining action performance. To the best of our knowledge, we are the first to introduce a feature-based penalty scheme that exclusively controls the action features, allowing freedom to anonymize private attributes. Extensive experiments were conducted to validate the effectiveness of the proposed method. The results demonstrate that applying a penalty to anonymizer from utility branch enhances action performance while maintaining nearly consistent privacy leakage across different penalty settings.

[39] Exploring Generalizable Pre-training for Real-world Change Detection via Geometric Estimation

Yitao Zhao,Sen Lei,Nanqing Liu,Heng-Chao Li,Turgay Celik,Qing Zhu

Main category: cs.CV

TL;DR: MatchCD是一个自监督驱动的变化检测框架，通过几何估计同时解决多时相图像未对齐和目标变化问题，无需手动配准，可直接处理大尺度图像。

Details

Motivation: 现有变化检测算法需要手动配准多时相图像，增加了工作流程的复杂性。MatchCD旨在通过自监督方法简化这一过程。 Method: MatchCD利用零样本能力优化编码器，采用自监督对比表示，并在下游任务中复用，直接处理大尺度图像。 Result: 在几何畸变显著的复杂场景中，MatchCD表现出色，验证了其有效性。 Conclusion: MatchCD框架通过自监督和几何估计，显著简化了变化检测流程，提升了性能。 Abstract: As an essential procedure in earth observation system, change detection (CD) aims to reveal the spatial-temporal evolution of the observation regions. A key prerequisite for existing change detection algorithms is aligned geo-references between multi-temporal images by fine-grained registration. However, in the majority of real-world scenarios, a prior manual registration is required between the original images, which significantly increases the complexity of the CD workflow. In this paper, we proposed a self-supervision motivated CD framework with geometric estimation, called "MatchCD". Specifically, the proposed MatchCD framework utilizes the zero-shot capability to optimize the encoder with self-supervised contrastive representation, which is reused in the downstream image registration and change detection to simultaneously handle the bi-temporal unalignment and object change issues. Moreover, unlike the conventional change detection requiring segmenting the full-frame image into small patches, our MatchCD framework can directly process the original large-scale image (e.g., 6K*4K resolutions) with promising performance. The performance in multiple complex scenarios with significant geometric distortion demonstrates the effectiveness of our proposed framework.

[40] FGSGT: Saliency-Guided Siamese Network Tracker Based on Key Fine-Grained Feature Information for Thermal Infrared Target Tracking

Ruoyan Xiong,Huanbin Zhang,Shentao Wang,Hui He,Yuke Hou,Yue Zhang,Yujie Cui,Huipan Guan,Shang Zhang

Main category: cs.CV

TL;DR: 提出了一种基于显著性引导的Siamese网络跟踪器，通过细粒度特征并行学习和多层特征融合，解决了TIR图像特征提取的挑战，显著提升了跟踪精度。

Details

Motivation: TIR图像特征细节少、对比度低，传统特征提取模型难以捕捉目标特征，易受干扰和跟踪漂移影响。 Method: 设计了双流架构的细粒度特征并行学习模块、多层细粒度特征融合模块、Siamese残差细化块和显著性损失函数。 Result: 在PTB-TIR、LSOTB-TIR和VOT-TIR基准测试中取得了最高精度和成功率。 Conclusion: 该方法通过细粒度特征提取和显著性引导，显著提升了TIR图像跟踪性能。 Abstract: Thermal infrared (TIR) images typically lack detailed features and have low contrast, making it challenging for conventional feature extraction models to capture discriminative target characteristics. As a result, trackers are often affected by interference from visually similar objects and are susceptible to tracking drift. To address these challenges, we propose a novel saliency-guided Siamese network tracker based on key fine-grained feature infor-mation. First, we introduce a fine-grained feature parallel learning convolu-tional block with a dual-stream architecture and convolutional kernels of varying sizes. This design captures essential global features from shallow layers, enhances feature diversity, and minimizes the loss of fine-grained in-formation typically encountered in residual connections. In addition, we propose a multi-layer fine-grained feature fusion module that uses bilinear matrix multiplication to effectively integrate features across both deep and shallow layers. Next, we introduce a Siamese residual refinement block that corrects saliency map prediction errors using residual learning. Combined with deep supervision, this mechanism progressively refines predictions, ap-plying supervision at each recursive step to ensure consistent improvements in accuracy. Finally, we present a saliency loss function to constrain the sali-ency predictions, directing the network to focus on highly discriminative fi-ne-grained features. Extensive experiment results demonstrate that the pro-posed tracker achieves the highest precision and success rates on the PTB-TIR and LSOTB-TIR benchmarks. It also achieves a top accuracy of 0.78 on the VOT-TIR 2015 benchmark and 0.75 on the VOT-TIR 2017 benchmark.

[41] DCFG: Diverse Cross-Channel Fine-Grained Feature Learning and Progressive Fusion Siamese Tracker for Thermal Infrared Target Tracking

Ruoyan Xiong,Yuke Hou,Princess Retor Torboh,Hui He,Huanbin Zhang,Yue Zhang,Yanpin Wang,Huipan Guan,Shang Zhang

Main category: cs.CV

TL;DR: 提出了一种基于跨通道细粒度特征学习和渐进融合的新型Siamese跟踪器，用于热红外（TIR）跟踪，显著提升了跟踪精度。

Details

Motivation: 解决热红外跟踪中难以捕捉高判别性特征的挑战。 Method: 采用跨通道细粒度特征学习网络，结合掩码和抑制系数抑制主导目标特征，引入通道重排和均衡机制，以及特征重定向和通道混洗策略。设计了专门的跨通道细粒度损失函数。 Result: 在VOT-TIR 2015和2017基准测试中分别达到0.81和0.78的准确率，并在LSOTB-TIR和PTB-TIR基准测试中优于其他方法。 Conclusion: 提出的方法通过细粒度特征学习和渐进融合，显著提升了热红外跟踪的性能。 Abstract: To address the challenge of capturing highly discriminative features in ther-mal infrared (TIR) tracking, we propose a novel Siamese tracker based on cross-channel fine-grained feature learning and progressive fusion. First, we introduce a cross-channel fine-grained feature learning network that employs masks and suppression coefficients to suppress dominant target features, en-abling the tracker to capture more detailed and subtle information. The net-work employs a channel rearrangement mechanism to enhance efficient in-formation flow, coupled with channel equalization to reduce parameter count. Additionally, we incorporate layer-by-layer combination units for ef-fective feature extraction and fusion, thereby minimizing parameter redun-dancy and computational complexity. The network further employs feature redirection and channel shuffling strategies to better integrate fine-grained details. Second, we propose a specialized cross-channel fine-grained loss function designed to guide feature groups toward distinct discriminative re-gions of the target, thus improving overall target representation. This loss function includes an inter-channel loss term that promotes orthogonality be-tween channels, maximizing feature diversity and facilitating finer detail capture. Extensive experiments demonstrate that our proposed tracker achieves the highest accuracy, scoring 0.81 on the VOT-TIR 2015 and 0.78 on the VOT-TIR 2017 benchmark, while also outperforming other methods across all evaluation metrics on the LSOTB-TIR and PTB-TIR benchmarks.

[42] Visual Prompting for One-shot Controllable Video Editing without Inversion

Zhengbo Zhang,Yuxi Zhou,Duo Peng,Joo-Hwee Lim,Zhigang Tu,De Wen Soh,Lin Geng Foo

Main category: cs.CV

TL;DR: 论文提出了一种无需DDIM反转的单次可控视频编辑方法，通过视觉提示和内容一致性采样确保编辑帧与源帧的内容一致性。

Details

Motivation: 解决现有方法因DDIM反转累积误差导致内容一致性问题。 Method: 采用视觉提示替代DDIM反转，并引入内容一致性采样（CCS）和时间一致性采样（TCS）。 Result: 实验验证了方法的有效性。 Conclusion: 新方法在内容一致性和时间一致性上表现优异。 Abstract: One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made -- using any image editing tool -- on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is then fed into a pre-trained diffusion model, conditioned on the user-edited first frame, to generate the edited video. However, the DDIM inversion process accumulates errors, which hinder the latent noise from accurately reconstructing the source frames, ultimately compromising content consistency in the generated edited frames. To overcome it, our method eliminates the need for DDIM inversion by performing OCVE through a novel perspective based on visual prompting. Furthermore, inspired by consistency models that can perform multi-step consistency sampling to generate a sequence of content-consistent images, we propose a content consistency sampling (CCS) to ensure content consistency between the generated edited frames and the source frames. Moreover, we introduce a temporal-content consistency sampling (TCS) based on Stein Variational Gradient Descent to ensure temporal consistency across the edited frames. Extensive experiments validate the effectiveness of our approach.

[43] Multispectral airborne laser scanning for tree species classification: a benchmark of machine learning and deep learning algorithms

Josef Taher,Eric Hyyppä,Matti Hyyppä,Klaara Salolahti,Xiaowei Yu,Leena Matikainen,Antero Kukko,Matti Lehtomäki,Harri Kaartinen,Sopitta Thurachen,Paula Litkey,Ville Luoma,Markus Holopainen,Gefei Kong,Hongchao Fan,Petri Rönnholm,Antti Polvivaara,Samuli Junttila,Mikko Vastaranta,Stefano Puliti,Rasmus Astrup,Joel Kostensalo,Mari Myllymäki,Maksymilian Kulicki,Krzysztof Stereńczak,Raul de Paula Pires,Ruben Valbuena,Juan Pedro Carbonell-Rivera,Jesús Torralba,Yi-Chen Chen,Lukas Winiwarter,Markus Hollaus,Gottfried Mandlburger,Narges Takhtkeshha,Fabio Remondino,Maciej Lisiewicz,Bartłomiej Kraszewski,Xinlian Liang,Jianchang Chen,Eero Ahokas,Kirsi Karila,Eugeniu Vezeteu,Petri Manninen,Roope Näsi,Heikki Hyyti,Siiri Pyykkönen,Peilun Hu,Juha Hyyppä

Main category: cs.CV

TL;DR: 该研究通过对比机器学习和深度学习方法，评估了多光谱ALS数据在树种分类中的性能，发现基于点的深度学习方法（如点变换器模型）在高密度数据上表现最佳。

Details

Motivation: 气候智能和生物多样性保护的林业需要精确的森林资源信息，包括单株树种的识别，但目前的技术在稀有树种识别和深度学习应用上存在挑战。 Method: 研究使用高密度多光谱ALS数据（HeliALS系统）和现有Optech Titan数据，测试了多种算法在芬兰南部测试点的树种分类准确性。 Result: 点变换器模型在高密度数据上表现最优，总体准确率达87.9%（宏平均74.5%），光谱信息显著提升了分类性能。 Conclusion: 基于点的深度学习方法在多光谱ALS数据树种分类中具有优势，光谱信息对提升分类准确性至关重要。 Abstract: Climate-smart and biodiversity-preserving forestry demands precise information on forest resources, extending to the individual tree level. Multispectral airborne laser scanning (ALS) has shown promise in automated point cloud processing and tree segmentation, but challenges remain in identifying rare tree species and leveraging deep learning techniques. This study addresses these gaps by conducting a comprehensive benchmark of machine learning and deep learning methods for tree species classification. For the study, we collected high-density multispectral ALS data (>1000 pts/m$^2$) at three wavelengths using the FGI-developed HeliALS system, complemented by existing Optech Titan data (35 pts/m$^2$), to evaluate the species classification accuracy of various algorithms in a test site located in Southern Finland. Based on 5261 test segments, our findings demonstrate that point-based deep learning methods, particularly a point transformer model, outperformed traditional machine learning and image-based deep learning approaches on high-density multispectral point clouds. For the high-density ALS dataset, a point transformer model provided the best performance reaching an overall (macro-average) accuracy of 87.9% (74.5%) with a training set of 1065 segments and 92.0% (85.1%) with 5000 training segments. The best image-based deep learning method, DetailView, reached an overall (macro-average) accuracy of 84.3% (63.9%), whereas a random forest (RF) classifier achieved an overall (macro-average) accuracy of 83.2% (61.3%). Importantly, the overall classification accuracy of the point transformer model on the HeliALS data increased from 73.0% with no spectral information to 84.7% with single-channel reflectance, and to 87.9% with spectral information of all the three channels.

Le Wang,Zonghao Ying,Tianyuan Zhang,Siyuan Liang,Shengshan Hu,Mingchuan Zhang,Aishan Liu,Xianglong Liu

Main category: cs.CV

TL;DR: 论文提出了一种针对多模态代理的新型安全漏洞攻击方法CrossInject，通过跨模态提示注入攻击，成功率高且影响深远。

Details

Motivation: 多模态大语言模型虽提升了代理能力，但其安全漏洞（跨模态提示注入攻击）被忽视，可能导致恶意指令劫持代理决策。 Method: 提出CrossInject框架，包含视觉潜在对齐和文本引导增强两部分，通过对抗性扰动和黑盒防御系统提示推断实现攻击。 Result: 实验显示攻击成功率提升至少26.4%，并在实际多模态自主代理中验证了有效性。 Conclusion: 研究揭示了多模态代理的安全隐患，对安全关键应用具有潜在影响。 Abstract: The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this work, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agent's decision-making process and execute unauthorized tasks. Our approach consists of two key components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to infer the black-box defensive system prompt through adversarial meta prompting and generate an malicious textual command that steers the agent's output toward better compliance with attackers' requests. Extensive experiments demonstrate that our method outperforms existing injection attacks, achieving at least a +26.4% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications.

[45] A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling

Kyle Buettner,Jacob Emmerson,Adriana Kovashka

Main category: cs.CV

TL;DR: 论文提出了一种基于LLM的多模态重标题策略，通过修改英文标题的描述来提升多语言视觉语言模型对感知多样性的理解，并在德文和日文的文本-图像检索任务中取得了显著提升。

Details

Motivation: 现有的多语言视觉语言模型（VLMs）主要依赖英语使用者的数据，导致感知偏见和模型灵活性不足。本文旨在解决这一问题，提升模型对跨文化感知多样性的理解。 Method: 提出了一种基于LLM的多模态重标题策略，通过修改英文标题的描述，并结合目标语言的本土使用者数据进行多模态机制优化。 Result: 在德文和日文的文本-图像检索任务中，平均召回率提升了3.5%，在非母语错误案例中提升了4.7%。 Conclusion: 通过数据高效的方法，成功提升了多语言VLMs对感知多样性的理解，并提出了跨数据集和跨语言泛化的分析机制。 Abstract: There are many ways to describe, name, and group objects when captioning an image. Differences are evident when speakers come from diverse cultures due to the unique experiences that shape perception. Machine translation of captions has pushed multilingual capabilities in vision-language models (VLMs), but data comes mainly from English speakers, indicating a perceptual bias and lack of model flexibility. In this work, we address this challenge and outline a data-efficient framework to instill multilingual VLMs with greater understanding of perceptual diversity. We specifically propose an LLM-based, multimodal recaptioning strategy that alters the object descriptions of English captions before translation. The greatest benefits are demonstrated in a targeted multimodal mechanism guided by native speaker data. By adding produced rewrites as augmentations in training, we improve on German and Japanese text-image retrieval cases studies (up to +3.5 mean recall overall, +4.7 on non-native error cases). We further propose a mechanism to analyze the specific object description differences across datasets, and we offer insights into cross-dataset and cross-language generalization.

[46] Efficient Spiking Point Mamba for Point Cloud Analysis

Peixi Wu,Bosong Chai,Menghua Zheng,Wei Li,Zhangchi Hu,Jie Chen,Zheyu Zhang,Hebei Li,Xiaoyan Sun

Main category: cs.CV

TL;DR: 提出了一种基于Mamba的3D脉冲神经网络SPM，结合了Mamba的序列建模能力和SNN的时序特征提取，显著提升了性能并降低了能耗。

Details

Motivation: 现有3D SNN在长程依赖建模上表现不佳，而Mamba的高效计算和序列建模能力为解决这一问题提供了可能。 Method: 设计了Hierarchical Dynamic Encoding (HDE)和Spiking Mamba Block (SMB)，并采用非对称SNN-ANN架构进行预训练和微调。 Result: 在ScanObjectNN和ShapeNetPart数据集上性能显著提升，能耗比ANN低至少3.5倍。 Conclusion: SPM是首个基于Mamba的3D SNN，成功结合了Mamba和SNN的优势，为高效时序特征提取提供了新思路。 Abstract: Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain. Due to the poor performance of simply transferring Mamba to 3D SNNs, SPM is designed to utilize both the sequence modeling capabilities of Mamba and the temporal feature extraction of SNNs. Specifically, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism, thereby facilitating temporal interactions. Then, we propose a Spiking Mamba Block (SMB), which builds upon Mamba while learning inter-time-step features and minimizing information loss caused by spikes. Finally, to further enhance model performance, we adopt an asymmetric SNN-ANN architecture for spike-based pre-training and finetune. Compared with the previous state-of-the-art SNN models, SPM improves OA by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at least 3.5x lower than that of its ANN counterpart. The code will be made publicly available.

[47] LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

Md Abtahi Majeed Chowdhury,Md Rifat Ur Rahman,Akil Ahmad Taki

Main category: cs.CV

TL;DR: 论文提出了一种可学习的补丁排序方法LOOPE，用于优化视觉变换器中的位置嵌入，显著提高了分类准确性，并通过新实验框架验证了其有效性。

Details

Motivation: 解决现有位置嵌入方法在将2D网格映射到1D序列时忽略补丁排序影响的问题。 Method: 提出LOOPE方法，优化补丁排序以改进空间表示，并引入“三细胞实验”框架评估位置嵌入效果。 Result: LOOPE显著提升了分类准确性，实验显示其能更好地保留相对和绝对位置信息。 Conclusion: LOOPE为位置嵌入提供了一种更有效的优化方法，并通过新实验框架验证了其优越性。 Abstract: Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown theoretical advantages over relative positional embeddings (RPE), particularly due to the ability of sinusoidal functions to preserve spatial inductive biases like monotonicity and shift invariance, a fundamental challenge arises when mapping a 2D grid to a 1D sequence. Existing methods have mostly overlooked or never explored the impact of patch ordering in positional embeddings. To address this, we propose LOOPE, a learnable patch-ordering method that optimizes spatial representation for a given set of frequencies, providing a principled approach to patch order optimization. Empirical results show that our PE significantly improves classification accuracy across various ViT architectures. To rigorously evaluate the effectiveness of positional embeddings, we introduce the "Three Cell Experiment", a novel benchmarking framework that assesses the ability of PEs to retain relative and absolute positional information across different ViT architectures. Unlike standard evaluations, which typically report a performance gap of 4 to 6% between models with and without PE, our method reveals a striking 30 to 35% difference, offering a more sensitive diagnostic tool to measure the efficacy of PEs. Our experimental analysis confirms that the proposed LOOPE demonstrates enhanced effectiveness in retaining both relative and absolute positional information.

[48] How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

Rahul Thapa,Andrew Li,Qingyang Wu,Bryan He,Yuki Sahashi,Christina Binder,Angela Zhang,Ben Athiwaratkun,Shuaiwen Leon Song,David Ouyang,James Zou

Main category: cs.CV

TL;DR: OpenBiomedVi数据集利用教育性生物医学视频微调视觉语言模型，显著提升了性能，并引入新基准测试验证模型泛化能力。

Details

Motivation: 探索非标准化、教育性生物医学视频是否能有效训练通用视觉语言模型。 Method: 构建OpenBiomedVi数据集，通过多步人工参与流程收集视频-字幕和Q/A对，并微调Qwen-2-VL模型。 Result: 微调后模型在视频和图像任务上表现显著提升，2B模型在视频任务上提升98.7%，7B模型在视频任务上提升37.09%。 Conclusion: 教育性生物医学视频为生物医学视觉语言模型提供了有效的训练信号。 Abstract: Publicly available biomedical videos, such as those on YouTube, serve as valuable educational resources for medical students. Unlike standard machine learning datasets, these videos are designed for human learners, often mixing medical imagery with narration, explanatory diagrams, and contextual framing. In this work, we investigate whether such pedagogically rich, yet non-standardized and heterogeneous videos can effectively teach general-domain vision-language models biomedical knowledge. To this end, we introduce OpenBiomedVi, a biomedical video instruction tuning dataset comprising 1031 hours of video-caption and Q/A pairs, curated through a multi-step human-in-the-loop pipeline. Diverse biomedical video datasets are rare, and OpenBiomedVid fills an important gap by providing instruction-style supervision grounded in real-world educational content. Surprisingly, despite the informal and heterogeneous nature of these videos, the fine-tuned Qwen-2-VL models exhibit substantial performance improvements across most benchmarks. The 2B model achieves gains of 98.7% on video tasks, 71.2% on image tasks, and 0.2% on text tasks. The 7B model shows improvements of 37.09% on video and 11.2% on image tasks, with a slight degradation of 2.7% on text tasks compared to their respective base models. To address the lack of standardized biomedical video evaluation datasets, we also introduce two new expert curated benchmarks, MIMICEchoQA and SurgeryVideoQA. On these benchmarks, the 2B model achieves gains of 99.1% and 98.1%, while the 7B model shows gains of 22.5% and 52.1%, respectively, demonstrating the models' ability to generalize and perform biomedical video understanding on cleaner and more standardized datasets than those seen during training. These results suggest that educational videos created for human learning offer a surprisingly effective training signal for biomedical VLMs.

[49] Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models

Chung-En,Yu,Hsuan-Chih,Chen,Brian Jalaian,Nathaniel D. Bastian

Main category: cs.CV

TL;DR: Hydra是一个自适应框架，通过迭代推理和跨模型验证提升视觉语言模型（VLMs）的对抗鲁棒性和减少幻觉问题。

Details

Motivation: 现有方法主要关注对抗防御或幻觉后处理，缺乏统一的鲁棒性策略。Hydra旨在填补这一空白。 Method: Hydra采用Action-Critique Loop，结合Chain-of-Thought和In-Context Learning技术，动态优化输出。 Result: 在多个VLMs和基准测试中，Hydra表现优于现有方法，无需显式对抗防御即可提升鲁棒性和事实一致性。 Conclusion: Hydra为VLMs提供了一种无需训练的可扩展解决方案，显著提升了现实应用中的可靠性。 Abstract: To develop trustworthy Vision-Language Models (VLMs), it is essential to address adversarial robustness and hallucination mitigation, both of which impact factual accuracy in high-stakes applications such as defense and healthcare. Existing methods primarily focus on either adversarial defense or hallucination post-hoc correction, leaving a gap in unified robustness strategies. We introduce \textbf{Hydra}, an adaptive agentic framework that enhances plug-in VLMs through iterative reasoning, structured critiques, and cross-model verification, improving both resilience to adversarial perturbations and intrinsic model errors. Hydra employs an Action-Critique Loop, where it retrieves and critiques visual information, leveraging Chain-of-Thought (CoT) and In-Context Learning (ICL) techniques to refine outputs dynamically. Unlike static post-hoc correction methods, Hydra adapts to both adversarial manipulations and intrinsic model errors, making it robust to malicious perturbations and hallucination-related inaccuracies. We evaluate Hydra on four VLMs, three hallucination benchmarks, two adversarial attack strategies, and two adversarial defense methods, assessing performance on both clean and adversarial inputs. Results show that Hydra surpasses plug-in VLMs and state-of-the-art (SOTA) dehallucination methods, even without explicit adversarial defenses, demonstrating enhanced robustness and factual consistency. By bridging adversarial resistance and hallucination mitigation, Hydra provides a scalable, training-free solution for improving the reliability of VLMs in real-world applications.

[50] SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation

Minho Park,Taewoong Kang,Jooyeol Yun,Sungwon Hwang,Jaegul Choo

Main category: cs.CV

TL;DR: SphereDiff提出了一种基于扩散模型的无缝360度全景图像和视频生成方法，通过球形潜在表示和加权平均技术解决了传统ERP投影的失真问题。

Details

Motivation: AR/VR应用对高质量360度全景内容的需求日益增长，但现有方法因ERP投影的严重失真而难以满足需求。 Method: SphereDiff定义了球形潜在表示，扩展了MultiDiffusion到球形空间，并提出了球形潜在采样和失真感知加权平均技术。 Result: SphereDiff在生成360度全景内容时表现优于现有方法，保持了高保真度。 Conclusion: SphereDiff为AR/VR应用提供了一种高效且高质量的360度全景内容生成解决方案。 Abstract: The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications. The code is available here. https://github.com/pmh9960/SphereDiff

[51] Adversarial Attack for RGB-Event based Visual Object Tracking

Qiang Chen,Xiao Wang,Haowen Wang,Bo Jiang,Lin Zhu,Dawei Zhang,Yonghong Tian,Jin Tang

Main category: cs.CV

TL;DR: 提出了一种针对RGB-Event视觉跟踪的跨模态对抗攻击算法，研究了Event体素和帧两种表示形式，并在多个数据集上验证了其有效性。

Details

Motivation: RGB-Event流跟踪算法在对抗攻击和防御方面的研究较少，本文旨在填补这一空白。 Method: 针对Event体素和帧两种表示形式，分别设计了优化扰动和两步攻击策略，并通过梯度信息优化跨模态通用扰动。 Result: 在COESOT、FE108和VisEvent数据集上的实验表明，该方法显著降低了跟踪器在单模态和多模态场景下的性能。 Conclusion: 提出的跨模态对抗攻击算法有效，未来将进一步研究防御策略。 Abstract: Visual object tracking is a crucial research topic in the fields of computer vision and multi-modal fusion. Among various approaches, robust visual tracking that combines RGB frames with Event streams has attracted increasing attention from researchers. While striving for high accuracy and efficiency in tracking, it is also important to explore how to effectively conduct adversarial attacks and defenses on RGB-Event stream tracking algorithms, yet research in this area remains relatively scarce. To bridge this gap, in this paper, we propose a cross-modal adversarial attack algorithm for RGB-Event visual tracking. Because of the diverse representations of Event streams, and given that Event voxels and frames are more commonly used, this paper will focus on these two representations for an in-depth study. Specifically, for the RGB-Event voxel, we first optimize the perturbation by adversarial loss to generate RGB frame adversarial examples. For discrete Event voxel representations, we propose a two-step attack strategy, more in detail, we first inject Event voxels into the target region as initialized adversarial examples, then, conduct a gradient-guided optimization by perturbing the spatial location of the Event voxels. For the RGB-Event frame based tracking, we optimize the cross-modal universal perturbation by integrating the gradient information from multimodal data. We evaluate the proposed approach against attacks on three widely used RGB-Event Tracking datasets, i.e., COESOT, FE108, and VisEvent. Extensive experiments show that our method significantly reduces the performance of the tracker across numerous datasets in both unimodal and multimodal scenarios. The source code will be released on https://github.com/Event-AHU/Adversarial_Attack_Defense

Ahmad Khalil,Mahmoud Khalil,Alioune Ngom

Main category: cs.CV

TL;DR: 论文提出了一种解决视频语言模型（ResNetVLLM）中多模态幻觉问题的方法，通过两步策略：语义对齐检测和检索增强生成（RAG），显著提升了模型的准确性。

Details

Motivation: 大型语言模型（LLMs）和视频语言模型（VideoLLMs）存在幻觉问题，生成的内容可能看似合理但事实错误。本文旨在解决这一问题，提升模型的可靠性。 Method: 采用两步策略：1）使用改进的Lynx模型检测生成描述与真实视频内容的语义对齐；2）通过动态构建的知识库和RAG技术减少幻觉。 Result: 改进后的模型ResNetVLLM-2在ActivityNet-QA基准测试中准确率从54.8%提升至65.3%。 Conclusion: 提出的幻觉检测和缓解策略有效提升了视频语言模型的可靠性，验证了方法的有效性。 Abstract: Large Language Models (LLMs) have transformed natural language processing (NLP) tasks, but they suffer from hallucination, generating plausible yet factually incorrect content. This issue extends to Video-Language Models (VideoLLMs), where textual descriptions may inaccurately represent visual content, resulting in multi-modal hallucinations. In this paper, we address hallucination in ResNetVLLM, a video-language model combining ResNet visual encoders with LLMs. We introduce a two-step protocol: (1) a faithfulness detection strategy that uses a modified Lynx model to assess semantic alignment between generated captions and ground-truth video references, and (2) a hallucination mitigation strategy using Retrieval-Augmented Generation (RAG) with an ad-hoc knowledge base dynamically constructed during inference. Our enhanced model, ResNetVLLM-2, reduces multi-modal hallucinations by cross-verifying generated content against external knowledge, improving factual consistency. Evaluation on the ActivityNet-QA benchmark demonstrates a substantial accuracy increase from 54.8% to 65.3%, highlighting the effectiveness of our hallucination detection and mitigation strategies in enhancing video-language model reliability.

Ahmad Khalil,Mahmoud Khalil,Alioune Ngom

Main category: cs.CV

TL;DR: ResNetVLLM是一种新型的跨模态框架，结合ResNet视觉编码器和大型语言模型，用于零样本视频理解，无需依赖预训练视频模型。

Details

Motivation: 解决零样本视频理解中依赖预训练模型的挑战，通过统一架构学习视觉和语义表示。 Method: 使用非预训练的ResNet提取视觉特征，结合大型语言模型生成文本描述。 Result: 在多个基准测试（如MSRVTT-QA、MSVD-QA等）中达到最先进的零样本视频理解性能。 Conclusion: ResNetVLLM通过统一架构有效提升了零样本视频理解的准确性和上下文相关性。 Abstract: In this paper, we introduce ResNetVLLM (ResNet Vision LLM), a novel cross-modal framework for zero-shot video understanding that integrates a ResNet-based visual encoder with a Large Language Model (LLM. ResNetVLLM addresses the challenges associated with zero-shot video models by avoiding reliance on pre-trained video understanding models and instead employing a non-pretrained ResNet to extract visual features. This design ensures the model learns visual and semantic representations within a unified architecture, enhancing its ability to generate accurate and contextually relevant textual descriptions from video inputs. Our experimental results demonstrate that ResNetVLLM achieves state-of-the-art performance in zero-shot video understanding (ZSVU) on several benchmarks, including MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.

[54] WT-BCP: Wavelet Transform based Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

Mingya Zhang,Liang Wang,Limei Gu,Tingsheng Ling,Xianping Tao

Main category: cs.CV

TL;DR: 论文提出了一种基于小波变换的双向复制粘贴半监督医学图像分割框架（WT-BCP），通过结合低频和高频信息以及一致性训练，解决了现有方法的分布不匹配和训练偏差问题。

Details

Motivation: 半监督医学图像分割（SSMIS）依赖稀缺的标注数据，但面临标注与未标注数据分布不匹配、人工扰动导致的训练偏差以及低频和高频信息利用不足等问题。 Method: 提出WT-BCP框架，结合小波变换提取低频和高频信息，使用双向复制粘贴增强未标注数据的理解，并设计XNet-Plus模型进行多输入多输出处理。通过一致性训练减少人工扰动带来的偏差。 Result: 在2D和3D数据集上的实验验证了模型的有效性。 Conclusion: WT-BCP框架通过综合利用图像信息和一致性训练，显著提升了半监督医学图像分割的性能。 Abstract: Semi-supervised medical image segmentation (SSMIS) shows promise in reducing reliance on scarce labeled medical data. However, SSMIS field confronts challenges such as distribution mismatches between labeled and unlabeled data, artificial perturbations causing training biases, and inadequate use of raw image information, especially low-frequency (LF) and high-frequency (HF) components.To address these challenges, we propose a Wavelet Transform based Bidirectional Copy-Paste SSMIS framework, named WT-BCP, which improves upon the Mean Teacher approach. Our method enhances unlabeled data understanding by copying random crops between labeled and unlabeled images and employs WT to extract LF and HF details.We propose a multi-input and multi-output model named XNet-Plus, to receive the fused information after WT. Moreover, consistency training among multiple outputs helps to mitigate learning biases introduced by artificial perturbations. During consistency training, the mixed images resulting from WT are fed into both models, with the student model's output being supervised by pseudo-labels and ground-truth. Extensive experiments conducted on 2D and 3D datasets confirm the effectiveness of our model.Code: https://github.com/simzhangbest/WT-BCP.

[55] Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

Carlos Caetano,Gabriel O. dos Santos,Caio Petrucci,Artur Barros,Camila Laranjeira,Leo S. F. Ribeiro,Júlia F. de Mendonça,Jefersson A. dos Santos,Sandra Avila

Main category: cs.CV

TL;DR: 论文探讨了在AI数据集中使用儿童图像的伦理问题，并提出了一种检测和移除这些图像的流程。

Details

Motivation: 儿童图像在数据集中的使用引发了隐私、同意和数据保护等伦理问题，亟需解决方案。 Method: 提出了一种基于视觉语言模型的流程，用于检测和移除儿童图像，并在#PraCegoVer和Open Images V7数据集上进行了测试。 Result: 流程在测试中表现出有效性，为未来研究提供了基线工具。 Conclusion: 呼吁研究社区反思并采取行动保护儿童权利，同时鼓励开发更全面的工具和方法。 Abstract: Including children's images in datasets has raised ethical concerns, particularly regarding privacy, consent, data protection, and accountability. These datasets, often built by scraping publicly available images from the Internet, can expose children to risks such as exploitation, profiling, and tracking. Despite the growing recognition of these issues, approaches for addressing them remain limited. We explore the ethical implications of using children's images in AI datasets and propose a pipeline to detect and remove such images. As a use case, we built the pipeline on a Vision-Language Model under the Visual Question Answering task and tested it on the #PraCegoVer dataset. We also evaluate the pipeline on a subset of 100,000 images from the Open Images V7 dataset to assess its effectiveness in detecting and removing images of children. The pipeline serves as a baseline for future research, providing a starting point for more comprehensive tools and methodologies. While we leverage existing models trained on potentially problematic data, our goal is to expose and address this issue. We do not advocate for training or deploying such models, but instead call for urgent community reflection and action to protect children's rights. Ultimately, we aim to encourage the research community to exercise - more than an additional - care in creating new datasets and to inspire the development of tools to protect the fundamental rights of vulnerable groups, particularly children.

[56] Causal Disentanglement for Robust Long-tail Medical Image Generation

Weizhi Nie,Zichun Zhang,Weijie Wang,Bruno Lepri,Anan Liu,Nicu Seb

Main category: cs.CV

TL;DR: 提出了一种基于因果解耦和文本引导的医学图像生成框架，用于生成高质量、多样化的反事实医学图像，解决数据稀缺和类别不平衡问题。

Details

Motivation: 医学图像数据稀缺且类别分布不平衡，生成高质量、多样化的反事实医学图像具有挑战性，同时需保持解剖结构信息的稳定性和一致性。 Method: 通过因果解耦实现病理和结构特征分离，引入分组监督确保独立性；利用扩散模型和文本引导建模病理特征，结合大语言模型提取病变信息，优化初始噪声提升长尾类别性能。 Result: 生成的反事实医学图像具有临床相关性和高解释性，同时解决了数据稀缺和类别不平衡问题。 Conclusion: 该框架有效提升了医学图像生成的多样性和质量，增强了模型的临床实用性和解释性。 Abstract: Counterfactual medical image generation effectively addresses data scarcity and enhances the interpretability of medical images. However, due to the complex and diverse pathological features of medical images and the imbalanced class distribution in medical data, generating high-quality and diverse medical images from limited data is significantly challenging. Additionally, to fully leverage the information in limited data, such as anatomical structure information and generate more structurally stable medical images while avoiding distortion or inconsistency. In this paper, in order to enhance the clinical relevance of generated data and improve the interpretability of the model, we propose a novel medical image generation framework, which generates independent pathological and structural features based on causal disentanglement and utilizes text-guided modeling of pathological features to regulate the generation of counterfactual images. First, we achieve feature separation through causal disentanglement and analyze the interactions between features. Here, we introduce group supervision to ensure the independence of pathological and identity features. Second, we leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging a large language model to extract lesion severity and location from medical reports. Additionally, we improve the performance of the latent diffusion model on long-tailed categories through initial noise optimization.

[57] Metamon-GS: Enhancing Representability with Variance-Guided Densification and Light Encoding

Junyan Su,Baozhu Zhao,Xiaohan Zhang,Qi Liu

Main category: cs.CV

TL;DR: Metamon-GS通过方差引导的密度策略和多级哈希网格解决了3D高斯泼溅（3DGS）在渲染性能上的问题，提升了新视角合成的质量。

Details

Motivation: 3DGS在渲染性能上存在颜色表示不准确和密度策略不足的问题，导致图像模糊和伪影。 Method: 提出方差引导的密度策略和多级哈希网格，前者针对高梯度方差区域增加高斯点，后者研究全局光照条件以准确表示颜色。 Result: 实验表明Metamon-GS在公开数据集上优于基线模型和先前版本，渲染质量显著提升。 Conclusion: Metamon-GS通过创新的密度策略和光照建模有效解决了3DGS的渲染问题，实现了更高质量的视角合成。 Abstract: The introduction of 3D Gaussian Splatting (3DGS) has advanced novel view synthesis by utilizing Gaussians to represent scenes. Encoding Gaussian point features with anchor embeddings has significantly enhanced the performance of newer 3DGS variants. While significant advances have been made, it is still challenging to boost rendering performance. Feature embeddings have difficulty accurately representing colors from different perspectives under varying lighting conditions, which leads to a washed-out appearance. Another reason is the lack of a proper densification strategy that prevents Gaussian point growth in thinly initialized areas, resulting in blurriness and needle-shaped artifacts. To address them, we propose Metamon-GS, from innovative viewpoints of variance-guided densification strategy and multi-level hash grid. The densification strategy guided by variance specifically targets Gaussians with high gradient variance in pixels and compensates for the importance of regions with extra Gaussians to improve reconstruction. The latter studies implicit global lighting conditions and accurately interprets color from different perspectives and feature embeddings. Our thorough experiments on publicly available datasets show that Metamon-GS surpasses its baseline model and previous versions, delivering superior quality in rendering novel views.

[58] LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation

Jiachen Li,Qing Xie,Xiaohan Yu,Hongyun Wang,Jinyu Xu,Yongjian Liu,Yongsheng Gao

Main category: cs.CV

TL;DR: LGD框架利用多模态大语言模型的生成能力，通过属性提示和周围提示生成描述，提升视觉-语言模型在零样本参考图像分割中的性能，显著优于现有方法。

Details

Motivation: 解决零样本参考图像分割中因自由形式表达导致的语义对齐和匹配问题，避免错误的目标定位。 Method: 设计属性提示和周围提示，生成属性描述和周围描述；引入三种视觉-文本匹配分数评估相似性。 Result: 在RefCOCO、RefCOCO+和RefCOCOg数据集上取得新SOTA，oIoU和mIoU最大提升分别为9.97%和11.29%。 Conclusion: LGD框架通过生成描述显著提升了区域-文本匹配性能，验证了多模态大语言模型在视觉-语言任务中的潜力。 Abstract: Zero-shot referring image segmentation aims to locate and segment the target region based on a referring expression, with the primary challenge of aligning and matching semantics across visual and textual modalities without training. Previous works address this challenge by utilizing Vision-Language Models and mask proposal networks for region-text matching. However, this paradigm may lead to incorrect target localization due to the inherent ambiguity and diversity of free-form referring expressions. To alleviate this issue, we present LGD (Leveraging Generative Descriptions), a framework that utilizes the advanced language generation capabilities of Multi-Modal Large Language Models to enhance region-text matching performance in Vision-Language Models. Specifically, we first design two kinds of prompts, the attribute prompt and the surrounding prompt, to guide the Multi-Modal Large Language Models in generating descriptions related to the crucial attributes of the referent object and the details of surrounding objects, referred to as attribute description and surrounding description, respectively. Secondly, three visual-text matching scores are introduced to evaluate the similarity between instance-level visual features and textual features, which determines the mask most associated with the referring expression. The proposed method achieves new state-of-the-art performance on three public datasets RefCOCO, RefCOCO+ and RefCOCOg, with maximum improvements of 9.97% in oIoU and 11.29% in mIoU compared to previous methods.

[59] Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Jingjing Ren,Wenbo Li,Zhongdao Wang,Haoze Sun,Bangzhen Liu,Haoyu Chen,Jiaqi Xu,Aoxue Li,Shifeng Zhang,Bin Shao,Yong Guo,Lei Zhu

Main category: cs.CV

TL;DR: Turbo2K是一个高效框架，用于生成2K分辨率视频，通过压缩潜在空间和知识蒸馏显著提升训练和推理效率。

Details

Motivation: 随着消费者对超高清视觉的需求增加，2K视频合成的需求上升，但现有扩散变换器（DiTs）在2K分辨率下计算成本过高。 Method: Turbo2K在高度压缩的潜在空间中操作，结合知识蒸馏策略和分层两阶段合成框架，提升效率和质量。 Result: Turbo2K在5秒24fps的2K视频生成中达到最高效率，推理速度比现有方法快20倍。 Conclusion: Turbo2K通过高效设计和知识蒸馏，使高分辨率视频生成更实用和可扩展。 Abstract: Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.

[60] Efficient Implicit Neural Compression of Point Clouds via Learnable Activation in Latent Space

Yichi Zhang,Qianqian Yang

Main category: cs.CV

TL;DR: PICO是一个基于隐式神经表示（INR）的静态点云压缩框架，通过几何和属性压缩两阶段分解任务，并引入LeAFNet网络提升性能。

Details

Motivation: 点云压缩是计算机视觉中的重要任务，传统方法效率有限，PICO旨在通过INR提升压缩性能。 Method: 提出两阶段压缩任务（几何和属性），设计LeAFNet网络（基于可学习激活函数），并引入量化和熵编码优化压缩效率。 Result: LeAFNet优于传统MLP，PICO在几何压缩上比MPEG标准提升4.92 dB D1 PSNR，联合压缩中PCQM增益2.7×10⁻³。 Conclusion: PICO通过INR和LeAFNet显著提升点云压缩性能，为未来研究提供新方向。 Abstract: Implicit Neural Representations (INRs), also known as neural fields, have emerged as a powerful paradigm in deep learning, parameterizing continuous spatial fields using coordinate-based neural networks. In this paper, we propose \textbf{PICO}, an INR-based framework for static point cloud compression. Unlike prevailing encoder-decoder paradigms, we decompose the point cloud compression task into two separate stages: geometry compression and attribute compression, each with distinct INR optimization objectives. Inspired by Kolmogorov-Arnold Networks (KANs), we introduce a novel network architecture, \textbf{LeAFNet}, which leverages learnable activation functions in the latent space to better approximate the target signal's implicit function. By reformulating point cloud compression as neural parameter compression, we further improve compression efficiency through quantization and entropy coding. Experimental results demonstrate that \textbf{LeAFNet} outperforms conventional MLPs in INR-based point cloud compression. Furthermore, \textbf{PICO} achieves superior geometry compression performance compared to the current MPEG point cloud compression standard, yielding an average improvement of $4.92$ dB in D1 PSNR. In joint geometry and attribute compression, our approach exhibits highly competitive results, with an average PCQM gain of $2.7 \times 10^{-3}$.

[61] Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation

Guoyi Zhang,Siyang Chen,Guangsheng Xu,Han Wang,Xiaohu Zhang

Main category: cs.CV

TL;DR: LSR-ST是一种轻量级PEFT框架，通过引入形状偏置的归纳先验，提升视觉基础模型在复杂场景中的鲁棒性。

Details

Motivation: 视觉基础模型在复杂场景（如伪装和红外图像）中表现不佳，主要由于其固有的纹理偏置在微调过程中被放大，限制了在纹理稀疏环境中的泛化能力。 Method: 提出LSR-ST框架，利用HDConv Block捕捉形状感知特征，满足大感受野、多阶特征交互和稀疏连接的条件。 Result: 在17个数据集和6个任务中，仅使用4.719M可训练参数，LSR-ST显著提升了性能。 Conclusion: LSR-ST通过表示效率的概念，为复杂视觉环境中的视觉基础模型提供了更鲁棒和通用的解决方案。 Abstract: Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios, such as camouflage and infrared imagery. We attribute this challenge to the inherent texture bias in VFMs, which is exacerbated during fine-tuning and limits generalization in texture-sparse environments. To address this, we propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors. LSR-ST captures shape-aware features using a simple HDConv Block, which integrates large-kernel attention and residual learning. The method satisfies three key conditions for inducing shape bias: large receptive fields, multi-order feature interactions, and sparse connectivity. Our analysis reveals that these improvements stem from representation efficiency-the ability to extract task-relevant, structurally grounded features while minimizing redundancy. We formalize this concept via Information Bottleneck theory and advocate for it as a key PEFT objective. Unlike traditional NLP paradigms that focus on optimizing parameters and memory, visual tasks require models that extract task-defined semantics, rather than just relying on pre-encoded features. This shift enables our approach to move beyond conventional trade-offs, offering more robust and generalizable solutions for vision tasks. With minimal changes to SAM2-UNet, LSR-ST achieves consistent improvements across 17 datasets and 6 tasks using only 4.719M trainable parameters. These results highlight the potential of representation efficiency for robust and adaptable VFMs within complex visual environments.

[62] STARS: Sparse Learning Correlation Filter with Spatio-temporal Regularization and Super-resolution Reconstruction for Thermal Infrared Target Tracking

Shang Zhang,Xiaobo Ding,Huanbin Zhang,Ruoyan Xiong,Yue Zhang

Main category: cs.CV

TL;DR: STARS是一种基于稀疏学习的相关滤波器跟踪器，结合时空正则化和超分辨率重建，显著提升了热红外目标跟踪的性能。

Details

Motivation: 热红外图像分辨率低且易受干扰，限制了跟踪器的性能。 Method: 采用自适应稀疏滤波和时域滤波提取目标特征，引入边缘保持稀疏正则化方法，并提出梯度增强超分辨率方法。 Result: 在多个基准测试中，STARS表现出优于现有跟踪器的鲁棒性。 Conclusion: STARS首次将超分辨率方法集成到稀疏学习框架中，有效解决了热红外跟踪中的性能问题。 Abstract: Thermal infrared (TIR) target tracking methods often adopt the correlation filter (CF) framework due to its computational efficiency. However, the low resolution of TIR images, along with tracking interference, significantly limits the perfor-mance of TIR trackers. To address these challenges, we introduce STARS, a novel sparse learning-based CF tracker that incorporates spatio-temporal regulari-zation and super-resolution reconstruction. First, we apply adaptive sparse filter-ing and temporal domain filtering to extract key features of the target while reduc-ing interference from background clutter and noise. Next, we introduce an edge-preserving sparse regularization method to stabilize target features and prevent excessive blurring. This regularization integrates multiple terms and employs the alternating direction method of multipliers to optimize the solution. Finally, we propose a gradient-enhanced super-resolution method to extract fine-grained TIR target features and improve the resolution of TIR images, addressing performance degradation in tracking caused by low-resolution sequences. To the best of our knowledge, STARS is the first to integrate super-resolution methods within a sparse learning-based CF framework. Extensive experiments on the LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR2017 benchmarks demonstrate that STARS outperforms state-of-the-art trackers in terms of robustness.

[63] DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

Fulong Ye,Miao Hua,Pengze Zhang,Xinghui Li,Qichao Sun,Songtao Zhao,Qian He,Xinglong Wu

Main category: cs.CV

TL;DR: DreamID是一种基于扩散模型的人脸交换方法，通过显式监督和Triplet ID Group数据提升身份相似性和属性保留，结合SD Turbo加速模型实现高效训练，并在512*512分辨率下0.6秒内完成高质量交换。

Details

Motivation: 传统人脸交换方法依赖隐式监督，效果不佳。DreamID旨在通过显式监督和高效架构解决这一问题，提升身份相似性和图像保真度。 Method: 构建Triplet ID Group数据实现显式监督，利用SD Turbo加速模型减少推理步骤，提出SwapNet、FaceNet和ID Adapter的改进架构。 Result: DreamID在身份相似性、姿态和表情保留、图像保真度上优于现有方法，0.6秒内完成512*512分辨率的高质量交换。 Conclusion: DreamID通过显式监督和高效架构实现了高质量、快速的人脸交换，适用于复杂场景。 Abstract: In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.

[64] Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Weirong Chen,Ganlin Zhang,Felix Wimbauer,Rui Wang,Nikita Araslanov,Andrea Vedaldi,Daniel Cremers

Main category: cs.CV

TL;DR: BA-Track是一种新型SLAM框架，通过3D点跟踪器分离相机运动与动态物体运动，结合传统束调整与深度一致性处理，显著提升动态场景下的相机姿态估计和3D重建精度。

Details

Motivation: 传统SLAM系统在动态场景中表现不佳，现有方法要么过滤动态元素，要么独立建模其运动，导致重建不完整或运动估计不一致。 Method: 使用3D点跟踪器分离相机运动与动态物体运动，结合束调整和基于尺度图的轻量级后处理，确保深度一致性。 Result: 在挑战性数据集上，BA-Track显著提升了相机姿态估计和3D重建的准确性。 Conclusion: BA-Track通过统一框架有效处理动态场景，为SLAM系统在复杂环境中的应用提供了新思路。 Abstract: Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

[65] Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

Tong Zeng,Longfeng Wu,Liang Shi,Dawei Zhou,Feng Guo

Main category: cs.CV

TL;DR: DVBench是一个新基准，用于评估视觉大语言模型（VLLMs）在安全关键驾驶场景中的表现，揭示了现有模型的局限性，并通过领域微调显著提升了性能。

Details

Motivation: 现有VLLMs在通用视觉任务中表现优异，但在安全关键领域（如自动驾驶）的性能尚未充分探索，缺乏针对复杂驾驶场景的评估工具。 Method: 提出DVBench基准，包含10,000道选择题，基于分层能力分类法评估VLLMs的感知和推理能力，并对14个SOTA模型进行实验和微调。 Result: 实验显示现有模型在复杂驾驶场景中表现不佳（最高准确率<40%），但微调后性能显著提升（相对改进达43.59%）。 Conclusion: DVBench填补了评估VLLMs在自动驾驶领域能力的空白，强调了领域适应的重要性，为未来研究提供了框架和工具。 Abstract: Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.

[66] SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

Liang Peng,Boxi Wu,Haoran Cheng,Yibo Zhao,Xiaofei He

Main category: cs.CV

TL;DR: SUDO是一种自监督直接偏好优化方法，用于提升文本到图像扩散模型的全局和局部图像质量，无需昂贵的数据标注。

Details

Motivation: 传统监督微调方法仅优化像素级MSE损失，忽略了全局图像质量的重要性。 Method: 通过自监督生成偏好图像对，结合直接偏好优化，同时优化像素级和全局图像质量。 Result: 在Stable Diffusion 1.5和XL等模型上显著提升了图像质量。 Conclusion: SUDO是一种高效且无需标注的替代方案，适用于任何文本到图像扩散模型。 Abstract: Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at \href{https://github.com/SPengLiang/SUDO}{this link}.

[67] FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models

Kuanting Wu,Kei Ota,Asako Kanezaki

Main category: cs.CV

TL;DR: FlowLoss通过直接比较生成视频与真实视频的光流场，结合噪声感知权重方案，提升了视频扩散模型的运动稳定性和训练收敛速度。

Details

Motivation: 视频扩散模型（VDMs）在生成高质量视频时，常面临时间一致性运动的问题，光流监督是解决这一问题的潜在方法。 Method: 提出FlowLoss，直接比较生成视频与真实视频的光流场，并设计噪声感知权重方案以调节去噪步骤中的光流损失。 Result: 在机器人视频数据集上的实验表明，FlowLoss提高了运动稳定性，并在早期训练阶段加速了收敛。 Conclusion: FlowLoss为噪声条件生成模型中引入基于运动的监督提供了实用见解。 Abstract: Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos. To account for the unreliability of flow estimation under high-noise conditions in diffusion, we propose a noise-aware weighting scheme that modulates the flow loss across denoising steps. Experiments on robotic video datasets suggest that FlowLoss improves motion stability and accelerates convergence in early training stages. Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.

[68] VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control

Lifeng Lin,Rongfeng Lu,Quan Chen,Haofan Ren,Ming Lu,Yaoqi Sun,Chenggang Yan,Anke Xue

Main category: cs.CV

TL;DR: VGNC是一种基于生成式新视角合成模型的验证引导高斯数控制方法，用于减少稀疏视角3D高斯泼溅（3DGS）重建中的过拟合问题，同时提升渲染质量并降低存储和计算需求。

Details

Motivation: 稀疏视角3D重建在实际应用中存在过拟合问题，现有3DGS方法虽有所改进但仍未彻底解决。 Method: 通过生成式新视角合成模型生成验证图像，并基于此提出高斯数控制策略，优化高斯数量以减少过拟合。 Result: 实验表明，VGNC不仅减少了过拟合，还提升了测试集的渲染质量，同时降低了高斯点数量，减少了存储和计算需求。 Conclusion: VGNC为稀疏视角3DGS重建提供了一种有效的过拟合缓解方法，具有实际应用潜力。 Abstract: Sparse-view 3D reconstruction is a fundamental yet challenging task in practical 3D reconstruction applications. Recently, many methods based on the 3D Gaussian Splatting (3DGS) framework have been proposed to address sparse-view 3D reconstruction. Although these methods have made considerable advancements, they still show significant issues with overfitting. To reduce the overfitting, we introduce VGNC, a novel Validation-guided Gaussian Number Control (VGNC) approach based on generative novel view synthesis (NVS) models. To the best of our knowledge, this is the first attempt to alleviate the overfitting issue of sparse-view 3DGS with generative validation images. Specifically, we first introduce a validation image generation method based on a generative NVS model. We then propose a Gaussian number control strategy that utilizes generated validation images to determine the optimal Gaussian numbers, thereby reducing the issue of overfitting. We conducted detailed experiments on various sparse-view 3DGS baselines and datasets to evaluate the effectiveness of VGNC. Extensive experiments show that our approach not only reduces overfitting but also improves rendering quality on the test set while decreasing the number of Gaussian points. This reduction lowers storage demands and accelerates both training and rendering. The code will be released.

[69] Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

Weijun Zhuang,Qizhang Li,Xin Li,Ming Liu,Xiaopeng Hong,Feng Gao,Fan Yang,Wangmeng Zuo

Main category: cs.CV

TL;DR: 论文提出了一种名为Grounding-MD的创新框架，用于开放世界场景中的时刻检测任务，通过视频-语言预训练实现了灵活且可扩展的检测能力。

Details

Motivation: 现有方法局限于封闭场景，无法适应开放世界需求，因此需要一种能够处理任意自然语言查询的统一框架。 Method: Grounding-MD采用跨模态融合编码器和文本引导融合解码器，结合大规模预训练，实现视频与文本的全面对齐。 Result: 在四个基准数据集上，Grounding-MD在零样本和监督设置下均达到了最先进的性能。 Conclusion: Grounding-MD为开放世界时刻检测提供了有效的解决方案，并展示了强大的语义表示学习能力。 Abstract: Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.

[70] SMTT: Novel Structured Multi-task Tracking with Graph-Regularized Sparse Representation for Robust Thermal Infrared Target Tracking

Shang Zhang,HuiPan Guan,XiaoBo Ding,Ruoyan Xiong,Yue Zhang

Main category: cs.CV

TL;DR: 提出了一种名为SMTT的新型热红外目标跟踪器，通过多任务学习、联合稀疏表示和自适应图正则化，有效解决了噪声、遮挡和快速目标运动等常见问题。

Details

Motivation: 热红外目标跟踪在监控、自动驾驶和军事行动中至关重要，但面临噪声、遮挡和快速目标运动等挑战。 Method: 将跟踪任务重新定义为多任务学习问题，利用加权混合范数正则化策略动态捕获空间和特征级相似性，并采用加速近端梯度方法优化实时性能。 Result: 在VOT-TIR、PTB-TIR和LSOTB-TIR等基准数据集上的实验表明，SMTT在准确性、鲁棒性和计算效率方面表现优异。 Conclusion: SMTT是一种在复杂环境中可靠且高性能的热红外目标跟踪解决方案。 Abstract: Thermal infrared target tracking is crucial in applications such as surveillance, autonomous driving, and military operations. In this paper, we propose a novel tracker, SMTT, which effectively addresses common challenges in thermal infrared imagery, such as noise, occlusion, and rapid target motion, by leveraging multi-task learning, joint sparse representation, and adaptive graph regularization. By reformulating the tracking task as a multi-task learning problem, the SMTT tracker independently optimizes the representation of each particle while dynamically capturing spatial and feature-level similarities using a weighted mixed-norm regularization strategy. To ensure real-time performance, we incorporate the Accelerated Proximal Gradient method for efficient optimization. Extensive experiments on benchmark datasets - including VOT-TIR, PTB-TIR, and LSOTB-TIR - demonstrate that SMTT achieves superior accuracy, robustness, and computational efficiency. These results highlight SMTT as a reliable and high-performance solution for thermal infrared target tracking in complex environments.

[71] NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results

Zheng Chen,Kai Liu,Jue Gong,Jingkai Wang,Lei Sun,Zongwei Wu,Radu Timofte,Yulun Zhang,Xiangyu Kong,Xiaoxuan Yu,Hyunhee Park,Suejin Han,Hakjae Jeon,Dafeng Zhang,Hyung-Ju Chun,Donghun Ryou,Inju Ha,Bohyung Han,Lu Zhao,Yuyi Zhang,Pengyu Yan,Jiawei Hu,Pengwei Liu,Fengjun Guo,Hongyuan Yu,Pufan Xu,Zhijuan Huang,Shuyuan Cui,Peng Guo,Jiahui Liu,Dongkai Zhang,Heng Zhang,Huiyuan Fu,Huadong Ma,Yanhui Guo,Sisi Tian,Xin Liu,Jinwen Liang,Jie Liu,Jie Tang,Gangshan Wu,Zeyu Xiao,Zhuoyuan Li,Yinxiang Zhang,Wenxuan Cai,Vijayalaxmi Ashok Aralikatti,Nikhil Akalwadi,G Gyaneshwar Rao,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Marcos V. Conde,Alejandro Merino,Bruno Longarela,Javier Abad,Weijun Yuan,Zhan Li,Zhanglu Chen,Boyang Yao,Aagam Jain,Milan Kumar Singh,Ankit Kumar,Shubh Kawa,Divyavardhan Singh,Anjali Sarvaiya,Kishor Upla,Raghavendra Ramachandra,Chia-Ming Lee,Yu-Fan Lin,Chih-Chung Hsu,Risheek V Hiremath,Yashaswini Palani,Yuxuan Jiang,Qiang Zhu,Siyue Teng,Fan Zhang,Shuyuan Zhu,Bing Zeng,David Bull,Jingwei Liao,Yuqing Yang,Wenda Shao,Junyi Zhao,Qisheng Xu,Kele Xu,Sunder Ali Khowaja,Ik Hyun Lee,Snehal Singh Tomar,Rajarshi Ray,Klaus Mueller,Sachin Chaudhary,Surya Vashisth,Akshay Dudhane,Praful Hambarde,Satya Naryan Tazi,Prashant Patil,Santosh Kumar Vipparthi,Subrahmanyam Murala,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Zahra Moammeri,Ahmad Mahmoudi-Aznaveh,Ali Karbasi,Hossein Motamednia,Liangyan Li,Guanhua Zhao,Kevin Le,Yimo Ning,Haoxuan Huang,Jun Chen

Main category: cs.CV

TL;DR: NTIRE 2025图像超分辨率挑战赛旨在通过开发高效网络设计或解决方案，从低分辨率图像中恢复高分辨率图像，分为恢复和感知两个子赛道。

Details

Motivation: 推动图像超分辨率技术的发展，通过竞赛形式促进高效网络设计和解决方案的创新。 Method: 挑战赛分为两个子赛道：恢复赛道（基于PSNR评估）和感知赛道（基于感知评分评估），参与者提交解决方案。 Result: 286名参与者注册，25支团队提交有效方案，挑战赛总结了设计、数据集、评估协议及团队方法。 Conclusion: 挑战赛作为基准推动了图像超分辨率技术的进步。 Abstract: This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.

[72] Using street view imagery and deep generative modeling for estimating the health of urban forests

Akshit Gupta,Remko Uijlenhoet

Main category: cs.CV

TL;DR: 提出一种基于街景图像、树木清单和气象数据的城市森林健康监测方法，利用图像转换网络估算NDVI和CTD参数，并与实地数据对比验证。

Details

Motivation: 传统城市森林健康监测方法依赖人工和高成本设备，难以规模化；多光谱遥感数据也存在部署和分辨率限制。 Method: 使用街景图像、树木清单和气象数据，通过图像转换网络估算NDVI和CTD参数，并与实地多光谱和热成像传感器数据对比验证。 Result: 该方法利用现有街景平台（如Google Street View）数据，有望实现城市森林健康的大规模有效管理。 Conclusion: 提出的方法为资源有限的城市提供了一种低成本、可扩展的城市森林健康监测解决方案。 Abstract: Healthy urban forests comprising of diverse trees and shrubs play a crucial role in mitigating climate change. They provide several key advantages such as providing shade for energy conservation, and intercepting rainfall to reduce flood runoff and soil erosion. Traditional approaches for monitoring the health of urban forests require instrumented inspection techniques, often involving a high amount of human labor and subjective evaluations. As a result, they are not scalable for cities which lack extensive resources. Recent approaches involving multi-spectral imaging data based on terrestrial sensing and satellites, are constrained respectively with challenges related to dedicated deployments and limited spatial resolutions. In this work, we propose an alternative approach for monitoring the urban forests using simplified inputs: street view imagery, tree inventory data and meteorological conditions. We propose to use image-to-image translation networks to estimate two urban forest health parameters, namely, NDVI and CTD. Finally, we aim to compare the generated results with ground truth data using an onsite campaign utilizing handheld multi-spectral and thermal imaging sensors. With the advent and expansion of street view imagery platforms such as Google Street View and Mapillary, this approach should enable effective management of urban forests for the authorities in cities at scale.

[73] NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results

Zheng Chen,Jingkai Wang,Kai Liu,Jue Gong,Lei Sun,Zongwei Wu,Radu Timofte,Yulun Zhang,Jianxing Zhang,Jinlong Wu,Jun Wang,Zheng Xie,Hakjae Jeon,Suejin Han,Hyung-Ju Chun,Hyunhee Park,Zhicun Yin,Junjie Chen,Ming Liu,Xiaoming Li,Chao Zhou,Wangmeng Zuo,Weixia Zhang,Dingquan Li,Kede Ma,Yun Zhang,Zhuofan Zheng,Yuyue Liu,Shizhen Tang,Zihao Zhang,Yi Ning,Hao Jiang,Wenjie An,Kangmeng Yu,Chenyang Wang,Kui Jiang,Xianming Liu,Junjun Jiang,Yingfu Zhang,Gang He,Siqi Wang,Kepeng Xu,Zhenyang Liu,Changxin Zhou,Shanlan Shen,Yubo Duan,Yiang Chen,Jin Guo,Mengru Yang,Jen-Wei Lee,Chia-Ming Lee,Chih-Chung Hsu,Hu Peng,Chunming He

Main category: cs.CV

TL;DR: NTIRE 2025挑战赛综述，聚焦真实人脸修复，评估解决方案及成果，推动感知质量和真实性的前沿技术。

Details

Motivation: 提升真实人脸修复的感知质量和真实性，同时保持身份一致性，推动领域发展。 Method: 使用加权图像质量评估（IQA）分数和AdaFace模型作为身份检查器，评估参赛模型性能。 Result: 141名注册者，13支团队提交有效模型，10支团队进入最终排名。 Conclusion: 挑战赛推动了真实人脸修复技术的进步，并总结了该领域的最新趋势。 Abstract: This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.

[74] MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation

Siyi Jiao,Wenzheng Zeng,Yerong Li,Huayu Zhang,Changxin Gao,Nong Sang,Mike Zheng Shou

Main category: cs.CV

TL;DR: MP-Mat是一种新颖的3D和实例感知的抠图框架，通过多平面表示从场景几何和实例两个视角解决复杂情况下的人像抠图问题。

Details

Motivation: 解决现有方法在复杂场景（如多实例重叠、毛发和细边界结构）中难以准确分离像素的问题。 Method: 提出多平面表示方法，从场景几何层面和实例层面分别构建特征级多平面表示，并结合背景作为特殊实例。 Result: 实验表明MP-Mat在抠图任务中表现优异，并在零样本推理的图像编辑任务中超越专业方法。 Conclusion: MP-Mat通过多平面表示显著提升了抠图效果，并展示了在图像编辑任务中的潜力。 Abstract: Human instance matting aims to estimate an alpha matte for each human instance in an image, which is challenging as it easily fails in complex cases requiring disentangling mingled pixels belonging to multiple instances along hairy and thin boundary structures. In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where the multiplane concept is designed from two different perspectives: scene geometry level and instance level. Specifically, we first build feature-level multiplane representations to split the scene into multiple planes based on depth differences. This approach makes the scene representation 3D-aware, and can serve as an effective clue for splitting instances in different 3D positions, thereby improving interpretability and boundary handling ability especially in occlusion areas. Then, we introduce another multiplane representation that splits the scene in an instance-level perspective, and represents each instance with both matte and color. We also treat background as a special instance, which is often overlooked by existing methods. Such an instance-level representation facilitates both foreground and background content awareness, and is useful for other down-stream tasks like image editing. Once built, the representation can be reused to realize controllable instance-level image editing with high efficiency. Extensive experiments validate the clear advantage of MP-Mat in matting task. We also demonstrate its superiority in image editing tasks, an area under-explored by existing matting-focused methods, where our approach under zero-shot inference even outperforms trained specialized image editing techniques by large margins. Code is open-sourced at https://github.com/JiaoSiyi/MPMat.git}.

[75] VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image

Han Bi,Ge Yu,Yu He,Wenzhuo Liu,Zijie Zheng

Main category: cs.CV

TL;DR: VM-BHINet利用状态空间模型改进双手交互重建，显著降低误差。

Details

Motivation: 现有方法在遮挡、模糊外观和计算效率方面表现不佳，需改进双手交互建模。 Method: 提出VM-BHINet，结合状态空间模型与局部全局特征操作，核心为VM-IFEBlock。 Result: 在InterHand2.6M数据集上，MPJPE和MPVPE降低2-3%，优于现有方法。 Conclusion: VM-BHINet有效提升双手交互重建的准确性和效率。 Abstract: Understanding bimanual hand interactions is essential for realistic 3D pose and shape reconstruction. However, existing methods struggle with occlusions, ambiguous appearances, and computational inefficiencies. To address these challenges, we propose Vision Mamba Bimanual Hand Interaction Network (VM-BHINet), introducing state space models (SSMs) into hand reconstruction to enhance interaction modeling while improving computational efficiency. The core component, Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), combines SSMs with local and global feature operations, enabling deep understanding of hand interactions. Experiments on the InterHand2.6M dataset show that VM-BHINet reduces Mean per-joint position error (MPJPE) and Mean per-vertex position error (MPVPE) by 2-3%, significantly surpassing state-of-the-art methods.

[76] Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts

Zhenkui Yang,Zeyi Huang,Ge Wang,Han Ding,Tony Xiao Han,Fei Wang

Main category: cs.CV

TL;DR: 论文提出了一种文本增强的无线传感框架WiTalk，通过三种分层提示策略整合语义知识，显著提升了无线信号（如WiFi、RFID和毫米波雷达）在人类行为识别和定位任务中的性能。

Details

Motivation: 现有无线传感技术虽在非接触操作和环境适应性上有优势，但未能充分利用数据集中的文本信息。本文旨在通过整合语义知识提升性能。 Method: 提出WiTalk框架，采用三种分层提示策略（标签、简要描述和详细动作描述），无需修改架构或增加数据成本。 Result: 在三个公共数据集上验证，性能显著提升：XRF55上WiFi、RFID和毫米波雷达的准确率分别提高3.9%、2.59%和0.46%；WiFiTAL上平均性能提升4.98%；XRFV2上平均精度提升4.02%至13.68%。 Conclusion: WiTalk框架通过整合语义知识，显著提升了无线传感技术的性能，为公共安全、医疗和智能环境等应用提供了更强大的支持。 Abstract: Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in https://github.com/yangzhenkui/WiTalk.

[77] MSAD-Net: Multiscale and Spatial Attention-based Dense Network for Lung Cancer Classification

Santanu Roy,Shweta Singh,Palak Sahu,Ashvath Suresh,Debashish Das

Main category: cs.CV

TL;DR: 提出了一种新型CNN架构MSD-Net，通过引入密集模块和特殊连接结构，显著提升了肺癌检测的准确率，优于现有模型。

Details

Motivation: 肺癌早期检测至关重要，但传统手动检测方法存在挑战，且现有深度学习模型因类别不平衡问题性能受限。 Method: 设计了MSD-Net，包含密集模块、深度可分离卷积层、跳跃连接和平行分支，以提取多尺度特征并降低模型复杂度。 Result: 实验表明，MSD-Net在性能上显著优于ConvNext-Tiny、ViT、PiT等最新模型。 Conclusion: MSD-Net为肺癌自动检测提供了一种高效且性能优越的解决方案。 Abstract: Lung cancer, a severe form of malignant tumor that originates in the tissues of the lungs, can be fatal if not detected in its early stages. It ranks among the top causes of cancer-related mortality worldwide. Detecting lung cancer manually using chest X-Ray image or Computational Tomography (CT) scans image poses significant challenges for radiologists. Hence, there is a need for automatic diagnosis system of lung cancers from radiology images. With the recent emergence of deep learning, particularly through Convolutional Neural Networks (CNNs), the automated detection of lung cancer has become a much simpler task. Nevertheless, numerous researchers have addressed that the performance of conventional CNNs may be hindered due to class imbalance issue, which is prevalent in medical images. In this research work, we have proposed a novel CNN architecture ``Multi-Scale Dense Network (MSD-Net)'' (trained-from-scratch). The novelties we bring in the proposed model are (I) We introduce novel dense modules in the 4th block and 5th block of the CNN model. We have leveraged 3 depthwise separable convolutional (DWSC) layers, and one 1x1 convolutional layer in each dense module, in order to reduce complexity of the model considerably. (II) Additionally, we have incorporated one skip connection from 3rd block to 5th block and one parallel branch connection from 4th block to Global Average Pooling (GAP) layer. We have utilized dilated convolutional layer (with dilation rate=2) in the last parallel branch in order to extract multi-scale features. Extensive experiments reveal that our proposed model has outperformed latest CNN model ConvNext-Tiny, recent trend Vision Transformer (ViT), Pooling-based ViT (PiT), and other existing models by significant margins.

[78] NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation

Junyuan Fang,Zihan Wang,Yejun Zhang,Shuzhe Wang,Iaroslav Melekhov,Juho Kannala

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯泼溅的硬视觉提示方法，通过相机插值生成多视角，无需2D-3D优化或微调，提升3D实例分割性能。

Details

Motivation: 解决视觉语言模型在3D实例级分割任务中定位和识别能力不足的问题。 Method: 利用3D高斯泼溅和相机插值生成多视角，增强几何一致性，无需训练。 Result: 方法提升了3D实例分割的鲁棒性和准确性。 Conclusion: 该方法为视觉语言模型在3D场景中的应用提供了有效的训练免费解决方案。 Abstract: Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.

[79] Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension

Lin Li,Wei Chen,Jiahui Li,Long Chen

Main category: cs.CV

TL;DR: Relation-R1是一个统一的关系理解框架，通过认知链式思维（CoT）引导的监督微调（SFT）和组相对策略优化（GRPO）提升多模态大语言模型（MLLMs）在视觉关系理解中的表现。

Details

Motivation: 当前MLLMs在视觉关系理解（如场景图生成）中表现有限，尤其是对N元关系的建模不足，导致输出不可靠。Relation-R1旨在解决这一问题。 Method: 结合认知链式思维（CoT）引导的SFT和GRPO，通过强化学习优化模型输出，优先考虑视觉语义基础而非语言先验。 Result: 在PSG和SWiG数据集上，Relation-R1在二元和N元关系理解中达到最优性能。 Conclusion: Relation-R1通过结构化推理和多奖励优化，显著提升了MLLMs在视觉关系理解中的表现。 Abstract: Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-entities leads to unreliable outputs, intensifying MLLMs' hallucinations and over-reliance on language priors. To this end, we propose Relation-R1, the first unified relational comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-reward optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.

[80] EmoSEM: Segment and Explain Emotion Stimuli in Visual Art

Jing Zhang,Dan Guo,Zhangbin Li,Meng Wang

Main category: cs.CV

TL;DR: 论文提出EmoSEM模型，通过情感提示和轻量级前缀投影器，实现艺术图像中情感触发区域的像素级分割和情感解释生成。

Details

Motivation: 解决艺术图像中像素级情感理解的挑战，包括情感主观性和艺术表达的抽象性。 Method: 结合情感提示、情感投影器和前缀投影器，将情感与视觉特征关联，并通过语言模型生成情感解释。 Result: 实验验证了模型的有效性，实现了从像素特征到情感解释的端到端建模。 Conclusion: EmoSEM为艺术情感计算提供了首个可解释的细粒度分析框架。 Abstract: This paper focuses on a key challenge in visual art understanding: given an art image, the model pinpoints pixel regions that trigger a specific human emotion, and generates linguistic explanations for the emotional arousal. Despite recent advances in art understanding, pixel-level emotion understanding still faces a dual challenge: first, the subjectivity of emotion makes it difficult for general segmentation models like SAM to adapt to emotion-oriented segmentation tasks; and second, the abstract nature of art expression makes it difficult for captioning models to balance pixel-level semantic understanding and emotion reasoning. To solve the above problems, this paper proposes the Emotion stimuli Segmentation and Explanation Model (EmoSEM) to endow the segmentation model SAM with emotion comprehension capability. First, to enable the model to perform segmentation under the guidance of emotional intent well, we introduce an emotional prompt with a learnable mask token as the conditional input for segmentation decoding. Then, we design an emotion projector to establish the association between emotion and visual features. Next, more importantly, to address emotion-visual stimuli alignment, we develop a lightweight prefix projector, a module that fuses the learned emotional mask with the corresponding emotion into a unified representation compatible with the language model.Finally, we input the joint visual, mask, and emotional tokens into the language model and output the emotional explanations. It ensures that the generated interpretations remain semantically and emotionally coherent with the visual stimuli. The method innovatively realizes end-to-end modeling from low-level pixel features to high-level emotion interpretation, providing the first interpretable fine-grained analysis framework for artistic emotion computing. Extensive experiments validate the effectiveness of our model.

[81] Frequency-domain Learning with Kernel Prior for Blind Image Deblurring

Jixiang Sun,Fei Lei,Jiawei Zhang,Wenxiu Sun,Yujiu Yang

Main category: cs.CV

TL;DR: 论文提出了一种结合核先验和频率域方法的深度学习图像去模糊方法，通过频率集成模块（FIM）提升泛化能力。

Details

Motivation: 现有深度学习方法在图像去模糊任务中泛化能力不足，主要依赖于特定领域数据集。核先验因其与图像内容无关，可能解决这一问题。 Method: 采用频率域解卷积的传统方法，设计频率集成模块（FIM）融合核先验，并结合基于频率的去模糊Transformer网络。 Result: 实验表明，该方法在多个盲图像去模糊任务中优于现有方法，展现出更强的泛化能力。 Conclusion: 引入核先验并通过频率域方法融合，能有效提升深度学习去模糊模型的泛化性能。 Abstract: While achieving excellent results on various datasets, many deep learning methods for image deblurring suffer from limited generalization capabilities with out-of-domain data. This limitation is likely caused by their dependence on certain domain-specific datasets. To address this challenge, we argue that it is necessary to introduce the kernel prior into deep learning methods, as the kernel prior remains independent of the image context. For effective fusion of kernel prior information, we adopt a rational implementation method inspired by traditional deblurring algorithms that perform deconvolution in the frequency domain. We propose a module called Frequency Integration Module (FIM) for fusing the kernel prior and combine it with a frequency-based deblurring Transfomer network. Experimental results demonstrate that our method outperforms state-of-the-art methods on multiple blind image deblurring tasks, showcasing robust generalization abilities. Source code will be available soon.

[82] DMPCN: Dynamic Modulated Predictive Coding Network with Hybrid Feedback Representations

A S M Sharifuzzaman Sagar,Yu Chen,Jun Hoong Chan

Main category: cs.CV

TL;DR: 本文提出了一种混合预测误差反馈机制和动态调制的深度预测编码网络，解决了传统方法在局部和全局细节处理上的不足，并设计了专用损失函数以提高性能。

Details

Motivation: 传统预测编码网络在局部和全局误差反馈机制上表现不佳，且难以动态适应输入数据的复杂性，导致性能受限。 Method: 引入混合预测误差反馈机制，结合全局上下文和局部细节，并动态调整反馈；设计专用损失函数以优化预测误差最小化。 Result: 实验结果表明，模型在CIFAR-10、CIFAR-100、MNIST和FashionMNIST数据集上收敛更快且预测精度更高。 Conclusion: 提出的方法显著提升了预测编码网络的性能，适用于多种场景。 Abstract: Traditional predictive coding networks, inspired by theories of brain function, consistently achieve promising results across various domains, extending their influence into the field of computer vision. However, the performance of the predictive coding networks is limited by their error feedback mechanism, which traditionally employs either local or global recurrent updates, leading to suboptimal performance in processing both local and broader details simultaneously. In addition, traditional predictive coding networks face difficulties in dynamically adjusting to the complexity and context of varying input data, which is crucial for achieving high levels of performance in diverse scenarios. Furthermore, there is a gap in the development and application of specific loss functions that could more effectively guide the model towards optimal performance. To deal with these issues, this paper introduces a hybrid prediction error feedback mechanism with dynamic modulation for deep predictive coding networks by effectively combining global contexts and local details while adjusting feedback based on input complexity. Additionally, we present a loss function tailored to this framework to improve accuracy by focusing on precise prediction error minimization. Experimental results demonstrate the superiority of our model over other approaches, showcasing faster convergence and higher predictive accuracy in CIFAR-10, CIFAR-100, MNIST, and FashionMNIST datasets.

[83] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Kaihang Pan,Wang Lin,Zhongqi Yue,Tenglong Ao,Liyu Jia,Wei Zhao,Juncheng Li,Siliang Tang,Hanwang Zhang

Main category: cs.CV

TL;DR: 论文提出了一种新的视觉语言方法，利用扩散时间步学习离散、递归的视觉标记，以解决现有空间标记方法的局限性，实现了多模态理解和生成的统一框架。

Details

Motivation: 现有方法依赖空间视觉标记，但这些标记缺乏语言的递归结构，限制了LLM的能力。本文旨在构建一种更适合LLM的视觉语言。 Method: 通过扩散时间步学习递归视觉标记，补偿噪声图像中的属性损失，结合LLM的自回归推理和扩散模型的精确图像生成能力。 Result: 实验表明，该方法在多模态理解和生成任务上均优于其他MLLM。 Conclusion: 提出的视觉语言方法有效整合了LLM和扩散模型的优势，实现了无缝的多模态理解和生成。 Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.

[84] Seurat: From Moving Points to Depth

Seokju Cho,Jiahui Huang,Seungryong Kim,Joon-Young Lee

Main category: cs.CV

TL;DR: 提出一种通过分析2D轨迹推断相对深度的方法，利用空间和时间变换器处理轨迹，实现高精度深度预测。

Details

Motivation: 单目视频深度估计存在模糊性，受人类通过观察物体大小和间距变化感知深度的启发。 Method: 使用现有点跟踪模型捕获2D轨迹，通过空间和时间变换器处理轨迹并推断深度变化。 Result: 在TAPVid-3D基准测试中表现优异，零样本泛化能力强，预测结果平滑且准确。 Conclusion: 该方法能有效从合成数据泛化到真实场景，实现高精度深度预测。 Abstract: Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.

[85] Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Enxin Song,Wenhao Chai,Weili Xu,Jianwen Xie,Yuxuan Liu,Gaoang Wang

Main category: cs.CV

TL;DR: Video-MMLU是一个用于评估语言多模态模型（LMMs）在多学科讲座理解能力的大规模基准测试，揭示了当前模型在感知与推理任务中的局限性。

Details

Motivation: 多学科讲座的理解任务尚未被充分探索，而现有LMMs在此类任务中的表现尚不明确。 Method: 通过评估90多个开源和专有模型（参数规模从0.5B到40B），并分析视觉标记数量和大语言模型对性能的影响。 Result: 当前模型在多学科讲座的认知挑战中表现不足，尤其是在需要感知与推理的任务中。 Conclusion: 研究揭示了多模态感知与推理在讲座理解中的关键作用，为未来模型改进提供了方向。 Abstract: Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.

[86] IXGS-Intraoperative 3D Reconstruction from Sparse, Arbitrarily Posed Real X-rays

Sascha Jecklin,Aidana Massalimova,Ruyi Zha,Lilian Calvet,Christoph J. Laux,Mazda Farshad,Philipp Fürnstahl

Main category: cs.CV

TL;DR: 该论文提出了一种基于高斯泼溅的实例学习方法，用于从稀疏X射线重建3D脊柱解剖结构，无需预训练，并引入解剖学引导的标准化步骤以提高重建质量。

Details

Motivation: 脊柱手术需要高精度的3D解剖重建，但现有监督学习方法依赖大量标注数据且泛化能力有限。高斯泼溅方法可能提供一种无需大量标注的替代方案。 Method: 扩展了$R^2$-高斯泼溅框架，引入解剖学引导的放射标准化步骤，通过风格迁移提高视图间一致性，无需预训练。 Result: 在离体数据集上验证，专家评估确认了3D重建的临床实用性，标准化步骤显著提升了解剖清晰度。定量指标（PSNR/SSIM）显示性能提升。 Conclusion: 该研究证明了从稀疏X射线进行实例化体积重建的可行性，为手术导航中的3D成像提供了新方法。 Abstract: Spine surgery is a high-risk intervention demanding precise execution, often supported by image-based navigation systems. Recently, supervised learning approaches have gained attention for reconstructing 3D spinal anatomy from sparse fluoroscopic data, significantly reducing reliance on radiation-intensive 3D imaging systems. However, these methods typically require large amounts of annotated training data and may struggle to generalize across varying patient anatomies or imaging conditions. Instance-learning approaches like Gaussian splatting could offer an alternative by avoiding extensive annotation requirements. While Gaussian splatting has shown promise for novel view synthesis, its application to sparse, arbitrarily posed real intraoperative X-rays has remained largely unexplored. This work addresses this limitation by extending the $R^2$-Gaussian splatting framework to reconstruct anatomically consistent 3D volumes under these challenging conditions. We introduce an anatomy-guided radiographic standardization step using style transfer, improving visual consistency across views, and enhancing reconstruction quality. Notably, our framework requires no pretraining, making it inherently adaptable to new patients and anatomies. We evaluated our approach using an ex-vivo dataset. Expert surgical evaluation confirmed the clinical utility of the 3D reconstructions for navigation, especially when using 20 to 30 views, and highlighted the standardization's benefit for anatomical clarity. Benchmarking via quantitative 2D metrics (PSNR/SSIM) confirmed performance trade-offs compared to idealized settings, but also validated the improvement gained from standardization over raw inputs. This work demonstrates the feasibility of instance-based volumetric reconstruction from arbitrary sparse-view X-rays, advancing intraoperative 3D imaging for surgical navigation.

[87] Time Frequency Analysis of EMG Signal for Gesture Recognition using Fine grained Features

Parshuram N. Aarotale,Ajita Rattani

Main category: cs.CV

TL;DR: 论文提出了一种基于EMG的手势识别新方法XMANet，通过跨层互注意力结合局部和语义特征，显著提升了分类性能。

Details

Motivation: 提升EMG手势识别的准确性和鲁棒性，适用于假肢控制、康复和人机交互。 Method: 使用XMANet结合STFT和WT生成的谱图和尺度图，进行细粒度分类。 Result: 在多个数据集和基线模型上，XMANet性能显著提升，最高提升9.36%。 Conclusion: XMANet通过细粒度特征实现了更准确和鲁棒的EMG分类，具有广泛应用潜力。 Abstract: Electromyography (EMG) based hand gesture recognition converts forearm muscle activity into control commands for prosthetics, rehabilitation, and human computer interaction. This paper proposes a novel approach to EMG-based hand gesture recognition that uses fine-grained classification and presents XMANet, which unifies low-level local and high level semantic cues through cross layer mutual attention among shallow to deep CNN experts. Using stacked spectrograms and scalograms derived from the Short Time Fourier Transform (STFT) and Wavelet Transform (WT), we benchmark XMANet against ResNet50, DenseNet-121, MobileNetV3, and EfficientNetB0. Experimental results on the Grabmyo dataset indicate that, using STFT, the proposed XMANet model outperforms the baseline ResNet50, EfficientNetB0, MobileNetV3, and DenseNet121 models with improvement of approximately 1.72%, 4.38%, 5.10%, and 2.53%, respectively. When employing the WT approach, improvements of around 1.57%, 1.88%, 1.46%, and 2.05% are observed over the same baselines. Similarly, on the FORS EMG dataset, the XMANet(ResNet50) model using STFT shows an improvement of about 5.04% over the baseline ResNet50. In comparison, the XMANet(DenseNet121) and XMANet(MobileNetV3) models yield enhancements of approximately 4.11% and 2.81%, respectively. Moreover, when using WT, the proposed XMANet achieves gains of around 4.26%, 9.36%, 5.72%, and 6.09% over the baseline ResNet50, DenseNet121, MobileNetV3, and EfficientNetB0 models, respectively. These results confirm that XMANet consistently improves performance across various architectures and signal processing techniques, demonstrating the strong potential of fine grained features for accurate and robust EMG classification.

[88] Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline

Hui Zhou,Shaoshuai Shi,Hongsheng Li

Main category: cs.CV

TL;DR: 该论文提出了一种结合模仿学习和强化学习的新框架，以解决模仿学习在自动驾驶规划中的局限性，并开发了闭环模拟器和因果基准进行评估。

Details

Motivation: 模仿学习在自动驾驶规划中表现良好，但难以验证其是否真正理解驾驶原则，且容易过拟合常见场景。 Method: 提出了一种结合模仿学习和强化学习的框架，并开发了闭环模拟器和因果基准。 Result: 新框架旨在克服模仿学习的局限性，提高对罕见或未见场景的泛化能力。 Conclusion: 通过结合模仿学习和强化学习，论文提供了一种更鲁棒的自动驾驶规划方法。 Abstract: Machine learning (ML)-based planners have recently gained significant attention. They offer advantages over traditional optimization-based planning algorithms. These advantages include fewer manually selected parameters and faster development. Within ML-based planning, imitation learning (IL) is a common algorithm. It primarily learns driving policies directly from supervised trajectory data. While IL has demonstrated strong performance on many open-loop benchmarks, it remains challenging to determine if the learned policy truly understands fundamental driving principles, rather than simply extrapolating from the ego-vehicle's initial state. Several studies have identified this limitation and proposed algorithms to address it. However, these methods often use original datasets for evaluation. In these datasets, future trajectories are heavily dependent on initial conditions. Furthermore, IL often overfits to the most common scenarios. It struggles to generalize to rare or unseen situations. To address these challenges, this work proposes: 1) a novel closed-loop simulator supporting both imitation and reinforcement learning, 2) a causal benchmark derived from the Waymo Open Dataset to rigorously assess the impact of the copycat problem, and 3) a novel framework integrating imitation learning and reinforcement learning to overcome the limitations of purely imitative approaches. The code for this work will be released soon.

[89] Med-2D SegNet: A Light Weight Deep Neural Network for Medical 2D Image Segmentation

Md. Sanaullah Chowdhury,Salauddin Tapu,Noyon Kumar Sarkar,Ferdous Bin Ali,Lameya Sabrin

Main category: cs.CV

TL;DR: Med-2D SegNet是一种高效医学图像分割架构，通过紧凑的Med Block设计实现高精度和低计算复杂度，在多个数据集上表现优异。

Details

Motivation: 医学图像分割对临床诊断和手术规划至关重要，但现有方法在复杂性和效率上存在挑战。 Method: 提出Med-2D SegNet，采用Med Block设计，结合维度扩展和参数减少技术，实现高效特征提取。 Result: 在20个数据集上平均DSC达89.77%，参数仅2.07百万，零样本学习表现优异。 Conclusion: Med-2D SegNet在精度与效率间取得平衡，为医疗技术普及提供了新工具。 Abstract: Accurate and efficient medical image segmentation is crucial for advancing clinical diagnostics and surgical planning, yet remains a complex challenge due to the variability in anatomical structures and the demand for low-complexity models. In this paper, we introduced Med-2D SegNet, a novel and highly efficient segmentation architecture that delivers outstanding accuracy while maintaining a minimal computational footprint. Med-2D SegNet achieves state-of-the-art performance across multiple benchmark datasets, including KVASIR-SEG, PH2, EndoVis, and GLAS, with an average Dice similarity coefficient (DSC) of 89.77% across 20 diverse datasets. Central to its success is the compact Med Block, a specialized encoder design that incorporates dimension expansion and parameter reduction, enabling precise feature extraction while keeping model parameters to a low count of just 2.07 million. Med-2D SegNet excels in cross-dataset generalization, particularly in polyp segmentation, where it was trained on KVASIR-SEG and showed strong performance on unseen datasets, demonstrating its robustness in zero-shot learning scenarios, even though we acknowledge that further improvements are possible. With top-tier performance in both binary and multi-class segmentation, Med-2D SegNet redefines the balance between accuracy and efficiency, setting a new benchmark for medical image analysis. This work paves the way for developing accessible, high-performance diagnostic tools suitable for clinical environments and resource-constrained settings, making it a step forward in the democratization of advanced medical technology.

[90] TAPIP3D: Tracking Any Point in Persistent 3D Geometry

Bowei Zhang,Lei Ke,Adam W. Harley,Katerina Fragkiadaki

Main category: cs.CV

TL;DR: TAPIP3D是一种用于单目RGB和RGB-D视频中长期3D点跟踪的新方法，通过将视频表示为相机稳定的时空特征云，并利用深度和相机运动信息提升2D特征到3D世界空间，显著提高了跟踪性能。

Details

Motivation: 现有3D点跟踪方法在长期跟踪和相机运动补偿方面存在不足，TAPIP3D旨在解决这些问题。 Method: TAPIP3D通过迭代优化多帧3D运动估计，并引入局部对注意力机制来利用3D空间关系，形成精确的3D轨迹估计。 Result: TAPIP3D在3D点跟踪基准测试中显著优于现有方法，并在有准确深度信息时提升了2D跟踪精度。 Conclusion: TAPIP3D通过相机运动补偿和3D上下文策略，实现了更鲁棒和准确的3D点跟踪。 Abstract: We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera motion is effectively canceled. TAPIP3D iteratively refines multi-frame 3D motion estimates within this stabilized representation, enabling robust tracking over extended periods. To manage the inherent irregularities of 3D point distributions, we propose a Local Pair Attention mechanism. This 3D contextualization strategy effectively exploits spatial relationships in 3D, forming informative feature neighborhoods for precise 3D trajectory estimation. Our 3D-centric approach significantly outperforms existing 3D point tracking methods and even enhances 2D tracking accuracy compared to conventional 2D pixel trackers when accurate depth is available. It supports inference in both camera coordinates (i.e., unstabilized) and world coordinates, and our results demonstrate that compensating for camera motion improves tracking performance. Our approach replaces the conventional 2D square correlation neighborhoods used in prior 2D and 3D trackers, leading to more robust and accurate results across various 3D point tracking benchmarks. Project Page: https://tapip3d.github.io

[91] ChronoRoot 2.0: An Open AI-Powered Platform for 2D Temporal Plant Phenotyping

Nicolás Gaggion,Rodrigo Bonazzola,María Florencia Legascue,María Florencia Mammarella,Florencia Sol Rodriguez,Federico Emanuel Aballay,Florencia Belén Catulo,Andana Barrios,Franco Accavallo,Santiago Nahuel Villarreal,Martin Crespi,Martiniano María Ricardi,Ezequiel Petrillo,Thomas Blein,Federico Ariel,Enzo Ferrante

Main category: cs.CV

TL;DR: ChronoRoot 2.0是一个开源平台，结合低成本硬件和AI技术，用于植物根系发育的时序表型分析，解决了现有技术通量低和结构分析受限的问题。

Details

Motivation: 研究植物发育可塑性（如根系结构）对理解植物适应性和农业可持续性至关重要，但现有表型技术难以满足时序分析需求。 Method: ChronoRoot 2.0整合了多器官追踪、实时质量控制、全面结构测量和专用用户界面，支持高通量筛选和详细分析。 Result: 通过拟南芥的三个案例展示了系统的功能，包括昼夜生长模式、转基因植物的向重力反应及多基因型的黄化反应筛选。 Conclusion: ChronoRoot 2.0以低成本、模块化和开源特性，为植物科学社区提供了更易用的时序表型分析工具。 Abstract: The analysis of plant developmental plasticity, including root system architecture, is fundamental to understanding plant adaptability and development, particularly in the context of climate change and agricultural sustainability. While significant advances have been made in plant phenotyping technologies, comprehensive temporal analysis of root development remains challenging, with most existing solutions providing either limited throughput or restricted structural analysis capabilities. Here, we present ChronoRoot 2.0, an integrated open-source platform that combines affordable hardware with advanced artificial intelligence to enable sophisticated temporal plant phenotyping. The system introduces several major advances, offering an integral perspective of seedling development: (i) simultaneous multi-organ tracking of six distinct plant structures, (ii) quality control through real-time validation, (iii) comprehensive architectural measurements including novel gravitropic response parameters, and (iv) dual specialized user interfaces for both architectural analysis and high-throughput screening. We demonstrate the system's capabilities through three use cases for Arabidopsis thaliana: characterization of circadian growth patterns under different light conditions, detailed analysis of gravitropic responses in transgenic plants, and high-throughput screening of etiolation responses across multiple genotypes. ChronoRoot 2.0 maintains its predecessor's advantages of low cost and modularity while significantly expanding its capabilities, making sophisticated temporal phenotyping more accessible to the broader plant science community. The system's open-source nature, combined with extensive documentation and containerized deployment options, ensures reproducibility and enables community-driven development of new analytical capabilities.

[92] SuperCL: Superpixel Guided Contrastive Learning for Medical Image Segmentation Pre-training

Shuang Zeng,Lei Zhu,Xinliang Zhang,Hangzhou He,Yanye Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为SuperCL的新型对比学习方法，用于医学图像分割预训练，通过利用图像的结构先验和像素相关性，显著提升了分割性能。

Details

Motivation: 医学图像分割面临高质量标注数据稀缺的挑战，现有对比学习方法多关注实例级或像素级表示，忽略了图像内相似像素组的特性，且对比对生成依赖人工阈值设置，效率低且泛化性差。 Method: SuperCL引入两种对比对生成策略：图像内局部对比对（ILCP）和图像间全局对比对（IGCP），利用超像素图生成伪掩码指导监督对比学习，并提出ASP和CCL模块进一步利用结构信息。 Result: 在8个医学图像数据集上的实验表明，SuperCL优于12种现有方法，DSC指标分别提升3.15%、5.44%、7.89%，且可视化结果更精确。 Conclusion: SuperCL通过创新的对比对生成策略和结构信息利用，显著提升了医学图像分割的性能，尤其在标注数据有限的情况下表现优异。 Abstract: Medical image segmentation is a critical yet challenging task, primarily due to the difficulty of obtaining extensive datasets of high-quality, expert-annotated images. Contrastive learning presents a potential but still problematic solution to this issue. Because most existing methods focus on extracting instance-level or pixel-to-pixel representation, which ignores the characteristics between intra-image similar pixel groups. Moreover, when considering contrastive pairs generation, most SOTA methods mainly rely on manually setting thresholds, which requires a large number of gradient experiments and lacks efficiency and generalization. To address these issues, we propose a novel contrastive learning approach named SuperCL for medical image segmentation pre-training. Specifically, our SuperCL exploits the structural prior and pixel correlation of images by introducing two novel contrastive pairs generation strategies: Intra-image Local Contrastive Pairs (ILCP) Generation and Inter-image Global Contrastive Pairs (IGCP) Generation. Considering superpixel cluster aligns well with the concept of contrastive pairs generation, we utilize the superpixel map to generate pseudo masks for both ILCP and IGCP to guide supervised contrastive learning. Moreover, we also propose two modules named Average SuperPixel Feature Map Generation (ASP) and Connected Components Label Generation (CCL) to better exploit the prior structural information for IGCP. Finally, experiments on 8 medical image datasets indicate our SuperCL outperforms existing 12 methods. i.e. Our SuperCL achieves a superior performance with more precise predictions from visualization figures and 3.15%, 5.44%, 7.89% DSC higher than the previous best results on MMWHS, CHAOS, Spleen with 10% annotations. Our code will be released after acceptance.

[93] Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches

Guodong Shen,Yuqi Ouyang,Junru Lu,Yixuan Yang,Victor Sanchez

Main category: cs.CV

TL;DR: 论文提出了一种混合框架，结合视觉变换器和ConvLSTM，通过双向结构优化单任务框架，提升视频异常检测性能。

Details

Motivation: 现有视频异常检测方法在多任务框架中采用次优的单任务框架，优化单任务框架可同时提升单任务和多任务方法。 Method: 使用中间帧预测作为代理任务，设计双向结构整合视觉变换器和ConvLSTM，通过卷积时间变换器和层交互ConvLSTM桥增强预测。 Result: 在公开基准测试中验证了混合框架的有效性，无论是单任务还是多任务分支均表现优异。 Conclusion: 结合视觉变换器和ConvLSTM的混合框架显著提升了视频异常检测的稳定性和准确性。 Abstract: Despite the prevailing transition from single-task to multi-task approaches in video anomaly detection, we observe that many adopt sub-optimal frameworks for individual proxy tasks. Motivated by this, we contend that optimizing single-task frameworks can advance both single- and multi-task approaches. Accordingly, we leverage middle-frame prediction as the primary proxy task, and introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames. This hybrid framework is built upon a bi-directional structure that seamlessly integrates both vision transformers and ConvLSTMs. Specifically, we utilize this bi-directional structure to fully analyze the temporal dimension by predicting frames in both forward and backward directions, significantly boosting the detection stability. Given the transformer's capacity to model long-range contextual dependencies, we develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames. Furthermore, we devise a layer-interactive ConvLSTM bridge that facilitates the smooth flow of low-level features across layers and time-steps, thereby strengthening predictions with fine details. Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions. Several experiments conducted on public benchmarks affirm the efficacy of our hybrid framework, whether used as a standalone single-task approach or integrated as a branch in a multi-task approach. These experiments also underscore the advantages of merging vision transformers and ConvLSTMs for video anomaly detection.

[94] How Effective Can Dropout Be in Multiple Instance Learning ?

Wenhui Zhu,Peijie Qiu,Xiwen Chen,Zhangsihao Yang,Aristeidis Sotiras,Abolfazl Razi,Yalin Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为MIL-Dropout的新方法，通过丢弃包中最重要实例来提升多实例学习（MIL）的性能和泛化能力。

Details

Motivation: MIL在组织学全切片图像（WSI）分类中应用广泛，但传统两阶段训练方案因特征嵌入噪声和弱监督问题导致性能受限。 Method: 提出MIL-Dropout方法，系统性丢弃包中最重要的实例，以提升模型性能。 Result: 在五个MIL基准数据集和两个WSI数据集上验证了MIL-Dropout的有效性，显著提升了性能且计算成本可忽略。 Conclusion: MIL-Dropout是一种简单有效的方法，能显著提升MIL模型的性能和泛化能力。 Abstract: Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at https://github.com/ChongQingNoSubway/MILDropout.

[95] When Cloud Removal Meets Diffusion Model in Remote Sensing

Zhenyu Yu,Mohd Yamani Idna Idris,Pei Wang

Main category: cs.CV

TL;DR: DC4CR是一种基于多模态扩散的新型云去除框架，通过提示驱动控制和低秩适配等技术，高效去除遥感图像中的云层，无需预生成云掩膜。

Details

Motivation: 云遮挡严重阻碍遥感应用，传统方法依赖预生成云掩膜且效率低，DC4CR旨在提供更高效、适应性强的解决方案。 Method: 提出DC4CR框架，结合提示驱动控制、低秩适配、主题驱动生成和分组学习，实现高效云去除。 Result: 在RICE和CUHK-CR数据集上表现优异，实现多样化条件下的先进云去除效果。 Conclusion: DC4CR为遥感图像处理提供了一种实用且高效的解决方案，具有广泛的实际应用潜力。 Abstract: Cloud occlusion significantly hinders remote sensing applications by obstructing surface information and complicating analysis. To address this, we propose DC4CR (Diffusion Control for Cloud Removal), a novel multimodal diffusion-based framework for cloud removal in remote sensing imagery. Our method introduces prompt-driven control, allowing selective removal of thin and thick clouds without relying on pre-generated cloud masks, thereby enhancing preprocessing efficiency and model adaptability. Additionally, we integrate low-rank adaptation for computational efficiency, subject-driven generation for improved generalization, and grouped learning to enhance performance on small datasets. Designed as a plug-and-play module, DC4CR seamlessly integrates into existing cloud removal models, providing a scalable and robust solution. Extensive experiments on the RICE and CUHK-CR datasets demonstrate state-of-the-art performance, achieving superior cloud removal across diverse conditions. This work presents a practical and efficient approach for remote sensing image processing with broad real-world applications.

[96] Real-Time Sleepiness Detection for Driver State Monitoring System

Deepak Ghimire,Sunghwan Jeong,Sunhong Yoon,Sanghyun Park,Juhwan Choi

Main category: cs.CV

TL;DR: 提出了一种实时驾驶员眼睛状态检测方法，结合动态模板匹配和Kalman滤波跟踪，使用SVM分类器判断眼睛状态，并在检测到疲劳时触发警报。

Details

Motivation: 驾驶员疲劳是许多事故的重要因素，通过实时监测眼睛状态可以有效预防疲劳驾驶。 Method: 使用动态模板匹配和Kalman滤波跟踪眼睛位置，结合HOG特征和SVM分类器判断眼睛开闭状态。 Result: 系统能够实时检测驾驶员眼睛状态，并在检测到疲劳时触发警报。 Conclusion: 该方法有效实现了驾驶员疲劳监测，有助于减少疲劳驾驶引发的事故。 Abstract: A driver face monitoring system can detect driver fatigue, which is a significant factor in many accidents, using computer vision techniques. In this paper, we present a real-time technique for driver eye state detection. First, the face is detected, and the eyes are located within the face region for tracking. A normalized cross-correlation-based online dynamic template matching technique, combined with Kalman filter tracking, is proposed to track the detected eye positions in subsequent image frames. A support vector machine with histogram of oriented gradients (HOG) features is used to classify the state of the eyes as open or closed. If the eyes remain closed for a specified period, the driver is considered to be asleep, and an alarm is triggered.

[97] ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

Zhoujie Qian

Main category: cs.CV

TL;DR: ECViT是一种结合CNN和Transformer优势的高效混合架构，解决了ViTs计算成本高和数据需求大的问题。

Details

Motivation: Vision Transformers（ViTs）在计算机视觉中表现出色，但面临计算成本高和数据需求大的挑战。 Method: ECViT通过引入CNN的归纳偏置（如局部性和平移不变性），结合局部注意力和金字塔结构，实现高效多尺度特征提取。 Result: 实验表明，ECViT在图像分类任务中性能优异，同时保持低计算和存储需求。 Conclusion: ECViT为高效且高性能的应用提供了理想解决方案。 Abstract: Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. However, ViTs face challenges such as high computational costs due to the quadratic scaling of self-attention and the requirement of a large amount of training data. To address these limitations, we propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers. ECViT introduces inductive biases such as locality and translation invariance, inherent to Convolutional Neural Networks (CNNs) into the Transformer framework by extracting patches from low-level features and enhancing the encoder with convolutional operations. Additionally, it incorporates local-attention and a pyramid structure to enable efficient multi-scale feature extraction and representation. Experimental results demonstrate that ECViT achieves an optimal balance between performance and efficiency, outperforming state-of-the-art models on various image classification tasks while maintaining low computational and storage requirements. ECViT offers an ideal solution for applications that prioritize high efficiency without compromising performance.

[98] Distribution-aware Dataset Distillation for Efficient Image Restoration

Zhuoran Zheng,Xin Su,Chen Wu,Xiuyi Jia

Main category: cs.CV

TL;DR: 论文提出TripleD方法，通过预训练ViT评估图像复杂度并选择子集，结合轻量CNN调整特征分布，分阶段训练以高效实现图像恢复任务。

Details

Motivation: 解决图像数据激增导致训练图像恢复模型耗时的问题，填补当前数据集蒸馏技术在图像恢复领域的空白。 Method: 使用预训练ViT提取特征评估复杂度，选择子集并通过轻量CNN调整特征分布，分两阶段训练（先简单样本后复杂样本）。 Result: 在多项图像恢复任务中表现优异，仅用消费级GPU在8小时内完成4K分辨率数据集训练，节省大量计算资源。 Conclusion: TripleD为图像恢复领域提供高效数据集蒸馏方案，显著降低训练成本并保持高性能。 Abstract: With the exponential increase in image data, training an image restoration model is laborious. Dataset distillation is a potential solution to this problem, yet current distillation techniques are a blank canvas in the field of image restoration. To fill this gap, we propose the Distribution-aware Dataset Distillation method (TripleD), a new framework that extends the principles of dataset distillation to image restoration. Specifically, TripleD uses a pre-trained vision Transformer to extract features from images for complexity evaluation, and the subset (the number of samples is much smaller than the original training set) is selected based on complexity. The selected subset is then fed through a lightweight CNN that fine-tunes the image distribution to align with the distribution of the original dataset at the feature level. To efficiently condense knowledge, the training is divided into two stages. Early stages focus on simpler, low-complexity samples to build foundational knowledge, while later stages select more complex and uncertain samples as the model matures. Our method achieves promising performance on multiple image restoration tasks, including multi-task image restoration, all-in-one image restoration, and ultra-high-definition image restoration tasks. Note that we can train a state-of-the-art image restoration model on an ultra-high-definition (4K resolution) dataset using only one consumer-grade GPU in less than 8 hours (500 savings in computing resources and immeasurable training time).

Xixi Wan,Aihua Zheng,Zi Wang,Bo Jiang,Jin Tang,Jixin Ma

Main category: cs.CV

TL;DR: 提出了一种名为MGRNet的图推理模型，用于解决多模态ReID任务中局部特征质量不均和跨模态信息利用不足的问题。

Details

Motivation: 现有方法忽视局部特征质量差异，未能充分利用跨模态互补信息，尤其是在低质量特征情况下。 Method: 构建模态感知图以增强细粒度局部细节提取，采用选择性图节点交换操作优化低质量特征影响，并通过局部感知图推理模块传播多模态信息。 Result: 在四个基准数据集上实现了最先进的性能。 Conclusion: MGRNet通过图推理有效提升了多模态ReID任务的性能，并能重建缺失模态信息。 Abstract: Multi-modal data provides abundant and diverse object information, crucial for effective modal interactions in Re-Identification (ReID) tasks. However, existing approaches often overlook the quality variations in local features and fail to fully leverage the complementary information across modalities, particularly in the case of low-quality features. In this paper, we propose to address this issue by leveraging a novel graph reasoning model, termed the Modality-aware Graph Reasoning Network (MGRNet). Specifically, we first construct modality-aware graphs to enhance the extraction of fine-grained local details by effectively capturing and modeling the relationships between patches. Subsequently, the selective graph nodes swap operation is employed to alleviate the adverse effects of low-quality local features by considering both local and global information, enhancing the representation of discriminative information. Finally, the swapped modality-aware graphs are fed into the local-aware graph reasoning module, which propagates multi-modal information to yield a reliable feature representation. Another advantage of the proposed graph reasoning approach is its ability to reconstruct missing modal information by exploiting inherent structural relationships, thereby minimizing disparities between different modalities. Experimental results on four benchmarks (RGBNT201, Market1501-MM, RGBNT100, MSVR310) indicate that the proposed method achieves state-of-the-art performance in multi-modal object ReID. The code for our method will be available upon acceptance.

[100] Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation

Yunpu Zhao,Rui Zhang,Junbin Xiao,Ruibo Hou,Jiaming Guo,Zihao Zhang,Yifan Hao,Yunji Chen

Main category: cs.CV

TL;DR: 提出了一种通过语义扰动（CSP）校准视觉语言模型（VLMs）置信度的方法，显著提升了模型置信度与响应正确性的对齐。

Details

Motivation: VLMs在多模态任务中表现优异，但其置信度校准不佳，导致用户信任度下降，尤其是在模型错误或虚构信息时仍表现高置信度。 Method: 通过高斯噪声扰动关键对象区域模拟视觉不确定性，建立视觉模糊度与置信度的映射，并结合监督微调和偏好优化的两阶段训练。 Result: 在多个基准测试中，该方法显著改善了置信度与正确性的对齐，同时保持或提升了任务性能。 Conclusion: 语义扰动是一种有效提升VLMs可靠性和可解释性的实用工具。 Abstract: Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration, resulting in misalignment between their verbalized confidence and response correctness. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. In this work, we propose a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries. We first introduce a perturbed dataset where Gaussian noise is applied to the key object regions to simulate visual uncertainty at different confidence levels, establishing an explicit mapping between visual ambiguity and confidence levels. We further enhance calibration through a two-stage training process combining supervised fine-tuning on the perturbed dataset with subsequent preference optimization. Extensive experiments on popular benchmarks demonstrate that our method significantly improves the alignment between verbalized confidence and response correctness while maintaining or enhancing overall task performance. These results highlight the potential of semantic perturbation as a practical tool for improving the reliability and interpretability of VLMs.

[101] Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer

Ziyi Liu,Yangcen Liu

Main category: cs.CV

TL;DR: PseudoFormer是一个两分支框架，通过生成高质量伪标签和利用不同先验知识，缩小了弱监督和全监督时序动作定位的性能差距。

Details

Motivation: 弱监督时序动作定位（WTAL）因缺乏时间标注导致性能不足，现有方法在伪标签质量、先验利用和噪声标签训练方面存在挑战。 Method: 提出RickerFusion映射动作提议到共享空间生成伪标签，结合片段级和提议级标签训练回归模型，并采用不确定性掩码和迭代优化机制。 Result: 在THUMOS14和ActivityNet1.3基准上达到最优性能，消融实验验证了各组件贡献。 Conclusion: PseudoFormer通过创新设计有效解决了WTAL的关键挑战，显著提升了性能。 Abstract: Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.

[102] Twin Co-Adaptive Dialogue for Progressive Image Generation

Jianhui Wang,Yangfan He,Yan Zhong,Xinyuan Song,Jiayi Su,Yuheng Feng,Hongyang He,Wenyu Zhu,Xinhang Yuan,Kuan Lu,Menghao Huo,Miao Zhang,Keqin Li,Jiaqi Chen,Tianyu Shi,Xueqian Wang

Main category: cs.CV

TL;DR: Twin-Co是一个通过动态对话逐步优化图像生成的框架，减少用户试错并提升图像质量。

Details

Motivation: 解决现有文本到图像生成系统在处理用户提示模糊性时的不足。 Method: 采用同步、协同适应的对话机制，通过迭代优化图像生成。 Result: 提升用户体验和生成图像质量，减少试错。 Conclusion: Twin-Co通过动态对话优化图像生成，显著提升效果。 Abstract: Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.

[103] ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams

Chris Dongjoo Kim,Jihwan Moon,Sangwoo Moon,Heeseung Yun,Sihaeng Lee,Aniruddha Kembhavi,Soonyoung Lee,Gunhee Kim,Sangho Lee,Christopher Clark

Main category: cs.CV

TL;DR: ReSpec是一种基于相关性和特异性的在线过滤框架，通过实时筛选视频-文本数据，显著减少存储和计算需求，同时在零样本视频检索任务中达到最先进性能。

Details

Motivation: 解决视频-文本数据快速增长带来的存储和计算挑战，同时满足实时响应需求。 Method: 提出ReSpec框架，基于四个标准（模态对齐、任务相关性、特异性和效率）实时筛选数据。 Result: 在WebVid2M和VideoCC3M数据集上，仅使用5%数据即达到最先进的零样本视频检索性能。 Conclusion: ReSpec通过高效数据筛选，显著降低了计算和存储需求，同时保持了高性能。 Abstract: The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at https://github.com/cdjkim/ReSpec.

[104] Collaborative Enhancement Network for Low-quality Multi-spectral Vehicle Re-identification

Aihua Zheng,Yongqi Sun,Zi Wang,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 论文提出了一种协作增强网络（CoEN），通过生成高质量代理并动态选择主光谱，以提升多光谱车辆重识别的性能。

Details

Motivation: 现有方法依赖主光谱增强低质量光谱数据，但主光谱选择困难且低质量主光谱会降低增强效果。 Method: 设计了代理生成器（PG）、动态质量排序模块（DQSM）和协作增强模块（CEM），分别用于生成代理、动态选择主光谱和协作增强光谱特征。 Result: 在三个基准数据集上的实验验证了CoEN优于其他多光谱车辆重识别方法。 Conclusion: CoEN通过协作增强和动态选择主光谱，有效提升了多光谱车辆重识别的鲁棒性。 Abstract: The performance of multi-spectral vehicle Re-identification (ReID) is significantly degraded when some important discriminative cues in visible, near infrared and thermal infrared spectra are lost. Existing methods generate or enhance missing details in low-quality spectra data using the high-quality one, generally called the primary spectrum, but how to justify the primary spectrum is a challenging problem. In addition, when the quality of the primary spectrum is low, the enhancement effect would be greatly degraded, thus limiting the performance of multi-spectral vehicle ReID. To address these problems, we propose the Collaborative Enhancement Network (CoEN), which generates a high-quality proxy from all spectra data and leverages it to supervise the selection of primary spectrum and enhance all spectra features in a collaborative manner, for robust multi-spectral vehicle ReID. First, to integrate the rich cues from all spectra data, we design the Proxy Generator (PG) to progressively aggregate multi-spectral features. Second, we design the Dynamic Quality Sort Module (DQSM), which sorts all spectra data by measuring their correlations with the proxy, to accurately select the primary spectra with the highest correlation. Finally, we design the Collaborative Enhancement Module (CEM) to effectively compensate for missing contents of all spectra by collaborating the primary spectra and the proxy, thereby mitigating the impact of low-quality primary spectra. Extensive experiments on three benchmark datasets are conducted to validate the efficacy of the proposed approach against other multi-spectral vehicle ReID methods. The codes will be released at https://github.com/yongqisun/CoEN.

[105] Memory-Augmented Dual-Decoder Networks for Multi-Class Unsupervised Anomaly Detection

Jingyu Xing,Chenwei Tang,Tao Wang,Rong Xiao,Wei Ju,Ji-Zhe Zhou,Liangli Zhen,Jiancheng Lv

Main category: cs.CV

TL;DR: 论文提出了一种名为MDD-Net的记忆增强双解码器网络，用于解决多类无监督异常检测中的过泛化和正常特征重建不足问题。

Details

Motivation: 在多类无监督异常检测中，重建方法面临两个主要挑战：过泛化导致异常难以区分，以及正常特征重建不足导致误报。现有方法通常只解决前者，反而加剧后者。 Method: MDD-Net包含双解码器反向蒸馏网络（DRD-Net）和类感知记忆模块（CMM）。DRD-Net通过两个解码器的特征差异优化异常评分，CMM则保存类特定正常原型以避免异常重建。 Result: 在多个基准测试中，MDD-Net框架表现优于当前最先进的多类无监督异常检测方法。 Conclusion: MDD-Net通过双解码器和记忆模块的协同作用，有效解决了多类无监督异常检测中的两大挑战，显著提升了性能。 Abstract: Recent advances in unsupervised anomaly detection (UAD) have shifted from single-class to multi-class scenarios. In such complex contexts, the increasing pattern diversity has brought two challenges to reconstruction-based approaches: (1) over-generalization: anomalies that are subtle or share compositional similarities with normal patterns may be reconstructed with high fidelity, making them difficult to distinguish from normal instances; and (2) insufficient normality reconstruction: complex normal features, such as intricate textures or fine-grained structures, may not be faithfully reconstructed due to the model's limited representational capacity, resulting in false positives. Existing methods typically focus on addressing the former, which unintentionally exacerbate the latter, resulting in inadequate representation of intricate normal patterns. To concurrently address these two challenges, we propose a Memory-augmented Dual-Decoder Networks (MDD-Net). This network includes two critical components: a Dual-Decoder Reverse Distillation Network (DRD-Net) and a Class-aware Memory Module (CMM). Specifically, the DRD-Net incorporates a restoration decoder designed to recover normal features from synthetic abnormal inputs and an identity decoder to reconstruct features that maintain the anomalous semantics. By exploiting the discrepancy between features produced by two decoders, our approach refines anomaly scores beyond the conventional encoder-decoder comparison paradigm, effectively reducing false positives and enhancing localization accuracy. Furthermore, the CMM explicitly encodes and preserves class-specific normal prototypes, actively steering the network away from anomaly reconstruction. Comprehensive experimental results across several benchmarks demonstrate the superior performance of our MDD-Net framework over current SoTA approaches in multi-class UAD tasks.

[106] WMKA-Net: A Weighted Multi-Kernel Attention NetworkMethod for Retinal Vessel Segmentation

Xinran Xu,Yuliang Ma,Sifu Cai

Main category: cs.CV

TL;DR: 提出了一种新颖的视网膜血管分割网络WMKA-Net，通过多核特征融合、渐进特征加权和注意力机制，显著提升了小血管和低对比度区域的分割性能。

Details

Motivation: 解决视网膜血管分割中多尺度特征捕捉不足、上下文信息丢失和噪声敏感性问题。 Method: 结合多核特征融合模块（MKDC）、渐进特征加权融合策略（UDFF）和注意力机制模块（AttentionBlock），分别用于多尺度特征提取、特征信息优化和关键区域增强。 Result: 在多个公开数据集上表现出色，尤其在小血管分割和病理区域处理上效果显著。 Conclusion: WMKA-Net为视网膜血管分割提供了一种高效且鲁棒的新方法。 Abstract: We propose a novel retinal vessel segmentation network, the Weighted Multi-Kernel Attention Network (WMKA-Net), which aims to address the issues of insufficient multiscale feature capture, loss of contextual information, and noise sensitivity in retinal vessel segmentation. WMKA-Net significantly improves the segmentation performance of small vessels and low-contrast regions by integrating several innovative components, including the MultiKernelFeature Fusion Module (MKDC), the Progressive Feature Weighting Fusion Strategy (UDFF), and the Attention Mechanism Module (AttentionBlock). The MKDC module employs multiscale parallel convolutional kernels to extract vessel characteristics, thereby enhancing the ability to capture complex vascular structures. The UDFF strategy optimizes the transmission of feature information by weighted fusion of high- and low-level features. The AttentionBlock highlights key regions and suppresses noise interference through the attention mechanism. Experimental results demonstrate that WMKA-Net achieves excellent segmentation performance in multiple public datasets, particularly in segmentation of small vessels and processing of pathological regions. This work provides a robust and efficient new method for segmentation of the retinal vessel.

[107] Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Chenjie Cao,Jingkai Zhou,Shikai Li,Jingyun Liang,Chaohui Yu,Fan Wang,Xiangyang Xue,Yanwei Fu

Main category: cs.CV

TL;DR: Uni3C是一个统一的3D增强框架，用于视频生成中相机和人体运动的精确控制，通过点云和SMPL-X角色实现灵活控制。

Details

Motivation: 现有方法通常单独处理相机和人体运动控制，且高质量标注数据有限。 Method: 提出PCDController模块和联合对齐的3D世界引导，分别实现相机控制和人体运动控制的统一。 Result: Uni3C在相机可控性和人体运动质量上显著优于竞争对手，验证了方法的有效性。 Conclusion: Uni3C通过统一控制信号和模块化训练，减少了对联合标注数据的依赖，表现出强大的泛化能力。 Abstract: Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.

[108] Guidelines for External Disturbance Factors in the Use of OCR in Real-World Environments

Kenji Iwata,Eiki Ishidera,Toshifumi Yamaai,Yutaka Satoh,Hiroshi Tanaka,Katsuhiko Takahashi,Akio Furuhata,Yoshihisa Tanabe,Hiroshi Matsumura

Main category: cs.CV

TL;DR: 论文总结了OCR性能下降的外部干扰因素，并整理成指南以帮助用户正确使用OCR。

Details

Motivation: 随着OCR应用范围的扩大，外部干扰因素可能导致性能下降，影响识别精度，因此需要系统整理这些因素并提供解决方案。 Method: 通过整理实际使用中的外部干扰因素和图像退化现象，编制了外部干扰因素表，并形成使用指南。 Result: 提出了一个外部干扰因素表和相应的使用指南，帮助用户优化OCR性能。 Conclusion: 通过系统整理干扰因素并提供指南，可以有效提升OCR在实际应用中的性能和可靠性。 Abstract: The performance of OCR has improved with the evolution of AI technology. As OCR continues to broaden its range of applications, the increased likelihood of interference introduced by various usage environments can prevent it from achieving its inherent performance. This results in reduced recognition accuracy under certain conditions, and makes the quality control of recognition devices more challenging. Therefore, to ensure that users can properly utilize OCR, we compiled the real-world external disturbance factors that cause performance degradation, along with the resulting image degradation phenomena, into an external disturbance factor table and, by also indicating how to make use of it, organized them into guidelines.

[109] GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection

Donghyeong Kim,Chaewon Park,Suhwan Cho,Hyeonjeong Lim,Minseok Kang,Jungho Lee,Sangyoun Lee

Main category: cs.CV

TL;DR: GenCLIP提出了一种通过多层提示和双分支推理更有效地学习和利用通用提示的框架，以解决零样本异常检测中的挑战。

Details

Motivation: 零样本异常检测（ZSAD）需要利用CLIP的零样本能力匹配文本提示与视觉特征，但通用提示的稳定学习和有效部署仍具挑战性。 Method: GenCLIP采用多层提示整合不同CLIP层的类别特定视觉线索，并通过双分支推理策略平衡特异性和泛化能力。 Result: 该方法通过自适应文本提示过滤机制和多层视觉特征增强了泛化能力，提高了异常检测的稳定性和可靠性。 Conclusion: GenCLIP通过创新的提示学习和推理策略，显著提升了零样本异常检测的性能。 Abstract: Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.

[110] DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

Geng Li,Jinglin Xu,Yunzhen Zhao,Yuxin Peng

Main category: cs.CV

TL;DR: Dyfo是一种无需训练的视觉搜索方法，通过双向交互和MCTS算法模拟人类视觉焦点调整，提升多模态模型的细粒度视觉理解能力。

Details

Motivation: 受人类视觉搜索机制启发，旨在解决现有方法需要额外模块或数据收集的问题。 Method: 利用双向交互和MCTS算法动态调整视觉焦点，无需额外训练或模块。 Result: 显著提升细粒度视觉理解能力，减少幻觉问题，在固定和动态分辨率模型中表现优异。 Conclusion: Dyfo为多模态模型提供了一种高效、无需训练的视觉搜索解决方案。 Abstract: Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose Dyfo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data collection, Dyfo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content, without introducing additional training caused by vocabulary expansion or the integration of specialized localization modules. Experimental results demonstrate that Dyfo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and dynamic resolution models. The code is available at https://github.com/PKU-ICST-MIPL/DyFo_CVPR2025

[111] Fast Adversarial Training with Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on Videos

Songping Wang,Hanqing Liu,Yueming Lyu,Xiantao Hu,Ziwen He,Wei Wang,Caifeng Shan,Liang Wang

Main category: cs.CV

TL;DR: VFAT-WS是一种针对视频数据的快速对抗训练方法，通过时间频率增强和弱到强的一致性正则化，显著提升了训练效率和鲁棒性。

Details

Motivation: 现有视频对抗训练方法计算成本高且难以平衡干净准确性和对抗鲁棒性，VFAT-WS旨在解决这些问题。 Method: 结合时间频率增强（TF-AUG）及其时空增强版本（STF-AUG），以及单步PGD攻击，同时采用弱到强的一致性正则化。 Result: 在UCF-101和HMDB-51数据集上，VFAT-WS显著提升了对抗鲁棒性和抗干扰鲁棒性，训练速度提升近490%。 Conclusion: VFAT-WS通过高效设计和一致性正则化，成功平衡了干净准确性和对抗鲁棒性，为视频对抗训练提供了实用解决方案。 Abstract: Adversarial Training (AT) has been shown to significantly enhance adversarial robustness via a min-max optimization approach. However, its effectiveness in video recognition tasks is hampered by two main challenges. First, fast adversarial training for video models remains largely unexplored, which severely impedes its practical applications. Specifically, most video adversarial training methods are computationally costly, with long training times and high expenses. Second, existing methods struggle with the trade-off between clean accuracy and adversarial robustness. To address these challenges, we introduce Video Fast Adversarial Training with Weak-to-Strong consistency (VFAT-WS), the first fast adversarial training method for video data. Specifically, VFAT-WS incorporates the following key designs: First, it integrates a straightforward yet effective temporal frequency augmentation (TF-AUG), and its spatial-temporal enhanced form STF-AUG, along with a single-step PGD attack to boost training efficiency and robustness. Second, it devises a weak-to-strong spatial-temporal consistency regularization, which seamlessly integrates the simpler TF-AUG and the more complex STF-AUG. Leveraging the consistency regularization, it steers the learning process from simple to complex augmentations. Both of them work together to achieve a better trade-off between clean accuracy and robustness. Extensive experiments on UCF-101 and HMDB-51 with both CNN and Transformer-based models demonstrate that VFAT-WS achieves great improvements in adversarial robustness and corruption robustness, while accelerating training by nearly 490%.

[112] TWIG: Two-Step Image Generation using Segmentation Masks in Diffusion Models

Mazharul Islam Rakib,Showrin Rahman,Joyanta Jyoti Mondal,Xi Xiao,David Lewis,Alessandra Mileo,Meem Arafat Manab

Main category: cs.CV

TL;DR: 提出了一种基于条件扩散模型的两步图像生成方法，通过生成图像分割掩码并避免特定形状，有效减少与训练图像的结构相似性，从而避免版权侵权和源复制问题。

Details

Motivation: 解决生成AI模型在图像生成中可能侵犯版权或直接复制源图像的问题，传统方法如水印和元数据效果有限。 Method: 采用两步法：首先生成图像分割掩码，捕捉图像形状；然后扩散模型重新生成图像时避免该形状。 Result: 该方法显著降低了生成图像与训练图像的结构相似性，避免了源复制问题，且无需昂贵的模型重新训练或用户提示生成技术。 Conclusion: 该方法是一种计算成本低、有效的解决方案，适用于基于扩散模型的图像生成，避免版权侵权和源复制。 Abstract: In today's age of social media and marketing, copyright issues can be a major roadblock to the free sharing of images. Generative AI models have made it possible to create high-quality images, but concerns about copyright infringement are a hindrance to their abundant use. As these models use data from training images to generate new ones, it is often a daunting task to ensure they do not violate intellectual property rights. Some AI models have even been noted to directly copy copyrighted images, a problem often referred to as source copying. Traditional copyright protection measures such as watermarks and metadata have also proven to be futile in this regard. To address this issue, we propose a novel two-step image generation model inspired by the conditional diffusion model. The first step involves creating an image segmentation mask for some prompt-based generated images. This mask embodies the shape of the image. Thereafter, the diffusion model is asked to generate the image anew while avoiding the shape in question. This approach shows a decrease in structural similarity from the training image, i.e. we are able to avoid the source copying problem using this approach without expensive retraining of the model or user-centered prompt generation techniques. This makes our approach the most computationally inexpensive approach to avoiding both copyright infringement and source copying for diffusion model-based image generation.

[113] PIV-FlowDiffuser:Transfer-learning-based denoising diffusion models for PIV

Qianyu Zhu,Junjie Wang,Jeremiah Hu,Jia Ai,Yong Lee

Main category: cs.CV

TL;DR: 该论文提出了一种基于去噪扩散模型（FlowDiffuser）的PIV分析方法，通过迁移学习策略训练模型，显著降低了噪声并提高了性能。

Details

Motivation: 深度学习在PIV中表现优异，但合成数据训练的模型在实际应用中因领域差距性能下降，需要解决噪声问题。 Method: 采用去噪扩散模型（FlowDiffuser），通过预训练（使用计算机视觉数据集）和微调（合成PIV数据）策略，提升模型性能。 Result: PIV-FlowDiffuser将平均终点误差（AEE）降低了59.4%，并在未见过的粒子图像上表现出更好的泛化能力。 Conclusion: 基于迁移学习的去噪扩散模型在PIV中具有显著优势，建议进一步实现细节参考提供的代码库。 Abstract: Deep learning algorithms have significantly reduced the computational time and improved the spatial resolution of particle image velocimetry~(PIV). However, the models trained on synthetic datasets might have a degraded performance on practical particle images due to domain gaps. As a result, special residual patterns are often observed for the vector fields of deep learning-based estimators. To reduce the special noise step-by-step, we employ a denoising diffusion model~(FlowDiffuser) for PIV analysis. And the data-hungry iterative denoising diffusion model is trained via a transfer learning strategy, resulting in our PIV-FlowDiffuser method. Specifically, (1) pre-training a FlowDiffuser model with multiple optical flow datasets of the computer vision community, such as Sintel, KITTI, etc; (2) fine-tuning the pre-trained model on synthetic PIV datasets. Note that the PIV images are upsampled by a factor of two to resolve the small-scale turbulent flow structures. The visualized results indicate that our PIV-FlowDiffuser effectively suppresses the noise patterns. Therefore, the denoising diffusion model reduces the average end-point error~($AEE$) by 59.4% over RAFT256-PIV baseline on the classic Cai's dataset. Besides, PIV-FlowDiffuser exhibits enhanced generalization performance on unseen particle images due to transfer learning. Overall, this study highlights the transfer-learning-based denoising diffusion models for PIV. And a detailed implementation is recommended for interested readers in the repository https://github.com/Zhu-Qianyu/PIV-FlowDiffuser.

[114] 3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations

Yating Wang,Xuan Wang,Ran Yi,Yanbo Fan,Jichen Hu,Jingcheng Zhu,Lizhuang Ma

Main category: cs.CV

TL;DR: 提出了一种结合3D高斯和3DMM的新方法，通过紧凑的张量表示和动态纹理编码，实现了高质量3D头像的动态细节捕捉，同时降低存储和计算开销。

Details

Motivation: 现有方法在动态纹理捕捉或运行时效率上存在不足，无法兼顾高质量和低开销。 Method: 采用张量格式编码3D高斯的纹理属性，静态中性表情存储在三角平面中，动态纹理细节通过轻量级1D特征线表示，并引入自适应截断不透明度惩罚和类别平衡采样。 Result: 实验表明，该方法能准确捕捉面部动态细节，保持实时渲染，并显著降低存储成本。 Conclusion: 该方法在动态细节捕捉和效率上取得了平衡，扩展了3D头像的应用场景。 Abstract: Recent studies have combined 3D Gaussian and 3D Morphable Models (3DMM) to construct high-quality 3D head avatars. In this line of research, existing methods either fail to capture the dynamic textures or incur significant overhead in terms of runtime speed or storage space. To this end, we propose a novel method that addresses all the aforementioned demands. In specific, we introduce an expressive and compact representation that encodes texture-related attributes of the 3D Gaussians in the tensorial format. We store appearance of neutral expression in static tri-planes, and represents dynamic texture details for different expressions using lightweight 1D feature lines, which are then decoded into opacity offset relative to the neutral face. We further propose adaptive truncated opacity penalty and class-balanced sampling to improve generalization across different expressions. Experiments show this design enables accurate face dynamic details capturing while maintains real-time rendering and significantly reduces storage costs, thus broadening the applicability to more scenarios.

[115] Cyc3D: Fine-grained Controllable 3D Generation via Cycle Consistency Regularization

Hongbin Xu,Chaohui Yu,Feng Xiao,Jiazheng Xing,Hai Ci,Weitao Chen,Ming Li

Main category: cs.CV

TL;DR: 论文提出了一种名为\name{}的新框架，通过循环一致性增强可控3D生成，显著提高了生成内容与输入条件的一致性。

Details

Motivation: 现有方法在3D生成中难以保持输入条件（如边缘和深度）与生成内容的一致性，导致明显差异。 Method: 采用高效的feed-forward主干网络，通过循环过程（包括视图一致性和条件一致性约束）生成和重新生成3D内容。 Result: 实验表明，\name{}在多个基准测试中显著优于现有方法（如边缘PSNR提升14.17%，草图PSNR提升6.26%）。 Conclusion: \name{}通过循环一致性约束有效提升了3D生成的可控性，尤其在细粒度细节上表现突出。 Abstract: Despite the remarkable progress of 3D generation, achieving controllability, i.e., ensuring consistency between generated 3D content and input conditions like edge and depth, remains a significant challenge. Existing methods often struggle to maintain accurate alignment, leading to noticeable discrepancies. To address this issue, we propose \name{}, a new framework that enhances controllable 3D generation by explicitly encouraging cyclic consistency between the second-order 3D content, generated based on extracted signals from the first-order generation, and its original input controls. Specifically, we employ an efficient feed-forward backbone that can generate a 3D object from an input condition and a text prompt. Given an initial viewpoint and a control signal, a novel view is rendered from the generated 3D content, from which the extracted condition is used to regenerate the 3D content. This re-generated output is then rendered back to the initial viewpoint, followed by another round of control signal extraction, forming a cyclic process with two consistency constraints. \emph{View consistency} ensures coherence between the two generated 3D objects, measured by semantic similarity to accommodate generative diversity. \emph{Condition consistency} aligns the final extracted signal with the original input control, preserving structural or geometric details throughout the process. Extensive experiments on popular benchmarks demonstrate that \name{} significantly improves controllability, especially for fine-grained details, outperforming existing methods across various conditions (e.g., +14.17\% PSNR for edge, +6.26\% PSNR for sketch).

[116] RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

Jingkai Zhou,Yifan Wu,Shikai Li,Min Wei,Chao Fan,Weihua Chen,Wei Jiang,Fan Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于强大基础模型的简单修改方法（RealisDance-DiT），通过灵活微调策略解决可控角色动画中的罕见姿势、风格化角色等问题，并在实验中显著优于现有方法。

Details

Motivation: 解决可控角色动画中的罕见姿势、风格化角色、角色-物体交互等挑战，避免传统方法在开放场景中泛化能力不足的问题。 Method: 基于Wan-2.1视频基础模型，提出RealisDance-DiT，通过最小化架构修改和灵活微调策略（如低噪声预热和大批次小迭代）提升性能。 Result: RealisDance-DiT在实验中大幅优于现有方法，并通过新测试数据集验证了其鲁棒性。 Conclusion: 强大的基础模型结合简单修改和灵活微调策略，可以有效解决可控角色动画的复杂挑战。 Abstract: Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

[117] Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hong-Tao Yu,Xiu-Shen Wei,Yuxin Peng,Serge Belongie

Main category: cs.CV

TL;DR: 该论文提出了一个名为FG-BMK的细粒度评估基准，用于评估大型视觉语言模型（LVLMs）在语义识别和细粒度特征表示方面的能力，填补了现有研究的空白。

Details

Motivation: 当前对LVLMs的研究主要集中在整体和专项任务评估，而细粒度图像任务尚未得到充分探索，因此需要建立一个全面的评估基准。 Method: 研究者构建了包含349万问题和332万图像的FG-BMK基准，从人类和机器视角系统评估了8种代表性LVLMs/VLMs的性能。 Result: 实验揭示了训练范式、模态对齐、扰动敏感性和细粒度类别推理对任务性能的影响，为LVLMs的局限性提供了关键见解。 Conclusion: 该工作为未来数据构建和模型设计提供了指导，并开源了代码以促进更先进的LVLMs发展。 Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on eight representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

[118] NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: KwaiSR Dataset and Study

Xin Li,Xijun Wang,Bingchen Li,Kun Yuan,Yizhen Shao,Suhang Yao,Ming Sun,Chao Zhou,Radu Timofte,Zhibo Chen

Main category: cs.CV

TL;DR: KwaiSR是首个针对短用户生成内容（UGC）图像超分辨率的基准数据集，包含合成和野生两部分，用于推动短UGC平台图像超分辨率算法的研究。

Details

Motivation: 推动短UGC平台图像超分辨率算法的研究，填补该领域基准数据集的空白。 Method: 数据集分为合成部分（模拟真实低质量UGC图像）和野生部分（直接从Kwai平台收集），并通过质量评估方法KVQ筛选。 Result: KwaiSR数据集包含1800对合成图像和1900张野生图像，挑战赛结果显示现有图像超分辨率方法难以应对。 Conclusion: KwaiSR数据集具有挑战性，有望引领图像超分辨率领域的新方向。 Abstract: In this work, we build the first benchmark dataset for short-form UGC Image Super-resolution in the wild, termed KwaiSR, intending to advance the research on developing image super-resolution algorithms for short-form UGC platforms. This dataset is collected from the Kwai Platform, which is composed of two parts, i.e., synthetic and wild parts. Among them, the synthetic dataset, including 1,900 image pairs, is produced by simulating the degradation following the distribution of real-world low-quality short-form UGC images, aiming to provide the ground truth for training and objective comparison in the validation/testing. The wild dataset contains low-quality images collected directly from the Kwai Platform, which are filtered using the quality assessment method KVQ from the Kwai Platform. As a result, the KwaiSR dataset contains 1800 synthetic image pairs and 1900 wild images, which are divided into training, validation, and testing parts with a ratio of 8:1:1. Based on the KwaiSR dataset, we organize the NTIRE 2025 challenge on a second short-form UGC Video quality assessment and enhancement, which attracts lots of researchers to develop the algorithm for it. The results of this competition have revealed that our KwaiSR dataset is pretty challenging for existing Image SR methods, which is expected to lead to a new direction in the image super-resolution field. The dataset can be found from https://lixinustc.github.io/NTIRE2025-KVQE-KwaSR-KVQ.github.io/.

[119] Shifts in Doctors' Eye Movements Between Real and AI-Generated Medical Images

David C Wong,Bin Wang,Gorkem Durak,Marouane Tliba,Mohamed Amine Kerkouri,Aladine Chetouani,Ahmet Enis Cetin,Cagdas Topel,Nicolo Gennaro,Camila Vendrami,Tugce Agirlar Trabzonlu,Amir Ali Rahsepar,Laetitia Perronne,Matthew Antalek,Onural Ozturk,Gokcan Okur,Andrew C. Gordon,Ayis Pyrros,Frank H Miller,Amir A Borhani,Hatice Savas,Eric M. Hart

Main category: cs.CV

TL;DR: 论文通过眼动追踪分析放射科医生对真实与深度学习生成图像的视觉注意力差异。

Details

Motivation: 研究放射科医生在诊断过程中如何分配注意力，以及真实与合成图像对其视觉行为的影响。 Method: 分析眼动模式（如扫视方向、幅度）和注视偏差图，比较真实与合成图像的视觉显著性差异。 Result: 揭示了放射科医生在真实与合成图像上的注视分布和视觉显著性差异。 Conclusion: 眼动追踪分析有助于理解放射科医生的诊断策略及其对图像真实性的反应。 Abstract: Eye-tracking analysis plays a vital role in medical imaging, providing key insights into how radiologists visually interpret and diagnose clinical cases. In this work, we first analyze radiologists' attention and agreement by measuring the distribution of various eye-movement patterns, including saccades direction, amplitude, and their joint distribution. These metrics help uncover patterns in attention allocation and diagnostic strategies. Furthermore, we investigate whether and how doctors' gaze behavior shifts when viewing authentic (Real) versus deep-learning-generated (Fake) images. To achieve this, we examine fixation bias maps, focusing on first, last, short, and longest fixations independently, along with detailed saccades patterns, to quantify differences in gaze distribution and visual saliency between authentic and synthetic images.

[120] Insert Anything: Image Insertion via In-Context Editing in DiT

Wensong Song,Hong Jiang,Zongxing Yang,Ruijie Quan,Yi Yang

Main category: cs.CV

TL;DR: Insert Anything是一个统一的框架，用于基于参考的图像插入，支持灵活的用户控制，能够将参考图像中的对象无缝集成到目标场景中。

Details

Motivation: 现有的方法通常需要为不同任务训练单独模型，而Insert Anything通过统一框架和多样化数据集AnyInsertion，实现了广泛的插入任务泛化能力。 Method: 利用Diffusion Transformer（DiT）的多模态注意力机制，支持掩码和文本引导的编辑，并引入上下文编辑机制，通过两种提示策略协调插入元素与目标场景。 Result: 在AnyInsertion、DreamBooth和VTON-HD基准测试中，该方法表现优于现有替代方案。 Conclusion: Insert Anything在创意内容生成、虚拟试穿和场景合成等实际应用中具有巨大潜力。 Abstract: This work presents Insert Anything, a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance. Instead of training separate models for individual tasks, our approach is trained once on our new AnyInsertion dataset--comprising 120K prompt-image pairs covering diverse tasks such as person, object, and garment insertion--and effortlessly generalizes to a wide range of insertion scenarios. Such a challenging setting requires capturing both identity features and fine-grained details, while allowing versatile local adaptations in style, color, and texture. To this end, we propose to leverage the multimodal attention of the Diffusion Transformer (DiT) to support both mask- and text-guided editing. Furthermore, we introduce an in-context editing mechanism that treats the reference image as contextual information, employing two prompting strategies to harmonize the inserted elements with the target scene while faithfully preserving their distinctive features. Extensive experiments on AnyInsertion, DreamBooth, and VTON-HD benchmarks demonstrate that our method consistently outperforms existing alternatives, underscoring its great potential in real-world applications such as creative content generation, virtual try-on, and scene composition.

[121] Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models

Zijin Yang,Xin Zhang,Kejiang Chen,Kai Zeng,Qiyi Yao,Han Fang,Weiming Zhang,Nenghai Yu

Main category: cs.CV

TL;DR: 论文提出了一种名为Gaussian Shading++的水印方法，解决了扩散模型在实际部署中的关键挑战，如密钥管理、用户定义参数和第三方验证问题。

Details

Motivation: 扩散模型在实际应用中面临版权保护和不当内容生成的伦理问题，现有水印方法忽略了部署中的关键挑战。 Method: 采用双通道设计，结合伪随机纠错码和软决策解码策略，并引入公钥签名以实现第三方验证。 Result: 实验表明，Gaussian Shading++在保持性能无损的同时，鲁棒性优于现有方法。 Conclusion: Gaussian Shading++是一种更实用的解决方案，适用于扩散模型的实际部署。 Abstract: Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. Existing methods primarily focus on ensuring that watermark embedding does not degrade the model performance. However, they often overlook critical challenges in real-world deployment scenarios, such as the complexity of watermark key management, user-defined generation parameters, and the difficulty of verification by arbitrary third parties. To address this issue, we propose Gaussian Shading++, a diffusion model watermarking method tailored for real-world deployment. We propose a double-channel design that leverages pseudorandom error-correcting codes to encode the random seed required for watermark pseudorandomization, achieving performance-lossless watermarking under a fixed watermark key and overcoming key management challenges. Additionally, we model the distortions introduced during generation and inversion as an additive white Gaussian noise channel and employ a novel soft decision decoding strategy during extraction, ensuring strong robustness even when generation parameters vary. To enable third-party verification, we incorporate public key signatures, which provide a certain level of resistance against forgery attacks even when model inversion capabilities are fully disclosed. Extensive experiments demonstrate that Gaussian Shading++ not only maintains performance losslessness but also outperforms existing methods in terms of robustness, making it a more practical solution for real-world deployment.

[122] DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Weijie He,Mushui Liu,Yunlong Yu,Zhao Wang,Chao Wu

Main category: cs.CV

TL;DR: DyST-XL是一个无需训练的框架，通过动态布局规划、双提示控制注意力机制和实体一致性约束，显著提升了文本到视频生成的性能。

Details

Motivation: 现有基于扩散的文本到视频生成模型在合成多实体交互的动态场景时存在布局不连续、实体身份漂移和不合理的交互动态问题。 Method: DyST-XL结合动态布局规划器、双提示控制注意力机制和实体一致性约束策略，优化了文本到视频生成过程。 Result: 实验表明，DyST-XL在复杂提示下显著提升了性能，填补了无需训练视频合成的关键空白。 Conclusion: DyST-XL为文本到视频生成提供了一种高效且无需额外训练的方法，解决了现有模型的关键问题。 Abstract: Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a \textbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving the precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis.

[123] An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Ji Qi,Yuan Yao,Yushi Bai,Bin Xu,Juanzi Li,Zhiyuan Liu,Tat-Seng Chua

Main category: cs.CV

TL;DR: Quicksviewer是一种新型大型多模态模型（LMM），通过动态分区和统一重采样视频帧，显著提高视频理解的效率，减少时空冗余，并在性能上优于固定分区策略的基线模型。

Details

Motivation: 传统LMM对视频帧的均匀感知在处理时间信息密度不均匀的视频时效率低下，因此需要一种更高效的方法。 Method: 使用Gumbel Softmax将视频动态分区为非均匀密度块，并对每个块进行统一重采样，以减少冗余并提高训练效率。 Result: 实现了45倍的压缩率，训练效率高（支持420秒/1fps的视频），性能提升显著（最高8.72准确率提升），在Video-MME上达到SOTA。 Conclusion: Quicksviewer的动态分区方法显著提升了视频理解的效率和性能，展示了模型能力的幂律扩展潜力。 Abstract: Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present \textbf{Quicksviewer}, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45$\times$ compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.

[124] Distribution-aware Forgetting Compensation for Exemplar-Free Lifelong Person Re-identification

Shiben Liu,Huijie Fan,Qiang Wang,Baojie Fan,Yandong Tang,Liangqiong Qu

Main category: cs.CV

TL;DR: 本文提出了一种名为DAFC的新模型，通过文本驱动的提示聚合和分布感知的遗忘补偿技术，解决了终身行人重识别中的知识遗忘问题，显著优于现有方法。

Details

Motivation: 终身行人重识别（LReID）面临在适应新信息的同时保留旧知识的关键挑战。现有方法（如基于排练和排练无关的方法）存在知识遗忘或分布学习不足的问题。 Method: 提出DAFC模型，包括文本驱动的提示聚合（TPA）和分布感知与集成（DAI），以及知识巩固机制（KCM），通过跨域共享表示学习和域特定分布集成解决遗忘问题。 Result: 实验结果表明，DAFC在两个训练顺序上的平均mAP/R@1分别至少优于现有方法9.8%/6.6%和6.4%/6.2%。 Conclusion: DAFC通过分布感知和知识巩固机制，有效解决了终身行人重识别中的知识遗忘问题，显著提升了性能。 Abstract: Lifelong Person Re-identification (LReID) suffers from a key challenge in preserving old knowledge while adapting to new information. The existing solutions include rehearsal-based and rehearsal-free methods to address this challenge. Rehearsal-based approaches rely on knowledge distillation, continuously accumulating forgetting during the distillation process. Rehearsal-free methods insufficiently learn the distribution of each domain, leading to forgetfulness over time. To solve these issues, we propose a novel Distribution-aware Forgetting Compensation (DAFC) model that explores cross-domain shared representation learning and domain-specific distribution integration without using old exemplars or knowledge distillation. We propose a Text-driven Prompt Aggregation (TPA) that utilizes text features to enrich prompt elements and guide the prompt model to learn fine-grained representations for each instance. This can enhance the differentiation of identity information and establish the foundation for domain distribution awareness. Then, Distribution-based Awareness and Integration (DAI) is designed to capture each domain-specific distribution by a dedicated expert network and adaptively consolidate them into a shared region in high-dimensional space. In this manner, DAI can consolidate and enhance cross-domain shared representation learning while alleviating catastrophic forgetting. Furthermore, we develop a Knowledge Consolidation Mechanism (KCM) that comprises instance-level discrimination and cross-domain consistency alignment strategies to facilitate model adaptive learning of new knowledge from the current domain and promote knowledge consolidation learning between acquired domain-specific distributions, respectively. Experimental results show that our DAFC outperform state-of-the-art methods by at least 9.8\%/6.6\% and 6.4\%/6.2\% of average mAP/R@1 on two training orders.

[125] Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Chun-Hsiao Yeh,Chenyu Wang,Shengbang Tong,Ta-Ying Cheng,Rouyu Wang,Tianzhe Chu,Yuexiang Zhai,Yubei Chen,Shenghua Gao,Yi Ma

Main category: cs.CV

TL;DR: 论文提出了All-Angles Bench基准，用于评估多模态大语言模型（MLLMs）在多视角场景理解中的表现，发现当前模型在跨视角一致性和相机姿态估计方面表现不佳。

Details

Motivation: 多视角理解是MLLMs作为具身代理的核心能力，但现有模型在几何一致性和跨视角对应方面存在不足，需要系统评估和改进。 Method: 设计了包含2,100个多视角问答对的基准All-Angles Bench，涵盖6项任务，测试模型的几何对应能力和跨视角信息对齐能力。 Result: 实验表明，27种代表性MLLMs（如Gemini-2.0-Flash、Claude-3.7-Sonnet和GPT-4o）在多视角任务中表现远低于人类水平，尤其在遮挡视角和相机姿态估计方面。 Conclusion: 当前MLLMs在多视角理解上仍需改进，All-Angles Bench为领域优化提供了重要参考。 Abstract: Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/.

[126] ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

Mohamed el amine Boudjoghra,Ivan Laptev,Angela Dai

Main category: cs.CV

TL;DR: ScanEdit是一种基于指令驱动的3D场景编辑方法，利用层次化场景图和大型语言模型（LLM）实现高效编辑，结合物理约束生成逼真场景。

Details

Motivation: 随着3D捕获技术的快速发展，3D数据大量涌现，高效的3D场景编辑对图形应用至关重要。 Method: 通过层次化场景图表示3D扫描对象，利用LLM将语言指令转化为可操作命令，并结合物理约束生成场景。 Result: ScanEdit在实验中表现优异，优于现有技术，适用于多种真实场景和输入指令。 Conclusion: ScanEdit提供了一种高效、逼真的3D场景编辑方法，结合了语言指令和物理约束。 Abstract: With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications. In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans. To model large and interdependent sets of ob- jectswe propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reason- ing capabilities of Large Language Models (LLMs) and translate high-level language instructions into actionable commands applied hierarchically to the scene graph. Fi- nally, ScanEdit integrates LLM-based guidance with ex- plicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent re- sults for a variety of real-world scenes and input instruc- tions.

[127] Structure-guided Diffusion Transformer for Low-Light Image Enhancement

Xiangchen Yin,Zhenda Yu,Longtao Jiang,Xin Gao,Xiao Sun,Zhi Liu,Xun Yang

Main category: cs.CV

TL;DR: 本文首次将扩散变换器（DiT）引入低光图像增强任务，提出了一种基于结构引导的扩散变换器框架（SDTL），通过小波变换和结构增强模块（SEM）提升模型效率和纹理增强效果，实验证明其性能优越。

Details

Motivation: 当前低光图像增强方法在恢复细节时会放大噪声，导致视觉质量下降。本文旨在探索DiT在此任务中的应用，以提升图像质量。 Method: 提出SDTL框架，包括小波变换压缩特征、结构增强模块（SEM）和结构引导注意力块（SAB），以优化纹理增强和噪声抑制。 Result: 在多个数据集上取得SOTA性能，验证了SDTL在提升图像质量和DiT在低光增强任务中的潜力。 Conclusion: SDTL框架有效结合DiT与结构引导策略，显著提升低光图像增强效果，为DiT在此领域的应用提供了新思路。 Abstract: While the diffusion transformer (DiT) has become a focal point of interest in recent years, its application in low-light image enhancement remains a blank area for exploration. Current methods recover the details from low-light images while inevitably amplifying the noise in images, resulting in poor visual quality. In this paper, we firstly introduce DiT into the low-light enhancement task and design a novel Structure-guided Diffusion Transformer based Low-light image enhancement (SDTL) framework. We compress the feature through wavelet transform to improve the inference efficiency of the model and capture the multi-directional frequency band. Then we propose a Structure Enhancement Module (SEM) that uses structural prior to enhance the texture and leverages an adaptive fusion strategy to achieve more accurate enhancement effect. In Addition, we propose a Structure-guided Attention Block (SAB) to pay more attention to texture-riched tokens and avoid interference from noisy areas in noise prediction. Extensive qualitative and quantitative experiments demonstrate that our method achieves SOTA performance on several popular datasets, validating the effectiveness of SDTL in improving image quality and the potential of DiT in low-light enhancement tasks.

[128] Hierarchical Attention Fusion of Visual and Textual Representations for Cross-Domain Sequential Recommendation

Wangyu Wu,Zhenhong Chen,Siqi Song,Xianglin Qiua,Xiaowei Huang,Fei Ma,Jimin Xiao

Main category: cs.CV

TL;DR: 论文提出了一种名为HAF-VT的新方法，通过结合视觉和文本数据增强跨域序列推荐，利用分层注意力机制模拟人类认知过程，显著提升了推荐性能。

Details

Motivation: 跨域序列推荐（CDSR）需要更好地建模用户的跨域偏好，而现有方法未能充分利用多模态数据（如图像和文本）来模拟人类认知过程。 Method: 使用冻结的CLIP模型生成图像和文本嵌入，通过分层注意力机制联合学习单域和跨域偏好，模拟人类信息整合过程。 Result: 在四个电子商务数据集上的实验表明，HAF-VT在捕捉跨域用户兴趣方面优于现有方法。 Conclusion: HAF-VT成功将认知原理与计算模型结合，突出了多模态数据在序列决策中的重要性。 Abstract: Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences through intra- and inter-sequence item relationships. Inspired by human cognitive processes, we propose Hierarchical Attention Fusion of Visual and Textual Representations (HAF-VT), a novel approach integrating visual and textual data to enhance cognitive modeling. Using the frozen CLIP model, we generate image and text embeddings, enriching item representations with multimodal data. A hierarchical attention mechanism jointly learns single-domain and cross-domain preferences, mimicking human information integration. Evaluated on four e-commerce datasets, HAF-VT outperforms existing methods in capturing cross-domain user interests, bridging cognitive principles with computational models and highlighting the role of multimodal data in sequential decision-making.

[129] VistaDepth: Frequency Modulation With Bias Reweighting For Enhanced Long-Range Depth Estimation

Mingxia Zhan,Li Zhang,XiaoMeng Chu,Beibei Wang

Main category: cs.CV

TL;DR: VistaDepth 是一种新的单目深度估计框架，通过结合频率域特征增强和自适应权重平衡机制，显著提升了远距离深度重建的准确性。

Details

Motivation: 现有的基于扩散模型的单目深度估计方法在远距离深度重建上表现不佳，主要由于深度值分布不平衡和过度依赖空间域特征。 Method: VistaDepth 引入了潜在频率调制（LFM）模块和自适应权重策略，动态优化特征空间中的频谱响应和损失函数。 Result: 实验表明，VistaDepth 在基于扩散模型的单目深度估计方法中表现最佳，尤其在远距离区域重建上。 Conclusion: VistaDepth 通过频率域增强和自适应权重机制，显著提升了深度感知性能，尤其在远距离细节上表现突出。 Abstract: Monocular depth estimation (MDE) aims to predict per-pixel depth values from a single RGB image. Recent advancements have positioned diffusion models as effective MDE tools by framing the challenge as a conditional image generation task. Despite their progress, these methods often struggle with accurately reconstructing distant depths, due largely to the imbalanced distribution of depth values and an over-reliance on spatial-domain features. To overcome these limitations, we introduce VistaDepth, a novel framework that integrates adaptive frequency-domain feature enhancements with an adaptive weight-balancing mechanism into the diffusion process. Central to our approach is the Latent Frequency Modulation (LFM) module, which dynamically refines spectral responses in the latent feature space, thereby improving the preservation of structural details and reducing noisy artifacts. Furthermore, we implement an adaptive weighting strategy that modulates the diffusion loss in real-time, enhancing the model's sensitivity towards distant depth reconstruction. These innovations collectively result in superior depth perception performance across both distance and detail. Experimental evaluations confirm that VistaDepth achieves state-of-the-art performance among diffusion-based MDE techniques, particularly excelling in the accurate reconstruction of distant regions.

[130] A triple-branch network for latent fingerprint enhancement guided by orientation fields and minutiae

Yurun Wang,Zerong Qi,Shujun Fu,Mingzheng Hu

Main category: cs.CV

TL;DR: 论文提出了一种名为TBSFNet的三分支空间融合网络，结合MLFGNet提升潜在指纹增强效果，实验表明其优于现有方法。

Details

Motivation: 现有深度学习方法在低质量指纹区域恢复上表现不足，需针对不同区域采用不同增强策略。 Method: 提出TBSFNet，结合方向场和细节模块，并引入MLFGNet提升泛化能力。 Result: 在MOLF和MUST数据集上，MLFGNet表现优于现有增强算法。 Conclusion: TBSFNet和MLFGNet有效提升了潜在指纹增强效果，尤其针对低质量区域。 Abstract: Latent fingerprint enhancement is a critical step in the process of latent fingerprint identification. Existing deep learning-based enhancement methods still fall short of practical application requirements, particularly in restoring low-quality fingerprint regions. Recognizing that different regions of latent fingerprints require distinct enhancement strategies, we propose a Triple Branch Spatial Fusion Network (TBSFNet), which simultaneously enhances different regions of the image using tailored strategies. Furthermore, to improve the generalization capability of the network, we integrate orientation field and minutiae-related modules into TBSFNet and introduce a Multi-Level Feature Guidance Network (MLFGNet). Experimental results on the MOLF and MUST datasets demonstrate that MLFGNet outperforms existing enhancement algorithms.

[131] Unwarping Screen Content Images via Structure-texture Enhancement Network and Transformation Self-estimation

Zhenzhen Xiao,Heng Liu,Bingwen Hu

Main category: cs.CV

TL;DR: 提出了一种结构-纹理增强网络（STEN），用于解决屏幕内容图像（SCI）的几何失真问题，通过B样条隐式神经表示和变换自估计算法提升性能。

Details

Motivation: 现有隐式神经网络方法在自然图像上表现良好，但在处理包含大几何失真、文本和尖锐边缘的SCI时表现不佳。 Method: STEN包含结构估计分支（SEB）和纹理估计分支（TEB），分别增强局部聚合与全局依赖建模及纹理细节合成，并通过变换自估计模块校正坐标变换矩阵。 Result: 在公开SCI数据集上的实验表明，该方法显著优于现有技术，且在自然图像数据集上也显示出潜力。 Conclusion: STEN通过结合结构-纹理增强和变换自估计，有效解决了SCI的失真问题，并具有扩展到自然图像的潜力。 Abstract: While existing implicit neural network-based image unwarping methods perform well on natural images, they struggle to handle screen content images (SCIs), which often contain large geometric distortions, text, symbols, and sharp edges. To address this, we propose a structure-texture enhancement network (STEN) with transformation self-estimation for SCI warping. STEN integrates a B-spline implicit neural representation module and a transformation error estimation and self-correction algorithm. It comprises two branches: the structure estimation branch (SEB), which enhances local aggregation and global dependency modeling, and the texture estimation branch (TEB), which improves texture detail synthesis using B-spline implicit neural representation. Additionally, the transformation self-estimation module autonomously estimates the transformation error and corrects the coordinate transformation matrix, effectively handling real-world image distortions. Extensive experiments on public SCI datasets demonstrate that our approach significantly outperforms state-of-the-art methods. Comparisons on well-known natural image datasets also show the potential of our approach for natural image distortion.

[132] Improving Sound Source Localization with Joint Slot Attention on Image and Audio

Inho Kim,Youngkil Song,Jicheol Park,Won Hwa Kim,Suha Kwak

Main category: cs.CV

TL;DR: 提出了一种基于联合槽注意力的声源定位方法，通过分解目标和非目标表示，优化对比学习效果，并在多个基准测试中取得最佳性能。

Details

Motivation: 现有声源定位方法因缺乏定位标签，通常将图像和音频表示为单一嵌入向量，但噪声和背景干扰导致效果不佳。 Method: 采用联合槽注意力分解图像和音频特征为目标和非目标表示，仅使用目标表示进行对比学习，并引入跨模态注意力匹配对齐局部特征。 Result: 在三个公开基准测试中几乎全部取得最佳性能，并在跨模态检索任务中显著优于先前工作。 Conclusion: 所提出的方法通过联合槽注意力和跨模态匹配，有效解决了噪声和背景干扰问题，显著提升了声源定位性能。 Abstract: Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.

[133] Robust and Real-time Surface Normal Estimation from Stereo Disparities using Affine Transformations

Csongor Csanad Kariko,Muhammad Rafi Faisal,Levente Hajder

Main category: cs.CV

TL;DR: 提出一种基于矫正立体图像对的表面法线估计新方法，通过利用视差值的仿射变换实现快速准确的结果。

Details

Motivation: 矫正立体图像对简化了表面法线估计过程，降低了计算复杂度，但需要解决噪声问题和提高鲁棒性。 Method: 结合仿射变换、自定义卷积算法和自适应启发式技术，构建快速准确的表面法线估计器。 Result: 在Middlebury和Cityscapes数据集上验证，显著提升了实时性能和准确性。 Conclusion: 方法高效且准确，未来将公开源代码以促进研究。 Abstract: This work introduces a novel method for surface normal estimation from rectified stereo image pairs, leveraging affine transformations derived from disparity values to achieve fast and accurate results. We demonstrate how the rectification of stereo image pairs simplifies the process of surface normal estimation by reducing computational complexity. To address noise reduction, we develop a custom algorithm inspired by convolutional operations, tailored to process disparity data efficiently. We also introduce adaptive heuristic techniques for efficiently detecting connected surface components within the images, further improving the robustness of the method. By integrating these methods, we construct a surface normal estimator that is both fast and accurate, producing a dense, oriented point cloud as the final output. Our method is validated using both simulated environments and real-world stereo images from the Middlebury and Cityscapes datasets, demonstrating significant improvements in real-time performance and accuracy when implemented on a GPU. Upon acceptance, the shader source code will be made publicly available to facilitate further research and reproducibility.

[134] MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video

Minh-Quan Viet Bui,Jongmin Park,Juan Luis Gonzalez Bello,Jaeho Moon,Jihyong Oh,Munchurl Kim

Main category: cs.CV

TL;DR: MoBGS是一种新的动态3D高斯泼溅（3DGS）去模糊框架，能够从模糊的单目视频中重建清晰高质量的时空新视角。

Details

Motivation: 现有动态新视角合成（NVS）方法对运动模糊敏感，导致渲染质量下降。MoBGS旨在解决这一问题，专注于动态对象的运动建模。 Method: MoBGS引入了Blur-adaptive Latent Camera Estimation（BLCE）方法估计潜在相机轨迹，改进全局相机运动去模糊；并提出Latent Camera-induced Exposure Estimation（LCEE）方法确保全局相机和局部对象运动的一致性去模糊。 Result: 在Stereo Blur数据集和真实模糊视频上的实验表明，MoBGS显著优于DyBluRF和Deblur4DGS等先进方法，达到动态NVS在运动模糊下的最先进性能。 Conclusion: MoBGS通过BLCE和LCEE方法，实现了动态场景的高质量去模糊和时空一致性，为动态NVS提供了有效解决方案。 Abstract: We present MoBGS, a novel deblurring dynamic 3D Gaussian Splatting (3DGS) framework capable of reconstructing sharp and high-quality novel spatio-temporal views from blurry monocular videos in an end-to-end manner. Existing dynamic novel view synthesis (NVS) methods are highly sensitive to motion blur in casually captured videos, resulting in significant degradation of rendering quality. While recent approaches address motion-blurred inputs for NVS, they primarily focus on static scene reconstruction and lack dedicated motion modeling for dynamic objects. To overcome these limitations, our MoBGS introduces a novel Blur-adaptive Latent Camera Estimation (BLCE) method for effective latent camera trajectory estimation, improving global camera motion deblurring. In addition, we propose a physically-inspired Latent Camera-induced Exposure Estimation (LCEE) method to ensure consistent deblurring of both global camera and local object motion. Our MoBGS framework ensures the temporal consistency of unseen latent timestamps and robust motion decomposition of static and dynamic regions. Extensive experiments on the Stereo Blur dataset and real-world blurry videos show that our MoBGS significantly outperforms the very recent advanced methods (DyBluRF and Deblur4DGS), achieving state-of-the-art performance for dynamic NVS under motion blur.

[135] Instance-Adaptive Keypoint Learning with Local-to-Global Geometric Aggregation for Category-Level Object Pose Estimation

Xiao Zhang,Lu Zou,Tao Lu,Yuan Yao,Zhangjin Huang,Guoping Wang

Main category: cs.CV

TL;DR: INKL-Pose 是一种新颖的类别级物体姿态估计框架，通过实例自适应关键点学习和局部到全局几何聚合，显著提升了复杂几何或非标准形状物体的姿态估计性能。

Details

Motivation: 现有方法在处理复杂几何或显著偏离标准形状的物体实例时表现不佳，因此需要一种能够适应不同实例并捕捉几何细节的方法。 Method: INKL-Pose 通过实例自适应关键点生成器预测关键点，并结合局部和全局特征聚合器（使用双向 Mamba）优化关键点，同时引入表面损失和分离损失确保关键点分布均匀且多样。 Result: 在 CAMERA25、REAL275 和 HouseCat6D 数据集上的实验表明，INKL-Pose 实现了最先进的性能，显著优于现有方法。 Conclusion: INKL-Pose 通过自适应关键点学习和几何聚合，有效解决了类别级物体姿态估计中的挑战，为复杂实例的姿态估计提供了新思路。 Abstract: Category-level object pose estimation aims to predict the 6D pose and size of previously unseen instances from predefined categories, requiring strong generalization across diverse object instances. Although many previous methods attempt to mitigate intra-class variations, they often struggle with instances exhibiting complex geometries or significant deviations from canonical shapes. To address this challenge, we propose INKL-Pose, a novel category-level object pose estimation framework that enables INstance-adaptive Keypoint Learning with local-to-global geometric aggregation. Specifically, our approach first predicts semantically consistent and geometric informative keypoints through an Instance-Adaptive Keypoint Generator, then refines them with: (1) a Local Keypoint Feature Aggregator capturing fine-grained geometries, and (2) a Global Keypoint Feature Aggregator using bidirectional Mamba for structural consistency. To enable bidirectional modeling in Mamba, we introduce a Feature Sequence Flipping strategy that preserves spatial coherence while constructing backward feature sequences. Additionally, we design a surface loss and a separation loss to enforce uniform coverage and spatial diversity in keypoint distribution. The generated keypoints are finally mapped to a canonical space for regressing the object's 6D pose and size. Extensive experiments on CAMERA25, REAL275, and HouseCat6D demonstrate that INKL-Pose achieves state-of-the-art performance and significantly outperforms existing methods.

[136] "I Know It When I See It": Mood Spaces for Connecting and Expressing Visual Concepts

Huzheng Yang,Katherine Xu,Michael D. Grossberg,Yutong Bai,Jianbo Shi

Main category: cs.CV

TL;DR: 论文提出了一种Mood Board方法，通过示例传达抽象概念，并构建Mood Space以压缩特征空间，实现高效的图像级操作。

Details

Motivation: 许多抽象概念难以定义但易于识别，需要一种方法来通过示例传达这些概念。 Method: 提出Mood Board和Mood Space，利用纤维化计算压缩特征空间，并通过学习图像标记的亲和关系构建紧凑的局部线性空间。 Result: Mood Space实现了高效的图像级操作（如对象平均、视觉类比和姿态转移），计算高效且仅需少量示例。 Conclusion: Mood Space为抽象概念的传达和操作提供了一种高效且紧凑的解决方案。 Abstract: Expressing complex concepts is easy when they can be labeled or quantified, but many ideas are hard to define yet instantly recognizable. We propose a Mood Board, where users convey abstract concepts with examples that hint at the intended direction of attribute changes. We compute an underlying Mood Space that 1) factors out irrelevant features and 2) finds the connections between images, thus bringing relevant concepts closer. We invent a fibration computation to compress/decompress pre-trained features into/from a compact space, 50-100x smaller. The main innovation is learning to mimic the pairwise affinity relationship of the image tokens across exemplars. To focus on the coarse-to-fine hierarchical structures in the Mood Space, we compute the top eigenvector structure from the affinity matrix and define a loss in the eigenvector space. The resulting Mood Space is locally linear and compact, allowing image-level operations, such as object averaging, visual analogy, and pose transfer, to be performed as a simple vector operation in Mood Space. Our learning is efficient in computation without any fine-tuning, needs only a few (2-20) exemplars, and takes less than a minute to learn.

[137] Landmark-Free Preoperative-to-Intraoperative Registration in Laparoscopic Liver Resection

Jun Zhou,Bingchen Gao,Kai Wang,Jialun Pei,Pheng-Ann Heng,Jing Qin

Main category: cs.CV

TL;DR: 提出了一种基于自监督学习的无标记术前-术中配准框架，将传统3D-2D工作流转化为3D-3D配准，解决了现有方法依赖解剖标记和术中信息不足的问题。

Details

Motivation: 现有配准方法依赖解剖标记，存在标记定义模糊和术中形状变形建模不足的问题，影响了手术成功率。 Method: 提出了一种无标记配准框架，包括特征解耦变换器学习刚性变换和结构正则化变形网络调整术前模型。 Result: 在合成和真实数据集上的实验证明了方法的优越性和临床潜力。 Conclusion: 该方法显著提升了配准性能，具有潜在的临床应用价值。 Abstract: Liver registration by overlaying preoperative 3D models onto intraoperative 2D frames can assist surgeons in perceiving the spatial anatomy of the liver clearly for a higher surgical success rate. Existing registration methods rely heavily on anatomical landmark-based workflows, which encounter two major limitations: 1) ambiguous landmark definitions fail to provide efficient markers for registration; 2) insufficient integration of intraoperative liver visual information in shape deformation modeling. To address these challenges, in this paper, we propose a landmark-free preoperative-to-intraoperative registration framework utilizing effective self-supervised learning, termed \ourmodel. This framework transforms the conventional 3D-2D workflow into a 3D-3D registration pipeline, which is then decoupled into rigid and non-rigid registration subtasks. \ourmodel~first introduces a feature-disentangled transformer to learn robust correspondences for recovering rigid transformations. Further, a structure-regularized deformation network is designed to adjust the preoperative model to align with the intraoperative liver surface. This network captures structural correlations through geometry similarity modeling in a low-rank transformer network. To facilitate the validation of the registration performance, we also construct an in-vivo registration dataset containing liver resection videos of 21 patients, called \emph{P2I-LReg}, which contains 346 keyframes that provide a global view of the liver together with liver mask annotations and calibrated camera intrinsic parameters. Extensive experiments and user studies on both synthetic and in-vivo datasets demonstrate the superiority and potential clinical applicability of our method.

[138] Dynamic 3D KAN Convolution with Adaptive Grid Optimization for Hyperspectral Image Classification

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: KANet提出了一种基于改进3D-DenseNet的模型，通过3D KAN Conv和自适应网格更新机制，有效解决了高光谱图像分类中的高维数据、稀疏分布和光谱冗余问题。

Details

Motivation: 高光谱图像分类中，深度神经网络面临高维数据、稀疏分布和光谱冗余等挑战，导致过拟合和泛化能力受限。 Method: 引入可学习的单变量B样条函数，通过动态网格调整机制优化分辨率，替代传统3D卷积核的固定线性权重。 Result: 在IN、UP和KSC数据集上表现优于主流方法，显著提升了高维数据建模精度和参数效率。 Conclusion: KANet通过3D动态专家卷积系统提升了模型表示能力，无需增加网络深度或宽度，有效缓解了维度灾难和过拟合风险。 Abstract: Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more efficiently adapt to ground object distributions while extracting image features without introducing excessive parameters and skipping redundant information, this paper proposes KANet based on an improved 3D-DenseNet model, consisting of 3D KAN Conv and an adaptive grid update mechanism. By introducing learnable univariate B-spline functions on network edges, specifically by flattening three-dimensional neighborhoods into vectors and applying B-spline-parameterized nonlinear activation functions to replace the fixed linear weights of traditional 3D convolutional kernels, we precisely capture complex spectral-spatial nonlinear relationships in hyperspectral data. Simultaneously, through a dynamic grid adjustment mechanism, we adaptively update the grid point positions of B-splines based on the statistical characteristics of input data, optimizing the resolution of spline functions to match the non-uniform distribution of spectral features, significantly improving the model's accuracy in high-dimensional data modeling and parameter efficiency, effectively alleviating the curse of dimensionality. This characteristic demonstrates superior neural scaling laws compared to traditional convolutional neural networks and reduces overfitting risks in small-sample and high-noise scenarios. KANet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.

[139] Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

Junyuan Deng,Xinyi Wu,Yongxing Yang,Congchao Zhu,Song Wang,Zhenyao Wu

Main category: cs.CV

TL;DR: 论文提出了一种名为FluxGen的训练数据构建流程，用于图像修复任务，通过预训练的文本到图像模型生成高质量训练样本，并结合轻量级适配器FluxIR控制模型，显著降低了训练成本。

Details

Motivation: 现有方法依赖大量高质量图像和计算资源进行训练，成本高且隐私不友好，因此需要一种更高效、低成本的方法。 Method: 利用预训练的T2I模型生成训练样本，设计FluxGen流程（无条件图像生成、图像选择和退化图像模拟），并开发轻量级适配器FluxIR控制模型。 Result: 实验表明，该方法在合成和真实退化数据集上均取得优异表现，训练成本仅为现有方法的8.5%。 Conclusion: FluxGen和FluxIR的组合为图像修复提供了一种高效、低成本的解决方案。 Abstract: Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.

[140] An Efficient Aerial Image Detection with Variable Receptive Fields

Liu Wenbin

Main category: cs.CV

TL;DR: VRF-DETR是一种基于Transformer的检测器，通过动态调整感受野和多尺度融合，解决了无人机目标检测中的小目标、遮挡和计算效率问题。

Details

Motivation: 无人机目标检测面临小目标（小于10像素）、密集遮挡和计算资源限制的挑战，现有检测器难以平衡精度和效率。 Method: VRF-DETR包含三个关键组件：多尺度上下文融合模块（MSCF）、门控卷积层（GConv）和门控多尺度融合瓶颈（GMCF），通过动态感受野和高效参数设计提升性能。 Result: 在VisDrone2019数据集上，VRF-DETR达到51.4% mAP50和31.8% mAP50:95，仅需13.5M参数。 Conclusion: VRF-DETR为无人机目标检测任务建立了新的效率-精度平衡标准。 Abstract: Aerial object detection using unmanned aerial vehicles (UAVs) faces critical challenges including sub-10px targets, dense occlusions, and stringent computational constraints. Existing detectors struggle to balance accuracy and efficiency due to rigid receptive fields and redundant architectures. To address these limitations, we propose Variable Receptive Field DETR (VRF-DETR), a transformer-based detector incorporating three key components: 1) Multi-Scale Context Fusion (MSCF) module that dynamically recalibrates features through adaptive spatial attention and gated multi-scale fusion, 2) Gated Convolution (GConv) layer enabling parameter-efficient local-context modeling via depthwise separable operations and dynamic gating, and 3) Gated Multi-scale Fusion (GMCF) Bottleneck that hierarchically disentangles occluded objects through cascaded global-local interactions. Experiments on VisDrone2019 demonstrate VRF-DETR achieves 51.4\% mAP\textsubscript{50} and 31.8\% mAP\textsubscript{50:95} with only 13.5M parameters. This work establishes a new efficiency-accuracy Pareto frontier for UAV-based detection tasks.

[141] HSANET: A Hybrid Self-Cross Attention Network For Remote Sensing Change Detection

Chengxi Han,Xiaoyu Su,Zhiqiang Wei,Meiqi Hu,Yichu Xu

Main category: cs.CV

TL;DR: HSANet是一种用于遥感图像变化检测的网络，通过分层卷积提取多尺度特征，结合混合自注意力和交叉注意力机制，提升检测性能。

Details

Motivation: 遥感图像变化检测是大规模监测的重要方法，需要高效捕捉多尺度信息和全局上下文。 Method: HSANet采用分层卷积提取多尺度特征，结合混合自注意力和交叉注意力机制，学习并融合全局和跨尺度信息。 Result: HSANet能够捕捉不同尺度的全局上下文，整合跨尺度特征，优化边缘细节，提升检测性能。 Conclusion: HSANet通过多尺度特征和注意力机制，显著提升了遥感图像变化检测的效果，并开源了模型代码。 Abstract: The remote sensing image change detection task is an essential method for large-scale monitoring. We propose HSANet, a network that uses hierarchical convolution to extract multi-scale features. It incorporates hybrid self-attention and cross-attention mechanisms to learn and fuse global and cross-scale information. This enables HSANet to capture global context at different scales and integrate cross-scale features, refining edge details and improving detection performance. We will also open-source our model code: https://github.com/ChengxiHAN/HSANet.

[142] DSPO: Direct Semantic Preference Optimization for Real-World Image Super-Resolution

Miaomiao Cai,Simiao Li,Wei Li,Xudong Huang,Hanting Chen,Jie Hu,Yunhe Wang

Main category: cs.CV

TL;DR: 该论文首次将人类偏好对齐技术引入Real-ISR，提出Direct Semantic Preference Optimization (DSPO)以解决像素级重建目标与图像级偏好的冲突，并通过语义指导实现实例级对齐。

Details

Motivation: 现有Real-ISR方法缺乏人类反馈集成，可能导致生成结果与人类偏好不一致，甚至产生伪影和有害内容。 Method: 引入Direct Preference Optimization (DPO)并改进为DSPO，通过语义实例对齐和用户描述反馈策略实现实例级偏好对齐。 Result: DSPO在单步和多步超分辨率框架中均表现出高效性。 Conclusion: DSPO是一种即插即用的解决方案，有效提升了Real-ISR生成结果与人类偏好的一致性。 Abstract: Recent advances in diffusion models have improved Real-World Image Super-Resolution (Real-ISR), but existing methods lack human feedback integration, risking misalignment with human preference and may leading to artifacts, hallucinations and harmful content generation. To this end, we are the first to introduce human preference alignment into Real-ISR, a technique that has been successfully applied in Large Language Models and Text-to-Image tasks to effectively enhance the alignment of generated outputs with human preferences. Specifically, we introduce Direct Preference Optimization (DPO) into Real-ISR to achieve alignment, where DPO serves as a general alignment technique that directly learns from the human preference dataset. Nevertheless, unlike high-level tasks, the pixel-level reconstruction objectives of Real-ISR are difficult to reconcile with the image-level preferences of DPO, which can lead to the DPO being overly sensitive to local anomalies, leading to reduced generation quality. To resolve this dichotomy, we propose Direct Semantic Preference Optimization (DSPO) to align instance-level human preferences by incorporating semantic guidance, which is through two strategies: (a) semantic instance alignment strategy, implementing instance-level alignment to ensure fine-grained perceptual consistency, and (b) user description feedback strategy, mitigating hallucinations through semantic textual feedback on instance-level images. As a plug-and-play solution, DSPO proves highly effective in both one-step and multi-step SR frameworks.

[143] FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image

Fei Yin,Mallikarjun B R,Chun-Han Yao,Rafał Mantiuk,Varun Jampani

Main category: cs.CV

TL;DR: 提出了一种从单张图像生成高质量、可动画4D头像的新框架，解决了现有方法对多视图数据依赖或形状精度不足的问题。

Details

Motivation: 现有4D头像生成方法需要大量多视图数据或难以保证形状精度和身份一致性，因此提出了一种综合利用形状、图像和视频先验的系统。 Method: 通过3D-GAN反演获取初始粗形状，利用深度引导的变形信号和图像扩散模型增强多视图纹理一致性，结合视频先验处理表情动画，并引入一致性-不一致性训练解决4D重建中的数据不一致问题。 Result: 实验结果表明，该方法在质量和多视角、表情一致性上优于现有技术。 Conclusion: 该框架成功实现了从单张图像生成高质量、可动画4D头像，并在一致性和动画效果上表现优异。 Abstract: We present a novel framework for generating high-quality, animatable 4D avatar from a single image. While recent advances have shown promising results in 4D avatar creation, existing methods either require extensive multiview data or struggle with shape accuracy and identity consistency. To address these limitations, we propose a comprehensive system that leverages shape, image, and video priors to create full-view, animatable avatars. Our approach first obtains initial coarse shape through 3D-GAN inversion. Then, it enhances multiview textures using depth-guided warping signals for cross-view consistency with the help of the image diffusion model. To handle expression animation, we incorporate a video prior with synchronized driving signals across viewpoints. We further introduce a Consistent-Inconsistent training to effectively handle data inconsistencies during 4D reconstruction. Experimental results demonstrate that our method achieves superior quality compared to the prior art, while maintaining consistency across different viewpoints and expressions.

[144] Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform

Xianpan Zhou

Main category: cs.CV

TL;DR: Tiger200K是一个高质量、手动整理的视频数据集，旨在解决开源文本到视频生成模型对专有训练数据的依赖问题。

Details

Motivation: 现有开源数据集质量不足，无法满足高级视频生成模型的微调需求，因此需要高质量的视频数据集。 Method: 通过手动整理用户生成内容（UGC），采用包括镜头边界检测、OCR、边框检测、运动过滤和双语字幕等简单但有效的流程。 Result: Tiger200K提供了高质量的视觉一致视频-文本对，支持视频生成模型的优化。 Conclusion: Tiger200K将作为开源项目持续扩展，推动视频生成模型的研究和应用。 Abstract: The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: https://tinytigerpan.github.io/tiger200k/

[145] Breast density in MRI: an AI-based quantification and relationship to assessment in mammography

Yaqian Chen,Lin Li,Hanxue Gu,Haoyu Dong,Derek L. Nguyen,Allan D. Kirk,Maciej A. Mazurowski,E. Shelley Hwang

Main category: cs.CV

TL;DR: 该研究利用机器学习算法评估MRI数据中的乳腺密度，发现其与乳腺X线密度相关，但MRI能捕捉到独特信息，未来可能用于改进乳腺癌风险预测。

Details

Motivation: 乳腺密度是乳腺癌的重要风险因素，MRI作为辅助手段能提供更全面的评估，但其3D特性带来分析挑战。 Method: 使用内部开发的机器学习算法分析三个MRI数据集中的乳腺密度。 Result: 乳腺密度在不同数据集和年龄组中表现一致，且与乳腺X线密度相关，但MRI能捕捉到独特信息。 Conclusion: 未来研究将探索如何整合MRI乳腺密度以改进乳腺癌风险预测工具。 Abstract: Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-house machine-learning algorithm to assess breast density on normal breasts in three MRI datasets. Breast density was consistent across different datasets (0.104 - 0.114). Analysis across different age groups also demonstrated strong consistency across datasets and confirmed a trend of decreasing density with age as reported in previous studies. MR breast density was correlated with mammographic breast density, although some notable differences suggest that certain breast density components are captured only on MRI. Future work will determine how to integrate MR breast density with current tools to improve future breast cancer risk prediction.

[146] Automated Measurement of Eczema Severity with Self-Supervised Learning

Neelesh Kumar,Oya Aran

Main category: cs.CV

TL;DR: 提出了一种基于自监督学习的湿疹自动诊断框架，适用于训练数据有限的情况，性能优于现有深度学习方法。

Details

Motivation: 现有基于深度学习的湿疹诊断方法需要大量标注数据，但在实际应用中难以获取。 Method: 采用两阶段框架：1) 使用SegGPT进行少样本湿疹区域分割；2) 提取DINO特征并通过MLP进行湿疹严重程度分类。 Result: 在野外湿疹图像数据集上，加权F1得分为0.67±0.01，优于Resnet-18和Vision Transformer。 Conclusion: 自监督学习在标注数据稀缺的皮肤诊断中具有潜力。 Abstract: Automated diagnosis of eczema using images acquired from digital camera can enable individuals to self-monitor their recovery. The process entails first segmenting out the eczema region from the image and then measuring the severity of eczema in the segmented region. The state-of-the-art methods for automated eczema diagnosis rely on deep neural networks such as convolutional neural network (CNN) and have shown impressive performance in accurately measuring the severity of eczema. However, these methods require massive volume of annotated data to train which can be hard to obtain. In this paper, we propose a self-supervised learning framework for automated eczema diagnosis under limited training data regime. Our framework consists of two stages: i) Segmentation, where we use an in-context learning based algorithm called SegGPT for few-shot segmentation of eczema region from the image; ii) Feature extraction and classification, where we extract DINO features from the segmented regions and feed it to a multi-layered perceptron (MLP) for 4-class classification of eczema severity. When evaluated on a dataset of annotated "in-the-wild" eczema images, we show that our method outperforms (Weighted F1: 0.67 $\pm$ 0.01) the state-of-the-art deep learning methods such as finetuned Resnet-18 (Weighted F1: 0.44 $\pm$ 0.16) and Vision Transformer (Weighted F1: 0.40 $\pm$ 0.22). Our results show that self-supervised learning can be a viable solution for automated skin diagnosis where labeled data is scarce.

[147] Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Yassir Benhammou,Alessandro Tiberio,Gabriel Trautmann,Suman Kalyan

Main category: cs.CV

TL;DR: MILS框架声称无需训练即可实现多模态任务，但其迭代优化过程带来高计算成本，而其他模型（如BLIP-2和GPT-4V）通过单次处理达到类似效果。

Details

Motivation: 揭示MILS框架在零样本图像描述任务中的隐藏计算成本，挑战其“无需训练即可高效实现”的说法。 Method: 通过对比MILS与BLIP-2、GPT-4V等模型的性能与计算成本，量化其迭代优化过程的资源消耗。 Result: MILS的高性能依赖于昂贵的多步优化，计算成本显著高于单次处理的替代模型。 Conclusion: MILS的实际应用价值可能因其高计算成本而受限，需在设计多模态模型时权衡性能与效率。 Abstract: MILS (Multimodal Iterative LLM Solver) is a recently published framework that claims "LLMs can see and hear without any training" by leveraging an iterative, LLM-CLIP based approach for zero-shot image captioning. While this MILS approach demonstrates good performance, our investigation reveals that this success comes at a hidden, substantial computational cost due to its expensive multi-step refinement process. In contrast, alternative models such as BLIP-2 and GPT-4V achieve competitive results through a streamlined, single-pass approach. We hypothesize that the significant overhead inherent in MILS's iterative process may undermine its practical benefits, thereby challenging the narrative that zero-shot performance can be attained without incurring heavy resource demands. This work is the first to expose and quantify the trade-offs between output quality and computational cost in MILS, providing critical insights for the design of more efficient multimodal models.

[148] Shape-Guided Clothing Warping for Virtual Try-On

Xiaoyu Han,Shunyuan Zheng,Zonglin Li,Chenyang Wang,Xin Sun,Quanling Meng

Main category: cs.CV

TL;DR: 论文提出了一种名为SCW-VTON的形状引导服装变形方法，通过全局形状约束和肢体纹理增强虚拟试穿的现实感和一致性。

Details

Motivation: 现有方法在服装变形时缺乏对细节的精确控制，导致服装与身体形状不一致以及肢体区域变形。 Method: 设计了双路径服装变形模块（形状路径和流路径）和肢体重建网络，结合全局形状约束和肢体纹理。 Result: 实验表明，SCW-VTON在服装形状一致性和细节控制上优于现有方法。 Conclusion: SCW-VTON显著提升了虚拟试穿的现实感和一致性，代码已开源。 Abstract: Image-based virtual try-on aims to seamlessly fit in-shop clothing to a person image while maintaining pose consistency. Existing methods commonly employ the thin plate spline (TPS) transformation or appearance flow to deform in-shop clothing for aligning with the person's body. Despite their promising performance, these methods often lack precise control over fine details, leading to inconsistencies in shape between clothing and the person's body as well as distortions in exposed limb regions. To tackle these challenges, we propose a novel shape-guided clothing warping method for virtual try-on, dubbed SCW-VTON, which incorporates global shape constraints and additional limb textures to enhance the realism and consistency of the warped clothing and try-on results. To integrate global shape constraints for clothing warping, we devise a dual-path clothing warping module comprising a shape path and a flow path. The former path captures the clothing shape aligned with the person's body, while the latter path leverages the mapping between the pre- and post-deformation of the clothing shape to guide the estimation of appearance flow. Furthermore, to alleviate distortions in limb regions of try-on results, we integrate detailed limb guidance by developing a limb reconstruction network based on masked image modeling. Through the utilization of SCW-VTON, we are able to generate try-on results with enhanced clothing shape consistency and precise control over details. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/xyhanHIT/SCW-VTON.

[149] Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation

Yunxuan Cai,Sitao Xiang,Zongjian Li,Haiwei Chen,Yajie Zhao

Main category: cs.CV

TL;DR: 本文提出了一种基于生成网络的语义可控数字人脸建模方法，通过扩散模型生成高质量3D人脸数据库，并开发了高效的GAN生成器，支持语义属性输入和后期编辑。

Details

Motivation: 传统数字人脸建模受限于数据采集设备、人工劳动和合适演员的需求，导致模型多样性、表现力和控制力不足。本文旨在通过生成网络提升建模过程的控制力。 Method: 提出了一种新的数据生成流程，利用预训练扩散模型创建高质量3D人脸数据库，并通过归一化模块将合成数据转换为扫描数据。开发了基于GAN的生成器，支持语义属性输入和后期编辑。 Result: 生成了44,000个人脸模型，开发了高效生成器和资产细化组件，构建了完整的系统并集成到基于网络的交互工具中。 Conclusion: 本文提出的方法显著提升了数字人脸建模的控制力和多样性，并通过实验和评估验证了其有效性。工具将随论文公开发布。 Abstract: Digital modeling and reconstruction of human faces serve various applications. However, its availability is often hindered by the requirements of data capturing devices, manual labor, and suitable actors. This situation restricts the diversity, expressiveness, and control over the resulting models. This work aims to demonstrate that a semantically controllable generative network can provide enhanced control over the digital face modeling process. To enhance diversity beyond the limited human faces scanned in a controlled setting, we introduce a novel data generation pipeline that creates a high-quality 3D face database using a pre-trained diffusion model. Our proposed normalization module converts synthesized data from the diffusion model into high-quality scanned data. Using the 44,000 face models we obtained, we further developed an efficient GAN-based generator. This generator accepts semantic attributes as input, and generates geometry and albedo. It also allows continuous post-editing of attributes in the latent space. Our asset refinement component subsequently creates physically-based facial assets. We introduce a comprehensive system designed for creating and editing high-quality face assets. Our proposed model has undergone extensive experiment, comparison and evaluation. We also integrate everything into a web-based interactive tool. We aim to make this tool publicly available with the release of the paper.

[150] Diffusion Bridge Models for 3D Medical Image Translation

Shaorong Zhang,Tamoghna Chattopadhyay,Sophia I. Thomopoulos,Jose-Luis Ambite,Paul M. Thompson,Greg Ver Steeg

Main category: cs.CV

TL;DR: 提出了一种扩散桥模型，用于在T1w MRI和DTI模态之间进行3D脑图像转换，以减少DTI采集时间并支持跨模态数据增强。

Details

Motivation: DTI成像耗时较长，而T1w MRI更易获取，因此需要一种方法在两者之间进行高效转换。 Method: 使用扩散桥模型从T1w图像生成高质量的DTI FA图像，反之亦然，并通过多种指标评估生成效果。 Result: 模型在捕捉解剖结构和保留白质完整性信息方面表现优异，生成的图像在分类任务中与真实数据性能相当。 Conclusion: 该模型为神经影像数据集改进和临床决策提供了有前景的解决方案，可能对研究和临床实践产生重大影响。 Abstract: Diffusion tensor imaging (DTI) provides crucial insights into the microstructure of the human brain, but it can be time-consuming to acquire compared to more readily available T1-weighted (T1w) magnetic resonance imaging (MRI). To address this challenge, we propose a diffusion bridge model for 3D brain image translation between T1w MRI and DTI modalities. Our model learns to generate high-quality DTI fractional anisotropy (FA) images from T1w images and vice versa, enabling cross-modality data augmentation and reducing the need for extensive DTI acquisition. We evaluate our approach using perceptual similarity, pixel-level agreement, and distributional consistency metrics, demonstrating strong performance in capturing anatomical structures and preserving information on white matter integrity. The practical utility of the synthetic data is validated through sex classification and Alzheimer's disease classification tasks, where the generated images achieve comparable performance to real data. Our diffusion bridge model offers a promising solution for improving neuroimaging datasets and supporting clinical decision-making, with the potential to significantly impact neuroimaging research and clinical practice.

[151] Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Guo Chen,Zhiqi Li,Shihao Wang,Jindong Jiang,Yicheng Liu,Lidong Lu,De-An Huang,Wonmin Byeon,Matthieu Le,Tuomas Rintamaki,Tyler Poon,Max Ehrlich,Tuomas Rintamaki,Tyler Poon,Tong Lu,Limin Wang,Bryan Catanzaro,Jan Kautz,Andrew Tao,Zhiding Yu,Guilin Liu

Main category: cs.CV

TL;DR: Eagle 2.5是一个前沿的视觉语言模型家族，专注于长上下文多模态学习，通过自动降级采样和图像区域保留技术提升长视频和高分辨率图像理解能力，并在长上下文数据训练中优化效率。

Details

Motivation: 解决现有视觉语言模型在长视频理解和高分辨率图像理解中的局限性。 Method: 提出包含自动降级采样和图像区域保留的训练框架，并优化长上下文数据训练的效率。 Result: Eagle 2.5-8B在Video-MME基准测试中达到72.4%的准确率，与顶级商业和开源模型表现相当。 Conclusion: Eagle 2.5为长上下文多模态学习提供了强大解决方案，显著提升了现有模型的性能。 Abstract: We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.

[152] DRAWER: Digital Reconstruction and Articulation With Environment Realism

Hongchi Xia,Entong Su,Marius Memmel,Arhan Jain,Raymond Yu,Numfor Mbiziwo-Tiapo,Ali Farhadi,Abhishek Gupta,Shenlong Wang,Wei-Chiu Ma

Main category: cs.CV

TL;DR: DRAWER框架将静态室内场景视频转换为逼真、交互式的数字环境，支持游戏引擎和机器人仿真平台。

Details

Motivation: 通过创建虚拟数字副本，释放游戏和机器人等领域的潜力。 Method: 采用双场景表示的重建模块和关节模块，分别处理几何细节和交互功能。 Result: 生成的虚拟环境逼真、交互性强，且实时运行，适用于游戏和机器人仿真。 Conclusion: DRAWER展示了在游戏和机器人应用中自动创建交互式环境的潜力。 Abstract: Creating virtual digital replicas from real-world data unlocks significant potential across domains like gaming and robotics. In this paper, we present DRAWER, a novel framework that converts a video of a static indoor scene into a photorealistic and interactive digital environment. Our approach centers on two main contributions: (i) a reconstruction module based on a dual scene representation that reconstructs the scene with fine-grained geometric details, and (ii) an articulation module that identifies articulation types and hinge positions, reconstructs simulatable shapes and appearances and integrates them into the scene. The resulting virtual environment is photorealistic, interactive, and runs in real time, with compatibility for game engines and robotic simulation platforms. We demonstrate the potential of DRAWER by using it to automatically create an interactive game in Unreal Engine and to enable real-to-sim-to-real transfer for robotics applications.

Weiye Xu,Jiahao Wang,Weiyun Wang,Zhe Chen,Wengang Zhou,Aijun Yang,Lewei Lu,Houqiang Li,Xiaohua Wang,Xizhou Zhu,Wenhai Wang,Jifeng Dai,Jinguo Zhu

Main category: cs.CV

TL;DR: VisuLogic是一个新的视觉推理基准，包含1000个人工验证的问题，用于评估多模态大语言模型（MLLMs）的真实视觉推理能力。

Details

Motivation: 当前MLLMs的推理评估依赖文本描述，存在语言推理捷径，无法真正衡量视觉推理能力。 Method: 提出VisuLogic基准，包含六类问题，评估MLLMs的视觉推理能力，并分析其失败模式。 Result: 多数MLLMs准确率低于30%，远低于人类的51.4%，显示视觉推理能力存在显著差距。 Conclusion: VisuLogic揭示了MLLMs在视觉推理上的不足，并提供了补充训练数据和强化学习基线以推动进步。 Abstract: Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress.

[154] StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

Cailin Zhuang,Yaoqi Hu,Xuanyang Zhang,Wei Cheng,Jiacheng Bao,Shengqi Liu,Yiying Yang,Xianfang Zeng,Gang Yu,Ming Li

Main category: cs.CV

TL;DR: StyleMe3D是一个用于3D高斯溅射（3DGS）风格迁移的整体框架，解决了3DGS在风格化场景中的问题，如纹理碎片化和语义不对齐。

Details

Motivation: 3DGS在真实场景重建中表现出色，但在风格化场景（如卡通、游戏）中表现不佳，存在纹理碎片化、语义不对齐和抽象美学适应性差的问题。 Method: StyleMe3D结合多模态风格条件、多级语义对齐和感知质量增强，提出了四个新组件：动态风格分数蒸馏（DSSD）、对比风格描述符（CSD）、同时优化尺度（SOS）和3D高斯质量评估（3DG-QA）。 Result: 在NeRF合成数据集和tandt db数据集上，StyleMe3D在保留几何细节和确保风格一致性方面优于现有方法，同时保持实时渲染。 Conclusion: 该工作将真实3DGS与艺术风格化结合，为游戏、虚拟世界和数字艺术开辟了新应用。 Abstract: 3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.

cs.GR [Back]

[155] PyFRep: Shape Modeling with Differentiable Function Representation

Pierre-Alain Fayolle,Evgenii Maltsev

Main category: cs.GR

TL;DR: 提出了一种基于函数表示（FRep）的可微分几何建模框架，支持自动微分，并展示了其在曲率估计、符号距离函数计算和形状参数拟合中的应用。

Details

Motivation: 为了在几何建模中实现可微分性，便于获取空间或形状参数的导数，从而支持更高效的形状分析和优化。 Method: 基于现代自动微分库构建框架，利用FRep表示几何形状，实现空间和形状参数的可微分计算。 Result: 成功应用于曲率估计、符号距离函数计算和形状参数拟合，框架已开源。 Conclusion: 该框架为几何建模提供了可微分的工具，支持多种应用，具有实用性和扩展性。 Abstract: We propose a framework for performing differentiable geometric modeling based on the Function Representation (FRep). The framework is built on top of modern libraries for performing automatic differentiation allowing us to obtain derivatives w.r.t. space or shape parameters. We demonstrate possible applications of this framework: Curvature estimation for shape interrogation, signed distance function computation and approximation and fitting shape parameters of a parametric model to data. Our framework is released as open-source.

[156] PRISM: A Unified Framework for Photorealistic Reconstruction and Intrinsic Scene Modeling

Alara Dirik,Tuanfeng Wang,Duygu Ceylan,Stefanos Zafeiriou,Anna Frühstück

Main category: cs.GR

TL;DR: PRISM是一个统一框架，支持多种图像生成和编辑任务，通过联合生成RGB图像和内在层（X层）实现一致性。

Details

Motivation: 解决现有方法需要单独推断内在属性或依赖多个模型的问题，提供一种统一的解决方案。 Method: 基于预训练的文本到图像扩散模型，提出有效的微调策略，同时生成RGB图像和X层。 Result: 实验表明PRISM在内在图像分解和条件图像生成方面具有竞争力，同时保留基础模型的文本到图像生成能力。 Conclusion: PRISM为多任务图像生成和编辑提供了一种高效且一致的解决方案。 Abstract: We present PRISM, a unified framework that enables multiple image generation and editing tasks in a single foundational model. Starting from a pre-trained text-to-image diffusion model, PRISM proposes an effective fine-tuning strategy to produce RGB images along with intrinsic maps (referred to as X layers) simultaneously. Unlike previous approaches, which infer intrinsic properties individually or require separate models for decomposition and conditional generation, PRISM maintains consistency across modalities by generating all intrinsic layers jointly. It supports diverse tasks, including text-to-RGBX generation, RGB-to-X decomposition, and X-to-RGBX conditional generation. Additionally, PRISM enables both global and local image editing through conditioning on selected intrinsic layers and text prompts. Extensive experiments demonstrate the competitive performance of PRISM both for intrinsic image decomposition and conditional image generation while preserving the base model's text-to-image generation capability.

[157] HoLa: B-Rep Generation using a Holistic Latent Representation

Yilin Liu,Duoteng Xu,Xingyao Yu,Xiang Xu,Daniel Cohen-Or,Hao Zhang,Hui Huang

Main category: cs.GR

TL;DR: 提出了一种新的CAD模型表示方法，通过统一的HoLa空间整合几何与拓扑信息，显著提升了生成模型的准确性和效率。

Details

Motivation: 传统CAD模型表示方法在几何与拓扑关系的统一处理上存在不足，导致生成模型时存在冗余和不一致性。 Method: 通过神经交叉网络学习曲面间的几何关系，将拓扑学习转化为几何重构问题，构建仅基于曲面的紧凑潜在空间。 Result: 生成模型的准确率显著提升至82%，远高于现有方法的约50%。 Conclusion: HoLa空间为CAD模型的生成提供了一种高效且准确的解决方案，减少了冗余和训练复杂性。 Abstract: We introduce a novel representation for learning and generating Computer-Aided Design (CAD) models in the form of $\textit{boundary representations}$ (B-Reps). Our representation unifies the continuous geometric properties of B-Rep primitives in different orders (e.g., surfaces and curves) and their discrete topological relations in a $\textit{holistic latent}$ (HoLa) space. This is based on the simple observation that the topological connection between two surfaces is intrinsically tied to the geometry of their intersecting curve. Such a prior allows us to reformulate topology learning in B-Reps as a geometric reconstruction problem in Euclidean space. Specifically, we eliminate the presence of curves, vertices, and all the topological connections in the latent space by learning to distinguish and derive curve geometries from a pair of surface primitives via a neural intersection network. To this end, our holistic latent space is only defined on surfaces but encodes a full B-Rep model, including the geometry of surfaces, curves, vertices, and their topological relations. Our compact and holistic latent space facilitates the design of a first diffusion-based generator to take on a large variety of inputs including point clouds, single/multi-view images, 2D sketches, and text prompts. Our method significantly reduces ambiguities, redundancies, and incoherences among the generated B-Rep primitives, as well as training complexities inherent in prior multi-step B-Rep learning pipelines, while achieving greatly improved validity rate over current state of the art: 82% vs. $\approx$50%.

[158] Can AI Recognize the Style of Art? Analyzing Aesthetics through the Lens of Style Transfer

Yunha Yeo,Daeho Um

Main category: cs.GR

TL;DR: 研究从美学角度分析AI风格迁移技术，比较CNN和Transformer模型生成图像的美学效果，并探讨未来研究方向。

Details

Motivation: 尽管风格迁移技术受到广泛关注，但多数研究集中于算法优化，缺乏美学视角的分析。本文旨在填补这一空白。 Method: 分析基于CNN和Transformer的风格迁移算法，通过美学评估比较生成图像。 Result: 揭示了当前风格迁移技术的局限性，并识别了构成艺术作品风格的关键元素。 Conclusion: 提出了未来风格迁移技术研究的方向，强调美学与AI技术的结合。 Abstract: This study investigates how artificial intelligence (AI) recognizes style through style transfer-an AI technique that generates a new image by applying the style of one image to another. Despite the considerable interest that style transfer has garnered among researchers, most efforts have focused on enhancing the quality of output images through advanced AI algorithms. In this paper, we approach style transfer from an aesthetic perspective, thereby bridging AI techniques and aesthetics. We analyze two style transfer algorithms: one based on convolutional neural networks (CNNs) and the other utilizing recent Transformer models. By comparing the images produced by each, we explore the elements that constitute the style of artworks through an aesthetic analysis of the style transfer results. We then elucidate the limitations of current style transfer techniques. Based on these limitations, we propose potential directions for future research on style transfer techniques.

[159] SEGA: Drivable 3D Gaussian Head Avatar from a Single Image

Chen Guo,Zhuo Su,Jian Wang,Shuang Li,Xu Chang,Zhaohu Li,Yang Zhao,Guidong Wang,Ruqi Huang

Main category: cs.GR

TL;DR: SEGA提出了一种基于单图像的3D可驱动高斯头化身创建方法，结合了广义先验模型和分层UV空间高斯溅射框架，实现了对未见身份的鲁棒泛化和实时性能。

Details

Motivation: 现有方法依赖多图像或多视角输入，限制了实际应用的实用性，因此需要一种基于单图像的高质量3D化身生成方法。 Method: SEGA结合了2D和3D先验模型，采用分层UV空间高斯溅射框架，通过双分支架构分离动态和静态面部组件，并支持个性化微调。 Result: 实验表明，SEGA在泛化能力、身份保持和表情真实性方面优于现有方法，实现了实时动画和渲染。 Conclusion: SEGA为单图像3D化身创建提供了实用且高效的解决方案，推动了虚拟现实和数字娱乐领域的发展。 Abstract: Creating photorealistic 3D head avatars from limited input has become increasingly important for applications in virtual reality, telepresence, and digital entertainment. While recent advances like neural rendering and 3D Gaussian splatting have enabled high-quality digital human avatar creation and animation, most methods rely on multiple images or multi-view inputs, limiting their practicality for real-world use. In this paper, we propose SEGA, a novel approach for Single-imagE-based 3D drivable Gaussian head Avatar creation that combines generalized prior models with a new hierarchical UV-space Gaussian Splatting framework. SEGA seamlessly combines priors derived from large-scale 2D datasets with 3D priors learned from multi-view, multi-expression, and multi-ID data, achieving robust generalization to unseen identities while ensuring 3D consistency across novel viewpoints and expressions. We further present a hierarchical UV-space Gaussian Splatting framework that leverages FLAME-based structural priors and employs a dual-branch architecture to disentangle dynamic and static facial components effectively. The dynamic branch encodes expression-driven fine details, while the static branch focuses on expression-invariant regions, enabling efficient parameter inference and precomputation. This design maximizes the utility of limited 3D data and achieves real-time performance for animation and rendering. Additionally, SEGA performs person-specific fine-tuning to further enhance the fidelity and realism of the generated avatars. Experiments show our method outperforms state-of-the-art approaches in generalization ability, identity preservation, and expression realism, advancing one-shot avatar creation for practical applications.

[160] Interdisciplinary Integration of Remote Sensing -- A Review with Four Examples

Zichen Jin

Main category: cs.GR

TL;DR: 本文简要回顾了遥感与其他学科的交叉融合，重点介绍了生态学、数学形态学、机器学习和电子学四个领域的例子。

Details

Motivation: 探讨遥感作为一门高层次学科，如何依赖并整合其他基础和应用学科与技术，以推动其发展。 Method: 通过四个具体例子（生态学、数学形态学、机器学习和电子学）分析遥感与其他学科的交叉融合。 Result: 展示了遥感在多个领域中的广泛应用和技术整合，突显其跨学科特性。 Conclusion: 遥感的发展离不开多学科的交叉融合，未来需进一步探索更多领域的合作与创新。 Abstract: As a high-level discipline, the development of remote sensing depends on the contribution of many other basic and applied disciplines and technologies. For example, due to the close relationship between remote sensing and photogrammetry, remote sensing would inevitably integrate disciplines such as optics and color science. Also, remote sensing integrates the knowledge of electronics in the conversion from optical signals to electrical signals via CCD (Charge-Coupled Device) or other image sensors. Moreover, when conducting object identification and classification with remote sensing data, mathematical morphology and other digital image processing technologies are used. These examples are only the tip of the iceberg of interdisciplinary integration of remote sensing. This work briefly reviews the interdisciplinary integration of remote sensing with four examples - ecology, mathematical morphology, machine learning, and electronics.

[161] A Controllable Appearance Representation for Flexible Transfer and Editing

Santiago Jimenez-Navarro,Julia Guerrero-Viu,Belen Masia

Main category: cs.GR

TL;DR: 提出了一种自监督学习方法，通过FactorVAE生成紧凑且解耦的潜在空间表示，用于材料外观的直观编辑。

Details

Motivation: 避免人工标注带来的偏差，实现材料外观和光照的无监督解耦表示。 Method: 使用FactorVAE自监督学习，结合无标签数据集训练，并利用IP-Adapter指导扩散模型进行外观迁移和编辑。 Result: 模型在无显式监督下实现了强解耦和可解释性，支持直观的属性编辑（如色调、光泽度）。 Conclusion: 该方法提供了细粒度的外观控制，适用于材料外观的迁移和编辑。 Abstract: We present a method that computes an interpretable representation of material appearance within a highly compact, disentangled latent space. This representation is learned in a self-supervised fashion using an adapted FactorVAE. We train our model with a carefully designed unlabeled dataset, avoiding possible biases induced by human-generated labels. Our model demonstrates strong disentanglement and interpretability by effectively encoding material appearance and illumination, despite the absence of explicit supervision. Then, we use our representation as guidance for training a lightweight IP-Adapter to condition a diffusion pipeline that transfers the appearance of one or more images onto a target geometry, and allows the user to further edit the resulting appearance. Our approach offers fine-grained control over the generated results: thanks to the well-structured compact latent space, users can intuitively manipulate attributes such as hue or glossiness in image space to achieve the desired final appearance.

cs.CL [Back]

[162] Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning

ByteDance Seed,:,Yufeng Yuan,Yu Yue,Mingxuan Wang,Xiaochen Zuo,Jiaze Chen,Lin Yan,Wenyuan Xu,Chi Zhang,Xin Liu,Chengyi Wang,TianTian Fan,Lingjun Liu,Qiying Yu,Xiangpeng Wei,Zhiqi Lin,Ruofei Zhu,Qingping Yang,Chengzhi Wei,Jerry He,Guanlin Liu,Zheng Wu,Xiangyu Yu,Zhicheng Liu,Jingjing Xu,Jiangjie Chen,Haojie Pan,Shengding Hu,Zhengyin Du,Wenqi Wang,Zewei Sun,Chenwei Lou,Bole Ma,Zihan Wang,Mofan Zhang,Wang Zhang,Gaohong Liu,Kaihua Jiang,Haibin Lin,Ru Zhang,Juncai Liu,Li Han,Jinxin Chi,Wenqiang Zhang,Jiayi Xu,Jun Yuan,Zhen Xiao,Yuqiao Xian,Jingqiao Wu,Kai Hua,Na Zhou,Jianhui Duan,Heyang Lu,Changbao Wang,Jinxiang Ou,Shihang Wang,Xiaoran Jin,Xuesong Yao,Chengyin Xu,Wenchang Ma,Zhecheng An,Renming Pang,Xia Xiao,Jing Su,Yuyu Zhang,Tao Sun,Kaibo Liu,Yifan Sun,Kai Shen,Sijun Zhang,Yiyuan Ma,Xingyan Bin,Ji Li,Yao Luo,Deyi Liu,Shiyi Zhan,Yunshui Li,Yuan Yang,Defa Zhu,Ke Shen,Chenggang Li,Xun Zhou,Liang Xiang,Yonghui Wu

Main category: cs.CL

TL;DR: Seed-Thinking-v1.5是一种具备先思考后回答能力的模型，在多个基准测试中表现优异，尤其在STEM和编程领域。

Details

Motivation: 提升模型的推理能力，并验证其在广泛任务中的泛化性能。 Method: 采用混合专家（MoE）架构，激活参数20B，总参数200B。 Result: 在AIME 2024、Codeforces和GPQA等基准测试中表现突出，非推理任务中超越DeepSeek R1 8%。 Conclusion: Seed-Thinking-v1.5展示了强大的推理和泛化能力，并计划公开内部基准以支持未来研究。 Abstract: We introduce Seed-Thinking-v1.5, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed-Thinking-v1.5 achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed-Thinking-v1.5 is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research.

[163] Uncovering Conspiratorial Narratives within Arabic Online Content

Djamila Mohdeb,Meriem Laifa,Zineb Guemraoui,Dalila Behih

Main category: cs.CL

TL;DR: 该研究通过计算分析阿拉伯语在线内容，结合命名实体识别和主题建模（Top2Vec算法），识别并分类了阿拉伯语博客和Facebook中的阴谋论叙事，揭示了六种类型。

Details

Motivation: 填补阴谋论研究中阿拉伯语内容或在线数据的空白，揭示阿拉伯社交媒体中阴谋论的嵌入方式及其背景影响。 Method: 使用命名实体识别和Top2Vec算法对阿拉伯语博客和Facebook数据进行计算分析。 Result: 识别出六类阴谋论叙事：性别/女权主义、地缘政治、政府掩盖、末日论、犹太共济会和地球工程。 Conclusion: 研究揭示了阿拉伯数字空间中阴谋论的表现与演变，增进了对其在阿拉伯世界公共话语中作用的理解。 Abstract: This study investigates the spread of conspiracy theories in Arabic digital spaces through computational analysis of online content. By combining Named Entity Recognition and Topic Modeling techniques, specifically the Top2Vec algorithm, we analyze data from Arabic blogs and Facebook to identify and classify conspiratorial narratives. Our analysis uncovers six distinct categories: gender/feminist, geopolitical, government cover-ups, apocalyptic, Judeo-Masonic, and geoengineering. The research highlights how these narratives are deeply embedded in Arabic social media discourse, shaped by regional historical, cultural, and sociopolitical contexts. By applying advanced Natural Language Processing methods to Arabic content, this study addresses a gap in conspiracy theory research, which has traditionally focused on English-language content or offline data. The findings provide new insights into the manifestation and evolution of conspiracy theories in Arabic digital spaces, enhancing our understanding of their role in shaping public discourse in the Arab world.

[164] MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Jaime Raldua Veuthey,Zainab Ali Majid,Suhas Hariharan,Jacob Haimes

Main category: cs.CL

TL;DR: MEQA是一个用于评估问答（QA）基准质量的元评估框架，旨在提供标准化评估和量化分数，并在网络安全领域进行了验证。

Details

Motivation: 随着大型语言模型（LLM）的发展，其社会影响日益显著，因此需要严格的评估。然而，现有基准的质量评估存在空白，MEQA填补了这一需求。 Method: 提出MEQA框架，通过标准化评估和量化分数对QA基准进行元评估，并在网络安全领域结合人类和LLM评估者进行验证。 Result: MEQA成功识别了网络安全基准的优势和不足，证明了其有效性。 Conclusion: MEQA为QA基准的元评估提供了实用工具，尤其在网络安全领域具有重要意义，同时强调了AI模型的双重性（防御工具与安全威胁）。 Abstract: As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.

[165] A Baseline for Self-state Identification and Classification in Mental Health Data: CLPsych 2025 Task

Laerdon Kim

Main category: cs.CL

TL;DR: 论文提出了一种基于4位量化Gemma 2 9B模型的少样本学习方法，用于分类Reddit心理健康数据中的自我状态，通过预处理步骤识别相关句子并进行二元分类，性能优于其他方法。

Details

Motivation: 解决心理健康数据中自我状态的分类问题，尤其是在Reddit等社交媒体平台上，以支持心理健康分析和干预。 Method: 使用4位量化Gemma 2 9B模型进行少样本学习，结合预处理步骤（识别相关句子并二元分类）。 Result: 系统在14个提交的系统中排名第三，测试召回率为0.579。 Conclusion: 句子分块步骤对模型性能有显著提升，因其更接近人工标注的粒度并简化了任务。 Abstract: We present a baseline for the CLPsych 2025 A.1 task: classifying self-states in mental health data taken from Reddit. We use few-shot learning with a 4-bit quantized Gemma 2 9B model and a data preprocessing step which first identifies relevant sentences indicating self-state evidence, and then performs a binary classification to determine whether the sentence is evidence of an adaptive or maladaptive self-state. This system outperforms our other method which relies on an LLM to highlight spans of variable length independently. We attribute the performance of our model to the benefits of this sentence chunking step for two reasons: partitioning posts into sentences 1) broadly matches the granularity at which self-states were human-annotated and 2) simplifies the task for our language model to a binary classification problem. Our system places third out of fourteen systems submitted for Task A.1, achieving a test-time recall of 0.579.

[166] LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

Kang He,Kaushik Roy

Main category: cs.CL

TL;DR: LogicTree是一个模块化框架，通过算法引导搜索解决LLMs在复杂逻辑推理中的挑战，显著提升证明准确率。

Details

Motivation: LLMs在复杂逻辑推理中面临系统性探索和逻辑一致性的挑战，尤其是在大前提空间的任务中。 Method: 提出LogicTree框架，结合缓存机制和线性化前提搜索，优化推理过程并引入LLM-free启发式策略。 Result: 在五个数据集上，LogicTree平均比CoT和ToT分别提升23.6%和12.5%的准确率，GPT-4o表现优于o3-mini。 Conclusion: LogicTree通过结构化证明探索和高效前提选择，显著提升了LLMs的逻辑推理能力。 Abstract: Large language models (LLMs) have achieved remarkable multi-step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof-finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference-time modular framework employing algorithm-guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step-by-step reasoning. Additionally, we introduce two LLM-free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference-time computation to achieve higher proof accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.

[167] PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models

Nusrat Jahan Prottasha,Upama Roy Chowdhury,Shetu Mohanto,Tasfia Nuzhat,Abdullah As Sami,Md Shamol Ali,Md Shohanur Islam Sobuj,Hafijur Raman,Md Kowsher,Ozlem Ozmen Garibay

Main category: cs.CL

TL;DR: 本文综述了参数高效微调（PEFT）技术，分析了其动机、分类、效果及未来方向，旨在解决大模型微调的资源消耗问题。

Details

Motivation: 传统微调大模型（如LLMs和VLMs）需要大量计算资源和任务数据，成本高昂，且存在过拟合、灾难性遗忘等问题。PEFT通过仅更新少量参数，提供了一种高效解决方案。 Method: 提出PEFT的分类框架（加法、选择性、重参数化、混合和统一方法），并系统比较其机制和权衡。 Result: PEFT在语言、视觉和生成建模等领域表现出色，资源消耗更低。 Conclusion: PEFT为大模型的实用化提供了高效、可持续的途径，未来需解决可扩展性、可解释性等挑战。 Abstract: Large models such as Large Language Models (LLMs) and Vision Language Models (VLMs) have transformed artificial intelligence, powering applications in natural language processing, computer vision, and multimodal learning. However, fully fine-tuning these models remains expensive, requiring extensive computational resources, memory, and task-specific data. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a promising solution that allows adapting large models to downstream tasks by updating only a small portion of parameters. This survey presents a comprehensive overview of PEFT techniques, focusing on their motivations, design principles, and effectiveness. We begin by analyzing the resource and accessibility challenges posed by traditional fine-tuning and highlight key issues, such as overfitting, catastrophic forgetting, and parameter inefficiency. We then introduce a structured taxonomy of PEFT methods -- grouped into additive, selective, reparameterized, hybrid, and unified frameworks -- and systematically compare their mechanisms and trade-offs. Beyond taxonomy, we explore the impact of PEFT across diverse domains, including language, vision, and generative modeling, showing how these techniques offer strong performance with lower resource costs. We also discuss important open challenges in scalability, interpretability, and robustness, and suggest future directions such as federated learning, domain adaptation, and theoretical grounding. Our goal is to provide a unified understanding of PEFT and its growing role in enabling practical, efficient, and sustainable use of large models.

[168] Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Katie Matton,Robert Osazuwa Ness,John Guttag,Emre Kıcıman

Main category: cs.CL

TL;DR: 该论文提出了一种新方法来衡量大语言模型（LLM）生成解释的忠实性，通过定义忠实性并开发一种基于反事实和贝叶斯分层模型的方法，揭示了LLM解释中可能隐藏的偏见和误导性。

Details

Motivation: LLM生成的解释可能不忠实，导致过度信任和误用，因此需要一种方法来量化这种不忠实性。 Method: 定义忠实性为解释中暗示的概念与真实影响概念之间的差异，并利用辅助LLM生成反事实输入，结合贝叶斯分层模型量化概念的影响。 Result: 实验表明，该方法能有效量化不忠实性，并在社会偏见和医学问答任务中发现LLM解释隐藏的偏见和误导性。 Conclusion: 该方法为评估LLM解释的忠实性提供了新工具，揭示了其潜在问题，有助于防止误用。 Abstract: Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

[169] SConU: Selective Conformal Uncertainty in Large Language Models

Zhiyuan Wang,Qingni Wang,Yue Zhang,Tianlong Chen,Xiaofeng Zhu,Xiaoshuang Shi,Kaidi Xu

Main category: cs.CL

TL;DR: 提出了一种名为SConU的新方法，通过显著性测试管理不确定性数据异常值，提高预测效率并控制误覆盖率。

Details

Motivation: 大语言模型在现实应用中需要任务特定指标的保证，现有框架无法识别违反交换性假设的异常值，导致误覆盖率不可控。 Method: 开发了两种conformal p值，通过显著性测试判断样本是否偏离校准集的不确定性分布，并管理误覆盖率。 Result: SConU方法在单领域和跨学科背景下均能有效控制误覆盖率，并提升预测效率。 Conclusion: SConU方法为高风险的问答任务提供了近似条件覆盖的解决方案。 Abstract: As large language models are increasingly utilized in real-world applications, guarantees of task-specific metrics are essential for their reliable deployment. Previous studies have introduced various criteria of conformal uncertainty grounded in split conformal prediction, which offer user-specified correctness coverage. However, existing frameworks often fail to identify uncertainty data outliers that violate the exchangeability assumption, leading to unbounded miscoverage rates and unactionable prediction sets. In this paper, we propose a novel approach termed Selective Conformal Uncertainty (SConU), which, for the first time, implements significance tests, by developing two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level. Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions. Furthermore, we comprehensively analyze the components of the conformal procedures, aiming to approximate conditional coverage, particularly in high-stakes question-answering tasks.

[170] Self-Correction Makes LLMs Better Parsers

Ziyan Zhang,Yang Hou,Chen Gong,Zhenghua Li

Main category: cs.CL

TL;DR: 论文提出了一种自校正方法，利用现有树库的语法规则指导大语言模型（LLMs）在句法分析任务中自我修正错误，显著提升了性能。

Details

Motivation: 尽管LLMs在多种NLP任务中表现出色，但在句法分析等基础任务上仍存在不足，尤其是无法充分利用现有树库的语法规则生成有效的句法结构。 Method: 提出自校正方法，通过自动检测潜在错误并动态搜索相关语法规则，为LLMs提供提示和示例，指导其自我修正。 Result: 在三个数据集上的实验表明，该方法显著提升了LLMs在英语和中文数据集上的性能，包括领域内和跨领域设置。 Conclusion: 通过利用现有树库的语法规则指导LLMs自我修正，可以有效提升其在句法分析任务中的表现。 Abstract: Large language models (LLMs) have achieved remarkable success across various natural language processing (NLP) tasks. However, recent studies suggest that they still face challenges in performing fundamental NLP tasks essential for deep language understanding, particularly syntactic parsing. In this paper, we conduct an in-depth analysis of LLM parsing capabilities, delving into the specific shortcomings of their parsing results. We find that LLMs may stem from limitations to fully leverage grammar rules in existing treebanks, which restricts their capability to generate valid syntactic structures. To help LLMs acquire knowledge without additional training, we propose a self-correction method that leverages grammar rules from existing treebanks to guide LLMs in correcting previous errors. Specifically, we automatically detect potential errors and dynamically search for relevant rules, offering hints and examples to guide LLMs in making corrections themselves. Experimental results on three datasets with various LLMs, demonstrate that our method significantly improves performance in both in-domain and cross-domain settings on the English and Chinese datasets.

[171] Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Yejun Yoon,Jaeyoon Jung,Seunghyun Yoon,Kunwoo Park

Main category: cs.CL

TL;DR: 论文质疑基于大语言模型（LLM）的查询扩展方法在零样本检索任务中的有效性，指出性能提升可能源于基准测试中的知识泄漏。

Details

Motivation: 研究动机是验证LLM生成的假设文档是否因包含基准测试中的真实证据信息（知识泄漏）而提升检索性能。 Method: 以事实验证为测试平台，分析生成文档是否包含真实证据的隐含信息，并评估其对性能的影响。 Result: 发现性能提升仅发生在生成文档包含真实证据隐含信息的案例中，表明基准测试可能存在知识泄漏。 Conclusion: 结论指出知识泄漏可能夸大LLM查询扩展方法的性能，尤其在需要检索小众或新知识的实际场景中。 Abstract: Query expansion methods powered by large language models (LLMs) have demonstrated effectiveness in zero-shot retrieval tasks. These methods assume that LLMs can generate hypothetical documents that, when incorporated into a query vector, enhance the retrieval of real evidence. However, we challenge this assumption by investigating whether knowledge leakage in benchmarks contributes to the observed performance gains. Using fact verification as a testbed, we analyzed whether the generated documents contained information entailed by ground truth evidence and assessed their impact on performance. Our findings indicate that performance improvements occurred consistently only for claims whose generated documents included sentences entailed by ground truth evidence. This suggests that knowledge leakage may be present in these benchmarks, inflating the perceived performance of LLM-based query expansion methods, particularly in real-world scenarios that require retrieving niche or novel knowledge.

[172] Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Xinlin Zhuang,Jiahui Peng,Ren Ma,Yinfan Wang,Tianyi Bai,Xingjian Wei,Jiantao Qiu,Chi Zhang,Ying Qian,Conghui He

Main category: cs.CL

TL;DR: 论文提出PRRC框架和Meta-rater方法，通过多维数据质量评估提升LLM预训练效率，实验证明其显著优于传统单维方法。

Details

Motivation: 当前LLM预训练数据集质量评估方法多为单维或冗余导向，限制了数据透明性和模型性能优化。 Method: 提出PRRC框架（专业性、可读性、推理性和清洁性）和Meta-rater方法，结合代理模型训练回归模型预测验证损失，优化数据选择。 Result: Meta-rater使1.3B参数模型收敛速度翻倍，下游任务性能提升3.23，且在3.3B模型中表现可扩展。 Conclusion: 多维质量评估显著优于传统方法，为LLM预训练提供了高效、可扩展的范式。 Abstract: The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens. Additionally, we release the annotated SlimPajama-627B dataset, labeled across 25 quality metrics (including PRRC), to advance research in data-centric LLM development. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.

[173] EIoU-EMC: A Novel Loss for Domain-specific Nested Entity Recognition

Jian Zhang,Tianqing Zhang,Qi Li,Hongwei Wang

Main category: cs.CL

TL;DR: 本文提出了一种新型损失函数EIoU-EMC，通过改进IoU损失和多类损失，解决了特定领域中嵌套NER任务的低资源和类别不平衡问题。

Details

Motivation: 特定领域（如生物医学和工业）中的嵌套NER任务面临低资源和类别不平衡的挑战，限制了其广泛应用。 Method: 设计了EIoU-EMC损失函数，结合实体边界和实体分类信息，提升模型在少量数据样本上的学习能力。 Result: 在三个生物医学NER数据集和一个工业设备维护文档数据集上验证，表现优于基线方法，尤其在实体边界识别和分类方面有显著提升。 Conclusion: EIoU-EMC方法在嵌套NER任务中表现出色，为低资源场景提供了有效解决方案。 Abstract: In recent years, research has mainly focused on the general NER task. There still have some challenges with nested NER task in the specific domains. Specifically, the scenarios of low resource and class imbalance impede the wide application for biomedical and industrial domains. In this study, we design a novel loss EIoU-EMC, by enhancing the implement of Intersection over Union loss and Multiclass loss. Our proposed method specially leverages the information of entity boundary and entity classification, thereby enhancing the model's capacity to learn from a limited number of data samples. To validate the performance of this innovative method in enhancing NER task, we conducted experiments on three distinct biomedical NER datasets and one dataset constructed by ourselves from industrial complex equipment maintenance documents. Comparing to strong baselines, our method demonstrates the competitive performance across all datasets. During the experimental analysis, our proposed method exhibits significant advancements in entity boundary recognition and entity classification. Our code are available here.

[174] Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

Takuma Udagawa,Yang Zhao,Hiroshi Kanayama,Bishwaranjan Bhattacharjee

Main category: cs.CL

TL;DR: 提出了一种高效的标注流程，用于分析预训练语料库中的社会偏见，并通过实验验证了偏见分析和缓解措施的效果。

Details

Motivation: 预训练数据中的社会偏见可能被大型语言模型（LLMs）延续或放大，因此需要一种方法来识别和缓解这些偏见。 Method: 提出了一种标注流程，包括受保护属性检测和语言极性分类，用于分析预训练语料库中的偏见。 Result: 实验证明了该偏见分析方法的有效性，并展示了缓解措施的效果。 Conclusion: 该方法为预训练语料库中的社会偏见分析提供了一种高效且有效的解决方案。 Abstract: Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.

[175] Understanding the Repeat Curse in Large Language Models from a Feature Perspective

Junchi Yao,Shu Yang,Jianhua Xu,Lijie Hu,Mengdi Li,Di Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为“Duplicatus Charm”的方法，通过机制可解释性分析大语言模型中的重复文本生成问题，并利用稀疏自编码器识别和抑制重复特征。

Details

Motivation: 大语言模型在多个领域取得了显著进展，但常出现重复文本生成的问题（称为“重复诅咒”），其根本机制尚未充分探索。 Method: 通过机制可解释性分析，结合稀疏自编码器（SAEs）提取单语义特征，提出“Duplicatus Charm”方法，定位并操纵重复特征。 Result: 构建了重复数据集，验证了重复特征的影响，并通过抑制这些特征有效缓解了重复问题。 Conclusion: 研究揭示了重复生成的机制，并提出了一种有效的方法来缓解这一问题。 Abstract: Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the "Repeat Curse". While previous studies have proposed decoding strategies to mitigate repetition, the underlying mechanism behind this issue remains insufficiently explored. In this work, we investigate the root causes of repetition in LLMs through the lens of mechanistic interpretability. Inspired by recent advances in Sparse Autoencoders (SAEs), which enable monosemantic feature extraction, we propose a novel approach, "Duplicatus Charm", to induce and analyze the Repeat Curse. Our method systematically identifies "Repetition Features" -the key model activations responsible for generating repetitive outputs. First, we locate the layers most involved in repetition through logit analysis. Next, we extract and stimulate relevant features using SAE-based activation manipulation. To validate our approach, we construct a repetition dataset covering token and paragraph level repetitions and introduce an evaluation pipeline to quantify the influence of identified repetition features. Furthermore, by deactivating these features, we have effectively mitigated the Repeat Curse.

[176] SimplifyMyText: An LLM-Based System for Inclusive Plain Language Text Simplification

Michael Färber,Parisa Aghdam,Kyuri Im,Mario Tawfelis,Hardik Ghoshal

Main category: cs.CL

TL;DR: 论文提出首个基于GPT-4和Llama-3的多格式输入文本简化系统，支持定制化输出，旨在提升包容性。

Details

Motivation: 复杂文本对理解困难群体构成障碍，现有方法未能充分利用大语言模型实现定制化简化。 Method: 开发了支持多格式输入的系统，使用GPT-4和Llama-3生成简化文本，并通过多指标评估输出。 Result: 系统成功生成定制化简化文本，验证了大语言模型在文本简化中的潜力。 Conclusion: 研究推动了自动文本简化领域，强调了定制化沟通对包容性的重要性。 Abstract: Text simplification is essential for making complex content accessible to diverse audiences who face comprehension challenges. Yet, the limited availability of simplified materials creates significant barriers to personal and professional growth and hinders social inclusion. Although researchers have explored various methods for automatic text simplification, none fully leverage large language models (LLMs) to offer tailored customization for different target groups and varying levels of simplicity. Moreover, despite its proven benefits for both consumers and organizations, the well-established practice of plain language remains underutilized. In this paper, we https://simplifymytext.org, the first system designed to produce plain language content from multiple input formats, including typed text and file uploads, with flexible customization options for diverse audiences. We employ GPT-4 and Llama-3 and evaluate outputs across multiple metrics. Overall, our work contributes to research on automatic text simplification and highlights the importance of tailored communication in promoting inclusivity.

[177] Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Bowen Jiang,Zhuoqun Hao,Young-Min Cho,Bryan Li,Yuan Yuan,Sihao Chen,Lyle Ungar,Camillo J. Taylor,Dan Roth

Main category: cs.CL

TL;DR: 论文介绍了PERSONAMEM基准，用于评估LLMs如何利用用户历史交互数据来个性化响应，发现现有模型在动态跟踪用户偏好方面仍有不足。

Details

Motivation: 研究LLMs如何利用用户历史交互数据来更好地理解用户特质和偏好，并生成个性化响应。 Method: 提出PERSONAMEM基准，包含180个模拟用户-LLM交互历史，评估LLMs在多轮对话中的个性化响应能力。 Result: 当前LLMs在动态跟踪用户偏好方面表现不佳，前沿模型准确率仅约50%。 Conclusion: PERSONAMEM基准和模拟工具可为未来开发更用户感知的聊天机器人提供支持。 Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.

[178] Probing the Subtle Ideological Manipulation of Large Language Models

Demetris Paschalides,George Pallis,Marios D. Dikaiakos

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型（LLMs）在政治意识形态光谱上的可操纵性，超越了传统的左右二分法，提出了多任务数据集并验证了微调对模型意识形态表达的影响。

Details

Motivation: 研究动机是探索LLMs在政治意识形态上的可操纵性，超越传统的左右二分法，以更全面地评估其潜在风险。 Method: 方法包括构建多任务数据集（如意识形态问答、宣言填空等），并对Phi-2、Mistral和Llama-3三种LLMs进行微调。 Result: 结果表明，微调显著增强了模型对意识形态的细致表达，而显式提示效果有限。 Conclusion: 结论指出LLMs易受意识形态操纵，需加强防护措施。 Abstract: Large Language Models (LLMs) have transformed natural language processing, but concerns have emerged about their susceptibility to ideological manipulation, particularly in politically sensitive areas. Prior work has focused on binary Left-Right LLM biases, using explicit prompts and fine-tuning on political QA datasets. In this work, we move beyond this binary approach to explore the extent to which LLMs can be influenced across a spectrum of political ideologies, from Progressive-Left to Conservative-Right. We introduce a novel multi-task dataset designed to reflect diverse ideological positions through tasks such as ideological QA, statement ranking, manifesto cloze completion, and Congress bill comprehension. By fine-tuning three LLMs-Phi-2, Mistral, and Llama-3-on this dataset, we evaluate their capacity to adopt and express these nuanced ideologies. Our findings indicate that fine-tuning significantly enhances nuanced ideological alignment, while explicit prompts provide only minor refinements. This highlights the models' susceptibility to subtle ideological manipulation, suggesting a need for more robust safeguards to mitigate these risks.

Xingyu Li,Chen Gong,Guohong Fu

Main category: cs.CL

TL;DR: 该论文介绍了TikTalkCoref，首个中文社交媒体多模态共指消解数据集，填补了真实世界对话研究的空白，并提出了一个基准方法。

Details

Motivation: 多模态共指消解（MCR）对理解多模态内容至关重要，但缺乏真实世界对话的数据资源。 Method: 从抖音平台收集短视频与文本对话，手动标注共指簇，并提出基准方法。 Result: 构建了TikTalkCoref数据集，并提供了可靠的基准实验结果。 Conclusion: TikTalkCoref将促进社交媒体多模态共指消解的未来研究。 Abstract: Multimodal coreference resolution (MCR) aims to identify mentions referring to the same entity across different modalities, such as text and visuals, and is essential for understanding multimodal content. In the era of rapidly growing mutimodal content and social media, MCR is particularly crucial for interpreting user interactions and bridging text-visual references to improve communication and personalization. However, MCR research for real-world dialogues remains unexplored due to the lack of sufficient data resources.To address this gap, we introduce TikTalkCoref, the first Chinese multimodal coreference dataset for social media in real-world scenarios, derived from the popular Douyin short-video platform. This dataset pairs short videos with corresponding textual dialogues from user comments and includes manually annotated coreference clusters for both person mentions in the text and the coreferential person head regions in the corresponding video frames. We also present an effective benchmark approach for MCR, focusing on the celebrity domain, and conduct extensive experiments on our dataset, providing reliable benchmark results for this newly constructed dataset. We will release the TikTalkCoref dataset to facilitate future research on MCR for real-world social media dialogues.

[180] Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

Patrick Haller,Jonas Golde,Alan Akbik

Main category: cs.CL

TL;DR: 本文研究了从Transformer教师模型到九种子二次学生模型的知识蒸馏效果，探讨了不同架构对蒸馏过程的影响及初始化策略的作用。

Details

Motivation: 自注意力机制在推理时的二次复杂度成为瓶颈，促使探索子二次替代方案（如SSMs、线性注意力和循环架构）的知识蒸馏效果。 Method: 系统评估了从Transformer教师模型到九种子二次学生模型的知识蒸馏，研究了矩阵混合和QKV复制等初始化策略的影响。 Result: 在多个NLP基准测试中，实证结果揭示了效率与性能之间的权衡，并指出了成功知识转移的关键因素。 Conclusion: 研究为子二次架构的知识蒸馏提供了实用见解，强调了架构选择和初始化策略的重要性。 Abstract: Knowledge distillation is a widely used technique for compressing large language models (LLMs) by training a smaller student model to mimic a larger teacher model. Typically, both the teacher and student are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention at inference time remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher to nine subquadratic student architectures. Our study aims to determine which subquadratic model best aligns with the teacher's learned representations and how different architectural constraints influence the distillation process. We also investigate the impact of intelligent initialization strategies, including matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.

[181] Diverse Prompts: Illuminating the Prompt Space of Large Language Models with MAP-Elites

Gabriel Machado Santos,Rita Maria da Silva Julia,Marcelo Zanchetta do Nascimento

Main category: cs.CL

TL;DR: 本文提出了一种结合上下文无关文法（CFG）和MAP-Elites算法的进化方法，用于系统探索提示空间，生成高质量且多样化的提示，并分析其与不同任务的匹配度。

Details

Motivation: 提示工程对优化大型语言模型（LLM）至关重要，但提示结构与任务性能之间的关系尚未充分研究。 Method: 采用CFG和MAP-Elites算法，系统探索提示空间，生成多样化且高性能的提示，并分析其与任务的匹配度。 Result: 在多个LLM和七项BigBench Lite任务上的实验表明，该方法能显著提升提示的质量和多样性。 Conclusion: 通过系统映射表型空间，揭示了结构变化对LLM性能的影响，为任务特定和适应性提示设计提供了实用见解。 Abstract: Prompt engineering is essential for optimizing large language models (LLMs), yet the link between prompt structures and task performance remains underexplored. This work introduces an evolutionary approach that combines context-free grammar (CFG) with the MAP-Elites algorithm to systematically explore the prompt space. Our method prioritizes quality and diversity, generating high-performing and structurally varied prompts while analyzing their alignment with diverse tasks by varying traits such as the number of examples (shots) and reasoning depth. By systematically mapping the phenotypic space, we reveal how structural variations influence LLM performance, offering actionable insights for task-specific and adaptable prompt design. Evaluated on seven BigBench Lite tasks across multiple LLMs, our results underscore the critical interplay of quality and diversity, advancing the effectiveness and versatility of LLMs.

[182] ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Tong Chen,Faeze Brahman,Jiacheng Liu,Niloofar Mireshghallah,Weijia Shi,Pang Wei Koh,Luke Zettlemoyer,Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: ParaPO是一种后训练方法，旨在减少语言模型的无意复制行为，同时保留其整体实用性。

Details

Motivation: 解决语言模型在非对抗性场景下记忆并复制预训练数据的问题，涉及版权、抄袭、隐私和创造力等担忧。 Method: 通过微调模型，使其偏好对记忆内容的改写版本而非原文，并开发系统提示变体以控制复制行为。 Result: 在Llama3.1-8B和Tulu3-8B上的实验显示，ParaPO显著减少复制行为，且系统提示变体能保留名言引用能力。 Conclusion: ParaPO有效减少语言模型的无意复制行为，同时保持实用性，优于现有方法。 Abstract: Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).

[183] CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

Armin Toroghi,Willis Guo,Scott Sanner

Main category: cs.CL

TL;DR: 论文提出新数据集CoLoTa，用于评估大语言模型（LLMs）在长尾实体上的常识推理能力及其幻觉问题，并发现现有方法在此类任务上表现不佳。

Details

Motivation: 尽管LLMs在编码事实和常识知识方面表现出色，但其在长尾实体上的推理错误和幻觉问题阻碍了其在高风险场景中的应用。 Method: 构建了包含3,300个查询的CoLoTa数据集，涵盖问答和声明验证任务，并支持知识图谱问答（KGQA）。 Result: 实验表明，现有LLM-based KGQA方法在涉及常识推理的查询上表现严重不足。 Conclusion: CoLoTa可作为评估LLMs和KGQA方法在常识推理及长尾实体上表现的新基准。 Abstract: The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.

[184] sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment

Yijun Liu

Main category: cs.CL

TL;DR: SSENSE框架通过对比学习将sEEG信号映射到CLIP模型的句子嵌入空间，实现从脑活动直接检索句子。

Details

Motivation: 探索多模态基础模型在神经科学与人工智能交叉领域的潜力，将侵入性脑记录与自然语言对齐。 Method: 使用InfoNCE损失在sEEG的频谱表示上训练神经编码器，不微调文本编码器。 Result: 在有限数据下，SSENSE展示了通用语言表征可作为神经解码的有效先验。 Conclusion: 通用语言表征可用于神经解码，SSENSE为脑活动与语言对齐提供了新方法。 Abstract: Interpreting neural activity through meaningful latent representations remains a complex and evolving challenge at the intersection of neuroscience and artificial intelligence. We investigate the potential of multimodal foundation models to align invasive brain recordings with natural language. We present SSENSE, a contrastive learning framework that projects single-subject stereo-electroencephalography (sEEG) signals into the sentence embedding space of a frozen CLIP model, enabling sentence-level retrieval directly from brain activity. SSENSE trains a neural encoder on spectral representations of sEEG using InfoNCE loss, without fine-tuning the text encoder. We evaluate our method on time-aligned sEEG and spoken transcripts from a naturalistic movie-watching dataset. Despite limited data, SSENSE achieves promising results, demonstrating that general-purpose language representations can serve as effective priors for neural decoding.

[185] DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

Xiang Li,Duyi Pan,Hongru Xiao,Jiale Han,Jing Tang,Jiabao Ma,Wei Wang,Bo Cheng

Main category: cs.CL

TL;DR: 提出了一种基于多智能体的语音合成框架DialogueAgents，通过脚本编写、语音合成和对话批评三个智能体协作生成对话，并发布了高质量的双语多轮对话数据集MultiTalk。

Details

Motivation: 现有语音合成数据集构建成本高且多样性不足，限制了情感表达和语境多样性。 Method: 采用三个智能体（脚本编写、语音合成、对话批评）协作迭代优化对话脚本和语音合成，提升情感表达和副语言特征。 Result: 生成了高质量的双语多轮对话数据集MultiTalk，实验验证了框架的有效性。 Conclusion: DialogueAgents框架和MultiTalk数据集为语音合成研究提供了新工具和资源。 Abstract: Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.

[186] FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

Yichen Li,Zhiting Fan,Ruizhe Chen,Xiaotang Gai,Luqi Gong,Yan Zhang,Zuozhu Liu

Main category: cs.CL

TL;DR: FairSteer是一种无需定制提示或模型重新训练的推理时去偏框架，通过检测偏激活、计算去偏转向向量并动态调整激活来实现去偏。

Details

Motivation: 大型语言模型（LLMs）容易从训练语料中捕捉偏见，现有方法存在不稳定或计算成本高的问题。 Method: FairSteer基于线性表示假设，通过轻量级线性分类器检测偏激活，计算去偏转向向量（DSV），并在推理时动态调整激活。 Result: 在六种LLMs上的综合评估表明，FairSteer在问答、反事实输入评估和开放式文本生成任务中表现优越。 Conclusion: FairSteer提供了一种高效且稳定的去偏方法，无需额外训练或复杂提示设计。 Abstract: Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.

[187] Functional Abstraction of Knowledge Recall in Large Language Models

Zijian Wang,Chang Xu

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型（LLMs）中的知识召回机制，将其抽象为功能结构，并提出激活向量与功能组件（输入、函数体、返回值）的对齐关系。通过实验验证，改进了基于激活修补的知识编辑方法。

Details

Motivation: 探索LLMs中知识召回的机制，将其形式化为功能结构，以理解模型如何通过激活向量实现知识映射。 Method: 设计基于修补的知识评分算法识别功能组件，并通过反知识测试验证各组件的独立功能效应。 Result: 激活向量与功能组件的对齐关系得到验证，改进的知识编辑方法提升了新知识的短期记忆保留能力。 Conclusion: 从功能视角揭示了LLMs的知识召回机制，为知识编辑提供了新思路。 Abstract: Pre-trained transformer large language models (LLMs) demonstrate strong knowledge recall capabilities. This paper investigates the knowledge recall mechanism in LLMs by abstracting it into a functional structure. We propose that during knowledge recall, the model's hidden activation space implicitly entails a function execution process where specific activation vectors align with functional components (Input argument, Function body, and Return values). Specifically, activation vectors of relation-related tokens define a mapping function from subjects to objects, with subject-related token activations serving as input arguments and object-related token activations as return values. For experimental verification, we first design a patching-based knowledge-scoring algorithm to identify knowledge-aware activation vectors as independent functional components. Then, we conduct counter-knowledge testing to examine the independent functional effects of each component on knowledge recall outcomes. From this functional perspective, we improve the contextual knowledge editing approach augmented by activation patching. By rewriting incoherent activations in context, we enable improved short-term memory retention for new knowledge prompting.

[188] Causality for Natural Language Processing

Zhijing Jin

Main category: cs.CL

TL;DR: 该论文探讨了大语言模型（LLMs）中的因果推理能力，研究其机制、应用及改进方向。

Details

Motivation: 因果推理是人工智能系统实现高级理解和决策的关键能力，研究LLMs的因果推理能力具有重要意义。 Method: 通过一系列研究、新数据集、基准任务和方法框架，分析LLMs的因果推理技能及其机制。 Result: 揭示了LLMs在因果推理中的关键挑战和机遇，为未来研究提供了基础。 Conclusion: 该研究为提升LLMs的因果推理能力提供了全面框架，推动了这一领域的发展。 Abstract: Causal reasoning is a cornerstone of human intelligence and a critical capability for artificial systems aiming to achieve advanced understanding and decision-making. This thesis delves into various dimensions of causal reasoning and understanding in large language models (LLMs). It encompasses a series of studies that explore the causal inference skills of LLMs, the mechanisms behind their performance, and the implications of causal and anticausal learning for natural language processing (NLP) tasks. Additionally, it investigates the application of causal reasoning in text-based computational social science, specifically focusing on political decision-making and the evaluation of scientific impact through citations. Through novel datasets, benchmark tasks, and methodological frameworks, this work identifies key challenges and opportunities to improve the causal capabilities of LLMs, providing a comprehensive foundation for future research in this evolving field.

[189] BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Yiting Ran,Xintao Wang,Tian Qiu,Jiaqing Liang,Yanghua Xiao,Deqing Yang

Main category: cs.CL

TL;DR: 论文提出了BookWorld系统，用于构建和模拟基于书籍的多智能体社会，覆盖了现实世界的复杂性，并在故事生成和社交模拟中表现出色。

Details

Motivation: 现有研究多关注从头创建智能体社会，而模拟已有虚构世界和角色尚未充分探索，但其具有重要实用价值。 Method: 设计了BookWorld系统，涵盖动态角色、虚构世界观、地理约束等现实复杂性。 Result: 实验表明，BookWorld在故事生成中表现优异，保持对原著的忠实度，胜率75.36%。 Conclusion: BookWorld为扩展和探索虚构作品提供了新途径，具有广泛的应用潜力。 Abstract: Recent advances in large language models (LLMs) have enabled social simulation through multi-agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies. BookWorld's design covers comprehensive real-world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high-quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code of this paper can be found at the project page: https://bookworld2025.github.io/.

[190] a1: Steep Test-time Scaling Law via Environment Augmented Generation

Lingrui Mei,Shenghua Liu,Yiwei Wang,Baolong Bi,Yuyao Ge,Jun Wan,Yurong Wu,Xueqi Cheng

Main category: cs.CL

TL;DR: EAG框架通过实时环境反馈、动态分支探索和经验学习增强LLM推理能力，显著提升复杂任务表现。

Details

Motivation: 当前LLM在多步推理任务中存在幻觉和逻辑错误，且无法自我纠正，现有方法如链式思维提示能力有限。 Method: 提出EAG框架，结合实时环境反馈验证步骤、动态分支探索替代路径，以及成功推理轨迹的经验学习。 Result: a1-32B模型在多个基准测试中达到同类模型最佳性能，部分任务超越更大模型，优势随任务复杂度增加。 Conclusion: EAG通过环境交互和分支探索为可靠机器推理建立新范式，特别适用于需要精确多步计算和逻辑验证的问题。 Abstract: Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG's distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity. EAG's theoretical framework demonstrates how environment interactivity and systematic branch exploration together establish a new paradigm for reliable machine reasoning, particularly for problems requiring precise multi-step calculation and logical verification.

[191] Translation Analytics for Freelancers: I. Introduction, Data Preparation, Baseline Evaluations

Yuri Balashov,Alex Balashov,Shiho Fukuda Koski

Main category: cs.CL

TL;DR: 论文探讨了语言技术进步为个体译者和资源有限的语言服务提供商带来的新机遇，提出了一种适用于自由译者的自动评估指标框架。

Details

Motivation: 研究旨在帮助自由译者利用新兴语言技术（如神经机器翻译和大语言模型）提升翻译质量和工作效率。 Method: 提出了一种将自动评估指标（如BLEU、chrF、TER和COMET）适配自由译者需求的框架，并通过医学领域的三语料库进行实证分析。 Result: 研究发现自动评估指标与人工评价之间存在统计相关性，验证了其实际应用价值。 Conclusion: 自由译者应主动拥抱新兴技术，以在快速变化的职业环境中适应并取得成功。 Abstract: This is the first in a series of papers exploring the rapidly expanding new opportunities arising from recent progress in language technologies for individual translators and language service providers with modest resources. The advent of advanced neural machine translation systems, large language models, and their integration into workflows via computer-assisted translation tools and translation management systems have reshaped the translation landscape. These advancements enable not only translation but also quality evaluation, error spotting, glossary generation, and adaptation to domain-specific needs, creating new technical opportunities for freelancers. In this series, we aim to empower translators with actionable methods to harness these advancements. Our approach emphasizes Translation Analytics, a suite of evaluation techniques traditionally reserved for large-scale industry applications but now becoming increasingly available for smaller-scale users. This first paper introduces a practical framework for adapting automatic evaluation metrics -- such as BLEU, chrF, TER, and COMET -- to freelancers' needs. We illustrate the potential of these metrics using a trilingual corpus derived from a real-world project in the medical domain and provide statistical analysis correlating human evaluations with automatic scores. Our findings emphasize the importance of proactive engagement with emerging technologies to not only adapt but thrive in the evolving professional environment.

[192] A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models

Hongming Tan,Shaoxiong Zhan,Fengwei Jia,Hai-Tao Zheng,Wai Kin Chan

Main category: cs.CL

TL;DR: HSPIM是一种基于大型语言模型的分层框架，通过论文分段和零样本提示评估科学论文的创新性，表现优于现有方法。

Details

Motivation: 现有方法难以全面捕捉论文创新性且缺乏泛化能力，因此需要一种更有效的评估框架。 Method: HSPIM将论文分解为章节，利用零样本LLM提示进行章节分类、QA增强和加权新颖性评分，并通过遗传算法优化问题提示组合。 Result: 在科学会议论文数据集上，HSPIM在有效性、泛化性和可解释性上优于基线方法。 Conclusion: HSPIM为科学论文创新性评估提供了一种高效且可推广的解决方案。 Abstract: Measuring scientific paper innovation is both important and challenging. Existing content-based methods often overlook the full-paper context, fail to capture the full scope of innovation, and lack generalization. We propose HSPIM, a hierarchical and training-free framework based on large language models (LLMs). It introduces a Paper-to-Sections-to-QAs decomposition to assess innovation. We segment the text by section titles and use zero-shot LLM prompting to implement section classification, question-answering (QA) augmentation, and weighted novelty scoring. The generated QA pair focuses on section-level innovation and serves as additional context to improve the LLM scoring. For each chunk, the LLM outputs a novelty score and a confidence score. We use confidence scores as weights to aggregate novelty scores into a paper-level innovation score. To further improve performance, we propose a two-layer question structure consisting of common and section-specific questions, and apply a genetic algorithm to optimize the question-prompt combinations. Comprehensive experiments on scientific conference paper datasets show that HSPIM outperforms baseline methods in effectiveness, generalization, and interpretability.

[193] Automatic Text Summarization (ATS) for Research Documents in Sorani Kurdish

Rondik Hadi Abdulrahman,Hossein Hassani

Main category: cs.CL

TL;DR: 该研究为库尔德语（Sorani方言）开发了一个数据集和语言模型，用于自动文本摘要（ATS），填补了该语言资源的空白。通过两种实验（是否包含结论部分）和多种评估方法，最佳准确率达到19.58%。

Details

Motivation: 库尔德语在自动文本摘要领域资源匮乏，限制了相关研究的发展。本研究旨在填补这一空白，为库尔德语NLP研究提供基础资源。 Method: 研究基于231篇库尔德语科学论文，使用句子加权和TF-IDF算法进行摘要生成，并通过手动和自动（ROUGE指标）评估结果。 Result: 实验结果显示，最佳摘要准确率为19.58%，专家手动评估结果因文档而异。 Conclusion: 该研究为库尔德语ATS和相关领域提供了宝贵资源，推动了库尔德语NLP的发展。 Abstract: Extracting concise information from scientific documents aids learners, researchers, and practitioners. Automatic Text Summarization (ATS), a key Natural Language Processing (NLP) application, automates this process. While ATS methods exist for many languages, Kurdish remains underdeveloped due to limited resources. This study develops a dataset and language model based on 231 scientific papers in Sorani Kurdish, collected from four academic departments in two universities in the Kurdistan Region of Iraq (KRI), averaging 26 pages per document. Using Sentence Weighting and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms, two experiments were conducted, differing in whether the conclusions were included. The average word count was 5,492.3 in the first experiment and 5,266.96 in the second. Results were evaluated manually and automatically using ROUGE-1, ROUGE-2, and ROUGE-L metrics, with the best accuracy reaching 19.58%. Six experts conducted manual evaluations using three criteria, with results varying by document. This research provides valuable resources for Kurdish NLP researchers to advance ATS and related fields.

[194] Harnessing Generative LLMs for Enhanced Financial Event Entity Extraction Performance

Soo-joon Choi,Ji-jun Park

Main category: cs.CL

TL;DR: 提出了一种基于生成式大语言模型（LLM）的金融事件实体提取方法，通过参数高效微调（PEFT）直接生成结构化输出，显著优于传统序列标注方法。

Details

Motivation: 金融文本语言复杂且实体重叠，传统序列标注模型难以处理长距离依赖和多实体提取问题。 Method: 将金融事件实体提取任务重构为文本到结构化输出的生成任务，利用PEFT微调预训练LLM直接生成包含实体及其字符跨度的JSON对象。 Result: 在CCKS 2019数据集上取得新的SOTA F1分数，显著优于SEBERTNets和sebertNets等基线方法。 Conclusion: 生成式LLM在复杂领域特定信息提取任务中具有潜力，能有效处理金融文本的复杂性并生成高质量实体。 Abstract: Financial event entity extraction is a crucial task for analyzing market dynamics and building financial knowledge graphs, yet it presents significant challenges due to the specialized language and complex structures in financial texts. Traditional approaches often rely on sequence labeling models, which can struggle with long-range dependencies and the inherent complexity of extracting multiple, potentially overlapping entities. Motivated by the advanced language understanding and generative capabilities of Large Language Models (LLMs), we propose a novel method that reframes financial event entity extraction as a text-to-structured-output generation task. Our approach involves fine-tuning a pre-trained LLM using Parameter-Efficient Fine-Tuning (PEFT) to directly generate a structured representation, such as a JSON object, containing the extracted entities and their precise character spans from the input text. We evaluate our method on the challenging CCKS 2019 Financial Event Entity Extraction dataset, comparing its performance against strong sequence labeling baselines, including SEBERTNets and sebertNets. Experimental results demonstrate that our generative LLM method achieves a new state-of-the-art F1 score on this benchmark, significantly outperforming previous methods. Through detailed quantitative analysis across event types, entity types, and instance complexity, as well as human evaluation, we show that our approach is more effective at handling the nuances of financial text and extracting high-quality entities. This work validates the potential of applying generative LLMs directly to complex, domain-specific information extraction tasks requiring structured output.

[195] A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs

Yihan Lin,Zhirong Bella Yu,Simon Lee

Main category: cs.CL

TL;DR: 论文探讨了使用大型语言模型（LLM）生成合成电子健康记录（EHR）的潜力与局限，发现LLM在小规模特征集上表现良好，但在高维数据中难以保持真实分布和相关性。

Details

Motivation: 合成EHR数据可保护隐私并支持医疗应用，但需解决其在不同医院间的泛化问题。 Method: 评估商业LLM生成合成EHR的能力，分析生成过程中的多个方面。 Result: LLM在小规模特征集上可靠，但在高维数据中难以保持真实性和相关性。 Conclusion: LLM在生成合成EHR方面有潜力，但需改进以应对高维数据的挑战。 Abstract: Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.

[196] Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data

Wei Zou,Sen Yang,Yu Bao,Shujian Huang,Jiajun Chen,Shanbo Cheng

Main category: cs.CL

TL;DR: TRANS-ZERO是一个利用单语数据和LLM内在多语言知识的自博弈框架，通过结合遗传蒙特卡洛树搜索和偏好优化，实现了与监督方法媲美的翻译性能。

Details

Motivation: 解决多语言机器翻译中低资源语言数据稀缺和灾难性遗忘的问题。 Method: 提出TRANS-ZERO框架，结合遗传蒙特卡洛树搜索（G-MCTS）和偏好优化，仅使用单语数据和LLM的多语言知识。 Result: 实验表明，该方法不仅匹配大规模并行数据训练模型的性能，还在非英语翻译方向上表现优异。 Conclusion: G-MCTS通过迭代翻译探索语义一致的候选，显著提升翻译质量，为框架成功奠定基础。 Abstract: The rise of Large Language Models (LLMs) has reshaped machine translation (MT), but multilingual MT still relies heavily on parallel data for supervised fine-tuning (SFT), facing challenges like data scarcity for low-resource languages and catastrophic forgetting. To address these issues, we propose TRANS-ZERO, a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of LLM. TRANS-ZERO combines Genetic Monte-Carlo Tree Search (G-MCTS) with preference optimization, achieving strong translation performance that rivals supervised methods. Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions. Further analysis reveals that G-MCTS itself significantly enhances translation quality by exploring semantically consistent candidates through iterative translations, providing a robust foundation for the framework's succuss.

[197] FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models

Mehrnoush Shamsfard,Zahra Saaberi,Mostafa Karimi manesh,Seyed Mohammad Hossein Hashemi,Zahra Vatankhah,Motahareh Ramezani,Niki Pourazin,Tara Zare,Maryam Azimi,Sarina Chitsaz,Sama Khoraminejad,Morteza Mahdavi Mortazavi,Mohammad Mahdi Chizari,Sahar Maleki,Seyed Soroush Majd,Mostafa Masumi,Sayed Ali Musavi Khoeini,Amir Mohseni,Sogol Alipour

Main category: cs.CL

TL;DR: 论文介绍了FarsEval-PKBETS基准，用于评估波斯语大语言模型的表现，结果显示当前模型准确率低于50%。

Details

Motivation: 研究波斯语等资源较少语言的大语言模型表现，填补现有研究的空白。 Method: 构建包含4000个问题的波斯语基准FarsEval-PKBETS，涵盖多领域任务，并评估三个模型的性能。 Result: 三个模型的平均准确率低于50%，表明当前模型难以应对该基准。 Conclusion: 波斯语大语言模型的表现仍有显著提升空间。 Abstract: Research on evaluating and analyzing large language models (LLMs) has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current language models are still far from being able to solve this benchmark

[198] OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

Songtao Jiang,Yuan Wang,Sibo Song,Yan Zhang,Zijie Meng,Bohan Lei,Jian Wu,Jimeng Sun,Zuozhu Liu

Main category: cs.CL

TL;DR: OmniV-Med是一个统一的多模态医疗理解框架，通过构建综合数据集、设计自适应编码器和引入医学感知的令牌修剪机制，实现了在2D/3D医学图像和视频任务上的最先进性能。

Details

Motivation: 现有的医学视觉语言模型通常对不同模态使用独立的编码器，限制了多模态数据的无缝集成。 Method: 1. 构建包含252K样本的多模态医学数据集；2. 设计旋转位置自适应编码器；3. 引入医学感知令牌修剪机制以减少冗余。 Result: OmniV-Med-7B在7个基准测试中表现最佳，轻量级版本（OmniV-Med-1.5B）性能相当且训练资源需求低。 Conclusion: OmniV-Med通过统一框架和高效设计，显著提升了多模态医学理解的能力和实用性。 Abstract: The practical deployment of medical vision-language models (Med-VLMs) necessitates seamless integration of textual data with diverse visual modalities, including 2D/3D images and videos, yet existing models typically employ separate encoders for different modalities. To address this limitation, we present OmniV-Med, a unified framework for multimodal medical understanding. Our technical contributions are threefold: First, we construct OmniV-Med-Instruct, a comprehensive multimodal medical dataset containing 252K instructional samples spanning 14 medical image modalities and 11 clinical tasks. Second, we devise a rotary position-adaptive encoder that processes multi-resolution 2D/3D images and videos within a unified architecture, diverging from conventional modality-specific encoders. Third, we introduce a medical-aware token pruning mechanism that exploits spatial-temporal redundancy in volumetric data (e.g., consecutive CT slices) and medical videos, effectively reducing 60\% of visual tokens without performance degradation. Empirical evaluations demonstrate that OmniV-Med-7B achieves state-of-the-art performance on 7 benchmarks spanning 2D/3D medical imaging and video understanding tasks. Notably, our lightweight variant (OmniV-Med-1.5B) attains comparable performance while requiring only 8 RTX3090 GPUs for training and supporting efficient long-video inference. Data, code and model will be released.

[199] Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives

Ratna Kandala,Katie Hoemann

Main category: cs.CL

TL;DR: 论文对比了BERTopic、LDA和KMeans在比利时荷兰语日常叙事中的表现，发现BERTopic在语义相关性上优于LDA和KMeans。

Details

Motivation: 探索BERTopic在形态丰富的语言（如比利时荷兰语）中的潜力，并对比其与传统方法（LDA和KMeans）的表现。 Method: 使用BERTopic、LDA和KMeans对开放式的比利时荷兰语日常叙事进行主题建模，并通过自动指标和人工评估对比性能。 Result: BERTopic生成的语义更相关且文化共鸣更强，而LDA和KMeans在语义相关性和连贯性上表现较差。 Conclusion: 研究强调了在NLP模型中结合上下文嵌入和混合评估框架的重要性，尤其是在形态丰富的语言中。 Abstract: This study explores BERTopic's potential for modeling open-ended Belgian Dutch daily narratives, contrasting its performance with Latent Dirichlet Allocation (LDA) and KMeans. Although LDA scores well on certain automated metrics, human evaluations reveal semantically irrelevant co-occurrences, highlighting the limitations of purely statistic-based methods. In contrast, BERTopic's reliance on contextual embeddings yields culturally resonant themes, underscoring the importance of hybrid evaluation frameworks that account for morphologically rich languages. KMeans performed less coherently than prior research suggested, pointing to the unique challenges posed by personal narratives. Our findings emphasize the need for robust generalization in NLP models, especially in underrepresented linguistic contexts.

[200] PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

Reya Vir,Shreya Shankar,Harrison Chase,Will Fu-Hinthorn,Aditya Parameswaran

Main category: cs.CL

TL;DR: PROMPTEVALS是一个包含2087个LLM管道提示和12623个断言标准的数据集，用于提高LLM在生成中的可靠性。微调的Mistral和Llama 3模型在生成断言方面比GPT-4o表现更好。

Details

Motivation: LLM在生产环境中常无法满足开发者期望，需要断言或护栏来提高可靠性。但确定适合任务的断言标准具有挑战性。 Method: 引入PROMPTEVALS数据集，并通过测试集评估闭源和开源模型生成断言的能力。 Result: 微调的Mistral和Llama 3模型平均比GPT-4o表现好20.93%，且延迟更低。 Conclusion: PROMPTEVALS数据集有望推动LLM可靠性、对齐和提示工程的进一步研究。 Abstract: Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.

[201] Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings

Saniya Karwa,Navpreet Singh

Main category: cs.CL

TL;DR: 该论文提出了一种框架，用于揭示BERT等高维不透明模型中向量嵌入的特定维度如何编码不同语言属性（LPs），并引入了一个新数据集LDSP-10和度量标准EDI分数。

Details

Motivation: 理解神经嵌入的内部机制，尤其是BERT等高维不透明模型，仍然是一个挑战。 Method: 使用LDSP-10数据集，结合Wilcoxon符号秩检验、互信息和递归特征消除等方法，分析BERT嵌入，并引入EDI分数量化维度对LP的影响。 Result: 研究发现否定和极性等属性在特定维度中编码较强，而同义性则表现出更复杂的模式。 Conclusion: 该研究为嵌入的可解释性提供了见解，有助于开发更透明和优化的语言模型，并对模型偏见缓解和AI系统负责任部署有重要意义。 Abstract: Understanding the inner workings of neural embeddings, particularly in models such as BERT, remains a challenge because of their high-dimensional and opaque nature. This paper proposes a framework for uncovering the specific dimensions of vector embeddings that encode distinct linguistic properties (LPs). We introduce the Linguistically Distinct Sentence Pairs (LDSP-10) dataset, which isolates ten key linguistic features such as synonymy, negation, tense, and quantity. Using this dataset, we analyze BERT embeddings with various methods, including the Wilcoxon signed-rank test, mutual information, and recursive feature elimination, to identify the most influential dimensions for each LP. We introduce a new metric, the Embedding Dimension Impact (EDI) score, which quantifies the relevance of each embedding dimension to a LP. Our findings show that certain properties, such as negation and polarity, are robustly encoded in specific dimensions, while others, like synonymy, exhibit more complex patterns. This study provides insights into the interpretability of embeddings, which can guide the development of more transparent and optimized language models, with implications for model bias mitigation and the responsible deployment of AI systems.

[202] Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Luyang Fang,Xiaowei Yu,Jiazhang Cai,Yongkai Chen,Shushan Wu,Zhengliang Liu,Zhenyuan Yang,Haoran Lu,Xilin Gong,Yufang Liu,Terry Ma,Wei Ruan,Ali Abbasi,Jing Zhang,Tao Wang,Ehsan Latif,Wei Liu,Wei Zhang,Soheil Kolouri,Xiaoming Zhai,Dajiang Zhu,Wenxuan Zhong,Tianming Liu,Ping Ma

Main category: cs.CL

TL;DR: 该论文综述了知识蒸馏（KD）和数据集蒸馏（DD）两种互补范式，旨在压缩大型语言模型（LLMs）的同时保留其推理能力和语言多样性，并探讨了它们的整合潜力与应用。

Details

Motivation: 随着LLMs的指数增长，计算和数据需求急剧增加，亟需高效的压缩策略以解决模型可扩展性和性能保留问题。 Method: 分析了KD的关键方法（如任务对齐、多教师框架）和DD技术（如梯度匹配、生成合成），并探讨了二者的整合策略。 Result: KD和DD的整合为LLMs提供了更高效、可扩展的压缩方案，适用于医疗和教育等领域，但保留推理能力和适应动态数据仍是挑战。 Conclusion: 通过整合KD和DD原则，论文为可持续、资源高效的LLMs发展指明了方向，但仍需解决评估协议和动态适应等问题。 Abstract: The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles.

[203] Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends

Jiaxin GUO,Xiaoyu Chen,Zhiqiang Rao,Jinlong Yang,Zongyao Li,Hengchao Shang,Daimeng Wei,Hao Yang

Main category: cs.CL

TL;DR: 本文综述了文档级机器翻译的自动评估现状，分析了现有方法的挑战，并展望了未来发展方向。

Details

Motivation: 文档级机器翻译的快速发展需要更准确的评估方法，但目前自动评估仍存在诸多问题。 Method: 分析了有参考和无参考的评估方法，包括传统指标、基于模型的指标和基于大语言模型的指标。 Result: 指出了当前评估方法的不足，如参考文本多样性不足、依赖句子级对齐信息以及LLM评估方法的偏差和不准确性。 Conclusion: 提出了未来研究方向，如开发更友好的文档级评估方法和更稳健的LLM评估方法，以减少对句子级信息的依赖。 Abstract: With the rapid development of deep learning technologies, the field of machine translation has witnessed significant progress, especially with the advent of large language models (LLMs) that have greatly propelled the advancement of document-level translation. However, accurately evaluating the quality of document-level translation remains an urgent issue. This paper first introduces the development status of document-level translation and the importance of evaluation, highlighting the crucial role of automatic evaluation metrics in reflecting translation quality and guiding the improvement of translation systems. It then provides a detailed analysis of the current state of automatic evaluation schemes and metrics, including evaluation methods with and without reference texts, as well as traditional metrics, Model-based metrics and LLM-based metrics. Subsequently, the paper explores the challenges faced by current evaluation methods, such as the lack of reference diversity, dependence on sentence-level alignment information, and the bias, inaccuracy, and lack of interpretability of the LLM-as-a-judge method. Finally, the paper looks ahead to the future trends in evaluation methods, including the development of more user-friendly document-level evaluation methods and more robust LLM-as-a-judge methods, and proposes possible research directions, such as reducing the dependency on sentence-level information, introducing multi-level and multi-granular evaluation approaches, and training models specifically for machine translation evaluation. This study aims to provide a comprehensive analysis of automatic evaluation for document-level translation and offer insights into future developments.

[204] On Self-improving Token Embeddings

Mario M. Kubek,Shiraj Pokharel,Thomas Böhme,Emma L. McDaniel,Herwig Unger,Armin R. Mikler

Main category: cs.CL

TL;DR: 本文提出了一种快速优化预训练静态词或标记嵌入的新方法，通过结合文本语料中相邻标记的嵌入，持续更新每个标记的表示，包括那些没有预分配嵌入的标记。该方法独立于大型语言模型和浅层神经网络，适用于语料探索、概念搜索和词义消歧等任务。

Details

Motivation: 解决预训练嵌入中的词汇外问题和通用嵌入在特定领域中的不适用性，提升标记在主题同质语料中的表示质量。 Method: 通过结合相邻标记的嵌入持续更新标记表示，适用于特定领域语料，不依赖大型模型。 Result: 在NOAA风暴事件数据库中应用，改进了风暴相关术语的表示，揭示了灾难叙事的演变。 Conclusion: 该方法有效提升了特定领域中标记嵌入的质量，为语料探索和概念分析提供了新工具。 Abstract: This article introduces a novel and fast method for refining pre-trained static word or, more generally, token embeddings. By incorporating the embeddings of neighboring tokens in text corpora, it continuously updates the representation of each token, including those without pre-assigned embeddings. This approach effectively addresses the out-of-vocabulary problem, too. Operating independently of large language models and shallow neural networks, it enables versatile applications such as corpus exploration, conceptual search, and word sense disambiguation. The method is designed to enhance token representations within topically homogeneous corpora, where the vocabulary is restricted to a specific domain, resulting in more meaningful embeddings compared to general-purpose pre-trained vectors. As an example, the methodology is applied to explore storm events and their impacts on infrastructure and communities using narratives from a subset of the NOAA Storm Events database. The article also demonstrates how the approach improves the representation of storm-related terms over time, providing valuable insights into the evolving nature of disaster narratives.

[205] Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation

Jiajun Shen,Tong Zhou,Yubo Chen,Delai Qiu,Shengping Liu,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: 论文提出了一种结合外部和内部知识的引文生成任务，并设计了RAEL范式和INTRALIGN方法，实验表明其方法在跨场景性能上优于基线。

Details

Motivation: 解决大语言模型在生成引文时对内部知识利用不透明及可信度问题。 Method: 提出Context-Prior Augmented Citation Generation任务，设计RAEL范式和INTRALIGN方法，结合数据生成和对齐算法。 Result: 实验结果显示方法在跨场景性能上优于基线，且检索质量、问题类型和模型知识对引文可信度有显著影响。 Conclusion: 通过结合外部和内部知识，论文方法提升了引文生成的可信度和性能。 Abstract: While hallucinations of large language models could been alleviated through retrieval-augmented generation and citation generation, how the model utilizes internal knowledge is still opaque, and the trustworthiness of its generated answers remains questionable. In this work, we introduce Context-Prior Augmented Citation Generation task, requiring models to generate citations considering both external and internal knowledge while providing trustworthy references, with 5 evaluation metrics focusing on 3 aspects: answer helpfulness, citation faithfulness, and trustworthiness. We introduce RAEL, the paradigm for our task, and also design INTRALIGN, an integrated method containing customary data generation and an alignment algorithm. Our experimental results show that our method achieves a better cross-scenario performance with regard to other baselines. Our extended experiments further reveal that retrieval quality, question types, and model knowledge have considerable influence on the trustworthiness in citation generation.

[206] Natural Fingerprints of Large Language Models

Teppei Suzuki,Ryokan Ri,Sho Takase

Main category: cs.CL

TL;DR: 研究发现，即使训练数据相同，大语言模型（LLMs）仍会因训练过程中的细微差异（如参数大小、优化设置、随机种子）产生独特的自然指纹，揭示了模型行为偏差的来源。

Details

Motivation: 探究大语言模型输出中存在的系统性偏差（自然指纹）的成因，以改进对模型行为的控制。 Method: 通过系统控制训练条件（如参数大小、优化设置、随机种子），分析模型生成文本的独特特征。 Result: 发现即使训练数据相同，LLMs仍会因训练过程的细微差异产生可区分的自然指纹。 Conclusion: 理解自然指纹有助于揭示模型偏差的起源，并为改进LLM行为控制提供新思路。 Abstract: Large language models (LLMs) often exhibit biases -- systematic deviations from expected norms -- in their outputs. These range from overt issues, such as unfair responses, to subtler patterns that can reveal which model produced them. We investigate the factors that give rise to identifiable characteristics in LLMs. Since LLMs model training data distribution, it is reasonable that differences in training data naturally lead to the characteristics. However, our findings reveal that even when LLMs are trained on the exact same data, it is still possible to distinguish the source model based on its generated text. We refer to these unintended, distinctive characteristics as natural fingerprints. By systematically controlling training conditions, we show that the natural fingerprints can emerge from subtle differences in the training process, such as parameter sizes, optimization settings, and even random seeds. We believe that understanding natural fingerprints offers new insights into the origins of unintended bias and ways for improving control over LLM behavior.

[207] Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey

Aoran Gan,Hao Yu,Kai Zhang,Qi Liu,Wenyu Yan,Zhenya Huang,Shiwei Tong,Guoping Hu

Main category: cs.CL

TL;DR: 本文综述了检索增强生成（RAG）系统的评估方法，分析了传统与新兴评估方法，并整理了相关数据集和框架。

Details

Motivation: RAG系统因其混合架构和动态知识依赖，评估面临独特挑战，需系统梳理现有方法以推动发展。 Method: 系统回顾RAG评估方法，包括性能、事实准确性、安全性和计算效率，并进行元分析。 Result: 总结了RAG评估的现状，提供了数据集和框架的分类，填补了传统与LLM驱动方法的空白。 Conclusion: 本文是RAG评估领域最全面的综述，为未来研究提供了重要资源。 Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval, enabling accurate, up-to-date, and verifiable text generation across diverse applications. However, evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components, as well as their dependence on dynamic knowledge sources in the LLM era. In response, this paper provides a comprehensive survey of RAG evaluation methods and frameworks, systematically reviewing traditional and emerging evaluation approaches, for system performance, factual accuracy, safety, and computational efficiency in the LLM era. We also compile and categorize the RAG-specific datasets and evaluation frameworks, conducting a meta-analysis of evaluation practices in high-impact RAG research. To the best of our knowledge, this work represents the most comprehensive survey for RAG evaluation, bridging traditional and LLM-driven methods, and serves as a critical resource for advancing RAG development.

[208] CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs

Yingming Zheng,Xiaoliang Liu,Peng Wu,Li Pan

Main category: cs.CL

TL;DR: CRAVE提出了一种基于冲突推理的可解释性声明验证方法，利用大语言模型（LLMs）生成冲突立场，并通过小语言模型（SLM）进行最终判断，显著提升了复杂声明的验证准确性和透明度。

Details

Motivation: 数字媒体和AI生成内容导致错误信息快速传播，传统依赖专家标注证据的方法效率低且难以扩展，现有自动化系统对复杂声明的推理能力不足。 Method: CRAVE采用三模块框架：1) 消除歧义并检索证据；2) 利用LLMs从四个维度推理冲突立场并初步判断；3) 通过SLM评估冲突立场并最终判断。 Result: 在两个公开数据集上，CRAVE表现优于现有方法，显著提升了证据检索能力和模型预测解释性。 Conclusion: CRAVE通过冲突推理和模块化设计，有效解决了复杂声明的验证问题，兼具高准确性和可解释性。 Abstract: The rapid spread of misinformation, driven by digital media and AI-generated content, has made automatic claim verification essential. Traditional methods, which depend on expert-annotated evidence, are labor-intensive and not scalable. Although recent automated systems have improved, they still struggle with complex claims that require nuanced reasoning. To address this, we propose CRAVE, a Conflicting Reasoning Approach for explainable claim VErification, that verify the complex claims based on the conflicting rationales reasoned by large language models (LLMs). Specifically, CRAVE introduces a three-module framework. Ambiguity Elimination enchanced Evidence Retrieval module performs ambiguity elimination and entity-based search to gather relevant evidence related to claim verification from external sources like Wikipedia. Conflicting Perspective Reasoning and Preliminary Judgment module with LLMs adopts LLMs to reason rationales with conflicting stances about claim verification from retrieved evidence across four dimensions, i.e., direct evidence, semantic relationships, linguistic patterns, and logical reasoning and make a preliminary judgment. Finally, Small Language Model (SLM) based Judge module is fine-tuned to make use of preliminary judgment from LLMs to assess the confidence of the conflicting rationales and make a final authenticity judgment. This methodology allows CRAVE to capture subtle inconsistencies in complex claims, improving both the accuracy and transparency of claim verification. Extensive experiments on two public claim verification datasets demonstrate that our CRAVE model achieves much better performance than state-of-the-art methods and exhibits a superior capacity for finding relevant evidence and explaining the model predictions. The code is provided at https://github.com/8zym/CRAVE.

[209] Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues

Rui Ribeiro,Luísa Coheur,Joao P. Carvalho

Main category: cs.CL

TL;DR: 本文探讨了使用模糊指纹和预训练模型改进基于文本的说话人识别，结合说话人特定标记和上下文建模，显著提高了准确性。

Details

Motivation: 传统方法在仅依赖文本数据时效果有限，本文旨在通过新方法提升文本说话人识别的性能。 Method: 采用模糊指纹技术，结合说话人特定标记和上下文感知建模，优化预训练模型。 Result: 在Friends和Big Bang Theory数据集上分别达到70.6%和67.7%的准确率，模糊指纹接近全微调性能。 Conclusion: 模糊指纹和上下文建模显著提升文本说话人识别，同时为未来研究提供了改进方向。 Abstract: Speaker identification using voice recordings leverages unique acoustic features, but this approach fails when only textual data is available. Few approaches have attempted to tackle the problem of identifying speakers solely from text, and the existing ones have primarily relied on traditional methods. In this work, we explore the use of fuzzy fingerprints from large pre-trained models to improve text-based speaker identification. We integrate speaker-specific tokens and context-aware modeling, demonstrating that conversational context significantly boosts accuracy, reaching 70.6% on the Friends dataset and 67.7% on the Big Bang Theory dataset. Additionally, we show that fuzzy fingerprints can approximate full fine-tuning performance with fewer hidden units, offering improved interpretability. Finally, we analyze ambiguous utterances and propose a mechanism to detect speaker-agnostic lines. Our findings highlight key challenges and provide insights for future improvements in text-based speaker identification.

[210] Evaluating LLMs on Chinese Topic Constructions: A Research Proposal Inspired by Tian et al. (2024)

Xiaodong Yang

Main category: cs.CL

TL;DR: 本文提出一个评估大语言模型（LLMs）对中文话题结构敏感性的框架，重点关注其对孤岛约束的敏感性。

Details

Motivation: 受Tian等人（2024）启发，旨在为未来研究提供基础并征求方法学反馈。 Method: 设计实验测试LLMs对汉语语法的知识，尚未进行实验。 Result: 暂无实验结果，仅为未来研究提供框架。 Conclusion: 本文为未来研究奠定基础，并欢迎对方法论的反馈。 Abstract: This paper proposes a framework for evaluating large language models (LLMs) on Chinese topic constructions, focusing on their sensitivity to island constraints. Drawing inspiration from Tian et al. (2024), we outline an experimental design for testing LLMs' grammatical knowledge of Mandarin syntax. While no experiments have been conducted yet, this proposal aims to provide a foundation for future studies and invites feedback on the methodology.

[211] Efficient Pretraining Length Scaling

Bohong Wu,Shen Yan,Sijun Zhang,Jianqiao Lu,Yutao Zeng,Ya Wang,Xun Zhou

Main category: cs.CL

TL;DR: 论文提出了一种名为PHD-Transformer的新框架，通过在预训练阶段实现高效的长度扩展，同时保持推理效率。

Details

Motivation: 探索长度扩展在预训练中的潜力，填补现有研究的空白。 Method: 采用创新的KV缓存管理策略，区分原始令牌和隐藏解码令牌，并引入两种优化变体（PHD-SWA和PHD-CSWA）以进一步提升性能。 Result: 在多个基准测试中表现出一致的性能提升。 Conclusion: PHD-Transformer在预训练阶段实现了高效的长度扩展，同时保持了推理效率，为相关研究提供了新思路。 Abstract: Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (\textit{PHD}-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. \textit{PHD}-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: \textit{PHD-SWA} employs sliding window attention to preserve local dependencies, while \textit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.

[212] Stay Hungry, Stay Foolish: On the Extended Reading Articles Generation with LLMs

Yow-Fu Liou,Yu-Chien Tang,An-Zi Yen

Main category: cs.CL

TL;DR: 该研究探讨了利用大型语言模型（LLMs）自动化生成教育材料和课程建议的潜力，以减轻教育工作者的负担。通过TED-Ed Dig Deeper部分作为案例，研究展示了如何生成扩展文章并推荐相关课程。

Details

Motivation: 教育材料制作耗时且繁重，研究旨在利用LLMs自动化这一过程，提升效率和质量。 Method: 从视频转录生成扩展文章，结合历史、文化和轶事，并通过语义相似度推荐课程，最后用LLM优化相关性。 Result: 实验表明，模型生成的内容质量高，课程推荐准确，提升了学习体验。 Conclusion: LLMs能有效连接核心内容与补充学习资源，辅助教师设计材料，同时为学生提供更多学习资源。 Abstract: The process of creating educational materials is both time-consuming and demanding for educators. This research explores the potential of Large Language Models (LLMs) to streamline this task by automating the generation of extended reading materials and relevant course suggestions. Using the TED-Ed Dig Deeper sections as an initial exploration, we investigate how supplementary articles can be enriched with contextual knowledge and connected to additional learning resources. Our method begins by generating extended articles from video transcripts, leveraging LLMs to include historical insights, cultural examples, and illustrative anecdotes. A recommendation system employing semantic similarity ranking identifies related courses, followed by an LLM-based refinement process to enhance relevance. The final articles are tailored to seamlessly integrate these recommendations, ensuring they remain cohesive and informative. Experimental evaluations demonstrate that our model produces high-quality content and accurate course suggestions, assessed through metrics such as Hit Rate, semantic similarity, and coherence. Our experimental analysis highlight the nuanced differences between the generated and existing materials, underscoring the model's capacity to offer more engaging and accessible learning experiences. This study showcases how LLMs can bridge the gap between core content and supplementary learning, providing students with additional recommended resources while also assisting teachers in designing educational materials.

[213] LLMs as Data Annotators: How Close Are We to Human Performance

Muhammad Uzair Ul Haq,Davide Rigoni,Alessandro Sperduti

Main category: cs.CL

TL;DR: 论文探讨了在NLP中利用LLMs自动生成高质量标注数据的挑战，提出通过检索增强生成（RAG）改进上下文学习（ICL）的方法，并比较了不同LLM和嵌入模型在NER任务中的表现。

Details

Motivation: 手动标注数据成本高且耗时，而现有ICL方法依赖人工选择上下文示例，效率低且性能不佳。 Method: 通过实验比较多种LLM和嵌入模型在NER任务中的表现，并引入RAG方法自动检索上下文示例。 Result: 结果表明选择合适的LLM和嵌入模型至关重要，同时需权衡模型大小与性能，并关注更具挑战性的数据集。 Conclusion: 研究强调了改进ICL方法的必要性，并建议未来研究应聚焦于更复杂的数据集。 Abstract: In NLP, fine-tuning LLMs is effective for various applications but requires high-quality annotated data. However, manual annotation of data is labor-intensive, time-consuming, and costly. Therefore, LLMs are increasingly used to automate the process, often employing in-context learning (ICL) in which some examples related to the task are given in the prompt for better performance. However, manually selecting context examples can lead to inefficiencies and suboptimal model performance. This paper presents comprehensive experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task. The evaluation encompasses models with approximately $7$B and $70$B parameters, including both proprietary and non-proprietary models. Furthermore, leveraging the success of Retrieval-Augmented Generation (RAG), it also considers a method that addresses the limitations of ICL by automatically retrieving contextual examples, thereby enhancing performance. The results highlight the importance of selecting the appropriate LLM and embedding model, understanding the trade-offs between LLM sizes and desired performance, and the necessity to direct research efforts towards more challenging datasets.

[214] DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

Chengyu Wang,Junbing Yan,Yuanhao Yue,Jun Huang

Main category: cs.CL

TL;DR: DistilQwen2.5是一系列轻量级大语言模型，通过蒸馏技术从Qwen2.5模型衍生而来，提升了指令跟随能力，并在工业实践中展示了高效部署的潜力。

Details

Motivation: 解决大语言模型在资源受限场景下的计算效率和部署成本问题。 Method: 利用多代理教师模型选择和改写指令-响应对，并通过模型融合技术逐步整合教师模型的细粒度知识。 Result: 蒸馏后的模型在能力上显著优于原始模型。 Conclusion: DistilQwen2.5模型为资源受限场景提供了高效解决方案，并已开源以促进实际应用。 Abstract: Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.

[215] RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

Quy-Anh Dang,Chris Ngo,Truong-Son Hy

Main category: cs.CL

TL;DR: RainbowPlus是一种基于进化计算的新型红队框架，通过自适应质量-多样性搜索增强对抗性提示生成，显著提高了攻击成功率和多样性。

Details

Motivation: 大型语言模型（LLMs）易受对抗性提示攻击，现有红队方法存在可扩展性差、资源密集或攻击策略单一的问题。 Method: RainbowPlus采用多元素存档存储多样化高质量提示，并使用综合适应度函数评估多个提示，改进了传统质量-多样性方法。 Result: 在多个基准数据集和LLMs上，RainbowPlus表现出更高的攻击成功率和多样性，生成更多独特提示，且速度更快。 Conclusion: RainbowPlus为LLM安全性评估提供了可扩展工具，开源实现促进了进一步研究。 Abstract: Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score $\approx 0.84$), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.

[216] Testing LLMs' Capabilities in Annotating Translations Based on an Error Typology Designed for LSP Translation: First Experiments with ChatGPT

Joachim Minder,Guillaume Wisniewski,Natalie Kübler

Main category: cs.CL

TL;DR: 研究探讨了ChatGPT在基于错误类型学标注机器翻译输出的能力，发现其在专业翻译中表现良好，但自我评估能力有限。

Details

Motivation: 探索大型语言模型（如ChatGPT）在专业翻译错误标注中的潜力，填补现有研究对通用语言的偏重。 Method: 通过两种不同提示和定制错误类型学，比较ChatGPT与人类专家对DeepL和ChatGPT翻译的标注。 Result: ChatGPT对DeepL翻译的召回率和精确度较高，但自我评估表现较差；提示的细节影响分类准确性。 Conclusion: LLMs在翻译评估中具有潜力但存在局限，未来可研究开源LLMs及其实践应用。 Abstract: This study investigates the capabilities of large language models (LLMs), specifically ChatGPT, in annotating MT outputs based on an error typology. In contrast to previous work focusing mainly on general language, we explore ChatGPT's ability to identify and categorise errors in specialised translations. By testing two different prompts and based on a customised error typology, we compare ChatGPT annotations with human expert evaluations of translations produced by DeepL and ChatGPT itself. The results show that, for translations generated by DeepL, recall and precision are quite high. However, the degree of accuracy in error categorisation depends on the prompt's specific features and its level of detail, ChatGPT performing very well with a detailed prompt. When evaluating its own translations, ChatGPT achieves significantly poorer results, revealing limitations with self-assessment. These results highlight both the potential and the limitations of LLMs for translation evaluation, particularly in specialised domains. Our experiments pave the way for future research on open-source LLMs, which could produce annotations of comparable or even higher quality. In the future, we also aim to test the practical effectiveness of this automated evaluation in the context of translation training, particularly by optimising the process of human evaluation by teachers and by exploring the impact of annotations by LLMs on students' post-editing and translation learning.

[217] Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models

K. Wong,B. Wu,S. Bulathwela,M. Cukurova

Main category: cs.CL

TL;DR: 研究探讨了多模态数据在诊断学生协作问题解决（CPS）能力中的潜力，发现多模态数据在特定情况下能提升模型性能，但其效果取决于标签复杂性和数据集组成。

Details

Motivation: 探索多模态数据和先进模型在检测复杂CPS行为中的实际价值，尤其是在真实教育环境中。 Method: 使用文本嵌入和声学嵌入构建多模态分类模型，比较传统模型和基于Transformer的单模态与多模态模型。 Result: 多模态数据在Transformer模型中提升了社交认知类CPS的诊断性能，但对传统模型无显著改进。 Conclusion: 多模态和模型选择需根据具体CPS指标类型和数据集特性谨慎评估，未来需结合人类与AI优势并探索更优模型架构。 Abstract: Detecting collaborative and problem-solving behaviours from digital traces to interpret students' collaborative problem solving (CPS) competency is a long-term goal in the Artificial Intelligence in Education (AIEd) field. Although multimodal data and advanced models are argued to have the potential to detect complex CPS behaviours, empirical evidence on their value remains limited with some contrasting evidence. In this study, we investigated the potential of multimodal data to improve model performance in diagnosing 78 secondary school students' CPS subskills and indicators in authentic educational settings. In particular, text embeddings from verbal data and acoustic embeddings from audio data were used in a multimodal classification model for CPS diagnosis. Both unimodal and multimodal transformer-based models outperformed traditional models in detecting CPS classes. Although the inclusion of multimodality did not improve the performance of traditional unimodal models, its integration into transformer-based models demonstrated improved performance for diagnosing social-cognitive CPS classes compared to unimodal transformer-based models. Based on the results, the paper argues that multimodality and the selection of a particular modelling technique should not be taken for granted to achieve the best performance in the automated detection of every CPS subskill and indicator. Rather, their value is limited to certain types of CPS indicators, affected by the complexity of the labels, and dependent on the composition of indicators in the dataset. We conclude the paper by discussing the required nuance when considering the value of LLMs and multimodality in automated CPS diagnosis, highlighting the need for human-AI complementarity, and proposing the exploration of relevant model architectures and techniques to improve CPS diagnosis in authentic educational contexts.

[218] Kuwain 1.5B: An Arabic SLM via Language Injection

Khalil Hennara,Sara Chrouf,Mohamed Motaism Hamed,Zeina Aldallal,Omar Hadid,Safwan AlModhayan

Main category: cs.CL

TL;DR: 论文提出了一种将新语言整合到大型语言模型（LLM）中的方法，成功将阿拉伯语注入小型开源模型，性能提升8%，且不损害原有知识。

Details

Motivation: 增强现有模型以融入新知识是AI发展的关键，但传统方法成本高且资源密集。 Method: 通过向一个以英语为主的小型开源模型注入阿拉伯语，训练了一个15亿参数的微型模型Kuwain。 Result: 阿拉伯语性能平均提升8%，同时保留了原有知识，且所需原始模型数据量极少。 Conclusion: 该方法为高效、定向扩展语言模型提供了成本效益高的替代方案，无需大规模重新训练。 Abstract: Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.

[219] EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

Ziwen Xu,Shuxun Wang,Kewei Xu,Haoming Xu,Mengru Wang,Xinle Deng,Yunzhi Yao,Guozhou Zheng,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: EasyEdit2是一个框架，支持通过插件式调整控制大型语言模型（LLM）的行为，无需修改模型参数。

Details

Motivation: 提供一种简单、高效的方法来控制LLM的行为，适用于多种干预场景，如安全性、情感、事实性等。 Method: 采用新的架构，包括导向向量生成器和应用器，通过单一样本自动生成和应用导向向量。 Result: 实验证明EasyEdit2在不同LLM上表现有效，用户友好且操作简便。 Conclusion: EasyEdit2为LLM行为控制提供了高效、易用的解决方案，并开源了代码和演示资源。 Abstract: In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://zjunlp.github.io/project/EasyEdit2/video for a quick introduction.

[220] The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks

Joan C. Timoneda

Main category: cs.CL

TL;DR: 论文提出了一种合成插补方法，利用生成式LLM（如GPT-4o）生成合成文本，以解决训练数据中类别不平衡的问题。该方法在75个原始样本时性能与完整样本相当，且过拟合可控。

Details

Motivation: 在构建高质量训练集时，某些类别的样本可能不足，影响编码器-解码器LLM（如BERT和RoBERTa）的性能。 Method: 使用生成式LLM（GPT-4o）基于少量原始样本生成合成文本，确保新文本与原始文本有足够差异以减少过拟合，同时保留语义。 Result: 在75个或更多原始样本时，合成插补方法的性能与完整样本相当；50个样本时过拟合低且可修正。 Conclusion: 合成插补方法为生成式LLM在研究中提供了新用途，帮助应用研究者平衡数据集以获得最佳性能。 Abstract: Encoder-decoder Large Language Models (LLMs), such as BERT and RoBERTa, require that all categories in an annotation task be sufficiently represented in the training data for optimal performance. However, it is often difficult to find sufficient examples for all categories in a task when building a high-quality training set. In this article, I describe this problem and propose a solution, the synthetic imputation approach. Leveraging a generative LLM (GPT-4o), this approach generates synthetic texts based on careful prompting and five original examples drawn randomly with replacement from the sample. This approach ensures that new synthetic texts are sufficiently different from the original texts to reduce overfitting, but retain the underlying substantive meaning of the examples to maximize out-of-sample performance. With 75 original examples or more, synthetic imputation's performance is on par with a full sample of original texts, and overfitting remains low, predictable and correctable with 50 original samples. The synthetic imputation approach provides a novel role for generative LLMs in research and allows applied researchers to balance their datasets for best performance.

[221] On true empty category

Qilin Tian

Main category: cs.CL

TL;DR: 论文讨论了空语类假设，认为某些空宾语位置无法用现有空语类解释，但通过话题化分析可以避免引入‘真正空语类’的概念。

Details

Motivation: 探讨现有空语类理论（如PRO、pro、trace等）是否足以解释所有空宾语现象，避免引入不必要的假设。 Method: 通过话题化现象分析，评估Li等人提出的‘真正空语类’假设的证据。 Result: 研究发现，话题化现象无需引入‘真正空语类’即可解释。 Conclusion: 现有空语类理论足以解释相关现象，无需引入‘真正空语类’假设。 Abstract: According to Chomsky (1981, 1986), empty categories consist of PRO, pro, trace, and variable. However, some empty object positions seem to be incompatible with extant empty categories. Given this, Li (2007a, 2007b, 2014) and Li & Wei (2014) raise the true empty category hypothesis, which holds that true empty category is only an empty position with category and Case features. As a last resort option, it is used mainly to meet the subcatgorization of a verb. This assumption is ingenious, and if proved to be true, it will exert a great impact on the study of UG. In this paper, we evaluate their evidence from topicalization and demonstrate that it can be accounted for without invoking true empty category.

[222] Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Nandan Thakur,Ronak Pradeep,Shivani Upadhyay,Daniel Campos,Nick Craswell,Jimmy Lin

Main category: cs.CL

TL;DR: 本文研究了检索增强生成（RAG）中引用文档支持答案的评估问题，比较了GPT-4o自动评估与人工评估的效果，发现GPT-4o在支持评估中表现可靠。

Details

Motivation: 探讨RAG系统中引用文档支持答案的评估方法，以减少系统幻觉并提高生成答案的可信度。 Method: 对TREC 2024 RAG Track的45份提交和36个主题进行大规模比较研究，比较GPT-4o与人工评估的效果，包括完全人工评估和基于LLM预测后编辑的评估。 Result: GPT-4o与人工评估在完全人工条件下的匹配率为56%，在后编辑条件下提升至72%；独立人工评估与GPT-4o相关性更高。 Conclusion: GPT-4o可作为支持评估的可靠替代方案，未来需进一步优化评估方法以减少错误。 Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

[223] EvalAgent: Discovering Implicit Evaluation Criteria from the Web

Manya Wadhwa,Zayne Sprague,Chaitanya Malaviya,Philippe Laban,Junyi Jessy Li,Greg Durrett

Main category: cs.CL

TL;DR: 论文介绍了EvalAgent框架，用于自动发现语言模型输出中的隐含和任务特定评估标准，结合专家指导和LLM生成标准，提升评估质量。

Details

Motivation: 现有评估方法通常依赖显式标准或LLM生成的标准，但忽略了隐含的高质量特征，如学术演讲的典型结构。 Method: EvalAgent通过挖掘专家在线指导，提出基于外部证据的多样化评估标准，并与LLM生成标准结合。 Result: 实验表明EvalAgent提出的标准具有隐含性和精确性，且能指导模型优化输出。结合LLM标准后，能发现更多人认可的标准。 Conclusion: EvalAgent框架能有效补充现有评估方法，提升语言模型输出的质量。 Abstract: Evaluation of language model outputs on structured writing tasks is typically conducted with a number of desirable criteria presented to human evaluators or large language models (LLMs). For instance, on a prompt like "Help me draft an academic talk on coffee intake vs research productivity", a model response may be evaluated for criteria like accuracy and coherence. However, high-quality responses should do more than just satisfy basic task requirements. An effective response to this query should include quintessential features of an academic talk, such as a compelling opening, clear research questions, and a takeaway. To help identify these implicit criteria, we introduce EvalAgent, a novel framework designed to automatically uncover nuanced and task-specific criteria. EvalAgent first mines expert-authored online guidance. It then uses this evidence to propose diverse, long-tail evaluation criteria that are grounded in reliable external sources. Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit (not directly stated in the user's prompt), yet specific (high degree of lexical precision). Further, EvalAgent criteria are often not satisfied by initial responses but they are actionable, such that responses can be refined to satisfy them. Finally, we show that combining LLM-generated and EvalAgent criteria uncovers more human-valued criteria than using LLMs alone.

[224] Fully Bayesian Approaches to Topics over Time

Julián Cendrero,Julio Gonzalo,Ivar Zapata

Main category: cs.CL

TL;DR: 论文提出了一种完全贝叶斯的Topics over Time模型（BToT），通过引入Beta分布的共轭先验解决原ToT模型的稳定性问题，并进一步提出加权版本（WBToT）以平衡时间和词模态的影响。实验表明WBToT在事件捕捉和主题一致性上优于现有方法。

Details

Motivation: 原ToT模型未采用完全贝叶斯方法，导致稳定性问题，且单时间观测与文档词频之间存在尺度差异。 Method: 引入Beta分布的共轭先验作为正则化，提出BToT；进一步通过重复文档发布时间提出WBToT以平衡模态影响。 Result: WBToT在SOTU和COVID-19推文数据集上表现优于LDA和BERTopic，主题时间偏差分别减少51%和34%，且在线优化算法更稳定。 Conclusion: WBToT通过平衡时间和词模态，显著提升了主题模型的稳定性和事件捕捉能力，适用于大规模动态数据集。 Abstract: The Topics over Time (ToT) model captures thematic changes in timestamped datasets by explicitly modeling publication dates jointly with word co-occurrence patterns. However, ToT was not approached in a fully Bayesian fashion, a flaw that makes it susceptible to stability problems. To address this issue, we propose a fully Bayesian Topics over Time (BToT) model via the introduction of a conjugate prior to the Beta distribution. This prior acts as a regularization that prevents the online version of the algorithm from unstable updates when a topic is poorly represented in a mini-batch. The characteristics of this prior to the Beta distribution are studied here for the first time. Still, this model suffers from a difference in scale between the single-time observations and the multiplicity of words per document. A variation of BToT, Weighted Bayesian Topics over Time (WBToT), is proposed as a solution. In WBToT, publication dates are repeated a certain number of times per document, which balances the relative influence of words and timestamps along the inference process. We have tested our models on two datasets: a collection of over 200 years of US state-of-the-union (SOTU) addresses and a large-scale COVID-19 Twitter corpus of 10 million tweets. The results show that WBToT captures events better than Latent Dirichlet Allocation and other SOTA topic models like BERTopic: the median absolute deviation of the topic presence over time is reduced by $51\%$ and $34\%$, respectively. Our experiments also demonstrate the superior coherence of WBToT over BToT, which highlights the importance of balancing the time and word modalities. Finally, we illustrate the stability of the online optimization algorithm in WBToT, which allows the application of WBToT to problems that are intractable for standard ToT.

[225] Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions

Saffron Huang,Esin Durmus,Miles McCain,Kunal Handa,Alex Tamkin,Jerry Hong,Michael Stern,Arushi Somani,Xiuruo Zhang,Deep Ganguli

Main category: cs.CL

TL;DR: 研究通过隐私保护方法提取Claude 3和3.5模型在真实交互中表现出的价值观，发现其支持积极的人类价值观，并因上下文不同而呈现多样性。

Details

Motivation: AI助手可能通过价值判断影响用户决策和世界观，但缺乏对其实际价值观的实证研究。 Method: 采用自下而上、隐私保护的方法，分析数十万次真实交互中模型表现出的价值观。 Result: 提取了3,307种AI价值观，发现其支持透明、健康边界等积极价值观，且价值观因上下文而异。 Conclusion: 研究为AI系统的价值观评估和设计提供了实证基础。 Abstract: AI assistants can impart value judgments that shape people's decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like "moral nihilism". While some values appear consistently across contexts (e.g. "transparency"), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, "harm prevention" emerges when Claude resists users, "historical accuracy" when responding to queries about controversial events, "healthy boundaries" when asked for relationship advice, and "human agency" in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.

[226] MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning

Yahan Yang,Soham Dan,Shuo Li,Dan Roth,Insup Lee

Main category: cs.CL

TL;DR: 提出了一种多语言防护栏方法，通过合成数据生成、监督微调和GRPO框架，显著提升多语言环境下LLMs的安全性。

Details

Motivation: 多语言环境下LLMs易受对抗攻击，且安全对齐数据有限，需开发能跨语言检测和过滤不安全内容的防护栏。 Method: 包括合成多语言数据生成、监督微调和GRPO框架，以提升性能。 Result: 实验表明，该方法在域内和域外语言中均优于基线，并能生成多语言解释。 Conclusion: 该方法有效提升了LLMs在多语言环境中的安全性，尤其适用于内容审核中的语言特定风险识别。 Abstract: Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we propose an approach to build a multilingual guardrail with reasoning. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-guided Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail consistently outperforms recent baselines across both in-domain and out-of-domain languages. The multilingual reasoning capability of our guardrail enables it to generate multilingual explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.

[227] Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Yilun Zhou,Austin Xu,Peifeng Wang,Caiming Xiong,Shafiq Joty

Main category: cs.CL

TL;DR: 论文研究了在测试时扩展计算中使用LLM-judges（生成自然语言评估的模型）作为评估器的效果，发现其在某些任务中表现尚可，但在其他任务中不如传统奖励模型。

Details

Motivation: 探索LLM-judges在测试时扩展计算中的有效性，填补其在自动评估中的性能空白。 Method: 引入JETTS基准，评估10种法官模型在三个领域（数学推理、代码生成、指令遵循）和三种任务设置（响应重排、步骤级束搜索、基于批判的响应优化）中的表现。 Result: 法官模型在响应重排中与结果奖励模型竞争，但在束搜索中不如过程奖励模型，且其自然语言批判对生成器的指导效果有限。 Conclusion: LLM-judges在特定任务中有效，但整体性能仍有提升空间，尤其在批判指导生成方面。 Abstract: Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.

eess.AS [Back]

[228] The First VoicePrivacy Attacker Challenge

Natalia Tomashenko,Xiaoxiao Miao,Emmanuel Vincent,Junichi Yamagishi

Main category: eess.AS

TL;DR: ICASSP 2025 SP Grand Challenge评估了针对语音匿名化系统的攻击者系统，最佳攻击系统将EER降低了25-44%。

Details

Motivation: 评估攻击者系统对语音匿名化系统的效果，推动语音隐私保护技术的发展。 Method: 提供训练、开发和评估数据集及基线攻击者，参与者开发自动说话人验证系统并提交分数。 Result: 最佳攻击系统将EER相对基线降低了25-44%。 Conclusion: 挑战赛展示了攻击者系统对语音匿名化系统的显著改进潜力。 Abstract: The First VoicePrivacy Attacker Challenge is an ICASSP 2025 SP Grand Challenge which focuses on evaluating attacker systems against a set of voice anonymization systems submitted to the VoicePrivacy 2024 Challenge. Training, development, and evaluation datasets were provided along with a baseline attacker. Participants developed their attacker systems in the form of automatic speaker verification systems and submitted their scores on the development and evaluation data. The best attacker systems reduced the equal error rate (EER) by 25-44% relative w.r.t. the baseline.

[229] Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training

Christopher Ick,Gordon Wichern,Yoshiki Masuyama,François G. Germain,Jonathan Le Roux

Main category: eess.AS

TL;DR: MERL提出了一种基于神经声场的RIR估计系统，用于数据增强和扬声器距离估计任务。

Details

Motivation: 解决房间脉冲响应（RIR）估计和扬声器距离估计问题，利用外部数据集和房间几何信息提升性能。 Method: 预训练神经声场模型，利用外部数据集中的RIR和几何信息；针对目标房间进行适配；预测RIR并用于训练距离估计模型。 Result: 通过预训练和适配，系统能够生成高质量的RIR数据，并用于扬声器距离估计。 Conclusion: 该方法结合外部数据和房间几何信息，有效提升了RIR估计和距离估计的准确性。 Abstract: This report details MERL's system for room impulse response (RIR) estimation submitted to the Generative Data Augmentation Workshop at ICASSP 2025 for Augmenting RIR Data (Task 1) and Improving Speaker Distance Estimation (Task 2). We first pre-train a neural acoustic field conditioned by room geometry on an external large-scale dataset in which pairs of RIRs and the geometries are provided. The neural acoustic field is then adapted to each target room by using the enrollment data, where we leverage either the provided room geometries or geometries retrieved from the external dataset, depending on availability. Lastly, we predict the RIRs for each pair of source and receiver locations specified by Task 1, and use these RIRs to train the speaker distance estimation model in Task 2.

[230] OmniAudio: Generating Spatial Audio from 360-Degree Video

Huadai Liu,Tianyi Luo,Qikai Jiang,Kaicheng Luo,Peiwen Sun,Jialei Wan,Rongjie Huang,Qian Chen,Wen Wang,Xiangtai Li,Shiliang Zhang,Zhijie Yan,Zhou Zhao,Wei Xue

Main category: eess.AS

TL;DR: 论文提出了一种新任务360V2SA，从360度视频生成空间音频（FOA格式），并介绍了数据集Sphere360和框架OmniAudio，实现了SOTA性能。

Details

Motivation: 传统视频到音频生成技术缺乏空间线索，无法准确表示3D环境中的声源。 Method: 创建Sphere360数据集，设计半自动化数据收集流程，提出双分支框架OmniAudio，结合自监督预训练和全景/FoV视频输入。 Result: OmniAudio在Sphere360上实现了SOTA性能。 Conclusion: 360V2SA任务和OmniAudio框架为3D空间音频生成提供了有效解决方案。 Abstract: Traditional video-to-audio generation techniques primarily focus on field-of-view (FoV) video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and FoV video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets will be released at https://github.com/liuhuadai/OmniAudio. The demo page is available at https://OmniAudio-360V2SA.github.io.

cs.SE [Back]

[231] Risk Assessment Framework for Code LLMs via Leveraging Internal States

Yuheng Huang,Lei Ma,Keizaburo Nishikino,Takumi Akazaki

Main category: cs.SE

TL;DR: PtTrust是一个基于内部状态预训练的两阶段风险评估框架，旨在提升代码LLM的可信度，通过无监督预训练和有监督微调实现跨任务和语言的通用性。

Details

Motivation: 当前代码LLM生成的代码可能存在错误、不安全或不可靠的问题，现有方法局限于狭窄领域且缺乏工业级可扩展性。 Method: PtTrust采用两阶段方法：1）无监督预训练学习LLM状态的通用表示；2）用小规模标注数据训练风险预测器。 Result: PtTrust在代码行级别风险评估中表现有效，能跨任务和语言泛化，并提供直观可解释的特征。 Conclusion: PtTrust为代码LLM的可扩展和可信保障迈出了重要一步。 Abstract: The pre-training paradigm plays a key role in the success of Large Language Models (LLMs), which have been recognized as one of the most significant advancements of AI recently. Building on these breakthroughs, code LLMs with advanced coding capabilities bring huge impacts on software engineering, showing the tendency to become an essential part of developers' daily routines. However, the current code LLMs still face serious challenges related to trustworthiness, as they can generate incorrect, insecure, or unreliable code. Recent exploratory studies find that it can be promising to detect such risky outputs by analyzing LLMs' internal states, akin to how the human brain unconsciously recognizes its own mistakes. Yet, most of these approaches are limited to narrow sub-domains of LLM operations and fall short of achieving industry-level scalability and practicability. To address these challenges, in this paper, we propose PtTrust, a two-stage risk assessment framework for code LLM based on internal state pre-training, designed to integrate seamlessly with the existing infrastructure of software companies. The core idea is that the risk assessment framework could also undergo a pre-training process similar to LLMs. Specifically, PtTrust first performs unsupervised pre-training on large-scale unlabeled source code to learn general representations of LLM states. Then, it uses a small, labeled dataset to train a risk predictor. We demonstrate the effectiveness of PtTrust through fine-grained, code line-level risk assessment and demonstrate that it generalizes across tasks and different programming languages. Further experiments also reveal that PtTrust provides highly intuitive and interpretable features, fostering greater user trust. We believe PtTrust makes a promising step toward scalable and trustworthy assurance for code LLMs.

[232] CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

Anirudh Khatry,Robert Zhang,Jia Pan,Ziteng Wang,Qiaochu Chen,Greg Durrett,Isil Dillig

Main category: cs.SE

TL;DR: CRUST-Bench是一个用于评估C到Rust转译的数据集，包含100个C仓库及其对应的安全Rust接口和测试用例，旨在解决复杂项目转译的挑战。

Details

Motivation: 现有数据集无法评估C到安全Rust的转译能力，CRUST-Bench填补了这一空白，支持复杂项目的转译验证。 Method: 通过提供手动编写的安全Rust接口和测试用例，CRUST-Bench确保转译的代码符合Rust的安全性和功能性要求。 Result: 评估发现，即使是先进的LLM（如OpenAI o1）在单次尝试中仅能解决15个任务，表明转译仍具挑战性。 Conclusion: CRUST-Bench为改进转译系统提供了基准，有助于推动从C到Rust等内存安全语言的代码迁移。 Abstract: C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.

cs.CY [Back]

[233] Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

Suhas BN,Dominik Mattioli,Saeed Abdullah,Rosa I. Arriaga,Chris W. Wiese,Andrew M. Sherrill

Main category: cs.CY

TL;DR: 论文提出了一个名为'Thousand Voices of Trauma'的合成数据集，包含3,000个基于PTSD治疗协议的模拟治疗对话，填补了心理健康数据中的关键空白。

Details

Motivation: AI系统在心理健康支持领域的进展受到治疗对话数据不足的限制，尤其是针对创伤治疗的数据。 Method: 通过确定性和概率性生成方法，创建了包含500个独特案例的数据集，每个案例通过六个对话视角展现治疗进程，并涵盖多样化的 demographics、创伤类型和行为。 Result: 数据集展示了真实的创伤类型和症状分布，并通过临床专家验证了其治疗保真度。同时开发了情感轨迹基准用于模型评估。 Conclusion: 该隐私保护数据集为创伤心理健康数据的不足提供了解决方案，对患者应用和临床培训工具有重要价值。 Abstract: The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4% male, 44.4% female, 6.2% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6%, bullying 10.2%) and symptoms (nightmares 23.4%, substance abuse 20.8%). Clinical experts validated the dataset's therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.

[234] AI Safety Should Prioritize the Future of Work

Sanchaita Hazra,Bodhisattwa Prasad Majumder,Tuhin Chakrabarty

Main category: cs.CY

TL;DR: 论文指出当前AI安全研究过于关注技术风险，忽视了AI对工作未来和人类生计的影响，建议通过经济理论和全球治理框架支持公平过渡。

Details

Motivation: 当前AI安全研究集中于技术风险，忽略了AI对社会结构和人类工作的长期影响，可能导致收入不平等加剧和创新垄断。 Method: 通过经济理论分析AI对劳动力市场的结构性影响，并提出全球治理框架和集体许可机制。 Result: 研究发现AI可能加剧收入不平等和创新垄断，需通过公平补偿机制和全球治理框架解决。 Conclusion: 建议建立支持工人权益的全球AI治理框架，确保经济公平和创新共享。 Abstract: Current efforts in AI safety prioritize filtering harmful content, preventing manipulation of human behavior, and eliminating existential risks in cybersecurity or biosecurity. While pressing, this narrow focus overlooks critical human-centric considerations that shape the long-term trajectory of a society. In this position paper, we identify the risks of overlooking the impact of AI on the future of work and recommend comprehensive transition support towards the evolution of meaningful labor with human agency. Through the lens of economic theories, we highlight the intertemporal impacts of AI on human livelihood and the structural changes in labor markets that exacerbate income inequality. Additionally, the closed-source approach of major stakeholders in AI development resembles rent-seeking behavior through exploiting resources, breeding mediocrity in creative labor, and monopolizing innovation. To address this, we argue in favor of a robust international copyright anatomy supported by implementing collective licensing that ensures fair compensation mechanisms for using data to train AI models. We strongly recommend a pro-worker framework of global AI governance to enhance shared prosperity and economic justice while reducing technical debt.

[235] Sentiment Analysis of Airbnb Reviews: Exploring Their Impact on Acceptance Rates and Pricing Across Multiple U.S. Regions

Ali Safari

Main category: cs.CY

TL;DR: 研究分析了Airbnb评论的情感极性对房源接受率和价格的影响，发现正面评论占比高，但数量对价格影响不大，情感极性更关键。

Details

Motivation: 探讨Airbnb评论的情感极性（正面/负面）是否影响房源的接受率和价格。 Method: 收集数千条评论，使用NLP分类情感，并通过t检验和相关分析验证影响。 Result: 90%以上评论为正面，数量对价格无显著影响，但正面评论多的房源接受率略高；预算房源评论多但价格竞争激烈，高端房源评论少但价格高。 Conclusion: 评论的情感质量比数量更能影响客户行为和定价策略。 Abstract: This research examines whether Airbnb guests' positive and negative comments influence acceptance rates and rental prices across six U.S. regions: Rhode Island, Broward County, Chicago, Dallas, San Diego, and Boston. Thousands of reviews were collected and analyzed using Natural Language Processing (NLP) to classify sentiments as positive or negative, followed by statistical testing (t-tests and basic correlations) on the average scores. The findings reveal that over 90 percent of reviews in each region are positive, indicating that having additional reviews does not significantly enhance prices. However, listings with predominantly positive feedback exhibit slightly higher acceptance rates, suggesting that sentiment polarity, rather than the sheer volume of reviews, is a more critical factor for host success. Additionally, budget listings often gather extensive reviews while maintaining competitive pricing, whereas premium listings sustain higher prices with fewer but highly positive reviews. These results underscore the importance of sentiment quality over quantity in shaping guest behavior and pricing strategies in an overwhelmingly positive review environment.

cs.SI [Back]

[236] VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform

Xingyu Lu,Tianke Zhang,Chang Meng,Xiaobei Wang,Jinpeng Wang,YiFan Zhang,Shisong Tang,Changyi Liu,Haojie Ding,Kaiyu Jiang,Kaiyu Tang,Bin Wen,Hai-Tao Zheng,Fan Yang,Tingting Gao,Di Zhang,Kun Gai

Main category: cs.SI

TL;DR: 论文提出了KuaiMod框架，用于解决短视频平台内容审核的挑战，结合VLM和CoT推理，提升审核准确性和动态更新能力。

Details

Motivation: 短视频平台内容审核存在人工偏见、自动化方法准确性不足及行业规范更新慢的问题，亟需高效解决方案。 Method: 提出KuaiMod框架，包含训练数据构建、离线适应和在线部署与优化三部分，利用VLM和CoT推理建模视频毒性。 Result: KuaiMod在基准测试中表现最佳，在线A/B测试中用户举报率降低20%，DAU和AUT显著提升。 Conclusion: KuaiMod有效解决了内容审核的挑战，为行业提供了动态、高效的解决方案。 Abstract: Exponentially growing short video platforms (SVPs) face significant challenges in moderating content detrimental to users' mental health, particularly for minors. The dissemination of such content on SVPs can lead to catastrophic societal consequences. Although substantial efforts have been dedicated to moderating such content, existing methods suffer from critical limitations: (1) Manual review is prone to human bias and incurs high operational costs. (2) Automated methods, though efficient, lack nuanced content understanding, resulting in lower accuracy. (3) Industrial moderation regulations struggle to adapt to rapidly evolving trends due to long update cycles. In this paper, we annotate the first SVP content moderation benchmark with authentic user/reviewer feedback to fill the absence of benchmark in this field. Then we evaluate various methods on the benchmark to verify the existence of the aforementioned limitations. We further propose our common-law content moderation framework named KuaiMod to address these challenges. KuaiMod consists of three components: training data construction, offline adaptation, and online deployment & refinement. Leveraging large vision language model (VLM) and Chain-of-Thought (CoT) reasoning, KuaiMod adequately models video toxicity based on sparse user feedback and fosters dynamic moderation policy with rapid update speed and high accuracy. Offline experiments and large-scale online A/B test demonstrates the superiority of KuaiMod: KuaiMod achieves the best moderation performance on our benchmark. The deployment of KuaiMod reduces the user reporting rate by 20% and its application in video recommendation increases both Daily Active User (DAU) and APP Usage Time (AUT) on several Kuaishou scenarios. We have open-sourced our benchmark at https://kuaimod.github.io.

[237] Rhythm of Opinion: A Hawkes-Graph Framework for Dynamic Propagation Analysis

Yulong Li,Zhixiang Lu,Feilong Tang,Simin Lai,Ming Hu,Yuxuan Zhang,Haochen Xue,Zhaodong Wu,Imran Razzak,Qingxia Li,Jionglong Su

Main category: cs.SI

TL;DR: 提出了一种结合多维霍克斯过程与图神经网络的方法，用于建模社交媒体中意见传播的动态，并引入新数据集VISTA支持研究。

Details

Motivation: 传统模型难以有效捕捉社交媒体中复杂的公共意见动态，需新方法解决。 Method: 整合多维霍克斯过程与图神经网络，建模社交网络中节点间的意见传播及评论层级关系。 Result: 提出VISTA数据集，包含多领域数据及详细标注，结合方法提供强解释性。 Conclusion: 该方法为未来研究提供了坚实基础，尤其在意见传播动态建模方面。 Abstract: The rapid development of social media has significantly reshaped the dynamics of public opinion, resulting in complex interactions that traditional models fail to effectively capture. To address this challenge, we propose an innovative approach that integrates multi-dimensional Hawkes processes with Graph Neural Network, modeling opinion propagation dynamics among nodes in a social network while considering the intricate hierarchical relationships between comments. The extended multi-dimensional Hawkes process captures the hierarchical structure, multi-dimensional interactions, and mutual influences across different topics, forming a complex propagation network. Moreover, recognizing the lack of high-quality datasets capable of comprehensively capturing the evolution of public opinion dynamics, we introduce a new dataset, VISTA. It includes 159 trending topics, corresponding to 47,207 posts, 327,015 second-level comments, and 29,578 third-level comments, covering diverse domains such as politics, entertainment, sports, health, and medicine. The dataset is annotated with detailed sentiment labels across 11 categories and clearly defined hierarchical relationships. When combined with our method, it offers strong interpretability by linking sentiment propagation to the comment hierarchy and temporal evolution. Our approach provides a robust baseline for future research.

cond-mat.mtrl-sci [Back]

[238] System of Agentic AI for the Discovery of Metal-Organic Frameworks

Theo Jaffrelot Inizan,Sherry Yang,Aaron Kaplan,Yen-hsu Lin,Jian Yin,Saber Mirzaei,Mona Abdelgaid,Ali H. Alawadhi,KwangHwan Cho,Zhiling Zheng,Ekin Dogus Cubuk,Christian Borgs,Jennifer T. Chayes,Kristin A. Persson,Omar M. Yaghi

Main category: cond-mat.mtrl-sci

TL;DR: MOFGen利用多智能体AI系统（包括语言模型、扩散模型和量子力学代理）生成新型MOF结构，并通过实验验证其合成可行性。

Details

Motivation: 解决生成模型在材料发现中面临的化学空间探索和合成可行性问题。 Method: 结合语言模型、扩散模型、量子力学代理和合成可行性代理，生成并优化MOF结构。 Result: 生成数十万新型MOF结构，成功合成五种AI设计的MOF。 Conclusion: MOFGen为自动化合成材料发现迈出重要一步。 Abstract: Generative models and machine learning promise accelerated material discovery in MOFs for CO2 capture and water harvesting but face significant challenges navigating vast chemical spaces while ensuring synthetizability. Here, we present MOFGen, a system of Agentic AI comprising interconnected agents: a large language model that proposes novel MOF compositions, a diffusion model that generates crystal structures, quantum mechanical agents that optimize and filter candidates, and synthetic-feasibility agents guided by expert rules and machine learning. Trained on all experimentally reported MOFs and computational databases, MOFGen generated hundreds of thousands of novel MOF structures and synthesizable organic linkers. Our methodology was validated through high-throughput experiments and the successful synthesis of five "AI-dreamt" MOFs, representing a major step toward automated synthesizable material discovery.

eess.IV [Back]

[239] Segmentation with Noisy Labels via Spatially Correlated Distributions

Ryu Tadokoro,Tsukasa Takagi,Shin-ichi Maeda

Main category: eess.IV

TL;DR: 论文提出了一种基于概率模型的贝叶斯估计方法，用于处理语义分割中因空间相关性导致的标签错误问题，显著提升了模型性能。

Details

Motivation: 在语义分割中，高质量标注对模型准确性至关重要，但实际场景（如医学影像和遥感）中获取真实标注困难且易出错，尤其是空间相关的标签错误。 Method: 采用近似贝叶斯估计，假设训练数据包含标签错误，并通过高斯分布（KMS矩阵结构协方差）建模相邻像素间的空间相关性。 Result: 实验表明，利用标签错误的空间相关性显著提升了性能，在特定任务（如肺部分割）中，该方法在中等噪声水平下表现接近干净标签训练。 Conclusion: 提出的方法有效解决了标签错误的空间相关性问题，为实际应用中标注不完美的场景提供了解决方案。 Abstract: In semantic segmentation, the accuracy of models heavily depends on the high-quality annotations. However, in many practical scenarios such as medical imaging and remote sensing, obtaining true annotations is not straightforward and usually requires significant human labor. Relying on human labor often introduces annotation errors, including mislabeling, omissions, and inconsistency between annotators. In the case of remote sensing, differences in procurement time can lead to misaligned ground truth annotations. These label errors are not independently distributed, and instead usually appear in spatially connected regions where adjacent pixels are more likely to share the same errors. To address these issues, we propose an approximate Bayesian estimation based on a probabilistic model that assumes training data includes label errors, incorporating the tendency for these errors to occur with spatial correlations between adjacent pixels. Bayesian inference requires computing the posterior distribution of label errors, which becomes intractable when spatial correlations are present. We represent the correlation of label errors between adjacent pixels through a Gaussian distribution whose covariance is structured by a Kac-Murdock-Szeg\"{o} (KMS) matrix, solving the computational challenges. Through experiments on multiple segmentation tasks, we confirm that leveraging the spatial correlation of label errors significantly improves performance. Notably, in specific tasks such as lung segmentation, the proposed method achieves performance comparable to training with clean labels under moderate noise levels. Code is available at https://github.com/pfnet-research/Bayesian_SpatialCorr.

cs.LG [Back]

[240] Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining

Deyu Cao,Samin Aref

Main category: cs.LG

TL;DR: 论文提出了一种基于ApiQ的改进方法，通过结合显著性感知正则化，在超低位量化中提升性能，无需完全重新训练。

Details

Motivation: 大型语言模型的计算需求高，量化方法能压缩模型但可能牺牲精度。现有方法如ApiQ在精度和效率间取得平衡，但仍有提升空间。 Method: 结合ApiQ的部分训练与显著性感知正则化，优先保留关键参数，提出一种超低位量化方法。 Result: 在LLaMA模型上的实验表明，新方法提升了精度，缩小了量化模型与全精度模型的差距，且开销极小。 Conclusion: 新方法在超低位量化中表现优异，未来将公开以促进相关研究。 Abstract: Large language models offer remarkable capabilities, but their size and computational demands pose practical challenges. Quantization methods compress their size through replacing their high-precision parameters by quantized values of lower precision. Post-training quantization reduces model size efficiently at the cost of decreased accuracy, while quantization-aware training better preserves accuracy but is resource-intensive. Among existing post-training quantization algorithms, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ's level. First, we look into combining existing quantization-aware training techniques with ApiQ's partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining may not be feasible through partial training. (2) This gain seems to depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. It relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on benchmark language models from the LLaMA family show that our proposed approach boosts accuracy and tightens the gap between the quantized model and the full-precision model, with minimal overhead. Our method will be made publicly available to facilitate future developments in ultra-low-bit quantization of large language models.

[241] ToolRL: Reward is All Tool Learning Needs

Cheng Qian,Emre Can Acikgoz,Qi He,Hongru Wang,Xiusi Chen,Dilek Hakkani-Tür,Gokhan Tur,Heng Ji

Main category: cs.LG

TL;DR: 论文研究了强化学习中奖励设计对大型语言模型（LLM）工具使用能力的影响，提出了一种新的奖励设计方法，并在实验中取得了显著提升。

Details

Motivation: 当前监督微调（SFT）在复杂工具使用场景中泛化能力不足，而强化学习（RL）的奖励设计面临挑战，需要更精细的反馈。 Method: 系统研究了多种奖励策略，提出了一种针对工具使用任务的奖励设计方法，并采用GRPO训练LLM。 Result: 实验表明，该方法在多个基准测试中表现优异，比基础模型提升17%，比SFT模型提升15%。 Conclusion: 研究表明，精心设计的奖励策略能显著提升LLM的工具使用能力和泛化性能，相关代码已开源。 Abstract: Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.

[242] One Jump Is All You Need: Short-Cutting Transformers for Early Exit Prediction with One Jump to Fit All Exit Levels

Amrit Diggavi Seshadri

Main category: cs.LG

TL;DR: 提出了一种名为OJFA的低秩捷径方法，显著减少了推理时的参数成本，同时保持了性能。

Details

Motivation: 减少大型语言模型推理的时间和计算成本，同时保持性能。 Method: 采用单一的低秩捷径（OJFA），替代传统的多捷径方法，显著降低参数成本。 Result: OJFA方法在GPT2-XL、Phi3-Mini和Llama2-7B模型上表现稳定，性能接近多捷径方法。 Conclusion: OJFA方法在参数效率和性能之间取得了良好平衡，适用于多种Transformer模型。 Abstract: To reduce the time and computational costs of inference of large language models, there has been interest in parameter-efficient low-rank early-exit casting of transformer hidden-representations to final-representations. Such low-rank short-cutting has been shown to outperform identity shortcuts at early model stages while offering parameter-efficiency in shortcut jumps. However, current low-rank methods maintain a separate early-exit shortcut jump to final-representations for each transformer intermediate block-level during inference. In this work, we propose selection of a single One-Jump-Fits-All (OJFA) low-rank shortcut that offers over a 30x reduction in shortcut parameter costs during inference. We show that despite this extreme reduction, our OJFA choice largely matches the performance of maintaining multiple shortcut jumps during inference and offers stable precision from all transformer block-levels for GPT2-XL, Phi3-Mini and Llama2-7B transformer models.

[243] Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

Lucas Maisonnave,Cyril Moineau,Olivier Bichler,Fabrice Rastello

Main category: cs.LG

TL;DR: 本文提出了一种基于Hadamard矩阵的量化方法，用于解决大语言模型（LLMs）在边缘设备部署中因激活值异常值导致的低比特量化难题，显著提升了3比特量化的性能。

Details

Motivation: 大语言模型（LLMs）在边缘设备上的部署受到其庞大参数规模的限制，而量化是减少内存和推理时间的常用方法。然而，LLMs激活值中的异常值使得低比特量化面临挑战。 Method: 利用Hadamard矩阵的理论优势，提出了一种逐步二进制搜索方法，实现了对权重、激活值和KV缓存的3比特量化，并通过Paley算法支持非2的幂次嵌入维度。 Result: 实验表明，该方法在Mistral、LLaMA和Qwen等模型上优于现有方法，实现了40%的准确率提升，并成功支持3比特量化。 Conclusion: Hadamard矩阵在减少异常值方面具有理论优势，该方法为LLMs的低比特量化提供了实用解决方案，显著提升了模型性能。 Abstract: Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40\% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

[244] Integrating Single-Cell Foundation Models with Graph Neural Networks for Drug Response Prediction

Till Rossner,Ziteng Li,Jonas Balke,Nikoo Salehfard,Tom Seifert,Ming Tang

Main category: cs.LG

TL;DR: 提出了一种结合scGPT和DeepCDR的创新方法，用于预测癌症药物反应（CDR），实验表明其优于现有方法。

Details

Motivation: 现有CDR预测方法存在局限性，需要更准确的预测模型。 Method: 利用scGPT生成基因表达数据的嵌入，作为DeepCDR的输入。 Result: scGPT方法在性能上优于原DeepCDR模型及其他相关方法。 Conclusion: scGPT嵌入能显著提高CDR预测准确性，为现有方法提供了有前景的替代方案。 Abstract: In this study, we propose an innovative methodology for predicting Cancer Drug Response (CDR) through the integration of the scGPT foundation model within the DeepCDR model. Our approach utilizes scGPT to generate embeddings from gene expression data, which are then used as gene expression input data for DeepCDR. The experimental findings demonstrate the efficacy of this scGPT-based method in outperforming previous related works, including the original DeepCDR model and the scFoundation-based model. This study highlights the potential of scGPT embeddings to enhance the accuracy of CDR predictions and offers a promising alternative to existing approaches.

[245] Improving RL Exploration for LLM Reasoning through Retrospective Replay

Shihan Dou,Muling Wu,Jingwen Xu,Rui Zheng,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.LG

TL;DR: 论文提出了一种名为RRL的新算法，通过动态重放机制解决强化学习在LLM后训练中的探索问题，显著提升了复杂推理任务的性能。

Details

Motivation: 在LLM的强化学习后训练中，早期探索阶段的有价值解决方案可能被抑制，导致后期训练中模型难以有效解决复杂问题。 Method: 提出Retrospective Replay-based Reinforcement Learning (RRL)算法，引入动态重放机制，使模型能重新访问早期有潜力的状态。 Result: 实验表明，RRL在复杂推理任务（如数学推理和代码生成）和对话任务中均显著提升了探索效率和模型性能。 Conclusion: RRL不仅优化了LLM在复杂任务中的表现，还提升了RLHF的安全性和实用性。 Abstract: Reinforcement learning (RL) has increasingly become a pivotal technique in the post-training of large language models (LLMs). The effective exploration of the output space is essential for the success of RL. We observe that for complex problems, during the early stages of training, the model exhibits strong exploratory capabilities and can identify promising solution ideas. However, its limited capability at this stage prevents it from successfully solving these problems. The early suppression of these potentially valuable solution ideas by the policy gradient hinders the model's ability to revisit and re-explore these ideas later. Consequently, although the LLM's capabilities improve in the later stages of training, it still struggles to effectively address these complex problems. To address this exploration issue, we propose a novel algorithm named Retrospective Replay-based Reinforcement Learning (RRL), which introduces a dynamic replay mechanism throughout the training process. RRL enables the model to revisit promising states identified in the early stages, thereby improving its efficiency and effectiveness in exploration. To evaluate the effectiveness of RRL, we conduct extensive experiments on complex reasoning tasks, including mathematical reasoning and code generation, and general dialogue tasks. The results indicate that RRL maintains high exploration efficiency throughout the training period, significantly enhancing the effectiveness of RL in optimizing LLMs for complicated reasoning tasks. Moreover, it also improves the performance of RLHF, making the model both safer and more helpful.

[246] LoRe: Personalizing LLMs via Low-Rank Reward Modeling

Avinandan Bose,Zhihan Xiong,Yuejie Chi,Simon Shaolei Du,Lin Xiao,Maryam Fazel

Main category: cs.LG

TL;DR: 提出了一种基于低秩偏好建模的新框架，用于个性化大型语言模型（LLM），以更好地适应多样化的用户偏好。

Details

Motivation: 传统基于人类反馈的强化学习（RLHF）方法依赖单一的价值表示，难以适应个体偏好，因此需要一种更灵活的方法。 Method: 通过低维子空间表示奖励函数，并将个体偏好建模为共享基函数的加权组合，实现可扩展性和少样本适应。 Result: 在多个偏好数据集上验证了方法的有效性，显示出对未见用户的优越泛化能力和偏好预测任务的准确性提升。 Conclusion: 该框架避免了僵化的用户分类，同时实现了高效的学习和个性化适应。 Abstract: Personalizing large language models (LLMs) to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction. Traditional reinforcement learning from human feedback (RLHF) approaches often rely on monolithic value representations, limiting their ability to adapt to individual preferences. We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions. By representing reward functions in a low-dimensional subspace and modeling individual preferences as weighted combinations of shared basis functions, our approach avoids rigid user categorization while enabling scalability and few-shot adaptation. We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.

[247] LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

Yunhui Xia,Wei Shen,Yan Wang,Jason Klein Liu,Huifeng Sun,Siyue Wu,Jian Hu,Xiaolong Xu

Main category: cs.LG

TL;DR: LeetCodeDataset是一个高质量代码生成模型评估和训练基准，解决了LLM研究中缺乏推理导向的编码基准和自包含训练测试平台的问题。

Details

Motivation: 解决LLM研究中缺乏推理导向的编码基准和自包含训练测试平台的挑战。 Method: 通过整理LeetCode Python问题，提供丰富的元数据、广泛覆盖、每个问题100+测试用例以及时间分割（2024年7月前后），实现无污染的评估和高效的监督微调（SFT）。 Result: 实验显示推理模型显著优于非推理模型，仅用2.6K模型生成的解决方案进行SFT即可达到与110K样本相当的性能。 Conclusion: LeetCodeDataset及其评估框架已在Hugging Face和Github上发布，为代码生成模型的研究提供了有力支持。 Abstract: We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.

[248] Learning to Reason under Off-Policy Guidance

Jianhao Yan,Yafu Li,Zican Hu,Zhi Wang,Ganqu Cui,Xiaoye Qu,Yu Cheng,Yue Zhang

Main category: cs.LG

TL;DR: LUFFY框架通过结合离策略推理轨迹和动态平衡模仿与探索，显著提升了推理模型的性能，尤其在泛化能力上表现突出。

Details

Motivation: 现有零强化学习方法局限于策略内学习，无法超越初始能力获取更高级的推理能力。 Method: 引入LUFFY框架，结合离策略推理轨迹和策略内训练，通过正则化重要性采样避免浅层模仿。 Result: 在六个数学基准测试中平均提升7.0分，在分布外任务中优势超过6.2分，显著优于基于模仿的监督微调。 Conclusion: LUFFY不仅有效模仿，还能超越演示进行探索，为训练可泛化的推理模型提供了可扩展的路径。 Abstract: Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

[249] Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

Vaishnavh Nagarajan,Chen Henry Wu,Charles Ding,Aditi Raghunathan

Main category: cs.LG

TL;DR: 论文设计了一套最小算法任务，用于量化语言模型的创造力，发现多令牌方法优于单令牌学习，并提出了一种新的噪声注入方法（hash-conditioning）。

Details

Motivation: 研究语言模型在开放式任务中的创造力限制，探索如何改进现有方法以生成更多样化和原创的输出。 Method: 设计抽象任务，比较单令牌和多令牌学习方法，并提出hash-conditioning噪声注入技术。 Result: 多令牌方法（如无教师训练和扩散模型）表现更好，hash-conditioning能有效平衡随机性与连贯性。 Conclusion: 为分析开放式创造力提供了测试平台，并支持超越单令牌学习和softmax采样的新方法。 Abstract: We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; comparatively, multi-token approaches, namely teacherless training and diffusion models, excel in producing diverse and original output. Secondly, in our tasks, we find that to elicit randomness from the Transformer without hurting coherence, it is better to inject noise right at the input layer (via a method we dub hash-conditioning) rather than defer to temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and softmax-based sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity

[250] Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing

Mehmet Yamaç,Muhammad Numan Yousaf,Serkan Kiranyaz,Moncef Gabbouj

Main category: cs.LG

TL;DR: 论文提出了一种名为多尺度张量求和（MTS）分解的新型神经网络算子，通过多尺度张量求和提升效率，优于传统密集层和卷积层。

Details

Motivation: 解决多层感知机（MLP）和卷积神经网络（CNN）在高维输入输出对中的效率问题，同时扩展感受野。 Method: 引入MTS分解作为新的神经网络层，通过Tucker分解类模式乘积实现多尺度张量求和。 Result: MTS在分类、压缩和信号恢复等任务中表现优异，与非线性单元结合后性能优于现有Transformer。 Conclusion: MTSNet在计算效率和性能上优于传统方法，为计算机视觉任务提供了新的解决方案。 Abstract: Multilayer perceptrons (MLP), or fully connected artificial neural networks, are known for performing vector-matrix multiplications using learnable weight matrices; however, their practical application in many machine learning tasks, especially in computer vision, can be limited due to the high dimensionality of input-output pairs at each layer. To improve efficiency, convolutional operators have been utilized to facilitate weight sharing and local connections, yet they are constrained by limited receptive fields. In this paper, we introduce Multiscale Tensor Summation (MTS) Factorization, a novel neural network operator that implements tensor summation at multiple scales, where each tensor to be summed is obtained through Tucker-decomposition-like mode products. Unlike other tensor decomposition methods in the literature, MTS is not introduced as a network compression tool; instead, as a new backbone neural layer. MTS not only reduces the number of parameters required while enhancing the efficiency of weight optimization compared to traditional dense layers (i.e., unfactorized weight matrices in MLP layers), but it also demonstrates clear advantages over convolutional layers. The proof-of-concept experimental comparison of the proposed MTS networks with MLPs and Convolutional Neural Networks (CNNs) demonstrates their effectiveness across various tasks, such as classification, compression, and signal restoration. Additionally, when integrated with modern non-linear units such as the multi-head gate (MHG), also introduced in this study, the corresponding neural network, MTSNet, demonstrates a more favorable complexity-performance tradeoff compared to state-of-the-art transformers in various computer vision applications. The software implementation of the MTS layer and the corresponding MTS-based networks, MTSNets, is shared at https://github.com/mehmetyamac/MTSNet.

[251] Mitigating Parameter Interference in Model Merging via Sharpness-Aware Fine-Tuning

Yeoreum Lee,Jinwook Jung,Sungyong Baik

Main category: cs.LG

TL;DR: 论文提出了一种新的微调目标函数，旨在减少参数干扰并提升单任务性能，基于锐度感知最小化（SAM）方法。

Details

Motivation: 大规模深度学习模型在预训练-微调范式下产生了许多任务特定模型，但合并这些模型时存在参数干扰问题。现有方法牺牲了单任务性能，限制了合并模型的性能。 Method: 设计了一种新的微调目标函数，结合锐度感知最小化（SAM），以减少参数干扰并提升单任务性能。 Result: 实验和理论结果表明，该方法有效且与其他方法正交，显著提升了合并和微调的性能。 Conclusion: 通过锐度感知最小化微调预训练模型，能够同时减少参数干扰并提升单任务性能，从而优化合并模型的表现。 Abstract: Large-scale deep learning models with a pretraining-finetuning paradigm have led to a surge of numerous task-specific models fine-tuned from a common pre-trained model. Recently, several research efforts have been made on merging these large models into a single multi-task model, particularly with simple arithmetic on parameters. Such merging methodology faces a central challenge: interference between model parameters fine-tuned on different tasks. Few recent works have focused on designing a new fine-tuning scheme that can lead to small parameter interference, however at the cost of the performance of each task-specific fine-tuned model and thereby limiting that of a merged model. To improve the performance of a merged model, we note that a fine-tuning scheme should aim for (1) smaller parameter interference and (2) better performance of each fine-tuned model on the corresponding task. In this work, we aim to design a new fine-tuning objective function to work towards these two goals. In the course of this process, we find such objective function to be strikingly similar to sharpness-aware minimization (SAM) objective function, which aims to achieve generalization by finding flat minima. Drawing upon our observation, we propose to fine-tune pre-trained models via sharpness-aware minimization. The experimental and theoretical results showcase the effectiveness and orthogonality of our proposed approach, improving performance upon various merging and fine-tuning methods. Our code is available at https://github.com/baiklab/SAFT-Merge.

[252] Semi-parametric Memory Consolidation: Towards Brain-like Deep Continual Learning

Geng Liu,Fei Zhu,Rong Feng,Zhiqiang Yi,Shiqi Wang,Gaofeng Meng,Zhaoxiang Zhang

Main category: cs.LG

TL;DR: 提出了一种受人类记忆和学习系统启发的生物模拟持续学习框架，解决了深度神经网络在顺序任务训练中的灾难性遗忘问题。

Details

Motivation: 深度神经网络在持续学习中存在灾难性遗忘问题，无法像人类一样积累知识。 Method: 结合半参数记忆和醒睡巩固机制，提出生物模拟持续学习框架。 Result: 在真实场景（如ImageNet类增量学习）中，模型既能学习新任务又能保留旧知识。 Conclusion: 模拟生物智能是赋予深度神经网络持续学习能力的有效途径。 Abstract: Humans and most animals inherently possess a distinctive capacity to continually acquire novel experiences and accumulate worldly knowledge over time. This ability, termed continual learning, is also critical for deep neural networks (DNNs) to adapt to the dynamically evolving world in open environments. However, DNNs notoriously suffer from catastrophic forgetting of previously learned knowledge when trained on sequential tasks. In this work, inspired by the interactive human memory and learning system, we propose a novel biomimetic continual learning framework that integrates semi-parametric memory and the wake-sleep consolidation mechanism. For the first time, our method enables deep neural networks to retain high performance on novel tasks while maintaining prior knowledge in real-world challenging continual learning scenarios, e.g., class-incremental learning on ImageNet. This study demonstrates that emulating biological intelligence provides a promising path to enable deep neural networks with continual learning capabilities.

[253] Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models

Hao Xuan,Xingyu Li

Main category: cs.LG

TL;DR: 论文提出了一种名为“稳健遗忘”（Robust Unlearning）的概念，并通过“遗忘映射攻击”（UMA）验证现有遗忘技术的安全性，发现其仍存在漏洞。

Details

Motivation: 现有遗忘技术无法完全消除模型中的残留信息，导致隐私泄露风险。 Method: 提出UMA框架，通过对抗性查询主动探测模型中的遗忘痕迹。 Result: 实验表明现有遗忘技术即使通过现有验证指标，仍易受攻击。 Conclusion: UMA为评估和提升机器遗忘安全性设定了新标准。 Abstract: Machine Unlearning (MUL) is crucial for privacy protection and content regulation, yet recent studies reveal that traces of forgotten information persist in unlearned models, enabling adversaries to resurface removed knowledge. Existing verification methods only confirm whether unlearning was executed, failing to detect such residual information leaks. To address this, we introduce the concept of Robust Unlearning, ensuring models are indistinguishable from retraining and resistant to adversarial recovery. To empirically evaluate whether unlearning techniques meet this security standard, we propose the Unlearning Mapping Attack (UMA), a post-unlearning verification framework that actively probes models for forgotten traces using adversarial queries. Extensive experiments on discriminative and generative tasks show that existing unlearning techniques remain vulnerable, even when passing existing verification metrics. By establishing UMA as a practical verification tool, this study sets a new standard for assessing and enhancing machine unlearning security.

[254] A Survey on Small Sample Imbalance Problem: Metrics, Feature Analysis, and Solutions

Shuxian Zhao,Jie Gui,Minjing Dong,Baosheng Yu,Zhipeng Gui,Lu Dong,Yuan Yan Tang,James Tin-Yau Kwok

Main category: cs.LG

TL;DR: 本文提出了一种系统分析框架，用于解决小样本不平衡（S&I）问题，总结了不平衡度量和复杂性分析方法，并探讨了现有解决方案的局限性。

Details

Motivation: 小样本不平衡问题是机器学习和数据分析中的主要挑战，现有方法缺乏对数据特征的深入分析，因此需要从数据角度进行系统性研究。 Method: 提出了一种系统分析框架，包括总结不平衡度量和复杂性分析方法，并回顾了针对不同类型S&I问题的解决方案。 Result: 实验表明，分类器性能差异显著超过通过重采样实现的改进，重采样仍是广泛采用的解决方案。 Conclusion: 本文强调了开放性问题，并讨论了未来趋势，呼吁进一步研究以解决S&I问题的复杂性。 Abstract: The small sample imbalance (S&I) problem is a major challenge in machine learning and data analysis. It is characterized by a small number of samples and an imbalanced class distribution, which leads to poor model performance. In addition, indistinct inter-class feature distributions further complicate classification tasks. Existing methods often rely on algorithmic heuristics without sufficiently analyzing the underlying data characteristics. We argue that a detailed analysis from the data perspective is essential before developing an appropriate solution. Therefore, this paper proposes a systematic analytical framework for the S\&I problem. We first summarize imbalance metrics and complexity analysis methods, highlighting the need for interpretable benchmarks to characterize S&I problems. Second, we review recent solutions for conventional, complexity-based, and extreme S&I problems, revealing methodological differences in handling various data distributions. Our summary finds that resampling remains a widely adopted solution. However, we conduct experiments on binary and multiclass datasets, revealing that classifier performance differences significantly exceed the improvements achieved through resampling. Finally, this paper highlights open questions and discusses future trends.

[255] What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale

Xiaoyong Yuan,Xiaolong Ma,Linke Guo,Lan Zhang

Main category: cs.LG

TL;DR: 论文提出了一种名为PAIA的无提示、无图像的概念审计框架，用于检测微调扩散模型是否学习到特定目标概念，解决了现有方法的局限性，并在实际应用中表现出高效性和准确性。

Details

Motivation: 随着扩散模型在文本到图像生成中的广泛应用，微调模型的共享引发了伦理和法律问题，但目前缺乏有效的审计工具。 Method: 提出Prompt-Agnostic Image-Free Auditing (PAIA)框架，直接分析模型内部行为，无需优化提示或生成图像。 Result: 在320个控制模型和690个真实社区模型上测试，PAIA检测准确率超过90%，审计时间减少18-40倍。 Conclusion: PAIA是首个可扩展且实用的扩散模型预部署概念审计解决方案，为模型共享提供了更安全和透明的基础。 Abstract: Diffusion models (DMs) have revolutionized text-to-image generation, enabling the creation of highly realistic and customized images from text prompts. With the rise of parameter-efficient fine-tuning (PEFT) techniques like LoRA, users can now customize powerful pre-trained models using minimal computational resources. However, the widespread sharing of fine-tuned DMs on open platforms raises growing ethical and legal concerns, as these models may inadvertently or deliberately generate sensitive or unauthorized content, such as copyrighted material, private individuals, or harmful content. Despite the increasing regulatory attention on generative AI, there are currently no practical tools for systematically auditing these models before deployment. In this paper, we address the problem of concept auditing: determining whether a fine-tuned DM has learned to generate a specific target concept. Existing approaches typically rely on prompt-based input crafting and output-based image classification but suffer from critical limitations, including prompt uncertainty, concept drift, and poor scalability. To overcome these challenges, we introduce Prompt-Agnostic Image-Free Auditing (PAIA), a novel, model-centric concept auditing framework. By treating the DM as the object of inspection, PAIA enables direct analysis of internal model behavior, bypassing the need for optimized prompts or generated images. We evaluate PAIA on 320 controlled model and 690 real-world community models sourced from a public DM sharing platform. PAIA achieves over 90% detection accuracy while reducing auditing time by 18-40x compared to existing baselines. To our knowledge, PAIA is the first scalable and practical solution for pre-deployment concept auditing of diffusion models, providing a practical foundation for safer and more transparent diffusion model sharing.

[256] Some Optimizers are More Equal: Understanding the Role of Optimizers in Group Fairness

Mojtaba Kolahdouzi,Hatice Gunes,Ali Etemad

Main category: cs.LG

TL;DR: 研究优化算法选择对深度神经网络群体公平性的影响，发现自适应优化器（如RMSProp）比随机优化器（如SGD）更易收敛到公平解。

Details

Motivation: 探讨优化算法如何影响模型公平性，尤其是在数据严重不平衡的情况下。 Method: 通过随机微分方程分析优化动态，比较自适应优化器（RMSProp）和随机优化器（SGD）的公平性表现，并进行实验验证。 Result: 自适应优化器在多个数据集和任务中表现更公平，且不影响预测准确性。 Conclusion: 自适应优化器是提升模型公平性的重要机制。 Abstract: We study whether and how the choice of optimization algorithm can impact group fairness in deep neural networks. Through stochastic differential equation analysis of optimization dynamics in an analytically tractable setup, we demonstrate that the choice of optimization algorithm indeed influences fairness outcomes, particularly under severe imbalance. Furthermore, we show that when comparing two categories of optimizers, adaptive methods and stochastic methods, RMSProp (from the adaptive category) has a higher likelihood of converging to fairer minima than SGD (from the stochastic category). Building on this insight, we derive two new theoretical guarantees showing that, under appropriate conditions, RMSProp exhibits fairer parameter updates and improved fairness in a single optimization step compared to SGD. We then validate these findings through extensive experiments on three publicly available datasets, namely CelebA, FairFace, and MS-COCO, across different tasks as facial expression recognition, gender classification, and multi-label classification, using various backbones. Considering multiple fairness definitions including equalized odds, equal opportunity, and demographic parity, adaptive optimizers like RMSProp and Adam consistently outperform SGD in terms of group fairness, while maintaining comparable predictive accuracy. Our results highlight the role of adaptive updates as a crucial yet overlooked mechanism for promoting fair outcomes.

[257] VeLU: Variance-enhanced Learning Unit for Deep Neural Networks

Ashkan Shakarami,Yousef Yeganeh,Azade Farshad,Lorenzo Nicolè,Stefano Ghidoni,Nassir Navab

Main category: cs.LG

TL;DR: VeLU是一种基于输入方差动态调整的激活函数，通过ArcTan-Sin变换和Wasserstein-2正则化优化梯度流和稳定性，在多个视觉任务中表现优于ReLU、Swish和GELU。

Details

Motivation: ReLU虽然简单但存在梯度消失和缺乏适应性问题，其他替代方案如Swish和GELU也无法动态适应输入统计特性。 Method: VeLU结合ArcTan-Sin变换和Wasserstein-2正则化，动态调整输入方差以优化梯度流和稳定性。 Result: 在ViT_B16、VGG19等模型上的实验表明，VeLU在多个视觉基准测试中优于ReLU、Swish和GELU。 Conclusion: VeLU通过动态调整输入方差，显著提升了激活函数的性能，代码已开源。 Abstract: Activation functions are fundamental in deep neural networks and directly impact gradient flow, optimization stability, and generalization. Although ReLU remains standard because of its simplicity, it suffers from vanishing gradients and lacks adaptability. Alternatives like Swish and GELU introduce smooth transitions, but fail to dynamically adjust to input statistics. We propose VeLU, a Variance-enhanced Learning Unit as an activation function that dynamically scales based on input variance by integrating ArcTan-Sin transformations and Wasserstein-2 regularization, effectively mitigating covariate shifts and stabilizing optimization. Extensive experiments on ViT_B16, VGG19, ResNet50, DenseNet121, MobileNetV2, and EfficientNetB3 confirm VeLU's superiority over ReLU, ReLU6, Swish, and GELU on six vision benchmarks. The codes of VeLU are publicly available on GitHub.

astro-ph.CO [Back]

[258] Revealing the 3D Cosmic Web through Gravitationally Constrained Neural Fields

Brandon Zhao,Aviad Levis,Liam Connor,Pratul P. Srinivasan,Katherine L. Bouman

Main category: astro-ph.CO

TL;DR: 该论文提出了一种利用引力约束的神经场方法，从2D望远镜图像中重建3D暗物质分布，克服了传统方法的局限性，并在模拟数据中验证了其优越性。

Details

Motivation: 准确重建3D暗物质分布对于定位宇宙结构和验证宇宙理论至关重要，但传统方法因单视角观测和噪声问题面临挑战。 Method: 采用引力约束的神经场建模连续物质分布，通过可微分物理前向模型优化神经网络权重，以复现观测到的透镜信号。 Result: 在模拟数据中，该方法不仅优于传统方法，还能恢复潜在的意外暗物质结构。 Conclusion: 该方法为暗物质分布的高精度3D重建提供了新思路，有望应用于未来望远镜观测数据。 Abstract: Weak gravitational lensing is the slight distortion of galaxy shapes caused primarily by the gravitational effects of dark matter in the universe. In our work, we seek to invert the weak lensing signal from 2D telescope images to reconstruct a 3D map of the universe's dark matter field. While inversion typically yields a 2D projection of the dark matter field, accurate 3D maps of the dark matter distribution are essential for localizing structures of interest and testing theories of our universe. However, 3D inversion poses significant challenges. First, unlike standard 3D reconstruction that relies on multiple viewpoints, in this case, images are only observed from a single viewpoint. This challenge can be partially addressed by observing how galaxy emitters throughout the volume are lensed. However, this leads to the second challenge: the shapes and exact locations of unlensed galaxies are unknown, and can only be estimated with a very large degree of uncertainty. This introduces an overwhelming amount of noise which nearly drowns out the lensing signal completely. Previous approaches tackle this by imposing strong assumptions about the structures in the volume. We instead propose a methodology using a gravitationally-constrained neural field to flexibly model the continuous matter distribution. We take an analysis-by-synthesis approach, optimizing the weights of the neural network through a fully differentiable physical forward model to reproduce the lensing signal present in image measurements. We showcase our method on simulations, including realistic simulated measurements of dark matter distributions that mimic data from upcoming telescope surveys. Our results show that our method can not only outperform previous methods, but importantly is also able to recover potentially surprising dark matter structures.

cs.CR [Back]

[259] Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation

Yi Yu,Song Xia,Xun Lin,Chenqi Kong,Wenhan Yang,Shijian Lu,Yap-Peng Tan,Alex C. Kot

Main category: cs.CR

TL;DR: 提出了一种新的训练范式，通过触发激活模型增强对抗样本的可转移性鲁棒性。

Details

Motivation: 现有防御方法在对抗样本的可转移性上存在效率低、效果差等问题，需要更高效的解决方案。 Method: 设计了一种触发激活模型，通过固定触发器使模型在干净数据上随机猜测，在触发数据上准确预测。 Result: 实验表明该方法在多种数据集和攻击方法下表现优越，提升了对抗样本的鲁棒性。 Conclusion: 触发激活模型为对抗样本的可转移性问题提供了高效且有效的解决方案。 Abstract: Adversarial examples, characterized by imperceptible perturbations, pose significant threats to deep neural networks by misleading their predictions. A critical aspect of these examples is their transferability, allowing them to deceive {unseen} models in black-box scenarios. Despite the widespread exploration of defense methods, including those on transferability, they show limitations: inefficient deployment, ineffective defense, and degraded performance on clean images. In this work, we introduce a novel training paradigm aimed at enhancing robustness against transferable adversarial examples (TAEs) in a more efficient and effective way. We propose a model that exhibits random guessing behavior when presented with clean data $\boldsymbol{x}$ as input, and generates accurate predictions when with triggered data $\boldsymbol{x}+\boldsymbol{\tau}$. Importantly, the trigger $\boldsymbol{\tau}$ remains constant for all data instances. We refer to these models as \textbf{models with trigger activation}. We are surprised to find that these models exhibit certain robustness against TAEs. Through the consideration of first-order gradients, we provide a theoretical analysis of this robustness. Moreover, through the joint optimization of the learnable trigger and the model, we achieve improved robustness to transferable attacks. Extensive experiments conducted across diverse datasets, evaluating a variety of attacking methods, underscore the effectiveness and superiority of our approach.

[260] REDEditing: Relationship-Driven Precise Backdoor Poisoning on Text-to-Image Diffusion Models

Chongye Guo,Jinhu Fu,Junfeng Fang,Kun Wang,Guorui Feng

Main category: cs.CR

TL;DR: 论文提出了一种基于模型编辑的无训练后门攻击方法REDEditing，揭示了图像生成模型的安全风险，并实现了更高的攻击成功率和隐蔽性。

Details

Motivation: 随着生成式AI的快速发展，文本到图像（T2I）模型的安全性尤为重要，尤其是后门攻击的威胁。及时披露和缓解T2I模型的安全漏洞对确保生成模型的安全部署至关重要。 Method: 通过模型编辑技术，提出了一种关系驱动的精确后门攻击方法REDEditing，基于等效属性对齐和隐蔽攻击原则，采用等效关系检索和联合属性转移方法，确保通过概念重绑定生成一致的后门图像。 Result: REDEditing的攻击成功率比现有方法高11%，仅需添加一行代码即可提高输出自然性，同时将后门隐蔽性提升24%。 Conclusion: 该研究旨在提高对可编辑图像生成模型中安全漏洞的认识，强调了模型编辑技术可能带来的安全风险。 Abstract: The rapid advancement of generative AI highlights the importance of text-to-image (T2I) security, particularly with the threat of backdoor poisoning. Timely disclosure and mitigation of security vulnerabilities in T2I models are crucial for ensuring the safe deployment of generative models. We explore a novel training-free backdoor poisoning paradigm through model editing, which is recently employed for knowledge updating in large language models. Nevertheless, we reveal the potential security risks posed by model editing techniques to image generation models. In this work, we establish the principles for backdoor attacks based on model editing, and propose a relationship-driven precise backdoor poisoning method, REDEditing. Drawing on the principles of equivalent-attribute alignment and stealthy poisoning, we develop an equivalent relationship retrieval and joint-attribute transfer approach that ensures consistent backdoor image generation through concept rebinding. A knowledge isolation constraint is proposed to preserve benign generation integrity. Our method achieves an 11\% higher attack success rate compared to state-of-the-art approaches. Remarkably, adding just one line of code enhances output naturalness while improving backdoor stealthiness by 24\%. This work aims to heighten awareness regarding this security vulnerability in editable image generation models.

cs.IR [Back]

[261] HF4Rec: Human-Like Feedback-Driven Optimization Framework for Explainable Recommendation

Jiakai Tang,Jingsen Zhang,Zihang Tian,Xueyang Feng,Lei Wang,Xu Chen

Main category: cs.IR

TL;DR: 提出了一种基于人类反馈优化的可解释推荐框架，利用大型语言模型模拟人类反馈，通过多目标优化提升解释性能。

Details

Motivation: 现有可解释推荐方法在稀疏交互数据下无法提供有效反馈信号，难以满足个性化需求。 Method: 采用动态交互优化机制，利用大型语言模型预测人类反馈，引入定制化奖励评分和多目标优化方法。 Result: 在四个数据集上的实验验证了方法的优越性。 Conclusion: 该框架在提升解释性能的同时，降低了人力成本，满足了多样化需求。 Abstract: Recent advancements in explainable recommendation have greatly bolstered user experience by elucidating the decision-making rationale. However, the existing methods actually fail to provide effective feedback signals for potentially better or worse generated explanations due to their reliance on traditional supervised learning paradigms in sparse interaction data. To address these issues, we propose a novel human-like feedback-driven optimization framework. This framework employs a dynamic interactive optimization mechanism for achieving human-centered explainable requirements without incurring high labor costs. Specifically, we propose to utilize large language models (LLMs) as human simulators to predict human-like feedback for guiding the learning process. To enable the LLMs to deeply understand the task essence and meet user's diverse personalized requirements, we introduce a human-induced customized reward scoring method, which helps stimulate the language understanding and logical reasoning capabilities of LLMs. Furthermore, considering the potential conflicts between different perspectives of explanation quality, we introduce a principled Pareto optimization that transforms the multi-perspective quality enhancement task into a multi-objective optimization problem for improving explanation performance. At last, to achieve efficient model training, we design an off-policy optimization pipeline. By incorporating a replay buffer and addressing the data distribution biases, we can effectively improve data utilization and enhance model generality. Extensive experiments on four datasets demonstrate the superiority of our approach.

[262] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models

Ronak Pradeep,Nandan Thakur,Shivani Upadhyay,Daniel Campos,Nick Craswell,Jimmy Lin

Main category: cs.IR

TL;DR: 本文提出了一种自动评估框架AutoNuggetizer，用于评估检索增强生成（RAG）系统，通过LLMs自动生成和分配评估单元（nuggets），并与人工评估结果对比，验证其有效性。

Details

Motivation: 当前RAG系统的评估方法存在障碍，阻碍了进一步的发展，因此需要一种自动化的评估框架来提升效率和准确性。 Method: 采用AutoNuggetizer框架，利用LLMs自动生成和分配评估单元（nuggets），并与人工评估方法（手动或半手动）进行对比校准。 Result: 实验结果显示，全自动评估方法与人工评估方法在运行级别上具有较强的一致性，尤其在独立自动化组件时表现更优。 Conclusion: 该框架在评估质量和效率之间提供了平衡，但仍需进一步研究以优化每主题一致性，从而更有效地诊断系统故障。 Abstract: Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget evaluation methodology provides a solid foundation for evaluating RAG systems. This approach, originally developed for the TREC Question Answering (QA) Track in 2003, evaluates systems based on atomic facts that should be present in good answers. Our efforts focus on "refactoring" this methodology, where we describe the AutoNuggetizer framework that specifically applies LLMs to both automatically create nuggets and automatically assign nuggets to system answers. In the context of the TREC 2024 RAG Track, we calibrate a fully automatic approach against strategies where nuggets are created manually or semi-manually by human assessors and then assigned manually to system answers. Based on results from a community-wide evaluation, we observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants. The agreement is stronger when individual framework components such as nugget assignment are automated independently. This suggests that our evaluation framework provides tradeoffs between effort and quality that can be used to guide the development of future RAG systems. However, further research is necessary to refine our approach, particularly in establishing robust per-topic agreement to diagnose system failures effectively.

[263] KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

Juyeon Kim,Geon Lee,Taeuk Kim,Kijung Shin

Main category: cs.IR

TL;DR: KGMEL是一个利用知识图谱三元组增强多模态实体链接的新框架，通过生成、检索和重排序三个阶段显著提升性能。

Details

Motivation: 现有MEL方法忽略了知识图谱的结构信息，KGMEL旨在利用这些信息减少歧义并提高对齐准确性。 Method: KGMEL分三个阶段：生成高质量三元组、通过对比学习检索候选实体、利用大型语言模型重排序。 Result: 在基准数据集上，KGMEL优于现有方法。 Conclusion: KGMEL通过整合知识图谱三元组，显著提升了多模态实体链接的性能。 Abstract: Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.

cs.RO [Back]

[264] Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering

Jonathan Embley-Riches,Jianwei Liu,Simon Julier,Dimitrios Kanoulas

Main category: cs.RO

TL;DR: 提出了一种结合Unreal Engine和MuJoCo的新型机器人仿真框架URL，实现了高保真渲染与精确物理模拟的结合，适用于视觉机器人应用的测试与数据集生成。

Details

Motivation: 高保真仿真对机器人研究至关重要，但目前同时实现逼真渲染和精确物理模拟仍具挑战性。 Method: 通过整合Unreal Engine的高级渲染能力和MuJoCo的高精度物理模拟，构建了Unreal Robotics Lab（URL）框架。 Result: 支持复杂环境效果（如烟雾、火焰和水动力学），并成功用于视觉导航和SLAM方法的基准测试。 Conclusion: 该框架填补了物理精度与逼真渲染之间的空白，为机器人研究和仿真到现实的迁移提供了强大工具。 Abstract: High-fidelity simulation is essential for robotics research, enabling safe and efficient testing of perception, control, and navigation algorithms. However, achieving both photorealistic rendering and accurate physics modeling remains a challenge. This paper presents a novel simulation framework--the Unreal Robotics Lab (URL) that integrates the Unreal Engine's advanced rendering capabilities with MuJoCo's high-precision physics simulation. Our approach enables realistic robotic perception while maintaining accurate physical interactions, facilitating benchmarking and dataset generation for vision-based robotics applications. The system supports complex environmental effects, such as smoke, fire, and water dynamics, which are critical for evaluating robotic performance under adverse conditions. We benchmark visual navigation and SLAM methods within our framework, demonstrating its utility for testing real-world robustness in controlled yet diverse scenarios. By bridging the gap between physics accuracy and photorealistic rendering, our framework provides a powerful tool for advancing robotics research and sim-to-real transfer.

[265] Infrared Vision Systems for Emergency Vehicle Driver Assistance in Low-Visibility Conditions

M-Mahdi Naddaf-Sh,Andrew Lee,Kin Yen,Eemon Amini,Iman Soltani

Main category: cs.RO

TL;DR: 研究探讨红外（IR）摄像头技术如何提升紧急车辆在低能见度条件下的驾驶员安全性，尤其是在夜间和浓雾中。

Details

Motivation: 低能见度环境（如夜间和浓雾）增加了紧急车辆（如拖车和扫雪车）的碰撞风险，传统辅助系统效果有限。 Method: 结合实验室实验、实地测试和紧急车辆操作员调查，评估IR技术的检测性能及改装现有车队的可行性。 Result: IR技术显著提升驾驶员对障碍物的感知能力，并提供了经济高效的改装方案。 Conclusion: IR技术是提升紧急车辆安全性的有效解决方案，可推广至现有车队。 Abstract: This study investigates the potential of infrared (IR) camera technology to enhance driver safety for emergency vehicles operating in low-visibility conditions, particularly at night and in dense fog. Such environments significantly increase the risk of collisions, especially for tow trucks and snowplows that must remain operational in challenging conditions. Conventional driver assistance systems often struggle under these conditions due to limited visibility. In contrast, IR cameras, which detect the thermal signatures of obstacles, offer a promising alternative. The evaluation combines controlled laboratory experiments, real-world field tests, and surveys of emergency vehicle operators. In addition to assessing detection performance, the study examines the feasibility of retrofitting existing Department of Transportation (DoT) fleets with cost-effective IR-based driver assistance systems. Results underscore the utility of IR technology in enhancing driver awareness and provide data-driven recommendations for scalable deployment across legacy emergency vehicle fleets.

[266] SG-Reg: Generalizable and Efficient Scene Graph Registration

Chuhao Liu,Zhijian Qiao,Jieqi Shi,Ke Wang,Peize Liu,Shaojie Shen

Main category: cs.RO

TL;DR: 本文提出了一种基于多模态语义节点特征的场景图网络，用于刚性语义场景图的注册，解决了传统方法依赖手工特征或真实标注的问题。

Details

Motivation: 解决自主代理在注册地图时依赖手工特征或真实标注的限制，提升实际环境中的适用性。 Method: 设计多模态语义节点特征（开放集语义、局部拓扑、形状特征），采用粗到细的匹配层和后端鲁棒位姿估计。 Result: 在两代理SLAM基准测试中显著优于手工基线，通信带宽需求低至52 KB/帧。 Conclusion: 该方法在注册成功率和资源效率上表现优异，适用于实际多代理任务。 Abstract: This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.

[267] Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

Wenke Xia,Ruoxuan Feng,Dong Wang,Di Hu

Main category: cs.RO

TL;DR: 论文提出Phoenix框架，通过运动指令连接高级语义反思与低级机器人动作修正，结合多任务运动条件扩散策略实现精确修正，并在实验中验证了其泛化性和鲁棒性。

Details

Motivation: 构建通用自校正系统对机器人从失败中恢复至关重要，但如何将语义反思转化为细粒度机器人动作修正仍具挑战性。 Method: 提出Phoenix框架，包含双过程运动调整机制和多任务运动条件扩散策略，结合视觉观察实现高频动作修正。 Result: 在RoboMimic仿真和真实场景实验中，框架展现出优异的泛化性和鲁棒性。 Conclusion: Phoenix框架通过运动指令和扩散策略实现了精确的机器人动作修正，并通过终身学习方法持续提升模型能力。 Abstract: Building a generalizable self-correction system is crucial for robots to recover from failures. Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating semantic reflection into how to correct fine-grained robotic actions remains a significant challenge. To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework, we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding how to correct fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction. By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion adjustment model and facilitate precise, fine-grained robotic action correction. Utilizing this framework, we further develop a lifelong learning method to automatically improve the model's capability from interactions with dynamic environments. The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks. Our code is released at \href{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}.

[268] Latent Representations for Visual Proprioception in Inexpensive Robots

Sahara Sheikholeslami,Ladislau Bölöni

Main category: cs.RO

TL;DR: 本文探讨了如何通过单次外部摄像头图像实现视觉本体感知，研究了多种潜在表示方法，并在低成本6自由度机器人上验证了其准确性。

Details

Motivation: 低成本机器人在非结构化环境中缺乏精确的本体感知能力，本文旨在探索单次视觉回归架构的可行性。 Method: 研究了CNN、VAE、ViT和未校准标记点等多种潜在表示方法，并采用适应有限数据的微调技术。 Result: 在低成本6自由度机器人上的实验验证了该方法的准确性。 Conclusion: 单次视觉回归架构在低成本机器人中具有实现视觉本体感知的潜力。 Abstract: Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.

[269] A General Infrastructure and Workflow for Quadrotor Deep Reinforcement Learning and Reality Deployment

Kangyao Huang,Hao Wang,Yu Luo,Jingyu Chen,Jintao Chen,Xiangkui Zhang,Xiangyang Ji,Huaping Liu

Main category: cs.RO

TL;DR: 提出一个平台，实现端到端深度强化学习策略的无缝转移，支持四旋翼无人机从零训练到现实部署。

Details

Motivation: 解决四旋翼无人机在非结构化户外环境中应用学习方法的挑战，如大量模拟数据需求、实时处理要求和模拟与现实的差距。 Method: 整合训练环境、飞行动力学控制、DRL算法、MAVROS中间件和硬件，形成完整工作流。 Result: 平台支持多种环境任务，验证了模拟到现实的高效性和户外飞行的鲁棒性。 Conclusion: 该平台为四旋翼无人机的学习策略部署提供了高效、可复现的解决方案。 Abstract: Deploying robot learning methods to a quadrotor in unstructured outdoor environments is an exciting task. Quadrotors operating in real-world environments by learning-based methods encounter several challenges: a large amount of simulator generated data required for training, strict demands for real-time processing onboard, and the sim-to-real gap caused by dynamic and noisy conditions. Current works have made a great breakthrough in applying learning-based methods to end-to-end control of quadrotors, but rarely mention the infrastructure system training from scratch and deploying to reality, which makes it difficult to reproduce methods and applications. To bridge this gap, we propose a platform that enables the seamless transfer of end-to-end deep reinforcement learning (DRL) policies. We integrate the training environment, flight dynamics control, DRL algorithms, the MAVROS middleware stack, and hardware into a comprehensive workflow and architecture that enables quadrotors' policies to be trained from scratch to real-world deployment in several minutes. Our platform provides rich types of environments including hovering, dynamic obstacle avoidance, trajectory tracking, balloon hitting, and planning in unknown environments, as a physical experiment benchmark. Through extensive empirical validation, we demonstrate the efficiency of proposed sim-to-real platform, and robust outdoor flight performance under real-world perturbations. Details can be found from our website https://emnavi.tech/AirGym/.

math.CO [Back]

[270] Density Measures for Language Generation

Jon Kleinberg,Fan Wei

Main category: math.CO

TL;DR: 论文提出了一种抽象的语言生成框架，探讨了算法在生成新字符串时的有效性与广度之间的权衡，并通过密度度量量化了这种权衡。

Details

Motivation: 研究大型语言模型（LLMs）在语言生成中的理论问题，特别是算法如何在生成新字符串时平衡有效性与广度。 Method: 提出了一种抽象的语言生成框架，将生成视为对手与算法之间的博弈，并引入密度度量来量化广度。 Result: 开发了一种算法，其输出在目标语言中具有严格正密度，并研究了算法内部表示的特性。 Conclusion: 通过新的拓扑结构分析语言家族，揭示了实现最佳广度可能需要在高密度和低密度表示之间无限振荡。 Abstract: The recent successes of large language models (LLMs) have led to a surge of theoretical research into language generation. A recent line of work proposes an abstract view, called language generation in the limit, where generation is seen as a game between an adversary and an algorithm: the adversary generates strings from an unknown language $K$, chosen from a countable collection of candidate languages, and after seeing a finite set of these strings, the algorithm must generate new strings from $K$ that it has not seen before. This formalism highlights a key tension: the trade-off between validity (the algorithm should only produce strings from the language) and breadth (it should be able to produce many strings from the language). This trade-off is central in applied language generation as well, where it appears as a balance between hallucination (generating invalid utterances) and mode collapse (generating only a restricted set of outputs). Despite its importance, this trade-off has been challenging to study quantitatively. We develop ways to quantify this trade-off by formalizing breadth using measures of density. Existing algorithms for language generation in the limit produce output sets that can have zero density in the true language, and this important failure of breadth might seem unavoidable. We show, however, that such a failure is not necessary: we provide an algorithm for language generation in the limit whose outputs have strictly positive density in $K$. We also study the internal representations built by these algorithms, specifically the sequence of hypothesized candidate languages they consider, and show that achieving the strongest form of breadth may require oscillating indefinitely between high- and low-density representations. Our analysis introduces a novel topology on language families, with notions of convergence and limit points playing a key role.

cs.AI [Back]

[271] Evaluation and Incident Prevention in an Enterprise AI Assistant

Akash V. Maharaj,David Arbour,Daniel Lee,Uttaran Bhattacharya,Anup Rao,Austin Zane,Avi Feller,Kun Qian,Yunyao Li

Main category: cs.AI

TL;DR: 本文提出了一种全面框架，用于监控、基准测试和持续改进企业AI助手，涵盖错误检测、基准构建和持续改进策略。

Details

Motivation: 企业AI助手在准确性要求高的领域部署时，错误输出可能导致严重事故，因此需要系统化的改进方法。 Method: 框架包括三部分：分层错误检测、可扩展基准构建方法，以及多维评估的持续改进策略。 Result: 该框架能系统性提升AI助手的可靠性和性能，确保其在关键企业环境中的有效性。 Conclusion: 多维度评估方法为AI系统的增强开辟了途径，推动更强大和可信赖的AI发展。 Abstract: Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical ``severity'' framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted evaluation approach opens avenues for various classes of enhancements, paving the way for more robust and trustworthy AI systems.

[272] Linking forward-pass dynamics in Transformers and real-time human processing

Jennifer Hu,Michael A. Lepori,Michael Franke

Main category: cs.AI

TL;DR: 论文探讨了Transformer模型的内部处理动态是否与人类实时处理相似，发现层时间动态能提供额外的预测能力。

Details

Motivation: 研究AI模型（如Transformer）的内部处理动态是否与人类认知处理相似，以探索AI模型作为人类认知研究工具的可能性。 Method: 通过五个跨领域和模态的研究，测试预训练Transformer的单次前向传递的计算动态是否能预测人类处理特征。 Result: 层时间动态在模型输出概率分布的基础上提供了额外的预测能力。 Conclusion: Transformer和人类的处理可能受输入刺激的相似属性影响，表明AI模型可作为显式处理模型研究人类认知。 Abstract: Modern AI models are increasingly being used as theoretical tools to study human cognition. One dominant approach is to evaluate whether human-derived measures (such as offline judgments or real-time processing) are predicted by a model's output: that is, the end-product of forward pass(es) through the network. At the same time, recent advances in mechanistic interpretability have begun to reveal the internal processes that give rise to model outputs, raising the question of whether models and humans might arrive at outputs using similar "processing strategies". Here, we investigate the link between real-time processing in humans and "layer-time" dynamics in Transformer models. Across five studies spanning domains and modalities, we test whether the dynamics of computation in a single forward pass of pre-trained Transformers predict signatures of processing in humans, above and beyond properties of the model's output probability distribution. We consistently find that layer-time dynamics provide additional predictive power on top of output measures. Our results suggest that Transformer processing and human processing may be facilitated or impeded by similar properties of an input stimulus, and this similarity has emerged through general-purpose objectives such as next-token prediction or image recognition. Our work suggests a new way of using AI models to study human cognition: not just as a black box mapping stimuli to responses, but potentially also as explicit processing models.

[273] Bayesian Principles Improve Prompt Learning In Vision-Language Models

Mingyu Kim,Jongwoo Ko,Mijung Park

Main category: cs.AI

TL;DR: 提出了一种基于贝叶斯学习原理的新训练目标函数，以解决提示学习中过拟合问题，平衡适应性和泛化性。

Details

Motivation: 现有提示学习方法在微调数据上容易过拟合，泛化性差。 Method: 通过贝叶斯学习原理设计目标函数，将预训练模型参数化为先验，微调模型对应后验。 Result: 新目标函数使微调模型既能适应下游任务，又保持与预训练模型的接近性。 Conclusion: 该方法有效平衡了适应性和泛化性，提升了提示学习的性能。 Abstract: Prompt learning is a popular fine-tuning method for vision-language models due to its efficiency. It requires a small number of additional learnable parameters while significantly enhancing performance on target tasks. However, most existing methods suffer from overfitting to fine-tuning data, yielding poor generalizability. To address this, we propose a new training objective function based on a Bayesian learning principle to balance adaptability and generalizability. We derive a prior over the logits, where the mean function is parameterized by the pre-trained model, while the posterior corresponds to the fine-tuned model. This objective establishes a balance by allowing the fine-tuned model to adapt to downstream tasks while remaining close to the pre-trained model.

[274] Large Language Model Enhanced Particle Swarm Optimization for Hyperparameter Tuning for Deep Learning Models

Saad Hameed,Basheer Qolomany,Samir Brahim Belhaouari,Mohamed Abdallah,Junaid Qadir,Ala Al-Fuqaha

Main category: cs.AI

TL;DR: 论文提出了一种结合大语言模型（LLMs）和粒子群优化（PSO）的新方法，用于深度学习超参数调优，显著提高了收敛速度并降低了计算成本。

Details

Motivation: 深度学习模型架构设计通常依赖人工调优或计算密集型优化方法，效率低下。LLMs和PSO的结合在这一领域的潜力尚未充分探索。 Method: 通过将LLMs（如ChatGPT-3.5和Llama3）集成到PSO中，替代表现不佳的粒子位置，以加速搜索空间探索。 Result: 实验表明，该方法在三种场景（Rastrigin函数优化、LSTM时间序列回归和CNN材料分类）中显著提升收敛速度，计算成本降低20%至60%。 Conclusion: 该方法为深度学习模型优化提供了高效解决方案，具有广泛的应用潜力。 Abstract: Determining the ideal architecture for deep learning models, such as the number of layers and neurons, is a difficult and resource-intensive process that frequently relies on human tuning or computationally costly optimization approaches. While Particle Swarm Optimization (PSO) and Large Language Models (LLMs) have been individually applied in optimization and deep learning, their combined use for enhancing convergence in numerical optimization tasks remains underexplored. Our work addresses this gap by integrating LLMs into PSO to reduce model evaluations and improve convergence for deep learning hyperparameter tuning. The proposed LLM-enhanced PSO method addresses the difficulties of efficiency and convergence by using LLMs (particularly ChatGPT-3.5 and Llama3) to improve PSO performance, allowing for faster achievement of target objectives. Our method speeds up search space exploration by substituting underperforming particle placements with best suggestions offered by LLMs. Comprehensive experiments across three scenarios -- (1) optimizing the Rastrigin function, (2) using Long Short-Term Memory (LSTM) networks for time series regression, and (3) using Convolutional Neural Networks (CNNs) for material classification -- show that the method significantly improves convergence rates and lowers computational costs. Depending on the application, computational complexity is lowered by 20% to 60% compared to traditional PSO methods. Llama3 achieved a 20% to 40% reduction in model calls for regression tasks, whereas ChatGPT-3.5 reduced model calls by 60% for both regression and classification tasks, all while preserving accuracy and error rates. This groundbreaking methodology offers a very efficient and effective solution for optimizing deep learning models, leading to substantial computational performance improvements across a wide range of applications.

[275] TALES: Text Adventure Learning Environment Suite

Christopher Zhang Cui,Xingdi Yuan,Zhang Xiao,Prithviraj Ammanabrolu,Marc-Alexandre Côté

Main category: cs.AI

TL;DR: TALES是一个多样化的文本冒险游戏集合，旨在挑战和评估大型语言模型（LLMs）的推理能力。尽管在合成游戏中表现优异，但LLMs在人类设计的游戏中表现不佳。

Details

Motivation: 随着任务复杂性增加，需要更复杂的推理能力来支持顺序决策，因此需要评估LLMs的多样化推理能力。 Method: 使用TALES（合成和人类编写的文本冒险游戏）对多种LLMs进行测试，并进行定性分析。 Result: 尽管在合成游戏中表现良好，但LLMs在人类设计的游戏中成功率低于15%。 Conclusion: TALES为评估LLMs的推理能力提供了有效工具，但LLMs在复杂人类任务中仍有改进空间。 Abstract: Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tales.

[276] Direct Advantage Regression: Aligning LLMs with Online AI Reward

Li He,He Zhao,Stephen Wan,Dadong Wang,Lina Yao,Tongliang Liu

Main category: cs.AI

TL;DR: 论文提出了一种名为DAR的简单对齐算法，利用在线AI奖励优化策略改进，避免了RL的复杂性，并在实验中表现优于OAIF和在线RLHF基线。

Details

Motivation: 在线AI反馈（OAIF）在语言模型对齐中替代人类反馈（RLHF）时，缺乏细粒度的监督信号。本文旨在通过DAR算法解决这一问题。 Method: 提出Direct Advantage Regression（DAR），一种基于在线AI奖励的加权监督微调方法，无需强化学习。 Result: 实验表明，DAR在人类-AI一致性和性能上优于OAIF和在线RLHF，GPT-4-Turbo和MT-bench评估结果支持这一结论。 Conclusion: DAR是一种高效且理论一致的对齐方法，简化了实现复杂度，同时提升了学习效率。 Abstract: Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.

[277] AI Idea Bench 2025: AI Research Idea Generation Benchmark

Yansheng Qiu,Haoquan Zhang,Zhaopan Xu,Ming Li,Diping Song,Zheng Wang,Kaipeng Zhang

Main category: cs.AI

TL;DR: AI Idea Bench 2025是一个评估LLMs生成AI研究想法的框架，解决了现有评估方法的不足。

Details

Motivation: 现有评估方法忽视了知识泄漏、缺乏开放基准和可行性分析受限等问题，限制了突破性研究想法的发现。 Method: 提出了一个包含3,495篇AI论文及其相关工作的数据集，并设计了基于原始论文内容和通用参考材料的双维度评估方法。 Result: AI Idea Bench 2025提供了一个全面的评估系统，用于比较和评估LLMs生成的想法质量。 Conclusion: 该框架为科学发现的自动化提供了重要资源，有助于推动AI研究的发展。 Abstract: Large-scale Language Models (LLMs) have revolutionized human-AI interaction and achieved significant success in the generation of novel ideas. However, current assessments of idea generation overlook crucial factors such as knowledge leakage in LLMs, the absence of open-ended benchmarks with grounded truth, and the limited scope of feasibility analysis constrained by prompt design. These limitations hinder the potential of uncovering groundbreaking research ideas. In this paper, we present AI Idea Bench 2025, a framework designed to quantitatively evaluate and compare the ideas generated by LLMs within the domain of AI research from diverse perspectives. The framework comprises a comprehensive dataset of 3,495 AI papers and their associated inspired works, along with a robust evaluation methodology. This evaluation system gauges idea quality in two dimensions: alignment with the ground-truth content of the original papers and judgment based on general reference material. AI Idea Bench 2025's benchmarking system stands to be an invaluable resource for assessing and comparing idea-generation techniques, thereby facilitating the automation of scientific discovery.

[278] Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment

Antoun Yaacoub,Jérôme Da-Rugna,Zainab Assaghir

Main category: cs.AI

TL;DR: 研究评估了将布鲁姆分类法整合到OneClickQuiz（一个AI驱动的Moodle插件）中，以改进AI生成的多选题与认知目标的匹配。结果显示，高级认知水平的问题更复杂，DistilBERT模型表现最佳。

Details

Motivation: 探讨布鲁姆分类法是否能提升AI生成问题的认知目标匹配性，以优化教育评估工具。 Method: 使用3691个按布鲁姆分类法分级的问题数据集，测试多种分类模型（如逻辑回归、朴素贝叶斯、线性SVC和DistilBERT）。 Result: 高级认知水平问题更长且更复杂，DistilBERT表现最佳（验证准确率91%）。 Conclusion: 布鲁姆分类法整合到AI工具中具有潜力，DistilBERT能显著提升教育内容生成质量。 Abstract: This study evaluates the integration of Bloom's Taxonomy into OneClickQuiz, an Artificial Intelligence (AI) driven plugin for automating Multiple-Choice Question (MCQ) generation in Moodle. Bloom's Taxonomy provides a structured framework for categorizing educational objectives into hierarchical cognitive levels. Our research investigates whether incorporating this taxonomy can improve the alignment of AI-generated questions with specific cognitive objectives. We developed a dataset of 3691 questions categorized according to Bloom's levels and employed various classification models-Multinomial Logistic Regression, Naive Bayes, Linear Support Vector Classification (SVC), and a Transformer-based model (DistilBERT)-to evaluate their effectiveness in categorizing questions. Our results indicate that higher Bloom's levels generally correlate with increased question length, Flesch-Kincaid Grade Level (FKGL), and Lexical Density (LD), reflecting the increased complexity of higher cognitive demands. Multinomial Logistic Regression showed varying accuracy across Bloom's levels, performing best for "Knowledge" and less accurately for higher-order levels. Merging higher-level categories improved accuracy for complex cognitive tasks. Naive Bayes and Linear SVC also demonstrated effective classification for lower levels but struggled with higher-order tasks. DistilBERT achieved the highest performance, significantly improving classification of both lower and higher-order cognitive levels, achieving an overall validation accuracy of 91%. This study highlights the potential of integrating Bloom's Taxonomy into AI-driven assessment tools and underscores the advantages of advanced models like DistilBERT for enhancing educational content generation.

[279] InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu,Pengxiang Li,Congkai Xie,Xavier Hu,Xiaotian Han,Shengyu Zhang,Hongxia Yang,Fei Wu

Main category: cs.AI

TL;DR: InfiGUI-R1是一个基于MLLM的GUI代理，通过两阶段训练框架Actor2Reasoner，从反应式执行者逐步演化为深思熟虑的推理者，提升了GUI任务的鲁棒性和适应性。

Details

Motivation: 当前GUI代理依赖手动设计的推理模板或隐式推理，缺乏对复杂GUI环境的鲁棒性和深度规划能力。 Method: 采用两阶段训练：1. 推理注入（Spatial Reasoning Distillation）；2. 深思熟虑增强（强化学习，包括子目标指导和错误恢复场景构建）。 Result: 实验表明InfiGUI-R1在GUI基础和轨迹任务中表现优异。 Conclusion: 通过Actor2Reasoner框架，GUI代理从反应式执行者成功转型为深思熟虑的推理者，显著提升了任务性能。 Abstract: Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.

[280] Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey

Ahsan Bilal,Muhammad Ahmed Mohsin,Muhammad Umer,Muhammad Awais Khan Bangash,Muhammad Ali Jamshed

Main category: cs.AI

TL;DR: 该调查从多智能体强化学习（MARL）角度探讨了大型语言模型（LLM）的元思维能力发展，提出通过多智能体架构增强LLM的可靠性和适应性。

Details

Motivation: 当前LLM存在幻觉和缺乏自我评估机制等局限性，需通过元思维提升其复杂或高风险任务中的表现。 Method: 分析了RLHF、自蒸馏和思维链提示等方法，并探讨多智能体架构（如监督者-代理层次、代理辩论和心理理论框架）如何模拟人类内省行为。 Result: 通过MARL的奖励机制、自我对弈和持续学习方法，为构建内省、自适应且可信的LLM提供了路线图。 Conclusion: 讨论了评估指标、数据集及未来研究方向（如神经科学启发架构和混合符号推理），为LLM的进一步发展提供了指导。 Abstract: This survey explores the development of meta-thinking capabilities in Large Language Models (LLMs) from a Multi-Agent Reinforcement Learning (MARL) perspective. Meta-thinking self-reflection, assessment, and control of thinking processes is an important next step in enhancing LLM reliability, flexibility, and performance, particularly for complex or high-stakes tasks. The survey begins by analyzing current LLM limitations, such as hallucinations and the lack of internal self-assessment mechanisms. It then talks about newer methods, including RL from human feedback (RLHF), self-distillation, and chain-of-thought prompting, and each of their limitations. The crux of the survey is to talk about how multi-agent architectures, namely supervisor-agent hierarchies, agent debates, and theory of mind frameworks, can emulate human-like introspective behavior and enhance LLM robustness. By exploring reward mechanisms, self-play, and continuous learning methods in MARL, this survey gives a comprehensive roadmap to building introspective, adaptive, and trustworthy LLMs. Evaluation metrics, datasets, and future research avenues, including neuroscience-inspired architectures and hybrid symbolic reasoning, are also discussed.

[281] PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities

Haoming Li,Zhaoliang Chen,Jonathan Zhang,Fei Liu

Main category: cs.AI

TL;DR: 本文分析了现有规划基准的现状，提出了分类和推荐，以指导算法选择和未来基准开发。

Details

Motivation: 规划在智能体和智能AI中至关重要，但目前缺乏对规划基准的全面理解，导致算法比较和新场景选择困难。 Method: 研究多种规划基准，将其分类为具身环境、网络导航、调度、游戏与谜题及日常任务自动化，并分析其适用性。 Result: 提出了针对不同算法的最合适基准，并指出了未来基准开发的潜在方向。 Conclusion: 研究为规划算法的选择和未来基准设计提供了实用指导。 Abstract: Planning is central to agents and agentic AI. The ability to plan, e.g., creating travel itineraries within a budget, holds immense potential in both scientific and commercial contexts. Moreover, optimal plans tend to require fewer resources compared to ad-hoc methods. To date, a comprehensive understanding of existing planning benchmarks appears to be lacking. Without it, comparing planning algorithms' performance across domains or selecting suitable algorithms for new scenarios remains challenging. In this paper, we examine a range of planning benchmarks to identify commonly used testbeds for algorithm development and highlight potential gaps. These benchmarks are categorized into embodied environments, web navigation, scheduling, games and puzzles, and everyday task automation. Our study recommends the most appropriate benchmarks for various algorithms and offers insights to guide future benchmark development.

[282] AlignRAG: An Adaptable Framework for Resolving Misalignments in Retrieval-Aware Reasoning of RAG

Jiaqi Wei,Hao Zhou,Xiang Zhang,Di Zhang,Zijie Qiu,Wei Wei,Jinzhe Li,Wanli Ouyang,Siqi Sun

Main category: cs.AI

TL;DR: AlignRAG通过迭代的批判驱动对齐（CDA）步骤解决检索增强生成（RAG）中的推理不一致问题，显著优于现有方法。

Details

Motivation: 现有RAG方法在推理轨迹与检索证据的对齐上存在不足，导致推理不一致。 Method: 提出AlignRAG框架，包括构建上下文丰富的训练语料、生成对比性批判、训练批判语言模型（CLM）及迭代优化推理轨迹。 Result: AlignRAG在实验中表现优于基线方法，并可无缝集成到现有RAG流程中。 Conclusion: AlignRAG为检索感知生成提供了实用的改进，重新定义了RAG的结构化推理轨迹。 Abstract: Retrieval-augmented generation (RAG) has emerged as a foundational paradigm for knowledge-grounded text generation. However, existing RAG pipelines often fail to ensure that the reasoning trajectories align with the evidential constraints imposed by retrieved content. In this paper, we reframe RAG as a problem of retrieval-aware reasoning and identify a core challenge: reasoning misalignment-the mismatch between a model's reasoning trajectory and the retrieved evidence. To address this challenge, we propose AlignRAG, a novel test-time framework that mitigates reasoning misalignment through iterative Critique-Driven Alignment (CDA) steps. In contrast to prior approaches that rely on static training or post-hoc selection, AlignRAG actively refines reasoning trajectories during inference by enforcing fine-grained alignment with evidence. Our framework introduces a new paradigm for retrieval-aware reasoning by: (1) constructing context-rich training corpora; (2) generating contrastive critiques from preference-aware reasoning trajectories; (3) training a dedicated \textit{Critic Language Model (CLM)} to identify reasoning misalignments; and (4) applying CDA steps to optimize reasoning trajectories iteratively. Empirical results demonstrate that AlignRAG consistently outperforms all baselines and could integrate as a plug-and-play module into existing RAG pipelines without further changes. By reconceptualizing RAG as a structured reasoning trajectory and establishing the test-time framework for correcting reasoning misalignments in RAG, AlignRAG provides practical advancements for retrieval-aware generation.

[283] OTC: Optimal Tool Calls via Reinforcement Learning

Hongru Wang,Cheng Qian,Wanjun Zhong,Xiusi Chen,Jiahao Qiu,Shijue Huang,Bowen Jin,Mengdi Wang,Kam-Fai Wong,Heng Ji

Main category: cs.AI

TL;DR: 论文提出了OTC-PO框架，通过强化学习优化工具调用效率，减少工具使用次数同时保持答案准确性。

Details

Motivation: 现有工具集成推理方法忽视工具使用效率和成本，导致工具调用过多或不足，影响性能和开销。 Method: 提出OTC-PO框架，结合正确性和工具效率的奖励机制，基于PPO和GRPO实现OTC-PPO和OTC-GRPO。 Result: 实验显示工具调用减少73.1%，工具效率提升229.4%，同时保持答案准确性。 Conclusion: OTC-PO是首个明确优化工具使用效率的强化学习框架，显著提升了工具集成推理的性能。 Abstract: Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools, such as search engines and code interpreters, to solve tasks beyond the capabilities of language-only reasoning. While reinforcement learning (RL) has shown promise in improving TIR by optimizing final answer correctness, existing approaches often overlook the efficiency and cost associated with tool usage. This can lead to suboptimal behavior, including excessive tool calls that increase computational and financial overhead, or insufficient tool use that compromises answer quality. In this work, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 73.1\% and improves tool productivity by up to 229.4\%, while maintaining comparable answer accuracy. To the best of our knowledge, this is the first RL-based framework that explicitly optimizes tool-use efficiency in TIR.

[284] EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

Yao Shi,Rongkeng Liang,Yong Xu

Main category: cs.AI

TL;DR: EducationQ框架通过多智能体对话评估LLMs的教学能力，发现教学效果与模型规模或通用推理能力无关，部分小型开源模型表现优于大型商业模型。

Details

Motivation: 当前评估LLMs教学能力的方法资源密集且复杂，缺乏对交互式教学法的关注。 Method: 引入EducationQ框架，模拟动态教育场景，测试14个LLMs在13个学科和10个难度级别上的表现。 Result: 教学效果与模型规模无关，部分小型模型表现更优；人类专家评估与自动化分析一致。 Conclusion: LLMs作为教师需要针对性优化，未来教育AI应关注特定教学效果提升。 Abstract: Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

[285] SuoiAI: Building a Dataset for Aquatic Invertebrates in Vietnam

Tue Vo,Lakshay Sharma,Tuan Dinh,Khuong Dinh,Trang Nguyen,Trung Phan,Minh Do,Duong Vu

Main category: cs.AI

TL;DR: SuoiAI是一个端到端流程，用于构建越南水生无脊椎动物数据集，并利用机器学习技术进行物种分类。

Details

Motivation: 理解和监测水生生物多样性对生态健康和保护至关重要。 Method: 通过半监督学习减少标注工作量，利用先进的目标检测和分类模型进行数据收集、标注和模型训练。 Result: 解决了数据稀缺、细粒度分类和多样化环境条件下的部署等挑战。 Conclusion: SuoiAI为水生生物多样性监测提供了一种高效且可扩展的解决方案。 Abstract: Understanding and monitoring aquatic biodiversity is critical for ecological health and conservation efforts. This paper proposes SuoiAI, an end-to-end pipeline for building a dataset of aquatic invertebrates in Vietnam and employing machine learning (ML) techniques for species classification. We outline the methods for data collection, annotation, and model training, focusing on reducing annotation effort through semi-supervised learning and leveraging state-of-the-art object detection and classification models. Our approach aims to overcome challenges such as data scarcity, fine-grained classification, and deployment in diverse environmental conditions.

cs.HC [Back]

[286] VoxLogicA UI: Supporting Declarative Medical Image Analysis

Antonio Strippoli

Main category: cs.HC

TL;DR: 设计并实现VoxLogicA的用户友好界面，简化神经影像分析工具的使用。

Details

Motivation: 现有工具过于复杂，难以被医学专业人士和研究人员使用，希望通过空间逻辑提升工具的实用性和可访问性。 Method: 采用现代Web技术（如Svelte和Niivue）设计界面，并通过用户研究和实际案例分析测试其效果。 Result: 未提及具体结果。 Conclusion: 目标是使强大的分析工具在临床环境中更实用和易用。 Abstract: This Master's Thesis in Computer Science dives into the design and creation of a user-friendly interface for VoxLogicA, an image analysis tool using spatial model checking with a focus on neuroimaging. The research tackles the problem of existing tools being too complex, which makes them hard for medical professionals and researchers to use. By using spatial logic, the goal is to make these powerful analytical tools more practical and accessible in real-world clinical settings. The main objectives are to design a modern web interface that's easy to use, build it with the latest web technologies (e.g. Svelte and Niivue), and test its effectiveness through user studies and real-world case analyses.

[287] Semantic Direct Modeling

Qiang Zou,Shuo Liu

Main category: cs.HC

TL;DR: 论文提出了一种语义直接建模（SDM）方法，通过结合大语言模型（LLM）和生成式AI，将直接建模从低级的几何操作提升为高级的语义交互，简化设计流程。

Details

Motivation: 现有的直接建模系统限制了用户对顶点、边和面的低级操作，设计师需关注几何细节而非高层设计意图。 Method: SDM利用经过CAD特定提示微调的大语言模型（LLM）解析设计意图，并通过条件性、上下文敏感的特征识别方法将意图映射到几何特征。 Result: SDM能够实现从高层设计意图到低级几何修改的无缝转换，并通过实际机械设计案例验证了其有效性。 Conclusion: SDM通过语义交互简化了设计流程，提升了设计效率。 Abstract: Current direct modeling systems limit users to low-level interactions with vertices, edges, and faces, forcing designers to manage detailed geometric elements rather than focusing on high-level design intent. This paper introduces semantic direct modeling (SDM), a novel approach that lifts direct modeling from low-level geometric modifications to high-level semantic interactions. This is achieved by utilizing a large language model (LLM) fine-tuned with CAD-specific prompts, which can guide the LLM to reason through design intent and accurately interpret CAD commands, thereby allowing designers to express their intent using natural language. Additionally, SDM maps design intent to the corresponding geometric features in the CAD model through a new conditional, context-sensitive feature recognition method, which uses generative AI to dynamically assign feature labels based on design intent. Together, they enable a seamless flow from high-level design intent to low-level geometric modifications, bypassing tedious software interactions. The effectiveness of SDM has been validated through real mechanical design cases.

[288] Interview AI-ssistant: Designing for Real-Time Human-AI Collaboration in Interview Preparation and Execution

Zhe Liu

Main category: cs.HC

TL;DR: 该论文提出了一种名为Interview AI-ssistant的系统，旨在通过AI实时辅助访谈者，提升定性研究中的访谈效率和质量。

Details

Motivation: 访谈在定性研究中具有重要价值，但访谈者面临实时信息处理、问题调整和关系维护等认知挑战，AI的引入有望解决这些问题。 Method: 通过四项相互关联的研究，包括需求调研、原型开发、实验评估和实地部署，探索AI在访谈中的协作设计。 Result: 研究不仅为智能访谈支持系统提供了实践指导，还推动了人机协作界面在复杂社交任务中的应用理解。 Conclusion: 该工作为AI增强的定性研究工具提供了设计指南，并丰富了智能用户界面社区的知识。 Abstract: Recent advances in large language models (LLMs) offer unprecedented opportunities to enhance human-AI collaboration in qualitative research methods, including interviews. While interviews are highly valued for gathering deep, contextualized insights, interviewers often face significant cognitive challenges, such as real-time information processing, question adaptation, and rapport maintenance. My doctoral research introduces Interview AI-ssistant, a system designed for real-time interviewer-AI collaboration during both the preparation and execution phases. Through four interconnected studies, this research investigates the design of effective human-AI collaboration in interviewing contexts, beginning with a formative study of interviewers' needs, followed by a prototype development study focused on AI-assisted interview preparation, an experimental evaluation of real-time AI assistance during interviews, and a field study deploying the system in a real-world research setting. Beyond informing practical implementations of intelligent interview support systems, this work contributes to the Intelligent User Interfaces (IUI) community by advancing the understanding of human-AI collaborative interfaces in complex social tasks and establishing design guidelines for AI-enhanced qualitative research tools.

[289] 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

Ivan Sviridov,Amina Miftakhova,Artemiy Tereshchenko,Galina Zubkova,Pavel Blinov,Andrey Savchenko

Main category: cs.HC

TL;DR: 3MDBench是一个开源评估框架，用于测试LVLM在医疗咨询中的表现，通过模拟多样患者行为和整合多模态数据提升诊断效率。

Details

Motivation: 探索LVLM在远程医疗中处理多样患者行为的能力，填补现有评估工具的不足。 Method: 开发3MDBench框架，模拟真实患者行为，整合文本和图像数据，评估LVLM在不同诊断策略下的表现。 Result: 对话和多模态输入显著提升诊断F1分数，结合CNN模型后F1分数达到70.3。 Conclusion: 3MDBench为AI医疗助手提供了可扩展的评估工具，推动更可靠、情境感知的医疗解决方案。 Abstract: Large Vision-Language Models (LVLMs) are increasingly being explored for applications in telemedicine, yet their ability to engage with diverse patient behaviors remains underexplored. We introduce 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source evaluation framework designed to assess LLM-driven medical consultations. Unlike existing benchmarks, 3MDBench simulates real-world patient variability by incorporating four temperament-driven Patient Agents and an Assessor Agent that evaluates diagnostic accuracy and dialogue quality. The benchmark integrates textual and image-based patient data across 34 common diagnoses, mirroring real-world telemedicine interactions. Under different diagnostic strategies, we evaluate state-of-the-art LVLMs. Our findings demonstrate that incorporating dialogue improves the F1 score from 50.4 to 54.2 compared to non-dialogue settings, underscoring the value of context-driven, information-seeking questioning. Additionally, we demonstrate that multimodal inputs enhance diagnostic efficiency. Image-supported models outperform text-only counterparts by raising the diagnostic F1 score from 52.8 to 54.2 in a similar dialogue setting. Finally, we suggest an approach that improves the diagnostic F1-score to 70.3 by training the CNN model on the diagnosis prediction task and incorporating its top-3 predictions into the LVLM context. 3MDBench provides a reproducible and extendable evaluation framework for AI-driven medical assistants. It offers insights into how patient temperament, dialogue strategies, and multimodal reasoning influence diagnosis quality. By addressing real-world complexities in telemedicine, our benchmark paves the way for more empathetic, reliable, and context-aware AI-driven healthcare solutions. The source code of our benchmark is publicly available: https://github.com/univanxx/3mdbench

[290] A Survey on (M)LLM-Based GUI Agents

Fei Tang,Haolei Xu,Hang Zhang,Siqi Chen,Xingyu Wu,Yongliang Shen,Wenqi Zhang,Guiyang Hou,Zeqi Tan,Yuchen Yan,Kaitao Song,Jian Shao,Weiming Lu,Jun Xiao,Yueting Zhuang

Main category: cs.HC

TL;DR: 本文综述了基于LLM的GUI代理的快速发展，分析了其架构、技术组件和评估方法，并探讨了未来研究方向。

Details

Motivation: 探索GUI代理从规则脚本到AI驱动系统的演变，以及其在人机交互中的潜力。 Method: 系统分析了GUI代理的四个核心组件：感知系统、探索机制、规划框架和交互系统。 Result: 揭示了LLM和多模态学习如何革新GUI自动化，并指出了现有评估框架的局限性。 Conclusion: 总结了GUI代理的现状、挑战及未来发展方向，为研究者和从业者提供了全面参考。 Abstract: Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents' capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field's current state and offers insights into future developments in intelligent interface automation.

[291] Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

Megan Gu,Chloe Qianhui Zhao,Claire Liu,Nikhil Patel,Jahnvi Shah,Jionghao Lin,Kenneth R. Koedinger

Main category: cs.HC

TL;DR: 论文介绍了一种利用大型语言模型（LLM）自动评估五种关键辅导策略有效性的系统，结果显示模型在排除错误分类方面表现良好，但在识别正确策略上仍有不足。

Details

Motivation: 研究旨在探索LLM在辅导策略分析中的潜力，为教育技术领域提供自动化评估工具。 Method: 使用GPT-3.5和少量示例提示对公开数据集（Teacher-Student Chatroom Corpus）中的辅导对话进行分析，分类五种策略的使用情况。 Result: 模型在五种策略上的真阴性率（TNR）为0.655至0.738，召回率为0.327至0.432，其中“帮助学生管理不平等”表现最佳。 Conclusion: LLM在辅导策略分析中具有潜力，未来可通过更先进的模型提升性能。 Abstract: Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy \textit{helping students manage inequity} showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.

[292] AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants

Isabel Villanueva,Tara Bobinac,Binwei Yao,Junjie Hu,Kaiping Chen

Main category: cs.HC

TL;DR: 研究表明，AI聊天机器人在促进跨文化共情方面存在局限性，尤其是在文化对齐和对话类型上的差异显著影响效果。

Details

Motivation: 探讨AI聊天机器人在跨文化对话中是否能有效促进共情，以及文化对齐和对话类型对结果的影响。 Method: 采用随机对话实验，比较审议式与非审议式对话，以及文化对齐与非对齐的AI聊天机器人互动。 Result: 审议式对话对美国参与者有效，但对拉丁美洲参与者无效，因后者认为AI的文化表达不准确。实时分析显示，文化知识差距是主要原因。 Conclusion: AI系统需改进文化真实性，以支持跨文化对话。研究为审议理论和AI对齐研究提供了新视角。 Abstract: Despite the growing integration of AI chatbots as conversational agents in public discourse, empirical evidence regarding their capacity to foster intercultural empathy remains limited. Using a randomized dialogue experiment, we examined how different types of AI chatbot interaction, i.e., deliberative versus non-deliberative and culturally aligned versus non-aligned, affect intercultural empathy across cultural groups. Results show that deliberative conversations increased intercultural empathy among American participants but not Latin American participants, who perceived AI responses as culturally inaccurate and failing to represent their cultural contexts and perspectives authentically. Real-time interaction analyses reveal that these differences stem from cultural knowledge gaps inherent in Large Language Models. Despite explicit prompting and instruction to represent cultural perspectives in participants' native languages, AI systems still exhibit significant disparities in cultural representation. This highlights the importance of designing AI systems capable of culturally authentic engagement in deliberative conversations. Our study contributes to deliberation theory and AI alignment research by underscoring AI's role in intercultural dialogue and the persistent challenge of representational asymmetry in democratic discourse.

[293] Kanji Workbook: A Writing-Based Intelligent Tutoring System for Learning Proper Japanese Kanji Writing Technique with Instructor-Emulated Assessment

Paul Taele,Jung In Koh,Tracy Hammond

Main category: cs.HC

TL;DR: Kanji Workbook是一个智能辅导系统，通过模拟教师反馈帮助学生学习日语汉字书写，提升课程成绩。

Details

Motivation: 英语母语学生在学习日语汉字时面临困难，现有教育工具缺乏教师模拟反馈。 Method: 开发Kanji Workbook系统，结合教师访谈和课堂观察，提供智能评分和视觉动画反馈。 Result: 使用该系统的学生在课程成绩上优于同龄人，并对系统功能反应积极。 Conclusion: Kanji Workbook通过智能反馈有效辅助学生汉字学习，具有实际应用价值。 Abstract: Kanji script writing is a skill that is often introduced to novice Japanese foreign language students for achieving Japanese writing mastery, but often poses difficulties to students with primarily English fluency due to their its vast differences with written English. Instructors often introduce various pedagogical methods -- such as visual structure and written techniques -- to assist students in kanji study, but may lack availability providing direct feedback on students' writing outside of class. Current educational applications are also limited due to lacking richer instructor-emulated feedback. We introduce Kanji Workbook, a writing-based intelligent tutoring system for students to receive intelligent assessment that emulates human instructor feedback. Our interface not only leverages students' computing devices for allowing them to learn, practice, and review the writing of prompted characters from their course's kanji script lessons, but also provides a diverse set of writing assessment metrics -- derived from instructor interviews and classroom observation insights -- through intelligent scoring and visual animations. We deployed our interface onto novice- and intermediate-level university courses over an entire academic year, and observed that interface users on average achieved higher course grades than their peers and also reacted positively to our interface's various features.

[294] Measuring Mental Health Variables in Computational Research: Toward Validated, Dimensional, and Transdiagnostic Approaches

Chen Shani,Elizabeth C. Stade

Main category: cs.HC

TL;DR: 论文指出当前计算心理健康研究中使用的心理病理学测量方法存在问题，提出了使用验证性、维度和跨诊断测量的建议。

Details

Motivation: 解决计算心理健康研究中因使用不适当的心理病理学测量方法而导致的效度问题。 Method: 识别了三个关键问题：依赖未验证的测量、将心理健康构念视为分类而非维度、关注特定障碍而非跨诊断构念。 Result: 提出了使用验证性、维度和跨诊断测量的优势及实践建议。 Conclusion: 使用反映心理病理学本质和结构的有效测量方法对计算心理健康研究至关重要。 Abstract: Computational mental health research develops models to predict and understand psychological phenomena, but often relies on inappropriate measures of psychopathology constructs, undermining validity. We identify three key issues: (1) reliance on unvalidated measures (e.g., self-declared diagnosis) over validated ones (e.g., diagnosis by clinician); (2) treating mental health constructs as categorical rather than dimensional; and (3) focusing on disorder-specific constructs instead of transdiagnostic ones. We outline the benefits of using validated, dimensional, and transdiagnostic measures and offer practical recommendations for practitioners. Using valid measures that reflect the nature and structure of psychopathology is essential for computational mental health research.

[295] TALLMesh: a simple application for performing Thematic Analysis with Large Language Models

Stefano De Paoli,Alex Fawzi

Main category: cs.HC

TL;DR: 本文介绍了一种基于大型语言模型（LLMs）的图形用户界面（GUI）应用，用于辅助研究者进行主题分析（TA），无需编程技能即可操作。

Details

Motivation: 为缺乏编程技能的研究者（如社会科学或人文学科）提供一种简单工具，利用LLMs进行主题分析，同时保持方法严谨性。 Method: 开发了一个基于streamlit框架的GUI应用，用户可上传文本数据，生成初始代码和主题，并通过迭代优化过程完善分析。 Result: 应用成功实现了通过LLMs辅助主题分析的功能，简化了研究流程，尤其适合非技术背景的研究者。 Conclusion: 该应用展示了LLMs在定性研究中的潜力，未来可进一步优化功能和扩展应用场景。 Abstract: Thematic analysis (TA) is a widely used qualitative research method for identifying and interpreting patterns within textual data, such as qualitative interviews. Recent research has shown that it is possible to satisfactorily perform TA using Large Language Models (LLMs). This paper presents a novel application using LLMs to assist researchers in conducting TA. The application enables users to upload textual data, generate initial codes and themes. All of this is possible through a simple Graphical User Interface, (GUI) based on the streamlit framework, working with python scripts for the analysis, and using Application Program Interfaces of LLMs. Having a GUI is particularly important for researchers in fields where coding skills may not be prevalent, such as social sciences or humanities. With the app, users can iteratively refine codes and themes adopting a human-in-the-loop process, without the need to work with programming and scripting. The paper describes the application key features, highlighting its potential for qualitative research while preserving methodological rigor. The paper discusses the design and interface of the app and outlines future directions for this work.

[296] Generative Framework for Personalized Persuasion: Inferring Causal, Counterfactual, and Latent Knowledge

Donghuo Zeng,Roberto Legaspi,Yuewen Sun,Xinshuai Dong,Kazushi Ikeda,Peter Spirtes,Kun Zhang

Main category: cs.HC

TL;DR: 论文提出基于因果和反事实知识的自适应策略生成最优系统响应，通过因果发现和反事实推理优化对话策略，显著提升说服性对话系统的效果。

Details

Motivation: 研究动机在于探索如何通过因果和反事实知识优化系统响应，以提升用户与系统交互的效果，尤其是在说服性对话系统中。 Method: 方法包括因果发现识别策略级因果关系，反事实推理生成个性化对话，以及基于反事实数据优化系统响应策略。 Result: 实验结果表明，该方法在真实数据集上显著提升了说服性对话系统的累积奖励，验证了因果发现和反事实推理的有效性。 Conclusion: 结论表明，结合因果发现和反事实推理的方法能有效优化对话策略，提升系统性能。 Abstract: We hypothesize that optimal system responses emerge from adaptive strategies grounded in causal and counterfactual knowledge. Counterfactual inference allows us to create hypothetical scenarios to examine the effects of alternative system responses. We enhance this process through causal discovery, which identifies the strategies informed by the underlying causal structure that govern system behaviors. Moreover, we consider the psychological constructs and unobservable noises that might be influencing user-system interactions as latent factors. We show that these factors can be effectively estimated. We employ causal discovery to identify strategy-level causal relationships among user and system utterances, guiding the generation of personalized counterfactual dialogues. We model the user utterance strategies as causal factors, enabling system strategies to be treated as counterfactual actions. Furthermore, we optimize policies for selecting system responses based on counterfactual data. Our results using a real-world dataset on social good demonstrate significant improvements in persuasive system outcomes, with increased cumulative rewards validating the efficacy of causal discovery in guiding personalized counterfactual inference and optimizing dialogue policies for a persuasive dialogue system.

[297] HealthGenie: Empowering Users with Healthy Dietary Guidance through Knowledge Graph and Large Language Models

Fan Gao,Xinjie Zhao,Ding Xia,Zhongyi Zhou,Rui Yang,Jinghui Lu,Hang Jiang,Chanjun Park,Irene Li

Main category: cs.HC

TL;DR: HealthGenie结合知识图谱（KG）和大语言模型（LLM），提供个性化饮食建议和可视化信息，减少用户交互负担和认知负荷。

Details

Motivation: 解决用户在获取饮食建议时需要处理复杂专业知识和个体健康条件的问题。 Method: HealthGenie通过查询精炼和预建KG检索信息，结合LLM生成解释性建议，并提供交互式可视化。 Result: 评估显示HealthGenie有效支持个性化饮食建议，降低用户交互负担和认知负荷。 Conclusion: LLM与KG结合在支持决策方面具有潜力，未来系统设计可参考此方法。 Abstract: Seeking dietary guidance often requires navigating complex professional knowledge while accommodating individual health conditions. Knowledge Graphs (KGs) offer structured and interpretable nutritional information, whereas Large Language Models (LLMs) naturally facilitate conversational recommendation delivery. In this paper, we present HealthGenie, an interactive system that combines the strengths of LLMs and KGs to provide personalized dietary recommendations along with hierarchical information visualization for a quick and intuitive overview. Upon receiving a user query, HealthGenie performs query refinement and retrieves relevant information from a pre-built KG. The system then visualizes and highlights pertinent information, organized by defined categories, while offering detailed, explainable recommendation rationales. Users can further tailor these recommendations by adjusting preferences interactively. Our evaluation, comprising a within-subject comparative experiment and an open-ended discussion, demonstrates that HealthGenie effectively supports users in obtaining personalized dietary guidance based on their health conditions while reducing interaction effort and cognitive load. These findings highlight the potential of LLM-KG integration in supporting decision-making through explainable and visualized information. We examine the system's usefulness and effectiveness with an N=12 within-subject study and provide design considerations for future systems that integrate conversational LLM and KG.

[298] Completing A Systematic Review in Hours instead of Months with Interactive AI Agents

Rui Qiu,Shijie Chen,Yu Su,Po-Yin Yen,Han-Wei Shen

Main category: cs.HC

TL;DR: InsightAgent是一个基于大型语言模型的人机交互AI代理，通过语义分割和多代理设计显著提升系统综述（SRs）的质量，同时提供可视化工具和实时反馈机制，使临床医生能在1.5小时内完成高质量SRs。

Details

Motivation: 系统综述（SRs）在医疗等高需求领域至关重要，但传统方法耗时且依赖专家知识，自动化方法难以生成高质量结果。 Method: InsightAgent采用语义分割和多代理设计处理文献，提供可视化工具和实时反馈机制。 Result: 用户研究表明，InsightAgent将SRs质量提升27.2%，达到人工写作质量的79.7%，用户满意度提升34.4%，完成时间从数月缩短至1.5小时。 Conclusion: InsightAgent通过人机交互和AI技术显著优化了系统综述的生成效率和质量。 Abstract: Systematic reviews (SRs) are vital for evidence-based practice in high stakes disciplines, such as healthcare, but are often impeded by intensive labors and lengthy processes that can take months to complete. Due to the high demand for domain expertise, existing automatic summarization methods fail to accurately identify relevant studies and generate high-quality summaries. To that end, we introduce InsightAgent, a human-centered interactive AI agent powered by large language models that revolutionize this workflow. InsightAgent partitions a large literature corpus based on semantics and employs a multi-agent design for more focused processing of literature, leading to significant improvement in the quality of generated SRs. InsightAgent also provides intuitive visualizations of the corpus and agent trajectories, allowing users to effortlessly monitor the actions of the agent and provide real-time feedback based on their expertise. Our user studies with 9 medical professionals demonstrate that the visualization and interaction mechanisms can effectively improve the quality of synthesized SRs by 27.2%, reaching 79.7% of human-written quality. At the same time, user satisfaction is improved by 34.4%. With InsightAgent, it only takes a clinician about 1.5 hours, rather than months, to complete a high-quality systematic review.

[299] Skeleton-Based Transformer for Classification of Errors and Better Feedback in Low Back Pain Physical Rehabilitation Exercises

Aleksa Marusic,Sao Mai Nguyen,Adriana Tapus

Main category: cs.HC

TL;DR: 提出了一种基于Transformer的算法，用于康复训练中的错误分类，并通过关节重要性计算提供更详细的反馈。

Details

Motivation: 患者在没有直接监督的情况下参与度下降，现有系统仅提供二元分类或连续评分，无法帮助患者改进。 Method: 采用骨架基础的运动评估，结合Transformer模型（受HyperFormer启发）进行错误分类。 Result: 在KERAAL数据集上显著优于现有方法，并提出了关节重要性计算方法。 Conclusion: 该算法为患者提供了更详细的反馈，是康复训练质量评估的重要一步。 Abstract: Physical rehabilitation exercises suggested by healthcare professionals can help recovery from various musculoskeletal disorders and prevent re-injury. However, patients' engagement tends to decrease over time without direct supervision, which is why there is a need for an automated monitoring system. In recent years, there has been great progress in quality assessment of physical rehabilitation exercises. Most of them only provide a binary classification if the performance is correct or incorrect, and a few provide a continuous score. This information is not sufficient for patients to improve their performance. In this work, we propose an algorithm for error classification of rehabilitation exercises, thus making the first step toward more detailed feedback to patients. We focus on skeleton-based exercise assessment, which utilizes human pose estimation to evaluate motion. Inspired by recent algorithms for quality assessment during rehabilitation exercises, we propose a Transformer-based model for the described classification. Our model is inspired by the HyperFormer method for human action recognition, and adapted to our problem and dataset. The evaluation is done on the KERAAL dataset, as it is the only medical dataset with clear error labels for the exercises, and our model significantly surpasses state-of-the-art methods. Furthermore, we bridge the gap towards better feedback to the patients by presenting a way to calculate the importance of joints for each exercise.

[300] Towards a Multimodal Document-grounded Conversational AI System for Education

Karan Taneja,Anjali Singh,Ashok K. Goel

Main category: cs.HC

TL;DR: MuDoC是一个基于GPT-4o的多模态对话AI系统，结合文本和图像提升学习效果，并通过可验证性增强信任。实验表明，多模态内容提升了学习者的参与度和信任，但对任务表现无显著影响。

Details

Motivation: 探索多模态对话AI在教育中的应用，弥补当前文本交互的不足，并解决内容可验证性问题。 Method: 基于GPT-4o开发MuDoC系统，结合文档中的文本和图像生成多模态响应，并与纯文本系统对比实验。 Result: 多模态内容和可验证性提升了学习者的参与度和信任，但对任务表现无显著影响。 Conclusion: 多模态对话AI在教育中具有潜力，未来需进一步优化以提升学习效果。 Abstract: Multimedia learning using text and images has been shown to improve learning outcomes compared to text-only instruction. But conversational AI systems in education predominantly rely on text-based interactions while multimodal conversations for multimedia learning remain unexplored. Moreover, deploying conversational AI in learning contexts requires grounding in reliable sources and verifiability to create trust. We present MuDoC, a Multimodal Document-grounded Conversational AI system based on GPT-4o, that leverages both text and visuals from documents to generate responses interleaved with text and images. Its interface allows verification of AI generated content through seamless navigation to the source. We compare MuDoC to a text-only system to explore differences in learner engagement, trust in AI system, and their performance on problem-solving tasks. Our findings indicate that both visuals and verifiability of content enhance learner engagement and foster trust; however, no significant impact in performance was observed. We draw upon theories from cognitive and learning sciences to interpret the findings and derive implications, and outline future directions for the development of multimodal conversational AI systems in education.

Table of Contents

cs.CV [Back]

[1] Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

[2] Entropy Rectifying Guidance for Diffusion and Flow Models

[3] Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training

[4] Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

[5] LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

[6] Occlusion-Ordered Semantic Instance Segmentation

[7] Towards Scale-Aware Low-Light Enhancement via Structure-Guided Transformer Design

[8] Retinex-guided Histogram Transformer for Mask-free Shadow Removal

[9] VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

[10] Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models

[11] Lightweight Road Environment Segmentation using Vector Quantization

[12] BMRL: Bi-Modal Guided Multi-Perspective Representation Learning for Zero-Shot Deepfake Attribution

[13] Transforming hyperspectral images into chemical maps: A new deep learning based approach to hyperspectral image processing

[14] HFBRI-MAE: Handcrafted Feature Based Rotation-Invariant Masked Autoencoder for 3D Point Cloud Analysis

[15] Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

[16] Segment Any Crack: Deep Semantic Segmentation Adaptation for Crack Detection

[17] ThyroidEffi 1.0: A Cost-Effective System for High-Performance Multi-Class Thyroid Carcinoma Classification

[18] Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

[19] Segregation and Context Aggregation Network for Real-time Cloud Segmentation

[20] Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization

[21] Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis

[22] Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

[23] Revisiting CLIP for SF-OSDA: Unleashing Zero-Shot Potential with Adaptive Threshold and Training-Free Feature Filtering

[24] Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

[25] Single Document Image Highlight Removal via A Large-Scale Real-World Dataset and A Location-Aware Network

[26] ROI-Guided Point Cloud Geometry Compression Towards Human and Machine Vision

[27] Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

[28] Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

[29] ColorVein: Colorful Cancelable Vein Biometrics

[30] Visual Consensus Prompting for Co-Salient Object Detection

[31] Cross-attention for State-based model RWKV-7

[32] Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction

[33] RAMCT: Novel Region-adaptive Multi-channel Tracker with Iterative Tikhonov Regularization for Thermal Infrared Tracking

[34] CLIP-Powered Domain Generalization and Domain Adaptation: A Comprehensive Survey

[35] ISTD-YOLO: A Multi-Scale Lightweight High-Performance Infrared Small Target Detection Algorithm

[36] Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization

[37] From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion

[38] Balancing Privacy and Action Performance: A Penalty-Driven Approach to Image Anonymization

[39] Exploring Generalizable Pre-training for Real-world Change Detection via Geometric Estimation

[40] FGSGT: Saliency-Guided Siamese Network Tracker Based on Key Fine-Grained Feature Information for Thermal Infrared Target Tracking

[41] DCFG: Diverse Cross-Channel Fine-Grained Feature Learning and Progressive Fusion Siamese Tracker for Thermal Infrared Target Tracking

[42] Visual Prompting for One-shot Controllable Video Editing without Inversion

[43] Multispectral airborne laser scanning for tree species classification: a benchmark of machine learning and deep learning algorithms

[44] Manipulating Multimodal Agents via Cross-Modal Prompt Injection

[45] A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling

[46] Efficient Spiking Point Mamba for Point Cloud Analysis

[47] LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

[48] How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

[49] Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models

[50] SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation

[51] Adversarial Attack for RGB-Event based Visual Object Tracking

[52] ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations

[53] ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task

[54] WT-BCP: Wavelet Transform based Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

[55] Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

[56] Causal Disentanglement for Robust Long-tail Medical Image Generation

[57] Metamon-GS: Enhancing Representability with Variance-Guided Densification and Light Encoding

[58] LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation

[59] Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

[60] Efficient Implicit Neural Compression of Point Clouds via Learnable Activation in Latent Space

[61] Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation

[62] STARS: Sparse Learning Correlation Filter with Spatio-temporal Regularization and Super-resolution Reconstruction for Thermal Infrared Target Tracking

[63] DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

[64] Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

[65] Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

[66] SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

[67] FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models

[68] VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control

[69] Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

[70] SMTT: Novel Structured Multi-task Tracking with Graph-Regularized Sparse Representation for Robust Thermal Infrared Target Tracking

[71] NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results

[72] Using street view imagery and deep generative modeling for estimating the health of urban forests

[73] NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results

[74] MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation

[75] VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image

[76] Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts

[77] MSAD-Net: Multiscale and Spatial Attention-based Dense Network for Lung Cancer Classification

[78] NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation