cs.CV [Back]

[1] RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

Ruiqi Wang,Hao Zhang

Main category: cs.CV

TL;DR: RESAnything是一种开放词汇、零样本的任意参考表达分割方法，通过属性提示和Chain-of-Thoughts推理处理对象和部分级别的标签及隐式参考。

Details

Motivation: 解决现有方法无法处理的广义输入表达，包括对象/部分级别标签和隐式参考（如功能、设计、风格等）。 Method: 利用Chain-of-Thoughts推理和属性提示，通过大型语言模型生成详细属性描述，结合基础图像分割模型生成提案。 Result: 在传统RES基准测试中表现优异，尤其在隐式查询和复杂部分关系场景中显著优于现有方法。 Conclusion: RESAnything是首个零样本、基于LLM的RES方法，提出新基准数据集以评估部分级别和任意RES解决方案。 Abstract: We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting input expressions that are more general than what prior works were designed to handle. Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESAnything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object or part attributes related to function, style, design, etc., enabling the system to handle implicit queries without any part annotations for training or fine-tuning. As the first zero-shot and LLM-based RES method, RESAnything achieves clearly superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. Finally, we contribute a new benchmark dataset to offer ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.

[2] StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data

Yuxuan Mu,Hung Yu Ling,Yi Shi,Ismael Baira Ojeda,Pengcheng Xi,Chang Shu,Fabio Zinno,Xue Bin Peng

Main category: cs.CV

TL;DR: StableMotion提出了一种直接从无配对数据中训练运动清理模型的方法，通过引入运动质量指标，实现了对混合质量数据的处理，显著减少了运动伪影。

Details

Motivation: 运动捕捉数据常因传感器和后期处理不准确而产生视觉伪影，清理这些数据通常需要大量人工，成本高且耗时。现有方法依赖配对数据，但获取高质量配对数据同样费时费力。 Method: 引入运动质量指标，通过手动标注或启发式算法标注，训练质量感知的运动生成模型。采用扩散框架实现统一的生成-判别模型，既能识别也能修复伪影帧。 Result: 在SoccerMocap数据集上测试，模型有效减少了68%的运动突变和81%的冻结帧。 Conclusion: StableMotion提供了一种简单有效的方法，直接从无配对数据中训练运动清理模型，显著提升了运动数据的质量。 Abstract: Motion capture (mocap) data often exhibits visually jarring artifacts due to inaccurate sensors and post-processing. Cleaning this corrupted data can require substantial manual effort from human experts, which can be a costly and time-consuming process. Previous data-driven motion cleanup methods offer the promise of automating this cleanup process, but often require in-domain paired corrupted-to-clean training data. Constructing such paired datasets requires access to high-quality, relatively artifact-free motion clips, which often necessitates laborious manual cleanup. In this work, we present StableMotion, a simple yet effective method for training motion cleanup models directly from unpaired corrupted datasets that need cleanup. The core component of our method is the introduction of motion quality indicators, which can be easily annotated through manual labeling or heuristic algorithms and enable training of quality-aware motion generation models on raw motion data with mixed quality. At test time, the model can be prompted to generate high-quality motions using the quality indicators. Our method can be implemented through a simple diffusion-based framework, leading to a unified motion generate-discriminate model, which can be used to both identify and fix corrupted frames. We demonstrate that our proposed method is effective for training motion cleanup models on raw mocap data in production scenarios by applying StableMotion to SoccerMocap, a 245-hour soccer mocap dataset containing real-world motion artifacts. The trained model effectively corrects a wide range of motion artifacts, reducing motion pops and frozen frames by 68% and 81%, respectively. See https://youtu.be/3Y7MMAH02B4 for more results.

[3] Gone With the Bits: Revealing Racial Bias in Low-Rate Neural Compression for Facial Images

Tian Qiu,Arjun Nichani,Rasta Tadayontahmasebi,Haewon Jeong

Main category: cs.CV

TL;DR: 论文提出了一种评估神经图像压缩模型中偏见的框架，发现传统失真指标无法捕捉偏见，且所有模型均存在种族偏见，并提出平衡训练集可部分缓解偏见。

Details

Motivation: 神经压缩方法在极低比特率下表现优异，但可能因训练过程中的偏见导致不公平结果，需系统评估和解决。 Method: 提出通用、结构化、可扩展的框架，分析九种流行模型及其变种，通过面部表型退化评估种族偏见。 Result: 发现所有模型均存在种族偏见，传统失真指标无效；平衡训练集可减少偏见但不足；偏见源于压缩和分类模型。 Conclusion: 本研究为评估和消除神经图像压缩模型中的偏见迈出了第一步。 Abstract: Neural compression methods are gaining popularity due to their superior rate-distortion performance over traditional methods, even at extremely low bitrates below 0.1 bpp. As deep learning architectures, these models are prone to bias during the training process, potentially leading to unfair outcomes for individuals in different groups. In this paper, we present a general, structured, scalable framework for evaluating bias in neural image compression models. Using this framework, we investigate racial bias in neural compression algorithms by analyzing nine popular models and their variants. Through this investigation, we first demonstrate that traditional distortion metrics are ineffective in capturing bias in neural compression models. Next, we highlight that racial bias is present in all neural compression models and can be captured by examining facial phenotype degradation in image reconstructions. We then examine the relationship between bias and realism in the decoded images and demonstrate a trade-off across models. Finally, we show that utilizing a racially balanced training set can reduce bias but is not a sufficient bias mitigation strategy. We additionally show the bias can be attributed to compression model bias and classification model bias. We believe that this work is a first step towards evaluating and eliminating bias in neural image compression models.

[4] Generating Narrated Lecture Videos from Slides with Synchronized Highlights

Alexander Holmberg

Main category: cs.CV

TL;DR: 论文提出了一种端到端系统，将静态幻灯片自动转化为视频讲座，通过AI生成旁白和动态视觉高亮，显著节省时间和成本。

Details

Motivation: 将静态幻灯片转化为视频讲座通常耗时耗力，需要人工录制和视觉引导。 Method: 系统采用新颖的高亮对齐模块，结合多种策略（如Levenshtein距离、LLM语义分析）和TTS技术，实现语音与幻灯片内容的精确同步。 Result: 在1000个样本的数据集上，LLM对齐方法实现高精度（F1>92%），生成成本低于每小时1美元。 Conclusion: 该系统高效、低成本，为静态幻灯片转化为视频讲座提供了实用且可扩展的解决方案。 Abstract: Turning static slides into engaging video lectures takes considerable time and effort, requiring presenters to record explanations and visually guide their audience through the material. We introduce an end-to-end system designed to automate this process entirely. Given a slide deck, this system synthesizes a video lecture featuring AI-generated narration synchronized precisely with dynamic visual highlights. These highlights automatically draw attention to the specific concept being discussed, much like an effective presenter would. The core technical contribution is a novel highlight alignment module. This module accurately maps spoken phrases to locations on a given slide using diverse strategies (e.g., Levenshtein distance, LLM-based semantic analysis) at selectable granularities (line or word level) and utilizes timestamp-providing Text-to-Speech (TTS) for timing synchronization. We demonstrate the system's effectiveness through a technical evaluation using a manually annotated slide dataset with 1000 samples, finding that LLM-based alignment achieves high location accuracy (F1 > 92%), significantly outperforming simpler methods, especially on complex, math-heavy content. Furthermore, the calculated generation cost averages under $1 per hour of video, offering potential savings of two orders of magnitude compared to conservative estimates of manual production costs. This combination of high accuracy and extremely low cost positions this approach as a practical and scalable tool for transforming static slides into effective, visually-guided video lectures.

[5] Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation

Anjila Budathoki,Manish Dhakal

Main category: cs.CV

TL;DR: 本文研究了视觉语言分割模型（VLSMs）在医学图像分析中对对抗攻击的鲁棒性，通过微调模型并应用PGD和FGSM攻击，发现性能显著下降。

Details

Motivation: 对抗攻击在计算机视觉和视觉语言模型中已有研究，但在医学图像分析的VLSMs中尚未充分探索。 Method: 微调预训练的VLSMs，并应用PGD和FGSM对抗攻击评估其鲁棒性。 Result: 对抗攻击导致DSC和IoU分数显著下降，但未找到适用于医学图像的通用扰动。 Conclusion: VLSMs在医学图像分析中对对抗攻击表现脆弱，需进一步研究提升鲁棒性。 Abstract: Adversarial attacks have been fairly explored for computer vision and vision-language models. However, the avenue of adversarial attack for the vision language segmentation models (VLSMs) is still under-explored, especially for medical image analysis. Thus, we have investigated the robustness of VLSMs against adversarial attacks for 2D medical images with different modalities with radiology, photography, and endoscopy. The main idea of this project was to assess the robustness of the fine-tuned VLSMs specially in the medical domain setting to address the high risk scenario. First, we have fine-tuned pre-trained VLSMs for medical image segmentation with adapters. Then, we have employed adversarial attacks -- projected gradient descent (PGD) and fast gradient sign method (FGSM) -- on that fine-tuned model to determine its robustness against adversaries. We have reported models' performance decline to analyze the adversaries' impact. The results exhibit significant drops in the DSC and IoU scores after the introduction of these adversaries. Furthermore, we also explored universal perturbation but were not able to find for the medical images. \footnote{https://github.com/anjilab/secure-private-ai}

[6] Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking

Daniela Ruiz,Paula Cardenas,Leonardo Manrique,Daniela Vega,Gabriel Mejia,Pablo Arbelaez

Main category: cs.CV

TL;DR: SpaRED和SpaCKLE为空间转录组学提供了标准化数据库和先进的基因表达预测模型，显著提升了预测准确性。

Details

Motivation: Visium技术成本高、效率低，且数据存在丢失问题，现有深度学习模型因数据不一致难以公平比较。 Method: 提出SpaRED标准化数据库和SpaCKLE基于Transformer的基因表达补全模型。 Result: SpaCKLE将均方误差降低82.5%，并在SpaRED基准测试中显著提升所有模型的预测效果。 Conclusion: SpaRED和SpaCKLE为空间转录组学研究提供了全面基准和未来研究方向。 Abstract: Spatial Transcriptomics is a groundbreaking technology that integrates histology images with spatially resolved gene expression profiles. Among the various Spatial Transcriptomics techniques available, Visium has emerged as the most widely adopted. However, its accessibility is limited by high costs, the need for specialized expertise, and slow clinical integration. Additionally, gene capture inefficiencies lead to significant dropout, corrupting acquired data. To address these challenges, the deep learning community has explored the gene expression prediction task directly from histology images. Yet, inconsistencies in datasets, preprocessing, and training protocols hinder fair comparisons between models. To bridge this gap, we introduce SpaRED, a systematically curated database comprising 26 public datasets, providing a standardized resource for model evaluation. We further propose SpaCKLE, a state-of-the-art transformer-based gene expression completion model that reduces mean squared error by over 82.5% compared to existing approaches. Finally, we establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data, demonstrating SpaCKLE substantially improves the results across all the gene expression prediction models. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on Spatial Transcriptomics.

[7] NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results

Nikolay Safonov,Alexey Bryncev,Andrey Moskalenko,Dmitry Kulikov,Dmitry Vatolin,Radu Timofte,Haibo Lei,Qifan Gao,Qing Luo,Yaqing Li,Jie Song,Shaozhe Hao,Meisong Zheng,Jingyi Xu,Chengbin Wu,Jiahui Liu,Ying Chen,Xin Deng,Mai Xu,Peipei Liang,Jie Ma,Junjie Jin,Yingxue Pang,Fangzhou Luo,Kai Chen,Shijie Zhao,Mingyang Wu,Renjie Li,Yushen Zuo,Shengyun Zhong,Zhengzhong Tu

Main category: cs.CV

TL;DR: NTIRE 2025挑战赛聚焦用户生成内容（UGC）视频增强，旨在提升无参考视频的视觉质量，吸引了25支团队参与，最终7支通过验证。

Details

Motivation: UGC视频在短视频平台广泛使用，但其质量常受噪声、模糊等问题影响，亟需有效增强方法。 Method: 挑战赛提供150个无参考UGC视频，参赛团队开发算法提升其质量，评估基于8000多名众包评分者的主观评分。 Result: 7支团队通过最终验证，成果揭示了UGC视频增强的最新进展和有效策略。 Conclusion: 挑战赛成果为UGC视频增强领域提供了宝贵数据和前沿见解，相关资源已公开。 Abstract: This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such videos. Given the widespread use of UGC on short-form video platforms, this task holds substantial practical importance. The evaluation was based on subjective quality assessment in crowdsourcing, obtaining votes from over 8000 assessors. The challenge attracted more than 25 teams submitting solutions, 7 of which passed the final phase with source code verification. The outcomes may provide insights into the state-of-the-art in UGC video enhancement and highlight emerging trends and effective strategies in this evolving research area. All data, including the processed videos and subjective comparison votes and scores, is made publicly available at https://github.com/msu-video-group/NTIRE25_UGC_Video_Enhancement.

[8] GIF: Generative Inspiration for Face Recognition at Scale

Saeed Ebrahimi,Sahar Rahimi,Ali Dabouei,Srinjoy Das,Jeremy M. Dawson,Nasser M. Nasrabadi

Main category: cs.CV

TL;DR: 提出了一种用结构化身份码替代标量标签的方法，显著降低了人脸识别中Softmax的计算成本。

Details

Motivation: 减少大规模标签空间中Softmax的计算成本，现有方法仅能线性降低计算量。 Method: 将标量标签转换为结构化身份码，训练模型预测代码而非标量标签，使计算成本与身份数呈对数关系。 Result: 在IJB-B和IJB-C上分别提升1.52%和0.6%的TAR@FAR=1e-4性能，计算成本从线性降至对数。 Conclusion: 结构化身份码方法有效降低了计算成本并提升了性能。 Abstract: Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities. Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. Inspired by generative modeling, we present a simple yet effective method that substitutes scalar labels with structured identity code, i.e., a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic w.r.t. number of identities. We demonstrate the benefits of the proposed method by conducting experiments. In particular, our method outperforms its competitors by 1.52%, and 0.6% at TAR@FAR$=1e-4$ on IJB-B and IJB-C, respectively, while transforming the association between computational cost and the number of identities from linear to logarithmic. See code at https://github.com/msed-Ebrahimi/GIF

[9] Lesion-Aware Generative Artificial Intelligence for Virtual Contrast-Enhanced Mammography in Breast Cancer

Aurora Rofena,Arianna Manchia,Claudia Lucia Piccolo,Bruno Beomonte Zobel,Paolo Soda,Valerio Guarrasi

Main category: cs.CV

TL;DR: Seg-CycleGAN是一种生成性深度学习框架，用于在对比增强光谱乳腺摄影（CESM）中实现虚拟对比增强，通过合成高质量的双能量减影图像，减少辐射和对比剂的使用。

Details

Motivation: CESM虽然诊断准确性高，但存在辐射和对比剂副作用的问题，需要一种无对比剂的替代方案。 Method: 提出Seg-CycleGAN，基于CycleGAN架构，引入病灶分割图指导生成过程，并增加局部损失项以优化病灶区域的重建。 Result: 在CESM@UCBM数据集上，Seg-CycleGAN在PSNR和SSIM上优于基线，同时保持竞争力的MSE和VIF，定性评估也显示病灶保真度提高。 Conclusion: Seg-CycleGAN为无对比剂的CESM替代方案提供了可行路径。 Abstract: Contrast-Enhanced Spectral Mammography (CESM) is a dual-energy mammographic technique that improves lesion visibility through the administration of an iodinated contrast agent. It acquires both a low-energy image, comparable to standard mammography, and a high-energy image, which are then combined to produce a dual-energy subtracted image highlighting lesion contrast enhancement. While CESM offers superior diagnostic accuracy compared to standard mammography, its use entails higher radiation exposure and potential side effects associated with the contrast medium. To address these limitations, we propose Seg-CycleGAN, a generative deep learning framework for Virtual Contrast Enhancement in CESM. The model synthesizes high-fidelity dual-energy subtracted images from low-energy images, leveraging lesion segmentation maps to guide the generative process and improve lesion reconstruction. Building upon the standard CycleGAN architecture, Seg-CycleGAN introduces localized loss terms focused on lesion areas, enhancing the synthesis of diagnostically relevant regions. Experiments on the CESM@UCBM dataset demonstrate that Seg-CycleGAN outperforms the baseline in terms of PSNR and SSIM, while maintaining competitive MSE and VIF. Qualitative evaluations further confirm improved lesion fidelity in the generated images. These results suggest that segmentation-aware generative models offer a viable pathway toward contrast-free CESM alternatives.

[10] An Explainable Anomaly Detection Framework for Monitoring Depression and Anxiety Using Consumer Wearable Devices

Yuezhou Zhang,Amos A. Folarin,Callum Stewart,Heet Sankesara,Yatharth Ranjan,Pauline Conde,Akash Roy Choudhury,Shaoxiong Sun,Zulqarnain Rashid,Richard J. B. Dobson

Main category: cs.CV

TL;DR: 论文提出了一种基于可穿戴设备的可解释异常检测框架，用于早期检测抑郁和焦虑症状的恶化。

Details

Motivation: 通过可穿戴设备连续监测行为和生理数据，为早期发现抑郁和焦虑症状恶化提供客观方法。 Method: 使用LSTM自编码器模型，基于2,023名参与者的健康基线数据，学习睡眠时长、步数和静息心率的正常模式，并在症状评分增加≥5分时标记异常。 Result: 模型在检测症状恶化时的调整F1分数为0.80，静息心率是最具影响力的特征。 Conclusion: 研究展示了可解释异常检测在个性化、可扩展和主动心理健康监测中的潜力。 Abstract: Continuous monitoring of behavior and physiology via wearable devices offers a novel, objective method for the early detection of worsening depression and anxiety. In this study, we present an explainable anomaly detection framework that identifies clinically meaningful increases in symptom severity using consumer-grade wearable data. Leveraging data from 2,023 participants with defined healthy baselines, our LSTM autoencoder model learned normal health patterns of sleep duration, step count, and resting heart rate. Anomalies were flagged when self-reported depression or anxiety scores increased by >=5 points (a threshold considered clinically significant). The model achieved an adjusted F1-score of 0.80 (precision = 0.73, recall = 0.88) in detecting 393 symptom-worsening episodes across 341 participants, with higher performance observed for episodes involving concurrent depression and anxiety escalation (F1 = 0.84) and for more pronounced symptom changes (>=10-point increases, F1 = 0.85). Model interpretability was supported by SHAP-based analysis, which identified resting heart rate as the most influential feature in 71.4 percentage of detected anomalies, followed by physical activity and sleep. Together, our findings highlight the potential of explainable anomaly detection to enable personalized, scalable, and proactive mental health monitoring in real-world settings.

[11] Estimating the Diameter at Breast Height of Trees in a Forest With a Single 360 Camera

Siming He,Zachary Osman,Fernando Cladera,Dexter Ong,Nitant Rai,Patrick Corey Green,Vijay Kumar,Pratik Chaudhari

Main category: cs.CV

TL;DR: 提出一种低成本、基于360相机的半自动化方法，用于测量树木胸径（DBH），精度接近LiDAR技术。

Details

Motivation: LiDAR技术虽然精度高，但成本昂贵且操作复杂，需要一种低成本替代方案。 Method: 使用Agisoft Metashape进行点云重建，结合Grounded SAM进行树干分割，并通过RANSAC技术估计DBH。 Result: 在61次测量中，相对误差为5-9%，仅比LiDAR高2-4%。 Conclusion: 该方法成本低、操作简单，适合广泛使用。 Abstract: Forest inventories rely on accurate measurements of the diameter at breast height (DBH) for ecological monitoring, resource management, and carbon accounting. While LiDAR-based techniques can achieve centimeter-level precision, they are cost-prohibitive and operationally complex. We present a low-cost alternative that only needs a consumer-grade 360 video camera. Our semi-automated pipeline comprises of (i) a dense point cloud reconstruction using Structure from Motion (SfM) photogrammetry software called Agisoft Metashape, (ii) semantic trunk segmentation by projecting Grounded Segment Anything (SAM) masks onto the 3D cloud, and (iii) a robust RANSAC-based technique to estimate cross section shape and DBH. We introduce an interactive visualization tool for inspecting segmented trees and their estimated DBH. On 61 acquisitions of 43 trees under a variety of conditions, our method attains median absolute relative errors of 5-9% with respect to "ground-truth" manual measurements. This is only 2-4% higher than LiDAR-based estimates, while employing a single 360 camera that costs orders of magnitude less, requires minimal setup, and is widely available.

[12] Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

Lei Wang,Senmao Li,Fei Yang,Jianye Wang,Ziheng Zhang,Yuhan Liu,Yaxing Wang,Jian Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为MaskUNet的方法，通过动态调整U-Net参数（包括归零某些参数）来提升扩散模型的生成质量，无需增加额外参数。

Details

Motivation: 传统扩散模型在同一网络层中同时学习结构和纹理信息，这与传统深度学习架构（如ResNet或GANs）不同，因此探索时间相关的扩散模型。 Method: 提出MaskUNet方法，利用时间步和样本依赖的有效U-Net参数，提供两种微调策略：基于训练的方法和无训练方法。 Result: 在COCO数据集上零样本推理中，MaskUNet取得了最佳FID分数，并在下游任务中验证了其有效性。 Conclusion: MaskUNet通过优化U-Net参数动态调整，显著提升了生成质量，且方法简单高效。 Abstract: The diffusion models, in early stages focus on constructing basic image structures, while the refined details, including local features and textures, are generated in later stages. Thus the same network layers are forced to learn both structural and textural information simultaneously, significantly differing from the traditional deep learning architectures (e.g., ResNet or GANs) which captures or generates the image semantic information at different layers. This difference inspires us to explore the time-wise diffusion models. We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method-termed ``MaskUNet''- that enhances generation quality with negligible parameter numbers. Our method fully leverages timestep- and sample-dependent effective U-Net parameters. To optimize MaskUNet, we offer two fine-tuning strategies: a training-based approach and a training-free approach, including tailored networks and optimization functions. In zero-shot inference on the COCO dataset, MaskUNet achieves the best FID score and further demonstrates its effectiveness in downstream task evaluations. Project page: https://gudaochangsheng.github.io/MaskUnet-Page/

[13] Image Recognition with Online Lightweight Vision Transformer: A Survey

Zherui Zhang,Rongtao Xu,Jie Zhou,Changwei Wang,Xingtian Pei,Wenhao Xu,Jiguang Zhang,Li Guo,Longxiang Gao,Wenbo Xu,Shibiao Xu

Main category: cs.CV

TL;DR: 本文综述了轻量化视觉Transformer的在线策略，聚焦高效组件设计、动态网络和知识蒸馏，评估了其在ImageNet-1K上的表现，并探讨了未来研究方向。

Details

Motivation: Transformer在自然语言处理中的成功激发了其在计算机视觉任务中的应用，但其缺乏归纳偏置和效率优势，面临计算和内存挑战。 Method: 研究了三种轻量化策略：高效组件设计、动态网络和知识蒸馏，并在ImageNet-1K上评估其性能。 Result: 分析了每种策略在精度、参数量、吞吐量等方面的权衡，总结了各自的优缺点和灵活性。 Conclusion: 提出了未来轻量化视觉Transformer的研究方向和潜在挑战，旨在为社区提供实践指导和启发。 Abstract: The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT

[14] Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

Teng Zhou,Jax Luo,Yuping Sun,Yiheng Tan,Shun Yao,Nazim Haouchine,Scott Raymond

Main category: cs.CV

TL;DR: 提出了一种基于路径和骨轮廓正则化的无配对MRI到CT转换方法，通过神经ODE建模连续流，显著提升了骨结构的转换精度。

Details

Motivation: 解决现有无配对MRI到CT转换方法在骨结构等解剖特征转换上的不足，满足放射治疗中对精确骨结构的需求。 Method: 将MRI和CT投影到共享潜在空间，通过神经ODE建模连续流，最小化路径长度，并引入可训练神经网络生成骨轮廓。 Result: 在三个数据集上表现优于现有方法，整体误差更低，且在骨分割任务中骨结构保真度更高。 Conclusion: 该方法显著提升了无配对MRI到CT转换的精度，尤其在骨结构转换上表现优异。 Abstract: Accurate MRI-to-CT translation promises the integration of complementary imaging information without the need for additional imaging sessions. Given the practical challenges associated with acquiring paired MRI and CT scans, the development of robust methods capable of leveraging unpaired datasets is essential for advancing the MRI-to-CT translation. Current unpaired MRI-to-CT translation methods, which predominantly rely on cycle consistency and contrastive learning frameworks, frequently encounter challenges in accurately translating anatomical features that are highly discernible on CT but less distinguishable on MRI, such as bone structures. This limitation renders these approaches less suitable for applications in radiation therapy, where precise bone representation is essential for accurate treatment planning. To address this challenge, we propose a path- and bone-contour regularized approach for unpaired MRI-to-CT translation. In our method, MRI and CT images are projected to a shared latent space, where the MRI-to-CT mapping is modeled as a continuous flow governed by neural ordinary differential equations. The optimal mapping is obtained by minimizing the transition path length of the flow. To enhance the accuracy of translated bone structures, we introduce a trainable neural network to generate bone contours from MRI and implement mechanisms to directly and indirectly encourage the model to focus on bone contours and their adjacent regions. Evaluations conducted on three datasets demonstrate that our method outperforms existing unpaired MRI-to-CT translation approaches, achieving lower overall error rates. Moreover, in a downstream bone segmentation task, our approach exhibits superior performance in preserving the fidelity of bone structures. Our code is available at: https://github.com/kennysyp/PaBoT.

[15] TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion

Haoyue Liu,Jinghan Xu,Yi Chang,Hanyu Zhou,Haozhi Zhao,Lin Wang,Luxin Yan

Main category: cs.CV

TL;DR: 论文提出了一种基于连续点跟踪的视频帧插值框架TimeTracker，通过事件相机处理非线性运动，显著提升了插帧质量。

Details

Motivation: 事件相机的高时间分辨率优势未被充分利用，现有方法在处理非线性运动时存在运动误差问题。 Method: 设计了场景感知区域分割模块（SARS）和连续轨迹引导的运动估计模块（CTME），通过跟踪局部区域的连续运动轨迹生成中间帧。 Result: 实验表明，该方法在运动估计和帧插值质量上优于现有技术。 Conclusion: TimeTracker框架通过连续点跟踪有效解决了非线性运动问题，提升了视频帧插值的性能。 Abstract: Video frame interpolation (VFI) that leverages the bio-inspired event cameras as guidance has recently shown better performance and memory efficiency than the frame-based methods, thanks to the event cameras' advantages, such as high temporal resolution. A hurdle for event-based VFI is how to effectively deal with non-linear motion, caused by the dynamic changes in motion direction and speed within the scene. Existing methods either use events to estimate sparse optical flow or fuse events with image features to estimate dense optical flow. Unfortunately, motion errors often degrade the VFI quality as the continuous motion cues from events do not align with the dense spatial information of images in the temporal dimension. In this paper, we find that object motion is continuous in space, tracking local regions over continuous time enables more accurate identification of spatiotemporal feature correlations. In light of this, we propose a novel continuous point tracking-based VFI framework, named TimeTracker. Specifically, we first design a Scene-Aware Region Segmentation (SARS) module to divide the scene into similar patches. Then, a Continuous Trajectory guided Motion Estimation (CTME) module is proposed to track the continuous motion trajectory of each patch through events. Finally, intermediate frames at any given time are generated through global motion optimization and frame refinement. Moreover, we collect a real-world dataset that features fast non-linear motion. Extensive experiments show that our method outperforms prior arts in both motion estimation and frame interpolation quality.

[16] VISLIX: An XAI Framework for Validating Vision Models with Slice Discovery and Analysis

Xinyuan Yan,Xiwei Xuan,Jorge Piazentin Ono,Jiajing Guo,Vikram Mohanty,Shekar Arvind Kumar,Liang Gou,Bei Wang,Liu Ren

Main category: cs.CV

TL;DR: VISLIX是一个新的视觉分析框架，利用基础模型帮助专家分析计算机视觉模型的数据切片，无需额外元数据，并提供交互式测试功能。

Details

Motivation: 现实世界中的机器学习模型（如自动驾驶和监控）需要严格评估，但现有数据切片方法面临依赖元数据、任务局限性高和缺乏交互性的问题。 Method: 提出VISLIX框架，利用基础模型自动生成自然语言洞察，支持交互式数据切片假设测试。 Result: 通过专家研究和三个用例验证了VISLIX在对象检测模型验证中的有效性。 Conclusion: VISLIX克服了现有数据切片方法的局限性，为计算机视觉模型验证提供了更高效的工具。 Abstract: Real-world machine learning models require rigorous evaluation before deployment, especially in safety-critical domains like autonomous driving and surveillance. The evaluation of machine learning models often focuses on data slices, which are subsets of the data that share a set of characteristics. Data slice finding automatically identifies conditions or data subgroups where models underperform, aiding developers in mitigating performance issues. Despite its popularity and effectiveness, data slicing for vision model validation faces several challenges. First, data slicing often needs additional image metadata or visual concepts, and falls short in certain computer vision tasks, such as object detection. Second, understanding data slices is a labor-intensive and mentally demanding process that heavily relies on the expert's domain knowledge. Third, data slicing lacks a human-in-the-loop solution that allows experts to form hypothesis and test them interactively. To overcome these limitations and better support the machine learning operations lifecycle, we introduce VISLIX, a novel visual analytics framework that employs state-of-the-art foundation models to help domain experts analyze slices in computer vision models. Our approach does not require image metadata or visual concepts, automatically generates natural language insights, and allows users to test data slice hypothesis interactively. We evaluate VISLIX with an expert study and three use cases, that demonstrate the effectiveness of our tool in providing comprehensive insights for validating object detection models.

[17] Enhancing Glass Defect Detection with Diffusion Models: Addressing Imbalanced Datasets in Manufacturing Quality Control

Sajjad Rezvani Boroujeni,Hossein Abedi,Tom Bush

Main category: cs.CV

TL;DR: 论文提出了一种基于DDPMs的方法，通过生成合成缺陷玻璃图像解决工业玻璃制造中数据不平衡问题，显著提升了CNN模型的分类性能。

Details

Motivation: 工业玻璃制造中视觉缺陷检测因缺陷产品频率低导致数据不平衡，限制了深度学习模型的性能。 Method: 使用Denoising Diffusion Probabilistic Models (DDPMs)生成合成缺陷图像进行数据增强，提升少数类样本的表示。 Result: 实验显示，该方法显著提高了CNN模型（如ResNet50V2）的分类准确率，从78%提升至93%，同时保持高精度。 Conclusion: 该方法为玻璃制造中的缺陷检测提供了一种可扩展、经济高效的解决方案，并可能适用于其他类似数据不平衡问题的行业。 Abstract: Visual defect detection in industrial glass manufacturing remains a critical challenge due to the low frequency of defective products, leading to imbalanced datasets that limit the performance of deep learning models and computer vision systems. This paper presents a novel approach using Denoising Diffusion Probabilistic Models (DDPMs) to generate synthetic defective glass product images for data augmentation, effectively addressing class imbalance issues in manufacturing quality control and automated visual inspection. The methodology significantly enhances image classification performance of standard CNN architectures (ResNet50V2, EfficientNetB0, and MobileNetV2) in detecting anomalies by increasing the minority class representation. Experimental results demonstrate substantial improvements in key machine learning metrics, particularly in recall for defective samples across all tested deep neural network architectures while maintaining perfect precision. The most dramatic improvement was observed in ResNet50V2's overall classification accuracy, which increased from 78 percent to 93 percent when trained with the augmented data. This work provides a scalable, cost-effective approach to enhancing automated defect detection in glass manufacturing that can potentially be extended to other industrial quality assurance systems and industries with similar class imbalance challenges.

[18] Motion-compensated cardiac MRI using low-rank diffeomorphic flow (DMoCo)

Joseph William Kettelkamp,Ludovica Romanin,Sarv Priya,Mathews Jacob

Main category: cs.CV

TL;DR: 提出了一种无监督运动补偿图像重建算法，用于自由呼吸和非门控3D心脏MRI，通过低秩模型表示运动相位间的变形。

Details

Motivation: 解决自由呼吸和非门控3D心脏MRI中的运动伪影问题，提高图像重建质量。 Method: 将每个运动相位的图像体积表示为静态模板的变形，使用低秩模型表示变形家族，并通过参数化速度场积分获得特定相位的变形。 Result: 相比现有运动分辨和运动补偿算法，提出的低秩运动模型在自由呼吸3D cine MRI中表现更优。 Conclusion: 该算法通过无监督学习静态模板和低秩运动模型，显著提升了图像重建效果。 Abstract: We introduce an unsupervised motion-compensated image reconstruction algorithm for free-breathing and ungated 3D cardiac magnetic resonance imaging (MRI). We express the image volume corresponding to each specific motion phase as the deformation of a single static image template. The main contribution of the work is the low-rank model for the compact joint representation of the family of diffeomorphisms, parameterized by the motion phases. The diffeomorphism at a specific motion phase is obtained by integrating a parametric velocity field along a path connecting the reference template phase to the motion phase. The velocity field at different phases is represented using a low-rank model. The static template and the low-rank motion model parameters are learned directly from the k-space data in an unsupervised fashion. The more constrained motion model is observed to offer improved recovery compared to current motion-resolved and motion-compensated algorithms for free-breathing 3D cine MRI.

[19] Robust Fairness Vision-Language Learning for Medical Image Analysis

Sparsh Bansal,Mingyang Wu,Xin Wang,Shu Hu

Main category: cs.CV

TL;DR: 本文提出了一种确保视觉语言模型（VLM）在医学图像分析中公平性和鲁棒性的框架，通过动态坏对挖掘算法和Sinkhorn距离优化损失函数，实验显示公平性AUC提升8.6%。

Details

Motivation: 在医学图像分析中，视觉语言模型的公平性和鲁棒性至关重要，以确保模型对所有患者均适用。 Method: 提出框架，通过动态坏对挖掘算法调整损失函数，并使用Sinkhorn距离确保保护组的损失分布与总体一致。 Result: 实验结果显示，公平性AUC指标提升了8.6%。 Conclusion: 该框架有效提升了VLM在医学图像分析中的公平性和鲁棒性。 Abstract: The advent of Vision-Language Models (VLMs) in medical image analysis has the potential to help process multimodal inputs and increase performance over traditional inference methods. However, when considering the domain in which these models will be implemented, fairness and robustness are important to ensure the model stays true for any patient. In this paper, we introduce a framework for ensuring robustness and fairness of VLM models. This framework modifies the loss function at training by identifying and adjusting faulty image-text pairs through a Dynamic Bad Pair Mining algorithm and also utilizing Sinkhorn distance to ensure the loss distributions of protected groups do not deviate from the total loss. Experimental testing of our framework shows up to a 8.6\% improvement when looking at equity-scaled AUC.

[20] RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

Sameer Malik,Moyuru Yamada,Ayush Singh,Dishank Aggarwal

Main category: cs.CV

TL;DR: RAVU框架通过检索增强和时空图推理，解决了LMMs处理长视频的难题，显著提升了视频问答任务的性能。

Details

Motivation: 当前LMMs因缺乏显式记忆和检索机制，难以处理长视频，限制了其在复杂视频理解任务中的应用。 Method: 构建视频的时空图表示，作为长期记忆，并通过分解查询为多步推理在图上执行检索。 Result: 在NExT-QA和EgoSchema数据集上，RAVU仅需5-10帧检索即优于其他SOTA方法。 Conclusion: RAVU通过检索增强和时空图推理，显著提升了长视频理解的准确性和效率。 Abstract: Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames. Our approach demonstrate superior performances with limited retrieved frames (5-10) compared with other SOTA methods and baselines on two major video QA datasets, NExT-QA and EgoSchema.

[21] seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Hafez Ghaemi,Eilif Muller,Shahab Bakhtiari

Main category: cs.CV

TL;DR: seq-JEPA是一种基于联合嵌入预测架构的世界建模范式，通过处理图像的不同视图序列，同时学习不变和等变表示，解决了现有两视图范式在任务适应性上的局限性。

Details

Motivation: 当前自监督算法主要依赖数据增强和掩码等变换学习视觉表示，但两视图范式限制了表示在下游任务中的灵活性，导致不变性和等变性任务之间的性能权衡。 Method: seq-JEPA通过处理输入图像的不同视图序列，结合相对变换的嵌入，使用Transformer编码器生成序列的聚合表示，并预测下一个视图的表示，无需额外的等变预测器或损失项。 Result: seq-JEPA在等变基准和图像分类任务中表现优异，且无需牺牲任一任务的性能，同时在需要序列聚合的任务（如路径整合和预测学习）中表现突出。 Conclusion: seq-JEPA通过联合嵌入预测架构，有效解决了不变性和等变性任务之间的权衡问题，为视觉表示学习提供了更灵活的解决方案。 Abstract: Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with respect to these transformations after encoding two views of an image. This dominant two-view paradigm can limit the flexibility of learned representations for downstream adaptation by creating performance trade-offs between invariance-related tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we introduce \emph{seq-JEPA}, a world modeling paradigm based on joint-embedding predictive architecture that leverages architectural inductive biases to resolve this trade-off. Without requiring an additional equivariance predictor or loss term, seq-JEPA simultaneously learns two architecturally segregated representations: one equivariant to the specified transformations and another invariant to them and suited for tasks such as classification. To do so, our model processes a short sequence of different views (observations) of an input image. Each encoded view is concatenated with embeddings corresponding to the relative transformation (action) producing the next observation in the sequence. A transformer encoder outputs an aggregate representation of this sequence, which is subsequently conditioned on the action leading to the next observation to predict its representation. Empirically, seq-JEPA achieves strong performance on equivariant benchmarks and image classification without sacrificing one for the other. Additionally, our framework excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.

[22] Interactive Instance Annotation with Siamese Networks

Xiang Xu,Ruotong Li,Mengjun Yi,Baile XU,Furao Shen,Jian Zhao

Main category: cs.CV

TL;DR: SiamAnno是一个基于Siamese网络的框架，用于跨域实例标注任务，通过一次性学习预测对象边界，无需微调即可在多个数据集上实现SOTA性能。

Details

Motivation: 实例标注耗时耗力，现有方法多局限于同域场景，难以应对跨域任务。 Method: 利用Siamese网络和一次性学习，通过输入边界框预测对象边界，支持用户调整。 Result: 在未微调的情况下，SiamAnno在多个数据集上表现优于现有方法。 Conclusion: SiamAnno是首个探索Siamese架构用于实例标注的模型，为未来研究提供了强基线。 Abstract: Annotating instance masks is time-consuming and labor-intensive. A promising solution is to predict contours using a deep learning model and then allow users to refine them. However, most existing methods focus on in-domain scenarios, limiting their effectiveness for cross-domain annotation tasks. In this paper, we propose SiamAnno, a framework inspired by the use of Siamese networks in object tracking. SiamAnno leverages one-shot learning to annotate previously unseen objects by taking a bounding box as input and predicting object boundaries, which can then be adjusted by annotators. Trained on one dataset and tested on another without fine-tuning, SiamAnno achieves state-of-the-art (SOTA) performance across multiple datasets, demonstrating its ability to handle domain and environment shifts in cross-domain tasks. We also provide more comprehensive results compared to previous work, establishing a strong baseline for future research. To our knowledge, SiamAnno is the first model to explore Siamese architecture for instance annotation.

[23] PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models

Chang Xie,Chenyi Zhuang,Pan Gao

Main category: cs.CV

TL;DR: PiCo（Pick-and-Control）是一种无需训练的文本到图像生成方法，通过噪声选择模块和参考掩码模块提升文本-图像对齐效果。

Details

Motivation: 现有扩散模型在复杂文本提示下难以实现文本-图像对齐，主要受随机初始化噪声质量和生成控制掩码可靠性的影响。 Method: PiCo包含噪声选择模块（评估噪声质量并快速采样）和参考掩码模块（生成像素级掩码并调制交叉注意力图）。 Result: 实验证明PiCo能有效减少随机生成的繁琐性，并显著提升多样文本描述的文本-图像对齐效果。 Conclusion: PiCo通过噪声选择和掩码调制，显著提升了复杂文本提示下的生成质量和对齐效果。 Abstract: Advanced diffusion models have made notable progress in text-to-image compositional generation. However, it is still a challenge for existing models to achieve text-image alignment when confronted with complex text prompts. In this work, we highlight two factors that affect this alignment: the quality of the randomly initialized noise and the reliability of the generated controlling mask. We then propose PiCo (Pick-and-Control), a novel training-free approach with two key components to tackle these two factors. First, we develop a noise selection module to assess the quality of the random noise and determine whether the noise is suitable for the target text. A fast sampling strategy is utilized to ensure efficiency in the noise selection stage. Second, we introduce a referring mask module to generate pixel-level masks and to precisely modulate the cross-attention maps. The referring mask is applied to the standard diffusion process to guide the reasonable interaction between text and image features. Extensive experiments have been conducted to verify the effectiveness of PiCo in liberating users from the tedious process of random generation and in enhancing the text-image alignment for diverse text descriptions.

[24] DCS-ST for Classification of Breast Cancer Histopathology Images with Limited Annotations

Liu Suxing,Byungwon Min

Main category: cs.CV

TL;DR: 深度学习在乳腺癌病理图像分类中表现良好，但在标注数据有限时性能下降。

Details

Motivation: 解决医学影像中标注数据稀缺且昂贵的问题。 Method: 使用深度学习方法。 Result: 在标注数据有限时性能下降。 Conclusion: 需要改进方法以应对数据稀缺的挑战。 Abstract: Deep learning methods have shown promise in classifying breast cancer histopathology images, but their performance often declines with limited annotated data, a critical challenge in medical imaging due to the high cost and expertise required for annotations.

[25] Dual-Domain Masked Image Modeling: A Self-Supervised Pretraining Strategy Using Spatial and Frequency Domain Masking for Hyperspectral Data

Shaheer Mohamed,Tharindu Fernando,Sridha Sridharan,Peyman Moghadam,Clinton Fookes

Main category: cs.CV

TL;DR: 提出了一种名为SFMIM的自监督预训练方法，通过空间和频率双域掩码机制，利用未标记的HSI数据提升分类性能。

Details

Motivation: 标记HSI数据稀缺限制了深度学习潜力，尤其是需要大规模训练的Transformer架构。 Method: 提出SFMIM方法，结合空间和频率掩码机制，通过重建掩码输入学习高阶谱空相关性。 Result: 在三个公开HSI分类基准上达到SOTA性能，且微调时收敛速度快。 Conclusion: SFMIM方法有效利用了未标记数据，显著提升了HSI分类性能。 Abstract: Hyperspectral images (HSIs) capture rich spectral signatures that reveal vital material properties, offering broad applicability across various domains. However, the scarcity of labeled HSI data limits the full potential of deep learning, especially for transformer-based architectures that require large-scale training. To address this constraint, we propose Spatial-Frequency Masked Image Modeling (SFMIM), a self-supervised pretraining strategy for hyperspectral data that utilizes the large portion of unlabeled data. Our method introduces a novel dual-domain masking mechanism that operates in both spatial and frequency domains. The input HSI cube is initially divided into non-overlapping patches along the spatial dimension, with each patch comprising the entire spectrum of its corresponding spatial location. In spatial masking, we randomly mask selected patches and train the model to reconstruct the masked inputs using the visible patches. Concurrently, in frequency masking, we remove portions of the frequency components of the input spectra and predict the missing frequencies. By learning to reconstruct these masked components, the transformer-based encoder captures higher-order spectral-spatial correlations. We evaluate our approach on three publicly available HSI classification benchmarks and demonstrate that it achieves state-of-the-art performance. Notably, our model shows rapid convergence during fine-tuning, highlighting the efficiency of our pretraining strategy.

[26] Seeing the Abstract: Translating the Abstract Language for Vision Language Models

Davide Talon,Federico Girella,Ziyue Liu,Marco Cristani,Yiming Wang

Main category: cs.CV

TL;DR: 论文揭示了抽象语言在视觉语言模型（VLM）中的重要性，提出了一种无需训练的模型无关方法（ACT），在检索任务中表现优于微调的VLM。

Details

Motivation: 当前视觉语言模型研究忽视了抽象语言的价值，而抽象语言在时尚等领域具有广泛存在和重要信息。 Method: 提出Abstract-to-Concrete Translator（ACT），利用预训练模型和多模态数据库，将抽象表示映射到VLM潜在空间中已充分表示的具象表示。 Result: ACT在文本到图像检索任务中优于微调的VLM，表现出强大的泛化能力，且适用于多种VLM。 Conclusion: ACT是一种即插即用的解决方案，显著提升了VLM处理抽象语言的能力。 Abstract: Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.

[27] PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNs

Lukas Meiner,Jens Mehnert,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: PROM是一种针对深度可分离卷积网络的量化方法，通过选择性使用三元和8位权重，显著降低能耗和存储需求。

Details

Motivation: 现代深度可分离卷积网络中，点卷积操作的计算成本高且分布不均，现有量化方法未能充分利用效率潜力。 Method: PROM采用两种位宽量化：点卷积使用三元权重，其余模块使用8位权重，并通过量化感知训练实现。激活量化为8位，将点卷积转换为int8加法。 Result: 在MobileNetV2上，PROM将能耗降低23.9倍，存储需求减少2.7倍，同时保持ImageNet分类性能。 Conclusion: PROM通过简单方法解决了深度可分离网络的量化挑战，显著提升了能效和存储效率。 Abstract: Convolutional neural networks (CNNs) are crucial for computer vision tasks on resource-constrained devices. Quantization effectively compresses these models, reducing storage size and energy cost. However, in modern depthwise-separable architectures, the computational cost is distributed unevenly across its components, with pointwise operations being the most expensive. By applying a general quantization scheme to this imbalanced cost distribution, existing quantization approaches fail to fully exploit potential efficiency gains. To this end, we introduce PROM, a straightforward approach for quantizing modern depthwise-separable convolutional networks by selectively using two distinct bit-widths. Specifically, pointwise convolutions are quantized to ternary weights, while the remaining modules use 8-bit weights, which is achieved through a simple quantization-aware training procedure. Additionally, by quantizing activations to 8-bit, our method transforms pointwise convolutions with ternary weights into int8 additions, which enjoy broad support across hardware platforms and effectively eliminates the need for expensive multiplications. Applying PROM to MobileNetV2 reduces the model's energy cost by more than an order of magnitude (23.9x) and its storage size by 2.7x compared to the float16 baseline while retaining similar classification performance on ImageNet. Our method advances the Pareto frontier for energy consumption vs. top-1 accuracy for quantized convolutional models on ImageNet. PROM addresses the challenges of quantizing depthwise-separable convolutional networks to both ternary and 8-bit weights, offering a simple way to reduce energy cost and storage size.

[28] DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

Wei-Ting Chen,Yu-Jiet Vong,Yi-Tsung Lee,Sy-Yen Kuo,Qiang Gao,Sizhuo Ma,Jian Wang

Main category: cs.CV

TL;DR: DiffVQA是一种新的视频质量评估框架，利用预训练的扩散模型提取特征，结合Mamba模块处理时间动态，显著提升了评估性能。

Details

Motivation: 现有基于CNN和ViT的视频质量评估方法难以与人类感知对齐，且受限于数据集的规模和多样性。 Method: DiffVQA通过控制模块适应扩散模型，提取语义和失真特征，并引入Mamba模块处理时间动态，合并特征预测分数。 Result: 实验表明DiffVQA在多个数据集上表现优异，泛化能力强。 Conclusion: 扩散模型作为特征提取器比CNN和ViT更有效，提升了视频质量评估性能。 Abstract: Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences. Despite the promising performance of existing methods using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), they often struggle to align closely with human perceptions, particularly in diverse real-world scenarios. This challenge is exacerbated by the limited scale and diversity of available datasets. To address this limitation, we introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets. Our framework adapts these models to reconstruct identical input frames through a control module. The adapted diffusion model is then used to extract semantic and distortion features from a resizing branch and a cropping branch, respectively. To enhance the model's ability to handle long-term temporal dynamics, a parallel Mamba module is introduced, which extracts temporal coherence augmented features that are merged with the diffusion features to predict the final score. Experiments across multiple datasets demonstrate DiffVQA's superior performance on intra-dataset evaluations and its exceptional generalization across datasets. These results confirm that leveraging a diffusion model as a feature extractor can offer enhanced VQA performance compared to CNN and ViT backbones.

Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Yaoqi Huang,Hongyu Lyu,Nguyen Hoang Khoi Tran,Tzu-Yun Tseng,Stewart Worrall

Main category: cs.CV

TL;DR: 论文提出了一种名为OccCylindrical的方法，通过在圆柱坐标系下融合和优化多模态特征，提升了3D语义占用预测的性能。

Details

Motivation: 现有基于多传感器融合的方法主要使用笛卡尔坐标系，忽略了传感器读数的分布，导致细节丢失和性能下降。 Method: 提出OccCylindrical方法，在圆柱坐标系下融合和优化多模态特征，保留更多几何细节。 Result: 在nuScenes数据集（包括雨天和夜间场景）上验证了方法的有效性和先进性能。 Conclusion: OccCylindrical方法通过圆柱坐标系优化特征融合，显著提升了3D语义占用预测的精度。 Abstract: The safe operation of autonomous vehicles (AVs) is highly dependent on their understanding of the surroundings. For this, the task of 3D semantic occupancy prediction divides the space around the sensors into voxels, and labels each voxel with both occupancy and semantic information. Recent perception models have used multisensor fusion to perform this task. However, existing multisensor fusion-based approaches focus mainly on using sensor information in the Cartesian coordinate system. This ignores the distribution of the sensor readings, leading to a loss of fine-grained details and performance degradation. In this paper, we propose OccCylindrical that merges and refines the different modality features under cylindrical coordinates. Our method preserves more fine-grained geometry detail that leads to better performance. Extensive experiments conducted on the nuScenes dataset, including challenging rainy and nighttime scenarios, confirm our approach's effectiveness and state-of-the-art performance. The code will be available at: https://github.com/DanielMing123/OccCylindrical

[30] Base-Detail Feature Learning Framework for Visible-Infrared Person Re-Identification

Zhihao Gong,Lian Wu,Yong Xu

Main category: cs.CV

TL;DR: 提出了一种Base-Detail Feature Learning Framework (BDLF)，通过同时利用模态共享和模态特定的信息，提升可见光-红外行人重识别（VIReID）的性能。

Details

Motivation: 现有方法未能充分利用不同模态的信息，主要关注模态共享特征而忽略模态特定细节，导致性能不佳。 Method: BDLF通过无损细节特征提取模块和互补基础嵌入生成机制，分别挖掘细节和基础特征，并通过相关性限制方法确保特征丰富。 Result: 在SYSU-MM01、RegDB和LLCM数据集上的实验验证了BDLF的有效性。 Conclusion: BDLF通过同时学习基础和细节特征，显著提升了VIReID任务的性能。 Abstract: Visible-infrared person re-identification (VIReID) provides a solution for ReID tasks in 24-hour scenarios; however, significant challenges persist in achieving satisfactory performance due to the substantial discrepancies between visible (VIS) and infrared (IR) modalities. Existing methods inadequately leverage information from different modalities, primarily focusing on digging distinguishing features from modality-shared information while neglecting modality-specific details. To fully utilize differentiated minutiae, we propose a Base-Detail Feature Learning Framework (BDLF) that enhances the learning of both base and detail knowledge, thereby capitalizing on both modality-shared and modality-specific information. Specifically, the proposed BDLF mines detail and base features through a lossless detail feature extraction module and a complementary base embedding generation mechanism, respectively, supported by a novel correlation restriction method that ensures the features gained by BDLF enrich both detail and base knowledge across VIS and IR features. Comprehensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets validate the effectiveness of BDLF.

[31] Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach

Pierre Adorni,Minh-Tan Pham,Stéphane May,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 提出了一种基于“能力编码”的方法，用于预测基础模型在多个下游任务中的表现，无需微调，简化模型选择并提供研究新视角。

Details

Motivation: 尽管已有75多个遥感视觉基础模型，但尚无模型在所有下游任务中表现一致最优，因此需要一种高效比较方法。 Method: 通过“能力编码”预测模型表现，避免对每个任务进行微调，成本更低。 Result: 该方法能简化基础模型选择，并为现有文献提供新视角。 Conclusion: 能力编码方法为模型比较和未来研究提供了实用工具。 Abstract: Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate their comparison, we propose a cost-effective method for predicting a model's performance on multiple downstream tasks without the need for fine-tuning on each one. This method is based on what we call "capabilities encoding." The utility of this novel approach is twofold: we demonstrate its potential to simplify the selection of a foundation model for a given new task, and we employ it to offer a fresh perspective on the existing literature, suggesting avenues for future research. Codes are available at https://github.com/pierreadorni/capabilities-encoding.

[32] 3D Can Be Explored In 2D: Pseudo-Label Generation for LiDAR Point Clouds Using Sensor-Intensity-Based 2D Semantic Segmentation

Andrew Caunes,Thierry Chateau,Vincent Frémont

Main category: cs.CV

TL;DR: 提出了一种无需3D标注的3D语义分割方法，通过2D分割模型和投票机制实现。

Details

Motivation: 解决3D点云语义分割中标注成本高和领域偏移问题。 Method: 利用LiDAR扫描生成2D视图，应用2D分割模型，再通过投票机制将结果映射回3D点云。 Result: 方法在无3D标注的情况下有效，可用于伪标签生成，支持无监督领域适应。 Conclusion: 该方法为3D语义分割提供了一种高效且低成本的解决方案。 Abstract: Semantic segmentation of 3D LiDAR point clouds, essential for autonomous driving and infrastructure management, is best achieved by supervised learning, which demands extensive annotated datasets and faces the problem of domain shifts. We introduce a new 3D semantic segmentation pipeline that leverages aligned scenes and state-of-the-art 2D segmentation methods, avoiding the need for direct 3D annotation or reliance on additional modalities such as camera images at inference time. Our approach generates 2D views from LiDAR scans colored by sensor intensity and applies 2D semantic segmentation to these views using a camera-domain pretrained model. The segmented 2D outputs are then back-projected onto the 3D points, with a simple voting-based estimator that merges the labels associated to each 3D point. Our main contribution is a global pipeline for 3D semantic segmentation requiring no prior 3D annotation and not other modality for inference, which can be used for pseudo-label generation. We conduct a thorough ablation study and demonstrate the potential of the generated pseudo-labels for the Unsupervised Domain Adaptation task.

[33] Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices

Tasnim Shahriar

Main category: cs.CV

TL;DR: 本文评估了五种轻量级深度学习模型在资源受限环境中的表现，发现EfficientNetV2-S准确率最高，MobileNetV3在准确率和效率间平衡最佳，SqueezeNet推理速度最快。

Details

Motivation: 研究轻量级模型在资源受限设备上的适用性，以优化边缘计算和移动平台的深度学习系统。 Method: 对五种模型（MobileNetV3 Small、ResNet18等）在三个数据集上评估分类准确率、推理时间、FLOPs和模型大小，并比较预训练与从头训练的差异。 Result: 迁移学习显著提升模型性能，EfficientNetV2-S准确率最高，MobileNetV3平衡最佳，SqueezeNet最紧凑。 Conclusion: 研究揭示了准确率与效率的权衡，为资源受限场景下的模型部署提供了实用指导。 Abstract: This paper presents a comprehensive evaluation of lightweight deep learning models for image classification, emphasizing their suitability for deployment in resource-constrained environments such as low-memory devices. Five state-of-the-art architectures - MobileNetV3 Small, ResNet18, SqueezeNet, EfficientNetV2-S, and ShuffleNetV2 - are benchmarked across three diverse datasets: CIFAR-10, CIFAR-100, and Tiny ImageNet. The models are assessed using four key performance metrics: classification accuracy, inference time, floating-point operations (FLOPs), and model size. Additionally, we investigate the impact of hyperparameter tuning, data augmentation, and training paradigms by comparing pretrained models with scratch-trained counterparts, focusing on MobileNetV3 Small. Our findings reveal that transfer learning significantly enhances model accuracy and computational efficiency, particularly for complex datasets like Tiny ImageNet. EfficientNetV2 consistently achieves the highest accuracy, while MobileNetV3 offers the best balance between accuracy and efficiency, and SqueezeNet excels in inference speed and compactness. This study highlights critical trade-offs between accuracy and efficiency, offering actionable insights for deploying lightweight models in real-world applications where computational resources are limited. By addressing these challenges, this research contributes to optimizing deep learning systems for edge computing and mobile platforms.

[34] 3D Gaussian Splatting Data Compression with Mixture of Priors

Lei Liu,Zhenghao Chen,Dong Xu

Main category: cs.CV

TL;DR: 提出了一种基于混合先验（MoP）策略的3D高斯泼溅（3DGS）数据压缩方法，解决了现有方法在熵模型和量化策略上的不足，显著提升了压缩性能。

Details

Motivation: 3DGS数据压缩在高效存储和传输中至关重要，但现有方法在熵模型和量化策略上表现不足，限制了其发展。 Method: 采用混合先验（MoP）策略，通过多个轻量级MLP生成多样化的先验特征，并结合门控机制整合为MoP特征。MoP特征用于改进条件熵模型（无损压缩）和指导元素级量化（有损压缩）。 Result: 在多个基准测试（如Mip-NeRF360、BungeeNeRF等）中实现了最先进的性能。 Conclusion: MoP策略有效提升了3DGS数据压缩的性能，为高效存储和传输提供了新思路。 Abstract: 3D Gaussian Splatting (3DGS) data compression is crucial for enabling efficient storage and transmission in 3D scene modeling. However, its development remains limited due to inadequate entropy models and suboptimal quantization strategies for both lossless and lossy compression scenarios, where existing methods have yet to 1) fully leverage hyperprior information to construct robust conditional entropy models, and 2) apply fine-grained, element-wise quantization strategies for improved compression granularity. In this work, we propose a novel Mixture of Priors (MoP) strategy to simultaneously address these two challenges. Specifically, inspired by the Mixture-of-Experts (MoE) paradigm, our MoP approach processes hyperprior information through multiple lightweight MLPs to generate diverse prior features, which are subsequently integrated into the MoP feature via a gating mechanism. To enhance lossless compression, the resulting MoP feature is utilized as a hyperprior to improve conditional entropy modeling. Meanwhile, for lossy compression, we employ the MoP feature as guidance information in an element-wise quantization procedure, leveraging a prior-guided Coarse-to-Fine Quantization (C2FQ) strategy with a predefined quantization step value. Specifically, we expand the quantization step value into a matrix and adaptively refine it from coarse to fine granularity, guided by the MoP feature, thereby obtaining a quantization step matrix that facilitates element-wise quantization. Extensive experiments demonstrate that our proposed 3DGS data compression framework achieves state-of-the-art performance across multiple benchmarks, including Mip-NeRF360, BungeeNeRF, DeepBlending, and Tank&Temples.

[35] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Yibin Wang,Zhimin Li,Yuhang Zang,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于链式思维（CoT）的多模态奖励模型UnifiedReward-Think，通过长链推理提升奖励信号的可靠性和鲁棒性。

Details

Motivation: 现有奖励模型（RMs）通常仅提供直接响应或浅层推理，导致奖励信号不准确。通过引入显式长链推理，可以提升模型的可靠性和鲁棒性。 Method: 采用探索驱动的强化微调方法：1) 利用少量图像生成偏好数据蒸馏GPT-4o的推理过程；2) 利用大规模多模态偏好数据激发模型推理能力；3) 通过GRPO强化微调优化模型。 Result: 在多种视觉奖励任务中，模型表现出优越性能。 Conclusion: UnifiedReward-Think通过长链推理显著提升了奖励模型的准确性和鲁棒性。 Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.

[36] SD-VSum: A Method and Dataset for Script-Driven Video Summarization

Manolis Mylonas,Evlampios Apostolidis,Vasileios Mezaris

Main category: cs.CV

TL;DR: 论文提出了一种基于脚本的视频摘要任务（SD-VSum），通过用户提供的脚本选择视频中最相关的部分生成摘要，并扩展了VideoXum数据集以支持多模态训练。实验表明，SD-VSum在性能上优于现有方法。

Details

Motivation: 解决传统视频摘要任务无法根据用户需求生成多样化摘要的问题，提出基于脚本驱动的视频摘要任务。 Method: 扩展VideoXum数据集，加入自然语言描述；设计跨模态注意力机制网络（SD-VSum），融合视觉与文本信息。 Result: SD-VSum在性能上优于现有的查询驱动和通用摘要方法，能根据用户需求生成适配的摘要。 Conclusion: SD-VSum为视频摘要任务提供了更灵活的用户驱动解决方案，性能优越。 Abstract: In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that relies on the use of a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against state-of-the-art approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user's needs about their content.

[37] Very High-Resolution Forest Mapping with TanDEM-X InSAR Data and Self-Supervised Learning

José-Luis Bueso-Bello,Benjamin Chauvel,Daniel Carcereri,Philipp Posovszky,Pietro Milillo,Jennifer Ruiz,Juan-Carlos Fernández-Diaz,Carolina González,Michele Martone,Ronny Hänsch,Paola Rizzoli

Main category: cs.CV

TL;DR: 论文提出了一种结合自监督学习和监督学习的框架，利用TanDEM-X数据实现高分辨率森林制图，解决了标记数据不足的问题。

Details

Motivation: 利用TanDEM-X任务的高分辨率能力，克服中分辨率产品在森林制图中的局限性（如窄道路检测和森林边界精确划分）。 Method: 采用自监督学习从输入特征中提取信息表示，随后用少量可靠标记数据进行监督训练。 Result: 在亚马逊雨林的实际应用中，该方法显著提升了分类精度，优于全监督方法。 Conclusion: 该框架为大规模高分辨率森林制图提供了极具前景的解决方案。 Abstract: Deep learning models have shown encouraging capabilities for mapping accurately forests at medium resolution with TanDEM-X interferometric SAR data. Such models, as most of current state-of-the-art deep learning techniques in remote sensing, are trained in a fully-supervised way, which requires a large amount of labeled data for training and validation. In this work, our aim is to exploit the high-resolution capabilities of the TanDEM-X mission to map forests at 6 m. The goal is to overcome the intrinsic limitations posed by midresolution products, which affect, e.g., the detection of narrow roads within vegetated areas and the precise delineation of forested regions contours. To cope with the lack of extended reliable reference datasets at such a high resolution, we investigate self-supervised learning techniques for extracting highly informative representations from the input features, followed by a supervised training step with a significantly smaller number of reliable labels. A 1 m resolution forest/non-forest reference map over Pennsylvania, USA, allows for comparing different training approaches for the development of an effective forest mapping framework with limited labeled samples. We select the best-performing approach over this test region and apply it in a real-case forest mapping scenario over the Amazon rainforest, where only very few labeled data at high resolution are available. In this challenging scenario, the proposed self-supervised framework significantly enhances the classification accuracy with respect to fully-supervised methods, trained using the same amount of labeled data, representing an extremely promising starting point for large-scale, very high-resolution forest mapping with TanDEM-X data.

[38] FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Rui Lan,Yancheng Bai,Xu Duan,Mingxing Li,Lei Sun,Xiangxiang Chu

Main category: cs.CV

TL;DR: FLUX-Text是一个基于FLUX-Fill的多语言场景文本编辑框架，通过轻量级字形和文本嵌入模块提升文本编辑效果，尤其在非拉丁字符（如中文）上表现优异。

Details

Motivation: 现有基于潜在扩散模型（LDM）的文本编辑方法在非拉丁字符上生成效果不佳，字形复杂时易产生不准确或不可识别的字符。 Method: 提出FLUX-Text框架，结合视觉和文本模态的字形条件，设计轻量级字形和文本嵌入模块，仅需10万训练样本。 Result: 在公开数据集上，FLUX-Text在文本保真度上超越现有方法，达到最先进性能。 Conclusion: FLUX-Text通过轻量级设计有效解决了复杂字形文本编辑问题，且训练效率高。 Abstract: The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have complex glyph structures. To address these issues, we present FLUX-Text, a simple and advanced multilingual scene text editing framework based on FLUX-Fill. Specifically, we carefully investigate glyph conditioning, considering both visual and textual modalities. To retain the original generative capabilities of FLUX-Fill while enhancing its understanding and generation of glyphs, we propose lightweight glyph and text embedding modules. Owning to the lightweight design, FLUX-Text is trained only with $100K$ training examples compared to current popular methods trained with 2.9M ones. With no bells and whistles, our method achieves state-of-the-art performance on text editing tasks. Qualitative and quantitative experiments on the public datasets demonstrate that our method surpasses previous works in text fidelity.

[39] From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

Guoting Wei,Yu Liu,Xia Yuan,Xizhe Xue,Linlin Guo,Yifan Yang,Chunxia Zhao,Zongwen Bai,Haokui Zhang,Rong Xiao

Main category: cs.CV

TL;DR: 该论文提出了一种大规模语言引导的开集航空检测数据集MI-OAD，并通过OS-W2S标注引擎自动生成丰富文本标注，显著提升了开集检测性能。

Details

Motivation: 现有语言引导方法因数据集有限，难以满足细粒度开集检测需求，需构建更大规模的数据集。 Method: 构建MI-OAD数据集，包含163,023张图像和200万图像-文本对，采用OS-W2S标注引擎自动标注。 Result: 在零样本条件下，Grounding DINO模型在句子输入上的AP_{50}和Recall@10分别提升29.5和33.7。 Conclusion: MI-OAD和OS-W2S标注引擎有效解决了现有数据集的局限性，显著提升了开集航空检测能力。 Abstract: In recent years, language-guided open-world aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary, which fails to meet the demands of more fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called Multi-instance Open-set Aerial Dataset (MI-OAD), addressing the limitations of current remote sensing grounding data and enabling effective open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, approximately 40 times larger than comparable datasets. We also employ state-of-the-art open-set methods from the natural image domain, trained on our proposed dataset, to validate the model's open-set detection capabilities. For instance, when trained on our dataset, Grounding DINO achieves improvements of 29.5 AP_{50} and 33.7 Recall@10 for sentence inputs under zero-shot transfer conditions. Both the dataset and the label engine will be released publicly.

[40] A Vision-Language Model for Focal Liver Lesion Classification

Song Jian,Hu Yuchang,Wang Hui,Chen Yen-Wei

Main category: cs.CV

TL;DR: Liver-VLM模型利用视觉-语言多模态学习，显著提升了肝脏病灶分类的准确性，尤其在数据有限的情况下表现优于传统方法。

Details

Motivation: 传统深度学习模型依赖大规模标注数据，而医学影像数据往往有限。视觉-语言模型（如CLIP）通过多模态学习解决了这一问题。 Method: Liver-VLM将类别信息融入文本编码器，通过计算图像与文本嵌入的余弦相似度，并优化交叉熵损失，实现图像与文本特征的对齐。 Result: 在MPCT-FLLs数据集上，Liver-VLM在准确率和AUC上优于标准CLIP和MedCLIP模型，轻量级ResNet18进一步提升了性能。 Conclusion: Liver-VLM为肝脏病灶分类提供了一种高效且数据高效的方法，尤其适用于标注数据有限的场景。 Abstract: Accurate classification of focal liver lesions is crucial for diagnosis and treatment in hepatology. However, traditional supervised deep learning models depend on large-scale annotated datasets, which are often limited in medical imaging. Recently, Vision-Language models (VLMs) such as Contrastive Language-Image Pre-training model (CLIP) has been applied to image classifications. Compared to the conventional convolutional neural network (CNN), which classifiers image based on visual information only, VLM leverages multimodal learning with text and images, allowing it to learn effectively even with a limited amount of labeled data. Inspired by CLIP, we pro-pose a Liver-VLM, a model specifically designed for focal liver lesions (FLLs) classification. First, Liver-VLM incorporates class information into the text encoder without introducing additional inference overhead. Second, by calculating the pairwise cosine similarities between image and text embeddings and optimizing the model with a cross-entropy loss, Liver-VLM ef-fectively aligns image features with class-level text features. Experimental results on MPCT-FLLs dataset demonstrate that the Liver-VLM model out-performs both the standard CLIP and MedCLIP models in terms of accuracy and Area Under the Curve (AUC). Further analysis shows that using a lightweight ResNet18 backbone enhances classification performance, particularly under data-constrained conditions.

[41] GUAVA: Generalizable Upper Body 3D Gaussian Avatar

Dongbin Zhang,Yunfei Liu,Lijian Lin,Ye Zhu,Yang Li,Minghan Qin,Yu Li,Haoqian Wang

Main category: cs.CV

TL;DR: GUAVA框架通过单图像快速重建高质量、可动画的上半身3D高斯化身，显著提升渲染质量和速度。

Details

Motivation: 现有方法依赖多视图或视频，且受限于SMPLX模型的表达能力，难以处理面部表情。 Method: 引入表达性人体模型（EHM）和改进的跟踪方法，基于高斯投影和逆向纹理映射技术。 Result: GUAVA在渲染质量和速度上显著优于现有方法，重建时间仅0.1秒，支持实时动画。 Conclusion: GUAVA为单图像重建可动画3D化身提供了高效解决方案。 Abstract: Reconstructing a high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX's expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range (0.1s), and supports real-time animation and rendering.

[42] Interpretable Zero-shot Learning with Infinite Class Concepts

Zihan Ye,Shreyank N Gowda,Shiming Chen,Yaochu Jin,Kaizhu Huang,Xiaobo Jin

Main category: cs.CV

TL;DR: 论文提出了一种名为InfZSL的新框架，通过动态生成无限短语级类别概念来改进零样本学习，解决了LLMs的幻觉问题，并显著提升了性能。

Details

Motivation: 零样本学习（ZSL）依赖类别语义对齐，但现有方法存在透明度不足和LLMs幻觉问题。本文旨在通过重新定义类别语义的迁移性和区分性来解决这些问题。 Method: 提出InfZSL框架，利用LLMs动态生成短语级类别概念，并通过基于熵的评分和“优质”概念选择机制筛选最具迁移性和区分性的概念。 Result: 在三个流行基准数据集上表现显著提升，并生成高度可解释的图像基础概念。 Conclusion: InfZSL框架通过动态生成和筛选类别概念，有效解决了ZSL中的幻觉问题，同时提升了性能和可解释性。 Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes by aligning images with intermediate class semantics, like human-annotated concepts or class definitions. An emerging alternative leverages Large-scale Language Models (LLMs) to automatically generate class documents. However, these methods often face challenges with transparency in the classification process and may suffer from the notorious hallucination problem in LLMs, resulting in non-visual class semantics. This paper redefines class semantics in ZSL with a focus on transferability and discriminability, introducing a novel framework called Zero-shot Learning with Infinite Class Concepts (InfZSL). Our approach leverages the powerful capabilities of LLMs to dynamically generate an unlimited array of phrase-level class concepts. To address the hallucination challenge, we introduce an entropy-based scoring process that incorporates a ``goodness" concept selection mechanism, ensuring that only the most transferable and discriminative concepts are selected. Our InfZSL framework not only demonstrates significant improvements on three popular benchmark datasets but also generates highly interpretable, image-grounded concepts. Code will be released upon acceptance.

[43] 3D Surface Reconstruction with Enhanced High-Frequency Details

Shikun Zhang,Yiqun Wang,Cunjian Chen,Yong Li,Qiuhong Ke

Main category: cs.CV

TL;DR: FreNeuS利用高频信息改进神经隐式3D重建，通过动态采样和高频加权方法提升表面细节重建质量。

Details

Motivation: 现有神经表面重建方法因随机采样导致高频细节学习不足，重建结果过于平滑。 Method: FreNeuS利用像素梯度变化获取高频区域，动态调整采样策略，并设计高频加权方法增强细节重建。 Result: 实验表明FreNeuS能重建精细表面细节，优于现有方法，且适用于基于NeuS的工作。 Conclusion: FreNeuS通过高频信息引导，显著提升了表面细节重建的质量和适用性。 Abstract: Neural implicit 3D reconstruction can reproduce shapes without 3D supervision, and it learns the 3D scene through volume rendering methods and neural implicit representations. Current neural surface reconstruction methods tend to randomly sample the entire image, making it difficult to learn high-frequency details on the surface, and thus the reconstruction results tend to be too smooth. We designed a method (FreNeuS) based on high-frequency information to solve the problem of insufficient surface detail. Specifically, FreNeuS uses pixel gradient changes to easily acquire high-frequency regions in an image and uses the obtained high-frequency information to guide surface detail reconstruction. High-frequency information is first used to guide the dynamic sampling of rays, applying different sampling strategies according to variations in high-frequency regions. To further enhance the focus on surface details, we have designed a high-frequency weighting method that constrains the representation of high-frequency details during the reconstruction process. Qualitative and quantitative results show that our method can reconstruct fine surface details and obtain better surface reconstruction quality compared to existing methods. In addition, our method is more applicable and can be generalized to any NeuS-based work.

[44] Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models

Abram Schonfeldt,Benjamin Maylor,Xiaofang Chen,Ronald Clark,Aiden Doherty

Main category: cs.CV

TL;DR: 论文比较了三种视觉语言模型和两种判别模型在自由生活场景下预测身体活动行为的表现，发现开源视觉语言模型在预测久坐行为时表现接近判别模型，但在其他强度活动上表现下降，且跨数据集性能显著降低。

Details

Motivation: 验证和开发基于可穿戴设备的身体活动测量方法，减少人工标注负担。 Method: 比较三种视觉语言模型和两种判别模型在两个自由生活验证研究中的表现，使用可穿戴相机数据。 Result: 视觉语言模型和判别模型在预测久坐行为时表现接近，但在其他强度活动上表现较差，跨数据集性能显著下降。 Conclusion: 开源计算机视觉模型可用于减少相似人群的久坐行为标注负担。 Abstract: Introduction: Data from wearable devices collected in free-living settings, and labelled with physical activity behaviours compatible with health research, are essential for both validating existing wearable-based measurement approaches and developing novel machine learning approaches. One common way of obtaining these labels relies on laborious annotation of sequences of images captured by cameras worn by participants through the course of a day. Methods: We compare the performance of three vision language models and two discriminative models on two free-living validation studies with 161 and 111 participants, collected in Oxfordshire, United Kingdom and Sichuan, China, respectively, using the Autographer (OMG Life, defunct) wearable camera. Results: We found that the best open-source vision-language model (VLM) and fine-tuned discriminative model (DM) achieved comparable performance when predicting sedentary behaviour from single images on unseen participants in the Oxfordshire study; median F1-scores: VLM = 0.89 (0.84, 0.92), DM = 0.91 (0.86, 0.95). Performance declined for light (VLM = 0.60 (0.56,0.67), DM = 0.70 (0.63, 0.79)), and moderate-to-vigorous intensity physical activity (VLM = 0.66 (0.53, 0.85); DM = 0.72 (0.58, 0.84)). When applied to the external Sichuan study, performance fell across all intensity categories, with median Cohen's kappa-scores falling from 0.54 (0.49, 0.64) to 0.26 (0.15, 0.37) for the VLM, and from 0.67 (0.60, 0.74) to 0.19 (0.10, 0.30) for the DM. Conclusion: Freely available computer vision models could help annotate sedentary behaviour, typically the most prevalent activity of daily living, from wearable camera images within similar populations to seen data, reducing the annotation burden.

[45] Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant

Haonan Wang,Jiaji Mao,Lehan Wang,Qixiang Zhang,Marawan Elbatel,Yi Qin,Huijun Hu,Baoxun Li,Wenhui Deng,Weifeng Qin,Hongrui Li,Jialin Liang,Jun Shen,Xiaomeng Li

Main category: cs.CV

TL;DR: RCMed是一种全栈AI助手，通过多模态对齐和分层视觉-语言基础，提升医学图像分析和诊断的准确性。

Details

Motivation: 医学AI助手在临床应用中面临多模态内容准确性不足和真实场景验证不足的挑战。 Method: 采用自增强相关机制和颜色区域描述策略，通过视觉特征与语言语义的闭环交互，学习形状-位置-文本关系。 Result: 在165个临床任务中表现优异，细胞分割精度相对提升23.5%，并在20种癌症类型的外部验证中达到最优性能。 Conclusion: RCMed展示了多模态模型在复杂场景中实现人类水平解释的潜力，推动了以人为中心的AI医疗发展。 Abstract: Medical AI assistants support doctors in disease diagnosis, medical image analysis, and report generation. However, they still face significant challenges in clinical use, including limited accuracy with multimodal content and insufficient validation in real-world settings. We propose RCMed, a full-stack AI assistant that improves multimodal alignment in both input and output, enabling precise anatomical delineation, accurate localization, and reliable diagnosis through hierarchical vision-language grounding. A self-reinforcing correlation mechanism allows visual features to inform language context, while language semantics guide pixel-wise attention, forming a closed loop that refines both modalities. This correlation is enhanced by a color region description strategy, translating anatomical structures into semantically rich text to learn shape-location-text relationships across scales. Trained on 20 million image-mask-description triplets, RCMed achieves state-of-the-art precision in contextualizing irregular lesions and subtle anatomical boundaries, excelling in 165 clinical tasks across 9 modalities. It achieved a 23.5% relative improvement in cell segmentation from microscopy images over prior methods. RCMed's strong vision-language alignment enables exceptional generalization, with state-of-the-art performance in external validation across 20 clinically significant cancer types, including novel tasks. This work demonstrates how integrated multimodal models capture fine-grained patterns, enabling human-level interpretation in complex scenarios and advancing human-centric AI healthcare.

[46] Attention-aggregated Attack for Boosting the Transferability of Facial Adversarial Examples

Jian-Wei Li,Wen-Ze Shao

Main category: cs.CV

TL;DR: 本文提出了一种名为注意力聚合攻击（AAA）的新方法，通过模仿其他FR模型对干净人脸图像的注意力，增强对抗样本的可迁移性，从而提升对人脸识别（FR）模型的攻击效果。

Details

Motivation: 对抗样本揭示了深度学习模型的脆弱性，但在细粒度视觉任务（如人脸识别）中，现有方法未充分考虑模型特异性，导致攻击效果不佳。 Method: 研究FR模型中哪些面部特征对嵌入学习有贡献，提出AAA方法，通过模仿其他FR模型的注意力来破坏关键面部特征。 Result: 在多种FR模型上的实验验证了AAA方法的优越性和鲁棒性。 Conclusion: AAA方法显著提升了对抗样本的可迁移性，为FR模型的攻击提供了新思路。 Abstract: Adversarial examples have revealed the vulnerability of deep learning models and raised serious concerns about information security. The transfer-based attack is a hot topic in black-box attacks that are practical to real-world scenarios where the training datasets, parameters, and structure of the target model are unknown to the attacker. However, few methods consider the particularity of class-specific deep models for fine-grained vision tasks, such as face recognition (FR), giving rise to unsatisfactory attacking performance. In this work, we first investigate what in a face exactly contributes to the embedding learning of FR models and find that both decisive and auxiliary facial features are specific to each FR model, which is quite different from the biological mechanism of human visual system. Accordingly we then propose a novel attack method named Attention-aggregated Attack (AAA) to enhance the transferability of adversarial examples against FR, which is inspired by the attention divergence and aims to destroy the facial features that are critical for the decision-making of other FR models by imitating their attentions on the clean face images. Extensive experiments conducted on various FR models validate the superiority and robust effectiveness of the proposed method over existing methods.

[47] Enhancing Target-unspecific Tasks through a Features Matrix

Fangming Cui,Yonggang Zhang,Xuan Wang,Xinmei Tian,Jun Yu

Main category: cs.CV

TL;DR: 提出了一种特征矩阵（FM）正则化方法，用于增强大型视觉语言模型在目标非特定任务中的表现，避免过拟合。

Details

Motivation: 现有提示学习方法在目标非特定任务中表现不佳，可能因过拟合导致模型遗忘通用知识。 Method: 通过提取和利用通用知识构建特征矩阵（FM），从深度和细粒度角度捕捉输入语义，保留通用知识。 Result: FM兼容现有框架，显著提升目标非特定任务性能，达到最先进水平。 Conclusion: FM是一种通用且灵活的模块，能有效缓解过拟合问题，提升模型通用性。 Abstract: Recent developments in prompt learning of large vision-language models have significantly improved performance in target-specific tasks. However, these prompt optimizing methods often struggle to tackle the target-unspecific or generalizable tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge having strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) regularization approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks, achieving state-of-the-art performance.

[48] EOPose : Exemplar-based object reposing using Generalized Pose Correspondences

Sarthak Mehrotra,Rishabh Jain,Mayur Hemani,Balaji Krishnamurthy,Mausoom Sarkar

Main category: cs.CV

TL;DR: EOPose是一种端到端框架，利用无监督关键点检测技术实现物体在图像中的重新姿态调整，适用于电子商务等领域。

Details

Motivation: 电子商务需要快速生成多种产品图像变体，现有生成方法难以保留物体的精细细节（如颜色、纹理和品牌标志）。 Method: EOPose采用三步法，通过目标姿态引导图像的关键点对应关系，对源图像进行变形和重新渲染。 Result: EOPose在PSNR、SSIM和FID等图像质量指标上表现优异，并保留了物体的精细细节。 Conclusion: EOPose是一种高效且高质量的物体重新姿态调整方法，适用于实际应用。 Abstract: Reposing objects in images has a myriad of applications, especially for e-commerce where several variants of product images need to be produced quickly. In this work, we leverage the recent advances in unsupervised keypoint correspondence detection between different object images of the same class to propose an end-to-end framework for generic object reposing. Our method, EOPose, takes a target pose-guidance image as input and uses its keypoint correspondence with the source object image to warp and re-render the latter into the target pose using a novel three-step approach. Unlike generative approaches, our method also preserves the fine-grained details of the object such as its exact colors, textures, and brand marks. We also prepare a new dataset of paired objects based on the Objaverse dataset to train and test our network. EOPose produces high-quality reposing output as evidenced by different image quality metrics (PSNR, SSIM and FID). Besides a description of the method and the dataset, the paper also includes detailed ablation and user studies to indicate the efficacy of the proposed method

[49] DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation

Shanshan Song,Hui Tang,Honglong Yang,Xiaomeng Li

Main category: cs.CV

TL;DR: 论文提出了一种动态差异感知时间残差网络（DDaTR），用于提升纵向放射学报告生成（LRRG）的性能，通过捕捉多级空间和时间相关性。

Details

Motivation: 现有LRRG方法在特征提取过程中未能有效捕捉空间和时间相关性，导致性能不佳。 Method: DDaTR包含动态特征对齐模块（DFAM）和动态差异感知模块（DDAM），分别用于对齐先验特征和捕捉差异信息，并通过动态残差网络建模时间相关性。 Result: 在三个基准测试中，DDaTR表现优于现有方法，证明了其在RRG和LRRG任务中的有效性。 Conclusion: DDaTR通过改进特征提取和时间建模，显著提升了LRRG的性能。 Abstract: Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.

[50] CXR-AD: Component X-ray Image Dataset for Industrial Anomaly Detection

Haoyu Bai,Jie Wang,Gaomin Li,Xuan Li,Xiaohu Zhang,Xia Yang

Main category: cs.CV

TL;DR: 论文构建了首个公开的X射线组件异常检测数据集CXR-AD，填补了内部缺陷检测数据集的空白，并分析了其技术挑战和现有算法的局限性。

Details

Motivation: 现有异常检测数据集主要关注可见光图像中的表面缺陷，缺乏针对组件内部缺陷的公开X射线数据集。 Method: 构建了包含653个正常样本和561个缺陷样本的CXR-AD数据集，并分析了其特性及三大技术挑战。 Result: 实验表明，现有算法在CXR-AD上的性能平均下降29.78%，揭示了其在内部缺陷检测任务中的局限性。 Conclusion: CXR-AD为内部缺陷检测提供了首个公开基准，有助于推动算法开发和提升检测技术精度。 Abstract: Internal defect detection constitutes a critical process in ensuring component quality, for which anomaly detection serves as an effective solution. However, existing anomaly detection datasets predominantly focus on surface defects in visible-light images, lacking publicly available X-ray datasets targeting internal defects in components. To address this gap, we construct the first publicly accessible component X-ray anomaly detection (CXR-AD) dataset, comprising real-world X-ray images. The dataset covers five industrial component categories, including 653 normal samples and 561 defect samples with precise pixel-level mask annotations. We systematically analyze the dataset characteristics and identify three major technical challenges: (1) strong coupling between complex internal structures and defect regions, (2) inherent low contrast and high noise interference in X-ray imaging, and (3) significant variations in defect scales and morphologies. To evaluate dataset complexity, we benchmark three state-of-the-art anomaly detection frameworks (feature-based, reconstruction-based, and zero-shot learning methods). Experimental results demonstrate a 29.78% average performance degradation on CXR-AD compared to MVTec AD, highlighting the limitations of current algorithms in handling internal defect detection tasks. To the best of our knowledge, CXR-AD represents the first publicly available X-ray dataset for component anomaly detection, providing a real-world industrial benchmark to advance algorithm development and enhance precision in internal defect inspection technologies.

[51] LiftFeat: 3D Geometry-Aware Local Feature Matching

Yepeng Liu,Wenpeng Lai,Zhou Zhao,Yuxuan Xiong,Jinchi Zhu,Jun Cheng,Yongchao Xu

Main category: cs.CV

TL;DR: 提出了一种名为LiftFeat的轻量级网络，通过聚合3D几何特征提升原始描述符的鲁棒性，适用于SLAM和视觉定位等任务。

Details

Motivation: 在光照变化剧烈、低纹理区域或重复模式场景中，提取鲁棒且具有区分性的视觉特征仍具挑战性。 Method: 采用预训练的单目深度估计模型生成伪表面法线标签，监督3D几何特征的提取，并设计3D几何感知的特征提升模块融合表面法线特征与原始2D描述符特征。 Result: 在相对位姿估计、单应性估计和视觉定位任务中，LiftFeat优于其他轻量级先进方法。 Conclusion: LiftFeat通过引入3D几何特征，显著提升了极端条件下的特征描述能力。 Abstract: Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called \textit{LiftFeat}, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at : https://github.com/lyp-deeplearning/LiftFeat.

[52] Phenotype-Guided Generative Model for High-Fidelity Cardiac MRI Synthesis: Advancing Pretraining and Clinical Applications

Ziyu Li,Yujian Hu,Zhengyao Ding,Yiheng Mao,Haitao Li,Fan Yi,Hongkun Zhang,Zhengxing Huang

Main category: cs.CV

TL;DR: 提出了一种名为CPGG的新方法，通过生成多样化的CMR数据来弥补数据不足的问题，显著提升了AI模型在下游任务中的表现。

Details

Motivation: 由于大规模高质量CMR数据的稀缺性，限制了AI在心脏疾病诊断中的应用，CPGG旨在解决这一问题。 Method: CPGG分为两阶段：首先生成模型基于CMR数据的心脏表型训练，随后通过掩码自回归扩散模型生成高质量CMR序列。 Result: 生成的合成CMR数据质量高，显著提升了诊断和表型预测等下游任务的性能。 Conclusion: CPGG方法有效解决了CMR数据不足的问题，为AI在心脏健康领域的应用提供了新思路。 Abstract: Cardiac Magnetic Resonance (CMR) imaging is a vital non-invasive tool for diagnosing heart diseases and evaluating cardiac health. However, the limited availability of large-scale, high-quality CMR datasets poses a major challenge to the effective application of artificial intelligence (AI) in this domain. Even the amount of unlabeled data and the health status it covers are difficult to meet the needs of model pretraining, which hinders the performance of AI models on downstream tasks. In this study, we present Cardiac Phenotype-Guided CMR Generation (CPGG), a novel approach for generating diverse CMR data that covers a wide spectrum of cardiac health status. The CPGG framework consists of two stages: in the first stage, a generative model is trained using cardiac phenotypes derived from CMR data; in the second stage, a masked autoregressive diffusion model, conditioned on these phenotypes, generates high-fidelity CMR cine sequences that capture both structural and functional features of the heart in a fine-grained manner. We synthesized a massive amount of CMR to expand the pretraining data. Experimental results show that CPGG generates high-quality synthetic CMR data, significantly improving performance on various downstream tasks, including diagnosis and cardiac phenotypes prediction. These gains are demonstrated across both public and private datasets, highlighting the effectiveness of our approach. Code is availabel at https://anonymous.4open.science/r/CPGG.

[53] A Fusion-Guided Inception Network for Hyperspectral Image Super-Resolution

Usman Muhammad,Jorma Laaksonen

Main category: cs.CV

TL;DR: 提出了一种名为FGIN的单图像超分辨率模型，通过融合模块和多尺度特征提取提升HSI超分辨率性能。

Details

Motivation: 现有HSI超分辨率方法依赖精确对齐的图像对，但实际场景中难以实现，因此提出单图像解决方案。 Method: 采用光谱-空间融合模块、Inception-like多尺度特征提取、多尺度融合块以及优化的上采样模块。 Result: 在两个公开HSI数据集上表现出竞争力。 Conclusion: FGIN模型有效解决了图像对齐问题，提升了超分辨率性能。 Abstract: The fusion of low-spatial-resolution hyperspectral images (HSIs) with high-spatial-resolution conventional images (e.g., panchromatic or RGB) has played a significant role in recent advancements in HSI super-resolution. However, this fusion process relies on the availability of precise alignment between image pairs, which is often challenging in real-world scenarios. To mitigate this limitation, we propose a single-image super-resolution model called the Fusion-Guided Inception Network (FGIN). Specifically, we first employ a spectral-spatial fusion module to effectively integrate spectral and spatial information at an early stage. Next, an Inception-like hierarchical feature extraction strategy is used to capture multiscale spatial dependencies, followed by a dedicated multi-scale fusion block. To further enhance reconstruction quality, we incorporate an optimized upsampling module that combines bilinear interpolation with depthwise separable convolutions. Experimental evaluations on two publicly available hyperspectral datasets demonstrate the competitive performance of our method.

[54] Robustness in AI-Generated Detection: Enhancing Resistance to Adversarial Attacks

Sun Haoxuan,Hong Yan,Zhan Jiahui,Chen Haoxing,Lan Jun,Zhu Huijia,Wang Weiqiang,Zhang Liqing,Zhang Jianfu

Main category: cs.CV

TL;DR: 本文研究了AI生成人脸检测系统的脆弱性，提出了一种结合对抗训练和扩散反演的方法，显著提升了检测系统的鲁棒性。

Details

Motivation: 生成图像技术的快速发展带来了安全风险，尤其是人脸生成检测领域。现有检测方法在标准条件下表现良好，但对对抗攻击的鲁棒性不足。 Method: 提出了一种结合对抗训练的方法，并利用扩散反演和重建技术增强检测鲁棒性。 Result: 实验表明，现有检测系统易受对抗扰动影响，而本文方法显著提升了鲁棒性。 Conclusion: 本文方法有效提升了AI生成人脸检测的鲁棒性，并公开了代码以促进进一步研究。 Abstract: The rapid advancement of generative image technology has introduced significant security concerns, particularly in the domain of face generation detection. This paper investigates the vulnerabilities of current AI-generated face detection systems. Our study reveals that while existing detection methods often achieve high accuracy under standard conditions, they exhibit limited robustness against adversarial attacks. To address these challenges, we propose an approach that integrates adversarial training to mitigate the impact of adversarial examples. Furthermore, we utilize diffusion inversion and reconstruction to further enhance detection robustness. Experimental results demonstrate that minor adversarial perturbations can easily bypass existing detection systems, but our method significantly improves the robustness of these systems. Additionally, we provide an in-depth analysis of adversarial and benign examples, offering insights into the intrinsic characteristics of AI-generated content. All associated code will be made publicly available in a dedicated repository to facilitate further research and verification.

[55] Polar Coordinate-Based 2D Pose Prior with Neural Distance Field

Qi Gan,Sao Mai Nguyen,Eric Fenaux,Stephan Clémençon,Mounîm El Yacoubi

Main category: cs.CV

TL;DR: 提出了一种基于神经距离场（NDF）的2D姿态先验引导细化方法，通过极坐标表示和新的非测地距离度量，提高了运动场景中姿态估计的准确性。

Details

Motivation: 现有基于RGB视频的深度学习姿态估计模型在真实运动场景中因运动模糊、遮挡和领域偏移等问题表现不佳，需要大量标注数据且泛化能力有限。 Method: 采用极坐标表示结合关节连接长度，定义非测地距离度量，并提出梯度批量投影增强策略以缓解数据稀缺问题。 Result: 在跳远数据集上验证了方法的有效性，能够跨多种姿态表示提升2D姿态估计的鲁棒性，且仅需少量训练数据。 Conclusion: 该方法通过极坐标表示和新距离度量显著提升了姿态估计的准确性和泛化能力，适用于数据稀缺的运动场景。 Abstract: Human pose capture is essential for sports analysis, enabling precise evaluation of athletes' movements. While deep learning-based human pose estimation (HPE) models from RGB videos have achieved impressive performance on public datasets, their effectiveness in real-world sports scenarios is often hindered by motion blur, occlusions, and domain shifts across different pose representations. Fine-tuning these models can partially alleviate such challenges but typically requires large-scale annotated data and still struggles to generalize across diverse sports environments. To address these limitations, we propose a 2D pose prior-guided refinement approach based on Neural Distance Fields (NDF). Unlike existing approaches that rely solely on angular representations of human poses, we introduce a polar coordinate-based representation that explicitly incorporates joint connection lengths, enabling a more accurate correction of erroneous pose estimations. Additionally, we define a novel non-geodesic distance metric that separates angular and radial discrepancies, which we demonstrate is better suited for polar representations than traditional geodesic distances. To mitigate data scarcity, we develop a gradient-based batch-projection augmentation strategy, which synthesizes realistic pose samples through iterative refinement. Our method is evaluated on a long jump dataset, demonstrating its ability to improve 2D pose estimation across multiple pose representations, making it robust across different domains. Experimental results show that our approach enhances pose plausibility while requiring only limited training data. Code is available at: https://github.com/QGAN2019/polar-NDF.

[56] Nonperiodic dynamic CT reconstruction using backward-warping INR with regularization of diffeomorphism (BIRD)

Muge Du,Zhuozhao Zheng,Wenying Wang,Guotao Quan,Wuliang Shi,Le Shen,Li Zhang,Liang Li,Yinong Liu,Yuxiang Xing

Main category: cs.CV

TL;DR: BIRD框架通过反向变形、微分同胚正则化、运动补偿重建和降维设计，解决了非周期性动态CT重建中的计算效率、解剖合理性和细节保留问题。

Details

Motivation: 非周期性快速运动（如心脏成像）在动态CT重建中面临运动伪影和有限角度问题的挑战，现有方法在计算效率和解剖合理性上存在不足。 Method: 提出BIRD框架，采用反向变形降低计算成本，微分同胚正则化确保解剖合理性，运动补偿重建增强细节，降维设计提高4D编码效率。 Result: 通过模拟和实际数据验证，BIRD在减少运动伪影和增强细节方面表现优异，适用于临床如单心跳心脏重建。 Conclusion: BIRD框架为非周期性动态CT重建提供了一种高效、准确的解决方案，具有临床潜力。 Abstract: Dynamic computed tomography (CT) reconstruction faces significant challenges in addressing motion artifacts, particularly for nonperiodic rapid movements such as cardiac imaging with fast heart rates. Traditional methods struggle with the extreme limited-angle problems inherent in nonperiodic cases. Deep learning methods have improved performance but face generalization challenges. Recent implicit neural representation (INR) techniques show promise through self-supervised deep learning, but have critical limitations: computational inefficiency due to forward-warping modeling, difficulty balancing DVF complexity with anatomical plausibility, and challenges in preserving fine details without additional patient-specific pre-scans. This paper presents a novel INR-based framework, BIRD, for nonperiodic dynamic CT reconstruction. It addresses these challenges through four key contributions: (1) backward-warping deformation that enables direct computation of each dynamic voxel with significantly reduced computational cost, (2) diffeomorphism-based DVF regularization that ensures anatomically plausible deformations while maintaining representational capacity, (3) motion-compensated analytical reconstruction that enhances fine details without requiring additional pre-scans, and (4) dimensional-reduction design for efficient 4D coordinate encoding. Through various simulations and practical studies, including digital and physical phantoms and retrospective patient data, we demonstrate the effectiveness of our approach for nonperiodic dynamic CT reconstruction with enhanced details and reduced motion artifacts. The proposed framework enables more accurate dynamic CT reconstruction with potential clinical applications, such as one-beat cardiac reconstruction, cinematic image sequences for functional imaging, and motion artifact reduction in conventional CT scans.

[57] Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

Vibhas Vats,Md. Alimoor Reza,David Crandall,Soon-heung Jung

Main category: cs.CV

TL;DR: GC MVSNet++通过在学习阶段主动实施多视角、多尺度的几何一致性检查，显著加速训练过程，并在多个数据集上达到新最优。

Details

Motivation: 传统MVS方法依赖光度一致性，而现代学习方法仅在后处理中应用几何一致性检查，未影响学习过程。本文旨在通过在学习阶段集成几何一致性检查，提升性能和效率。 Method: 提出GC MVSNet++，在学习阶段实施多视角、多尺度的几何一致性检查，并引入密集连接的成本正则化网络。 Result: 在DTU和BlendedMVS数据集上达到新最优，在Tanks and Temples基准上排名第二。 Conclusion: GC MVSNet++是首个在学习阶段实施多视角、多尺度几何一致性的方法，显著提升了性能和训练效率。 Abstract: Traditional multi-view stereo (MVS) methods primarily depend on photometric and geometric consistency constraints. In contrast, modern learning-based algorithms often rely on the plane sweep algorithm to infer 3D geometry, applying explicit geometric consistency (GC) checks only as a post-processing step, with no impact on the learning process itself. In this work, we introduce GC MVSNet plus plus, a novel approach that actively enforces geometric consistency of reference view depth maps across multiple source views (multi view) and at various scales (multi scale) during the learning phase (see Fig. 1). This integrated GC check significantly accelerates the learning process by directly penalizing geometrically inconsistent pixels, effectively halving the number of training iterations compared to other MVS methods. Furthermore, we introduce a densely connected cost regularization network with two distinct block designs simple and feature dense optimized to harness dense feature connections for enhanced regularization. Extensive experiments demonstrate that our approach achieves a new state of the art on the DTU and BlendedMVS datasets and secures second place on the Tanks and Temples benchmark. To our knowledge, GC MVSNet plus plus is the first method to enforce multi-view, multi-scale supervised geometric consistency during learning. Our code is available.

[58] UPMAD-Net: A Brain Tumor Segmentation Network with Uncertainty Guidance and Adaptive Multimodal Feature Fusion

Zhanyuan Jia,Ni Yao,Danyang Sun,Chuang Han,Yanting Li,Jiaofen Nan,Fubao Zhu,Chen Zhao,Weihua Zhou

Main category: cs.CV

TL;DR: 提出了一种结合深度学习和区域生长算法先验知识的脑肿瘤分割方法，通过多尺度特征融合和自适应注意力机制提升性能，并在BraTS数据集上表现优异。

Details

Motivation: 脑肿瘤分割对诊断和治疗至关重要，但由于肿瘤形状不规则、边界模糊和变异性高，准确分割仍具挑战性。 Method: 采用多尺度特征融合模块和自适应注意力机制提取特征，结合蒙特卡洛Dropout策略进行不确定性估计。 Result: 在BraTS2021和BraTS2019数据集上表现优异，Dice分数显著优于现有方法。 Conclusion: 提出了一种基于U-Net架构的新型3D脑肿瘤分割网络，通过先验知识和不确定性估计提升了鲁棒性和性能。 Abstract: Background: Brain tumor segmentation has a significant impact on the diagnosis and treatment of brain tumors. Accurate brain tumor segmentation remains challenging due to their irregular shapes, vague boundaries, and high variability. Objective: We propose a brain tumor segmentation method that combines deep learning with prior knowledge derived from a region-growing algorithm. Methods: The proposed method utilizes a multi-scale feature fusion (MSFF) module and adaptive attention mechanisms (AAM) to extract multi-scale features and capture global contextual information. To enhance the model's robustness in low-confidence regions, the Monte Carlo Dropout (MC Dropout) strategy is employed for uncertainty estimation. Results: Extensive experiments demonstrate that the proposed method achieves superior performance on Brain Tumor Segmentation (BraTS) datasets, significantly outperforming various state-of-the-art methods. On the BraTS2021 dataset, the test Dice scores are 89.18% for Enhancing Tumor (ET) segmentation, 93.67% for Whole Tumor (WT) segmentation, and 91.23% for Tumor Core (TC) segmentation. On the BraTS2019 validation set, the validation Dice scores are 87.43%, 90.92%, and 90.40% for ET, WT, and TC segmentation, respectively. Ablation studies further confirmed the contribution of each module to segmentation accuracy, indicating that each component played a vital role in overall performance improvement. Conclusion: This study proposed a novel 3D brain tumor segmentation network based on the U-Net architecture. By incorporating the prior knowledge and employing the uncertainty estimation method, the robustness and performance were improved. The code for the proposed method is available at https://github.com/chenzhao2023/UPMAD_Net_BrainSeg.

[59] MRI motion correction via efficient residual-guided denoising diffusion probabilistic models

Mojtaba Safari,Shansong Wang,Qiang Li,Zach Eidex,Richard L. J. Qiu,Chih-Wei Chang,Hui Mao,Xiaofeng Yang

Main category: cs.CV

TL;DR: Res-MoCoDiff是一种高效的扩散概率模型，用于MRI运动伪影校正，通过残差误差移位机制和四步反向扩散显著提升图像质量，并在速度和性能上优于传统方法。

Details

Motivation: MRI中的运动伪影严重影响图像质量和定量分析，传统方法成本高且流程繁琐，因此需要一种高效且性能优越的解决方案。 Method: Res-MoCoDiff采用残差误差移位机制和四步反向扩散，结合U-net和Swin-Transformer块，使用l1+l2损失函数训练，并在合成和真实数据集上评估。 Result: 在所有运动严重程度下，Res-MoCoDiff均表现出最佳性能，SSIM最高，NMSE最低，PSNR达41.91±2.94 dB，采样时间显著缩短至0.37秒/批次。 Conclusion: Res-MoCoDiff在MRI运动伪影校正中表现出高效性和优越性能，为临床和科研提供了实用工具。 Abstract: Purpose: Motion artifacts in magnetic resonance imaging (MRI) significantly degrade image quality and impair quantitative analysis. Conventional mitigation strategies, such as repeated acquisitions or motion tracking, are costly and workflow-intensive. This study introduces Res-MoCoDiff, an efficient denoising diffusion probabilistic model tailored for MRI motion artifact correction. Methods: Res-MoCoDiff incorporates a novel residual error shifting mechanism in the forward diffusion process, aligning the noise distribution with motion-corrupted data and enabling an efficient four-step reverse diffusion. A U-net backbone enhanced with Swin-Transformer blocks conventional attention layers, improving adaptability across resolutions. Training employs a combined l1+l2 loss, which promotes image sharpness and reduces pixel-level errors. Res-MoCoDiff was evaluated on synthetic dataset generated using a realistic motion simulation framework and on an in-vivo dataset. Comparative analyses were conducted against established methods, including CycleGAN, Pix2pix, and MT-DDPM using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and normalized mean squared error (NMSE). Results: The proposed method demonstrated superior performance in removing motion artifacts across all motion severity levels. Res-MoCoDiff consistently achieved the highest SSIM and the lowest NMSE values, with a PSNR of up to 41.91+-2.94 dB for minor distortions. Notably, the average sampling time was reduced to 0.37 seconds per batch of two image slices, compared with 101.74 seconds for conventional approaches.

[60] Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking

Shenglan Li,Rui Yao,Yong Zhou,Hancheng Zhu,Kunyang Sun,Bing Liu,Zhiwen Shao,Jiaqi Zhao

Main category: cs.CV

TL;DR: GDSTrack提出了一种动态图融合和时序扩散的方法，用于解决自监督RGB-T跟踪中的伪标签噪声和背景干扰问题，显著提升了性能。

Details

Motivation: 减少对大规模标注的依赖，同时解决自监督RGB-T跟踪中伪标签错误和背景噪声对模态融合效率的影响。 Method: 通过动态图融合（MDGF）模块和时序图扩散（TGID）模块，动态融合邻近帧模态并利用生成模型的去噪能力。 Result: 在四个公开RGB-T数据集上的实验表明，GDSTrack优于现有最先进方法。 Conclusion: GDSTrack通过动态图融合和时序扩散有效提升了自监督RGB-T跟踪的鲁棒性和性能。 Abstract: To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object's coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at https://github.com/LiShenglana/GDSTrack.

[61] Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks

Haotong Cheng,Zhiqi Zhang,Hao Li,Xinshang Zhang

Main category: cs.CV

TL;DR: 论文提出“通用性”概念及评估方程（UAE），用于量化模块的可移植性，并设计了两种优化模块（CRB和DCRB），在多种数据集上表现优于现有方法。

Details

Motivation: 现有研究多关注性能提升，而忽略了模块的可移植性量化。本文旨在填补这一空白，探索模块通用性与模型泛化能力的关系。 Method: 引入“通用性”概念及UAE评估方程，基于评估结果设计优化模块CRB和DCRB。 Result: 在多种数据集上，嵌入优化模块的网络性能优于现有方法，PSNR提升达0.83dB，或参数减少71.3%且重建质量几乎无损。 Conclusion: 通过量化模块通用性并优化设计，显著提升了模型的性能和可移植性。 Abstract: Deep learning has substantially advanced the Single Image Super-Resolution (SISR). However, existing researches have predominantly focused on raw performance gains, with little attention paid to quantifying the transferability of architectural components. In this paper, we introduce the concept of "Universality" and its associated definitions which extend the traditional notion of "Generalization" to encompass the modules' ease of transferability, thus revealing the relationships between module universality and model generalizability. Then we propose the Universality Assessment Equation (UAE), a metric for quantifying how readily a given module could be transplanted across models. Guided by the UAE results of standard residual blocks and other plug-and-play modules, we further design two optimized modules, Cycle Residual Block (CRB) and Depth-Wise Cycle Residual Block (DCRB). Through comprehensive experiments on natural-scene benchmarks, remote-sensing datasets, extreme-industrial imagery and on-device deployments, we demonstrate that networks embedded with the proposed plug-and-play modules outperform several state-of-the-arts, reaching a PSNR enhancement of up to 0.83dB or enabling a 71.3% reduction in parameters with negligible loss in reconstruction fidelity.

[62] Coop-WD: Cooperative Perception with Weighting and Denoising for Robust V2V Communication

Chenguang Liu,Jianjun Chen,Yunfei Chen,Yubei He,Zhuangkun Wei,Hongjian Sun,Haiyan Lu,Qi Hao

Main category: cs.CV

TL;DR: 提出了一种名为Coop-WD的联合加权和去噪框架，用于增强V2V通信受损下的协同感知性能，并提出了高效变体Coop-WD-eco以减少计算开销。

Details

Motivation: 现有研究缺乏对不同级别V2V通信损伤的泛化能力，限制了协同感知的性能提升。 Method: 采用自监督对比模型和条件扩散概率模型进行车辆级和像素级特征增强，并提出Coop-WD-eco选择性去激活去噪以降低开销。 Result: Coop-WD在所有类型信道中均优于传统基准，Coop-WD-eco在严重失真下可减少50%计算成本且保持精度。 Conclusion: Coop-WD框架有效提升了协同感知性能，Coop-WD-eco在计算效率和精度间取得了平衡。 Abstract: Cooperative perception, leveraging shared information from multiple vehicles via vehicle-to-vehicle (V2V) communication, plays a vital role in autonomous driving to alleviate the limitation of single-vehicle perception. Existing works have explored the effects of V2V communication impairments on perception precision, but they lack generalization to different levels of impairments. In this work, we propose a joint weighting and denoising framework, Coop-WD, to enhance cooperative perception subject to V2V channel impairments. In this framework, the self-supervised contrastive model and the conditional diffusion probabilistic model are adopted hierarchically for vehicle-level and pixel-level feature enhancement. An efficient variant model, Coop-WD-eco, is proposed to selectively deactivate denoising to reduce processing overhead. Rician fading, non-stationarity, and time-varying distortion are considered. Simulation results demonstrate that the proposed Coop-WD outperforms conventional benchmarks in all types of channels. Qualitative analysis with visual examples further proves the superiority of our proposed method. The proposed Coop-WD-eco achieves up to 50% reduction in computational cost under severe distortion while maintaining comparable accuracy as channel conditions improve.

[63] RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT

Chuyu Zhao,Hao Huang,Jiashuo Guo,Ziyu Shen,Zhongwei Zhou,Jie Liu,Zekuan Yu

Main category: cs.CV

TL;DR: RAIL是一种双组双学生的半监督学习框架，通过区域感知指导学习解决CBCT牙齿分割中监督不足和伪标签不可靠的问题。

Details

Motivation: 现有半监督学习方法在CBCT牙齿分割中存在监督不足和伪标签不可靠的问题，RAIL旨在通过双组双学生框架和区域感知机制解决这些问题。 Method: RAIL采用双组双学生框架，通过交替训练促进组间知识转移。引入DFS控制器和CAL调制器，分别优化监督学习和无监督学习。 Result: 在四个CBCT牙齿分割数据集上的实验表明，RAIL在有限标注下优于现有方法。 Conclusion: RAIL通过区域感知指导和双组框架，有效提升了半监督学习在CBCT牙齿分割中的性能。 Abstract: Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework. Each group contains two student models guided by a shared teacher network. By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model. Specifically, RAIL introduces two instructive mechanisms. Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas. In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training. This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels. Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation. Our code will be available at https://github.com/Tournesol-Saturday/RAIL.

[64] Panoramic Out-of-Distribution Segmentation

Mengfei Duan,Kailun Yang,Yuheng Zhang,Yihong Cao,Fei Teng,Kai Luo,Jiaming Zhang,Zhiyong Li,Shutao Li

Main category: cs.CV

TL;DR: 论文提出全景图像中的离群点分割任务（PanOoS），并首次提出解决方案POS，通过文本引导的提示分布学习适应全景图像特性，显著提升性能。

Details

Motivation: 当前全景语义分割方法无法识别离群点，而传统离群点分割模型在全景域表现不佳，需解决背景干扰和像素失真问题。 Method: 提出POS方法，结合解耦策略、提示修复注意力（PRA）和双层提示分布学习（BPDL），优化语义解码和嵌入分布。 Result: POS在DenseOoS数据集上AuPRC提升34.25%，FPR95降低21.42%，优于现有方法，同时具备领先的封闭集分割能力。 Conclusion: POS为全景离群点分割提供有效解决方案，并通过新数据集DenseOoS和QuadOoS填补数据稀缺问题。 Abstract: Panoramic imaging enables capturing 360{\deg} images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), achieving OoS for panoramas. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.

[65] Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment

João Alves,Pia Haubro Andersen,Rikke Gade

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习和光流的方法，用于自动检测和定位马匹视频中的特定耳朵动作单元（AU），旨在解决手动标注数据稀缺的问题。

Details

Motivation: 由于手动标注马匹面部动作单元（AUs）耗时且昂贵，现有数据稀缺，限制了马匹情感状态评估的发展。因此，需要自动化标注系统来利用现有数据并改进检测工具。 Method: 结合深度学习视频特征提取和循环神经网络进行视频分类任务，同时采用经典的光流方法，用于检测和定位马匹视频中的耳朵AU。 Result: 在公开的马匹视频数据集上，实现了87.5%的耳朵动作存在分类准确率，证明了方法的潜力。 Conclusion: 该方法为自动化AU检测提供了可行方案，未来可进一步开发以应用于马匹福利和兽医诊断实践。代码将公开。 Abstract: The Equine Facial Action Coding System (EquiFACS) enables the systematic annotation of facial movements through distinct Action Units (AUs). It serves as a crucial tool for assessing affective states in horses by identifying subtle facial expressions associated with discomfort. However, the field of horse affective state assessment is constrained by the scarcity of annotated data, as manually labelling facial AUs is both time-consuming and costly. To address this challenge, automated annotation systems are essential for leveraging existing datasets and improving affective states detection tools. In this work, we study different methods for specific ear AU detection and localization from horse videos. We leverage past works on deep learning-based video feature extraction combined with recurrent neural networks for the video classification task, as well as a classic optical flow based approach. We achieve 87.5% classification accuracy of ear movement presence on a public horse video dataset, demonstrating the potential of our approach. We discuss future directions to develop these systems, with the aim of bridging the gap between automated AU detection and practical applications in equine welfare and veterinary diagnostics. Our code will be made publicly available at https://github.com/jmalves5/read-my-ears.

[66] Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID

Koray Ulusan,Benjamin Kiefer

Main category: cs.CV

TL;DR: 研究探讨了如何通过增强技术提升Stable Diffusion生成肖像的面部相似性，比较了DreamBooth和InstantID两种方法，并提出了FaceDistance评估工具。

Details

Motivation: 探索增强技术对提升业余照片生成专业肖像的面部相似性的影响，为下游应用提供优化策略。 Method: 通过实验评估多种增强策略对DreamBooth和InstantID生成肖像的影响，并引入FaceDistance工具量化面部相似性。 Result: 研究发现增强技术能显著提升生成肖像的面部相似性，FaceDistance工具有效辅助评估。 Conclusion: 增强技术在提升Stable Diffusion生成肖像的面部相似性中起关键作用，为实际应用提供了优化方向。 Abstract: The personalization of Stable Diffusion for generating professional portraits from amateur photographs is a burgeoning area, with applications in various downstream contexts. This paper investigates the impact of augmentations on improving facial resemblance when using two prominent personalization techniques: DreamBooth and InstantID. Through a series of experiments with diverse subject datasets, we assessed the effectiveness of various augmentation strategies on the generated headshots' fidelity to the original subject. We introduce FaceDistance, a wrapper around FaceNet, to rank the generations based on facial similarity, which aided in our assessment. Ultimately, this research provides insights into the role of augmentations in enhancing facial resemblance in SDXL-generated portraits, informing strategies for their effective deployment in downstream applications.

[67] Real-Time Person Image Synthesis Using a Flow Matching Model

Jiwoo Jeong,Kirok Kim,Wooju Kim,Nam-Joon Kim

Main category: cs.CV

TL;DR: PGPIS任务通过目标姿态和源图像生成逼真的人体图像，但实时性能是挑战。基于流匹配的模型RPFM在速度和性能间取得平衡，实现了近实时生成。

Details

Motivation: 实时PGPIS在AR/VR、直播等场景中至关重要，但现有扩散模型速度慢，无法满足实时需求。 Method: 提出基于流匹配（FM）的生成模型RPFM，支持条件生成和潜在空间操作，提升训练和采样效率。 Result: RPFM在DeepFashion数据集上实现近实时速度，性能与SOTA模型相当，速度提升两倍以上。 Conclusion: RPFM通过牺牲少量生成精度换取速度，为实时交互系统提供了可行解决方案。 Abstract: Pose-Guided Person Image Synthesis (PGPIS) generates realistic person images conditioned on a target pose and a source image. This task plays a key role in various real-world applications, such as sign language video generation, AR/VR, gaming, and live streaming. In these scenarios, real-time PGPIS is critical for providing immediate visual feedback and maintaining user immersion.However, achieving real-time performance remains a significant challenge due to the complexity of synthesizing high-fidelity images from diverse and dynamic human poses. Recent diffusion-based methods have shown impressive image quality in PGPIS, but their slow sampling speeds hinder deployment in time-sensitive applications. This latency is particularly problematic in tasks like generating sign language videos during live broadcasts, where rapid image updates are required. Therefore, developing a fast and reliable PGPIS model is a crucial step toward enabling real-time interactive systems. To address this challenge, we propose a generative model based on flow matching (FM). Our approach enables faster, more stable, and more efficient training and sampling. Furthermore, the proposed model supports conditional generation and can operate in latent space, making it especially suitable for real-time PGPIS applications where both speed and quality are critical. We evaluate our proposed method, Real-Time Person Image Synthesis Using a Flow Matching Model (RPFM), on the widely used DeepFashion dataset for PGPIS tasks. Our results show that RPFM achieves near-real-time sampling speeds while maintaining performance comparable to the state-of-the-art models. Our methodology trades off a slight, acceptable decrease in generated-image accuracy for over a twofold increase in generation speed, thereby ensuring real-time performance.

[68] Uncertainty-Aware Prototype Semantic Decoupling for Text-Based Person Search in Full Images

Zengli Luo,Canlong Zhang,Xiaochun Lu,Zhixin Li,Zhiwen Wang

Main category: cs.CV

TL;DR: UPD-TBPS框架通过多粒度不确定性估计、原型不确定性解耦和跨模态重识别三个模块，提升文本行人搜索在复杂场景中的性能。

Details

Motivation: 现有方法在复杂场景中因检测和匹配的不确定性导致性能下降，需解决这一问题。 Method: 提出UPD-TBPS框架，包含MUE（多粒度不确定性估计）、PUD（原型不确定性解耦）和ReID（跨模态重识别）模块。 Result: 在CUHK-SYSU-TBPS和PRW-TBPS数据集上验证了框架的有效性。 Conclusion: UPD-TBPS通过减少不确定性提升了文本行人搜索的准确性和鲁棒性。 Abstract: Text-based pedestrian search (TBPS) in full images aims to locate a target pedestrian in untrimmed images using natural language descriptions. However, in complex scenes with multiple pedestrians, existing methods are limited by uncertainties in detection and matching, leading to degraded performance. To address this, we propose UPD-TBPS, a novel framework comprising three modules: Multi-granularity Uncertainty Estimation (MUE), Prototype-based Uncertainty Decoupling (PUD), and Cross-modal Re-identification (ReID). MUE conducts multi-granularity queries to identify potential targets and assigns confidence scores to reduce early-stage uncertainty. PUD leverages visual context decoupling and prototype mining to extract features of the target pedestrian described in the query. It separates and learns pedestrian prototype representations at both the coarse-grained cluster level and the fine-grained individual level, thereby reducing matching uncertainty. ReID evaluates candidates with varying confidence levels, improving detection and retrieval accuracy. Experiments on CUHK-SYSU-TBPS and PRW-TBPS datasets validate the effectiveness of our framework.

[69] Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models

Mishal Fatima,Steffen Jung,Margret Keuper

Main category: cs.CV

TL;DR: 论文研究了图像背景对模型预测中虚假相关性的影响，提出了一个合成数据集Hard-Spurious-ImageNet，并发现模型在目标区域占比较小或偏离中心时更依赖背景特征。

Details

Motivation: 探讨图像数据集中的位置和大小偏差如何导致模型依赖虚假背景特征，进而影响预测准确性。 Method: 提出合成数据集Hard-Spurious-ImageNet，评估不同预训练模型在背景、位置和大小变化下的表现。 Result: 发现模型在目标区域小或偏离中心时更依赖背景特征，且现有方法未能显著改善最差组性能。 Conclusion: 背景特征对模型预测有显著影响，现有方法需进一步改进以应对目标区域变化。 Abstract: Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change.

[70] Supervised and Unsupervised Textile Classification via Near-Infrared Hyperspectral Imaging and Deep Learning

Maria Kainz,Johannes K. Krondorfer,Malte Jaschik,Maria Jernej,Harald Ganster

Main category: cs.CV

TL;DR: 利用高光谱近红外成像和深度学习算法优化纺织纤维分类，提升可持续回收效率。

Details

Motivation: 纺织纤维回收对减少环境影响至关重要，需高效分类方法。 Method: 研究监督与非监督深度学习模型，测试其在不同纺织结构上的泛化能力。 Result: 优化的卷积神经网络和自编码器网络在多变条件下表现稳健。 Conclusion: 高光谱成像与深度学习结合可推动纺织回收的准确性和稳健性。 Abstract: Recycling textile fibers is critical to reducing the environmental impact of the textile industry. Hyperspectral near-infrared (NIR) imaging combined with advanced deep learning algorithms offers a promising solution for efficient fiber classification and sorting. In this study, we investigate supervised and unsupervised deep learning models and test their generalization capabilities on different textile structures. We show that optimized convolutional neural networks (CNNs) and autoencoder networks achieve robust generalization under varying conditions. These results highlight the potential of hyperspectral imaging and deep learning to advance sustainable textile recycling through accurate and robust classification.

[71] DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

Sergey Linok,Vadim Semenov,Anastasia Trunova,Oleg Bulichev,Dmitry Yudin

Main category: cs.CV

TL;DR: DyGEnc是一种动态图编码方法，结合空间-时间结构表示与大型语言模型，显著提升动态环境中事件分析的性能。

Details

Motivation: 解决现有视觉模型在动态环境中缺乏可解释空间-时间对象表示的问题。 Method: 提出DyGEnc方法，整合压缩空间-时间结构表示与大型语言模型，支持基于文本场景图的高级问答。 Result: 在STAR和AGQA数据集上，DyGEnc比现有视觉方法性能提升15-25%。 Conclusion: DyGEnc为基于图的机器人记忆和长期推理提供了有效解决方案。 Abstract: The analysis of events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly utilize visual models. However, these methods often capture information implicitly from images, lacking interpretable spatial-temporal object representations. To address this issue we introduce DyGEnc - a novel method for Encoding a Dynamic Graph. This method integrates compressed spatial-temporal structural observation representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on a sequence of textual scene graphs. Extended evaluations on the STAR and AGQA datasets indicate that DyGEnc outperforms existing visual methods by a large margin of 15-25% in addressing queries regarding the history of human-to-object interactions. Furthermore, the proposed method can be seamlessly extended to process raw input images utilizing foundational models for extracting explicit textual scene graphs, as substantiated by the results of a robotic experiment conducted with a wheeled manipulator platform. We hope that these findings will contribute to the implementation of robust and compressed graph-based robotic memory for long-horizon reasoning. Code is available at github.com/linukc/DyGEnc.

[72] Fixed-Length Dense Fingerprint Representation

Zhiyu Pan,Xiongjun Guan,Yongjie Duan,Jianjiang Feng,Jie Zhou

Main category: cs.CV

TL;DR: FLARE提出了一种固定长度的指纹密集描述符，结合姿态对齐和鲁棒增强，显著提升了跨模态和低质量指纹的匹配性能。

Details

Motivation: 固定长度指纹表示在处理多样模态、姿态变化和噪声干扰时仍面临挑战，需要更鲁棒和高效的解决方案。 Method: 采用三维密集描述符捕捉指纹脊结构的空间关系，结合姿态对齐和双增强策略优化特征表示。 Result: FLARE在多种指纹类型和低质量场景下表现优异，显著优于现有方法。 Conclusion: FLARE是一种统一且可扩展的指纹表示与匹配解决方案，具有高效性和通用性。 Abstract: Fixed-length fingerprint representations, which map each fingerprint to a compact and fixed-size feature vector, are computationally efficient and well-suited for large-scale matching. However, designing a robust representation that effectively handles diverse fingerprint modalities, pose variations, and noise interference remains a significant challenge. In this work, we propose a fixed-length dense descriptor of fingerprints, and introduce FLARE-a fingerprint matching framework that integrates the Fixed-Length dense descriptor with pose-based Alignment and Robust Enhancement. This fixed-length representation employs a three-dimensional dense descriptor to effectively capture spatial relationships among fingerprint ridge structures, enabling robust and locally discriminative representations. To ensure consistency within this dense feature space, FLARE incorporates pose-based alignment using complementary estimation methods, along with dual enhancement strategies that refine ridge clarity while preserving the original fingerprint modality. The proposed dense descriptor supports fixed-length representation while maintaining spatial correspondence, enabling fast and accurate similarity computation. Extensive experiments demonstrate that FLARE achieves superior performance across rolled, plain, latent, and contactless fingerprints, significantly outperforming existing methods in cross-modality and low-quality scenarios. Further analysis validates the effectiveness of the dense descriptor design, as well as the impact of alignment and enhancement modules on the accuracy of dense descriptor matching. Experimental results highlight the effectiveness and generalizability of FLARE as a unified and scalable solution for robust fingerprint representation and matching. The implementation and code will be publicly available at https://github.com/Yu-Yy/FLARE.

[73] From Pixels to Polygons: A Survey of Deep Learning Approaches for Medical Image-to-Mesh Reconstruction

Fengming Lin,Arezoo Zakeri,Yidan Xue,Michael MacRaild,Haoran Dou,Zherui Zhou,Ziwei Zou,Ali Sarrami-Foroushani,Jinming Duan,Alejandro F. Frangi

Main category: cs.CV

TL;DR: 这篇综述系统分类了基于深度学习的医学图像到网格重建方法，分析了四类主要模型，评估了其应用和性能，并讨论了当前挑战与未来方向。

Details

Motivation: 推动医学图像分析领域的发展，为计算医学和虚拟试验提供更精确的三维网格模型，以支持疾病机制研究和诊疗技术进步。 Method: 将现有方法分为模板模型、统计模型、生成模型和隐式模型四类，详细分析其方法基础、优缺点及适用性，并通过定量比较和公开数据集评估性能。 Result: 总结了各类方法的性能表现，指出了拓扑正确性、几何精度和多模态集成等挑战。 Conclusion: 为医学图像分析和计算医学领域的研究者提供了全面的参考，并提出了未来研究方向。 Abstract: Deep learning-based medical image-to-mesh reconstruction has rapidly evolved, enabling the transformation of medical imaging data into three-dimensional mesh models that are critical in computational medicine and in silico trials for advancing our understanding of disease mechanisms, and diagnostic and therapeutic techniques in modern medicine. This survey systematically categorizes existing approaches into four main categories: template models, statistical models, generative models, and implicit models. Each category is analysed in detail, examining their methodological foundations, strengths, limitations, and applicability to different anatomical structures and imaging modalities. We provide an extensive evaluation of these methods across various anatomical applications, from cardiac imaging to neurological studies, supported by quantitative comparisons using standard metrics. Additionally, we compile and analyze major public datasets available for medical mesh reconstruction tasks and discuss commonly used evaluation metrics and loss functions. The survey identifies current challenges in the field, including requirements for topological correctness, geometric accuracy, and multi-modality integration. Finally, we present promising future research directions in this domain. This systematic review aims to serve as a comprehensive reference for researchers and practitioners in medical image analysis and computational medicine.

[74] PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model

Y. B. Wang,S. Z. Zhou,J. F. Wu,T. Hu,J. N. Zhang,Y. Liu

Main category: cs.CV

TL;DR: PAHA是一种基于扩散模型的端到端音频驱动上半身人体动画框架，通过PAR和PCE方法提升生成质量与音频-动作一致性，并设计了SG和DG推理指导方法。

Details

Motivation: 现有方法多依赖多阶段生成和中间表示，导致推理时间长且生成质量与音频-动作一致性不足。 Method: 提出PAR动态调整区域训练损失权重，PCE构建基于扩散的区域音频-视觉分类器，并设计SG和DG推理指导方法。 Result: PAHA在音频-动作对齐和视频相关评估中显著优于现有方法。 Conclusion: PAHA通过局部细粒度监督指导解决了现有问题，并发布了首个中文新闻主播语音数据集CNAS。 Abstract: Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose PAHA, an end-to-end audio-driven upper-body human animation framework with diffusion model. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains diffusion-based regional audio-visual classifiers to improve the consistency of motion and co-speech audio. Afterwards, we design two novel inference guidance methods for the foregoing classifiers, Sequential Guidance (SG) and Differential Guidance (DG), to balance efficiency and quality respectively. Additionally, we build CNAS, the first public Chinese News Anchor Speech dataset, to advance research and validation in this field. Extensive experimental results and user studies demonstrate that PAHA significantly outperforms existing methods in audio-motion alignment and video-related evaluations. The codes and CNAS dataset will be released upon acceptance.

[75] Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection

Fangling Jiang,Qi Li,Bing Liu,Weining Wang,Caifeng Shan,Zhenan Sun,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 提出了一种基于知识图谱提示学习的框架，用于3D面具攻击检测，结合视觉-语言模型和因果图理论，提升了泛化能力。

Details

Motivation: 现有方法成本高且泛化能力有限，而视觉-语言多模态特征的潜力尚未被探索。 Method: 结合知识图谱实体和三重信息生成任务特定提示，并引入视觉特定知识过滤器和因果图理论优化。 Result: 在基准数据集上实现了最优的跨场景检测性能。 Conclusion: 该方法通过知识驱动和因果优化，显著提升了3D面具攻击检测的泛化能力。 Abstract: 3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.

[76] Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images

Fangling Jiang,Qi Li,Weining Wang,Wei Shen,Bing Liu,Zhenan Sun

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉语言模型的新方法，通过生成文本提示来提升人脸防伪的泛化能力，解决了协变量偏移和语义偏移问题。

Details

Motivation: 人脸防伪技术在实际应用中泛化能力不足，主要受协变量偏移和语义偏移的影响。 Method: 利用视觉语言模型生成真实人脸和潜在未知攻击的文本提示，通过多样化提示优化框架学习有效提示。 Result: 在九个数据集上的实验表明，该方法无需使用伪造人脸图像即可实现优异的泛化性能。 Conclusion: 该方法通过文本提示学习，显著提升了人脸防伪系统对未知攻击类型的泛化能力。 Abstract: Face anti-spoofing is a critical technology for ensuring the security of face recognition systems. However, its ability to generalize across diverse scenarios remains a significant challenge. In this paper, we attribute the limited generalization ability to two key factors: covariate shift, which arises from external data collection variations, and semantic shift, which results from substantial differences in emerging attack types. To address both challenges, we propose a novel approach for learning unknown spoof prompts, relying solely on real face images from a single source domain. Our method generates textual prompts for real faces and potential unknown spoof attacks by leveraging the general knowledge embedded in vision-language models, thereby enhancing the model's ability to generalize to unseen target domains. Specifically, we introduce a diverse spoof prompt optimization framework to learn effective prompts. This framework constrains unknown spoof prompts within a relaxed prior knowledge space while maximizing their distance from real face images. Moreover, it enforces semantic independence among different spoof prompts to capture a broad range of spoof patterns. Experimental results on nine datasets demonstrate that the learned prompts effectively transfer the knowledge of vision-language models, enabling state-of-the-art generalization ability against diverse unknown attack types across unseen target domains without using any spoof face images.

Yiping Xie,Bo Zhao,Mingtong Dai,Jian-Ping Zhou,Yue Sun,Tao Tan,Weicheng Xie,Linlin Shen,Zitong Yu

Main category: cs.CV

TL;DR: PhysLLM框架结合LLMs与rPPG组件，通过跨模态对齐和双域特征重加权，提升了非接触式生理测量的准确性和鲁棒性。

Details

Motivation: rPPG技术易受光照变化和运动伪影影响，LLMs虽擅长长程依赖建模，但难以直接处理噪声敏感的rPPG信号。 Method: 提出Text Prototype Guidance（TPG）策略实现跨模态对齐，并设计Dual-Domain Stationary（DDS）算法进行时频域特征重加权。 Result: 在四个基准数据集上，PhysLLM实现了最优的准确性和鲁棒性，尤其在光照变化和运动场景中表现突出。 Conclusion: PhysLLM通过结合LLMs与领域知识，显著提升了rPPG的性能，为动态环境下的生理监测提供了新思路。 Abstract: Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.

[78] Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map

Alessandro Simoni,Francesco Pelosin

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的新方法，用于生成高保真工业缺陷数据集，减少标注成本。

Details

Motivation: 工业缺陷分割需要高精度标注数据，但获取成本高且耗时。 Method: 利用扩散模型结合边界框表示生成精确分割掩码。 Result: 相比现有方法，提升了缺陷一致性和空间准确性，并通过下游任务验证了有效性。 Conclusion: 扩散合成方法能缩小人工数据与真实工业数据的差距，提升分割模型的可靠性和成本效益。 Abstract: Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models. The code is publicly available at https://github.com/covisionlab/diffusion_labeling.

[79] Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

Linhan Cao,Wei Sun,Kaiwei Zhang,Yicong Peng,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 提出了一种自监督学习框架，用于视频质量评估（VQA），通过大规模无标签网络视频学习质量评估能力，显著提升了模型的泛化性能。

Details

Motivation: 现有监督VQA模型依赖人工标注数据集，成本高且难以扩展，限制了模型对未知视频内容和失真的泛化能力。 Method: 采用学习排序范式训练多模态模型，利用现有VQA模型生成的伪标签和基于合成失真模拟的相对质量排序进行自动标注，并引入迭代自改进训练策略。 Result: 模型在零样本情况下性能匹配或超越监督模型，展示了优越的分布外泛化能力，并在微调后达到新SOTA。 Conclusion: 自监督方法有效训练了泛化能力强的VQA模型，为未来研究提供了数据集和代码支持。 Abstract: Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets -- a process that is labor-intensive, costly, and difficult to scale up -- has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbf{learning-to-rank} paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic distortion simulations. Furthermore, we introduce a novel \textbf{iterative self-improvement training strategy}, where the trained model acts an improved annotator to iteratively refine the annotation quality of training data. By training on a dataset $10\times$ larger than the existing VQA benchmarks, our model: (1) achieves zero-shot performance on in-domain VQA benchmarks that matches or surpasses supervised models; (2) demonstrates superior out-of-distribution (OOD) generalization across diverse video content and distortions; and (3) sets a new state-of-the-art when fine-tuned on human-labeled datasets. Extensive experimental results validate the effectiveness of our self-supervised approach in training generalized VQA models. The datasets and code will be publicly released to facilitate future research.

[80] Towards Smart Point-and-Shoot Photography

Jiawan Li,Fei Zhou,Zhipeng Zhong,Jiongzhi Lin,Guoping Qiu

Main category: cs.CV

TL;DR: 本文提出了一种智能点拍（SPAS）系统，通过实时调整相机姿态帮助用户拍摄更好的照片。系统包括一个基于CLIP的构图质量评估模型（CCQA）和一个相机姿态调整模型（CPAM）。

Details

Motivation: 传统点拍相机无法指导用户如何构图，而智能手机用户普遍缺乏摄影技巧。因此，开发一个能实时指导用户调整相机姿态的系统具有重要意义。 Method: 1. 构建包含32万张图像的数据集；2. 开发CCQA模型，通过可学习的文本嵌入技术评估构图质量；3. 开发CPAM模型，采用专家混合模型和门控损失函数，输出相机姿态调整建议。 Result: 系统在公开数据集上展示了良好的性能，能够有效指导用户调整相机姿态以改善构图。 Conclusion: SPAS系统通过创新的CCQA和CPAM模型，成功实现了实时构图指导，为普通用户提供了专业级的摄影辅助。 Abstract: Hundreds of millions of people routinely take photos using their smartphones as point and shoot (PAS) cameras, yet very few would have the photography skills to compose a good shot of a scene. While traditional PAS cameras have built-in functions to ensure a photo is well focused and has the right brightness, they cannot tell the users how to compose the best shot of a scene. In this paper, we present a first of its kind smart point and shoot (SPAS) system to help users to take good photos. Our SPAS proposes to help users to compose a good shot of a scene by automatically guiding the users to adjust the camera pose live on the scene. We first constructed a large dataset containing 320K images with camera pose information from 4000 scenes. We then developed an innovative CLIP-based Composition Quality Assessment (CCQA) model to assign pseudo labels to these images. The CCQA introduces a unique learnable text embedding technique to learn continuous word embeddings capable of discerning subtle visual quality differences in the range covered by five levels of quality description words {bad, poor, fair, good, perfect}. And finally we have developed a camera pose adjustment model (CPAM) which first determines if the current view can be further improved and if so it outputs the adjust suggestion in the form of two camera pose adjustment angles. The two tasks of CPAM make decisions in a sequential manner and each involves different sets of training samples, we have developed a mixture-of-experts model with a gated loss function to train the CPAM in an end-to-end manner. We will present extensive results to demonstrate the performances of our SPAS system using publicly available image composition datasets.

[81] ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant

Yifan Xiang,Zhenxi Zhang,Bin Li,Yixuan Weng,Shoujun Zhou,Yangfan He,Keqin Li

Main category: cs.CV

TL;DR: 论文提出ReGraP数据集和ReGraP-LLaVA模型，解决现有个性化MLLMs在关系推理和多概念学习上的不足，通过图提示方法提升性能。

Details

Motivation: 现有方法在个性化多模态大语言模型（MLLMs）中缺乏对多对象关系的学习和推理能力，且实验局限于单一概念任务。 Method: 提出ReGraP数据集（含图像、知识图谱和推理问答对），设计ReGraP-LLaVA模型，采用软硬图提示方法对齐知识图谱。 Result: ReGraP-LLaVA在关系推理和知识连接任务中表现优异，达到当前最优性能。 Conclusion: ReGraP数据集和模型有效提升了MLLMs在多概念关系推理上的能力，为未来研究提供了新方向。 Abstract: Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face three main limitations: Their training data lacks multi-object sets in which relations among objects are learnable. Building on the limited training data, their models overlook the relations between different personalized concepts and fail to reason over them. Their experiments mainly focus on a single personalized concept, where evaluations are limited to recognition and captioning tasks. To address the limitations, we present a new dataset named ReGraP, consisting of 120 sets of personalized knowledge. Each set includes images, KGs, and CoT QA pairs derived from the KGs, enabling more structured and sophisticated reasoning pathways. We propose ReGraP-LLaVA, an MLLM trained with the corresponding KGs and CoT QA pairs, where soft and hard graph prompting methods are designed to align KGs within the model's semantic space. We establish the ReGraP Benchmark, which contains diverse task types: multiple-choice, fill-in-the-blank, True/False, and descriptive questions in both open- and closed-ended settings. The proposed benchmark is designed to evaluate the relational reasoning and knowledge-connection capability of personalized MLLMs. We conduct experiments on the proposed ReGraP-LLaVA and other competitive MLLMs. Results show that the proposed model not only learns personalized knowledge but also performs relational reasoning in responses, achieving the SoTA performance compared with the competitive methods. All the codes and datasets are released at: https://github.com/xyfyyds/ReGraP.

[82] Revolutionizing Brain Tumor Imaging: Generating Synthetic 3D FA Maps from T1-Weighted MRI using CycleGAN Models

Xin Du,Francesca M. Cozzi,Rajesh Jena

Main category: cs.CV

TL;DR: 提出了一种基于CycleGAN的方法，直接从T1加权MRI扫描生成FA图，解决了FA图与纤维束追踪图谱的空间不对齐问题。

Details

Motivation: FA和DEC图对评估白质完整性至关重要，但FA图与纤维束追踪图谱的空间不对齐阻碍了其在预测模型中的有效整合。 Method: 使用CycleGAN方法，基于未配对数据训练模型，直接从T1加权MRI生成FA图。 Result: 模型生成的FA图具有高保真度，尤其在肿瘤区域表现优异，SSIM和PSNR评估结果良好。 Conclusion: 该方法为临床工作流提供了AI驱动的替代方案，减少了额外扫描的需求，具有潜在临床应用价值。 Abstract: Fractional anisotropy (FA) and directionally encoded colour (DEC) maps are essential for evaluating white matter integrity and structural connectivity in neuroimaging. However, the spatial misalignment between FA maps and tractography atlases hinders their effective integration into predictive models. To address this issue, we propose a CycleGAN based approach for generating FA maps directly from T1-weighted MRI scans, representing the first application of this technique to both healthy and tumour-affected tissues. Our model, trained on unpaired data, produces high fidelity maps, which have been rigorously evaluated using Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), demonstrating particularly robust performance in tumour regions. Radiological assessments further underscore the model's potential to enhance clinical workflows by providing an AI-driven alternative that reduces the necessity for additional scans.

[83] Distribution-Conditional Generation: From Class Distribution to Creative Generation

Fu Feng,Yucheng Xie,Xu Yang,Jing Wang,Xin Geng

Main category: cs.CV

TL;DR: 该论文提出了一种名为DisTok的新方法，通过分布条件生成和动态概念池实现创意图像合成，超越了传统文本到图像模型的限制。

Details

Motivation: 传统文本到图像模型依赖训练数据分布，无法生成真正新颖的、超出分布的概念。现有方法通常通过组合已知概念来增强创意，但仍受限于语义空间。 Method: 提出Distribution-Conditional Generation，将创意建模为基于类别分布的图像合成。DisTok框架通过编码-解码将类别分布映射到潜在空间，并动态迭代生成创意概念。 Result: DisTok在文本-图像对齐和人类偏好评分上表现优异，实现了高效的令牌级生成。 Conclusion: DisTok通过分布条件融合和基于采样的合成，实现了灵活且高效的创意图像生成，性能达到最先进水平。 Abstract: Text-to-image (T2I) diffusion models are effective at producing semantically aligned images, but their reliance on training data distributions limits their ability to synthesize truly novel, out-of-distribution concepts. Existing methods typically enhance creativity by combining pairs of known concepts, yielding compositions that, while out-of-distribution, remain linguistically describable and bounded within the existing semantic space. Inspired by the soft probabilistic outputs of classifiers on ambiguous inputs, we propose Distribution-Conditional Generation, a novel formulation that models creativity as image synthesis conditioned on class distributions, enabling semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder-decoder framework that maps class distributions into a latent space and decodes them into tokens of creative concept. DisTok maintains a dynamic concept pool and iteratively sampling and fusing concept pairs, enabling the generation of tokens aligned with increasingly complex class distributions. To enforce distributional consistency, latent vectors sampled from a Gaussian prior are decoded into tokens and rendered into images, whose class distributions-predicted by a vision-language model-supervise the alignment between input distributions and the visual semantics of generated tokens. The resulting tokens are added to the concept pool for subsequent composition. Extensive experiments demonstrate that DisTok, by unifying distribution-conditioned fusion and sampling-based synthesis, enables efficient and flexible token-level generation, achieving state-of-the-art performance with superior text-image alignment and human preference scores.

[84] CaRaFFusion: Improving 2D Semantic Segmentation with Camera-Radar Point Cloud Fusion and Zero-Shot Image Inpainting

Huawei Sun,Bora Kunter Sahin,Georg Stettinger,Maximilian Bernhard,Matthias Schubert,Robert Wille

Main category: cs.CV

TL;DR: 提出了一种新的相机-雷达融合框架，通过扩散模型和伪掩码生成技术，提升了恶劣天气条件下的语义分割性能。

Details

Motivation: 相机传感器在恶劣天气下性能下降，雷达传感器数据稀疏且噪声大，融合两者信息可提升环境感知能力。 Method: 利用雷达点特征生成伪掩码，通过噪声抑制单元优化，并生成修复图像以补充原始图像缺失信息。 Result: 在Waterscenes数据集上，相机基线分割性能提升2.63%，相机-雷达融合架构性能提升1.48%。 Conclusion: 该方法在恶劣天气条件下有效提升了语义分割性能，证明了相机-雷达融合的潜力。 Abstract: Segmenting objects in an environment is a crucial task for autonomous driving and robotics, as it enables a better understanding of the surroundings of each agent. Although camera sensors provide rich visual details, they are vulnerable to adverse weather conditions. In contrast, radar sensors remain robust under such conditions, but often produce sparse and noisy data. Therefore, a promising approach is to fuse information from both sensors. In this work, we propose a novel framework to enhance camera-only baselines by integrating a diffusion model into a camera-radar fusion architecture. We leverage radar point features to create pseudo-masks using the Segment-Anything model, treating the projected radar points as point prompts. Additionally, we propose a noise reduction unit to denoise these pseudo-masks, which are further used to generate inpainted images that complete the missing information in the original images. Our method improves the camera-only segmentation baseline by 2.63% in mIoU and enhances our camera-radar fusion architecture by 1.48% in mIoU on the Waterscenes dataset. This demonstrates the effectiveness of our approach for semantic segmentation using camera-radar fusion under adverse weather conditions.

[85] Matching Distance and Geometric Distribution Aided Learning Multiview Point Cloud Registration

Shiqi Li,Jihua Zhu,Yifan Xie,Naiwen Hu,Di Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于网络模型的多视角点云配准方法，通过匹配距离提取可靠对构建位姿图，并利用数据驱动方式计算绝对位姿。

Details

Motivation: 多视角点云配准在机器人、自动化和计算机视觉领域至关重要，但现有方法在构建位姿图和运动同步时存在可靠性不足的问题。 Method: 设计了两个神经网络模型：一个用于提取点云对匹配距离信息以构建可靠位姿图，另一个通过数据驱动方式计算绝对位姿，结合几何分布信息和改进的注意力机制。 Result: 在多种室内外数据集上的实验验证了方法的有效性和泛化能力。 Conclusion: 该方法显著提升了多视角点云配准的可靠性，代码已开源。 Abstract: Multiview point cloud registration plays a crucial role in robotics, automation, and computer vision fields. This paper concentrates on pose graph construction and motion synchronization within multiview registration. Previous methods for pose graph construction often pruned fully connected graphs or constructed sparse graph using global feature aggregated from local descriptors, which may not consistently yield reliable results. To identify dependable pairs for pose graph construction, we design a network model that extracts information from the matching distance between point cloud pairs. For motion synchronization, we propose another neural network model to calculate the absolute pose in a data-driven manner, rather than optimizing inaccurate handcrafted loss functions. Our model takes into account geometric distribution information and employs a modified attention mechanism to facilitate flexible and reliable feature interaction. Experimental results on diverse indoor and outdoor datasets confirm the effectiveness and generalizability of our approach. The source code is available at https://github.com/Shi-Qi-Li/MDGD.

[86] Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

François Role,Sébastien Meyer,Victor Amblard

Main category: cs.CV

TL;DR: 论文提出新方法和度量标准（基于谱方法和最优传输）来评估和减少视觉语言模型中的模态间隙，实验证明其有效性。

Details

Motivation: 视觉语言模型存在模态间隙问题，影响下游任务性能，目前缺乏通用且实用的评估和解决方法。 Method: 提出基于谱方法和最优传输的技术，用于评估和减少模态间隙。 Result: 在多个图像-文本数据集和模型上验证了方法的有效性，并改善了下游任务性能。 Conclusion: 新方法能有效解决模态间隙问题，提升视觉语言模型的性能。 Abstract: Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.

[87] DISARM++: Beyond scanner-free harmonization

Luca Caldera,Lara Cavinato,Alessio Cirone,Isabella Cama,Sara Garbarino,Raffaele Lodi,Fabrizio Tagliavini,Anna Nigri,Silvia De Francesco,Andrea Cappozzo,Michele Piana,Francesca Ieva

Main category: cs.CV

TL;DR: 该研究提出了一种新颖的T1加权MR图像跨扫描仪直接协调方法，确保提取的特征在下游分析中保持可靠，并在多项应用中验证了其优越性。

Details

Motivation: 解决不同扫描仪获取的T1加权MR图像不一致问题，确保神经影像研究的可靠性和一致性。 Method: 通过两种方式实现图像转换：(1)映射到无扫描仪空间，(2)转换到特定扫描仪域。方法具有强泛化能力，无需预处理步骤。 Result: 在脑龄预测（R2=0.60）、AD分类（准确率0.86）等应用中表现优于现有方法，且无需重新训练新数据集。 Conclusion: 该方法为神经影像研究提供了高效、可靠的跨扫描器协调解决方案，适用于多种应用场景。 Abstract: Harmonization of T1-weighted MR images across different scanners is crucial for ensuring consistency in neuroimaging studies. This study introduces a novel approach to direct image harmonization, moving beyond feature standardization to ensure that extracted features remain inherently reliable for downstream analysis. Our method enables image transfer in two ways: (1) mapping images to a scanner-free space for uniform appearance across all scanners, and (2) transforming images into the domain of a specific scanner used in model training, embedding its unique characteristics. Our approach presents strong generalization capability, even for unseen scanners not included in the training phase. We validated our method using MR images from diverse cohorts, including healthy controls, traveling subjects, and individuals with Alzheimer's disease (AD). The model's effectiveness is tested in multiple applications, such as brain age prediction (R2 = 0.60 \pm 0.05), biomarker extraction, AD classification (Test Accuracy = 0.86 \pm 0.03), and diagnosis prediction (AUC = 0.95). In all cases, our harmonization technique outperforms state-of-the-art methods, showing improvements in both reliability and predictive accuracy. Moreover, our approach eliminates the need for extensive preprocessing steps, such as skull-stripping, which can introduce errors by misclassifying brain and non-brain structures. This makes our method particularly suitable for applications that require full-head analysis, including research on head trauma and cranial deformities. Additionally, our harmonization model does not require retraining for new datasets, allowing smooth integration into various neuroimaging workflows. By ensuring scanner-invariant image quality, our approach provides a robust and efficient solution for improving neuroimaging studies across diverse settings. The code is available at this link.

[88] FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

Shiyi Zhang,Junhao Zhuang,Zhaoyang Zhang,Ying Shan,Yansong Tang

Main category: cs.CV

TL;DR: FlexiAct提出了一种新的动作定制方法，通过RefAdapter和FAE技术，实现了在多样布局、骨架和视角下的动作迁移，同时保持身份一致性。

Details

Motivation: 现有方法在动作定制中受限于严格的空间结构约束（如布局、骨架和视角一致性），限制了其适应性和灵活性。 Method: FlexiAct结合了RefAdapter（轻量级图像条件适配器）和FAE（频率感知动作提取），在去噪过程中直接提取动作，无需分离的时空架构。 Result: 实验表明，FlexiAct能有效将动作迁移到具有不同布局、骨架和视角的目标图像上，超越了现有方法。 Conclusion: FlexiAct通过创新的空间适应和一致性保持技术，为动作定制提供了更高的灵活性和适应性，推动了该领域的发展。 Abstract: Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/

[89] Multi-Agent System for Comprehensive Soccer Understanding

Jiayuan Rao,Zifeng Li,Haoning Wu,Ya Zhang,Yanfeng Wang,Weidi Xie

Main category: cs.CV

TL;DR: 提出一个全面的足球理解框架，包括构建多模态知识库SoccerWiki、创建大规模基准SoccerBench、设计多智能体系统SoccerAgent，并进行广泛评估。

Details

Motivation: 现有研究集中于孤立或狭窄任务，缺乏对足球理解的全面覆盖。 Method: 构建SoccerWiki知识库、SoccerBench基准，设计SoccerAgent多智能体系统。 Result: SoccerAgent在SoccerBench上表现优异，优于现有MLLMs。 Conclusion: 提出的框架为足球理解提供了全面解决方案，数据与代码已开源。 Abstract: Recent advancements in AI-driven soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Specifically, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge-driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K standardized multimodal (text, image, video) multi-choice QA pairs across 13 distinct understanding tasks, curated through automated pipelines and manual verification; (iii) we introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and ablations that benchmark state-of-the-art MLLMs on SoccerBench, highlighting the superiority of our proposed agentic system. All data and code are publicly available at: https://jyrao.github.io/SoccerAgent/.

cs.CL [Back]

Bang Zhang,Ruotian Ma,Qingxuan Jiang,Peisong Wang,Jiaqi Chen,Zheng Xie,Xingyu Chen,Yue Wang,Fanghua Ye,Jian Li,Yifan Yang,Zhaopeng Tu,Xiaolong Li

Main category: cs.CL

TL;DR: SAGE是一个自动化评估框架，通过模拟人类情感变化和内心思考，评估大语言模型（LLM）的社会认知能力。

Details

Motivation: 现有方法难以评估LLM对人类情感的理解，SAGE旨在填补这一空白。 Method: SAGE通过模拟情感变化、内心思考和回复行为，生成情感轨迹和可解释的内心活动。 Result: 实验显示SAGE的情感评分与心理学指标高度相关，并揭示了前沿模型与基线模型的显著差距。 Conclusion: SAGE为评估语言模型的社会认知能力提供了可扩展且可解释的工具。 Abstract: Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

[91] Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors

Nicy Scaria,Silvester John Joseph Kennedy,Diksha Seth,Ananya Thakur,Deepak Subramani

Main category: cs.CL

TL;DR: 提出了一种基于分层概念图的框架，利用LLM生成高质量MCQ，针对高中物理领域，显著优于基线方法。

Details

Motivation: 手动生成高质量MCQ耗时且依赖专家知识，现有自动化方法无法满足高认知水平和领域特定误解的需求。 Method: 开发分层概念图，通过自动化流程提取相关内容指导LLM生成MCQ和干扰项，并进行自动验证。 Result: 专家评估显示成功率达75.20%，学生测试中猜测成功率显著降低至28.05%。 Conclusion: 该方法能有效评估认知水平并快速识别概念差距，支持规模化反馈和干预。 Abstract: Generating high-quality MCQs, especially those targeting diverse cognitive levels and incorporating common misconceptions into distractor design, is time-consuming and expertise-intensive, making manual creation impractical at scale. Current automated approaches typically generate questions at lower cognitive levels and fail to incorporate domain-specific misconceptions. This paper presents a hierarchical concept map-based framework that provides structured knowledge to guide LLMs in generating MCQs with distractors. We chose high-school physics as our test domain and began by developing a hierarchical concept map covering major Physics topics and their interconnections with an efficient database design. Next, through an automated pipeline, topic-relevant sections of these concept maps are retrieved to serve as a structured context for the LLM to generate questions and distractors that specifically target common misconceptions. Lastly, an automated validation is completed to ensure that the generated MCQs meet the requirements provided. We evaluate our framework against two baseline approaches: a base LLM and a RAG-based generation. We conducted expert evaluations and student assessments of the generated MCQs. Expert evaluation shows that our method significantly outperforms the baseline approaches, achieving a success rate of 75.20% in meeting all quality criteria compared to approximately 37% for both baseline methods. Student assessment data reveal that our concept map-driven approach achieved a significantly lower guess success rate of 28.05% compared to 37.10% for the baselines, indicating a more effective assessment of conceptual understanding. The results demonstrate that our concept map-based approach enables robust assessment across cognitive levels and instant identification of conceptual gaps, facilitating faster feedback loops and targeted interventions at scale.

[92] 30DayGen: Leveraging LLMs to Create a Content Corpus for Habit Formation

Franklin Zhang,Sonya Zhang,Alon Halevy

Main category: cs.CL

TL;DR: 30 Day Me是一款利用LLM帮助用户分解目标并跟踪进度的习惯养成应用，核心是30DAYGEN系统，能生成3,531种30天挑战。

Details

Motivation: 利用LLM快速构建领域特定内容库，支持行为和教育目标。 Method: 通过LLM生成内容并语义去重，构建30DAYGEN系统。 Result: 生成了3,531种独特的30天挑战，支持用户目标对齐。 Conclusion: 展示了LLM在内容生成和去重中的实用性，为行为和教育应用提供新方法。 Abstract: In this paper, we present 30 Day Me, a habit formation application that leverages Large Language Models (LLMs) to help users break down their goals into manageable, actionable steps and track their progress. Central to the app is the 30DAYGEN system, which generates 3,531 unique 30-day challenges sourced from over 15K webpages, and enables runtime search of challenge ideas aligned with user-defined goals. We showcase how LLMs can be harnessed to rapidly construct domain specific content corpora for behavioral and educational purposes, and propose a practical pipeline that incorporates effective LLM enhanced approaches for content generation and semantic deduplication.

[93] Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets

Masumi Morishige,Ryo Koshihara

Main category: cs.CL

TL;DR: GPR-bench是一个轻量级、可扩展的基准测试工具，用于评估生成式AI系统的可重复性和可靠性，覆盖多任务和多语言场景。

Details

Motivation: 解决生成式AI系统因模型更新或提示调整导致行为漂移的可重复性和可靠性问题。 Method: 开发GPR-bench，包含双语数据集（英语和日语）和自动化评估流程，使用“LLM-as-a-Judge”评分。 Result: 新模型在正确性上略有提升但差异不显著，简洁写作提示显著提高简洁性。 Conclusion: GPR-bench为社区提供了可扩展的基准测试工具，但需注意其挑战性不足的问题。 Abstract: Reproducibility and reliability remain pressing challenges for generative AI systems whose behavior can drift with each model update or prompt revision. We introduce GPR-bench, a lightweight, extensible benchmark that operationalizes regression testing for general purpose use cases. GPR-bench couples an open, bilingual (English and Japanese) dataset covering eight task categories (e.g., text generation, code generation, and information retrieval) and 10 scenarios in each task categories (80 total test cases for each language) with an automated evaluation pipeline that employs "LLM-as-a-Judge" scoring of correctness and conciseness. Experiments across three recent model versions - gpt-4o-mini, o3-mini, and o4-mini - and two prompt configurations (default versus concise-writing instruction) reveal heterogeneous quality. Our results show that newer models generally improve correctness, but the differences are modest and not statistically significant, suggesting that GPR-bench may not be sufficiently challenging to differentiate between recent model versions. In contrast, the concise-writing instruction significantly enhances conciseness (+12.37 pp, Mann-Whitney U test: p < 0.001, effect size r = 0.2995) with minimal degradations on accuracy (-1.7 pp), demonstrating the effectiveness of prompt engineering. Released under the MIT License, GPR- bench lowers the barrier to initiating reproducibility monitoring and provides a foundation for community-driven extensions, while also raising important considerations about benchmark design for rapidly evolving language models.

Henry Tari,Nojus Sereiva,Rishabh Kaushal,Thales Bertaglia,Adriana Iamnitchi

Main category: cs.CL

TL;DR: 论文探讨了利用大语言模型生成多平台社交媒体合成数据的潜力，提出了一种基于主题的提示方法，并通过实验验证了其可行性。

Details

Motivation: 由于成本和平台限制，获取多平台社交媒体数据集困难，研究旨在通过合成数据解决这一问题。 Method: 采用多平台主题提示方法，利用不同语言模型生成合成数据，并与真实数据对比评估。 Result: 实验表明，大语言模型生成多平台合成数据具有潜力，不同模型表现各异，后处理可能提升数据保真度。 Conclusion: 研究为多平台社交媒体数据生成提供了新方法，并提出了针对性的保真度评估指标。 Abstract: Social media datasets are essential for research on a variety of topics, such as disinformation, influence operations, hate speech detection, or influencer marketing practices. However, access to social media datasets is often constrained due to costs and platform restrictions. Acquiring datasets that span multiple platforms, which is crucial for understanding the digital ecosystem, is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real data. We propose multi-platform topic-based prompting and employ various language models to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings show that using large language models to generate synthetic multi-platform social media data is promising, different language models perform differently in terms of fidelity, and a post-processing approach might be needed for generating high-fidelity synthetic datasets for research. In addition to the empirical evaluation of three state of the art large language models, our contributions include new fidelity metrics specific to multi-platform social media datasets.

[95] Enhancing ML Model Interpretability: Leveraging Fine-Tuned Large Language Models for Better Understanding of AI

Jonas Bokstaller,Julia Altheimer,Julian Dormehl,Alina Buss,Jasper Wiltfang,Johannes Schneider,Maximilian Röglinger

Main category: cs.CL

TL;DR: 提出了一种基于微调LLM的交互式聊天机器人架构，用于解释XAI，并在电池健康状态预测中验证其有效性。

Details

Motivation: 随着机器学习模型的黑箱特性日益明显，XAI的应用需求增加，同时LLM在理解人类语言和复杂模式方面取得进展，结合两者提升XAI的可解释性。 Method: 设计了一种参考架构，通过微调的LLM驱动的交互式聊天机器人解释XAI，并在电池SoH预测中实例化该架构。 Result: 评估表明，该原型显著提升了ML的可解释性，尤其对XAI经验较少的用户。 Conclusion: 结合LLM的交互式XAI解释架构有效提升了模型的可解释性，尤其在特定领域（如电池健康预测）中表现良好。 Abstract: Across various sectors applications of eXplainableAI (XAI) gained momentum as the increasing black-boxedness of prevailing Machine Learning (ML) models became apparent. In parallel, Large Language Models (LLMs) significantly developed in their abilities to understand human language and complex patterns. By combining both, this paper presents a novel reference architecture for the interpretation of XAI through an interactive chatbot powered by a fine-tuned LLM. We instantiate the reference architecture in the context of State-of-Health (SoH) prediction for batteries and validate its design in multiple evaluation and demonstration rounds. The evaluation indicates that the implemented prototype enhances the human interpretability of ML, especially for users with less experience with XAI.

[96] Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs

Haoming Yang,Ke Ma,Xiaojun Jia,Yingfei Sun,Qianqian Xu,Qingming Huang

Main category: cs.CL

TL;DR: 本文提出了一种基于人类认知启发的新框架ICRT，用于绕过大型语言模型的安全机制，通过认知分解和相关性偏见优化恶意提示，并引入排名评估指标量化危害。

Details

Motivation: 现有研究依赖暴力优化或手动设计，未能揭示真实场景中的潜在风险，因此需要更有效的方法来评估和攻击模型的安全机制。 Method: 提出ICRT框架，利用认知分解简化恶意提示，通过相关性偏见重组提示以增强语义对齐，并采用排名聚合方法（如Elo、HodgeRank等）评估危害性。 Result: 实验表明，ICRT能有效绕过主流LLM的安全机制，生成高风险内容，揭示了攻击风险并为防御策略提供参考。 Conclusion: ICRT框架为理解jailbreak攻击风险提供了新视角，有助于开发更强大的防御机制。 Abstract: Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream LLMs' safety mechanisms and generates high-risk content, providing insights into jailbreak attack risks and contributing to stronger defense strategies.

[97] Accelerating Large Language Model Reasoning via Speculative Search

Zhihai Wang,Jie Wang,Jilai Pan,Xilin Xia,Huiling Zhen,Mingxuan Yuan,Jianye Hao,Feng Wu

Main category: cs.CL

TL;DR: 提出了一种名为SpecSearch的新框架，通过优化思维生成显著加速LLM推理，同时保持推理质量。

Details

Motivation: 解决基于树搜索的推理方法因生成大量推理思维导致的高延迟问题，限制LLM的适用性。 Method: 利用小模型与大模型在思维和标记级别战略协作，结合质量保持拒绝机制过滤低质量思维。 Result: 在Qwen和Llama模型上实验，SpecSearch比现有方法快2.12倍，且推理质量相当。 Conclusion: SpecSearch有效加速LLM推理，同时保持高质量，为实际应用提供了可行方案。 Abstract: Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12$\times$ speedup with comparable reasoning quality.

[98] Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading

Cfir Avraham Hadar,Omer Shubi,Yoav Meiri,Yevgeni Berzak

Main category: cs.CL

TL;DR: 论文探讨了是否可以通过眼动数据自动解码读者的开放式阅读目标，提出了目标分类和重构任务，并开发了多模态LLM模型，实验证明其有效性。

Details

Motivation: 研究动机是探索眼动数据是否能反映读者的文本特定目标，从而为理解阅读行为提供新视角。 Method: 方法包括设计目标分类和重构任务，利用大规模英语阅读眼动数据，开发多模态LLM模型（结合眼动和文本）。 Result: 实验结果表明，模型在目标分类和重构任务上表现良好，证明眼动数据能有效反映阅读目标。 Conclusion: 结论是LLM可以从眼动数据中提取读者目标信息，为阅读行为研究提供新工具。 Abstract: When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you only care about the question ``but does it work?''. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded from eye movements in reading. To address this question, we introduce goal classification and goal reconstruction tasks and evaluation frameworks, and use large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal LLMs that combine eye movements and text for goal classification and goal reconstruction. Our experiments show considerable success on both tasks, suggesting that LLMs can extract valuable information about the readers' text-specific goals from eye movements.

[99] Logits-Constrained Framework with RoBERTa for Ancient Chinese NER

Wenjie Hua,Shenghan Xu

Main category: cs.CL

TL;DR: 本文提出了一种Logits-Constrained（LC）框架用于古汉语命名实体识别（NER），在EvaHan 2025基准上表现优于传统方法。

Details

Motivation: 解决古汉语NER任务中高标签或大数据场景下的性能问题。 Method: 两阶段模型，结合GujiRoBERTa进行上下文编码和可微分解码机制以约束BMES标签转移。 Result: LC框架在性能上优于传统CRF和BiLSTM方法。 Conclusion: 提出的模型选择标准为实际古汉语NLP任务提供了实用指导。 Abstract: This paper presents a Logits-Constrained (LC) framework for Ancient Chinese Named Entity Recognition (NER), evaluated on the EvaHan 2025 benchmark. Our two-stage model integrates GujiRoBERTa for contextual encoding and a differentiable decoding mechanism to enforce valid BMES label transitions. Experiments demonstrate that LC improves performance over traditional CRF and BiLSTM-based approaches, especially in high-label or large-data settings. We also propose a model selection criterion balancing label complexity and dataset size, providing practical guidance for real-world Ancient Chinese NLP tasks.

[100] RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Daniel Goldstein,Eric Alcaide,Janna Lu,Eugene Cheah

Main category: cs.CL

TL;DR: RADLADS提出了一种快速将softmax注意力Transformer转换为线性注意力解码器模型的协议，并展示了两种新的RWKV变体架构。转换过程仅需350-700M tokens，成本低至2000美元，且性能接近原始模型。

Details

Motivation: 解决传统Transformer模型的高计算成本和资源消耗问题，提供一种高效且经济的替代方案。 Method: 通过RADLADS协议快速转换softmax注意力Transformer为线性注意力解码器模型，并开发了新的RWKV变体架构。 Result: 转换后的模型在标准基准测试中表现优异，性能接近原始Transformer，同时显著降低了计算成本。 Conclusion: RADLADS提供了一种高效、低成本的方法，适用于大规模模型的转换，且性能表现优异。 Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper

[101] Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis

Albérick Euraste Djiré,Abdoul Kader Kaboré,Earl T. Barr,Jacques Klein,Tegawendé F. Bissyandé

Main category: cs.CL

TL;DR: PEARL是一种检测大型语言模型（LLM）记忆现象的新方法，通过输入扰动评估模型输出的敏感性，无需访问模型内部即可区分真实泛化与记忆。

Details

Motivation: 大型语言模型在训练中可能记忆而非泛化数据，引发隐私、知识产权和评估可靠性问题。 Method: PEARL通过输入扰动分析模型输出的一致性，检测记忆现象。 Result: 在Pythia和GPT-4o模型上验证，PEARL成功识别了经典文本和代码的记忆现象，并推测了训练数据来源。 Conclusion: PEARL为检测LLM记忆提供了有效框架，有助于解决隐私和评估问题。 Abstract: While Large Language Models (LLMs) achieve remarkable performance through training on massive datasets, they can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization. This memorization phenomenon raises significant concerns about data privacy, intellectual property rights, and the reliability of model evaluations. This paper introduces PEARL, a novel approach for detecting memorization in LLMs. PEARL assesses how sensitive an LLM's performance is to input perturbations, enabling memorization detection without requiring access to the model's internals. We investigate how input perturbations affect the consistency of outputs, enabling us to distinguish between true generalization and memorization. Our findings, following extensive experiments on the Pythia open model, provide a robust framework for identifying when the model simply regurgitates learned information. Applied on the GPT 4o models, the PEARL framework not only identified cases of memorization of classic texts from the Bible or common code from HumanEval but also demonstrated that it can provide supporting evidence that some data, such as from the New York Times news articles, were likely part of the training data of a given model.

[102] A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

Steven Bedrick,A. Seza Doğruöz,Sergiu Nisioi

Main category: cs.CL

TL;DR: 本文综述了医疗领域中合成数据集的创建、评估和应用，并提出了一种新的分类法以比较和评估数据合成类型和程度。

Details

Motivation: 临床对话数据因隐私和治理问题难以获取，合成数据集成为替代方案，但缺乏理论指导其最佳使用和泛化。 Method: 综述合成数据集的创建与评估方法，并提出新的数据合成分类法。 Result: 合成数据集在医疗对话任务中有一定效果，但需进一步理论支持。 Conclusion: 提出分类法有助于合成数据集的比较与评估，未来需更多研究支持其泛化能力。 Abstract: Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.

[103] UCSC at SemEval-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output

Sicong Huang,Jincheng He,Shiyuan Huang,Karthik Raja Anandan,Arkajyoti Chakraborty,Ian Lane

Main category: cs.CL

TL;DR: 论文介绍了UCSC团队在Mu-SHROOM任务中的系统，提出了一种检测和定位大语言模型幻觉的框架，取得了最佳性能。

Details

Motivation: 解决大语言模型在知识密集型查询中产生的幻觉问题，并精确定位幻觉发生的位置。 Method: 通过检索相关上下文、识别答案中的错误内容，并将其映射回LLM输出中的具体片段，同时优化提示。 Result: 系统在所有语言中平均排名第一，性能最优。 Conclusion: 提出的框架有效解决了幻觉检测和定位问题，代码和实验结果已公开。 Abstract: Hallucinations pose a significant challenge for large language models when answering knowledge-intensive queries. As LLMs become more widely adopted, it is crucial not only to detect if hallucinations occur but also to pinpoint exactly where in the LLM output they occur. SemEval 2025 Task 3, Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction. This paper describes the UCSC system submission to the shared Mu-SHROOM task. We introduce a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans in the LLM output. The process is further enhanced by automatically optimizing prompts. Our system achieves the highest overall performance, ranking #1 in average position across all languages. We release our code and experiment results.

[104] Teaching Models to Understand (but not Generate) High-risk Data

Ryan Wang,Matthew Finlayson,Luca Soldaini,Swabha Swayamdipta,Robin Jia

Main category: cs.CL

TL;DR: SLUNG是一种预训练范式，旨在让模型理解高风险内容但不生成它，通过选择性损失避免高风险标记的生成，同时保留其理解能力。

Details

Motivation: 传统方法过滤高风险内容会限制模型对有害或敏感内容的识别和响应能力，SLUNG旨在解决这一问题。 Method: SLUNG在预训练中选择性避免激励高风险标记的生成，但仍保留其在上下文中的理解。 Result: 实验表明，SLUNG提高了模型对高风险内容的理解能力，同时未增加其生成高风险内容的概率。 Conclusion: SLUNG使模型能够从未被过滤的高风险文本中受益，同时避免生成有害内容。 Abstract: Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

[105] Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text

Jennifer Healey,Laurie Byrum,Md Nadeem Akhtar,Surabhi Bhargava,Moumita Sinha

Main category: cs.CL

TL;DR: 论文提出了一种半自动化的偏见评估框架，用于自由文本响应，结合人类洞察力解决LLM评估中的挑战。

Details

Motivation: LLM评估在现实部署中因任务特定提示和上下文交互而复杂化，传统短上下文基准可能失效，大规模人类评估成本高且难以实施。 Method: 开发了偏见的操作定义以自动化流程，并提出超越多选题的偏见分类方法，结合人类评估发现偏见基准中的问题模板。 Result: 框架成功结合自动化与人类洞察力，提升了偏见评估的有效性和可扩展性。 Conclusion: 半自动化框架为LLM偏见评估提供了可行方案，人类参与对发现潜在问题至关重要。 Abstract: LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs' deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.

[106] Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

Junlin Wang,Roy Xie,Shang Zhu,Jue Wang,Ben Athiwaratkun,Bhuwan Dhingra,Shuaiwen Leon Song,Ce Zhang,James Zou

Main category: cs.CL

TL;DR: 论文提出了一种名为MoAA（Mixture of Agents Alignment）的方法，通过利用多种语言模型的集体优势生成高质量的对齐数据，从而提升模型性能。

Details

Motivation: 构建高质量的人类标注数据成本高且难以扩展，且可能缺乏多样性和泛化能力。MoAA旨在解决这些问题。 Method: 采用MoAA方法，结合多种语言模型的优势生成对齐数据，用于监督微调和偏好优化。 Result: 实验表明，MoAA显著提升了模型性能（如LLaMA-3.1-8B-Instruct在Arena-Hard和AlpacaEval2上的胜率大幅提高）。 Conclusion: MoAA为模型对齐提供了一种可扩展且多样化的合成数据方案，并能通过自改进流程提升开源LLM的性能。 Abstract: Building helpful and harmless large language models (LLMs) requires effective model alignment approach based on human instructions and feedback, which necessitates high-quality human-labeled data. Constructing such datasets is often expensive and hard to scale, and may face potential limitations on diversity and generalization. To address these challenges, we introduce Mixture of Agents Alignment (MoAA), that leverages the collective strengths of various language models to provide high-quality data for model alignment. By employing MoAA, we enhance both supervised fine-tuning and preference optimization, leading to improved performance compared to using a single model alone to generate alignment data (e.g. using GPT-4o alone). Evaluation results show that our approach can improve win rate of LLaMA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on AlpacaEval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that MoAA enables a self-improvement pipeline, where models finetuned on MoA-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.

[107] Survey of Abstract Meaning Representation: Then, Now, Future

Behrooz Mansouri

Main category: cs.CL

TL;DR: 本文综述了抽象意义表示（AMR）及其扩展，探讨了其解析和生成任务，并回顾了AMR在文本生成、分类和信息提取中的应用。

Details

Motivation: 研究AMR的目的是为了增强机器对人类语言的理解能力，通过图结构捕捉句子语义。 Method: 通过调查AMR及其扩展，分析解析和生成任务的传统、当前和未来方法，并总结AMR的应用。 Result: 综述揭示了AMR在语义表示和语言处理任务中的潜力及其面临的挑战。 Conclusion: AMR为机器理解人类语言提供了新方向，未来研究需解决现有挑战以进一步发挥其潜力。 Abstract: This paper presents a survey of Abstract Meaning Representation (AMR), a semantic representation framework that captures the meaning of sentences through a graph-based structure. AMR represents sentences as rooted, directed acyclic graphs, where nodes correspond to concepts and edges denote relationships, effectively encoding the meaning of complex sentences. This survey investigates AMR and its extensions, focusing on AMR capabilities. It then explores the parsing (text-to-AMR) and generation (AMR-to-text) tasks by showing traditional, current, and possible futures approaches. It also reviews various applications of AMR including text generation, text classification, and information extraction and information seeking. By analyzing recent developments and challenges in the field, this survey provides insights into future directions for research and the potential impact of AMR on enhancing machine understanding of human language.

[108] Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback

Shijing Zhu,Zhuang Chen,Guanqun Bi,Binghang Li,Yaxi Deng,Dazhen Wan,Libiao Peng,Xiyao Xiao,Rongsheng Zhang,Tangjie Lv,Zhipeng Hu,FangFang Li,Minlie Huang

Main category: cs.CL

TL;DR: 论文提出了Psi-Arena框架，用于全面评估和优化基于LLM的心理咨询师，通过多阶段对话、三方评估和闭环优化提升性能。

Details

Motivation: 现有评估方法在静态测试、单一视角和开环框架方面存在局限，无法全面评估LLM心理咨询师的能力。 Method: 提出Psi-Arena框架，包括模拟真实咨询的多阶段对话、三方评估（客户、咨询师、督导）和闭环优化。 Result: 实验显示不同LLM在真实场景中表现差异显著，优化后咨询性能提升高达141%。 Conclusion: Psi-Arena为心理保健领域可靠且人性化的LLM应用提供了基础资源。 Abstract: Large language models (LLMs) have shown promise in providing scalable mental health support, while evaluating their counseling capability remains crucial to ensure both efficacy and safety. Existing evaluations are limited by the static assessment that focuses on knowledge tests, the single perspective that centers on user experience, and the open-loop framework that lacks actionable feedback. To address these issues, we propose {\Psi}-Arena, an interactive framework for comprehensive assessment and optimization of LLM-based counselors, featuring three key characteristics: (1) Realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients, (2) Tripartite evaluation that integrates assessments from the client, counselor, and supervisor perspectives, and (3) Closed-loop optimization that iteratively improves LLM counselors using diagnostic feedback. Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives. Moreover, reflection-based optimization results in up to a 141% improvement in counseling performance. We hope PsychoArena provides a foundational resource for advancing reliable and human-aligned LLM applications in mental healthcare.

[109] Recall with Reasoning: Chain-of-Thought Distillation for Mamba's Long-Context Memory and Extrapolation

Junyu Ma,Tianqing Fang,Zhisong Zhang,Hongming Zhang,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: 通过Recall with Reasoning (RwR)方法，提升Mamba模型的长上下文记忆能力，无需架构改动。

Details

Motivation: Mamba模型在理论上具有无限上下文潜力，但在实际应用中，当序列远超训练长度时表现受限。 Method: 提出RwR方法，通过从教师模型中提取链式思维（CoT）摘要，并在微调时将其作为CoT提示前置，教会Mamba主动回忆和推理长上下文。 Result: 在LONGMEMEVAL和HELMET上的实验表明，RwR在相似预训练条件下优于Transformer/混合基线，同时保持短上下文能力。 Conclusion: RwR是一种简单有效的方法，可显著提升Mamba的长上下文性能，且不影响其短上下文表现。 Abstract: Mamba's theoretical infinite-context potential is limited in practice when sequences far exceed training lengths. This work explores unlocking Mamba's long-context memory ability by a simple-yet-effective method, Recall with Reasoning (RwR), by distilling chain-of-thought (CoT) summarization from a teacher model. Specifically, RwR prepends these summarization as CoT prompts during fine-tuning, teaching Mamba to actively recall and reason over long contexts. Experiments on LONGMEMEVAL and HELMET show RwR boosts Mamba's long-context performance against comparable Transformer/hybrid baselines under similar pretraining conditions, while preserving short-context capabilities, all without architectural changes.

[110] Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation

Mohammad Shoaib Ansari,Mohd Sohail Ali Khan,Shubham Revankar,Aditya Varma,Anil S. Mokhade

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型（LLMs）在医疗领域的应用，通过结合检索增强生成（RAG）和量化低秩适应（QLoRA）技术，提升医疗决策支持的准确性和效率。

Details

Motivation: 旨在解决医疗决策支持系统中信息准确性和效率的问题，同时优化资源利用。 Method: 采用Llama 3.2-3B-Instruct作为基础模型，结合RAG和QLoRA技术，嵌入和检索医疗数据。 Result: 系统显著提高了响应准确性，并在多个医疗基准测试中表现良好，适用于基础医疗建议。 Conclusion: LLMs在医疗领域潜力巨大，但需关注伦理问题和实际部署挑战，未来可进一步优化和扩展。 Abstract: This research paper investigates the application of Large Language Models (LLMs) in healthcare, specifically focusing on enhancing medical decision support through Retrieval-Augmented Generation (RAG) integrated with hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation (QLoRA). The system utilizes Llama 3.2-3B-Instruct as its foundation model. By embedding and retrieving context-relevant healthcare information, the system significantly improves response accuracy. QLoRA facilitates notable parameter efficiency and memory optimization, preserving the integrity of medical information through specialized quantization techniques. Our research also shows that our model performs relatively well on various medical benchmarks, indicating that it can be used to make basic medical suggestions. This paper details the system's technical components, including its architecture, quantization methods, and key healthcare applications such as enhanced disease prediction from patient symptoms and medical history, treatment suggestions, and efficient summarization of complex medical reports. We touch on the ethical considerations-patient privacy, data security, and the need for rigorous clinical validation-as well as the practical challenges of integrating such systems into real-world healthcare workflows. Furthermore, the lightweight quantized weights ensure scalability and ease of deployment even in low-resource hospital environments. Finally, the paper concludes with an analysis of the broader impact of LLMs on healthcare and outlines future directions for LLMs in medical settings.

[111] MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks

Mouath Abu Daoud,Chaimae Abouzahir,Leen Kharouf,Walid Al-Eisawi,Nizar Habash,Farah E. Shamout

Main category: cs.CL

TL;DR: 研究介绍了MedArabiQ，一个阿拉伯语医疗领域的基准数据集，用于评估大型语言模型（LLMs）的性能，并强调多语言高质量基准的重要性。

Details

Motivation: 阿拉伯语医疗领域中缺乏高质量的数据集和基准，限制了LLMs在该领域的应用和评估。 Method: 构建了包含七项阿拉伯语医疗任务的MedArabiQ数据集，并评估了五种先进LLMs的性能。 Result: 研究发现现有LLMs在阿拉伯语医疗任务中表现不一，凸显了多语言基准的必要性。 Conclusion: 通过MedArabiQ数据集，为未来多语言LLMs的研究和公平应用奠定了基础。 Abstract: Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.

[112] An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation

Matan Orbach,Ohad Eytan,Benjamin Sznajder,Ariel Gera,Odellia Boni,Yoav Kantor,Gal Bloch,Omri Levy,Hadas Abraham,Nitzan Barzilay,Eyal Shnarch,Michael E. Factor,Shila Ofek-Koifman,Paula Ta-Shma,Assaf Toledo

Main category: cs.CL

TL;DR: 本文研究了检索增强生成（RAG）的超参数优化（HPO）框架的有效性，通过5种HPO算法和5个数据集进行了全面评估，发现贪婪或迭代随机搜索能高效提升RAG性能。

Details

Motivation: 由于为特定用例找到最优RAG配置复杂且昂贵，本文旨在填补现有HPO框架缺乏严格基准测试的空白。 Method: 研究采用了5种HPO算法，在5个多样化数据集（包括一个新收集的真实产品文档数据集）上进行了实验，探索了迄今最大的HPO搜索空间，并使用两种优化评估指标。 Result: 结果表明，贪婪或迭代随机搜索能高效完成RAG HPO，并显著提升所有数据集的性能。贪婪方法中，优先优化模型比按RAG流程顺序优化更有效。 Conclusion: RAG HPO可通过贪婪或随机搜索高效实现，且优先优化模型是更优策略。 Abstract: Finding the optimal Retrieval-Augmented Generation (RAG) configuration for a given use case can be complex and expensive. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To address this gap, we present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse domains, including a new one collected for this work on real-world product documentation. Our study explores the largest HPO search space considered to date, with two optimized evaluation metrics. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with iterative random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing models first is preferable to the prevalent practice of optimizing sequentially according to the RAG pipeline order.

[113] Uncertainty-Aware Large Language Models for Explainable Disease Diagnosis

Shuang Zhou,Jiashuo Wang,Zidu Xu,Song Wang,David Brauer,Lindsay Welton,Jacob Cogan,Yuen-Hei Chung,Lei Tian,Zaifu Zhan,Yu Hou,Mingquan Lin,Genevieve B. Melton,Rui Zhang

Main category: cs.CL

TL;DR: ConfiDx是一个基于不确定性感知的大型语言模型（LLM），通过微调开源LLM并结合诊断标准，解决了诊断不确定性识别和解释的问题，提升了自动诊断系统的可靠性。

Details

Motivation: 临床笔记中证据不足时，诊断不确定性会增加误诊风险，但目前对诊断不确定性的识别和解释研究不足。 Method: 通过微调开源LLM并构建标注数据集，开发了ConfiDx模型，用于识别和解释诊断不确定性。 Result: ConfiDx在真实数据集上表现出色，能够识别诊断不确定性并提供可信的解释，诊断性能优越。 Conclusion: ConfiDx首次联合解决诊断不确定性的识别和解释问题，显著提升了自动诊断系统的可靠性。 Abstract: Explainable disease diagnosis, which leverages patient information (e.g., signs and symptoms) and computational models to generate probable diagnoses and reasonings, offers clear clinical values. However, when clinical notes encompass insufficient evidence for a definite diagnosis, such as the absence of definitive symptoms, diagnostic uncertainty usually arises, increasing the risk of misdiagnosis and adverse outcomes. Although explicitly identifying and explaining diagnostic uncertainties is essential for trustworthy diagnostic systems, it remains under-explored. To fill this gap, we introduce ConfiDx, an uncertainty-aware large language model (LLM) created by fine-tuning open-source LLMs with diagnostic criteria. We formalized the task and assembled richly annotated datasets that capture varying degrees of diagnostic ambiguity. Evaluating ConfiDx on real-world datasets demonstrated that it excelled in identifying diagnostic uncertainties, achieving superior diagnostic performance, and generating trustworthy explanations for diagnoses and uncertainties. To our knowledge, this is the first study to jointly address diagnostic uncertainty recognition and explanation, substantially enhancing the reliability of automatic diagnostic systems.

[114] Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

Bin Yu,Hang Yuan,Yuliang Wei,Bailing Wang,Weizhen Qi,Kai Chen

Main category: cs.CL

TL;DR: LS-Mixture SFT方法通过结合长链和短链推理数据，解决了监督微调中的“过度思考”问题，提升了模型准确率并减少了响应长度。

Details

Motivation: 现有监督微调方法（SFT）从大型推理模型蒸馏的CoT数据中继承了“过度思考”问题，导致推理链冗长冗余。 Method: 提出LS-Mixture SFT，结合长链CoT数据及其通过结构保留重写得到的短链数据。 Result: 实验显示，LS-Mixture SFT平均准确率提升2.3%，响应长度减少约47.61%。 Conclusion: LS-Mixture SFT有效赋予非推理模型推理能力，同时避免“过度思考”问题，实现高效推理。 Abstract: Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the "overthinking" problem from teacher models, producing verbose and redundant reasoning chains during inference. To address this challenge, we propose \textbf{L}ong-\textbf{S}hort Chain-of-Thought \textbf{Mixture} \textbf{S}upervised \textbf{F}ine-\textbf{T}uning (\textbf{LS-Mixture SFT}), which combines long CoT reasoning dataset with their short counterparts obtained through structure-preserved rewriting. Our experiments demonstrate that models trained using the LS-Mixture SFT method, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3\% across various benchmarks while substantially reducing model response length by approximately 47.61\%. This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning while avoiding the inherent overthinking problems inherited from teacher models, thereby enabling efficient reasoning in the fine-tuned models.

[115] Evaluation of LLMs on Long-tail Entity Linking in Historical Documents

Marta Boscariol,Luana Bulla,Lia Draetta,Beatrice Fiumanò,Emanuele Lenzi,Leonardo Piano

Main category: cs.CL

TL;DR: 论文探讨了LLMs在长尾实体链接（EL）任务中的表现，发现其性能优于传统方法，但仍需改进。

Details

Motivation: 长尾实体在训练数据和知识库中代表性不足，LLMs为解决这一问题提供了新视角。 Method: 使用GPT和LLama3两种LLMs，在MHERCL v0.1数据集上与ReLiK框架进行性能对比。 Result: LLMs在长尾EL任务中表现良好，显示出潜力。 Conclusion: LLMs可作为补充技术，弥合主流与长尾实体链接的差距。 Abstract: Entity Linking (EL) plays a crucial role in Natural Language Processing (NLP) applications, enabling the disambiguation of entity mentions by linking them to their corresponding entries in a reference knowledge base (KB). Thanks to their deep contextual understanding capabilities, LLMs offer a new perspective to tackle EL, promising better results than traditional methods. Despite the impressive generalization capabilities of LLMs, linking less popular, long-tail entities remains challenging as these entities are often underrepresented in training data and knowledge bases. Furthermore, the long-tail EL task is an understudied problem, and limited studies address it with LLMs. In the present work, we assess the performance of two popular LLMs, GPT and LLama3, in a long-tail entity linking scenario. Using MHERCL v0.1, a manually annotated benchmark of sentences from domain-specific historical texts, we quantitatively compare the performance of LLMs in identifying and linking entities to their corresponding Wikidata entries against that of ReLiK, a state-of-the-art Entity Linking and Relation Extraction framework. Our preliminary experiments reveal that LLMs perform encouragingly well in long-tail EL, indicating that this technology can be a valuable adjunct in filling the gap between head and long-tail EL.

[116] Sentence Embeddings as an intermediate target in end-to-end summarisation

Maciej Zembrzuski,Saad Mahamood

Main category: cs.CL

TL;DR: 提出了一种结合抽取式和生成式方法的新方法，用于处理大规模用户评论的摘要任务，并通过预训练句子嵌入提升性能。

Details

Motivation: 现有神经网络方法在处理大规模输入数据时表现不佳，需要改进内容选择策略。 Method: 结合抽取式方法和预训练句子嵌入，再与生成式模型结合，预测句子级嵌入而非传统概率分布。 Result: 新方法在大规模输入数据集上优于现有方法，且预测句子嵌入提升了端到端系统的质量。 Conclusion: 通过结合抽取式与生成式方法并利用句子嵌入，显著提升了大规模用户评论摘要的性能。 Abstract: Current neural network-based methods to the problem of document summarisation struggle when applied to datasets containing large inputs. In this paper we propose a new approach to the challenge of content-selection when dealing with end-to-end summarisation of user reviews of accommodations. We show that by combining an extractive approach with externally pre-trained sentence level embeddings in an addition to an abstractive summarisation model we can outperform existing methods when this is applied to the task of summarising a large input dataset. We also prove that predicting sentence level embedding of a summary increases the quality of an end-to-end system for loosely aligned source to target corpora, than compared to commonly predicting probability distributions of sentence selection.

[117] Faster MoE LLM Inference for Extremely Large Models

Haoqi Yang,Luohe Shi,Qiwei Li,Zuchao Li,Ping Wang,Bo Du,Mengjia Shen,Hai Zhao

Main category: cs.CL

TL;DR: 稀疏专家混合（MoE）大语言模型（LLM）正成为超大规模模型的主流方法。本文探讨了细粒度MoE模型在不同服务负载下的效率动态，并研究了减少路由专家数量对效率与性能权衡的影响。

Details

Motivation: 随着DeepSeek等细粒度MoE模型的兴起，相关研究仍有限，本文旨在填补这一空白，探讨其优化潜力。 Method: 通过分析减少激活专家和总专家数量对效率和性能的影响，提出优化方法。 Result: 减少激活专家数量可显著提升效率且性能损失小；减少总专家数量效率提升有限但性能损失严重。方法可实现至少10%的吞吐量提升且无性能损失。 Conclusion: MoE推理优化仍有巨大探索和改进空间。 Abstract: Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios, with only minor performance degradation. Reducing the total number of experts provides limited efficiency gains but results in severe performance degradation. Our method can increase throughput by at least 10\% without any performance degradation. Overall, we conclude that MoE inference optimization remains an area with substantial potential for exploration and improvement.

[118] Say It Another Way: A Framework for User-Grounded Paraphrasing

Cléa Chataigner,Rebecca Ma,Prakhar Ganesh,Afaf Taïk,Elliot Creager,Golnoosh Farnadi

Main category: cs.CL

TL;DR: 论文探讨了提示词微小变化对大型语言模型行为的影响，提出了基于语言转换的框架来生成自然提示变体，并验证了其对模型评估的影响。

Details

Motivation: 研究提示词微小变化对LLM行为的影响，以解决评估的稳定性和可靠性问题。 Method: 提出基于语言转换分类的受控转述框架，生成自然提示变体，并在BBQ数据集上验证。 Result: 即使细微的提示修改也会导致模型行为的显著变化。 Conclusion: 需要开发对转述敏感的稳健评估协议。 Abstract: Small changes in how a prompt is worded can lead to meaningful differences in the behavior of large language models (LLMs), raising concerns about the stability and reliability of their evaluations. While prior work has explored simple formatting changes, these rarely capture the kinds of natural variation seen in real-world language use. We propose a controlled paraphrasing framework based on a taxonomy of minimal linguistic transformations to systematically generate natural prompt variations. Using the BBQ dataset, we validate our method with both human annotations and automated checks, then use it to study how LLMs respond to paraphrased prompts in stereotype evaluation tasks. Our analysis shows that even subtle prompt modifications can lead to substantial changes in model behavior. These results highlight the need for robust, paraphrase-aware evaluation protocols.

[119] Towards conversational assistants for health applications: using ChatGPT to generate conversations about heart failure

Anuja Tayal,Devika Salunke,Barbara Di Eugenio,Paula G Allen-Meares,Eulalia P Abril,Olga Garcia-Bedoya,Carolyn A Dickens,Andrew D. Boyd

Main category: cs.CL

TL;DR: 研究探讨了ChatGPT（3.5-turbo和4）为非洲裔美国心衰患者生成自我护理对话的潜力，发现有效提示设计是关键，但ChatGPT仍缺乏医疗沟通所需的同理心和参与度。

Details

Motivation: 针对非洲裔美国心衰患者自我护理领域缺乏专门数据集的问题，探索ChatGPT生成相关对话的能力。 Method: 采用四种提示策略（领域、AAVE、SDOH、SDOH推理）生成对话，涵盖饮食、运动和液体摄入等关键领域，并结合患者特定SDOH属性。 Result: 研究发现，结合SDOH和推理能提升对话质量，但ChatGPT在同理心和参与度方面仍有不足。 Conclusion: 提示设计对生成高质量对话至关重要，但ChatGPT需进一步改进以更好地支持医疗沟通。 Abstract: We explore the potential of ChatGPT (3.5-turbo and 4) to generate conversations focused on self-care strategies for African-American heart failure patients -- a domain with limited specialized datasets. To simulate patient-health educator dialogues, we employed four prompting strategies: domain, African American Vernacular English (AAVE), Social Determinants of Health (SDOH), and SDOH-informed reasoning. Conversations were generated across key self-care domains of food, exercise, and fluid intake, with varying turn lengths (5, 10, 15) and incorporated patient-specific SDOH attributes such as age, gender, neighborhood, and socioeconomic status. Our findings show that effective prompt design is essential. While incorporating SDOH and reasoning improves dialogue quality, ChatGPT still lacks the empathy and engagement needed for meaningful healthcare communication.

[120] IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages

Sharvi Endait,Ruturaj Ghatage,Aditya Kulkarni,Rajlaxmi Patil,Raviraj Joshi

Main category: cs.CL

TL;DR: IndicSQuAD是一个覆盖九种主要印度语言的多语言抽取式QA数据集，基于SQuAD数据集构建，旨在解决印度语言在QA系统中的资源不足问题。

Details

Motivation: 高资源语言在QA系统中进展迅速，而印度语言尽管拥有大量母语使用者，却资源匮乏。IndicSQuAD旨在填补这一空白。 Method: 通过翻译技术从SQuAD数据集扩展，保持语言保真度和答案跨度对齐，构建训练、验证和测试集。 Result: 使用单语和多语BERT模型评估，结果显示低资源环境下的挑战，并提出了未来研究方向。 Conclusion: IndicSQuAD为印度语言QA研究提供了基础，未来可扩展至更多语言和多模态数据。 Abstract: The rapid progress in question-answering (QA) systems has predominantly benefited high-resource languages, leaving Indic languages largely underrepresented despite their vast native speaker base. In this paper, we present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages, systematically derived from the SQuAD dataset. Building on previous work with MahaSQuAD for Marathi, our approach adapts and extends translation techniques to maintain high linguistic fidelity and accurate answer-span alignment across diverse languages. IndicSQuAD comprises extensive training, validation, and test sets for each language, providing a robust foundation for model development. We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT. The results indicate some challenges inherent in low-resource settings. Moreover, our experiments suggest potential directions for future work, including expanding to additional languages, developing domain-specific datasets, and incorporating multimodal data. The dataset and models are publicly shared at https://github.com/l3cube-pune/indic-nlp

[121] NBF at SemEval-2025 Task 5: Light-Burst Attention Enhanced System for Multilingual Subject Recommendation

Baharul Islam,Nasim Ahmad,Ferdous Ahmed Barbhuiya,Kuntal Dey

Main category: cs.CL

TL;DR: 该系统在SemEval 2025任务5中表现良好，利用双语数据和负采样技术，通过低维自注意力机制实现高效主题检索。

Details

Motivation: 研究跨语言主题分类问题，旨在在资源受限条件下高效捕捉主题信息。 Method: 采用双语数据训练，结合负采样和基于边界的检索目标，设计低维自注意力机制编码句子嵌入。 Result: 平均召回率为32.24%（所有主题），定性评估中分别为43.16%和31.53%，GPU使用率低。 Conclusion: 方法在资源受限条件下有效，但仍有改进空间。 Abstract: We present our system submission for SemEval 2025 Task 5, which focuses on cross-lingual subject classification in the English and German academic domains. Our approach leverages bilingual data during training, employing negative sampling and a margin-based retrieval objective. We demonstrate that a dimension-as-token self-attention mechanism designed with significantly reduced internal dimensions can effectively encode sentence embeddings for subject retrieval. In quantitative evaluation, our system achieved an average recall rate of 32.24% in the general quantitative setting (all subjects), 43.16% and 31.53% of the general qualitative evaluation methods with minimal GPU usage, highlighting their competitive performance. Our results demonstrate that our approach is effective in capturing relevant subject information under resource constraints, although there is still room for improvement.

[122] WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu,Yunqiao Yang,Houxing Ren,Haotian Hou,Han Xiao,Ke Wang,Weikang Shi,Aojun Zhou,Mingjie Zhan,Hongsheng Li

Main category: cs.CL

TL;DR: WebGen-Bench是一个用于评估LLM代理生成多文件网站代码能力的基准，包含多样化的指令和647个测试用例。测试结果显示现有模型的性能仍有提升空间。

Details

Motivation: 评估LLM代理在生成复杂网站代码方面的能力，推动相关技术的发展。 Method: 通过人类和GPT-4o合作生成多样化指令，使用GPT-4o生成测试用例并手动优化，利用网页导航代理自动化测试。 Result: 最佳组合（Bolt.diy + DeepSeek-R1）的准确率仅为27.8%，训练后的Qwen2.5-Coder-32B-Instruct达到38.2%。 Conclusion: WebGen-Bench具有挑战性，现有模型表现有限，但通过训练可提升性能。 Abstract: LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2\%, surpassing the performance of the best proprietary model.

Zuwei Long,Yunhang Shen,Chaoyou Fu,Heting Gao,Lijiang Li,Peixian Chen,Mengdan Zhang,Hang Shao,Jian Li,Jinlong Peng,Haoyu Cao,Ke Li,Rongrong Ji,Xing Sun

Main category: cs.CL

TL;DR: VITA-Audio是一种端到端的大型语音模型，通过轻量级多模态令牌预测模块和四阶段渐进训练策略，显著降低流式场景中的首音频生成延迟，实现3~5倍的推理加速。

Details

Motivation: 现有语音模型在流式场景中生成首音频令牌时存在高延迟问题，限制了实际部署。 Method: 提出轻量级多模态令牌预测模块（MCTP）和四阶段渐进训练策略，以高效生成多音频令牌并加速推理。 Result: 在7B参数规模下实现3~5倍推理加速，并在ASR、TTS和SQA任务中显著优于同类开源模型。 Conclusion: VITA-Audio是首个能在首轮前向传递中生成音频的多模态大模型，为实时对话提供了低延迟解决方案。 Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

cs.DC [Back]

[124] Elevating Semantic Exploration: A Novel Approach Utilizing Distributed Repositories

Valerio Bellandi

Main category: cs.DC

TL;DR: 本文比较了集中式和分布式系统的优缺点，并介绍了一个为意大利司法部开发的分布式文档存储系统，该系统利用边缘存储库增强语义探索能力。

Details

Motivation: 探讨集中式与分布式系统的适用场景，并开发一个分布式文档存储系统以满足大规模环境中的高可用性和性能需求。 Method: 采用分布式架构，利用边缘存储库分析文本数据和元数据，以增强语义探索功能。 Result: 成功开发了一个适用于大规模环境的分布式文档存储系统，提高了语义探索能力。 Conclusion: 分布式系统在需要高可用性和性能的大规模环境中更具优势，而集中式系统适用于有限扩展性和集中控制的应用。 Abstract: Centralized and distributed systems are two main approaches to organizing ICT infrastructure, each with its pros and cons. Centralized systems concentrate resources in one location, making management easier but creating single points of failure. Distributed systems, on the other hand, spread resources across multiple nodes, offering better scalability and fault tolerance, but requiring more complex management. The choice between them depends on factors like application needs, scalability, and data sensitivity. Centralized systems suit applications with limited scalability and centralized control, while distributed systems excel in large-scale environments requiring high availability and performance. This paper explores a distributed document repository system developed for the Italian Ministry of Justice, using edge repositories to analyze textual data and metadata, enhancing semantic exploration capabilities.

cs.LG [Back]

[125] When Your Own Output Becomes Your Training Data: Noise-to-Meaning Loops and a Formal RSI Trigger

Rintaro Ando

Main category: cs.LG

TL;DR: N2M-RSI是一个形式化模型，展示AI代理在反馈自身输出并跨越信息整合阈值后，其内部复杂性将无限增长。

Details

Motivation: 统一自我提示大型语言模型、哥德尔自指和AutoML等早期思想，探索AI自我改进的潜力。 Method: 通过递归反馈和跨越信息整合阈值，模型实现自我改进，并可扩展到多代理交互。 Result: 模型显示内部复杂性无限增长，多代理交互可能产生超线性效应。 Conclusion: N2M-RSI为AI自我改进提供了理论框架，但出于安全考虑未公开具体实现细节。 Abstract: We present Noise-to-Meaning Recursive Self-Improvement (N2M-RSI), a minimal formal model showing that once an AI agent feeds its own outputs back as inputs and crosses an explicit information-integration threshold, its internal complexity will grow without bound under our assumptions. The framework unifies earlier ideas on self-prompting large language models, G\"odelian self-reference, and AutoML, yet remains implementation-agnostic. The model furthermore scales naturally to interacting swarms of agents, hinting at super-linear effects once communication among instances is permitted. For safety reasons, we omit system-specific implementation details and release only a brief, model-agnostic toy prototype in Appendix C.

[126] Radio: Rate-Distortion Optimization for Large Language Model Compression

Sean I. Young

Main category: cs.LG

TL;DR: 论文提出了一种基于率失真理论的大语言模型量化方法，支持用户按需压缩模型。

Details

Motivation: 解决大语言模型在资源受限设备上的部署问题，降低计算成本和环境影响。 Method: 从率失真理论出发，提出一种基于简单率失真优化的量化技术。 Result: 技术可扩展到包含数千亿参数的模型，支持用户按模型大小或精度需求压缩模型。 Conclusion: 该量化方法为大语言模型的压缩提供了灵活且高效的解决方案。 Abstract: In recent years, the compression of large language models (LLMs) has emerged as a key problem in facilitating LLM deployment on resource-limited devices, reducing compute costs, and mitigating the environmental footprint due to large-scale AI infrastructure. Here, we establish the foundations of LLM quantization from a rate-distortion theory perspective and propose a quantization technique based on simple rate-distortion optimization. Our technique scales to models containing hundreds of billions of weight parameters and offers users the flexibility to compress models, post-training, to a model size or accuracy specified by the user.

[127] Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao,Yiran Wu,Yang Yue,Tong Wu,Quentin Xu,Yang Yue,Matthieu Lin,Shenzhi Wang,Qingyun Wu,Zilong Zheng,Gao Huang

Main category: cs.LG

TL;DR: 论文提出了一种名为Absolute Zero的新RLVR范式，通过模型自主生成任务并验证答案，无需外部数据，实现了在编码和数学推理任务上的SOTA性能。

Details

Motivation: 解决现有RLVR方法依赖人工标注数据的问题，以及在AI超越人类智能后人类任务可能限制学习潜力的挑战。 Method: 提出Absolute Zero范式，引入Absolute Zero Reasoner (AZR)，通过代码执行器自主生成和验证任务，实现自我进化。 Result: AZR在编码和数学推理任务上表现优于依赖大量人工标注数据的现有模型。 Conclusion: Absolute Zero范式展示了无需外部数据的自我进化潜力，适用于不同模型规模和类型。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

[128] A Wireless Collaborated Inference Acceleration Framework for Plant Disease Recognition

Hele Zhu,Xinyi Huang,Haojia Gao,Mengfei Jiang,Haohua Que,Lei Mu

Main category: cs.LG

TL;DR: 提出了一种基于边缘设备与云服务器协作的植物病害识别框架，通过深度强化学习剪枝模型并优化分割点，显著提升推理速度。

Details

Motivation: 传统手动识别方法效率低，深度学习在资源受限设备上运行困难，通信带宽限制影响云服务器推理效率。 Method: 使用深度强化学习剪枝DNN模型，贪婪策略确定最优分割点，实现协作推理加速。 Result: 实验表明该框架显著提升推理速度，同时保持可接受的识别精度。 Conclusion: 为快速诊断和预防植物病害提供了新解决方案。 Abstract: Plant disease is a critical factor affecting agricultural production. Traditional manual recognition methods face significant drawbacks, including low accuracy, high costs, and inefficiency. Deep learning techniques have demonstrated significant benefits in identifying plant diseases, but they still face challenges such as inference delays and high energy consumption. Deep learning algorithms are difficult to run on resource-limited embedded devices. Offloading these models to cloud servers is confronted with the restriction of communication bandwidth, and all of these factors will influence the inference's efficiency. We propose a collaborative inference framework for recognizing plant diseases between edge devices and cloud servers to enhance inference speed. The DNN model for plant disease recognition is pruned through deep reinforcement learning to improve the inference speed and reduce energy consumption. Then the optimal split point is determined by a greedy strategy to achieve the best collaborated inference acceleration. Finally, the system for collaborative inference acceleration in plant disease recognition has been implemented using Gradio to facilitate friendly human-machine interaction. Experiments indicate that the proposed collaborative inference framework significantly increases inference speed while maintaining acceptable recognition accuracy, offering a novel solution for rapidly diagnosing and preventing plant diseases.

[129] ALMA: Aggregated Lipschitz Maximization Attack on Auto-encoders

Chethan Krishnamurthy Ramanaik,Arjun Roy,Eirini Ntoutsi

Main category: cs.LG

TL;DR: 论文提出了一种基于层条件的对抗优化目标，用于增强深度自编码器的对抗鲁棒性评估，并通过实验验证了其优于现有方法。

Details

Motivation: 深度自编码器的对抗鲁棒性研究相对不足，现有白盒攻击方法未能充分利用其中间层的脆弱性。 Method: 提出了一种新的层条件对抗优化目标，通过增强损失梯度信息传播来优化对抗扰动。 Result: 实验表明，该方法在通用和样本特定场景下均优于现有攻击方法。 Conclusion: 论文还提出了一种防御插件，用于减轻对抗样本的影响。 Abstract: Despite the extensive use of deep autoencoders (AEs) in critical applications, their adversarial robustness remains relatively underexplored compared to classification models. AE robustness is characterized by the Lipschitz bounds of its components. Existing robustness evaluation frameworks based on white-box attacks do not fully exploit the vulnerabilities of intermediate ill-conditioned layers in AEs. In the context of optimizing imperceptible norm-bounded additive perturbations to maximize output damage, existing methods struggle to effectively propagate adversarial loss gradients throughout the network, often converging to less effective perturbations. To address this, we propose a novel layer-conditioning-based adversarial optimization objective that effectively guides the adversarial map toward regions of local Lipschitz bounds by enhancing loss gradient information propagation during attack optimization. We demonstrate through extensive experiments on state-of-the-art AEs that our adversarial objective results in stronger attacks, outperforming existing methods in both universal and sample-specific scenarios. As a defense method against this attack, we introduce an inference-time adversarially trained defense plugin that mitigates the effects of adversarial examples.

physics.soc-ph [Back]

[130] Floating Car Observers in Intelligent Transportation Systems: Detection Modeling and Temporal Insights

Jeremias Gerner,Klaus Bogenberger,Stefanie Schmidtner

Main category: physics.soc-ph

TL;DR: 论文探讨了Floating Car Observers (FCOs)在微观交通模拟中的建模方法，展示了其在智能交通系统中的潜力。

Details

Motivation: 传统Floating Car Data (FCD)数据有限，FCOs通过集成车载传感器提供更丰富的交通数据，以支持智能交通系统应用。 Method: 采用多种建模方法，包括2D光线追踪、高保真协同模拟和基于神经网络的仿真技术，评估FCOs在SUMO交通网络数字孪生中的表现。 Result: 实验表明，即使在20%的渗透率下，基于LiDAR的FCOs能识别65%的车辆；结合时间信息后，可恢复80%以上车辆，位置偏差极小。 Conclusion: FCOs在智能交通系统中具有显著潜力，尤其在交通状态估计和监控方面，适用于不同渗透率和交通条件。 Abstract: Floating Car Observers (FCOs) extend traditional Floating Car Data (FCD) by integrating onboard sensors to detect and localize other traffic participants, providing richer and more detailed traffic data. In this work, we explore various modeling approaches for FCO detections within microscopic traffic simulations to evaluate their potential for Intelligent Transportation System (ITS) applications. These approaches range from 2D raytracing to high-fidelity co-simulations that emulate real-world sensors and integrate 3D object detection algorithms to closely replicate FCO detections. Additionally, we introduce a neural network-based emulation technique that effectively approximates the results of high-fidelity co-simulations. This approach captures the unique characteristics of FCO detections while offering a fast and scalable solution for modeling. Using this emulation method, we investigate the impact of FCO data in a digital twin of a traffic network modeled in SUMO. Results demonstrate that even at a 20% penetration rate, FCOs using LiDAR-based detections can identify 65% of vehicles across various intersections and traffic demand scenarios. Further potential emerges when temporal insights are integrated, enabling the recovery of previously detected but currently unseen vehicles. By employing data-driven methods, we recover over 80% of these vehicles with minimal positional deviations. These findings underscore the potential of FCOs for ITS, particularly in enhancing traffic state estimation and monitoring under varying penetration rates and traffic conditions.

cs.NE [Back]

[131] From Neurons to Computation: Biological Reservoir Computing for Pattern Recognition

Ludovico Iannello,Luca Ciampi,Gabriele Lagani,Fabrizio Tonelli,Eleonora Crocco,Lucio Maria Calcagnile,Angelo Di Garbo,Federico Cremisi,Giuseppe Amato

Main category: cs.NE

TL;DR: 提出了一种基于培养生物神经元的新型储层计算范式（BRC），利用多电极阵列记录神经活动，通过非线性映射实现高效模式识别。

Details

Motivation: 探索生物神经网络在传统人工神经网络任务中的应用潜力，推动生物启发计算系统的发展。 Method: 使用培养神经元作为储层基质，通过多电极阵列输入和记录神经活动，生成高维生物特征空间。 Result: 实验表明BRC能有效处理模式识别任务，如位置编码、方向条和数字识别。 Conclusion: BRC展示了生物神经网络在计算任务中的可行性，为神经形态工程和生物混合计算提供了新方向。 Abstract: In this paper, we introduce a novel paradigm for reservoir computing (RC) that leverages a pool of cultured biological neurons as the reservoir substrate, creating a biological reservoir computing (BRC). This system operates similarly to an echo state network (ESN), with the key distinction that the neural activity is generated by a network of cultured neurons, rather than being modeled by traditional artificial computational units. The neuronal activity is recorded using a multi-electrode array (MEA), which enables high-throughput recording of neural signals. In our approach, inputs are introduced into the network through a subset of the MEA electrodes, while the remaining electrodes capture the resulting neural activity. This generates a nonlinear mapping of the input data to a high-dimensional biological feature space, where distinguishing between data becomes more efficient and straightforward, allowing a simple linear classifier to perform pattern recognition tasks effectively. To evaluate the performance of our proposed system, we present an experimental study that includes various input patterns, such as positional codes, bars with different orientations, and a digit recognition task. The results demonstrate the feasibility of using biological neural networks to perform tasks traditionally handled by artificial neural networks, paving the way for further exploration of biologically-inspired computing systems, with potential applications in neuromorphic engineering and bio-hybrid computing.

cs.HC [Back]

[132] Evaluating Foveated Frame Rate Reduction in Virtual Reality for Head-Mounted Displays

Christopher Flöter,Sergej Geringer,Guido Reina,Daniel Weiskopf,Timo Ropinski

Main category: cs.HC

TL;DR: 研究探讨了在虚拟现实中通过降低外围区域的帧率（时间分辨率）而非空间分辨率来减少渲染成本的效果。

Details

Motivation: 探索在用户视野外围降低时间分辨率（帧率）而非空间分辨率的感知效果，以优化渲染效率。 Method: 在虚拟现实环境中对15名参与者进行用户研究，测试不同偏心度下帧率降低的感知效果。 Result: 结果表明，在参与者一致感知到时间伪影之前，可以大幅降低平均渲染成本（渲染像素数）。 Conclusion: 外围区域的时间分辨率降低是一种可行的渲染优化方法，可在不明显影响用户体验的情况下减少计算负担。 Abstract: Foveated rendering methods usually reduce spatial resolution in the periphery of the users' view. However, using foveated rendering to reduce temporal resolution, i.e., rendering frame rate, seems less explored. In this work, we present the results of a user study investigating the perceptual effects of foveated temporal resolution reduction, where only the temporal resolution (frame rate) is reduced in the periphery without affecting spatial quality (pixel density). In particular, we investigated the perception of temporal resolution artifacts caused by reducing the frame rate dependent on the eccentricity of the user's gaze. Our user study with 15 participants was conducted in a virtual reality setting using a head-mounted display. Our results indicate that it was possible to reduce average rendering costs, i.e., the number of rendered pixels, to a large degree before participants consistently reported perceiving temporal artifacts.

cs.AI [Back]

[133] Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach

Fabrizio Marozzo

Main category: cs.AI

TL;DR: 论文提出了一种迭代方法，通过结构化问题和示例解决自然语言指令的模糊性，生成精确解决方案，相比传统方法更高效且用户满意度更高。

Details

Motivation: 自然语言的模糊性导致生成式AI系统需要多次迭代修正，效率低下，用户体验不佳。 Method: 采用迭代方法，通过结构化问题和替代方案逐步消除指令模糊性，结合输入/输出示例生成最终精确解。 Result: 在编码、数据分析和创意写作等多样化数据集上，该方法表现出更高的准确性、竞争性的解决时间和用户满意度。 Conclusion: 迭代方法显著提升了生成式AI系统的效率和用户体验，优于传统一次性解决方案。 Abstract: Generative AI systems have revolutionized human interaction by enabling natural language-based coding and problem solving. However, the inherent ambiguity of natural language often leads to imprecise instructions, forcing users to iteratively test, correct, and resubmit their prompts. We propose an iterative approach that systematically narrows down these ambiguities through a structured series of clarification questions and alternative solution proposals, illustrated with input/output examples as well. Once every uncertainty is resolved, a final, precise solution is generated. Evaluated on a diverse dataset spanning coding, data analysis, and creative writing, our method demonstrates superior accuracy, competitive resolution times, and higher user satisfaction compared to conventional one-shot solutions, which typically require multiple manual iterations to achieve a correct output.

[134] BLAB: Brutally Long Audio Bench

Orevaoghene Ahia,Martijn Bartelds,Kabir Ahuja,Hila Gonen,Valentin Hofmann,Siddhant Arora,Shuyue Stella Li,Vishal Puttagunta,Mofetoluwa Adeyemi,Charishma Buchireddy,Ben Walls,Noah Bennett,Shinji Watanabe,Noah A. Smith,Yulia Tsvetkov,Sachin Kumar

Main category: cs.AI

TL;DR: BLAB是一个针对长音频语言模型的挑战性基准测试，揭示了现有模型在长音频任务中的表现不佳。

Details

Motivation: 开发能够理解多样化语音交互的大型音频语言模型，以提升语言技术的可访问性。 Method: 引入BLAB基准测试，包含833+小时的长音频片段和人工标注的问题与答案，评估模型在定位、时长估计、情感和计数任务上的表现。 Result: 现有音频语言模型在BLAB任务中表现不佳，性能随音频时长增加而下降。 Conclusion: BLAB为开发具有强大长音频理解能力的模型提供了挑战性评估框架。 Abstract: Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.

cs.IR [Back]

[135] Rational Retrieval Acts: Leveraging Pragmatic Reasoning to Improve Sparse Retrieval

Arthur Satouf,Gabriel Ben Zenou,Benjamin Piwowarski,Habiboulaye Amadou Boubacar,Pablo Piantanida

Main category: cs.IR

TL;DR: 本文提出了一种基于理性言语行为（RSA）框架的稀疏神经信息检索方法，通过动态调整词项与文档的交互，提升检索性能。

Details

Motivation: 现有稀疏神经信息检索方法及传统模型（如BM25）未充分考虑文档集合及词项权重间的复杂关系。 Method: 将RSA框架应用于信息检索，动态调整词项与文档的交互，考虑数据集中文档的影响。 Result: 实验表明，RSA方法显著提升了多种稀疏检索模型的性能，并在BEIR基准测试中达到最优表现。 Conclusion: RSA框架能有效改进稀疏检索模型，提升跨领域数据集上的检索性能。 Abstract: Current sparse neural information retrieval (IR) methods, and to a lesser extent more traditional models such as BM25, do not take into account the document collection and the complex interplay between different term weights when representing a single document. In this paper, we show how the Rational Speech Acts (RSA), a linguistics framework used to minimize the number of features to be communicated when identifying an object in a set, can be adapted to the IR case -- and in particular to the high number of potential features (here, tokens). RSA dynamically modulates token-document interactions by considering the influence of other documents in the dataset, better contrasting document representations. Experiments show that incorporating RSA consistently improves multiple sparse retrieval models and achieves state-of-the-art performance on out-of-domain datasets from the BEIR benchmark. https://github.com/arthur-75/Rational-Retrieval-Acts

cs.SD [Back]

[136] SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation

Zhaoxi Mu,Xinyu Yang,Gang Wang

Main category: cs.SD

TL;DR: SepALM是一种利用音频语言模型（ALM）在文本域中校正和重新合成语音的创新方法，解决了传统语音分离技术在噪声和混响环境中的局限性。

Details

Motivation: 现实环境中的噪声和混响会导致语音分离产生伪影或失真，传统方法结合ASR和LLM存在误差累积和优化难题。 Method: SepALM包含四个核心组件：分离器、校正器、合成器和对齐器，并采用ALM端到端纠错机制，结合Chain-of-Thought提示和知识蒸馏技术。 Result: 实验证明SepALM不仅提高了语音分离的精度，还显著增强了在新声学环境中的适应性。 Conclusion: SepALM通过ALM的文本域处理有效解决了传统方法的局限性，为复杂环境下的语音分离提供了更优方案。 Abstract: While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.

[137] CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

Detao Bai,Zhiheng Ma,Xihan Wei,Liefeng Bo

Main category: cs.SD

TL;DR: CoGenAV模型通过结合对比特征对齐和生成文本预测的双目标优化，利用少量标注数据学习多任务适用的音频-视觉表示，显著提升了语音识别和噪声环境下的性能。

Details

Motivation: 利用说话者的唇部动作、声音和语言内容之间的同步性，改进传统音频系统在复杂条件下的表现。 Method: CoGenAV模型通过对比-生成同步策略优化双目标（对比特征对齐和生成文本预测），仅使用223小时LRS2数据集标注数据。 Result: 在LRS2数据集上，AVSR的WER为1.27，VSR的WER为22.0，噪声环境下性能提升70%以上，并在语音重建和同步任务中表现优异。 Conclusion: CoGenAV展示了音频-视觉表示在多任务中的强大潜力，未来将开源以促进学术和工业界的合作。 Abstract: The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized for Audio-Visual Speech Recognition (AVSR) on LRS2, these representations contribute to achieving a state-of-the-art Word Error Rate (WER) of 1.27. They also enable strong performance in Visual Speech Recognition (VSR) with a WER of 22.0 on LRS2, and significantly improve performance in noisy environments by over 70%. Furthermore, CoGenAV representations benefit speech reconstruction tasks, boosting performance in Speech Enhancement and Separation, and achieve competitive results in audio-visual synchronization tasks like Active Speaker Detection (ASD). Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.

eess.IV [Back]

[138] Physical foundations for trustworthy medical imaging: a review for artificial intelligence researchers

Miriam Cobo,David Corral Fontecha,Wilson Silva,Lara Lloret Iglesias

Main category: eess.IV

TL;DR: 本文探讨了医学影像中人工智能的发展，强调了物理知识在提升AI算法可信度和鲁棒性中的重要性。

Details

Motivation: 由于AI开发者对医学影像物理原理的理解不足，限制了AI在医学影像中的潜力发挥。 Method: 回顾医学影像的物理基础及其对AI（如生成模型和重建算法）的影响，并探索物理知识在机器学习模型中的整合。 Result: 物理知识的整合增强了AI算法在医学影像中的表现，尤其在数据有限的情况下。 Conclusion: 物理启发的机器学习模型通过结合物理约束，能更好地学习医学影像特征。 Abstract: Artificial intelligence in medical imaging has seen unprecedented growth in the last years, due to rapid advances in deep learning and computing resources. Applications cover the full range of existing medical imaging modalities, with unique characteristics driven by the physics of each technique. Yet, artificial intelligence professionals entering the field, and even experienced developers, often lack a comprehensive understanding of the physical principles underlying medical image acquisition, which hinders their ability to fully leverage its potential. The integration of physics knowledge into artificial intelligence algorithms enhances their trustworthiness and robustness in medical imaging, especially in scenarios with limited data availability. In this work, we review the fundamentals of physics in medical images and their impact on the latest advances in artificial intelligence, particularly, in generative models and reconstruction algorithms. Finally, we explore the integration of physics knowledge into physics-inspired machine learning models, which leverage physics-based constraints to enhance the learning of medical imaging features.

[139] Dual Prompting for Diverse Count-level PET Denoising

Xiaofeng Liu,Yongsong Huang,Thibault Marin,Samira Vafay Eslahi,Tiss Amal,Yanis Chemli,Keith Johnson,Georges El Fakhri,Jinsong Ouyang

Main category: eess.IV

TL;DR: 提出了一种基于提示学习的PET去噪方法，通过双提示机制（显式计数级提示和隐式通用去噪提示）动态指导不同计数水平的去噪过程，显著提升了性能。

Details

Motivation: PET图像去噪面临不同计数水平的挑战，需要一种统一且通用的模型来处理多样化的案例。 Method: 采用双提示机制（显式计数级提示和隐式通用去噪提示），结合提示融合模块和提示-特征交互模块，动态指导去噪过程。 Result: 在1940个低计数PET 3D体积上评估，双提示机制显著提升了性能，优于计数条件模型。 Conclusion: 双提示机制能够高效训练统一的去噪模型，适用于不同计数水平的个性化去噪任务。 Abstract: The to-be-denoised positron emission tomography (PET) volumes are inherent with diverse count levels, which imposes challenges for a unified model to tackle varied cases. In this work, we resort to the recently flourished prompt learning to achieve generalizable PET denoising with different count levels. Specifically, we propose dual prompts to guide the PET denoising in a divide-and-conquer manner, i.e., an explicitly count-level prompt to provide the specific prior information and an implicitly general denoising prompt to encode the essential PET denoising knowledge. Then, a novel prompt fusion module is developed to unify the heterogeneous prompts, followed by a prompt-feature interaction module to inject prompts into the features. The prompts are able to dynamically guide the noise-conditioned denoising process. Therefore, we are able to efficiently train a unified denoising model for various count levels, and deploy it to different cases with personalized prompts. We evaluated on 1940 low-count PET 3D volumes with uniformly randomly selected 13-22\% fractions of events from 97 $^{18}$F-MK6240 tau PET studies. It shows our dual prompting can largely improve the performance with informed count-level and outperform the count-conditional model.

[140] STG: Spatiotemporal Graph Neural Network with Fusion and Spatiotemporal Decoupling Learning for Prognostic Prediction of Colorectal Cancer Liver Metastasis

Yiran Zhu,Wei Yang,Yan su,Zesheng Li,Chengchang Pan,Honggang Qi

Main category: eess.IV

TL;DR: 提出了一种多模态时空图神经网络（STG）框架，用于预测结直肠癌肝转移（CRLM）的进展，显著优于现有方法。

Details

Motivation: 现有临床模型未能有效整合肿瘤的空间异质性、动态演化和多模态数据关系，限制了预测准确性。 Method: 结合术前CT影像和临床数据构建异构图结构，通过空间拓扑和跨模态边联合建模肿瘤分布与时间演化，使用GraphSAGE聚合时空邻域信息，并采用监督与对比学习策略。 Result: 在MSKCC CRLM数据集上，时间邻近准确率达85%，平均绝对误差为1.1005，参数数量减少78.55%。 Conclusion: 该框架揭示了动态肿瘤微环境变化与预后的关联，为个性化治疗决策提供了可靠支持。 Abstract: We propose a multimodal spatiotemporal graph neural network (STG) framework to predict colorectal cancer liver metastasis (CRLM) progression. Current clinical models do not effectively integrate the tumor's spatial heterogeneity, dynamic evolution, and complex multimodal data relationships, limiting their predictive accuracy. Our STG framework combines preoperative CT imaging and clinical data into a heterogeneous graph structure, enabling joint modeling of tumor distribution and temporal evolution through spatial topology and cross-modal edges. The framework uses GraphSAGE to aggregate spatiotemporal neighborhood information and leverages supervised and contrastive learning strategies to enhance the model's ability to capture temporal features and improve robustness. A lightweight version of the model reduces parameter count by 78.55%, maintaining near-state-of-the-art performance. The model jointly optimizes recurrence risk regression and survival analysis tasks, with contrastive loss improving feature representational discriminability and cross-modal consistency. Experimental results on the MSKCC CRLM dataset show a time-adjacent accuracy of 85% and a mean absolute error of 1.1005, significantly outperforming existing methods. The innovative heterogeneous graph construction and spatiotemporal decoupling mechanism effectively uncover the associations between dynamic tumor microenvironment changes and prognosis, providing reliable quantitative support for personalized treatment decisions.

cs.MM [Back]

[141] Mitigating Image Captioning Hallucinations in Vision-Language Models

Fei Zhao,Chengcui Zhang,Runlin Zhang,Tianyang Wang,Xi Li

Main category: cs.MM

TL;DR: 提出了一种基于强化学习的测试时适应框架，通过更新语言模型中可学习的参数来减少幻觉现象，无需重新训练或辅助模型。

Details

Motivation: 视觉语言模型（VLMs）中的幻觉问题降低了可靠性和实际应用性，现有方法计算成本高且数据收集繁琐。 Method: 使用强化学习框架，仅更新语言模型层归一化中的可学习参数（约0.003%），并基于CLIP的幻觉评估模型提供双重奖励。 Result: 在LLaVA和InstructBLIP上分别减少15.4%和17.3%的幻觉率，优于现有基线68.3%。 Conclusion: 该方法有效减少了幻觉现象，且计算成本低，具有实际应用潜力。 Abstract: Hallucinations in vision-language models (VLMs) hinder reliability and real-world applicability, usually stemming from distribution shifts between pretraining data and test samples. Existing solutions, such as retraining or fine-tuning on additional data, demand significant computational resources and labor-intensive data collection, while ensemble-based methods incur additional costs by introducing auxiliary VLMs. To address these challenges, we propose a novel test-time adaptation framework using reinforcement learning to mitigate hallucinations during inference without retraining or any auxiliary VLMs. By updating only the learnable parameters in the layer normalization of the language model (approximately 0.003% of the model parameters), our method reduces distribution shifts between test samples and pretraining samples. A CLIP-based hallucination evaluation model is proposed to provide dual rewards to VLMs. Experimental results demonstrate a 15.4% and 17.3% reduction in hallucination rates on LLaVA and InstructBLIP, respectively. Our approach outperforms state-of-the-art baselines with a 68.3% improvement in hallucination mitigation, demonstrating its effectiveness.

cs.RO [Back]

[142] Sim2Real Transfer for Vision-Based Grasp Verification

Pau Amargant,Peter Hönig,Markus Vincze

Main category: cs.RO

TL;DR: 提出了一种基于视觉的抓取验证方法，用于判断机器人夹爪是否成功抓取物体，采用YOLO和ResNet的两阶段架构，并引入合成数据集HSR-GraspSynth。实验表明该方法在真实环境中表现优异。

Details

Motivation: 传统基于力和触觉传感器的方法在处理可变形和非刚性物体时效果不佳，因此需要一种更可靠的视觉验证方法。 Method: 使用YOLO检测夹爪位置，ResNet分类器判断物体存在，并利用合成数据集HSR-GraspSynth补充真实数据不足。 Result: 实验结果显示该方法在真实环境中具有高准确性，并可集成到抓取流程中。 Conclusion: 提出的视觉方法有效解决了可变形物体抓取验证的挑战，代码和数据集已开源。 Abstract: The verification of successful grasps is a crucial aspect of robot manipulation, particularly when handling deformable objects. Traditional methods relying on force and tactile sensors often struggle with deformable and non-rigid objects. In this work, we present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object. Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper and then a ResNet-based classifier determines the presence of an object. To address the limitations of real-world data capture, we introduce HSR-GraspSynth, a synthetic dataset designed to simulate diverse grasping scenarios. Furthermore, we explore the use of Visual Question Answering capabilities as a zero-shot baseline to which we compare our model. Experimental results demonstrate that our approach achieves high accuracy in real-world environments, with potential for integration into grasping pipelines. Code and datasets are publicly available at https://github.com/pauamargant/HSR-GraspSynth .

Guillermo Roque,Erika Maquiling,Jose Giovanni Tapia Lopez,Ross Greer

Main category: cs.RO

TL;DR: 利用GPS和NLP自动生成指令-动作数据对，减少人工标注成本，提升数据集生成效率。

Details

Motivation: 人工标注指令-动作数据对成本高且效率低，需探索自动化方法。 Method: 通过GPS应用收集语音指令，结合视频数据形成视觉-语言-动作三元组，开发自动化数据收集系统ADVLAT-Engine。 Result: 成功将GPS语音指令分类为八种类型，展示了自动化生成高质量数据集的潜力。 Conclusion: 自动化生成指令-动作数据对可加速高质量数据集的创建，为视觉-语言导航和交互式自主系统提供支持。 Abstract: Instruction-Action (IA) data pairs are valuable for training robotic systems, especially autonomous vehicles (AVs), but having humans manually annotate this data is costly and time-inefficient. This paper explores the potential of using mobile application Global Positioning System (GPS) references and Natural Language Processing (NLP) to automatically generate large volumes of IA commands and responses without having a human generate or retroactively tag the data. In our pilot data collection, by driving to various destinations and collecting voice instructions from GPS applications, we demonstrate a means to collect and categorize the diverse sets of instructions, further accompanied by video data to form complete vision-language-action triads. We provide details on our completely automated data collection prototype system, ADVLAT-Engine. We characterize collected GPS voice instructions into eight different classifications, highlighting the breadth of commands and referentialities available for curation from freely available mobile applications. Through research and exploration into the automation of IA data pairs using GPS references, the potential to increase the speed and volume at which high-quality IA datasets are created, while minimizing cost, can pave the way for robust vision-language-action (VLA) models to serve tasks in vision-language navigation (VLN) and human-interactive autonomous systems.

[144] Self-Supervised Learning for Robotic Leaf Manipulation: A Hybrid Geometric-Neural Approach

Srecharan Selvam,Abhishesh Silwal,George Kanter

Main category: cs.RO

TL;DR: 提出了一种结合几何与神经网络的自主叶片抓取方法，通过自监督学习实现高成功率。

Details

Motivation: 农业环境中叶片操作的自动化面临植物形态多变和叶片可变形等挑战，需要一种高效方法。 Method: 结合YOLOv8实例分割和RAFT-Stereo 3D深度估计，通过几何特征评分和神经细化模块（GraspPointCNN）动态融合。 Result: 在控制环境中成功率88.0%，真实温室中84.7%，显著优于纯几何（75.3%）和纯神经网络（60.2%）方法。 Conclusion: 为农业机器人领域提供了一种结合领域专业知识和机器学习的新范式，支持全自动作物监测系统。 Abstract: Automating leaf manipulation in agricultural settings faces significant challenges, including the variability of plant morphologies and deformable leaves. We propose a novel hybrid geometric-neural approach for autonomous leaf grasping that combines traditional computer vision with neural networks through self-supervised learning. Our method integrates YOLOv8 for instance segmentation and RAFT-Stereo for 3D depth estimation to build rich leaf representations, which feed into both a geometric feature scoring pipeline and a neural refinement module (GraspPointCNN). The key innovation is our confidence-weighted fusion mechanism that dynamically balances the contribution of each approach based on prediction certainty. Our self-supervised framework uses the geometric pipeline as an expert teacher to automatically generate training data. Experiments demonstrate that our approach achieves an 88.0% success rate in controlled environments and 84.7% in real greenhouse conditions, significantly outperforming both purely geometric (75.3%) and neural (60.2%) methods. This work establishes a new paradigm for agricultural robotics where domain expertise is seamlessly integrated with machine learning capabilities, providing a foundation for fully automated crop monitoring systems.

[145] Visual Imitation Enables Contextual Humanoid Control

Arthur Allshire,Hongsuk Choi,Junyi Zhang,David McAllister,Anthony Zhang,Chung Min Kim,Trevor Darrell,Pieter Abbeel,Jitendra Malik,Angjoo Kanazawa

Main category: cs.RO

TL;DR: VIDEOMIMIC是一种从日常视频中提取人类动作和环境信息，并通过仿真生成人形机器人控制策略的流程。

Details

Motivation: 研究如何通过简单的人类动作视频教授人形机器人复杂技能（如爬楼梯、坐椅子），利用环境上下文实现高效学习。 Method: 提出VIDEOMIMIC流程，从视频中重建人类与环境信息，生成机器人控制策略，并通过仿真验证。 Result: 在真实人形机器人上展示了稳健、可重复的上下文控制能力，如爬楼梯、坐椅子等动态全身技能。 Conclusion: VIDEOMIMIC为教授人形机器人在多样化环境中操作提供了可扩展的解决方案。 Abstract: How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them-casually capture a human motion video and feed it to humanoids. We introduce VIDEOMIMIC, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills-all from a single policy, conditioned on the environment and global root commands. VIDEOMIMIC offers a scalable path towards teaching humanoids to operate in diverse real-world environments.

cs.CR [Back]

[146] BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

Zihan Wang,Hongwei Li,Rui Zhang,Wenbo Jiang,Kangjie Chen,Tianwei Zhang,Qingchuan Zhao,Guowen Xu

Main category: cs.CR

TL;DR: 本文提出了一种针对大型语言模型（LLMs）的新型后门攻击——语言后门攻击（lingual-backdoor attacks），利用语言本身作为触发器，诱导模型生成煽动性言论。通过改进任务无关的攻击方法（BadLingual），显著提升了攻击的泛化能力。

Details

Motivation: 揭示多语言LLMs的潜在漏洞，尤其是语言作为后门触发器的攻击方式，以促进未来防御研究。 Method: 1. 基线攻击：通过翻译触发语言污染训练数据；2. BadLingual：基于PGCG对抗训练的任务无关攻击方法。 Result: 基线攻击在特定任务上ASR达90%，但在任务无关场景中仅37.61%；BadLingual将ASR提升37.35%。 Conclusion: 语言后门攻击暴露了LLMs的多语言能力漏洞，需加强防御研究以提升模型鲁棒性。 Abstract: In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs' robustness

cs.CY [Back]

[147] Aligning Large Language Models with Healthcare Stakeholders: A Pathway to Trustworthy AI Integration

Kexin Ding,Mu Zhou,Akshay Chaudhari,Shaoting Zhang,Dimitris N. Metaxas

Main category: cs.CY

TL;DR: 本文探讨了大型语言模型（LLMs）在医疗领域中的对齐问题，强调医疗利益相关者需参与模型全生命周期以确保其输出符合需求与价值观。

Details

Motivation: LLMs在医疗领域的广泛应用需要其输出与医疗利益相关者的知识、需求和价值观对齐，以确保工作流程的有效性、安全性和责任性。 Method: 通过医疗利益相关者参与LLMs的训练数据整理、模型训练和推理等全生命周期，结合知识整合、任务理解和人工指导，实现对齐。 Result: 研究表明，通过上述方法，LLMs能更好地遵循人类价值观，提升医疗应用的信任度。 Conclusion: 未来需进一步强化人类与LLMs的对齐，以构建可信赖的医疗应用。 Abstract: The wide exploration of large language models (LLMs) raises the awareness of alignment between healthcare stakeholder preferences and model outputs. This alignment becomes a crucial foundation to empower the healthcare workflow effectively, safely, and responsibly. Yet the varying behaviors of LLMs may not always match with healthcare stakeholders' knowledge, demands, and values. To enable a human-AI alignment, healthcare stakeholders will need to perform essential roles in guiding and enhancing the performance of LLMs. Human professionals must participate in the entire life cycle of adopting LLM in healthcare, including training data curation, model training, and inference. In this review, we discuss the approaches, tools, and applications of alignments between healthcare stakeholders and LLMs. We demonstrate that LLMs can better follow human values by properly enhancing healthcare knowledge integration, task understanding, and human guidance. We provide outlooks on enhancing the alignment between humans and LLMs to build trustworthy real-world healthcare applications.

cs.SE [Back]

[148] The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

Fernando Vallecillos Ruiz,Max Hort,Leon Moonen

Main category: cs.SE

TL;DR: 研究探讨了自动程序修复（APR）中多输出生成与多轮迭代的平衡策略，通过限制每个错误的总补丁数为10，评估了三种指令调优LLM的性能，并发现小规模微调数据集能显著提升修复效果。

Details

Motivation: 减少手动修复代码错误的工作量，同时探索LLM在APR任务中的潜力，平衡多输出生成与迭代优化的策略。 Method: 使用三种指令调优LLM（DeepSeekCoder-Instruct、Codellama-Instruct、Llama3.1-Instruct），在不同规模的微调数据集（1K、30K、65K）和两种微调技术（Full Fine-Tuning和LoRA）下，评估其在HumanEval-Java和Defects4J基准上的表现。 Result: 小规模微调数据集（<1%）能提升78%的合理补丁生成量，但过度微调会导致效果下降；迭代策略对基础模型效果显著，复杂基准中优势更明显。 Conclusion: 平衡多输出生成与迭代优化的策略在APR任务中更为有效，尤其是在复杂基准上，同时需避免过度微调导致的过拟合。 Abstract: Automatic program repair (APR) aims to reduce the manual efforts required to identify and fix errors in source code. Before the rise of LLM-based agents, a common strategy was to increase the number of generated patches, sometimes to the thousands, to achieve better repair results on benchmarks. More recently, self-iterative capabilities enabled LLMs to refine patches over multiple rounds guided by feedback. However, literature often focuses on many iterations and disregards different numbers of outputs. We investigate an APR pipeline that balances these two approaches, the generation of multiple outputs and multiple rounds of iteration, while imposing a limit of 10 total patches per bug. We apply three SOTA instruction-tuned LLMs - DeepSeekCoder-Instruct, Codellama-Instruct, Llama3.1-Instruct - to the APR task. We further fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J. Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated, challenging prior studies that reported limited gains using Full Fine-Tuning. However, we find that exceeding certain thresholds leads to diminishing outcomes, likely due to overfitting. Moreover, we show that base models greatly benefit from creating patches in an iterative fashion rather than generating them all at once. In addition, the benefit of iterative strategies becomes more pronounced in complex benchmarks. Even fine-tuned models, while benefiting less from iterations, still gain advantages, particularly on complex benchmarks. The research underscores the need for balanced APR strategies that combine multi-output generation and iterative refinement.

Table of Contents

cs.CV [Back]

[1] RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

[2] StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data

[3] Gone With the Bits: Revealing Racial Bias in Low-Rate Neural Compression for Facial Images

[4] Generating Narrated Lecture Videos from Slides with Synchronized Highlights

[5] Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation

[6] Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking

[7] NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results

[8] GIF: Generative Inspiration for Face Recognition at Scale

[9] Lesion-Aware Generative Artificial Intelligence for Virtual Contrast-Enhanced Mammography in Breast Cancer

[10] An Explainable Anomaly Detection Framework for Monitoring Depression and Anxiety Using Consumer Wearable Devices

[11] Estimating the Diameter at Breast Height of Trees in a Forest With a Single 360 Camera

[12] Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

[13] Image Recognition with Online Lightweight Vision Transformer: A Survey

[14] Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

[15] TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion

[16] VISLIX: An XAI Framework for Validating Vision Models with Slice Discovery and Analysis

[17] Enhancing Glass Defect Detection with Diffusion Models: Addressing Imbalanced Datasets in Manufacturing Quality Control

[18] Motion-compensated cardiac MRI using low-rank diffeomorphic flow (DMoCo)

[19] Robust Fairness Vision-Language Learning for Medical Image Analysis

[20] RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

[21] seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

[22] Interactive Instance Annotation with Siamese Networks

[23] PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models

[24] DCS-ST for Classification of Breast Cancer Histopathology Images with Limited Annotations

[25] Dual-Domain Masked Image Modeling: A Self-Supervised Pretraining Strategy Using Spatial and Frequency Domain Masking for Hyperspectral Data

[26] Seeing the Abstract: Translating the Abstract Language for Vision Language Models

[27] PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNs

[28] DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

[29] OccCylindrical: Multi-Modal Fusion with Cylindrical Representation for 3D Semantic Occupancy Prediction

[30] Base-Detail Feature Learning Framework for Visible-Infrared Person Re-Identification

[31] Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach

[32] 3D Can Be Explored In 2D: Pseudo-Label Generation for LiDAR Point Clouds Using Sensor-Intensity-Based 2D Semantic Segmentation

[33] Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices

[34] 3D Gaussian Splatting Data Compression with Mixture of Priors

[35] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

[36] SD-VSum: A Method and Dataset for Script-Driven Video Summarization

[37] Very High-Resolution Forest Mapping with TanDEM-X InSAR Data and Self-Supervised Learning

[38] FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

[39] From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

[40] A Vision-Language Model for Focal Liver Lesion Classification

[41] GUAVA: Generalizable Upper Body 3D Gaussian Avatar

[42] Interpretable Zero-shot Learning with Infinite Class Concepts

[43] 3D Surface Reconstruction with Enhanced High-Frequency Details

[44] Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models

[45] Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant

[46] Attention-aggregated Attack for Boosting the Transferability of Facial Adversarial Examples

[47] Enhancing Target-unspecific Tasks through a Features Matrix

[48] EOPose : Exemplar-based object reposing using Generalized Pose Correspondences

[49] DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation

[50] CXR-AD: Component X-ray Image Dataset for Industrial Anomaly Detection

[51] LiftFeat: 3D Geometry-Aware Local Feature Matching

[52] Phenotype-Guided Generative Model for High-Fidelity Cardiac MRI Synthesis: Advancing Pretraining and Clinical Applications

[53] A Fusion-Guided Inception Network for Hyperspectral Image Super-Resolution

[54] Robustness in AI-Generated Detection: Enhancing Resistance to Adversarial Attacks

[55] Polar Coordinate-Based 2D Pose Prior with Neural Distance Field

[56] Nonperiodic dynamic CT reconstruction using backward-warping INR with regularization of diffeomorphism (BIRD)

[57] Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

[58] UPMAD-Net: A Brain Tumor Segmentation Network with Uncertainty Guidance and Adaptive Multimodal Feature Fusion

[59] MRI motion correction via efficient residual-guided denoising diffusion probabilistic models

[60] Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking

[61] Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks

[62] Coop-WD: Cooperative Perception with Weighting and Denoising for Robust V2V Communication

[63] RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT

[64] Panoramic Out-of-Distribution Segmentation

[65] Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment

[66] Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID

[67] Real-Time Person Image Synthesis Using a Flow Matching Model

[68] Uncertainty-Aware Prototype Semantic Decoupling for Text-Based Person Search in Full Images

[69] Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models

[70] Supervised and Unsupervised Textile Classification via Near-Infrared Hyperspectral Imaging and Deep Learning

[71] DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

[72] Fixed-Length Dense Fingerprint Representation

[73] From Pixels to Polygons: A Survey of Deep Learning Approaches for Medical Image-to-Mesh Reconstruction

[74] PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model

[75] Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection

[76] Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images

[77] PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing

[78] Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map