cs.CV [Back]

[1] ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Yijun Liang,Ming Li,Chenrui Fan,Ziyue Li,Dang Nguyen,Kwesi Cobbina,Shweta Bhardwaj,Jiuhai Chen,Fuxiao Liu,Tianyi Zhou

Main category: cs.CV

TLDR: ColorBench是一个评估视觉语言模型（VLMs）颜色理解能力的基准测试，揭示了当前模型在颜色感知、推理和鲁棒性方面的局限性。

Details

Motivation: 研究VLMs是否能够像人类一样感知、理解和利用颜色，填补现有模型在颜色理解方面的研究空白。 Method: 通过设计多样化的测试场景（ColorBench），评估32种不同VLMs在颜色感知、推理和鲁棒性方面的表现。 Result: 发现语言模型比视觉编码器更重要，颜色理解被现有模型忽视，CoT推理能提升性能，颜色线索可能误导模型。 Conclusion: 当前VLMs在颜色理解方面存在局限，ColorBench为提升多模态AI的颜色理解能力提供了基础工具。 Abstract: Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

[2] Enhancing Image Restoration through Learning Context-Rich and Detail-Accurate Features

Hu Gao,Depeng Dang

Main category: cs.CV

TLDR: 本文提出了一种多尺度设计（LCDNet），通过结合空间和频率域知识，选择性恢复图像中最具信息量的部分，同时引入跳跃连接注意力机制（SCAM）以减少噪声。

Details

Motivation: 现有方法在图像恢复中过于关注空间细节，忽视了频率变化的理解，导致恢复效果不佳。 Method: 开发了混合尺度频率选择块（HSFSBlock）和多尺度设计，结合空间和频率域信息；引入SCAM机制优化跳跃连接。 Result: 在多种图像恢复任务中，LCDNet表现优于或与现有最优算法相当。 Conclusion: LCDNet通过平衡空间和频率域信息，显著提升了图像恢复效果。 Abstract: Image restoration involves recovering high-quality images from their corrupted versions, requiring a nuanced balance between spatial details and contextual information. While certain methods address this balance, they predominantly emphasize spatial aspects, neglecting frequency variation comprehension. In this paper, we present a multi-scale design that optimally balances these competing objectives, seamlessly integrating spatial and frequency domain knowledge to selectively recover the most informative information. Specifically, we develop a hybrid scale frequency selection block (HSFSBlock), which not only captures multi-scale information from the spatial domain, but also selects the most informative components for image restoration in the frequency domain. Furthermore, to mitigate the inherent noise introduced by skip connections employing only addition or concatenation, we introduce a skip connection attention mechanism (SCAM) to selectively determines the information that should propagate through skip connections. The resulting tightly interlinked architecture, named as LCDNet. Extensive experiments conducted across diverse image restoration tasks showcase that our model attains performance levels that are either superior or comparable to those of state-of-the-art algorithms.

[3] Data Augmentation Through Random Style Replacement

Qikai Yang,Cheng Ji,Huaiying Luo,Panfeng Li,Zhicheng Ding

Main category: cs.CV

TLDR: 提出一种结合风格增强和随机擦除的数据增强方法，通过选择性替换图像子区域为风格转换后的补丁，提升训练鲁棒性。

Details

Motivation: 现有风格增强方法可能无法充分利用风格转换的优势，且容易过拟合。 Method: 先对训练图像进行随机风格转换，再随机替换其子区域为风格转换后的补丁。 Result: 相比现有方法，性能更优且收敛更快。 Conclusion: 该方法能无缝兼容多种风格转换算法，易于集成到数据增强流程中，有效减少过拟合。 Abstract: In this paper, we introduce a novel data augmentation technique that combines the advantages of style augmentation and random erasing by selectively replacing image subregions with style-transferred patches. Our approach first applies a random style transfer to training images, then randomly substitutes selected areas of these images with patches derived from the style-transferred versions. This method is able to seamlessly accommodate a wide range of existing style transfer algorithms and can be readily integrated into diverse data augmentation pipelines. By incorporating our strategy, the training process becomes more robust and less prone to overfitting. Comparative experiments demonstrate that, relative to previous style augmentation methods, our technique achieves superior performance and faster convergence.

[4] H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Yushu Wu,Yanyu Li,Ivan Skorokhodov,Anil Kag,Willi Menapace,Sharath Girish,Aliaksandr Siarohin,Yanzhi Wang,Sergey Tulyakov

Main category: cs.CV

TLDR: 论文系统研究了自动编码器（AE）在网络设计、压缩比和训练策略上的优化，提出高效高压缩视频AE，支持移动设备实时解码，并提出新型潜在一致性损失提升重建质量。

Details

Motivation: 探索AE在网络设计、压缩比和训练策略上的潜力，以提升图像和视频生成效率。 Method: 优化AE架构设计和计算分布，提出潜在一致性损失，统一普通AE和图像条件I2V VAE设计。 Result: 实现超高压缩比和移动设备实时解码，重建质量显著优于现有技术。 Conclusion: 验证了AE在文本到视频生成中的高效性和高质量表现。 Abstract: Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time on mobile devices. We also unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single network. In addition, we find that the widely adopted discriminative losses, i.e., GAN, LPIPS, and DWT losses, provide no significant improvements when training AEs at scale. We propose a novel latent consistency loss that does not require complicated discriminator design or hyperparameter tuning, but provides stable improvements in reconstruction quality. Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

[5] AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark

Aruna Gauba,Irene Pi,Yunze Man,Ziqi Pang,Vikram S. Adve,Yu-Xiong Wang

Main category: cs.CV

TLDR: AgMMU是一个专注于农业领域的多模态数据集，用于评估和开发视觉语言模型（VLMs），以生成事实准确的答案。数据集包含5,460个多选题和开放性问题，以及205,399条农业知识信息。实验显示现有VLMs在结合详细感知和事实知识的问题上表现不佳，开源模型与专有模型存在差距。通过微调实验，LLaVA-1.5的准确率提升了3.1%。

Details

Motivation: 农业是一个社会效益显著的领域，需要结合视觉观察和精确知识，如病虫害识别和管理指导。现有VLMs在知识密集型任务中表现不足，因此需要专门的数据集来推动模型发展。 Method: 通过从116,231条真实用户与农业专家的对话中提取事实、问题和答案，经过GPT-4o、LLaMA模型和人工验证的三步流程，构建了AgMMU数据集。数据集包括多选题、开放题和开发集。 Result: 实验表明，现有VLMs在需要结合感知和知识的问题上表现不佳，开源模型与专有模型差距显著。通过微调开发集，LLaVA-1.5的准确率提升了3.1%。 Conclusion: AgMMU可作为农业领域的评估基准和开发工具，推动知识密集型VLMs的发展。 Abstract: We curate a dataset AgMMU for evaluating and developing vision-language models (VLMs) to produce factually accurate answers for knowledge-intensive expert domains. Our AgMMU concentrates on one of the most socially beneficial domains, agriculture, which requires connecting detailed visual observation with precise knowledge to diagnose, e.g., pest identification, management instructions, etc. As a core uniqueness of our dataset, all facts, questions, and answers are extracted from 116,231 conversations between real-world users and authorized agricultural experts. After a three-step dataset curation pipeline with GPT-4o, LLaMA models, and human verification, AgMMU features an evaluation set of 5,460 multiple-choice questions (MCQs) and open-ended questions (OEQs). We also provide a development set that contains 205,399 pieces of agricultural knowledge information, including disease identification, symptoms descriptions, management instructions, insect and pest identification, and species identification. As a multimodal factual dataset, it reveals that existing VLMs face significant challenges with questions requiring both detailed perception and factual knowledge. Moreover, open-source VLMs still demonstrate a substantial performance gap compared to proprietary ones. To advance knowledge-intensive VLMs, we conduct fine-tuning experiments using our development set, which improves LLaVA-1.5 evaluation accuracy by up to 3.1%. We hope that AgMMU can serve both as an evaluation benchmark dedicated to agriculture and a development suite for incorporating knowledge-intensive expertise into general-purpose VLMs.

[6] Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

Chunzhuo Wang,Zhewen Xue,T. Sunil Kumar,Guido Camps,Hans Hallez,Bart Vanrumste

Main category: cs.CV

TLDR: 该论文提出了一种基于骨架的ST-GCN-BiLSTM模型，用于自动检测饮食手势，验证了其在实验室和智能手机数据集上的有效性。

Details

Motivation: 超重和肥胖问题日益严重，与不健康饮食习惯相关，需要一种自动化方法来监测日常饮食行为。 Method: 结合了扩张时空图卷积网络（ST-GCN）和双向长短时记忆网络（BiLSTM）的ST-GCN-BiLSTM模型，利用骨架数据进行手势检测。 Result: 在实验室数据集（OREBA）上，饮食和饮水手势的F1分数分别为86.18%和74.84%；在智能手机数据集上分别为85.40%和67.80%。 Conclusion: 骨架数据可用于饮食手势检测，且该方法在跨数据集验证中表现出鲁棒性。 Abstract: Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.

[7] SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging

Tan-Hanh Pham,Chris Ngo,Trong-Duong Bui,Minh Luu Quang,Tan-Huong Pham,Truong-Son Hy

Main category: cs.CV

TLDR: 提出了一种基于语音交互的多模态医疗视觉语言模型SilVar-Med，解决了现有模型在临床环境中文本交互不实用和缺乏推理解释的问题。

Details

Motivation: 现有医疗视觉语言模型依赖文本指令，限制了在临床环境（如手术）中的实用性，且缺乏预测背后的推理解释，影响临床决策的可靠性。 Method: 开发了SilVar-Med，一种端到端的语音驱动医疗VLM，结合语音交互和视觉语言模型，并提出了推理数据集以支持预测解释。 Result: 通过实验验证了推理驱动的医疗图像解释与语音交互的可行性。 Conclusion: SilVar-Med推动了医疗AI领域的发展，提供了更透明、交互性强且临床可行的诊断支持系统。 Abstract: Medical Visual Language Models have shown great potential in various healthcare applications, including medical image captioning and diagnostic assistance. However, most existing models rely on text-based instructions, limiting their usability in real-world clinical environments especially in scenarios such as surgery, text-based interaction is often impractical for physicians. In addition, current medical image analysis models typically lack comprehensive reasoning behind their predictions, which reduces their reliability for clinical decision-making. Given that medical diagnosis errors can have life-changing consequences, there is a critical need for interpretable and rational medical assistance. To address these challenges, we introduce an end-to-end speech-driven medical VLM, SilVar-Med, a multimodal medical image assistant that integrates speech interaction with VLMs, pioneering the task of voice-based communication for medical image analysis. In addition, we focus on the interpretation of the reasoning behind each prediction of medical abnormalities with a proposed reasoning dataset. Through extensive experiments, we demonstrate a proof-of-concept study for reasoning-driven medical image interpretation with end-to-end speech interaction. We believe this work will advance the field of medical AI by fostering more transparent, interactive, and clinically viable diagnostic support systems. Our code and dataset are publicly available at SiVar-Med.

[8] Relation-Rich Visual Document Generator for Visual Information Extraction

Zi-Han Jiang,Chien-Wei Lin,Wei-Hua Li,Hsuan-Tung Liu,Yi-Ren Yeh,Chu-Song Chen

Main category: cs.CV

TLDR: 论文提出了一种名为RIDGE的两阶段方法，通过内容生成和内容驱动的布局生成，解决了关系丰富文档的视觉信息提取问题。

Details

Motivation: 现有方法在视觉文档理解中因布局多样性和数据稀缺而受限，RIDGE旨在通过生成多样化且实用的文档布局来改进这一问题。 Method: RIDGE采用两阶段方法：1) 利用LLMs生成内容；2) 基于OCR结果生成布局，无需人工标注。 Result: 实验表明，RIDGE显著提升了文档理解模型在多个VIE基准上的性能。 Conclusion: RIDGE通过内容驱动的布局生成，有效解决了关系丰富文档的视觉信息提取问题，并开源了代码和模型。 Abstract: Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at https://github.com/AI-Application-and-Integration-Lab/RIDGE .

[9] Perturbed State Space Feature Encoders for Optical Flow with Event Cameras

Gokul Raju Govinda Raju,Nikola Zubić,Marco Cannici,Davide Scaramuzza

Main category: cs.CV

TLDR: 提出P-SSE方法，用于事件相机的多帧光流估计，通过扰动状态动力学矩阵提升性能，在DSEC-Flow和MVSEC数据集上表现优异。

Details

Motivation: 事件相机在光流估计中具有优势，但现有神经网络在时空推理上存在局限。 Method: 提出P-SSE方法，结合大感受野和线性计算复杂度，通过扰动状态动力学矩阵提升稳定性与性能。 Result: 在DSEC-Flow和MVSEC数据集上分别提升EPE性能8.48%和11.86%。 Conclusion: P-SSE通过创新扰动技术显著提升了事件相机光流估计的性能。 Abstract: With their motion-responsive nature, event-based cameras offer significant advantages over traditional cameras for optical flow estimation. While deep learning has improved upon traditional methods, current neural networks adopted for event-based optical flow still face temporal and spatial reasoning limitations. We propose Perturbed State Space Feature Encoders (P-SSE) for multi-frame optical flow with event cameras to address these challenges. P-SSE adaptively processes spatiotemporal features with a large receptive field akin to Transformer-based methods, while maintaining the linear computational complexity characteristic of SSMs. However, the key innovation that enables the state-of-the-art performance of our model lies in our perturbation technique applied to the state dynamics matrix governing the SSM system. This approach significantly improves the stability and performance of our model. We integrate P-SSE into a framework that leverages bi-directional flows and recurrent connections, expanding the temporal context of flow prediction. Evaluations on DSEC-Flow and MVSEC datasets showcase P-SSE's superiority, with 8.48% and 11.86% improvements in EPE performance, respectively.

[10] H-MoRe: Learning Human-centric Motion Representation for Action Analysis

Zhanbo Huang,Xiaoming Liu,Yu Kong

Main category: cs.CV

TLDR: H-MoRe是一种自监督学习的人类运动表示方法，通过动态保留相关运动并过滤背景运动，结合人体姿态和形状信息，显著提升了下游任务的性能。

Details

Motivation: 现有方法依赖合成数据的全监督学习，而H-MoRe直接从真实场景中自监督学习，捕捉更精确的人类运动表示。 Method: H-MoRe利用世界-局部流矩阵表示绝对和相对运动，结合人体姿态和形状信息，自监督学习。 Result: 在步态识别（CL@R1提升16.01%）、动作识别（Acc@1提升8.92%）和视频生成（FVD降低67.07%）等任务中表现优异，推理效率达34 fps。 Conclusion: H-MoRe是一种高效、精确的人类运动表示方法，适用于实时场景，代码和模型将公开。 Abstract: In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.

[11] NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results

Yuqian Fu,Xingyu Qiu,Bin Ren,Yanwei Fu,Radu Timofte,Nicu Sebe,Ming-Hsuan Yang,Luc Van Gool,Kaijin Zhang,Qingpeng Nong,Xiugang Dong,Hong Gao,Xiangsheng Zhou,Jiancheng Pan,Yanxing Liu,Xiao He,Jiahao Li,Yuze Sun,Xiaomeng Huang,Zhenyu Zhang,Ran Ma,Yuhan Liu,Zijian Zhuang,Shuai Yi,Yixiong Zou,Lingyi Hong,Mingxi Chen,Runze Li,Xingdong Sheng,Wenqiang Zhang,Weisen Chen,Yongxin Yan,Xinguo Chen,Yuanjie Shao,Zhengrong Zuo,Nong Sang,Hao Wu,Haoran Sun,Shuming Hu,Yan Zhang,Zhiguang Shi,Yu Zhang,Chao Chen,Tao Wang,Da Feng,Linhai Zhuo,Ziming Lin,Yali Huang,Jie Me,Yiming Yang,Mi Guo,Mingyuan Jiu,Mingliang Xu,Maomao Xiong,Qunshu Zhang,Xinyu Cao,Yuqing Yang,Dianmo Sheng,Xuanpu Zhao,Zhiyu Li,Xuyang Ding,Wenqian Li

Main category: cs.CV

TLDR: 论文介绍了2025年NTIRE CD-FSOD挑战赛，旨在提升跨域少样本目标检测性能，吸引了42个团队参与，并展示了新SOTA结果。

Details

Motivation: 解决跨域少样本目标检测（CD-FSOD）的挑战，推动目标检测器在新领域中的性能提升。 Method: 组织挑战赛，吸引团队提交创新模型，评估其在开源和闭源设置下的表现。 Result: 42个团队参与，13个团队提交有效结果，部分模型达到新SOTA。 Conclusion: 挑战赛成功推动了CD-FSOD领域的研究，展示了多种创新解决方案。 Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registered participants, received submissions from 42 teams, and concluded with 13 teams making valid final submissions. Participants approached the task from diverse perspectives, proposing novel models that achieved new state-of-the-art (SOTA) results under both open-source and closed-source settings. In this report, we present an overview of the 1st NTIRE 2025 CD-FSOD Challenge, highlighting the proposed solutions and summarizing the results submitted by the participants.

[12] The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

Bin Ren,Hang Guo,Lei Sun,Zongwei Wu,Radu Timofte,Yawei Li,Yao Zhang,Xinning Chai,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Li Song,Hongyuan Yu,Pufan Xu,Cheng Wan,Zhijuan Huang,Peng Guo,Shuyuan Cui,Chenjun Li,Xuehai Hu,Pan Pan,Xin Zhang,Heng Zhang,Qing Luo,Linyan Jiang,Haibo Lei,Qifang Gao,Yaqing Li,Weihua Luo,Tsing Li,Qing Wang,Yi Liu,Yang Wang,Hongyu An,Liou Zhang,Shijie Zhao,Lianhong Song,Long Sun,Jinshan Pan,Jiangxin Dong,Jinhui Tang,Jing Wei,Mengyang Wang,Ruilong Guo,Qian Wang,Qingliang Liu,Yang Cheng,Davinci,Enxuan Gu,Pinxin Liu,Yongsheng Yu,Hang Hua,Yunlong Tang,Shihao Wang,Yukun Yang,Zhiyu Zhang,Yukun Yang,Jiyu Wu,Jiancheng Huang,Yifan Liu,Yi Huang,Shifeng Chen,Rui Chen,Yi Feng,Mingxi Li,Cailu Wan,Xiangji Wu,Zibin Liu,Jinyang Zhong,Kihwan Yoon,Ganzorig Gankhuyag,Shengyun Zhong,Mingyang Wu,Renjie Li,Yushen Zuo,Zhengzhong Tu,Zongang Gao,Guannan Chen,Yuan Tian,Wenhui Chen,Weijun Yuan,Zhan Li,Yihang Chen,Yifan Deng,Ruting Deng,Yilin Zhang,Huan Zheng,Yanyan Wei,Wenxuan Zhao,Suiyi Zhao,Fei Wang,Kun Li,Yinggan Tang,Mengjie Su,Jae-hyeon Lee,Dong-Hyeop Son,Ui-Jin Choi,Tiancheng Shao,Yuqing Zhang,Mengcheng Ma,Donggeun Ko,Youngsang Kwak,Jiun Lee,Jaehwa Kwak,Yuxuan Jiang,Qiang Zhu,Siyue Teng,Fan Zhang,Shuyuan Zhu,Bing Zeng,David Bull,Jing Hu,Hui Deng,Xuan Zhang,Lin Zhu,Qinrui Fan,Weijian Deng,Junnan Wu,Wenqin Deng,Yuquan Liu,Zhaohong Xu,Jameer Babu Pinjari,Kuldeep Purohit,Zeyu Xiao,Zhuoyuan Li,Surya Vashisth,Akshay Dudhane,Praful Hambarde,Sachin Chaudhary,Satya Naryan Tazi,Prashant Patil,Santosh Kumar Vipparthi,Subrahmanyam Murala,Wei-Chen Shen,I-Hsiang Chen,Yunzhe Xu,Chen Zhao,Zhizhou Chen,Akram Khatami-Rizi,Ahmad Mahmoudi-Aznaveh,Alejandro Merino,Bruno Longarela,Javier Abad,Marcos V. Conde,Simone Bianco,Luca Cogo,Gianmarco Corti

Main category: cs.CV

TLDR: 本文综述了NTIRE 2025高效单图像超分辨率（ESR）挑战赛，分析了参赛方法及其结果，强调了该领域的技术突破。

Details

Motivation: 推动深度学习模型在计算效率（如运行时间、参数和FLOPs）和性能（PSNR）上的优化，促进单图像超分辨率技术的发展。 Method: 挑战赛吸引了244名注册者和43支有效参赛团队，通过评估模型在DIV2K_LSDIR数据集上的表现（PSNR≥26.90 dB/26.99 dB）来比较方法。 Result: 挑战赛展示了单图像ESR领域的最新技术进展，并建立了未来研究的基准。 Conclusion: 该挑战赛为高效单图像超分辨率技术提供了创新方法和研究方向，推动了该领域的进一步发展。 Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.

[13] SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

Stathis Galanakis,Alexandros Lattas,Stylianos Moschoglou,Bernhard Kainz,Stefanos Zafeiriou

Main category: cs.CV

TLDR: SpinMeRound是一种基于扩散模型的方法，旨在从新视角生成一致且准确的头部肖像。

Details

Motivation: 当前方法在生成头部肖像时受限于有限的视角范围，且在大规模扩散模型上表现不佳。 Method: 利用多个输入视图和身份嵌入，生成多样化的视角并保持身份特征。 Result: 实验表明，该方法在360度头部合成中优于当前最先进的多视角扩散模型。 Conclusion: SpinMeRound在生成新视角头部肖像方面具有显著优势。 Abstract: Despite recent progress in diffusion models, generating realistic head portraits from novel viewpoints remains a significant challenge. Most current approaches are constrained to limited angular ranges, predominantly focusing on frontal or near-frontal views. Moreover, although the recent emerging large-scale diffusion models have been proven robust in handling 3D scenes, they underperform on facial data, given their complex structure and the uncanny valley pitfalls. In this paper, we propose SpinMeRound, a diffusion-based approach designed to generate consistent and accurate head portraits from novel viewpoints. By leveraging a number of input views alongside an identity embedding, our method effectively synthesizes diverse viewpoints of a subject whilst robustly maintaining its unique identity features. Through experimentation, we showcase our model's generation capabilities in 360 head synthesis, while beating current state-of-the-art multiview diffusion models.

[14] Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization

Darryl Hannan,John Cooper,Dylan White,Timothy Doster,Henry Kvinge,Yijing Watkins

Main category: cs.CV

TLDR: 该论文分析了最新多模态大语言模型（MLLMs）在地球观测（EO）图像中的细粒度空间推理能力，并评估其在零样本场景下的性能。

Details

Motivation: MLLMs在计算机视觉领域表现优异，但在EO图像等分布外领域的细粒度任务（如目标定位）中表现不佳，需要进一步研究其能力。 Method: 对最新MLLMs进行基准测试，评估其在EO目标定位任务中的表现，并探讨提示选择、地面采样距离优化及失败案例。 Result: 这些模型在特定场景下表现良好，适合零样本任务。 Conclusion: 该研究为评估MLLMs在EO定位任务中的适用性及优化提供了参考。 Abstract: Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some EO tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios. Additionally, we provide a detailed discussion focused on prompt selection, ground sample distance (GSD) optimization, and analyzing failure cases. We hope that this work will prove valuable as others evaluate whether an MLLM is well suited for a given EO localization task and how to optimize it.

[15] CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates

Ankit Kumar Shaw,Kun Jiang,Tuopu Wen,Chandan Kumar Sah,Yining Shi,Mengmeng Yang,Diange Yang,Xiaoli Lian

Main category: cs.CV

TLDR: CleanMAP是一个基于多模态大语言模型（MLLM）的蒸馏框架，用于过滤和优化众包数据，以实现高置信度的高清地图更新。

Details

Motivation: 智能网联汽车（ICV）和车路云集成系统的快速发展对实时高清地图更新的准确性提出了更高要求，但众包数据的不一致性（如运动模糊、光照变化、恶劣天气和车道标记退化）导致地图可靠性难以保证。 Method: CleanMAP采用MLLM驱动的车道可见性评分模型，动态量化关键视觉参数并分配置信度分数（0-10），结合动态分段置信度评分函数和置信度驱动的局部地图融合策略，筛选最优数据。 Result: 实验表明，融合前三个局部地图时，平均地图更新误差最低（0.28m），优于基线（0.37m），且与人工评估的吻合度达84.88%。 Conclusion: CleanMAP为众包高清地图更新提供了一种可扩展且可靠的解决方案，提升了自动驾驶导航的精确性和可靠性。 Abstract: The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/

[16] Hearing Anywhere in Any Environment

Xiulong Liu,Anurag Kumar,Paul Calamia,Sebastia V. Amengual,Calvin Murdock,Ishwarya Ananthabhotla,Philip Robinson,Eli Shlizerman,Vamsi Krishna Ithapu,Ruohan Gao

Main category: cs.CV

TLDR: xRIR框架通过结合几何特征提取器和RIR编码器，实现了跨房间的RIR预测，显著优于基线方法，并在真实环境中验证了其泛化能力。

Details

Motivation: 混合现实中真实的声学体验对沉浸感至关重要，但现有神经方法局限于单一环境训练，难以泛化到新房间。 Method: 结合全景深度图像的几何特征提取器和少量参考RIR样本的RIR编码器，构建跨房间RIR预测框架。 Result: 在ACOUSTICROOMS数据集上表现优异，并在四个真实环境中验证了泛化能力。 Conclusion: xRIR框架通过少量测量实现了跨房间RIR预测，为混合现实提供了更通用的声学建模方案。 Abstract: In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce ACOUSTICROOMS, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.

[17] Real-time Seafloor Segmentation and Mapping

Michele Grimaldi,Nouf Alkaabi,Francesco Ruscio,Sebastian Realpe Rua,Rafael Garcia,Nuno Gracias

Main category: cs.CV

TLDR: 该论文提出了一种结合机器学习和计算机视觉的框架，用于自主水下车辆（AUV）监测Posidonia oceanica草场的边界，通过改进的Mask R-CNN模型和岩石分类增强监测效果。

Details

Motivation: Posidonia oceanica草场因依赖岩石生存而面临全球性衰退，亟需高效监测工具。现有深度学习技术在水下环境表现受限。 Method: 结合Mask R-CNN图像分割和草场边界追踪策略，新增岩石分类以优化模型。 Result: 框架在真实水下图像和模拟环境中验证有效，AUV能自主完成岩石分割和草场监测任务。 Conclusion: 该框架为海洋环境保护提供了新工具，支持Posidonia oceanica草场的针对性保护。 Abstract: Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts

[18] ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models

Amirhosein Chahe,Lifeng Zhou

Main category: cs.CV

TLDR: 研究探讨了在自动驾驶任务中，通过显式建模推理能力提升视觉语言模型（VLM）性能的方法。结果显示，基于推理的微调显著优于其他方法。

Details

Motivation: 自动驾驶中的视觉语言模型缺乏透明的推理能力，这对安全性至关重要。研究旨在验证显式建模推理是否能提升模型性能。 Method: 使用GPT-4o生成结构化推理链，并在DriveLM基准测试中比较基于推理的微调、仅答案微调和基线模型。测试了多个小型VLM家族（Llama 3.2、Llava 1.5和Qwen 2.5VL）。 Result: 基于推理的微调在准确性和文本生成质量上显著优于其他方法，其中Llama3.2-11B-reason表现最佳。 Conclusion: 显式推理能增强模型对驾驶决策的内部表示，为开发更可解释的自动驾驶系统提供了方向。 Abstract: Vision-language models (VLMs) show promise for autonomous driving but often lack transparent reasoning capabilities that are critical for safety. We investigate whether explicitly modeling reasoning during fine-tuning enhances VLM performance on driving decision tasks. Using GPT-4o, we generate structured reasoning chains for driving scenarios from the DriveLM benchmark with category-specific prompting strategies. We compare reasoning-based fine-tuning, answer-only fine-tuning, and baseline instruction-tuned models across multiple small VLM families (Llama 3.2, Llava 1.5, and Qwen 2.5VL). Our results demonstrate that reasoning-based fine-tuning consistently outperforms alternatives, with Llama3.2-11B-reason achieving the highest performance. Models fine-tuned with reasoning show substantial improvements in accuracy and text generation quality, suggesting explicit reasoning enhances internal representations for driving decisions. These findings highlight the importance of transparent decision processes in safety-critical domains and offer a promising direction for developing more interpretable autonomous driving systems.

[19] SeeTree -- A modular, open-source system for tree detection and orchard localization

Jostan Brown,Cindy Grimm,Joseph R. Davidson

Main category: cs.CV

TLDR: SeeTree是一个开源嵌入式系统，用于果树树干检测和果园定位，支持多种传感器集成，实验表现优异。

Details

Motivation: 果园精准管理需要高精度定位，但现有商业解决方案较少。 Method: 基于粒子滤波的视觉定位系统，支持视觉、GNSS和轮式里程计集成。 Result: 在商业果园实验中，系统99%的情况下能准确定位，转弯跟踪成功率99%。 Conclusion: SeeTree系统高效可靠，开源数据集和代码支持未来研究与应用。 Abstract: Accurate localization is an important functional requirement for precision orchard management. However, there are few off-the-shelf commercial solutions available to growers. In this paper, we present SeeTree, a modular, open source embedded system for tree trunk detection and orchard localization that is deployable on any vehicle. Building on our prior work on vision-based in-row localization using particle filters, SeeTree includes several new capabilities. First, it provides capacity for full orchard localization including out-of-row headland turning. Second, it includes the flexibility to integrate either visual, GNSS, or wheel odometry in the motion model. During field experiments in a commercial orchard, the system converged to the correct location 99% of the time over 800 trials, even when starting with large uncertainty in the initial particle locations. When turning out of row, the system correctly tracked 99% of the turns (860 trials representing 43 unique row changes). To help support adoption and future research and development, we make our dataset, design files, and source code freely available to the community.

[20] Minimal Sensing for Orienting a Solar Panel

Jeremy Klotz,Shree K. Nayar

Main category: cs.CV

TLDR: 通过四个光电探测器的测量，优化太阳能板倾斜角度以最大化辐照度，解决了多局部最大值问题，实验验证了方法的有效性。

Details

Motivation: 太阳能板在任意环境和方向下如何最大化能量收集是一个关键问题，传统梯度上升法在多局部最大值环境下失效。 Method: 使用四个光电探测器测量辐照度，通过优化倾斜角度等效于模糊辐照度函数，消除局部最大值，使其单峰化。 Result: 在多种实际环境中（阳光直射、阴天、城市遮挡、复杂室内光），能量收集显著优于标准方法。 Conclusion: 该方法通过模糊辐照度函数解决了多局部最大值问题，实验验证了其鲁棒性和高效性。 Abstract: A solar panel harvests the most energy when pointing in the direction that maximizes the total illumination (irradiance) falling on it. Given an arbitrary orientation of a panel and an arbitrary environmental illumination, we address the problem of finding the direction of maximum total irradiance. We develop a minimal sensing approach where measurements from just four photodetectors are used to iteratively vary the tilt of the panel to maximize the irradiance. Many environments produce irradiance functions with multiple local maxima. As a result, simply measuring the gradient of the irradiance function and applying gradient ascent will not work. We show that a larger, optimized tilt between the detectors and the panel is equivalent to blurring the irradiance function. This has the effect of eliminating local maxima and turning the irradiance function into a unimodal one, whose maximum can be found using gradient ascent. We show that there is a close relationship between our approach and scale space theory. We have collected a large dataset of high-dynamic range lighting environments in New York City, called \textit{UrbanSky}. We used this dataset to conduct simulations to verify the robustness of our approach. Finally, we have built a portable solar panel with four compact detectors and an actuator to conduct experiments in various real-world settings: direct sunlight, cloudy sky, urban settings with occlusions and shadows, and complex indoor lighting. In all cases, we show significant improvements in harvested energy compared to standard approaches for controlling the orientation of a solar panel.

[21] Rainy: Unlocking Satellite Calibration for Deep Learning in Precipitation

Zhenyu Yu,Hanqing Chen,Mohd Yamani Idna Idris,Pei Wang

Main category: cs.CV

TLDR: 论文提出Rainy数据集和Taper Loss方法，解决多源卫星数据与站点数据融合问题，支持五项降水相关任务，促进定量遥感与计算机视觉的跨学科合作。

Details

Motivation: 降水对水文循环至关重要，传统方法因数据获取难和特征关系复杂而受限，缺乏标准化多源数据集阻碍AI模型应用。 Method: 提出Rainy数据集（融合卫星与站点数据）和Taper Loss方法，支持五项任务（如卫星校准、降水预测等），并选用基准模型和评估指标。 Result: Rainy数据集和Taper Loss展示了定量遥感与计算机视觉的无缝协作，为AI在定量遥感领域的应用提供数据支持和跨学科合作启示。 Conclusion: Rainy数据集和Taper Loss填补了数据空白，推动了AI在降水研究中的应用，为跨学科整合提供了新思路。 Abstract: Precipitation plays a critical role in the Earth's hydrological cycle, directly affecting ecosystems, agriculture, and water resource management. Accurate precipitation estimation and prediction are crucial for understanding climate dynamics, disaster preparedness, and environmental monitoring. In recent years, artificial intelligence (AI) has gained increasing attention in quantitative remote sensing (QRS), enabling more advanced data analysis and improving precipitation estimation accuracy. Although traditional methods have been widely used for precipitation estimation, they face limitations due to the difficulty of data acquisition and the challenge of capturing complex feature relationships. Furthermore, the lack of standardized multi-source satellite datasets, and in most cases, the exclusive reliance on station data, significantly hinders the effective application of advanced AI models. To address these challenges, we propose the Rainy dataset, a multi-source spatio-temporal dataset that integrates pure satellite data with station data, and propose Taper Loss, designed to fill the gap in tasks where only in-situ data is available without area-wide support. The Rainy dataset supports five main tasks: (1) satellite calibration, (2) precipitation event prediction, (3) precipitation level prediction, (4) spatiotemporal prediction, and (5) precipitation downscaling. For each task, we selected benchmark models and evaluation metrics to provide valuable references for researchers. Using precipitation as an example, the Rainy dataset and Taper Loss demonstrate the seamless collaboration between QRS and computer vision, offering data support for AI for Science in the field of QRS and providing valuable insights for interdisciplinary collaboration and integration.

[22] Visual Language Models show widespread visual deficits on neuropsychological tests

Gene Tangtartharakul,Katherine R. Storrs

Main category: cs.CV

TLDR: VLMs在复杂视觉任务中表现优异，但在基础视觉概念（如方向、位置）上存在显著缺陷，与人类视觉能力存在差距。

Details

Motivation: 评估VLMs在基础视觉概念上的能力，揭示其与人类视觉的差异。 Method: 使用51项临床和实验测试，系统评估三种先进VLMs的视觉能力。 Result: VLMs在对象识别任务中表现优异，但在低中层次视觉能力上存在显著缺陷。 Conclusion: VLMs可能无需明确训练即可实现复杂对象识别，但缺乏人类的基础视觉概念。 Abstract: Visual Language Models (VLMs) show remarkable performance in visual reasoning tasks, successfully tackling college-level challenges that require high-level understanding of images. However, some recent reports of VLMs struggling to reason about elemental visual concepts like orientation, position, continuity, and occlusion suggest a potential gulf between human and VLM vision. Here we use the toolkit of neuropsychology to systematically assess the capabilities of three state-of-the-art VLMs across visual domains. Using 51 tests drawn from six clinical and experimental batteries, we characterise the visual abilities of leading VLMs relative to normative performance in healthy adults. While the models excel in straightforward object recognition tasks, we find widespread deficits in low- and mid-level visual abilities that would be considered clinically significant in humans. These selective deficits, profiled through validated test batteries, suggest that an artificial system can achieve complex object recognition without developing foundational visual concepts that in humans require no explicit training.

[23] 3D Wavelet Convolutions with Extended Receptive Fields for Hyperspectral Image Classification

Guandong Li,Mengxia Ye

Main category: cs.CV

TLDR: 论文提出WCNet，一种结合小波变换改进的3D-DenseNet模型，用于解决高光谱图像分类中的高维数据、稀疏分布和光谱冗余问题，提升分类性能和泛化能力。

Details

Motivation: 高光谱图像分类面临高维数据、稀疏分布和光谱冗余等挑战，易导致过拟合和泛化能力不足。 Method: 引入小波变换扩展卷积感受野，通过级联引导CNN更好地响应低频信号（小波卷积），动态关注不同频段和空间结构。 Result: 在IN、UP和KSC数据集上表现优于主流方法。 Conclusion: WCNet通过小波卷积动态扩展感受野，显著提升分类性能，且参数增加较少。 Abstract: Deep neural networks face numerous challenges in hyperspectral image classification, including high-dimensional data, sparse ground object distributions, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To better adapt to ground object distributions while expanding receptive fields without introducing excessive parameters and skipping redundant information, this paper proposes WCNet, an improved 3D-DenseNet model integrated with wavelet transforms. We introduce wavelet transforms to effectively extend convolutional receptive fields and guide CNNs to better respond to low frequencies through cascading, termed wavelet convolution. Each convolution focuses on different frequency bands of the input signal with gradually increasing effective ranges. This process enables greater emphasis on low-frequency components while adding only a small number of trainable parameters. This dynamic approach allows the model to flexibly focus on critical spatial structures when processing different regions, rather than relying on fixed receptive fields of single static kernels. The Wavelet Conv module enhances model representation capability by expanding receptive fields through 3D wavelet transforms without increasing network depth or width. Experimental results demonstrate superior performance on the IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification methods.

[24] The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability

Jiani Liu,Zhiyuan Wang,Zeliang Zhang,Chao Huang,Susan Liang,Yunlong Tang,Chenliang Xu

Main category: cs.CV

TLDR: 本文研究了Vision Transformers（ViTs）中的计算冗余如何影响对抗样本的可迁移性，并提出了一系列技术来利用这种冗余提升攻击效果。

Details

Motivation: ViTs在对抗鲁棒性方面表现出独特的性质，尤其是对抗样本的可迁移性高于CNNs，因此探索其结构特性对攻击效果的影响具有重要意义。 Method: 通过分析数据级和模型级的冗余，设计了包括注意力稀疏化、注意力头置换、干净令牌正则化、ghost MoE多样化和测试时对抗训练等技术。 Result: 在ImageNet-1k数据集上的实验表明，所提方法在可迁移性和泛化性上显著优于现有基线。 Conclusion: ViTs中的计算冗余可被有效利用以提升对抗攻击的效果，为未来的对抗鲁棒性研究提供了新方向。 Abstract: Vision Transformers (ViTs) have demonstrated impressive performance across a range of applications, including many safety-critical tasks. However, their unique architectural properties raise new challenges and opportunities in adversarial robustness. In particular, we observe that adversarial examples crafted on ViTs exhibit higher transferability compared to those crafted on CNNs, suggesting that ViTs contain structural characteristics favorable for transferable attacks. In this work, we investigate the role of computational redundancy in ViTs and its impact on adversarial transferability. Unlike prior studies that aim to reduce computation for efficiency, we propose to exploit this redundancy to improve the quality and transferability of adversarial examples. Through a detailed analysis, we identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training. Extensive experiments on the ImageNet-1k dataset validate the effectiveness of our approach, showing that our methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.

[25] Tabular foundation model to detect empathy from visual cues

Md Rakibul Hasan,Shafin Rahman,Md Zakir Hossain,Aneesh Krishna,Tom Gedeon

Main category: cs.CV

TLDR: 论文探讨了如何利用表格基础模型（如TabPFN v2和TabICL）从视频交互的表格特征中检测共情能力，显著提升了跨主体共情检测的准确性和AUC值。

Details

Motivation: 由于隐私和伦理问题，视频数据集通常以表格形式发布。尽管传统树模型表现优异，但受文本基础模型成功的启发，研究者探索了表格基础模型在共情检测中的应用。 Method: 实验使用了两种表格基础模型（TabPFN v2和TabICL），通过上下文学习和微调设置，在公开的人机交互数据集上进行测试。 Result: 实验结果显示，跨主体共情检测的准确率从0.590提升至0.730，AUC从0.564提升至0.669。 Conclusion: 研究不仅提升了性能，还为未来基于表格特征的共情检测提供了新见解和评估框架，适用于隐私受限的场景。 Abstract: Detecting empathy from video interactions is an emerging area of research. Video datasets, however, are often released as extracted features (i.e., tabular data) rather than raw footage due to privacy and ethical concerns. Prior research on such tabular datasets established tree-based classical machine learning approaches as the best-performing models. Motivated by the recent success of textual foundation models (i.e., large language models), we explore the use of tabular foundation models in empathy detection from tabular visual features. We experiment with two recent tabular foundation models $-$ TabPFN v2 and TabICL $-$ through in-context learning and fine-tuning setups. Our experiments on a public human-robot interaction benchmark demonstrate a significant boost in cross-subject empathy detection accuracy over several strong baselines (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). In addition to performance improvement, we contribute novel insights and an evaluation setup to ensure generalisation on unseen subjects in this public benchmark. As the practice of releasing video features as tabular datasets is likely to persist due to privacy constraints, our findings will be widely applicable to future empathy detection video datasets as well.

[26] GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

Christophe Bolduc,Yannick Hold-Geoffroy,Zhixin Shu,Jean-François Lalonde

Main category: cs.CV

TLDR: GaSLight是一种从普通图像生成空间变化光照的方法，首次实现将普通图像作为3D渲染器的光源。

Details

Motivation: 传统方法难以将普通图像直接用作3D渲染的光源，GaSLight旨在解决这一问题。 Method: 采用两阶段方法：1) 利用扩散模型增强图像动态范围；2) 使用HDR高斯点建模3D光照。 Result: 在HDR估计和虚拟场景照明应用中取得最先进效果，并引入新数据集用于评估。 Conclusion: GaSLight为普通图像作为光源提供了高效解决方案，推动了相关领域的发展。 Abstract: We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. The code to reproduce our method will be available upon acceptance.

[27] PatrolVision: Automated License Plate Recognition in the wild

Anmol Singhal Navya Singhal

Main category: cs.CV

TLDR: 本文提出了一种基于低功耗GPU的巡逻系统原型，用于城市环境中的自动车牌检测、识别和跟踪，针对新加坡车牌设计了一个完整的ALPR系统。

Details

Motivation: 尽管计算机视觉技术在交通监控中有潜力，但其在公共服务中的采用率较低，主要因精度和速度问题。现有ALPR系统缺乏端到端解决方案，尤其是在非理想拍摄条件下。 Method: 使用RFB-Net检测车牌并校正扭曲图像，然后通过自定义YOLO网络进行字符识别。系统在包含16,000多张图像的新数据集上测试。 Result: 系统车牌检测精度为86%，字符识别准确率为67%（完全匹配）和89%（部分匹配），并在Tesla P4 GPU上达到64FPS。 Conclusion: 提出的系统在非理想条件下表现良好，为实际应用提供了可行的ALPR解决方案。 Abstract: Adoption of AI driven techniques in public services remains low due to challenges related to accuracy and speed of information at population scale. Computer vision techniques for traffic monitoring have not gained much popularity despite their relative strength in areas such as autonomous driving. Despite large number of academic methods for Automatic License Plate Recognition (ALPR) systems, very few provide an end to end solution for patrolling in the city. This paper presents a novel prototype for a low power GPU based patrolling system to be deployed in an urban environment on surveillance vehicles for automated vehicle detection, recognition and tracking. In this work, we propose a complete ALPR system for Singapore license plates having both single and double line creating our own YOLO based network. We focus on unconstrained capture scenarios as would be the case in real world application, where the license plate (LP) might be considerably distorted due to oblique views. In this work, we first detect the license plate from the full image using RFB-Net and rectify multiple distorted license plates in a single image. After that, the detected license plate image is fed to our network for character recognition. We evaluate the performance of our proposed system on a newly built dataset covering more than 16,000 images. The system was able to correctly detect license plates with 86\% precision and recognize characters of a license plate in 67\% of the test set, and 89\% accuracy with one incorrect character (partial match). We also test latency of our system and achieve 64FPS on Tesla P4 GPU

[28] IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

Janna Bruner,Amit Moryossef,Lior Wolf

Main category: cs.CV

TLDR: 提出一种利用生成模型将手语视频转换为静态插图的方法，以补充教育资源。

Details

Motivation: 手语的动态特性使其难以详细研究，尤其是对新学习者和教育者。静态插图可作为视频的补充资源，但目前由艺术家制作成本高昂。 Method: 利用扩散模型在去噪过程中注入风格信息，结合几何和边缘信息生成草图风格插图，并通过注意力机制合并起始和结束帧。 Result: 生成的手语插图成本低，可作为教育材料的补充资源。 Conclusion: 该方法为手语教育提供了一种经济高效的插图生成方案。 Abstract: Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.

[29] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

Dianbing Xi,Jiepeng Wang,Yuanzhi Liang,Xi Qiu,Yuchi Huo,Rui Wang,Chi Zhang,Xuelong Li

Main category: cs.CV

TLDR: OmniVDiff是一个可控视频扩散框架，支持多模态视频内容生成与理解，通过动态调整模态角色实现灵活控制。

Details

Motivation: 旨在统一视频生成与理解任务，提升可控视频扩散的灵活性和扩展性。 Method: 在色彩空间中学习联合分布，采用自适应控制策略动态调整模态角色（生成或条件）。 Result: 支持文本条件视频生成、视频理解和X条件视频生成，实验验证了其有效性。 Conclusion: OmniVDiff为可控视频扩散提供了统一且灵活的解决方案，适用于多种下游应用。 Abstract: In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.

[30] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

Hengyu Shi,Junhao Su,Huansheng Ning,Xiaoming Wei,Jialin Gao

Main category: cs.CV

TLDR: LayoutCoT利用RAG和CoT技术，通过标准化序列化格式和迭代优化，无需训练即可在布局生成任务中实现最先进性能。

Details

Motivation: 现有生成模型需要大量训练数据或微调，而基于LLM的无训练方法推理能力有限，无法生成高质量布局。 Method: 将布局表示标准化为序列化格式，结合Layout-aware RAG生成初步布局，再通过CoT模块迭代优化。 Result: 在五个数据集上实验表明，LayoutCoT无需训练即达到最优性能，且优于专用深度推理模型。 Conclusion: LayoutCoT展示了LLM在布局生成任务中的深度推理潜力，为无训练方法提供了新思路。 Abstract: Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.

[31] LightFormer: A lightweight and efficient decoder for remote sensing image segmentation

Sihang Chen,Lijun Yun,Ze Liu,JianFeng Zhu,Jie Chen,Hui Wang,Yueping Nie

Main category: cs.CV

TLDR: LightFormer是一种轻量级解码器，用于实时遥感图像分割任务，显著降低模型复杂度，同时保持高精度。

Details

Motivation: 解决深度学习在遥感图像语义分割中实时部署时解码器复杂度高的问题。 Method: 采用特征融合与细化模块以及空间信息选择模块（SISM），高效聚合多尺度信息并捕捉空间依赖关系。 Result: 在多个基准测试中表现优异，如ISPRS Vaihingen上达到83.9% mIoU，仅需14.7%的FLOPs和15.9%的参数。 Conclusion: LightFormer是一种计算高效且高精度的解决方案，适用于实时遥感应用。 Abstract: Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search-and-rescue, and cultural heritage monitoring. LightFormer employs a feature-fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi-scale, multi-range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long-range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet's mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy-efficiency trade-off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high-precision segmentation are imperative.

[32] A comprehensive review of remote sensing in wetland classification and mapping

Shuai Yuan,Xiangan Liang,Tianwu Lin,Shuang Chen,Rui Liu,Jie Wang,Hongsheng Zhang,Peng Gong

Main category: cs.CV

TLDR: 本文综述了湿地分类与制图的研究进展，通过元分析1200多篇论文，总结了湿地类型、方法、传感器类型和研究地点的趋势，并探讨了湿地变化的驱动因素、当前局限性和未来方向。

Details

Motivation: 湿地是支持生物多样性和人类福祉的关键生态系统，但自20世纪以来显著减少。尽管已有综述总结了该领域的发展，但对湿地分类与制图的全面深入理解仍不足。 Method: 通过元分析1200多篇论文，总结湿地类型、方法、传感器类型和研究地点的趋势；综述湿地特征、现有数据和方法；探讨湿地变化的驱动因素。 Result: 总结了典型的湿地制图产品，揭示了湿地变化的内在驱动因素，并提出了当前研究的局限性。 Conclusion: 本文为湿地遥感提供了全面视角，提出了应对全球环境变化和技术创新的未来方向，推动了湿地科学的进步。 Abstract: Wetlands constitute critical ecosystems that support both biodiversity and human well-being; however, they have experienced a significant decline since the 20th century. Back in the 1970s, researchers began to employ remote sensing technologies for wetland classification and mapping to elucidate the extent and variations of wetlands. Although some review articles summarized the development of this field, there is a lack of a thorough and in-depth understanding of wetland classification and mapping: (1) the scientific importance of wetlands, (2) major data, methods used in wetland classification and mapping, (3) driving factors of wetland changes, (4) current research paradigm and limitations, (5) challenges and opportunities in wetland classification and mapping under the context of technological innovation and global environmental change. In this review, we aim to provide a comprehensive perspective and new insights into wetland classification and mapping for readers to answer these questions. First, we conduct a meta-analysis of over 1,200 papers, encompassing wetland types, methods, sensor types, and study sites, examining prevailing trends in wetland classification and mapping. Next, we review and synthesize the wetland features and existing data and methods in wetland classification and mapping. We also summarize typical wetland mapping products and explore the intrinsic driving factors of wetland changes across multiple spatial and temporal scales. Finally, we discuss current limitations and propose future directions in response to global environmental change and technological innovation. This review consolidates our understanding of wetland remote sensing and offers scientific recommendations that foster transformative progress in wetland science.

[33] Enhancing Features in Long-tailed Data Using Large Vision Mode

Pengxiao Han,Changkun Ye,Jinguang Tong,Cuicui Jiang,Jie Hong,Li Fang,Xuesong Li

Main category: cs.CV

TLDR: 研究探索了如何利用大型视觉模型（LVMs）增强长尾数据特征，无需语言信息。

Details

Motivation: 语言基础模型在长尾识别中应用广泛，但并非所有任务都需要语言数据，因此研究转向纯视觉模型。 Method: 从LVM提取特征，与基线网络的特征融合，并设计原型损失函数以优化特征。 Result: 在ImageNet-LT和iNaturalist2018数据集上验证了方法的有效性。 Conclusion: 纯视觉模型可以有效增强长尾数据特征，无需依赖语言信息。 Abstract: Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network's map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.

[34] LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Hanning Chen,Yang Ni,Wenjun Huang,Hyunwoo Oh,Yezi Liu,Tamoghno Das,Mohsen Imani

Main category: cs.CV

TLDR: LVLM_CSP是一种新型的无训练视觉令牌修剪方法，用于基于LVLM的推理分割任务，显著降低计算开销，同时保持高分割精度。

Details

Motivation: 大型视觉语言模型（LVLMs）在推理分割任务中表现出色，但计算开销大，主要源于处理大量图像令牌。现有修剪方法难以平衡计算开销与分割精度。 Method: LVLM_CSP分为三个阶段：聚类、散射和修剪。首先进行粗粒度视觉推理，然后进行细粒度推理，最后修剪大部分视觉令牌。 Result: 实验表明，LVLM_CSP在7B LVLM上实现了65%的图像令牌推理FLOPs减少且精度几乎无下降，70%减少时精度仅下降1%。 Conclusion: LVLM_CSP有效解决了LVLMs在推理分割任务中的计算开销问题，为高效视觉推理提供了新方法。 Abstract: Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens, a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.

[35] DAAF:Degradation-Aware Adaptive Fusion Framework for Robust Infrared and Visible Images Fusion

Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuxin Jing,Yuhan Lyu

Main category: cs.CV

TLDR: 提出了一种新的红外与可见光图像融合方法DAAF，通过自适应退化优化和特征交互融合，解决了现有方法忽略图像退化的问题。

Details

Motivation: 现有红外与可见光图像融合算法通常忽视图像退化（如低光和噪声），限制了实际应用潜力。 Method: DAAF包含自适应退化优化网络（ADON）和特征交互局部-全局融合网络（FILGF）。ADON通过频域特征分解和Retinex分解优化退化，FILGF通过多尺度特征交互实现融合。 Result: 实验表明，DAAF在正常和复杂退化场景下均优于现有算法。 Conclusion: DAAF通过统一建模退化优化与图像融合，显著提升了融合效果。 Abstract: Existing infrared and visible image fusion(IVIF) algorithms often prioritize high-quality images, neglecting image degradation such as low light and noise, which limits the practical potential. This paper propose Degradation-Aware Adaptive image Fusion (DAAF), which achieves unified modeling of adaptive degradation optimization and image fusion. Specifically, DAAF comprises an auxiliary Adaptive Degradation Optimization Network (ADON) and a Feature Interactive Local-Global Fusion (FILGF) Network. Firstly, ADON includes infrared and visible-light branches. Within the infrared branch, frequency-domain feature decomposition and extraction are employed to isolate Gaussian and stripe noise. In the visible-light branch, Retinex decomposition is applied to extract illumination and reflectance components, enabling complementary enhancement of detail and illumination distribution. Subsequently, FILGF performs interactive multi-scale local-global feature fusion. Local feature fusion consists of intra-inter model feature complement, while global feature fusion is achieved through a interactive cross-model attention. Extensive experiments have shown that DAAF outperforms current IVIF algorithms in normal and complex degradation scenarios.

[36] Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

Tonko E. W. Bossen,Andreas Møgelmose,Ross Greer

Main category: cs.CV

TLDR: 研究评估了当前视觉语言模型在零样本条件下理解交通手势的能力，发现其表现不佳，需进一步研究。

Details

Motivation: 确保自动驾驶能正确理解交通手势，提升道路安全与用户体验。 Method: 创建两个数据集（ATG和ITGI），通过三种方法（描述相似性、分类、姿态序列重建）评估模型。 Result: 模型表现较差：描述相似性低于0.59，分类F1分数仅0.14-0.39，远低于专家基线0.70。 Conclusion: 当前模型对交通手势的理解不够准确和鲁棒，需进一步研究改进。 Abstract: In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.

[37] Weather-Aware Object Detection Transformer for Domain Adaptation

Soheil Gharatappeh,Salimeh Sekeh,Vikas Dhiman

Main category: cs.CV

TLDR: 论文研究了三种新方法（域适应感知损失、天气自适应注意力和天气融合编码器）以增强RT-DETR在雾天环境下的鲁棒性，但均未显著优于基线模型。

Details

Motivation: RT-DETR在雾天等恶劣天气条件下性能下降，需要提升其鲁棒性。 Method: 提出了三种方法：域适应感知损失、天气自适应注意力和天气融合编码器。 Result: 三种方法均未能显著优于基线RT-DETR。 Conclusion: 分析了方法的局限性，为未来天气感知目标检测研究提供了方向。 Abstract: RT-DETRs have shown strong performance across various computer vision tasks but are known to degrade under challenging weather conditions such as fog. In this work, we investigate three novel approaches to enhance RT-DETR robustness in foggy environments: (1) Domain Adaptation via Perceptual Loss, which distills domain-invariant features from a teacher network to a student using perceptual supervision; (2) Weather Adaptive Attention, which augments the attention mechanism with fog-sensitive scaling by introducing an auxiliary foggy image stream; and (3) Weather Fusion Encoder, which integrates a dual-stream encoder architecture that fuses clear and foggy image features via multi-head self and cross-attention. Despite the architectural innovations, none of the proposed methods consistently outperform the baseline RT-DETR. We analyze the limitations and potential causes, offering insights for future research in weather-aware object detection.

[38] Large Language Model-Informed Feature Discovery Improves Prediction and Interpretation of Credibility Perceptions of Visual Content

Yilang Peng,Sijia Qian,Yingdan Lu,Cuihua Shen

Main category: cs.CV

TLDR: 论文提出了一种基于大语言模型（LLM）的特征发现框架，用于预测社交媒体视觉内容的可信度，并解释其判断依据。该方法在科学、健康和政治等领域的4,191个视觉帖子中测试，表现优于零样本GPT预测13%。

Details

Motivation: 在视觉主导的社交媒体环境中，预测视觉内容的可信度并理解人类判断的依据对打击虚假信息至关重要。但由于视觉特征的多样性和丰富性，这些任务具有挑战性。 Method: 利用多模态LLM（如GPT-4o）提取和量化可解释特征，并通过针对性提示将其整合到机器学习模型中，以提高可信度预测。 Result: 在4,191个视觉社交媒体帖子中测试，该方法在R2上优于零样本GPT预测13%，并揭示了信息具体性和图像格式等关键特征。 Conclusion: 该方法为虚假信息缓解、视觉可信度研究以及LLM在社会科学中的应用提供了新思路。 Abstract: In today's visually dominated social media landscape, predicting the perceived credibility of visual content and understanding what drives human judgment are crucial for countering misinformation. However, these tasks are challenging due to the diversity and richness of visual features. We introduce a Large Language Model (LLM)-informed feature discovery framework that leverages multimodal LLMs, such as GPT-4o, to evaluate content credibility and explain its reasoning. We extract and quantify interpretable features using targeted prompts and integrate them into machine learning models to improve credibility predictions. We tested this approach on 4,191 visual social media posts across eight topics in science, health, and politics, using credibility ratings from 5,355 crowdsourced workers. Our method outperformed zero-shot GPT-based predictions by 13 percent in R2, and revealed key features like information concreteness and image format. We discuss the implications for misinformation mitigation, visual credibility, and the role of LLMs in social science.

[39] Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task

Aviral Chharia,Tianyu Ren,Tomotake Furuhata,Kenji Shimada

Main category: cs.CV

TLDR: Safe-Construct提出首个3D多视角框架，通过场景级工人-物体上下文和3D空间理解，改进施工环境中的安全违规识别，并引入合成数据集生成器SICSG，性能提升7.6%。

Details

Motivation: 现有2D目标检测模型无法捕捉真实世界违规的复杂性，包括任务简化、验证不足、缺乏标准化基准和数据限制。 Method: 将违规识别重新定义为3D多视角任务，结合场景级上下文和3D空间理解，并开发合成数据集生成器SICSG。 Result: 在四种违规类型上性能提升7.6%，并在接近真实环境中验证了鲁棒性。 Conclusion: Safe-Construct通过3D多视角理解和合成数据，为高风险行业的安全监控设定了新基准。 Abstract: Recognizing safety violations in construction environments is critical yet remains underexplored in computer vision. Existing models predominantly rely on 2D object detection, which fails to capture the complexities of real-world violations due to: (i) an oversimplified task formulation treating violation recognition merely as object detection, (ii) inadequate validation under realistic conditions, (iii) absence of standardized baselines, and (iv) limited scalability from the unavailability of synthetic dataset generators for diverse construction scenarios. To address these challenges, we introduce Safe-Construct, the first framework that reformulates violation recognition as a 3D multi-view engagement task, leveraging scene-level worker-object context and 3D spatial understanding. We also propose the Synthetic Indoor Construction Site Generator (SICSG) to create diverse, scalable training data, overcoming data limitations. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types. We rigorously evaluate our approach in near-realistic settings, incorporating four violations, four workers, 14 objects, and challenging conditions like occlusions (worker-object, worker-worker) and variable illumination (back-lighting, overexposure, sunlight). By integrating 3D multi-view spatial understanding and synthetic data generation, Safe-Construct sets a new benchmark for scalable and robust safety monitoring in high-risk industries. Project Website: https://Safe-Construct.github.io/Safe-Construct

[40] Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models

Karan Jain,Mohammad Nayeem Teli

Main category: cs.CV

TLDR: 提出了一种基于可逆UNet架构的单GPU高效训练扩散模型，用于高维医学图像生成，显著降低内存和能耗。

Details

Motivation: 解决扩散模型在3D医学图像生成中计算资源需求高的问题。 Method: 采用可逆UNet架构和可逆注意力模块，使内存使用与数据维度无关。 Result: 在BraTS2020数据集上，峰值内存消耗降低15%，同时保持图像质量与SOTA相当。 Conclusion: 该模型在高效性和性能上取得平衡，适用于多种图像生成任务。 Abstract: Diffusion models have recently gained state of the art performance on many image generation tasks. However, most models require significant computational resources to achieve this. This becomes apparent in the application of medical image synthesis due to the 3D nature of medical datasets like CT-scans, MRIs, electron microscope, etc. In this paper we propose a novel architecture for a single GPU memory-efficient training for diffusion models for high dimensional medical datasets. The proposed model is built by using an invertible UNet architecture with invertible attention modules. This leads to the following two contributions: 1. denoising diffusion models and thus enabling memory usage to be independent of the dimensionality of the dataset, and 2. reducing the energy usage during training. While this new model can be applied to a multitude of image generation tasks, we showcase its memory-efficiency on the 3D BraTS2020 dataset leading to up to 15\% decrease in peak memory consumption during training with comparable results to SOTA while maintaining the image quality.

[41] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving

Zeyu Zhang,Zijian Chen,Zicheng Zhang,Yuze Sun,Yuan Tian,Ziheng Jia,Chunyi Li,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TLDR: 论文提出了一种动态多模态评估框架OVPG，用于自动生成多样且可验证的评估数据，以解决静态基准测试的局限性。

Details

Motivation: 现有基准测试多为静态且与预训练数据重叠，导致复杂性固定和数据污染问题，同时人工标注数据集耗时且易受偏见影响。 Method: OVPG框架包含原始材料采样、视觉内容生成和谜题规则设计模块，确保评估实例具有原始性、高度随机性和唯一可解性。 Result: 基于OVPG构建的PuzzleBench包含11,840个VQA样本，涵盖六种任务，针对LMM的视觉识别、逻辑推理和上下文理解能力。 Conclusion: PuzzleBench通过动态生成和开放设计，能够持续适应LMM能力的演进，优于静态基准测试。 Abstract: Large Multimodal Models (LMMs) have demonstrated impressive capabilities across a wide range of multimodal tasks, achieving ever-increasing performance on various evaluation benchmarks. However, existing benchmarks are typically static and often overlap with pre-training datasets, leading to fixed complexity constraints and substantial data contamination issues. Meanwhile, manually annotated datasets are labor-intensive, time-consuming, and subject to human bias and inconsistency, leading to reliability and reproducibility issues. To address these problems, we propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG), which aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks. Specifically, the OVPG pipeline consists of a raw material sampling module, a visual content generation module, and a puzzle rule design module, which ensures that each evaluation instance is primitive, highly randomized, and uniquely solvable, enabling continual adaptation to the evolving capabilities of LMMs. Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples. It features six carefully designed puzzle tasks targeting three core LMM competencies, visual recognition, logical reasoning, and context understanding. PuzzleBench differs from static benchmarks that quickly become outdated. It enables ongoing dataset refreshing through OVPG and a rich set of open-ended puzzle designs, allowing seamless adaptation to the evolving capabilities of LMMs.

Jiahuan Long,Wen Yao,Tingsong Jiang,Chao Ma

Main category: cs.CV

TLDR: CDUPatch是一种针对可见-红外双模态目标检测器的通用跨模态对抗补丁攻击方法，通过优化颜色分布和红外纹理，提升了攻击效果。

Details

Motivation: 现有双模态对抗补丁攻击在多样化物理场景中效果有限，需改进。 Method: 提出RGB-to-infrared适配器，统一优化跨模态补丁；引入多尺度裁剪策略和新数据集MSDrone。 Result: 在四个基准数据集上表现优于现有方法，物理测试验证了跨尺度、视角和场景的强迁移性。 Conclusion: CDUPatch显著提升了双模态对抗补丁的攻击效果和通用性。 Abstract: Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.

[43] Fine-Grained Rib Fracture Diagnosis with Hyperbolic Embeddings: A Detailed Annotation Framework and Multi-Label Classification Model

Shripad Pate,Aiman Farooq,Suvrankar Dutta,Musadiq Aadil Sheikh,Atin Kumar,Deepak Mishra

Main category: cs.CV

TLDR: 提出了一种新的肋骨骨折标注协议，并结合跨模态嵌入方法提升骨折分类性能，实验结果显示该方法优于现有技术。

Details

Motivation: 现有数据集缺乏细粒度标注，尤其是骨折特征、类型和精确解剖位置的描述，影响了治疗计划的制定。 Method: 设计了一种新的肋骨骨折标注协议，并利用双曲嵌入将放射图像和临床描述映射到共享的非欧几里得流形中，以捕捉骨折的层次结构。 Result: 在AirRib和RibFrac数据集上，平均召回率分别提高了6%和17.5%。 Conclusion: 该方法通过跨模态嵌入和层次化建模，显著提升了肋骨骨折分类的准确性。 Abstract: Accurate rib fracture identification and classification are essential for treatment planning. However, existing datasets often lack fine-grained annotations, particularly regarding rib fracture characterization, type, and precise anatomical location on individual ribs. To address this, we introduce a novel rib fracture annotation protocol tailored for fracture classification. Further, we enhance fracture classification by leveraging cross-modal embeddings that bridge radiological images and clinical descriptions. Our approach employs hyperbolic embeddings to capture the hierarchical nature of fracture, mapping visual features and textual descriptions into a shared non-Euclidean manifold. This framework enables more nuanced similarity computations between imaging characteristics and clinical descriptions, accounting for the inherent hierarchical relationships in fracture taxonomy. Experimental results demonstrate that our approach outperforms existing methods across multiple classification tasks, with average recall improvements of 6% on the AirRib dataset and 17.5% on the public RibFrac dataset.

[44] InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation

Yukang Lin,Yan Hong,Zunnan Xu,Xindi Li,Chao Xu,Chuanbiao Song,Ronghui Li,Haoxing Chen,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang,Xiu Li

Main category: cs.CV

TLDR: 论文提出了一种新范式，用于生成逼真的手-脸交互动画，并发布了大规模数据集InterHF和区域感知扩散模型InterAnimate。

Details

Motivation: 现有视频生成研究多关注孤立动作，而忽略了交互动作（如手-脸交互），这些交互对生物特征认证系统至关重要。 Method: 通过同时学习时空接触动力学和生物力学合理的变形效果，提出区域感知扩散模型InterAnimate，利用可学习的时空潜在变量捕捉动态交互先验。 Result: InterAnimate生成了高度逼真的动画，并发布了包含18种交互模式和90,000个标注视频的数据集InterHF。 Conclusion: 该研究首次系统研究了人类手-脸交互，为相关领域提供了新基准，代码和数据将公开以推动研究。 Abstract: Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.

[45] Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering

Peipei Song,Long Zhang,Long Lan,Weidong Chen,Dan Guo,Xun Yang,Meng Wang

Main category: cs.CV

TLDR: AMDNet提出了一种用于部分相关视频检索（PRVR）的新方法，通过主动发现语义一致的视频片段，结合多样性损失和相关性损失，显著提升了检索性能。

Details

Motivation: 现有PRVR方法在多尺度片段建模中存在内容独立性和信息冗余问题，影响了检索效果。 Method: 使用可学习的跨度锚点捕获不同片段，并通过掩码多片段注意力突出重要片段，同时引入多样性损失和相关性损失优化模型。 Result: 在TVR和ActivityNet Captions数据集上表现优异，AMDNet参数量减少15.5倍，性能提升6.0分。 Conclusion: AMDNet通过高效的片段建模和损失优化，为PRVR任务提供了更紧凑且信息丰富的视频表示。 Abstract: Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.

[46] Cross-Frequency Implicit Neural Representation with Self-Evolving Parameters

Chang Yu,Yisi Luo,Kai Ye,Xile Zhao,Deyu Meng

Main category: cs.CV

TLDR: 提出了一种基于Haar小波变换的自进化跨频率隐式神经表示（CF-INR），通过分离频率分量并自动优化参数，显著提升了视觉数据表示的精度。

Details

Motivation: 传统隐式神经表示（INR）方法在原始空间中混合不同频率分量，且需要手动配置参数（如频率参数ω或秩R），限制了其性能和灵活性。 Method: 使用Haar小波变换将数据分解为四个频率分量，并在小波空间中应用INR；提出自进化跨频率张量分解范式，自动优化每个频率分量的参数。 Result: CF-INR在图像回归、修复、去噪和云去除等任务中表现优于现有方法。 Conclusion: CF-INR通过自动参数优化和跨频率表示，显著提升了视觉数据表示的精度和适应性。 Abstract: Implicit neural representation (INR) has emerged as a powerful paradigm for visual data representation. However, classical INR methods represent data in the original space mixed with different frequency components, and several feature encoding parameters (e.g., the frequency parameter $\omega$ or the rank $R$) need manual configurations. In this work, we propose a self-evolving cross-frequency INR using the Haar wavelet transform (termed CF-INR), which decouples data into four frequency components and employs INRs in the wavelet space. CF-INR allows the characterization of different frequency components separately, thus enabling higher accuracy for data representation. To more precisely characterize cross-frequency components, we propose a cross-frequency tensor decomposition paradigm for CF-INR with self-evolving parameters, which automatically updates the rank parameter $R$ and the frequency parameter $\omega$ for each frequency component through self-evolving optimization. This self-evolution paradigm eliminates the laborious manual tuning of these parameters, and learns a customized cross-frequency feature encoding configuration for each dataset. We evaluate CF-INR on a variety of visual data representation and recovery tasks, including image regression, inpainting, denoising, and cloud removal. Extensive experiments demonstrate that CF-INR outperforms state-of-the-art methods in each case.

[47] Recognition of Geometrical Shapes by Dictionary Learning

Alexander Köhler,Michael Breuß

Main category: cs.CV

TLDR: 论文提出了一种将字典学习应用于几何形状识别的方法，并展示了优化方法对识别质量的重要影响。

Details

Motivation: 字典学习在图像重建等任务中表现出强大的表示能力，但尚未用于形状识别。本文旨在探索其在几何形状识别中的应用。 Method: 通过字典学习生成过完备的原子集，用于表示输入形状，并研究不同优化方法对识别效果的影响。 Result: 实验结果表明，字典学习在形状识别任务中具有潜力，优化方法的选择显著影响识别质量。 Conclusion: 字典学习可作为形状识别的一种有效方法，未来研究可进一步优化其性能。 Abstract: Dictionary learning is a versatile method to produce an overcomplete set of vectors, called atoms, to represent a given input with only a few atoms. In the literature, it has been used primarily for tasks that explore its powerful representation capabilities, such as for image reconstruction. In this work, we present a first approach to make dictionary learning work for shape recognition, considering specifically geometrical shapes. As we demonstrate, the choice of the underlying optimization method has a significant impact on recognition quality. Experimental results confirm that dictionary learning may be an interesting method for shape recognition tasks.

[48] An Efficient and Mixed Heterogeneous Model for Image Restoration

Yubin Gu,Yuan Meng,Kaihang Zheng,Xiaoshuai Sun,Jiayi Ji,Weijian Ruan,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TLDR: RestorMixer是一种基于混合架构融合的高效通用图像恢复模型，结合了CNN、Mamba和注意力机制的优势，在多种图像恢复任务中表现优异。

Details

Motivation: 当前图像恢复模型多为单一架构，难以同时处理多样化的退化类型。研究旨在通过混合架构融合，提升模型的通用性和效率。 Method: RestorMixer采用三阶段编码器-解码器结构，分别利用CNN提取局部特征、Mamba建模全局上下文，并结合注意力机制动态优化特征。 Result: 实验表明，RestorMixer在多种图像恢复任务中性能领先，同时保持高效推理。 Conclusion: RestorMixer通过混合架构融合，成功平衡了局部特征提取、全局上下文建模和动态特征优化，为通用图像恢复提供了高效解决方案。 Abstract: Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at https://github.com/ClimBin/RestorMixer.

[49] AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

Yihang Liu,Lianghua He,Ying Wen,Longzhen Yang,Hongzhou Chen

Main category: cs.CV

TLDR: AFiRe提出了一种基于解剖学驱动的自监督框架，通过结合视觉Transformer的特性，增强放射影像分析中的细粒度表征。

Details

Motivation: 现有自监督方法（如对比学习）主要关注全局区分，忽略了放射影像分析所需的细粒度解剖细节。 Method: AFiRe结合两种自监督方案：基于解剖结构的对比学习和像素级异常修复，同时提出合成病变掩码以增强解剖多样性。 Result: AFiRe在解剖区分、泛化能力和细粒度信息整合方面优于现有方法，尤其在多标签分类和异常检测任务中表现突出。 Conclusion: AFiRe通过解剖学驱动的自监督学习，显著提升了放射影像分析的细粒度表征能力。 Abstract: Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.

Zhisheng Zhang,Peng Zhang,Fengxiang Wang,Liangli Ma,Fuchun Sun

Main category: cs.CV

TLDR: 提出了一种特征空间变换和自监督多帧融合策略，显著提升了前视声纳图像的质量，解决了现有方法在真实数据上的局限性。

Details

Motivation: 高质量真实配对数据难以获取，现有深度学习方法依赖模拟数据，泛化能力受限；跨模态退化差距导致直接迁移预训练权重效果不佳。 Method: 通过特征空间变换将声纳图像映射到鲁棒特征域，并结合自监督多帧融合策略利用帧间互补信息去除噪声和增强亮度。 Result: 在三个真实数据集上显著优于现有方法，有效抑制噪声、保留边缘细节并提升亮度。 Conclusion: 该方法为水下目标检测提供了有力工具，展示了实际应用的潜力。 Abstract: Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect the cross-modal degradation gap between sonar and remote sensing images. Directly transferring pretrained weights often leads to overly smooth sonar images, detail loss, and insufficient brightness. To address this, we propose a feature-space transformation that maps sonar images from the pixel domain to a robust feature domain, effectively bridging the degradation gap. Additionally, our self-supervised multi-frame fusion strategy leverages complementary inter-frame information to naturally remove speckle noise and enhance target-region brightness. Experiments on three self-collected real-world forward-looking sonar datasets show that our method significantly outperforms existing approaches, effectively suppressing noise, preserving detailed edges, and substantially improving brightness, demonstrating strong potential for underwater target detection applications.

[51] Adaptive Decision Boundary for Few-Shot Class-Incremental Learning

Linhao Li,Yongzhang Tan,Siyuan Yang,Hao Cheng,Yongfeng Dong,Liang Yang

Main category: cs.CV

TLDR: 提出了一种自适应决策边界策略（ADBS），用于改进少样本类增量学习（FSCIL），通过动态调整决策边界和优化类间约束，显著提升性能。

Details

Motivation: 现有FSCIL方法主要关注防止灾难性遗忘，忽略了每个类的具体决策空间，导致性能受限。 Method: 提出ADBS策略，为每个类分配特定决策边界并动态调整，同时引入类间约束损失优化边界和原型。 Result: 在CIFAR100、miniImageNet和CUB200基准测试中，ADBS显著提升了现有FSCIL方法的性能，达到最优结果。 Conclusion: ADBS是一种即插即用的策略，能够有效优化FSCIL中的决策空间，提升分类性能。 Abstract: Few-Shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes from a limited set of training samples without forgetting knowledge of previously learned classes. Conventional FSCIL methods typically build a robust feature extractor during the base training session with abundant training samples and subsequently freeze this extractor, only fine-tuning the classifier in subsequent incremental phases. However, current strategies primarily focus on preventing catastrophic forgetting, considering only the relationship between novel and base classes, without paying attention to the specific decision spaces of each class. To address this challenge, we propose a plug-and-play Adaptive Decision Boundary Strategy (ADBS), which is compatible with most FSCIL methods. Specifically, we assign a specific decision boundary to each class and adaptively adjust these boundaries during training to optimally refine the decision spaces for the classes in each session. Furthermore, to amplify the distinctiveness between classes, we employ a novel inter-class constraint loss that optimizes the decision boundaries and prototypes for each class. Extensive experiments on three benchmarks, namely CIFAR100, miniImageNet, and CUB200, demonstrate that incorporating our ADBS method with existing FSCIL techniques significantly improves performance, achieving overall state-of-the-art results.

[52] Deep Learning in Concealed Dense Prediction

Pancheng Zhao,Deng-Ping Fan,Shupeng Cheng,Salman Khan,Fahad Shahbaz Khan,David Clifton,Peng Xu,Jufeng Yang

Main category: cs.CV

TLDR: 该论文介绍了隐蔽密集预测（CDP）任务，分析了其特点、挑战及与通用视觉任务的差异，总结了深度学习在CDP中的研究进展，并探讨了未来发展方向。

Details

Motivation: 随着深度学习的发展，处理复杂视觉任务的时机成熟，CDP任务因其在实际应用中的价值而值得关注。 Method: 通过实验对三种CDP任务进行研究，比较了25种先进方法在12个隐蔽数据集上的表现，并提出了基于隐蔽对抗的分类法。 Result: 总结了CDP的研究进展，提出了6个潜在研究方向，并构建了CvpINST数据集和CvpAgent代理。 Conclusion: CDP在未来有广阔的应用前景，尤其是在大模型时代，需进一步探索其潜力。 Abstract: Deep learning is developing rapidly and handling common computer vision tasks well. It is time to pay attention to more complex vision tasks, as model size, knowledge, and reasoning capabilities continue to improve. In this paper, we introduce and review a family of complex tasks, termed Concealed Dense Prediction (CDP), which has great value in agriculture, industry, etc. CDP's intrinsic trait is that the targets are concealed in their surroundings, thus fully perceiving them requires fine-grained representations, prior knowledge, auxiliary reasoning, etc. The contributions of this review are three-fold: (i) We introduce the scope, characteristics, and challenges specific to CDP tasks and emphasize their essential differences from generic vision tasks. (ii) We develop a taxonomy based on concealment counteracting to summarize deep learning efforts in CDP through experiments on three tasks. We compare 25 state-of-the-art methods across 12 widely used concealed datasets. (iii) We discuss the potential applications of CDP in the large model era and summarize 6 potential research directions. We offer perspectives for the future development of CDP by constructing a large-scale multimodal instruction fine-tuning dataset, CvpINST, and a concealed visual perception agent, CvpAgent.

[53] Seeing like a Cephalopod: Colour Vision with a Monochrome Event Camera

Sami Arja,Nimrod Kruger,Alexandre Marcireau,Nicholas Owen Ralph,Saeed Afshar,Gregory Cohen

Main category: cs.CV

TLDR: 受头足类动物视觉机制启发，设计了一种结合球透镜和事件相机的光谱成像系统，实现了无需传统滤色器的光谱感知。

Details

Motivation: 头足类动物仅有一种光感受器却能感知颜色，其机制为设计新型光谱成像系统提供了灵感。 Method: 通过电机系统调整焦距，模拟头足类动物的自适应透镜运动，结合事件相机实现波长依赖性聚焦。 Result: 系统在可见光和近红外光谱范围内实现了光谱感知，验证了生物启发方法的有效性。 Conclusion: 该方法为无需传统滤色器的光谱感知系统提供了新思路，代码和分析已公开。 Abstract: Cephalopods exhibit unique colour discrimination capabilities despite having one type of photoreceptor, relying instead on chromatic aberration induced by their ocular optics and pupil shapes to perceive spectral information. We took inspiration from this biological mechanism to design a spectral imaging system that combines a ball lens with an event-based camera. Our approach relies on a motorised system that shifts the focal position, mirroring the adaptive lens motion in cephalopods. This approach has enabled us to achieve wavelength-dependent focusing across the visible light and near-infrared spectrum, making the event a spectral sensor. We characterise chromatic aberration effects, using both event-based and conventional frame-based sensors, validating the effectiveness of bio-inspired spectral discrimination both in simulation and in a real setup as well as assessing the spectral discrimination performance. Our proposed approach provides a robust spectral sensing capability without conventional colour filters or computational demosaicing. This approach opens new pathways toward new spectral sensing systems inspired by nature's evolutionary solutions. Code and analysis are available at: https://samiarja.github.io/neuromorphic_octopus_eye/

Minghui Lin,Shu Wang,Xiang Wang,Jianhua Tang,Longbin Fu,Zhengrong Zuo,Nong Sang

Main category: cs.CV

TLDR: 论文提出了一种高效的提示调优框架DMPT，用于多模态物体重识别，通过冻结主干网络并优化少量新参数，显著减少了计算和存储需求。

Details

Motivation: 现有基于大规模预训练主干网络的多模态物体重识别方法虽然性能优异，但需要优化大量参数，导致计算和存储成本高。 Method: DMPT框架将视觉提示解耦为模态特定提示和模态无关语义提示，并设计了Prompt Inverse Bind策略以促进多模态信息互补。 Result: 在多个基准测试中，DMPT仅需优化6.5%的主干参数，即可达到与现有最优方法竞争的性能。 Conclusion: DMPT是一种高效的多模态物体重识别方法，显著降低了资源需求，同时保持了高性能。 Abstract: Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification, dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.

[55] PraNet-V2: Dual-Supervised Reverse Attention for Medical Image Segmentation

Bo-Cheng Hu,Ge-Peng Ji,Dian Shao,Deng-Ping Fan

Main category: cs.CV

TLDR: PraNet-V2通过引入双监督反向注意力模块（DSRA），解决了PraNet-V1在多类分割任务中的不足，显著提升了性能。

Details

Motivation: PraNet-V1在多类分割任务中表现不佳，需要改进以扩展其应用范围。 Method: 提出DSRA模块，结合显式背景监督、独立背景建模和语义增强的注意力融合。 Result: 在四个息肉分割数据集上表现优异，并在三种先进语义分割模型中提升了1.36%的平均Dice分数。 Conclusion: PraNet-V2在多类分割任务中表现出色，具有广泛的应用潜力。 Abstract: Accurate medical image segmentation is essential for effective diagnosis and treatment. Previously, PraNet-V1 was proposed to enhance polyp segmentation by introducing a reverse attention (RA) module that utilizes background information. However, PraNet-V1 struggles with multi-class segmentation tasks. To address this limitation, we propose PraNet-V2, which, compared to PraNet-V1, effectively performs a broader range of tasks including multi-class segmentation. At the core of PraNet-V2 is the Dual-Supervised Reverse Attention (DSRA) module, which incorporates explicit background supervision, independent background modeling, and semantically enriched attention fusion. Our PraNet-V2 framework demonstrates strong performance on four polyp segmentation datasets. Additionally, by integrating DSRA to iteratively enhance foreground segmentation results in three state-of-the-art semantic segmentation models, we achieve up to a 1.36% improvement in mean Dice score. Code is available at: https://github.com/ai4colonoscopy/PraNet-V2/tree/main/binary_seg/jittor.

[56] TMCIR: Token Merge Benefits Composed Image Retrieval

Chaoyang Wang,Zeyu Zhang,Long Teng,Zijun Li,Shichao Kan

Main category: cs.CV

TLDR: TMCIR提出了一种新的组合图像检索框架，通过意图感知跨模态对齐和自适应令牌融合，显著提升了检索性能。

Details

Motivation: 现有组合图像检索方法在视觉和文本信息融合上存在偏差，无法准确捕捉用户意图。 Method: 1) 使用扩散模型生成意图反映的伪目标图像，对比微调CLIP编码器；2) 通过自适应令牌融合动态平衡视觉和文本表示。 Result: 在Fashion-IQ和CIRR数据集上，TMCIR显著优于现有方法。 Conclusion: TMCIR通过改进意图捕捉和动态平衡，提升了组合图像检索的准确性。 Abstract: Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.

[57] MediSee: Reasoning-based Pixel-level Perception in Medical Images

Qinyue Tong,Ziqian Lu,Jun Liu,Yangming Zheng,Zheming Lu

Main category: cs.CV

TLDR: 论文提出了一种新的医学视觉任务MedSD，通过逻辑推理理解医学图像的隐式查询，并生成目标的分割掩码和边界框。作者构建了数据集MLMR-SD，并提出了基线模型MediSee，实验表明其优于传统方法。

Details

Motivation: 现有医学图像感知方法依赖特定任务或精确输入提示，限制了通用性。普通用户更倾向于使用需要逻辑推理的口头查询，因此需要一种更通用的方法。 Method: 提出MedSD任务，构建MLMR-SD数据集，并设计基线模型MediSee，用于处理隐式查询并生成分割和检测结果。 Result: 实验证明MediSee能有效处理隐式查询，性能优于传统医学参考分割方法。 Conclusion: MedSD任务和MediSee模型为医学图像理解提供了更通用的解决方案，适用于普通用户的需求。 Abstract: Despite remarkable advancements in pixel-level medical image perception, existing methods are either limited to specific tasks or heavily rely on accurate bounding boxes or text labels as input prompts. However, the medical knowledge required for input is a huge obstacle for general public, which greatly reduces the universality of these methods. Compared with these domain-specialized auxiliary information, general users tend to rely on oral queries that require logical reasoning. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for medical reasoning segmentation and detection. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods.

[58] GATE3D: Generalized Attention-based Task-synergized Estimation in 3D*

Eunsoo Im,Jung Kwon Lee,Changhyun Jee

Main category: cs.CV

TLDR: GATE3D是一个新颖的弱监督框架，用于通用单目3D物体检测，通过一致性损失解决多领域训练中的挑战，并在KITTI和室内数据集上表现优异。

Details

Motivation: 解决单目3D物体检测在多领域训练中因标注数据稀缺和数据集偏差带来的挑战。 Method: 提出GATE3D框架，利用伪标签和一致性损失（2D与3D预测之间）来桥接领域差距。 Result: 在KITTI和室内数据集上取得竞争性性能，并通过有效预训练策略加速学习。 Conclusion: GATE3D展示了在机器人、增强现实和虚拟现实中的广泛应用潜力。 Abstract: The emerging trend in computer vision emphasizes developing universal models capable of simultaneously addressing multiple diverse tasks. Such universality typically requires joint training across multi-domain datasets to ensure effective generalization. However, monocular 3D object detection presents unique challenges in multi-domain training due to the scarcity of datasets annotated with accurate 3D ground-truth labels, especially beyond typical road-based autonomous driving contexts. To address this challenge, we introduce a novel weakly supervised framework leveraging pseudo-labels. Current pretrained models often struggle to accurately detect pedestrians in non-road environments due to inherent dataset biases. Unlike generalized image-based 2D object detection models, achieving similar generalization in monocular 3D detection remains largely unexplored. In this paper, we propose GATE3D, a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions. Remarkably, our model achieves competitive performance on the KITTI benchmark as well as on an indoor-office dataset collected by us to evaluate the generalization capabilities of our framework. Our results demonstrate that GATE3D significantly accelerates learning from limited annotated data through effective pre-training strategies, highlighting substantial potential for broader impacts in robotics, augmented reality, and virtual reality applications. Project page: https://ies0411.github.io/GATE3D/

[59] AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era

Chenyang Zhu,Xing Zhang,Yuyang Sun,Ching-Chun Chang,Isao Echizen

Main category: cs.CV

TLDR: 论文提出了AnimeDL-2M，首个针对动漫图像的大规模IMDL基准数据集，并开发了AniXplore模型，显著提升了动漫图像篡改检测的性能。

Details

Motivation: 动漫领域因AI生成伪造图像的威胁（如版权侵犯、内容篡改）而亟需专门的篡改检测方法，而现有方法主要针对自然图像，效果不佳。 Method: 构建了包含200万张真实、部分篡改和全AI生成图像的AnimeDL-2M数据集，并设计了针对动漫视觉特征的AniXplore模型。 Result: 实验表明，现有IMDL模型在动漫图像上表现不佳，而AniXplore显著优于现有方法。 Conclusion: AnimeDL-2M和AniXplore填补了动漫IMDL领域的空白，为社区提供了重要工具。 Abstract: Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored-despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples. Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods. Dataset and code can be found in https://flytweety.github.io/AnimeDL2M/.

[60] DRIFT open dataset: A drone-derived intelligence for traffic analysis in urban environmen

Hyejin Lee,Seokjun Hong,Jeonghoon Song,Haechan Cho,Zhixiong Jin,Byeonghun Kim,Joobin Jin,Jaegyun Im,Byeongjoon Noh,Hwasoo Yeo

Main category: cs.CV

TLDR: DRIFT数据集通过无人机视频同步采集，提供高分辨率车辆轨迹，支持多尺度交通分析。

Details

Motivation: 可靠交通数据对城市交通管理和研究至关重要，但现有数据难以满足多尺度分析需求。 Method: 利用无人机在250米高空同步拍摄视频，通过视频同步和正射地图对齐处理，生成81,699条车辆轨迹。 Result: DRIFT数据集包含详细车辆轨迹信息，支持从微观到宏观的交通分析。 Conclusion: DRIFT数据集为学术研究和实际应用提供了高质量、即用的交通数据资源。 Abstract: Reliable traffic data are essential for understanding urban mobility and developing effective traffic management strategies. This study introduces the DRone-derived Intelligence For Traffic analysis (DRIFT) dataset, a large-scale urban traffic dataset collected systematically from synchronized drone videos at approximately 250 meters altitude, covering nine interconnected intersections in Daejeon, South Korea. DRIFT provides high-resolution vehicle trajectories that include directional information, processed through video synchronization and orthomap alignment, resulting in a comprehensive dataset of 81,699 vehicle trajectories. Through our DRIFT dataset, researchers can simultaneously analyze traffic at multiple scales - from individual vehicle maneuvers like lane-changes and safety metrics such as time-to-collision to aggregate network flow dynamics across interconnected urban intersections. The DRIFT dataset is structured to enable immediate use without additional preprocessing, complemented by open-source models for object detection and trajectory extraction, as well as associated analytical tools. DRIFT is expected to significantly contribute to academic research and practical applications, such as traffic flow analysis and simulation studies. The dataset and related resources are publicly accessible at https://github.com/AIxMobility/The-DRIFT.

[61] Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

Andrea Simonelli,Norman Müller,Peter Kontschieder

Main category: cs.CV

TLDR: 本文提出了一种高效的3D交互式分割方法，结合体素稀疏编码器和轻量级Transformer解码器，在多个数据集上表现优异。

Details

Motivation: 随着数字3D环境的普及，对高效、精确的3D交互需求增加，尤其是在未知环境和陌生对象中。 Method: 采用体素稀疏编码器和轻量级Transformer解码器，实现隐式点击融合。 Result: 在ScanNet、ScanNet++等数据集上表现优异，且适用于未知几何分布。 Conclusion: 该方法在性能和效率上均超越现有技术，适用于多样化场景。 Abstract: The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting. The project web-page is available at https://simonelli-andrea.github.io/easy3d.

[62] TADACap: Time-series Adaptive Domain-Aware Captioning

Elizabeth Fons,Rachneet Kaur,Zhen Zeng,Soham Palande,Tucker Balch,Svitlana Vyetrenko,Manuela Veloso

Main category: cs.CV

TLDR: TADACap是一个基于检索的框架，用于生成时间序列图像的领域感知描述，无需重新训练即可适应新领域。

Details

Motivation: 现有时间序列描述方法通常提供通用描述，难以适应新领域，且需要大量重新训练。 Method: 提出TADACap框架及TADACap-diverse检索策略，从目标领域数据库中检索多样化的图像-描述对。 Result: TADACap-diverse在语义准确性上与现有方法相当，且标注需求显著减少。 Conclusion: TADACap-diverse为时间序列图像描述提供了一种高效且适应性强的解决方案。 Abstract: While image captioning has gained significant attention, the potential of captioning time-series images, prevalent in areas like finance and healthcare, remains largely untapped. Existing time-series captioning methods typically offer generic, domain-agnostic descriptions of time-series shapes and struggle to adapt to new domains without substantial retraining. To address these limitations, we introduce TADACap, a retrieval-based framework to generate domain-aware captions for time-series images, capable of adapting to new domains without retraining. Building on TADACap, we propose a novel retrieval strategy that retrieves diverse image-caption pairs from a target domain database, namely TADACap-diverse. We benchmarked TADACap-diverse against state-of-the-art methods and ablation variants. TADACap-diverse demonstrates comparable semantic accuracy while requiring significantly less annotation effort.

[63] Defending Against Frequency-Based Attacks with Diffusion Models

Fatemeh Amerehi,Patrick Healy

Main category: cs.CV

TLDR: 对抗训练通常针对特定攻击类型，泛化能力有限；对抗净化通过生成模型去除扰动，能更好地应对未见攻击。扩散模型在噪声净化中表现优异，本研究探讨了其在频谱和空间对抗攻击中的效果。

Details

Motivation: 研究对抗净化的泛化能力，尤其是针对未见攻击类型（如频谱和空间攻击）的效果。 Method: 利用扩散模型进行对抗净化，独立于分类器和威胁模型训练，评估其在频谱和空间攻击中的表现。 Result: 对抗净化能有效处理从低频到高频的多样化失真模式。 Conclusion: 对抗净化在应对多样化对抗攻击方面具有潜力，尤其是在泛化能力上优于传统对抗训练。 Abstract: Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.

[64] QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

Yudong Zhang,Ruobing Xie,Jiansheng Chen,Xingwu Sun,Zhanhui Kang,Yu Wang

Main category: cs.CV

TLDR: 论文提出了一种查询无关的视觉攻击方法（QAVA），旨在生成对未知问题也能导致错误回答的对抗样本，显著提升了攻击效果和效率。

Details

Motivation: 传统对抗攻击针对特定图像和问题，而实际场景中图像常关联多个问题。QAVA旨在解决这一问题，揭示视觉对抗威胁的潜在漏洞。 Method: 引入QAVA方法，生成对未知问题也能导致错误回答的对抗样本。 Result: QAVA在未知问题情况下攻击效果显著，性能接近已知目标问题的攻击。 Conclusion: QAVA扩展了视觉对抗攻击的适用范围，揭示了LVLMs在视觉对抗威胁中的新漏洞。 Abstract: In typical multimodal tasks, such as Visual Question Answering (VQA), adversarial attacks targeting a specific image and question can lead large vision-language models (LVLMs) to provide incorrect answers. However, it is common for a single image to be associated with multiple questions, and LVLMs may still answer other questions correctly even for an adversarial image attacked by a specific question. To address this, we introduce the query-agnostic visual attack (QAVA), which aims to create robust adversarial examples that generate incorrect responses to unspecified and unknown questions. Compared to traditional adversarial attacks focused on specific images and questions, QAVA significantly enhances the effectiveness and efficiency of attacks on images when the question is unknown, achieving performance comparable to attacks on known target questions. Our research broadens the scope of visual adversarial attacks on LVLMs in practical settings, uncovering previously overlooked vulnerabilities, particularly in the context of visual adversarial threats. The code is available at https://github.com/btzyd/qava.

[65] Leveraging LLMs and attention-mechanism for automatic annotation of historical maps

Yunshuang Yuan,Monika Sester

Main category: cs.CV

TLDR: 提出了一种利用大语言模型和注意力机制自动标注历史地图的新方法，实现了高召回率和准确率。

Details

Motivation: 历史地图是研究过去地理景观的重要资源，但传统方法依赖人工解读，效率低且难以扩展。 Method: 结合大语言模型生成粗分类标签，并通过注意力机制细化到高分辨率。 Result: 实验显示，召回率超90%，IoU和精确度表现优异，无需细粒度人工标注。 Conclusion: 该方法为高效、可扩展的历史地图分析提供了新途径。 Abstract: Historical maps are essential resources that provide insights into the geographical landscapes of the past. They serve as valuable tools for researchers across disciplines such as history, geography, and urban studies, facilitating the reconstruction of historical environments and the analysis of spatial transformations over time. However, when constrained to analogue or scanned formats, their interpretation is limited to humans and therefore not scalable. Recent advancements in machine learning, particularly in computer vision and large language models (LLMs), have opened new avenues for automating the recognition and classification of features and objects in historical maps. In this paper, we propose a novel distillation method that leverages LLMs and attention mechanisms for the automatic annotation of historical maps. LLMs are employed to generate coarse classification labels for low-resolution historical image patches, while attention mechanisms are utilized to refine these labels to higher resolutions. Experimental results demonstrate that the refined labels achieve a high recall of more than 90%. Additionally, the intersection over union (IoU) scores--84.2% for Wood and 72.0% for Settlement--along with precision scores of 87.1% and 79.5%, respectively, indicate that most labels are well-aligned with ground-truth annotations. Notably, these results were achieved without the use of fine-grained manual labels during training, underscoring the potential of our approach for efficient and scalable historical map analysis.

[66] Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections

Alireza Salehi,Mohammadreza Salehi,Reshad Hosseini,Cees G. M. Snoek,Makoto Yamada,Mohammad Sabokrou

Main category: cs.CV

TLDR: 论文提出一种基于CLIP的新方法，通过调整文本编码器的提示和修改视觉编码器，提升了异常检测的性能，尤其在图像级和像素级任务中表现优异。

Details

Motivation: 传统异常检测方法依赖正常训练样本且泛化能力有限，现有CLIP方法在图像级和像素级检测间存在性能差距。 Method: 通过图像上下文调整文本编码器的提示，修改CLIP视觉编码器以提取更密集的特征，保留更多空间和结构信息。 Result: 在14个数据集上性能提升2%至29%，达到最先进水平。 Conclusion: 该方法有效解决了图像级和像素级异常检测的性能差距，具有广泛适用性。 Abstract: Anomaly Detection (AD) involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible, as collecting such data can be impractical. Additionally, these methods often struggle to generalize across different domains. Recent advancements, such as AnomalyCLIP and AdaCLIP, utilize the zero-shot generalization capabilities of CLIP but still face a performance gap between image-level and pixel-level anomaly detection. To address this gap, we propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder. Also, to capture fine-grained variations more effectively, we have modified the CLIP vision encoder and altered the extraction of dense features. These changes ensure that the features retain richer spatial and structural information for both normal and anomalous prompts. Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets. This demonstrates its effectiveness in both image-level and pixel-level anomaly detection.

[67] UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques

Pedro Diaz-Garcia,Felix Escalona,Miguel Cazorla

Main category: cs.CV

TLDR: 论文探讨了水下图像增强技术如何提升关键点检测与匹配的准确性，通过深度学习模型（如GAN和CNN）显著优于传统方法。

Details

Motivation: 研究目的是通过水下图像增强技术改进关键点检测与匹配的准确性和鲁棒性。 Method: 采用生成对抗网络（GAN）和卷积神经网络（CNN）等深度学习模型。 Result: 在多个水下数据集上验证，显示性能显著优于传统方法。 Conclusion: 深度学习模型在水下图像增强中表现优异，能有效提升关键点检测与匹配的效果。 Abstract: The purpose of this paper is to explore the use of underwater image enhancement techniques to improve keypoint detection and matching. By applying advanced deep learning models, including generative adversarial networks and convolutional neural networks, we aim to find the best method which improves the accuracy of keypoint detection and the robustness of matching algorithms. We evaluate the performance of these techniques on various underwater datasets, demonstrating significant improvements over traditional methods.

[68] Improving fingerprint presentation attack detection by an approach integrated into the personal verification stage

Marco Micheletto,Giulia Orrù,Luca Ghiani,Gian Luca Marcialis

Main category: cs.CV

TLDR: 论文提出了一种名为Closeness Binary Code（CC）的附加模块，用于增强指纹验证系统中的Presentation Attack Detection（PAD）性能，利用真实指纹特征的聚集特性。

Details

Motivation: 当前PAD系统通常独立于指纹验证系统设计，未能充分利用用户模板的潜在安全性提升机会。 Method: 在基础PAD系统上添加CC模块，利用真实指纹在特征空间中的聚集特性（同一手指、同一用户不同手指、其他用户指纹的层次性接近）。 Result: 实验证明CC模块能显著提升PAD性能，且无需针对特定用户样本设计。 Conclusion: CC模块是一种高效、通用的PAD增强方案，可轻松集成到现有指纹验证系统中。 Abstract: Presentation Attack Detection (PAD) systems are usually designed independently of the fingerprint verification system. While this can be acceptable for use cases where specific user templates are not predetermined, it represents a missed opportunity to enhance security in scenarios where integrating PAD with the fingerprint verification system could significantly leverage users' templates, which are the real target of a potential presentation attack. This does not mean that a PAD should be specifically designed for such users; that would imply the availability of many enrolled users' PAI and, consequently, complexity, time, and cost increase. On the contrary, we propose to equip a basic PAD, designed according to the state of the art, with an innovative add-on module called the Closeness Binary Code (CC) module. The term "closeness" refers to a peculiar property of the bona fide-related features: in an Euclidean feature space, genuine fingerprints tend to cluster in a specific pattern. First, samples from the same finger are close to each other, then samples from other fingers of the same user and finally, samples from fingers of other users. This property is statistically verified in our previous publication, and further confirmed in this paper. It is independent of the user population and the feature set class, which can be handcrafted or deep network-based (embeddings). Therefore, the add-on can be designed without the need for the targeted user samples; moreover, it exploits her/his samples' "closeness" property during the verification stage. Extensive experiments on benchmark datasets and state-of-the-art PAD methods confirm the benefits of the proposed add-on, which can be easily coupled with the main PAD module integrated into the fingerprint verification system.

[69] Change State Space Models for Remote Sensing Change Detection

Elman Ghazaei,Erchan Aptoula

Main category: cs.CV

TLDR: 论文提出了一种基于状态空间模型的Change State Space Model（CSSM），专注于双时相图像中的相关变化，显著提升了计算效率并保持了高检测性能。

Details

Motivation: 解决ConvNets和ViT在变化检测中的局限性，如ConvNets难以建模长距离依赖，ViT计算效率低。 Method: 设计了专注于双时相图像变化的Change State Space Model，通过过滤无关信息减少网络参数。 Result: 在三个基准数据集上表现优于ConvNets、ViTs和Mamba模型，且计算复杂度更低。 Conclusion: CSSM在计算效率和检测性能上均优于现有方法，适用于大规模变化检测任务。 Abstract: Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.

[70] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

Jiaxin Huang,Sheng Miao,BangBnag Yang,Yuewen Ma,Yiyi Liao

Main category: cs.CV

TLDR: Vivid4D是一种新方法，通过从单目视频合成多视角视频来增强4D动态场景重建，结合几何和生成先验，并通过视频修复任务实现视角增强。

Details

Motivation: 从单目视频重建4D动态场景具有挑战性，因为每个时间戳仅从单一视角观察。现有方法要么仅依赖几何先验，要么忽视几何信息。 Method: 将视角增强重新定义为视频修复任务，利用单目深度先验将观察视角变形为新视角，并在未标记的网络视频上训练修复模型。引入迭代视角增强策略和鲁棒重建损失。 Result: 实验表明，该方法有效改善了单目4D场景的重建和补全。 Conclusion: Vivid4D通过结合几何和生成先验，显著提升了单目4D视频合成的效果。 Abstract: Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.

[71] Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Yulong Zhang,Tianyi Liang,Xinyue Huang,Erfei Cui,Xu Guo,Pei Chu,Chenhui Li,Ru Zhang,Wenhai Wang,Gongshen Liu

Main category: cs.CV

TLDR: 提出了一种无需训练的OCR后处理方法Consensus Entropy (CE)，通过聚合多个VLM的输出量化OCR不确定性，显著提升OCR任务的质量和准确性。

Details

Motivation: 现有VLM在OCR任务中虽平均准确率提升，但仍存在样本级质量下降和缺乏可靠的低质量输出自动检测问题。 Method: 利用多VLM输出的共识熵（CE）量化不确定性，开发轻量级多模型框架，识别问题样本并优化输出。 Result: CE在多个OCR基准测试中表现优异，F1分数比VLM-as-judge方法高15.2%，数学计算任务准确率提升6.0%，且仅需重述7.3%的输入。 Conclusion: CE无需训练或监督，即插即用，显著提升OCR任务性能，成为当前最优方法。 Abstract: The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2\% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0\% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3\% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.

[72] Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

Jiangtao Liu,Zhaoxin Wang,Handing Wang,Cong Tian,Yaochu Jin

Main category: cs.CV

TLDR: TCBS-Attack是一种新型的黑盒越狱攻击方法，通过优化接近决策边界的令牌生成语义连贯的对抗提示，成功绕过T2I模型的多层防御。

Details

Motivation: 现有的防御机制（如提示检查器和图像检查器）容易受到复杂对抗攻击的影响，因此需要研究更强大的攻击方法以揭示漏洞。 Method: 提出TCBS-Attack，通过迭代优化接近文本和图像检查器决策边界的令牌，生成语义连贯的对抗提示。 Result: TCBS-Attack在多种T2I模型上表现优异，ASR-4达到45%，ASR-1达到21%，显著优于基线方法。 Conclusion: TCBS-Attack展示了现有防御机制的脆弱性，为未来防御策略的改进提供了重要参考。 Abstract: Recent advancements in Text-to-Image (T2I) generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt checkers and post-hoc image checkers, are vulnerable to sophisticated adversarial attacks. In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. By iteratively optimizing tokens near these boundaries, TCBS-Attack generates semantically coherent adversarial prompts capable of bypassing multiple defensive layers in T2I models. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 45\% and an ASR-1 of 21\% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.

[73] S$^2$Teacher: Step-by-step Teacher for Sparsely Annotated Oriented Object Detection

Yu Lin,Jianghang Lin,Kai Ye,You Shen,Yan Zhang,Shengchuan Zhang,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TLDR: 论文提出了一种稀疏标注的定向目标检测（SAOOD）方法，通过S$^2$Teacher逐步挖掘伪标签并重新加权损失，显著提升了检测性能，接近全监督效果。

Details

Motivation: 解决遥感图像中密集标注的困难，减少人工标注负担。 Method: 提出SAOOD设置和S$^2$Teacher方法，逐步挖掘伪标签并重新加权损失。 Result: 在DOTA数据集上，仅用10%标注实例即接近全监督性能。 Conclusion: S$^2$Teacher有效平衡检测精度与标注效率，代码将公开。 Abstract: Although fully-supervised oriented object detection has made significant progress in multimodal remote sensing image understanding, it comes at the cost of labor-intensive annotation. Recent studies have explored weakly and semi-supervised learning to alleviate this burden. However, these methods overlook the difficulties posed by dense annotations in complex remote sensing scenes. In this paper, we introduce a novel setting called sparsely annotated oriented object detection (SAOOD), which only labels partial instances, and propose a solution to address its challenges. Specifically, we focus on two key issues in the setting: (1) sparse labeling leading to overfitting on limited foreground representations, and (2) unlabeled objects (false negatives) confusing feature learning. To this end, we propose the S$^2$Teacher, a novel method that progressively mines pseudo-labels for unlabeled objects, from easy to hard, to enhance foreground representations. Additionally, it reweights the loss of unlabeled objects to mitigate their impact during training. Extensive experiments demonstrate that S$^2$Teacher not only significantly improves detector performance across different sparse annotation levels but also achieves near-fully-supervised performance on the DOTA dataset with only 10% annotation instances, effectively balancing detection accuracy with annotation efficiency. The code will be public.

[74] Flyweight FLIM Networks for Salient Object Detection in Biomedical Images

Leonardo M. Joao,Jancarlo F. Gomes,Silvio J. F. Guimaraes,Ewa Kijak,Alexandre X. Falcao

Main category: cs.CV

TLDR: 该论文提出了一种基于FLIM网络的轻量级显著目标检测方法，无需大规模标注数据或反向传播，适用于资源受限场景。

Details

Motivation: 显著目标检测通常需要大量计算资源和标注数据，而轻量级模型在复杂或数据稀缺场景中表现不佳。FLIM网络通过利用图像标记学习卷积核，解决了这一问题。 Method: 论文提出了学习扩张可分离卷积核和多扩张层的方法，无需反向传播，并设计了一种网络简化方法以减少核冗余和编码器大小。结合自适应解码器，构建了高效的FLIM网络模型。 Result: 实验结果表明，该方法在计算效率和效果上优于轻量级模型，参数和计算量显著减少，同时与重量级模型竞争性相当。 Conclusion: FLIM网络在数据有限和资源受限的应用中具有潜力，尤其适用于信息冗余的生物医学图像。 Abstract: Salient Object Detection (SOD) with deep learning often requires substantial computational resources and large annotated datasets, making it impractical for resource-constrained applications. Lightweight models address computational demands but typically strive in complex and scarce labeled-data scenarios. Feature Learning from Image Markers (FLIM) learns an encoder's convolutional kernels among image patches extracted from discriminative regions marked on a few representative images, dismissing large annotated datasets, pretraining, and backpropagation. Such a methodology exploits information redundancy commonly found in biomedical image applications. This study presents methods to learn dilated-separable convolutional kernels and multi-dilation layers without backpropagation for FLIM networks. It also proposes a novel network simplification method to reduce kernel redundancy and encoder size. By combining a FLIM encoder with an adaptive decoder, a concept recently introduced to estimate a pointwise convolution per image, this study presents very efficient (named flyweight) SOD models for biomedical images. Experimental results in challenging datasets demonstrate superior efficiency and effectiveness to lightweight models. By requiring significantly fewer parameters and floating-point operations, the results show competitive effectiveness to heavyweight models. These advances highlight the potential of FLIM networks for data-limited and resource-constrained applications with information redundancy.

P. Tomkiewicz,J. Jaworski,P. Zielonka,A. Wilinski

Main category: cs.CV

TLDR: 本文提出了一种基于多模态卫星影像的密度梯度分析方法，用于评估城市指标，并应用于公共交通等城市系统。通过结合光学和SAR数据，开发了分割城市区域、识别城市中心及量化密度梯度的方法，并利用K-means聚类分析密度梯度图。结果表明，该方法能有效揭示城市结构，为公共交通分析提供工具。

Details

Motivation: 城市规划和公共交通评估需要高效、低成本的方法，而卫星数据提供了全球覆盖的可能性。本文旨在利用多模态卫星影像开发一种适用于初步公共交通评估的方法。 Method: 结合光学和SAR数据，开发密度梯度分析方法，计算密度梯度系数（α）和最小有效距离（LD），并利用K-means聚类识别密度梯度图中的区域。 Result: 通过对比单中心和多中心城市，发现密度梯度特征与公共交通网络拓扑相关。密度峰值明显的城市需要不同的交通策略。 Conclusion: 该方法为城市规划者提供了一种低成本、全球适用的初步公共交通评估工具，基于开源卫星数据实现。 Abstract: This paper presents a novel computational approach for evaluating urban metrics through density gradient analysis using multi-modal satellite imagery, with applications including public transport and other urban systems. By combining optical and Synthetic Aperture Radar (SAR) data, we develop a method to segment urban areas, identify urban centers, and quantify density gradients. Our approach calculates two key metrics: the density gradient coefficient ($\alpha$) and the minimum effective distance (LD) at which density reaches a target threshold. We further employ machine learning techniques, specifically K-means clustering, to objectively identify uniform and high-variability regions within density gradient plots. We demonstrate that these metrics provide an effective screening tool for public transport analyses by revealing the underlying urban structure. Through comparative analysis of two representative cities with contrasting urban morphologies (monocentric vs polycentric), we establish relationships between density gradient characteristics and public transport network topologies. Cities with clear density peaks in their gradient plots indicate distinct urban centers requiring different transport strategies than those with more uniform density distributions. This methodology offers urban planners a cost-effective, globally applicable approach to preliminary public transport assessment using freely available satellite data. The complete implementation, with additional examples and documentation, is available in an open-source repository under the MIT license at https://github.com/nexri/Satellite-Imagery-Urban-Analysis.

[76] Visual Re-Ranking with Non-Visual Side Information

Gustav Hanning,Gabrielle Flood,Viktor Larsson

Main category: cs.CV

TLDR: 论文提出了一种基于图神经网络的重新排序方法GCSA，利用多模态信息改进视觉地点识别。

Details

Motivation: 现有方法仅基于初始检索的图像描述符进行重新排序，信号有限。 Method: 提出GCSA方法，结合视觉描述符和其他传感器数据（如WiFi信号或相机位姿），通过亲和向量共享编码多模态输入。 Result: 在两个大规模数据集上实验，显著提升了图像检索和视觉定位任务的性能。 Conclusion: GCSA通过利用多模态信息，有效改进了视觉地点识别的重新排序效果。 Abstract: The standard approach for visual place recognition is to use global image descriptors to retrieve the most similar database images for a given query image. The results can then be further improved with re-ranking methods that re-order the top scoring images. However, existing methods focus on re-ranking based on the same image descriptors that were used for the initial retrieval, which we argue provides limited additional signal. In this work we propose Generalized Contextual Similarity Aggregation (GCSA), which is a graph neural network-based re-ranking method that, in addition to the visual descriptors, can leverage other types of available side information. This can for example be other sensor data (such as signal strength of nearby WiFi or BlueTooth endpoints) or geometric properties such as camera poses for database images. In many applications this information is already present or can be acquired with low effort. Our architecture leverages the concept of affinity vectors to allow for a shared encoding of the heterogeneous multi-modal input. Two large-scale datasets, covering both outdoor and indoor localization scenarios, are utilized for training and evaluation. In experiments we show significant improvement not only on image retrieval metrics, but also for the downstream visual localization task.

[77] Taming Consistency Distillation for Accelerated Human Image Animation

Xiang Wang,Shiwei Zhang,Hangjie Yuan,Yujie Wei,Yingya Zhang,Changxin Gao,Yuehuan Wang,Nong Sang

Main category: cs.CV

TLDR: DanceLCM通过分段一致性蒸馏和运动聚焦损失，显著减少了视频扩散模型的推理步骤，同时保持了高质量。

Details

Motivation: 现有视频扩散模型依赖多次迭代去噪步骤，导致推理成本高且速度慢，而简单采用一致性模型会导致视觉质量下降。 Method: 提出DanceLCM方法，包括分段一致性蒸馏、辅助轻量头监督、运动聚焦损失和面部保真特征注入。 Result: 实验表明，DanceLCM仅需2-4步推理即可达到与先进视频扩散模型相当的效果。 Conclusion: DanceLCM显著降低了推理负担，同时保持了视频质量，代码和模型将公开。 Abstract: Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.

[78] GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention

Mahir Gulzar,Yar Muhammad,Naveed Muhammad

Main category: cs.CV

TLDR: 提出了一种基于车道图的运动预测模型，通过交叉注意力融合多上下文信息，实现了对未来车辆轨迹的鲁棒预测。

Details

Motivation: 预测周围车辆的未来轨迹依赖于上下文信息，包括静态（如车道）和动态（如交通参与者）元素。现有方法需要更高效地融合这些信息以提高预测准确性。 Method: 采用编码器-交互器-解码器架构：编码器用轻量级GRU编码场景上下文，交互器通过交叉注意力融合场景特征与车道图目标提议，解码器通过拉普拉斯混合密度网络回归多模态轨迹。 Result: 在nuScenes数据集上实现了最先进的预测性能。 Conclusion: 通过车道图目标提议和交叉注意力机制，模型能够关注未来目标相关的场景元素，从而提供更鲁棒的轨迹预测。 Abstract: Predicting future trajectories of surrounding vehicles heavily relies on what contextual information is given to a motion prediction model. The context itself can be static (lanes, regulatory elements, etc) or dynamic (traffic participants). This paper presents a lane graph-based motion prediction model that first predicts graph-based goal proposals and later fuses them with cross attention over multiple contextual elements. We follow the famous encoder-interactor-decoder architecture where the encoder encodes scene context using lightweight Gated Recurrent Units, the interactor applies cross-context attention over encoded scene features and graph goal proposals, and the decoder regresses multimodal trajectories via Laplacian Mixture Density Network from the aggregated encodings. Using cross-attention over graph-based goal proposals gives robust trajectory estimates since the model learns to attend to future goal-relevant scene elements for the intended agent. We evaluate our work on nuScenes motion prediction dataset, achieving state-of-the-art results.

[79] SAR-to-RGB Translation with Latent Diffusion for Earth Observation

Kaan Aydin,Joelle Hanna,Damian Borth

Main category: cs.CV

TLDR: 提出了一种基于扩散模型（DM）的SAR-to-RGB转换方法，生成合成光学图像以弥补Sentinel-2（S2）数据缺失问题。

Details

Motivation: Sentinel-2（S2）图像常因云层或数据缺失不可用，需通过SAR数据生成合成光学图像。 Method: 采用三种扩散模型设置（两种标准扩散，一种冷扩散），分别通过噪声添加/去除或混合SAR信号生成S2图像。 Result: 生成图像虽不完全真实，但实用性强；类别条件提升分类精度，冷扩散在土地分类中表现优异。 Conclusion: 扩散模型在SAR-to-RGB转换中具有潜力，尤其在光学图像缺失的遥感应用中。 Abstract: Earth observation satellites like Sentinel-1 (S1) and Sentinel-2 (S2) provide complementary remote sensing (RS) data, but S2 images are often unavailable due to cloud cover or data gaps. To address this, we propose a diffusion model (DM)-based approach for SAR-to-RGB translation, generating synthetic optical images from SAR inputs. We explore three different setups: two using Standard Diffusion, which reconstruct S2 images by adding and removing noise (one without and one with class conditioning), and one using Cold Diffusion, which blends S2 with S1 before removing the SAR signal. We evaluate the generated images in downstream tasks, including land cover classification and cloud removal. While generated images may not perfectly replicate real S2 data, they still provide valuable information. Our results show that class conditioning improves classification accuracy, while cloud removal performance remains competitive despite our approach not being optimized for it. Interestingly, despite exhibiting lower perceptual quality, the Cold Diffusion setup performs well in land cover classification, suggesting that traditional quantitative evaluation metrics may not fully reflect the practical utility of generated images. Our findings highlight the potential of DMs for SAR-to-RGB translation in RS applications where RGB images are missing.

[80] DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention

Haohan Chen,Hongjia Liu,Shiyong Lan,Wenwu Wang,Yixin Qiao,Yao Li,Guonan Deng

Main category: cs.CV

TLDR: DMAGaze提出了一种新的凝视估计框架，通过分离凝视相关全局特征、局部眼部特征和头部姿态特征，结合多尺度注意力模块，显著提升了性能。

Details

Motivation: 凝视估计常受复杂凝视无关信息干扰，需更有效的方法分离和利用凝视相关信息。 Method: 设计连续掩码分离器分离凝视相关与非相关信息，引入多尺度全局局部注意力模块增强特征，结合头部姿态和局部眼部特征进行估计。 Result: 在两个主流公开数据集上验证，达到最先进性能。 Conclusion: DMAGaze通过多特征分离和注意力机制，显著提升了凝视估计的精度。 Abstract: Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.

[81] TSAL: Few-shot Text Segmentation Based on Attribute Learning

Chenming Li,Chengxu Liu,Yuanting Fan,Xiao Jin,Xingsong Hou,Xueming Qian

Main category: cs.CV

TLDR: 论文提出TSAL方法，利用CLIP的先验知识和自适应提示分支进行场景文本分割，减少数据依赖并提升精度。

Details

Motivation: 高质量数据集稀缺和像素标注成本高限制了监督学习在场景文本分割中的发展，因此探索少样本学习方法的应用。 Method: 提出TSAL方法，结合CLIP先验知识，通过视觉引导分支和自适应提示分支提取特征，并设计自适应特征对齐模块（AFA）优化特征捕获。 Result: 实验表明，TSAL在少样本设置下在多个文本分割数据集上达到SOTA性能。 Conclusion: TSAL能有效捕获文本独特属性，仅需少量图像即可实现精确分割，在文本相关领域具有潜力。 Abstract: Recently supervised learning rapidly develops in scene text segmentation. However, the lack of high-quality datasets and the high cost of pixel annotation greatly limit the development of them. Considering the well-performed few-shot learning methods for downstream tasks, we investigate the application of the few-shot learning method to scene text segmentation. We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation. To fully utilize the semantic and texture information in the image, a visual-guided branch is proposed to separately extract text and background features. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates to capture various text attributes. To enable adaptive prompts capture distinctive text features and complex background distribution, we propose Adaptive Feature Alignment module(AFA). By aligning learnable tokens of different attributes with visual features and prompt prototypes, AFA enables adaptive prompts to capture both general and distinctive attribute information. TSAL can capture the unique attributes of text and achieve precise segmentation using only few images. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings and show great potential in text-related domains.

[82] YOLO-RS: Remote Sensing Enhanced Crop Detection Methods

Linlin Xiao,Zhang Tiancong,Yutong Jia,Xinyu Nie,Mengyao Wang,Xiaohang Shao

Main category: cs.CV

TLDR: 提出了一种基于YOLOv11的新型目标检测模型YOLO-RS，通过引入CAA机制和多尺度特征融合网络，显著提升了遥感图像中小目标的检测性能。

Details

Motivation: 现有目标检测方法在处理复杂背景和小目标时表现不佳，难以满足实际应用需求。 Method: YOLO-RS采用双向特征融合策略和ACmix模块，增强小目标检测能力并解决类别不平衡问题。 Result: 在PDT和CWC数据集上，YOLO-RS的召回率和mAP提升了2-3%，F1分数显著提高，计算复杂度仅增加5.2 GFLOPs。 Conclusion: YOLO-RS在遥感图像小目标检测任务中表现出色，具有高效性和应用潜力。 Abstract: With the rapid development of remote sensing technology, crop classification and health detection based on deep learning have gradually become a research hotspot. However, the existing target detection methods show poor performance when dealing with small targets in remote sensing images, especially in the case of complex background and image mixing, which is difficult to meet the practical application requirementsite. To address this problem, a novel target detection model YOLO-RS is proposed in this paper. The model is based on the latest Yolov11 which significantly enhances the detection of small targets by introducing the Context Anchor Attention (CAA) mechanism and an efficient multi-field multi-scale feature fusion network. YOLO-RS adopts a bidirectional feature fusion strategy in the feature fusion process, which effectively enhances the model's performance in the detection of small targets. Small target detection. Meanwhile, the ACmix module at the end of the model backbone network solves the category imbalance problem by adaptively adjusting the contrast and sample mixing, thus enhancing the detection accuracy in complex scenes. In the experiments on the PDT remote sensing crop health detection dataset and the CWC crop classification dataset, YOLO-RS improves both the recall and the mean average precision (mAP) by about 2-3\% or so compared with the existing state-of-the-art methods, while the F1-score is also significantly improved. Moreover, the computational complexity of the model only increases by about 5.2 GFLOPs, indicating its significant advantages in both performance and efficiency. The experimental results validate the effectiveness and application potential of YOLO-RS in the task of detecting small targets in remote sensing images.

[83] TerraMind: Large-Scale Generative Multimodality for Earth Observation

Johannes Jakubik,Felix Yang,Benedikt Blumenstiel,Erik Scheurer,Rocco Sedona,Stefano Maurogiovanni,Jente Bosmans,Nikolaos Dionelis,Valerio Marsocci,Niklas Kopp,Rahul Ramachandran,Paolo Fraccaro,Thomas Brunschwiler,Gabriele Cavallaro,Juan Bernabe-Moreno,Nicolas Longépé

Main category: cs.CV

TLDR: TerraMind是一种多模态地球观测基础模型，通过双尺度表示（token级和像素级）预训练，支持零样本和少样本应用，并引入“Thinking-in-Modalities”能力，在标准基准测试中表现优异。

Details

Motivation: 解决地球观测中多模态数据的融合与生成问题，提升模型的泛化能力和性能。 Method: 采用双尺度早期融合方法，结合token级和像素级数据预训练，支持零样本和少样本学习，并引入TiM能力。 Result: 在PANGAEA等基准测试中超越现有技术，模型和代码开源。 Conclusion: TerraMind为地球观测提供了一种高效的多模态生成模型，具有广泛的应用潜力。 Abstract: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.

[84] TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data

Benedikt Blumenstiel,Paolo Fraccaro,Valerio Marsocci,Johannes Jakubik,Stefano Maurogiovanni,Mikolaj Czerkawski,Rocco Sedona,Gabriele Cavallaro,Thomas Brunschwiler,Juan Bernabe-Moreno,Nicolas Longépé

Main category: cs.CV

TLDR: TerraMesh是一个全球多样化的多模态数据集，结合光学、合成孔径雷达、高程和土地覆盖数据，用于大规模预训练和跨模态学习。

Details

Motivation: 现有公共数据集在规模、地理覆盖或传感器多样性上有限，需要更全面的数据集支持地球观测领域的基础模型。 Method: 引入TerraMesh数据集，包含900多万样本，八种时空对齐模态，提供数据处理步骤和统计信息。 Result: 实验证明在TerraMesh上预训练的模型性能提升。 Conclusion: TerraMesh将公开并提供宽松许可，促进地球观测领域的研究。 Abstract: Large-scale foundation models in Earth Observation can learn versatile, label-efficient representations by leveraging massive amounts of unlabeled data. However, existing public datasets are often limited in scale, geographic coverage, or sensor variety. We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, synthetic aperture radar, elevation, and land-cover modalities in an Analysis-Ready Data format. TerraMesh includes over 9 million samples with eight spatiotemporal aligned modalities, enabling large-scale pre-training and fostering robust cross-modal correlation learning. We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh. The dataset will be made publicly available with a permissive license.

[85] Video Summarization with Large Language Models

Min Jung Lee,Dayoung Gong,Minsu Cho

Main category: cs.CV

TLDR: 提出了一种基于大语言模型（LLM）的视频摘要框架LLMVS，通过多模态大语言模型（M-LLM）将视频帧转换为字幕序列，并利用LLM评估每帧的重要性，结合全局注意力机制生成更符合语义和人类判断的摘要。

Details

Motivation: 现有视频摘要方法主要依赖视觉特征和时间动态，难以捕捉视频语义，导致摘要不完整或不连贯。 Method: 使用M-LLM将视频帧转换为字幕序列，通过LLM评估每帧重要性，并结合全局注意力机制优化摘要。 Result: 实验结果表明，LLMVS在标准基准测试中优于现有方法。 Conclusion: LLMVS展示了LLM在多媒体内容处理中的潜力，能够生成更符合语义和人类判断的视频摘要。 Abstract: The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.

[86] Focal Split: Untethered Snapshot Depth from Differential Defocus

Junjie Luo,John Mamish,Alan Fu,Thomas Concannon,Josiah Hester,Emma Alexander,Qi Guo

Main category: cs.CV

TLDR: Focal Split是一种手持式、基于差分离焦深度（DfDD）的即时深度相机，具有完全集成的电源和计算能力。

Details

Motivation: 设计一种被动式深度相机，避免光源的功耗，同时实现高效的深度计算。 Method: 使用双传感器捕获差分离焦图像，基于DfDD理论进行数据处理，每像素仅需500次浮点运算。 Result: 原型系统功耗4.9W，支持0.4m至1.2m的距离测量，输出480×360稀疏深度图，帧率2.1FPS。 Conclusion: Focal Split是一种DIY友好的深度相机解决方案，提供了完整的构建指南和代码。 Abstract: We introduce Focal Split, a handheld, snapshot depth camera with fully onboard power and computing based on depth-from-differential-defocus (DfDD). Focal Split is passive, avoiding power consumption of light sources. Its achromatic optical system simultaneously forms two differentially defocused images of the scene, which can be independently captured using two photosensors in a snapshot. The data processing is based on the DfDD theory, which efficiently computes a depth and a confidence value for each pixel with only 500 floating point operations (FLOPs) per pixel from the camera measurements. We demonstrate a Focal Split prototype, which comprises a handheld custom camera system connected to a Raspberry Pi 5 for real-time data processing. The system consumes 4.9 W and is powered on a 5 V, 10,000 mAh battery. The prototype can measure objects with distances from 0.4 m to 1.2 m, outputting 480$\times$360 sparse depth maps at 2.1 frames per second (FPS) using unoptimized Python scripts. Focal Split is DIY friendly. A comprehensive guide to building your own Focal Split depth camera, code, and additional data can be found at https://focal-split.qiguo.org.

[87] 3DAffordSplat: Efficient Affordance Reasoning with 3D Gaussians

Zeming wei,Junyi Lin,Yang Liu,Weixing Chen,Jingzhou Luo,Guanbin Li,Liang Lin

Main category: cs.CV

TLDR: 论文提出了3DAffordSplat数据集和AffordSplatNet模型，用于基于3D高斯泼溅（3DGS）的功能推理，显著提升了识别精度和泛化能力。

Details

Motivation: 现有基于稀疏点云的方法在功能推理中存在泛化性和鲁棒性不足的问题，而3DGS的高保真渲染特性未被充分利用。 Method: 提出了首个大规模3DGS功能数据集3DAffordSplat，并设计AffordSplatNet模型，通过跨模态结构对齐模块提升识别精度。 Result: 实验表明，3DAffordSplat显著推进了3DGS领域的功能学习，AffordSplatNet在多种场景下优于现有方法。 Conclusion: 3DAffordSplat和AffordSplatNet为3D功能推理提供了新基准，展示了3DGS的潜力。 Abstract: 3D affordance reasoning is essential in associating human instructions with the functional regions of 3D objects, facilitating precise, task-oriented manipulations in embodied AI. However, current methods, which predominantly depend on sparse 3D point clouds, exhibit limited generalizability and robustness due to their sensitivity to coordinate variations and the inherent sparsity of the data. By contrast, 3D Gaussian Splatting (3DGS) delivers high-fidelity, real-time rendering with minimal computational overhead by representing scenes as dense, continuous distributions. This positions 3DGS as a highly effective approach for capturing fine-grained affordance details and improving recognition accuracy. Nevertheless, its full potential remains largely untapped due to the absence of large-scale, 3DGS-specific affordance datasets. To overcome these limitations, we present 3DAffordSplat, the first large-scale, multi-modal dataset tailored for 3DGS-based affordance reasoning. This dataset includes 23,677 Gaussian instances, 8,354 point cloud instances, and 6,631 manually annotated affordance labels, encompassing 21 object categories and 18 affordance types. Building upon this dataset, we introduce AffordSplatNet, a novel model specifically designed for affordance reasoning using 3DGS representations. AffordSplatNet features an innovative cross-modal structure alignment module that exploits structural consistency priors to align 3D point cloud and 3DGS representations, resulting in enhanced affordance recognition accuracy. Extensive experiments demonstrate that the 3DAffordSplat dataset significantly advances affordance learning within the 3DGS domain, while AffordSplatNet consistently outperforms existing methods across both seen and unseen settings, highlighting its robust generalization capabilities.

[88] CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image

Jingshun Huang,Haitao Lin,Tianyu Wang,Yanwei Fu,Xiangyang Xue,Yi Zhu

Main category: cs.CV

TLDR: 本文提出了一种单阶段网络CAP-Net，用于估计类别级铰接物体的6D姿态和尺寸，结合RGB-D特征实现端到端预测，并通过新数据集RGBD-Art验证其优越性能。

Details

Motivation: 现有方法依赖几何线索和多阶段流程，忽略了RGB图像的密集语义信息，导致对小部件物体的姿态估计精度不足。 Method: CAP-Net通过统一网络预测点级类别标签、质心偏移和NPCS映射，结合聚类算法分离部件并恢复姿态和尺寸。 Result: 在RGBD-Art数据集上，CAP-Net显著优于现有方法，并在机器人任务中展示了鲁棒性和优异的仿真到现实迁移能力。 Conclusion: CAP-Net和RGBD-Art数据集为类别级铰接物体姿态估计提供了高效且实用的解决方案。 Abstract: This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geometric cues and complex multi-stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, particularly for objects with small parts. To address these limitations, we propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB-D features to generate instance segmentation and NPCS representations for each part in an end-to-end manner. CAP-Net uses a unified network to simultaneously predict point-wise class labels, centroid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their estimated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size. To bridge the sim-to-real domain gap, we introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring photorealistic RGB images and depth noise simulated from real sensors. Experimental evaluations on the RGBD-Art dataset demonstrate that our method significantly outperforms the state-of-the-art approach. Real-world deployments of our model in robotic tasks underscore its robustness and exceptional sim-to-real transfer capabilities, confirming its substantial practical utility. Our dataset, code and pre-trained models are available on the project page.

[89] Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

Elisa Ancarani,Julie Tores,Lucile Sassatelli,Rémy Sun,Hui-Yin Wu,Frédéric Precioso

Main category: cs.CV

TLDR: 研究探讨了概念监督对多模态视频解释模型的影响，使用MOByGaze数据集和CMSDs方法，结果显示CMSDs训练模型优于传统方法。

Details

Motivation: 探索模态特定注释对视频模型性能的影响，推动可解释多模态学习的发展。 Method: 引入CMSDs（概念模态特定数据集），按模态分类数据子集，比较早期和晚期融合模型的性能。 Result: CMSDs训练模型在早期和晚期融合中均优于传统方法，晚期融合模型性能接近早期融合。 Conclusion: 模态特定注释对开发鲁棒、可解释的视频模型至关重要，推动了复杂视频分析中的多模态学习。 Abstract: We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

Xiaoxiao Ma,Junxiong Tong

Main category: cs.CV

TLDR: 提出了一种基于多模态图像融合和注意力机制的小目标检测方法，结合YOLOv5和卷积注意力模块，显著提升了复杂环境中的小目标检测性能。

Details

Motivation: 现代战争对情报的依赖增加，小目标检测在军事应用中至关重要，但复杂环境中的干扰使其面临挑战。 Method: 利用YOLOv5框架，融合红外和可见光数据，并引入卷积注意力模块，通过特征点匹配实现多模态数据集配准。 Result: 在反无人机和Visdrone数据集上验证了方法的有效性，对小目标和暗目标的检测结果优于现有方法。 Conclusion: 该方法通过多模态融合和注意力机制，显著提升了小目标检测的准确性和鲁棒性，具有实际应用价值。 Abstract: With the rapid development of information technology, modern warfare increasingly relies on intelligence, making small target detection critical in military applications. The growing demand for efficient, real-time detection has created challenges in identifying small targets in complex environments due to interference. To address this, we propose a small target detection method based on multi-modal image fusion and attention mechanisms. This method leverages YOLOv5, integrating infrared and visible light data along with a convolutional attention module to enhance detection performance. The process begins with multi-modal dataset registration using feature point matching, ensuring accurate network training. By combining infrared and visible light features with attention mechanisms, the model improves detection accuracy and robustness. Experimental results on anti-UAV and Visdrone datasets demonstrate the effectiveness and practicality of our approach, achieving superior detection results for small and dim targets.

[91] Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning

Juan Garcia Giraldo,Nikolaos Dimitriadis,Ke Wang,Pascal Frossard

Main category: cs.CV

TLDR: 论文探讨了在多任务学习中合并单任务模型的方法，特别关注单输入多输出（SIMO）场景，并提出了一种重新对齐特征表示的方法。

Details

Motivation: 现有模型合并方法在多任务场景中表现不佳，尤其是在单输入多输出（SIMO）设置下，导致性能下降。 Method: 提出了两种简单高效的方法，用于在合并后重新对齐特征表示。 Result: 实验表明，该方法在性能上与传统多任务学习相当，但需要更少的样本和训练步骤。 Conclusion: 该方法为多任务学习提供了一种计算高效且灵活的解决方案，并揭示了任务关系的离线识别潜力。 Abstract: Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model. Prior work has solely focused on constrained multi-task settings where there is a one-to-one mapping between a sample and a task, overlooking the paradigm where multiple tasks may operate on the same sample, e.g., scene understanding. In this paper, we focus on the multi-task setting with single-input-multiple-outputs (SIMO) and show that it qualitatively differs from the single-input-single-output model merging settings studied in the literature due to the existence of task-specific decoders and diverse loss objectives. We identify that existing model merging methods lead to significant performance degradation, primarily due to representation misalignment between the merged encoder and task-specific decoders. We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging. Compared to joint fine-tuning, our approach is computationally effective and flexible, and sheds light into identifying task relationships in an offline manner. Experiments on NYUv2, Cityscapes, and a subset of the Taskonomy dataset demonstrate: (1) task arithmetic suffices to enable multi-task capabilities; however, the representations generated by the merged encoder has to be re-aligned with the task-specific heads; (2) the proposed architecture rivals traditional multi-task learning in performance but requires fewer samples and training steps by leveraging the existence of task-specific models.

[92] Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution

Xinning Chai,Yao Zhang,Yuxuan Zhang,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Li Song

Main category: cs.CV

TLDR: DSCLoRA是一种基于低秩分解和知识蒸馏的轻量级超分辨率方法，通过改进SPAN网络，在不增加复杂度的情况下提升性能。

Details

Motivation: CNN在超分辨率任务中性能提升通常需要更深的网络和更大的特征图，导致复杂度和推理成本增加。 Method: 提出DSCLoRA，结合ConvLoRA和知识蒸馏策略，改进SPAN网络结构，利用低秩分解和空间特征亲和性传递知识。 Result: 在基准数据集上，DSCLoRA在PSNR和SSIM上优于SPAN，并在NTIRE 2025挑战赛中排名第一。 Conclusion: DSCLoRA在不增加复杂度的前提下显著提升了轻量级模型的性能，具有高效和竞争力。 Abstract: Convolutional neural networks (CNNs) have been widely used in efficient image super-resolution. However, for CNN-based methods, performance gains often require deeper networks and larger feature maps, which increase complexity and inference costs. Inspired by LoRA's success in fine-tuning large language models, we explore its application to lightweight models and propose Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA), which improves model performance without increasing architectural complexity or inference costs. Specifically, we integrate ConvLoRA into the efficient SR network SPAN by replacing the SPAB module with the proposed SConvLB module and incorporating ConvLoRA layers into both the pixel shuffle block and its preceding convolutional layer. DSCLoRA leverages low-rank decomposition for parameter updates and employs a spatial feature affinity-based knowledge distillation strategy to transfer second-order statistical information from teacher models (pre-trained SPAN) to student models (ours). This method preserves the core knowledge of lightweight models and facilitates optimal solution discovery under certain conditions. Experiments on benchmark datasets show that DSCLoRA improves PSNR and SSIM over SPAN while maintaining its efficiency and competitive image quality. Notably, DSCLoRA ranked first in the Overall Performance Track of the NTIRE 2025 Efficient Super-Resolution Challenge. Our code and models are made publicly available at https://github.com/Yaozzz666/DSCF-SR.

[93] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

Xiang Wang,Shiwei Zhang,Longxiang Tang,Yingya Zhang,Changxin Gao,Yuehuan Wang,Nong Sang

Main category: cs.CV

TLDR: UniAnimate-DiT利用Wan2.1模型和LoRA技术，通过轻量级姿态编码器和简单拼接操作，实现了高质量且一致的人像动画生成。

Details

Motivation: 为了在保持Wan2.1模型强大生成能力的同时，降低训练内存开销，并提升动画的视觉一致性和时间一致性。 Method: 采用LoRA技术微调少量参数，设计轻量级姿态编码器，并通过拼接操作整合参考外观和姿态信息。 Result: 实验显示，该方法能生成视觉逼真且时间一致的高保真动画，并具有从480P到720P的泛化能力。 Conclusion: UniAnimate-DiT在高效训练和高质量动画生成方面表现出色，代码已开源。 Abstract: This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.

[94] Autoregressive Distillation of Diffusion Transformers

Yeongmin Kim,Sotiris Anagnostidis,Yuming Du,Edgar Schönfeld,Jonas Kohler,Markos Georgopoulos,Albert Pumarola,Ali Thabet,Artsiom Sanakoyeu

Main category: cs.CV

TLDR: 论文提出了一种名为AutoRegressive Distillation (ARD)的新方法，通过利用ODE的历史轨迹预测未来步骤，减少资源消耗和暴露偏差。

Details

Motivation: 扩散模型在生成高保真图像方面表现出色，但迭代采样过程资源密集。现有方法依赖最新去噪样本，容易产生暴露偏差。 Method: ARD通过添加时间嵌入和块级因果注意力掩码，利用ODE历史轨迹预测未来步骤，并在低层Transformer中引入历史输入。 Result: 在ImageNet和T2I合成任务中，ARD显著减少FID退化（5倍），仅需1.1%额外FLOPs，并在4步内达到1.84 FID。 Conclusion: ARD通过历史轨迹预测有效减少资源消耗和偏差，性能优于现有方法。 Abstract: Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: https://github.com/alsdudrla10/ARD.

[95] CFIS-YOLO: A Lightweight Multi-Scale Fusion Network for Edge-Deployable Wood Defect Detection

Jincheng Kang,Yi Cen,Yigang Cen,Ke Wang,Yuhan Liu

Main category: cs.CV

TLDR: CFIS-YOLO是一种轻量级目标检测模型，针对边缘设备优化，解决了木材缺陷检测中传统方法成本高、主观性强以及深度学习模型在边缘部署时难以平衡精度和效率的问题。

Details

Motivation: 木材缺陷检测对质量控制至关重要，但传统方法成本高且主观性强，而主流深度学习模型在边缘设备上难以兼顾精度和效率。 Method: 提出CFIS-YOLO模型，引入增强的C2f结构、动态特征重组模块和包含辅助边界框及角度约束的新损失函数，优化多尺度特征融合和小目标定位。 Result: 在公开木材缺陷数据集上，mAP@0.5达77.5%，比YOLOv10s高4个百分点；在边缘设备上实现135 FPS，功耗降至17.3%，mAP仅下降0.5个百分点。 Conclusion: CFIS-YOLO是资源受限环境下木材缺陷检测的实用有效解决方案。 Abstract: Wood defect detection is critical for ensuring quality control in the wood processing industry. However, current industrial applications face two major challenges: traditional methods are costly, subjective, and labor-intensive, while mainstream deep learning models often struggle to balance detection accuracy and computational efficiency for edge deployment. To address these issues, this study proposes CFIS-YOLO, a lightweight object detection model optimized for edge devices. The model introduces an enhanced C2f structure, a dynamic feature recombination module, and a novel loss function that incorporates auxiliary bounding boxes and angular constraints. These innovations improve multi-scale feature fusion and small object localization while significantly reducing computational overhead. Evaluated on a public wood defect dataset, CFIS-YOLO achieves a mean Average Precision (mAP@0.5) of 77.5\%, outperforming the baseline YOLOv10s by 4 percentage points. On SOPHON BM1684X edge devices, CFIS-YOLO delivers 135 FPS, reduces power consumption to 17.3\% of the original implementation, and incurs only a 0.5 percentage point drop in mAP. These results demonstrate that CFIS-YOLO is a practical and effective solution for real-world wood defect detection in resource-constrained environments.

[96] Context-Aware Palmprint Recognition via a Relative Similarity Metric

Trinnhallen Brisley,Aryan Gandhi,Joseph Magen

Main category: cs.CV

TLDR: 提出了一种新的掌纹识别匹配机制，通过引入相对相似性度量（RSM），增强了现有匹配框架的鲁棒性和区分性。

Details

Motivation: 传统系统依赖直接成对相似性度量（如余弦或欧氏距离），但这些度量无法捕捉成对相似性在整个数据集中的相对表现。 Method: 通过评估相似性分数在所有身份中的相对一致性，更好地抑制假阳性和假阴性。 Result: 在CCNet架构上应用该方法，在Tongji数据集上实现了0.000036%的等错误率（EER），优于先前方法。 Conclusion: 将关系结构引入掌纹匹配过程具有显著效果。 Abstract: We propose a new approach to matching mechanism for palmprint recognition by introducing a Relative Similarity Metric (RSM) that enhances the robustness and discriminability of existing matching frameworks. While conventional systems rely on direct pairwise similarity measures, such as cosine or Euclidean distances, these metrics fail to capture how a pairwise similarity compares within the context of the entire dataset. Our method addresses this by evaluating the relative consistency of similarity scores across up to all identities, allowing for better suppression of false positives and negatives. Applied atop the CCNet architecture, our method achieves a new state-of-the-art 0.000036% Equal Error Rate (EER) on the Tongji dataset, outperforming previous methods and demonstrating the efficacy of incorporating relational structure into the palmprint matching process.

[97] Uncertainty Estimation for Trust Attribution to Speed-of-Sound Reconstruction with Variational Networks

Sonia Laguna,Lin Zhang,Can Deniz Bezek,Monika Farkas,Dieter Schweizer,Rahel A. Kubik-Huch,Orcun Goksel

Main category: cs.CV

TLDR: 该论文提出了一种基于不确定性估计的方法，用于从超声采集数据中选择最可信的帧，以提高声速（SoS）成像的诊断准确性。

Details

Motivation: 声速成像是潜在的诊断生物标志物，但数据帧可能因噪声而损坏，影响重建质量。通过不确定性估计选择可信帧，可优化诊断决策。 Method: 使用蒙特卡洛Dropout和贝叶斯变分推断进行不确定性估计，并基于此自动选择最可信的数据帧。 Result: 在乳腺癌诊断中，基于不确定性的帧选择方法（AUC 76%-80%）优于未考虑不确定性的基线方法（AUC 64%）。 Conclusion: 不确定性估计可用于多数据采集中的帧选择，提升诊断准确性。 Abstract: Speed-of-sound (SoS) is a biomechanical characteristic of tissue, and its imaging can provide a promising biomarker for diagnosis. Reconstructing SoS images from ultrasound acquisitions can be cast as a limited-angle computed-tomography problem, with Variational Networks being a promising model-based deep learning solution. Some acquired data frames may, however, get corrupted by noise due to, e.g., motion, lack of contact, and acoustic shadows, which in turn negatively affects the resulting SoS reconstructions. We propose to use the uncertainty in SoS reconstructions to attribute trust to each individual acquired frame. Given multiple acquisitions, we then use an uncertainty based automatic selection among these retrospectively, to improve diagnostic decisions. We investigate uncertainty estimation based on Monte Carlo Dropout and Bayesian Variational Inference. We assess our automatic frame selection method for differential diagnosis of breast cancer, distinguishing between benign fibroadenoma and malignant carcinoma. We evaluate 21 lesions classified as BI-RADS~4, which represents suspicious cases for probable malignancy. The most trustworthy frame among four acquisitions of each lesion was identified using uncertainty based criteria. Selecting a frame informed by uncertainty achieved an area under curve of 76% and 80% for Monte Carlo Dropout and Bayesian Variational Inference, respectively, superior to any uncertainty-uninformed baselines with the best one achieving 64%. A novel use of uncertainty estimation is proposed for selecting one of multiple data acquisitions for further processing and decision making.

[98] Big Brother is Watching: Proactive Deepfake Detection via Learnable Hidden Face

Hongbo Li,Shangchao Yang,Ruiyang Xia,Lin Yuan,Xinbo Gao

Main category: cs.CV

TLDR: 本文提出了一种基于可学习水印的主动防御方法，通过半脆弱可逆隐写网络将秘密模板嵌入图像，以检测恶意篡改。

Details

Motivation: 随着深度伪造技术的发展，被动检测方法难以应对多样化的伪造操作和数据集，因此需要结合主动防御技术。 Method: 利用半脆弱可逆隐写网络嵌入优化的秘密模板，结合自混合机制和鲁棒性学习策略，构建检测器。 Result: 在多个数据集上的实验表明，该方法优于现有的被动和主动检测方法。 Conclusion: 该方法有效结合了主动防御与被动检测的优势，提升了深度伪造检测的鲁棒性和准确性。 Abstract: As deepfake technologies continue to advance, passive detection methods struggle to generalize with various forgery manipulations and datasets. Proactive defense techniques have been actively studied with the primary aim of preventing deepfake operation effectively working. In this paper, we aim to bridge the gap between passive detection and proactive defense, and seek to solve the detection problem utilizing a proactive methodology. Inspired by several watermarking-based forensic methods, we explore a novel detection framework based on the concept of ``hiding a learnable face within a face''. Specifically, relying on a semi-fragile invertible steganography network, a secret template image is embedded into a host image imperceptibly, acting as an indicator monitoring for any malicious image forgery when being restored by the inverse steganography process. Instead of being manually specified, the secret template is optimized during training to resemble a neutral facial appearance, just like a ``big brother'' hidden in the image to be protected. By incorporating a self-blending mechanism and robustness learning strategy with a simulative transmission channel, a robust detector is built to accurately distinguish if the steganographic image is maliciously tampered or benignly processed. Finally, extensive experiments conducted on multiple datasets demonstrate the superiority of the proposed approach over competing passive and proactive detection methods.

[99] Intelligent driving vehicle front multi-target tracking and detection based on YOLOv5 and point cloud 3D projection

Dayong Liu,Qingrui Zhang,Zeyang Meng

Main category: cs.CV

TLDR: 提出了一种基于YOLOv5和点云3D投影的智能驾驶车辆多目标跟踪与检测方法，通过图像增强和多帧关联实现高精度跟踪。

Details

Motivation: 解决多目标跟踪中目标关联的复杂问题，提升智能驾驶车辆对前方多目标的实时跟踪与检测能力。 Method: 使用Retinex算法增强图像，基于YOLOv5构建检测模型，结合点云3D投影技术推断目标位置变化关联。 Result: 实验显示MOTA值大于30，验证了方法的优越跟踪与检测性能。 Conclusion: 该方法有效实现了智能驾驶车辆前方多目标的稳定跟踪与检测，具有实际应用价值。 Abstract: In multi-target tracking and detection tasks, it is necessary to continuously track multiple targets, such as vehicles, pedestrians, etc. To achieve this goal, the system must be able to continuously acquire and process image frames containing these targets. These consecutive frame images enable the algorithm to update the position and state of the target in real-time in each frame of the image. How to accurately associate the detected target with the target in the previous or next frame to form a stable trajectory is a complex problem. Therefore, a multi object tracking and detection method for intelligent driving vehicles based on YOLOv5 and point cloud 3D projection is proposed. Using Retinex algorithm to enhance the image of the environment in front of the vehicle, remove light interference in the image, and build an intelligent detection model based on YOLOv5 network structure. The enhanced image is input into the model, and multiple targets in front of the vehicle are identified through feature extraction and target localization. By combining point cloud 3D projection technology, the correlation between the position changes of adjacent frame images in the projection coordinate system can be inferred. By sequentially projecting the multi-target recognition results of multiple consecutive frame images into the 3D laser point cloud environment, effective tracking of the motion trajectories of all targets in front of the vehicle can be achieved. The experimental results show that the application of this method for intelligent driving vehicle front multi-target tracking and detection yields a MOTA (Tracking Accuracy) value greater than 30, demonstrating its superior tracking and detection performance.

[100] PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

Henghui Ding,Chang Liu,Nikhila Ravi,Shuting He,Yunchao Wei,Song Bai,Philip Torr,Kehuan Song,Xinglin Xie,Kexin Zhang,Licheng Jiao,Lingling Li,Shuyuan Yang,Xuqiang Cao,Linnan Zhao,Jiaxuan Zhao,Fang Liu,Mengjiao Wang,Junpei Zhang,Xu Liu,Yuting Yang,Mengru Ma,Hao Fang,Runmin Cong,Xiankai Lu,Zhiyang Che,Wei Zhan,Tianming Liang,Haichao Jiang,Wei-Shi Zheng,Jian-Fang Hu,Haobo Yuan,Xiangtai Li,Tao Zhang,Lu Qi,Ming-Hsuan Yang

Main category: cs.CV

TLDR: 本文总结了CVPR 2025中举办的PVUW挑战赛，包括两个赛道（MOSE和MeViS）的成果、方法和未来研究方向。

Details

Motivation: 通过挑战赛推动复杂视频分割领域的研究，并引入更贴近真实场景的数据集。 Method: 组织两个赛道：MOSE（复杂场景视频对象分割）和MeViS（基于语言和运动的视频分割），并提供新数据集。 Result: 挑战赛提供了当前最先进技术和新兴趋势的详细评估与分析。 Conclusion: PVUW挑战赛为复杂视频分割领域的研究提供了重要见解和未来方向。 Abstract: This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.

[101] Seedream 3.0 Technical Report

Yu Gao,Lixue Gong,Qiushan Guo,Xiaoxia Hou,Zhichao Lai,Fanshi Li,Liang Li,Xiaochen Lian,Chao Liao,Liyang Liu,Wei Liu,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Rui Wang,Xuanda Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Xin Xia,Xuefeng Xiao,Zhonghua Zhai,Xinyu Zhang,Qi Zhang,Yuwei Zhang,Shijia Zhao,Jianchao Yang,Weilin Huang

Main category: cs.CV

TLDR: Seedream 3.0是一个高性能的中英双语图像生成基础模型，通过技术改进解决了Seedream 2.0的多个问题，包括复杂提示对齐、细粒度排版生成、视觉美学和保真度不足以及分辨率限制。

Details

Motivation: 改进Seedream 2.0在复杂提示对齐、排版生成、视觉美学和分辨率等方面的不足，提升模型性能。 Method: 在数据层采用缺陷感知训练和双轴协作数据采样框架，预训练阶段使用混合分辨率训练、跨模态RoPE等技术，后训练阶段采用多样化美学描述和VLM奖励模型。 Result: Seedream 3.0在复杂中文文本渲染和高分辨率输出（最高2K）方面表现显著优于Seedream 2.0，同时实现了4到8倍的加速。 Conclusion: Seedream 3.0通过全流程改进显著提升了图像生成能力，尤其在中文排版和高分辨率输出方面表现突出。 Abstract: We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.

[102] DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation

Soyoung Yoo,Namwoo Kang

Main category: cs.CV

TLDR: 论文提出了一种基于生成AI的合成设计-性能数据集生成框架，用于填补车辆轮毂设计领域的数据空白，生成了包含6000多张照片级图像和900个3D模型的DeepWheel数据集。

Details

Motivation: 车辆轮毂设计领域缺乏大规模、高质量的数据集，限制了数据驱动设计的应用。 Method: 使用Stable Diffusion生成2D渲染图像，通过2.5D深度估计重建3D几何，进行结构仿真提取性能数据，并应用拓扑优化扩展设计空间。 Result: 生成了DeepWheel数据集，包含6000多张图像和900个3D模型，支持代理模型训练和数据驱动设计。 Conclusion: 该框架为复杂设计领域提供了数据支持，并公开了数据集供非商业使用。 Abstract: Data-driven design is emerging as a powerful strategy to accelerate engineering innovation. However, its application to vehicle wheel design remains limited due to the lack of large-scale, high-quality datasets that include 3D geometry and physical performance metrics. To address this gap, this study proposes a synthetic design-performance dataset generation framework using generative AI. The proposed framework first generates 2D rendered images using Stable Diffusion, and then reconstructs the 3D geometry through 2.5D depth estimation. Structural simulations are subsequently performed to extract engineering performance data. To further expand the design and performance space, topology optimization is applied, enabling the generation of a more diverse set of wheel designs. The final dataset, named DeepWheel, consists of over 6,000 photo-realistic images and 900 structurally analyzed 3D models. This multi-modal dataset serves as a valuable resource for surrogate model training, data-driven inverse design, and design space exploration. The proposed methodology is also applicable to other complex design domains. The dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International(CC BY-NC 4.0) and is available on the https://www.smartdesignlab.org/datasets

[103] Explicit and Implicit Representations in AI-based 3D Reconstruction for Radiology: A systematic literature review

Yuezhe Yang,Boyu Yang,Yaqian Wang,Yang He,Xingbo Dong,Zhe Jin

Main category: cs.CV

TLDR: 本文综述了基于AI的放射影像3D重建算法，分为显式和隐式方法，并探讨了评估指标、数据集、发展现状及未来方向。

Details

Motivation: 提高放射影像3D重建的精度和效率，减少患者辐射暴露和不适，助力临床诊断。 Method: 将算法分为显式（点、体积、高斯表示）和隐式（隐式先验嵌入、神经辐射场）方法，并分析评估指标和数据集。 Result: 总结了当前AI在3D重建中的进展，提出了显式和隐式方法的分类框架。 Conclusion: AI在放射影像3D重建中潜力巨大，但仍需解决关键挑战，未来研究应关注算法优化和应用扩展。 Abstract: The demand for high-quality medical imaging in clinical practice and assisted diagnosis has made 3D reconstruction in radiological imaging a key research focus. Artificial intelligence (AI) has emerged as a promising approach to enhancing reconstruction accuracy while reducing acquisition and processing time, thereby minimizing patient radiation exposure and discomfort and ultimately benefiting clinical diagnosis. This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches based on their underlying principles. Explicit methods include point-based, volume-based, and Gaussian representations, while implicit methods encompass implicit prior embedding and neural radiance fields. Additionally, we examine commonly used evaluation metrics and benchmark datasets. Finally, we discuss the current state of development, key challenges, and future research directions in this evolving field. Our project available on: https://github.com/Bean-Young/AI4Med.

[104] A Decade of Wheat Mapping for Lebanon

Hasan Wehbi,Hasan Nasrallah,Mohamad Hasan Zahweh,Zeinab Takach,Veera Ganesh Yalla,Ali J. Ghandour

Main category: cs.CV

TLDR: 本文提出了一种改进的小麦田分割方法，结合时空视觉变换器和高效参数微调技术，用于卫星图像中小麦田的精确制图，并通过案例研究验证了其有效性。

Details

Motivation: 小麦是全球粮食安全的重要组成部分，精确绘制小麦田地图对政策制定、资源分配和供应链管理至关重要。 Method: 采用时空视觉变换器（TSViT）与高效参数微调（PEFT）结合，并引入基于FTW框架的后处理流程，解决了现有方法中小农田聚集成大田块的问题。 Result: 实验表明，该方法在边界划分和田块级精度上表现优异，适用于农业监测和历史趋势分析。 Conclusion: 该方法为小麦田精确制图奠定了基础，支持作物监测和产量估算等未来研究。 Abstract: Wheat accounts for approximately 20% of the world's caloric intake, making it a vital component of global food security. Given this importance, mapping wheat fields plays a crucial role in enabling various stakeholders, including policy makers, researchers, and agricultural organizations, to make informed decisions regarding food security, supply chain management, and resource allocation. In this paper, we tackle the problem of accurately mapping wheat fields out of satellite images by introducing an improved pipeline for winter wheat segmentation, as well as presenting a case study on a decade-long analysis of wheat mapping in Lebanon. We integrate a Temporal Spatial Vision Transformer (TSViT) with Parameter-Efficient Fine Tuning (PEFT) and a novel post-processing pipeline based on the Fields of The World (FTW) framework. Our proposed pipeline addresses key challenges encountered in existing approaches, such as the clustering of small agricultural parcels in a single large field. By merging wheat segmentation with precise field boundary extraction, our method produces geometrically coherent and semantically rich maps that enable us to perform in-depth analysis such as tracking crop rotation pattern over years. Extensive evaluations demonstrate improved boundary delineation and field-level precision, establishing the potential of the proposed framework in operational agricultural monitoring and historical trend analysis. By allowing for accurate mapping of wheat fields, this work lays the foundation for a range of critical studies and future advances, including crop monitoring and yield estimation.

[105] From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation

Jingkun Chen,Haoran Duan,Xiao Zhang,Boyan Gao,Tao Tan,Vicente Grau,Jungong Han

Main category: cs.CV

TLDR: 提出了一种结合医生注视数据和视觉语言模型的教师-学生框架，用于医学图像分割，提升了分割性能并保持了临床可解释性。

Details

Motivation: 医学图像分割需要大量像素级标注，成本高。医生注视数据和视觉语言模型各有局限性，但互补性强。 Method: 教师模型从注视点和VLM生成的描述中学习，指导学生模型通过多尺度特征对齐、置信加权一致性约束和自适应掩码。 Result: 在Kvasir-SEG、NCI-ISBI和ISIC数据集上，Dice分数分别达到80.78%、80.53%和84.22%，比基线提升3-5%。 Conclusion: 结合人类视觉注意力和AI生成的语义上下文，能有效克服单一弱监督信号的局限性，推动高效标注的医学AI系统发展。 Abstract: Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.git.

[106] Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model

Liu Yang,Huiyu Duan,Yucheng Zhu,Xiaohong Liu,Lu Liu,Zitong Xu,Guangji Ma,Xiongkuo Min,Guangtao Zhai,Patrick Le Callet

Main category: cs.CV

TLDR: 论文介绍了Any2Omni数据集和Omni²模型，用于360°全景图像的生成与编辑，解决了现有方法在ODI处理上的不足。

Details

Motivation: 由于360°全景图像（ODI）的独特格式和广阔视野，现有2D图像生成和编辑方法难以满足需求，因此需要专门的数据集和模型。 Method: 构建了包含60,000+训练数据的Any2Omni数据集，并提出Omni²模型，支持多种ODI生成和编辑任务。 Result: 实验证明Omni²模型在ODI生成和编辑任务上表现优越且有效。 Conclusion: Any2Omni数据集和Omni²模型为ODI生成与编辑提供了全面解决方案。 Abstract: $360^{\circ}$ omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360$^{\circ}$ Field-of-View (FoV) of ODIs. To bridge this gap, we construct \textbf{\textit{Any2Omni}}, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an \textbf{\underline{Omni}} model for \textbf{\underline{Omni}}-directional image generation and editing (\textbf{\textit{Omni$^2$}}), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni$^2$ model for both the ODI generation and editing tasks.

[107] Multi-level Cellular Automata for FLIM networks

Felipe Crispim Salvagnini,Jancarlo F. Gomes,Cid A. N. Santos,Silvio Jamil F. Guimarães,Alexandre X. Falcão

Main category: cs.CV

TLDR: 论文提出了一种结合FLIM编码器和多级CA的方法，用于资源有限场景下的显著目标检测，减少了参数需求且无需反向传播。

Details

Motivation: 解决深度学习显著目标检测中需要大量标注数据和复杂网络架构的问题，特别是在资源有限的医疗应用中。 Method: 结合FLIM编码器和自适应解码器，利用专家知识初始化CA状态，构建多级CA框架。 Result: 在医疗数据集上的测试表明，该方法性能与现有深度SOD模型相当。 Conclusion: 该方法为资源有限场景提供了一种高效且实用的显著目标检测解决方案。 Abstract: The necessity of abundant annotated data and complex network architectures presents a significant challenge in deep-learning Salient Object Detection (deep SOD) and across the broader deep-learning landscape. This challenge is particularly acute in medical applications in developing countries with limited computational resources. Combining modern and classical techniques offers a path to maintaining competitive performance while enabling practical applications. Feature Learning from Image Markers (FLIM) methodology empowers experts to design convolutional encoders through user-drawn markers, with filters learned directly from these annotations. Recent findings demonstrate that coupling a FLIM encoder with an adaptive decoder creates a flyweight network suitable for SOD, requiring significantly fewer parameters than lightweight models and eliminating the need for backpropagation. Cellular Automata (CA) methods have proven successful in data-scarce scenarios but require proper initialization -- typically through user input, priors, or randomness. We propose a practical intersection of these approaches: using FLIM networks to initialize CA states with expert knowledge without requiring user interaction for each image. By decoding features from each level of a FLIM network, we can initialize multiple CAs simultaneously, creating a multi-level framework. Our method leverages the hierarchical knowledge encoded across different network layers, merging multiple saliency maps into a high-quality final output that functions as a CA ensemble. Benchmarks across two challenging medical datasets demonstrate the competitiveness of our multi-level CA approach compared to established models in the deep SOD literature.

[108] Robustness and sex differences in skin cancer detection: logistic regression vs CNNs

Nikolette Pedersen,Regitze Sydendal,Andreas Wulff,Ralf Raumanns,Eike Petersen,Veronika Cheplygina

Main category: cs.CV

TLDR: 研究通过复制阿尔茨海默病研究的方法，探讨了皮肤癌检测中的性别偏见，发现CNN对男性患者的准确性显著高于女性。

Details

Motivation: 尽管深度学习在皮肤癌检测中表现优异，但结果的可重复性和偏见问题仍存。本研究旨在探索性别偏见对模型性能的影响。 Method: 使用PAD-UFES-20数据集，分别训练基于手工特征的逻辑回归（LR）和预训练的ResNet-50模型，评估其在不同性别组成数据集上的鲁棒性。 Result: LR和CNN对性别分布均表现出鲁棒性，但CNN对男性患者的准确性和AUROC显著高于女性患者。 Conclusion: 研究揭示了CNN在皮肤癌检测中的性别偏见，为医学机器学习中的潜在偏见研究提供了新视角。 Abstract: Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a study on Alzheimer's disease [28] which studied robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with [28]: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distributions, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients than for female patients. We hope these findings to contribute to the growing field of investigating potential bias in popular medical machine learning methods. The data and relevant scripts to reproduce our results can be found in our Github.

[109] Deep Learning-based Bathymetry Retrieval without In-situ Depths using Remote Sensing Imagery and SfM-MVS DSMs with Data Gaps

Panagiotis Agrafiotis,Begüm Demir

Main category: cs.CV

TLDR: 论文提出了一种结合SfM-MVS和深度学习的方法Swin-BathyUNet，用于解决浅海测深中的数据缺失和噪声问题，提高了测深精度和覆盖范围。

Details

Motivation: 浅海测深数据对气候和人为压力监测至关重要，但现有方法如SDB和SfM-MVS存在数据缺失、噪声和成本高的问题。 Method: 结合SfM-MVS的高保真3D重建和深度学习的光谱分析能力，提出Swin-BathyUNet模型，利用U-Net和Swin Transformer改进测深精度。 Result: 在地中海和波罗的海的实验中，该方法显著提高了测深精度、细节、覆盖范围和噪声抑制。 Conclusion: Swin-BathyUNet为浅海测深提供了一种高效且准确的解决方案，并可独立应用于标准SDB任务。 Abstract: Accurate, detailed, and high-frequent bathymetry is crucial for shallow seabed areas facing intense climatological and anthropogenic pressures. Current methods utilizing airborne or satellite optical imagery to derive bathymetry primarily rely on either SfM-MVS with refraction correction or Spectrally Derived Bathymetry (SDB). However, SDB methods often require extensive manual fieldwork or costly reference data, while SfM-MVS approaches face challenges even after refraction correction. These include depth data gaps and noise in environments with homogeneous visual textures, which hinder the creation of accurate and complete Digital Surface Models (DSMs) of the seabed. To address these challenges, this work introduces a methodology that combines the high-fidelity 3D reconstruction capabilities of the SfM-MVS methods with state-of-the-art refraction correction techniques, along with the spectral analysis capabilities of a new deep learning-based method for bathymetry prediction. This integration enables a synergistic approach where SfM-MVS derived DSMs with data gaps are used as training data to generate complete bathymetric maps. In this context, we propose Swin-BathyUNet that combines U-Net with Swin Transformer self-attention layers and a cross-attention mechanism, specifically tailored for SDB. Swin-BathyUNet is designed to improve bathymetric accuracy by capturing long-range spatial relationships and can also function as a standalone solution for standard SDB with various training depth data, independent of the SfM-MVS output. Experimental results in two completely different test sites in the Mediterranean and Baltic Seas demonstrate the effectiveness of the proposed approach through extensive experiments that demonstrate improvements in bathymetric accuracy, detail, coverage, and noise reduction in the predicted DSM. The code is available at https://github.com/pagraf/Swin-BathyUNet.

[110] Leveraging Point Transformers for Detecting Anatomical Landmarks in Digital Dentistry

Tibor Kubík,Oldřich Kodym,Petr Šilling,Kateřina Trávníčková,Tomáš Mojžiš,Jan Matula

Main category: cs.CV

TLDR: 论文探讨了利用点云学习和Transformer架构自动检测口腔扫描中的关键标志点，提出了一种基于Point Transformer v3的方法，并展示了有前景的结果。

Details

Motivation: 随着口腔扫描设备的普及，自动检测关键标志点（如牙尖、牙龈边界等）的需求增加，但面临数据集小、解剖结构差异大等挑战。 Method: 采用Point Transformer v3模块提取几何和解剖特征，结合轻量级解码器和基于图的非最小值抑制技术预测标志点。 Result: 实验结果表明该方法在3DTeethLand挑战中表现良好，并提供了特征可解释性的见解。 Conclusion: 提出的方法在自动检测口腔标志点方面具有潜力，未来可进一步优化和扩展。 Abstract: The increasing availability of intraoral scanning devices has heightened their importance in modern clinical orthodontics. Clinicians utilize advanced Computer-Aided Design techniques to create patient-specific treatment plans that include laboriously identifying crucial landmarks such as cusps, mesial-distal locations, facial axis points, and tooth-gingiva boundaries. Detecting such landmarks automatically presents challenges, including limited dataset sizes, significant anatomical variability among subjects, and the geometric nature of the data. We present our experiments from the 3DTeethLand Grand Challenge at MICCAI 2024. Our method leverages recent advancements in point cloud learning through transformer architectures. We designed a Point Transformer v3 inspired module to capture meaningful geometric and anatomical features, which are processed by a lightweight decoder to predict per-point distances, further processed by graph-based non-minima suppression. We report promising results and discuss insights on learned feature interpretability.

[111] ADT: Tuning Diffusion Models with Adversarial Supervision

Dazhong Shen,Guanglu Song,Yi Zhang,Bingqi Ma,Lujundong Li,Dongzhi Jiang,Zhuofan Zong,Yu Liu

Main category: cs.CV

TLDR: ADT通过对抗性监督调整扩散模型的推理过程，提升分布对齐和图像质量。

Details

Motivation: 训练与推理的差异导致分布对齐问题，ADT旨在解决这一问题。 Method: 提出ADT框架，结合对抗性监督和Siamese网络判别器，优化推理过程。 Result: 在多个Stable Diffusion模型上显著提升分布对齐和图像质量。 Conclusion: ADT是一种有效且直观的微调框架，适用于扩散模型优化。 Abstract: Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.

[112] NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Yanrui Bin,Wenbo Hu,Haoyuan Wang,Xinya Chen,Bing Wang

Main category: cs.CV

TLDR: 论文提出NormalCrafter方法，利用视频扩散模型的时序先验，通过语义特征正则化（SFR）和两阶段训练协议，实现视频中高保真且时序一致的表面法线估计。

Details

Motivation: 现有方法在静态图像场景中表现良好，但在视频中保持时序一致性的表面法线估计仍具挑战性。 Method: 提出NormalCrafter，结合语义特征正则化（SFR）和两阶段训练协议，利用扩散模型的时序先验。 Result: 方法在多样视频中生成具有丰富细节且时序一致的法线序列，表现优异。 Conclusion: NormalCrafter通过时序先验和语义对齐，显著提升了视频表面法线估计的时序一致性和细节保留能力。 Abstract: Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.

[113] Enhancing Out-of-Distribution Detection with Extended Logit Normalization

Yifan Ding,Xixi Liu,Jonas Unger,Gabriel Eilertsen

Main category: cs.CV

TLDR: 论文提出了一种改进的Logit归一化方法（ELogitNorm），用于提升OOD检测的泛化性和鲁棒性。

Details

Motivation: 现有OOD检测方法通常针对特定后处理技术设计，泛化性不足。LogitNorm在某些后处理方法中效果不佳。 Method: 提出ELogitNorm，结合特征距离感知，改进LogitNorm的OOD分离性和ID置信度校准。 Result: 实验表明，ELogitNorm在OOD检测上优于现有方法，同时保持ID分类准确性。 Conclusion: ELogitNorm是一种无需超参数的方法，显著提升了OOD检测的泛化性和鲁棒性。 Abstract: Out-of-distribution (OOD) detection is essential for the safe deployment of machine learning models. Recent advances have explored improved classification losses and representation learning strategies to enhance OOD detection. However, these methods are often tailored to specific post-hoc detection techniques, limiting their generalizability. In this work, we identify a critical issue in Logit Normalization (LogitNorm), which inhibits its effectiveness in improving certain post-hoc OOD detection methods. To address this, we propose Extended Logit Normalization ($\textbf{ELogitNorm}$), a novel hyperparameter-free formulation that significantly benefits a wide range of post-hoc detection methods. By incorporating feature distance-awareness to LogitNorm, $\textbf{ELogitNorm}$ shows more robust OOD separability and in-distribution (ID) confidence calibration than its predecessor. Extensive experiments across standard benchmarks demonstrate that our approach outperforms state-of-the-art training-time methods in OOD detection while maintaining strong ID classification accuracy.

[114] Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion

An Zhaol,Shengyuan Zhang,Ling Yang,Zejian Li,Jiale Wu,Haoran Xu,AnYang Wei,Perry Pengyun GU Lingyun Sun

Main category: cs.CV

TLDR: 提出了一种名为Distillation-DPO的新型扩散蒸馏框架，用于LiDAR场景补全，通过偏好对齐提升性能并加速采样速度。

Details

Motivation: 扩散模型在3D LiDAR场景补全中因采样速度慢而受限，现有方法如分数蒸馏虽加速采样但性能下降，而DPO虽提升性能但需偏好数据。 Method: 学生模型生成不同初始噪声的配对补全场景，利用LiDAR场景评估指标作为偏好构建胜负样本对，通过优化师生模型在配对场景上的分数函数差异进行训练。 Result: 相比现有方法，Distillation-DPO在提升补全质量的同时，加速速度超过5倍。 Conclusion: 该方法首次将偏好学习引入蒸馏，为偏好对齐蒸馏提供了新思路，代码已开源。 Abstract: The application of diffusion models in 3D LiDAR scene completion is limited due to diffusion's slow sampling speed. Score distillation accelerates diffusion sampling but with performance degradation, while post-training with direct policy optimization (DPO) boosts performance using preference data. This paper proposes Distillation-DPO, a novel diffusion distillation framework for LiDAR scene completion with preference aligment. First, the student model generates paired completion scenes with different initial noises. Second, using LiDAR scene evaluation metrics as preference, we construct winning and losing sample pairs. Such construction is reasonable, since most LiDAR scene metrics are informative but non-differentiable to be optimized directly. Third, Distillation-DPO optimizes the student model by exploiting the difference in score functions between the teacher and student models on the paired completion scenes. Such procedure is repeated until convergence. Extensive experiments demonstrate that, compared to state-of-the-art LiDAR scene completion diffusion models, Distillation-DPO achieves higher-quality scene completion while accelerating the completion speed by more than 5-fold. Our method is the first to explore adopting preference learning in distillation to the best of our knowledge and provide insights into preference-aligned distillation. Our code is public available on https://github.com/happyw1nd/DistillationDPO.

[115] PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond

Minghua Liu,Mikaela Angelina Uy,Donglai Xiang,Hao Su,Sanja Fidler,Nicholas Sharp,Jun Gao

Main category: cs.CV

TLDR: PartField是一种前馈方法，用于学习基于部分的3D特征，无需依赖预定义模板或文本名称，适用于多种模态的开放世界3D形状。

Details

Motivation: 现有方法在运行时和鲁棒性上存在不足，且通常依赖预定义模板或文本名称，限制了其通用性。 Method: 通过对比学习从标记数据集和大型无监督数据集的图像分割中提取2D和3D部分提案，训练模型生成连续特征场。 Result: PartField在准确率上比其他方法高出20%，且运行速度快几个数量级。 Conclusion: PartField不仅适用于单形状部分分解，还能实现跨形状的一致性和任务如共分割和对应关系。 Abstract: We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields. Check our Webpage! https://research.nvidia.com/labs/toronto-ai/partfield-release/

[116] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

Junke Wang,Zhi Tian,Xun Wang,Xinyu Zhang,Weilin Huang,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TLDR: SimpleAR是一个简单的自回归视觉生成框架，无需复杂架构修改，通过训练和推理优化，实现了高分辨率图像生成和竞争性基准表现。

Details

Motivation: 探索自回归视觉生成的潜力，通过优化训练和推理技术，展示其在高分辨率图像生成和文本到图像任务中的竞争力。 Method: 采用自回归框架，结合监督微调（SFT）和Group Relative Policy Optimization（GRPO）训练，以及推理加速技术（如vLLM）。 Result: 模型仅需0.5B参数即可生成1024x1024高保真图像，在GenEval和DPG基准上表现优异；推理时间优化至14秒。 Conclusion: SimpleAR展示了自回归视觉生成的潜力，开源代码以鼓励更多研究参与。 Abstract: This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.

[117] Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Ziqi Pang,Xin Xu,Yu-Xiong Wang

Main category: cs.CV

TLDR: 生成扩散模型被用于判别任务时存在关键差距，本文通过分析增强生成与感知任务的对齐，提出改进方法，显著提升性能。

Details

Motivation: 生成模型在判别任务中因中间采样误差容忍度高而表现不佳，本文旨在解决这一差距。 Method: 分析去噪过程中感知质量变化，提出定制学习目标和数据增强方法。 Result: 改进后的模型在深度估计、参考图像分割等任务中达到最优性能。 Conclusion: 生成扩散模型可通过优化对齐显著提升判别任务表现，无需架构改动。 Abstract: With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.

cs.CL [Back]

[118] LayerFlow: Layer-wise Exploration of LLM Embeddings using Uncertainty-aware Interlinked Projections

Rita Sevastjanova,Robin Gerling,Thilo Spinner,Mennatallah El-Assady

Main category: cs.CL

TLDR: LayerFlow是一个可视化分析工具，用于展示语言模型嵌入的不确定性，通过多种视觉组件帮助用户理解数据转换中的潜在失真。

Details

Motivation: 理解语言模型嵌入的语义和语法特性对研究者和任务应用至关重要，但降维技术引入的不确定性可能影响数据解读。 Method: LayerFlow通过互联投影设计展示嵌入，并利用凸包、点对距离、聚类摘要和投影质量指标等视觉组件传达不确定性。 Result: 通过案例研究验证了LayerFlow的实用性，表明多视觉组件和数据视角能有效传达不确定性。 Conclusion: LayerFlow通过可视化手段解决了嵌入降维中的不确定性问题，为语言模型研究提供了更可靠的分析工具。 Abstract: Large language models (LLMs) represent words through contextual word embeddings encoding different language properties like semantics and syntax. Understanding these properties is crucial, especially for researchers investigating language model capabilities, employing embeddings for tasks related to text similarity, or evaluating the reasons behind token importance as measured through attribution methods. Applications for embedding exploration frequently involve dimensionality reduction techniques, which reduce high-dimensional vectors to two dimensions used as coordinates in a scatterplot. This data transformation step introduces uncertainty that can be propagated to the visual representation and influence users' interpretation of the data. To communicate such uncertainties, we present LayerFlow - a visual analytics workspace that displays embeddings in an interlinked projection design and communicates the transformation, representation, and interpretation uncertainty. In particular, to hint at potential data distortions and uncertainties, the workspace includes several visual components, such as convex hulls showing 2D and HD clusters, data point pairwise distances, cluster summaries, and projection quality metrics. We show the usability of the presented workspace through replication and expert case studies that highlight the need to communicate uncertainty through multiple visual components and different data perspectives.

[119] Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

Thilo Hagendorff,Sarah Fabi

Main category: cs.CL

TLDR: 研究通过设计一个基准测试（n=4,000）来量化大型语言模型（LLMs）的内部推理能力，要求模型通过选择非英语的初始响应语言来解决问题，从而评估其推理能力。

Details

Motivation: 理解并量化模型内部推理能力（即模型在单个标记预测之间的推断“跳跃”）是关键，但目前缺乏相关研究。 Method: 设计一个基准测试，要求LLMs通过选择非英语的初始响应语言来解决问题，评估其推理能力。测试了18个LLMs。 Result: GPT-4.5表现最佳（74.7%），优于Grok-2（67.2%）和Llama 3.1 405B（65.6%）。实验表明LLMs能通过潜在空间计算进行推理，但也存在启发式利用的可能。 Conclusion: LLMs确实能通过潜在空间计算进行内部推理，但需进一步研究其推理策略，尤其是与安全相关的问题（如隐蔽规划或欺骗）。 Abstract: Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally by generating explicit token sequences like chains of thought. Significant progress in enhancing reasoning abilities has been made by scaling test-time compute. However, understanding and quantifying model-internal reasoning abilities - the inferential "leaps" models make between individual token predictions - remains crucial. This study introduces a benchmark (n = 4,000 items) designed to quantify model-internal reasoning in different domains. We achieve this by having LLMs indicate the correct solution to reasoning problems not through descriptive text, but by selecting a specific language of their initial response token that is different from English, the benchmark language. This not only requires models to reason beyond their context window, but also to overrise their default tendency to respond in the same language as the prompt, thereby posing an additional cognitive strain. We evaluate a set of 18 LLMs, showing significant performance variations, with GPT-4.5 achieving the highest accuracy (74.7%), outperforming models like Grok-2 (67.2%), and Llama 3.1 405B (65.6%). Control experiments and difficulty scaling analyses suggest that while LLMs engage in internal reasoning, we cannot rule out heuristic exploitations under certain conditions, marking an area for future investigation. Our experiments demonstrate that LLMs can "think" via latent-space computations, revealing model-internal inference strategies that need further understanding, especially regarding safety-related concerns such as covert planning, goal-seeking, or deception emerging without explicit token traces.

[120] Better Estimation of the KL Divergence Between Language Models

Afra Amini,Tim Vieira,Ryan Cotterell

Main category: cs.CL

TLDR: 提出了一种Rao--Blackwellized估计器，用于更稳定地估计语言模型间的KL散度，并减少方差。

Details

Motivation: 精确计算语言模型间的KL散度不可行，现有蒙特卡洛估计器方差高且可能为负。 Method: 引入Rao--Blackwellized估计器，证明其方差低于标准蒙特卡洛估计器。 Result: 在情感控制微调实验中，新估计器提供更稳定的KL估计并显著减少方差。 Conclusion: Rao--Blackwellized估计器及其梯度版本在KL散度估计和模型训练中表现更优。 Abstract: Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.

[121] Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning

Saif Punjwani,Larry Heck

Main category: cs.CL

TLDR: 论文提出了一种名为Weight-of-Thought (WoT)的新方法，通过分析神经网络权重来改进大语言模型的推理能力，优于传统方法。

Details

Motivation: 现有方法（如Chain-of-Thought）仅关注输出层面的推理，忽略了内部权重动态，限制了推理能力的提升。 Method: WoT利用图结构消息传递、多步推理过程和注意力机制，构建推理节点的互联图。 Result: 在多种推理任务（如逻辑、数学、代数等）中，WoT表现优于传统方法，尤其在复杂问题上。 Conclusion: WoT不仅提升了性能，还增强了推理过程的可解释性，为改进LLM推理能力提供了新方向。 Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities when prompted with strategies such as Chain-of-Thought (CoT). However, these approaches focus on token-level output without considering internal weight dynamics. We introduce Weight-of-Thought (WoT) reasoning, a novel approach that examines neural network weights before inference to identify reasoning pathways. Unlike existing methods, WoT explores the weight space through graph-based message passing, multi-step reasoning processes, and attention mechanisms. Our implementation creates an interconnected graph of reasoning nodes. Experiments on diverse reasoning tasks (syllogistic, mathematical, algebraic, combinatorial, and geometric) demonstrate that WoT achieves superior performance compared to traditional methods, particularly for complex problems. This approach leads to both improved performance and greater interpretability of the reasoning process, offering a promising direction for enhancing LLM reasoning capabilities.

[122] Improving In-Context Learning with Reasoning Distillation

Nafis Sadeq,Xin Xu,Zhouhang Xie,Julian McAuley,Byungkyu Kang,Prarit Lamba,Xiang Gao

Main category: cs.CL

TLDR: ReDis是一种推理蒸馏技术，通过数据增强、过滤、监督微调和对齐，显著提升语言模型的归纳推理能力，并在多个任务上超越GPT-4o。

Details

Motivation: 语言模型依赖语义先验进行上下文学习，但在归纳推理任务上表现不佳，现有方法难以提升模型对输入输出底层规则的理解。 Method: 提出ReDis技术，结合数据增强、过滤、监督微调和对齐，优化语言模型的推理能力。 Result: 在1D-ARC、ACRE和MiniSCAN等任务上，ReDis相对GPT-4o分别提升23.2%、2.8%和66.6%。 Conclusion: ReDis显著提升语言模型的归纳推理能力，部分任务表现甚至超越GPT-4o。 Abstract: Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at https://github.com/NafisSadeq/reasoning-distillation.git.

[123] LITERA: An LLM Based Approach to Latin-to-English Translation

Paul Rosu

Main category: cs.CL

TLDR: 论文介绍了一个基于LLM的拉丁语到英语翻译平台LITERA，通过多层翻译过程和微调GPT-4o-mini/GPT-4o，显著提升了翻译准确度。

Details

Motivation: 解决拉丁语文本翻译的挑战，特别是古典拉丁语的翻译需求。 Method: 采用多层翻译过程，结合微调的GPT-4o-mini和GPT-4o模型，并与杜克大学古典研究部门合作构建高质量数据集。 Result: 显著提升了BLEU和BLEURT分数，尤其在古典拉丁语翻译中表现优异。 Conclusion: LITERA展示了在拉丁语翻译中的高准确性和实用性，为研究提供了有力工具。 Abstract: This paper introduces an LLM-based Latin-to-English translation platform designed to address the challenges of translating Latin texts. We named the model LITERA, which stands for Latin Interpretation and Translations into English for Research Assistance. Through a multi-layered translation process utilizing a fine-tuned version of GPT-4o-mini and GPT-4o, LITERA offers an unprecedented level of accuracy, showcased by greatly improved BLEU scores, particularly in classical Latin, along with improved BLEURT scores. The development of LITERA involved close collaboration with Duke University's Classical Studies Department, which was instrumental in creating a small, high-quality parallel Latin-English dataset. This paper details the architecture, fine-tuning methodology, and prompting strategies used in LITERA, emphasizing its ability to produce literal translations.

[124] Characterizing Knowledge Manipulation in a Russian Wikipedia Fork

Mykola Trokhymovych,Oleksandr Kosovan,Nathan Forrester,Pablo Aragón,Diego Saez-Trumper,Ricardo Baeza-Yates

Main category: cs.CL

TLDR: 本文分析了俄罗斯维基百科的分支Ruwiki，提出了一种方法来识别和分类知识操纵的主要变化，并提供了相关方法论。

Details

Motivation: 研究Ruwiki的动机是为了识别其与原始俄罗斯维基百科的差异，揭示可能的知识操纵行为。 Method: 通过比较1.9M篇文章，结合元信息、地理、时间、类别和文本特征，分析Ruwiki编辑的改动。 Result: 研究揭示了Ruwiki中的主要知识操纵主题，并量化了其范围。 Conclusion: 该研究不仅揭示了Ruwiki的变化，还提供了一种可应用于其他维基分支的方法论。 Abstract: Wikipedia is powered by MediaWiki, a free and open-source software that is also the infrastructure for many other wiki-based online encyclopedias. These include the recently launched website Ruwiki, which has copied and modified the original Russian Wikipedia content to conform to Russian law. To identify practices and narratives that could be associated with different forms of knowledge manipulation, this article presents an in-depth analysis of this Russian Wikipedia fork. We propose a methodology to characterize the main changes with respect to the original version. The foundation of this study is a comprehensive comparative analysis of more than 1.9M articles from Russian Wikipedia and its fork. Using meta-information and geographical, temporal, categorical, and textual features, we explore the changes made by Ruwiki editors. Furthermore, we present a classification of the main topics of knowledge manipulation in this fork, including a numerical estimation of their scope. This research not only sheds light on significant changes within Ruwiki, but also provides a methodology that could be applied to analyze other Wikipedia forks and similar collaborative projects.

[125] Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

F. A. Rizvi,T. Navojith,A. M. N. H. Adhikari,W. P. U. Senevirathna,Dharshana Kasthurirathna,Lakmini Abeywardhana

Main category: cs.CL

TLDR: 该论文提出了一种混合NLP方法，用于改进银行领域多语言和代码混合内容的关键词提取、内容过滤和基于方面的分类，展示了微调Transformer模型在金融文本分析中的优越性。

Details

Motivation: 传统NLP模型在处理代码混合文本（如僧伽罗语-英语）时表现不佳，无法捕捉领域特定知识，因此需要一种更有效的方法来维护银行品牌声誉。 Method: 采用混合方法，包括微调的SpaCy NER模型、FinBERT-based KeyBERT嵌入、YAKE和EmbedRank用于英语关键词提取；微调的XLM-RoBERTa模型结合僧伽罗语金融词汇用于代码混合和僧伽罗语关键词提取。内容过滤和基于方面的分类也使用BERT和XLM-RoBERTa模型。 Result: 英语关键词提取准确率为91.2%，僧伽罗语为87.4%；内容过滤中BERT模型对英语的准确率为85.2%，XLM-RoBERTa对僧伽罗语为88.1%；基于方面的分类中BERT模型对英语的准确率为87.4%，XLM-RoBERTa对僧伽罗语为85.9%。 Conclusion: 微调Transformer模型在多语言金融文本分析中优于传统方法，为银行品牌声誉监控提供了准确且可扩展的解决方案。 Abstract: Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

[126] EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration

Soham Shah,Kumar Shridhar,Surojit Chatterjee,Souvik Sen

Main category: cs.CL

TLDR: EMAFusion是一个自优化的LLM选择和执行框架，通过结合分类路由器和学习路由器，以及级联模型选择策略，显著提高了性能并降低了成本。

Details

Motivation: 现有路由策略依赖大量标注数据或任务特定启发式方法，而融合技术虽提升准确性但增加成本和偏见。EMAFusion旨在解决这些问题。 Method: EMAFusion结合分类路由器、学习路由器和基于多评委置信度评估的级联模型选择策略。 Result: EMAFusion性能优于最佳单一模型2.6个百分点（94.3% vs. 91.7%），成本降低4倍，且比GPT-4成本低20倍以上。 Conclusion: EMAFusion提供灵活的成本-准确性权衡，有效统一了路由策略，显著提升了性能和成本效益。 Abstract: While recent advances in large language models (LLMs) have significantly enhanced performance across diverse natural language tasks, the high computational and financial costs associated with their deployment remain substantial barriers. Existing routing strategies partially alleviate this challenge by assigning queries to cheaper or specialized models, but they frequently rely on extensive labeled data or fragile task-specific heuristics. Conversely, fusion techniques aggregate multiple LLM outputs to boost accuracy and robustness, yet they often exacerbate cost and may reinforce shared biases. We introduce EMAFusion, a new framework that self-optimizes for seamless LLM selection and reliable execution for a given query. Specifically, EMAFusion integrates a taxonomy-based router for familiar query types, a learned router for ambiguous inputs, and a cascading approach that progressively escalates from cheaper to more expensive models based on multi-judge confidence evaluations. Through extensive evaluations, we find EMAFusion outperforms the best individual models by over 2.6 percentage points (94.3% vs. 91.7%), while being 4X cheaper than the average cost. EMAFusion further achieves a remarkable 17.1 percentage point improvement over models like GPT-4 at less than 1/20th the cost. Our combined routing approach delivers 94.3% accuracy compared to taxonomy-based (88.1%) and learned model predictor-based (91.7%) methods alone, demonstrating the effectiveness of our unified strategy. Finally, EMAFusion supports flexible cost-accuracy trade-offs, allowing users to balance their budgetary constraints and performance needs.

[127] HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

Avinash Kumar,Shashank Nag,Jason Clemons,Lizy John,Poulami Das

Main category: cs.CL

TLDR: HELIOS通过动态选择和加载LLM的部分层，优化了延迟、准确性和吞吐量之间的权衡，显著提升了性能和资源效率。

Details

Motivation: 大型语言模型（LLM）部署中存在延迟、准确性和吞吐量之间的固有权衡，现有方法无法动态适应输入查询的变化。 Method: HELIOS通过实时评估候选LLM、动态加载部分层以及周期性重新评估模型性能，优化资源使用和性能。 Result: HELIOS在吞吐量、能源效率、响应时间和推理批量大小方面分别提升了1.48倍、1.10倍、1.39倍和3.7倍。 Conclusion: HELIOS有效解决了LLM部署中的资源效率问题，显著提升了性能指标。 Abstract: Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an output token early, thus reducing latency without impacting accuracy. However, as the early exits taken depend on the task and are unknown apriori to request processing, EE-LLMs conservatively load the entire model, limiting resource savings and throughput. Also, current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers. This approach yields memory savings which enables us to process more requests at the same time, thereby improving throughput. Third, HELIOS monitors and periodically reassesses the performance of the candidate LLMs and if needed, switches to another model that can service incoming queries more efficiently (such as using fewer layers without lowering accuracy). Our evaluations show that HELIOS achieves 1.48$\times$ throughput, 1.10$\times$ energy-efficiency, 1.39$\times$ lower response time, and 3.7$\times$ improvements in inference batch sizes compared to the baseline, when optimizing for the respective service level objectives.

[128] The Art of Audience Engagement: LLM-Based Thin-Slicing of Scientific Talks

Ralf Schmälzle,Sue Lim,Yuetong Du,Gary Bente

Main category: cs.CL

TLDR: 论文研究了薄切片方法在科学演讲中的应用，发现LLM评估短片段能准确预测整体演讲质量。

Details

Motivation: 探索薄切片方法在科学演讲中的有效性，验证LLM评估的可靠性和效率。 Method: 使用LLM分析科学演讲的全文和短片段，比较其评估结果与人类评分的相关性。 Result: 短片段（少于10%演讲时长）能强预测整体评价，LLM评估与人类评分高度一致。 Conclusion: 薄切片方法适用于演讲评估，LLM可作为高效反馈工具，推动沟通研究。 Abstract: This paper examines the thin-slicing approach - the ability to make accurate judgments based on minimal information - in the context of scientific presentations. Drawing on research from nonverbal communication and personality psychology, we show that brief excerpts (thin slices) reliably predict overall presentation quality. Using a novel corpus of over one hundred real-life science talks, we employ Large Language Models (LLMs) to evaluate transcripts of full presentations and their thin slices. By correlating LLM-based evaluations of short excerpts with full-talk assessments, we determine how much information is needed for accurate predictions. Our results demonstrate that LLM-based evaluations align closely with human ratings, proving their validity, reliability, and efficiency. Critically, even very short excerpts (less than 10 percent of a talk) strongly predict overall evaluations. This suggests that the first moments of a presentation convey relevant information that is used in quality evaluations and can shape lasting impressions. The findings are robust across different LLMs and prompting strategies. This work extends thin-slicing research to public speaking and connects theories of impression formation to LLMs and current research on AI communication. We discuss implications for communication and social cognition research on message reception. Lastly, we suggest an LLM-based thin-slicing framework as a scalable feedback tool to enhance human communication.

[129] GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction

Jessica Lin,Amir Zeldes

Main category: cs.CL

TLDR: 本文提出了一种新的方法，用于对文本中的实体进行分级显著性评分，结合了主观判断和基于摘要的方法的优点。

Details

Motivation: 用户依赖模型解读长文档时，需要识别和排序文本中最显著的实体。现有方法在一致性和输出限制上存在不足。 Method: 通过收集每个文档的多个摘要，计算实体在这些摘要中的出现频率来评分。 Result: 新方法与基于人类摘要的评分相关性更强，优于现有技术。 Conclusion: 该方法为分级显著性实体提取提供了有效解决方案，并开源了数据和代码。 Abstract: Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity's salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at https://github.com/jl908069/gum_sum_salience to support further research on graded salient entity extraction.

[130] Name of Thrones: Evaluating How LLMs Rank Student Names, Race, and Gender in Status Hierarchies

Annabella Sakunkoo,Jonathan Sakunkoo

Main category: cs.CL

TLDR: 研究探讨了大型语言模型（LLMs）如何基于姓名（包括姓氏和名字）对人群进行不公平的偏见分类，并揭示了性别和种族在其中的复杂影响。

Details

Motivation: 姓名承载着深刻的个人和文化意义，同时也是性别、种族和社会地位的信号。随着LLMs的广泛应用，评估其是否基于姓名对人群进行不公平分类变得至关重要。 Method: 通过大规模分析5个种族的姓名变体，研究AI如何展现姓名偏见，并探讨不平等的三个关键特征。 Result: 研究发现LLMs基于姓名（尤其是性别和种族信号）反映并强化了社会地位等级，东亚和南亚姓名在某些情况下排名更高，性别也调节了偏见。 Conclusion: 研究强调了在评估LLMs时需采用更复杂的种族、性别和混合身份理解，挑战了单一的亚洲模范少数族裔假设。 Abstract: Across cultures, names tell a lot about their bearers as they carry deep personal and cultural significance. Names also serve as powerful signals of gender, race, and status in the social hierarchy - a pecking order in which individual positions shape others' expectations on their perceived competence and worth. With the widespread adoption of LLMs and as names are often an input for LLMs, it is crucial to evaluate whether LLMs may sort people into status positions based on first and last names and, if so, whether it is in an unfair, biased fashion. While prior work has primarily investigated biases in first names, little attention has been paid to last names and even less to the combined effects of first and last names. In this study, we conduct a large-scale analysis of name variations across 5 ethnicities to examine how AI exhibits name biases. Our study investigates three key characteristics of inequality and finds that LLMs reflect and reinforce status hierarchies based on names that signal gender and ethnicity as they encode differential expectations of competence, leadership, and economic potential. Contrary to the common assumption that AI tends to favor Whites, we show that East and, in some contexts, South Asian names receive higher rankings. We also disaggregate Asians, a population projected to be the largest immigrant group in the U.S. by 2055. Our results challenge the monolithic Asian model minority assumption, illustrating a more complex and stratified model of bias. Gender moderates biases, with girls facing unfair disadvantages in certain racial groups. Additionally, spanning cultural categories by adopting Western first names improves AI-perceived status for East and Southeast Asian students, particularly for girls. Our findings underscore the importance of intersectional and more nuanced understandings of race, gender, and mixed identities in the evaluation of LLMs.

[131] CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

Ayoung Lee,Ryan Sungmo Kwon,Peter Railton,Lu Wang

Main category: cs.CL

TLDR: 论文介绍了CLASH数据集，用于评估大语言模型（LLM）在高风险困境中的推理能力，发现现有模型在模糊决策、价值转变和心理不适方面表现不足。

Details

Motivation: 填补现有研究在评估LLM处理高风险、价值观冲突困境时的空白，探索模型在模糊决策和价值转变中的表现。 Method: 构建CLASH数据集（345个高风险困境和3,795个价值观视角），并测试10种前沿模型在模糊决策、心理不适和价值转变中的表现。 Result: 模型在模糊决策中准确率低于50%，能预测心理不适但难以理解价值转变，且价值偏好与可引导性相关。 Conclusion: LLM需提升对复杂价值观的推理能力，且不同视角（第一人称vs第三方）对模型表现有显著影响。 Abstract: Navigating high-stakes dilemmas involving conflicting values is challenging even for humans, let alone for AI. Yet prior work in evaluating the reasoning capabilities of large language models (LLMs) in such situations has been limited to everyday scenarios. To close this gap, this work first introduces CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. In particular, we design CLASH in a way to support the study of critical aspects of value-based decision-making processes which are missing from prior work, including understanding decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in characters' perspectives. By benchmarking 10 open and closed frontier models, we uncover several key findings. (1) Even the strongest models, such as GPT-4o and Claude-Sonnet, achieve less than 50% accuracy in identifying situations where the decision should be ambivalent, while they perform significantly better in clear-cut scenarios. (2) While LLMs reasonably predict psychological discomfort as marked by human, they inadequately comprehend perspectives involving value shifts, indicating a need for LLMs to reason over complex values. (3) Our experiments also reveal a significant correlation between LLMs' value preferences and their steerability towards a given value. (4) Finally, LLMs exhibit greater steerability when engaged in value reasoning from a third-party perspective, compared to a first-person setup, though certain value pairs benefit uniquely from the first-person framing.

[132] Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators

Phill Kyu Rhee

Main category: cs.CL

TLDR: 该论文提出了一种新框架，将大语言模型（LLMs）解释为概率左上下文敏感语言（CSLs）生成器，分解Transformer为三个基本组件，以增强模型的可解释性和灵活性。

Details

Motivation: 理解LLMs的底层机制，突破传统注意力与自回归不可分割的视角，为生成式AI提供理论基础。 Method: 将Transformer分解为上下文窗口、注意力机制和自回归生成框架，提出LLMs是概率左CSLs的动态近似。 Result: 证明LLMs通过简单令牌预测可生成类人智能输出，且Transformer随机近似CSLs，后者是公认的类人智能模型。 Conclusion: 该框架连接形式语言理论与Transformer生成能力，为未来生成式AI的理论与应用奠定基础。 Abstract: Large Language Models (LLMs), powered by Transformers, have demonstrated human-like intelligence capabilities, yet their underlying mechanisms remain poorly understood. This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators. We hypothesize that Transformers can be effectively decomposed into three fundamental components: context windows, attention mechanisms, and autoregressive generation frameworks. This decomposition allows for the development of more flexible and interpretable computational models, moving beyond the traditional view of attention and autoregression as inseparable processes. We argue that next-token predictions can be understood as probabilistic, dynamic approximations of left CSL production rules, providing an intuitive explanation for how simple token predictions can yield human-like intelligence outputs. Given that all CSLs are left context-sensitive (Penttonen, 1974), we conclude that Transformers stochastically approximate CSLs, which are widely recognized as models of human-like intelligence. This interpretation bridges the gap between Formal Language Theory and the observed generative power of Transformers, laying a foundation for future advancements in generative AI theory and applications. Our novel perspective on Transformer architectures will foster a deeper understanding of LLMs and their future potentials.

[133] Ai2 Scholar QA: Organized Literature Synthesis with Attribution

Amanpreet Singh,Joseph Chee Chang,Chloe Anastasiades,Dany Haddad,Aakanksha Naik,Amber Tanaka,Angele Zamarron,Cecile Nguyen,Jena D. Hwang,Jason Dunkleberger,Matt Latzke,Smita Rao,Jaron Lochner,Rob Evans,Rodney Kinney,Daniel S. Weld,Doug Downey,Sergey Feldman

Main category: cs.CL

TLDR: Ai2 Scholar QA是一个免费的开源科学问答系统，性能优于现有系统。

Details

Motivation: 解决现有科学问答系统昂贵且闭源的问题。 Method: 提供可定制的开源Python包、交互式Web应用、公开API和可下载数据集。 Result: 在科学问答基准测试中表现优于竞争对手。 Conclusion: Ai2 Scholar QA是一个高效且开放的科学问答解决方案。 Abstract: Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2 Scholar QA, a free online scientific question answering application. To facilitate research, we make our entire pipeline public: as a customizable open-source Python package and interactive web app, along with paper indexes accessible through public APIs and downloadable datasets. We describe our system in detail and present experiments analyzing its key design decisions. In an evaluation on a recent scientific QA benchmark, we find that Ai2 Scholar QA outperforms competing systems.

[134] Efficient Reasoning Models: A Survey

Sicheng Feng,Gongfan Fang,Xinyin Ma,Xinchao Wang

Main category: cs.CL

TLDR: 该论文综述了高效推理的最新进展，提出了三种加速方向：缩短推理链、开发紧凑模型和设计高效解码策略。

Details

Motivation: 由于推理模型在解决复杂任务时生成长链式思考（CoTs）导致计算开销大，亟需有效的加速方法。 Method: 将现有工作分为三类：1）缩短推理链；2）开发紧凑模型；3）设计高效解码策略。 Result: 综述了相关研究，并提供了GitHub资源库。 Conclusion: 高效推理是解决计算开销问题的关键方向，未来需进一步优化。 Abstract: Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this "slow-thinking" paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster - designing efficient decoding strategies to accelerate inference. A curated collection of papers discussed in this survey is available in our GitHub repository.

[135] Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

Changjiang Gao,Hankun Lin,Shujian Huang,Xin Huang,Xue Han,Junlan Feng,Chao Deng,Jiajun Chen

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLMs）的跨语言上下文检索能力，评估了40多个LLMs在12种语言中的表现，发现小型后训练开源模型表现优异，与闭源模型如GPT-4o相当。

Details

Motivation: 跨语言上下文检索能力在现实应用中至关重要，但现有研究对此关注不足。 Method: 通过跨语言机器阅读理解（xMRC）场景评估40多个LLMs，分析其能力来源及后训练的影响。 Result: 小型后训练开源模型表现优异，跨语言检索能力可分为问题编码和答案检索两阶段，后训练显著提升性能。 Conclusion: 大规模预训练无法提升xMRC性能，多语言后训练是释放LLMs跨语言潜力的关键。 Abstract: The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at https://github.com/NJUNLP/Cross-Lingual-Context-Retrieval

[136] Exploring the Role of KG-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs

Yingjian Chen,Feiyang Li,Xingyu Song,Tianxiao Li,Issey Sudeka,Irene Li

Main category: cs.CL

TLDR: 研究了基于知识图谱的检索增强生成（RAG）框架在日语医疗问答中的效果，发现其对小规模开源LLMs的提升有限，且效果受检索内容质量影响显著。

Details

Motivation: 解决商业LLMs（如GPT-4）在日语医疗场景中因隐私限制无法使用的问题，探索开源LLMs结合RAG的潜力。 Method: 提出基于知识图谱的RAG框架，用于日语医疗问答的小规模开源LLMs。 Result: 实验表明KG-based RAG对小规模开源LLMs的提升有限，效果高度依赖检索内容的质量和相关性。 Conclusion: 研究揭示了RAG在日语医疗QA中的挑战与潜力，为其他低资源语言提供了参考。 Abstract: Large language models (LLMs) perform well in medical QA, but their effectiveness in Japanese contexts is limited due to privacy constraints that prevent the use of commercial models like GPT-4 in clinical settings. As a result, recent efforts focus on instruction-tuning open-source LLMs, though the potential of combining them with retrieval-augmented generation (RAG) remains underexplored. To bridge this gap, we are the first to explore a knowledge graph-based (KG) RAG framework for Japanese medical QA small-scale open-source LLMs. Experimental results show that KG-based RAG has only a limited impact on Japanese medical QA using small-scale open-source LLMs. Further case studies reveal that the effectiveness of the RAG is sensitive to the quality and relevance of the external retrieved content. These findings offer valuable insights into the challenges and potential of applying RAG in Japanese medical QA, while also serving as a reference for other low-resource languages.

[137] ReZero: Enhancing LLM search ability by trying one-more-time

Alan Dao,Thinh Le

Main category: cs.CL

TLDR: ReZero（Retry-Zero）是一种新的强化学习框架，通过奖励重新尝试搜索查询的行为，显著提升了LLM在知识密集型任务中的性能。

Details

Motivation: 当前方法通常关注查询制定或结果推理，而未明确鼓励在搜索失败后继续尝试。ReZero旨在通过奖励持久性行为，提升LLM在复杂信息检索场景中的鲁棒性。 Method: ReZero采用强化学习框架，直接奖励在初次搜索失败后重新尝试查询的行为，激励LLM探索替代查询而非过早停止。 Result: ReZero在实验中表现优异，准确率达到46.88%，显著高于25%的基线。 Conclusion: 通过奖励持久性，ReZero有效提升了LLM在初始查询不足时的性能，展现了在复杂信息检索任务中的潜力。 Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Model (LLM) performance on knowledge-intensive tasks but depends heavily on initial search query quality. Current methods, often using Reinforcement Learning (RL), typically focus on query formulation or reasoning over results, without explicitly encouraging persistence after a failed search. We introduce ReZero (Retry-Zero), a novel RL framework that directly rewards the act of retrying a search query following an initial unsuccessful attempt. This incentivizes the LLM to explore alternative queries rather than prematurely halting. ReZero demonstrates significant improvement, achieving 46.88% accuracy compared to a 25% baseline. By rewarding persistence, ReZero enhances LLM robustness in complex information-seeking scenarios where initial queries may prove insufficient.

[138] Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Jinwu Hu,Wei Zhang,Yufeng Wang,Yu Hu,Bin Xiao,Mingkui Tan,Qing Du

Main category: cs.CL

TLDR: 论文提出了一种任务无关的动态提示压缩方法（LLM-DCP），通过马尔可夫决策过程（MDP）建模，逐步去除冗余令牌，同时保留关键信息，实验表明其在高压缩率下优于现有技术。

Details

Motivation: 现有提示压缩方法难以平衡信息保留、上下文适应性和任务通用性，导致计算成本高且性能受限。 Method: 将提示压缩建模为MDP，训练DCP-Agent逐步压缩提示，并通过分层训练策略（HPC）逐步增加压缩难度。 Result: 实验显示LLM-DCP在高压缩率下优于现有技术，且无需依赖外部黑盒LLM。 Conclusion: LLM-DCP提供了一种高效且通用的提示压缩方法，显著降低了计算成本并保持了性能。 Abstract: Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at https://github.com/Fhujinwu/DCP.

[139] LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews

Sukannya Purkayastha,Zhuang Li,Anne Lauscher,Lizhen Qu,Iryna Gurevych

Main category: cs.CL

TLDR: 论文介绍了LazyReview数据集，用于检测同行评审中的懒惰思维，并展示了基于指令的微调显著提升检测性能。

Details

Motivation: 同行评审中懒惰思维的使用影响了评审质量，但目前缺乏相关研究和数据集。 Method: 构建了LazyReview数据集，并通过LLMs进行零样本和微调实验。 Result: 微调后性能提升10-20个百分点，修订后的评审更全面和可操作。 Conclusion: 高质量训练数据和反馈机制对提升评审质量至关重要。 Abstract: Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of `quick' heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community. (Code available here: https://github.com/UKPLab/arxiv2025-lazy-review)

[140] DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis

Efthymios Georgiou,Vassilis Katsouros,Yannis Avrithis,Alexandros Potamianos

Main category: cs.CL

TLDR: 论文提出了DeepMLF，一种新型多模态语言模型，通过可学习令牌实现深度融合，在多模态情感分析中取得最优性能。

Details

Motivation: 多模态融合中的融合深度和多模态容量分配尚未充分研究，本文旨在探索这些因素对有效融合的影响。 Method: 引入DeepMLF模型，结合音频视觉编码器和预训练解码器，通过可学习令牌实现渐进式多模态融合。 Result: 在三个MSA基准测试中达到最优性能，证实深度融合（5-7层）和小令牌集（约20个）效果最佳。 Conclusion: 深度融合和多模态容量分配对性能至关重要，DeepMLF的设计和训练方法为多模态融合提供了新方向。 Abstract: While multimodal fusion has been extensively studied in Multimodal Sentiment Analysis (MSA), the role of fusion depth and multimodal capacity allocation remains underexplored. In this work, we position fusion depth, scalability, and dedicated multimodal capacity as primary factors for effective fusion. We introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a pretrained decoder LM augmented with multimodal information across its layers. We append learnable tokens to the LM that: 1) capture modality interactions in a controlled fashion and 2) preserve independent information flow for each modality. These fusion tokens gather linguistic information via causal self-attention in LM Blocks and integrate with audiovisual information through cross-attention MM Blocks. Serving as dedicated multimodal capacity, this design enables progressive fusion across multiple layers, providing depth in the fusion process. Our training recipe combines modality-specific losses and language modelling loss, with the decoder LM tasked to predict ground truth polarity. Across three MSA benchmarks with varying dataset characteristics, DeepMLF achieves state-of-the-art performance. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches. Additionally, our analysis on the number of fusion tokens reveals that small token sets ($\sim$20) achieve optimal performance. We examine the importance of representation learning order (fusion curriculum) through audiovisual encoder initialization experiments. Our ablation studies demonstrate the superiority of the proposed fusion design and gating while providing a holistic examination of DeepMLF's scalability to LLMs, and the impact of each training objective and embedding regularization.

[141] Using LLMs as prompt modifier to avoid biases in AI image generators

René Peinl

Main category: cs.CL

TLDR: 研究探讨了如何利用大型语言模型（LLMs）通过修改用户提示来减少文本到图像生成系统中的偏见。实验表明，LLM修改后的提示显著增加了图像多样性并减少了偏见。

Details

Motivation: 解决文本到图像生成系统中因中性提示导致的偏见问题，提升生成图像的公平性和多样性。 Method: 通过LLM修改用户提示，实验验证了Stable Diffusion XL、3.5和Flux等图像生成器的效果。 Result: LLM修改的提示显著提高了图像多样性并减少了偏见，尤其对较简单的图像生成器效果更佳。 Conclusion: 该方法有效减少了偏见并增加了多样性，但在某些特定场景（如残疾表现）仍有局限性。 Abstract: This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at https://iisys-hof.github.io/llm-prompt-img-gen/

[142] Benchmarking Vision Language Models on German Factual Data

René Peinl,Vincent Tischler

Main category: cs.CL

TLDR: 本文分析了开放权重视觉语言模型（VLMs）在德语和英语中的事实知识表现，发现模型在德语图像内容和语言识别上存在不足。

Details

Motivation: 研究动机是解决视觉语言模型在非英语语言（如德语）支持不足的问题，尤其是在高资源语言中的表现。 Method: 通过分离图像和文本因素，使用陪审团评估方法分析德语和国际背景图像及提示语言的准确性。 Result: 研究发现，VLMs在识别德语图像内容（如名人和景点）时表现不佳，但对动植物（科学名或英文名）和部分商品（如汽车和超市产品）的识别较好。 Conclusion: 结论是VLMs在德语支持上仍需改进，尤其是在视觉认知和语言识别方面。 Abstract: Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.

Laura De Grazia,Pol Pastells,Mauro Vázquez Chas,Desmond Elliott,Danae Sánchez Villegas,Mireia Farrús,Mariona Taulé

Main category: cs.CL

TLDR: 论文提出了一个多模态西班牙语数据集MuSeD，用于检测视频中的性别歧视，并评估了大型语言模型和多模态模型在此任务上的表现。

Details

Motivation: 社交媒体平台上的性别歧视通过多模态内容传播，尤其是视频，需要一种多模态方法来检测和分析。 Method: 研究引入了MuSeD数据集，提出了创新的标注框架，并评估了多种大型语言模型和多模态模型。 Result: 视觉信息在性别歧视标注中起关键作用，模型能有效检测显性性别歧视，但对隐性性别歧视（如刻板印象）表现不佳。 Conclusion: 检测隐性性别歧视具有挑战性，因其依赖于社会和文化背景，未来研究需进一步改进模型。 Abstract: Sexism is generally defined as prejudice and discrimination based on sex or gender, affecting every sector of society, from social institutions to relationships and individual behavior. Social media platforms amplify the impact of sexism by conveying discriminatory content not only through text but also across multiple modalities, highlighting the critical need for a multimodal approach to the analysis of sexism online. With the rise of social media platforms where users share short videos, sexism is increasingly spreading through video content. Automatically detecting sexism in videos is a challenging task, as it requires analyzing the combination of verbal, audio, and visual elements to identify sexist content. In this study, (1) we introduce MuSeD, a new Multimodal Spanish dataset for Sexism Detection consisting of $\approx$ 11 hours of videos extracted from TikTok and BitChute; (2) we propose an innovative annotation framework for analyzing the contribution of textual and multimodal labels in the classification of sexist and non-sexist content; and (3) we evaluate a range of large language models (LLMs) and multimodal LLMs on the task of sexism detection. We find that visual information plays a key role in labeling sexist content for both humans and models. Models effectively detect explicit sexism; however, they struggle with implicit cases, such as stereotypes, instances where annotators also show low agreement. This highlights the inherent difficulty of the task, as identifying implicit sexism depends on the social and cultural context.

Ej Zhou,Weiming Lu

Main category: cs.CL

TLDR: 研究探讨了多语言模型中社会偏见的评估与去偏方法，重点分析了低资源语言中的偏见问题，并验证了高资源语言去偏方法的适用性。

Details

Motivation: 语言模型中的社会偏见可能加剧社会不平等，但现有研究多集中于英语，低资源语言因数据不足表现更差。本研究旨在利用高资源语言语料评估偏见并实验去偏方法。 Method: 评估了五种语言（英语、中文、俄语、印尼语、泰语）的多语言模型，分析了四种偏见维度（性别、宗教、国籍、种族肤色），并测试了三种去偏方法（CDA、Dropout、SenDeb）。 Result: 构建了多语言偏见评估数据集，验证了高资源语言的去偏方法可有效迁移至低资源语言。 Conclusion: 研究为多语言NLP的公平性研究提供了可行见解，证明了去偏方法的跨语言适用性。 Abstract: Social bias in language models can potentially exacerbate social inequalities. Despite it having garnered wide attention, most research focuses on English data. In a low-resource scenario, the models often perform worse due to insufficient training data. This study aims to leverage high-resource language corpora to evaluate bias and experiment with debiasing methods in low-resource languages. We evaluated the performance of recent multilingual models in five languages: English (\textsc{eng}), Chinese (\textsc{zho}), Russian (\textsc{rus}), Indonesian (\textsc{ind}) and Thai (\textsc{tha}), and analyzed four bias dimensions: \textit{gender}, \textit{religion}, \textit{nationality}, and \textit{race-color}. By constructing multilingual bias evaluation datasets, this study allows fair comparisons between models across languages. We have further investigated three debiasing methods-\texttt{CDA}, \texttt{Dropout}, \texttt{SenDeb}-and demonstrated that debiasing methods from high-resource languages can be effectively transferred to low-resource ones, providing actionable insights for fairness research in multilingual NLP.

[145] Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items

Minjie Zou,Sahana Srinivasan,Thaddaeus Wai Soon Lo,Ke Zou,Gabriel Dawei Yang,Xuguang Ai,Hyunjae Kim,Maxwell Singer,Fares Antaki,Kelvin Li,Robert Chang,Marcus Tan,David Ziyou Chen,Dianbo Liu,Qingyu Chen,Yih Chung Tham

Main category: cs.CL

TLDR: 该研究评估了四种专注于推理的大型语言模型（DeepSeek-R1、OpenAI o1、o3-mini和Gemini 2.0 Flash-Thinking）在眼科领域的表现，发现o1和DeepSeek-R1在准确性上表现最佳，而不同模型在文本生成指标和推理时间上各有优劣。

Details

Motivation: 探索专注于推理的大型语言模型在眼科等专业领域的表现，填补现有研究的空白。 Method: 使用5,888道眼科考试题目（MedMCQA数据集）在零样本设置下评估模型，定量分析准确性、Macro-F1和五种文本生成指标，并记录推理时间。同时，由两名眼科专家对回答进行定性评估。 Result: o1（0.902）和DeepSeek-R1（0.888）准确性最高；不同模型在文本生成指标上表现各异；DeepSeek-R1推理时间最长（40.4秒），Gemini 2.0 Flash-Thinking最快（6.7秒）。定性评估显示DeepSeek-R1和Gemini 2.0 Flash-Thinking提供更详细的推理。 Conclusion: 专注于推理的LLM在眼科领域表现优异，但不同模型各有特点，需根据需求选择。未来可进一步优化推理时间和文本生成能力。 Abstract: Recent advances in reasoning-focused large language models (LLMs) mark a shift from general LLMs toward models designed for complex decision-making, a crucial aspect in medicine. However, their performance in specialized domains like ophthalmology remains underexplored. This study comprehensively evaluated and compared the accuracy and reasoning capabilities of four newly developed reasoning-focused LLMs, namely DeepSeek-R1, OpenAI o1, o3-mini, and Gemini 2.0 Flash-Thinking. Each model was assessed using 5,888 multiple-choice ophthalmology exam questions from the MedMCQA dataset in zero-shot setting. Quantitative evaluation included accuracy, Macro-F1, and five text-generation metrics (ROUGE-L, METEOR, BERTScore, BARTScore, and AlignScore), computed against ground-truth reasonings. Average inference time was recorded for a subset of 100 randomly selected questions. Additionally, two board-certified ophthalmologists qualitatively assessed clarity, completeness, and reasoning structure of responses to differential diagnosis questions.O1 (0.902) and DeepSeek-R1 (0.888) achieved the highest accuracy, with o1 also leading in Macro-F1 (0.900). The performance of models across the text-generation metrics varied: O3-mini excelled in ROUGE-L (0.151), o1 in METEOR (0.232), DeepSeek-R1 and o3-mini tied for BERTScore (0.673), DeepSeek-R1 (-4.105) and Gemini 2.0 Flash-Thinking (-4.127) performed best in BARTScore, while o3-mini (0.181) and o1 (0.176) led AlignScore. Inference time across the models varied, with DeepSeek-R1 being slowest (40.4 seconds) and Gemini 2.0 Flash-Thinking fastest (6.7 seconds). Qualitative evaluation revealed that DeepSeek-R1 and Gemini 2.0 Flash-Thinking tended to provide detailed and comprehensive intermediate reasoning, whereas o1 and o3-mini displayed concise and summarized justifications.

[146] From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs

Guocong Li,Weize Liu,Yihang Wu,Ping Wang,Shuaihan Huang,Hongxia Xu,Jian Wu

Main category: cs.CL

TLDR: 本文提出了一种三阶段微调方法，提升大语言模型（LLMs）检测和纠正输入中误导信息的能力，从而提高回答准确性和减少幻觉生成。

Details

Motivation: 现有方法主要关注纠正输出，但忽略了提升LLMs检测和纠正输入中误导信息的潜力。 Method: 三阶段微调方法：1）训练LLMs识别误导信息；2）训练LLMs利用内置或外部知识纠正误导信息；3）训练LLMs基于纠正后的查询生成准确回答。 Result: 实验表明，该方法显著提高了LLM回答的准确性和事实性，增强了幻觉检测能力，并减少了输出中的幻觉生成。 Conclusion: 该方法有效提升了LLMs处理误导信息的能力，为未来研究提供了新方向。 Abstract: Large language models (LLMs) exhibit excellent performance in natural language processing (NLP), but remain highly sensitive to the quality of input queries, especially when these queries contain misleading or inaccurate information. Existing methods focus on correcting the output, but they often overlook the potential of improving the ability of LLMs to detect and correct misleading content in the input itself. In this paper, we propose a novel three-stage fine-tuning method that enhances the ability of LLMs to detect and correct misleading information in the input, further improving response accuracy and reducing hallucinations. Specifically, the three stages include (1) training LLMs to identify misleading information, (2) training LLMs to correct the misleading information using built-in or external knowledge, and (3) training LLMs to generate accurate answers based on the corrected queries. To evaluate our method, we conducted experiments on three datasets for the hallucination detection task and the question answering (QA) task, as well as two datasets containing misleading information that we constructed. The experimental results demonstrate that our method significantly improves the accuracy and factuality of LLM responses, while also enhancing the ability to detect hallucinations and reducing the generation of hallucinations in the output, particularly when the query contains misleading information. We will publicly release our code upon acceptance.

[147] Automated Python Translation

Joshua Otten,Antonios Anastasopoulos,Kevin Moran

Main category: cs.CL

TLDR: 论文提出了一种自动化翻译Python关键词和模块到其他人类语言的方法，以解决非英语使用者的理解障碍。

Details

Motivation: Python因其英语关键词和模块的易读性而流行，但对非英语使用者可能造成理解障碍。 Method: 创建自动化管道，结合机器翻译和大语言模型，翻译Python关键词、错误类型等，并在五种常见库中测试七种语言。 Result: 在法语、希腊语和孟加拉语中进行了质量测试，展示了翻译的可行性。 Conclusion: 该方法为创建通用Python提供了路径，使其对任何语言背景的用户都更易访问。 Abstract: Python is one of the most commonly used programming languages in industry and education. Its English keywords and built-in functions/modules allow it to come close to pseudo-code in terms of its readability and ease of writing. However, those who do not speak English may not experience these advantages. In fact, they may even be hindered in their ability to understand Python code, as the English nature of its terms creates an additional layer of overhead. To that end, we introduce the task of automatically translating Python's natural modality (keywords, error types, identifiers, etc.) into other human languages. This presents a unique challenge, considering the abbreviated nature of these forms, as well as potential untranslatability of advanced mathematical/programming concepts across languages. We therefore create an automated pipeline to translate Python into other human languages, comparing strategies using machine translation and large language models. We then use this pipeline to acquire translations from five common Python libraries (pytorch, pandas, tensorflow, numpy, and random) in seven languages, and do a quality test on a subset of these terms in French, Greek, and Bengali. We hope this will provide a clearer path forward towards creating a universal Python, accessible to anyone regardless of nationality or language background.

[148] Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis

Hao Liu,Lijun He,Jiaxi Liang,Zhihan Ren,Fan Li

Main category: cs.CL

TLDR: DASCO框架通过依赖解析树增强多模态情感分析，解决了情感线索感知、多模态信息不对齐和语义噪声消除三大挑战，并在实验中表现优异。

Details

Motivation: 现有方法难以同时解决情感线索感知（SCP）、多模态信息不对齐（MIM）和语义噪声消除（SNE）三大挑战，因此提出DASCO框架。 Method: DASCO结合依赖解析树和多任务预训练策略，包括面向方面的增强、图像-文本匹配和情感敏感认知，以优化模型性能。 Result: 在两个基准数据集的三项子任务中，DASCO表现优异，特别是在Twitter2015上F1值提升3.1%，精确度提升5.4%。 Conclusion: DASCO通过依赖解析树和精细范围导向的框架，显著提升了多模态情感分析的性能，解决了现有方法的局限性。 Abstract: Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model's perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+3.1\% F1 and +5.4\% precision on Twitter2015).

[149] REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective

Zhihao Xu,Yongqi Tong,Xin Zhang,Jun Zhou,Xiting Wang

Main category: cs.CL

TLDR: 论文提出了一种数据驱动的方法（Reward Consistency Sampling）来解决多目标偏好对齐中的冲突问题，通过识别符合多个目标的样本（Reward Consistency）来优化训练效果。

Details

Motivation: 多目标偏好对齐中常存在目标冲突（如帮助性与无害性），传统方法主要依赖算法优化，而本文探索数据驱动的方法以缓解冲突。 Method: 提出Reward Consistency（RC）概念，识别符合多目标的样本，并通过梯度分析验证其有效性；进一步开发Reward Consistency Sampling框架，自动构建缓解冲突的数据集。 Result: 生成的数据集在无害率和帮助性胜率上平均提升13.37%，并能持续解决多目标场景中的冲突。 Conclusion: 数据驱动的方法能有效缓解多目标偏好对齐中的冲突，Reward Consistency Sampling框架为多目标优化提供了新思路。 Abstract: Multi-objective preference alignment in language models often encounters a challenging trade-off: optimizing for one human preference (e.g., helpfulness) frequently compromises others (e.g., harmlessness) due to the inherent conflicts between competing objectives. While prior work mainly focuses on algorithmic solutions, we explore a novel data-driven approach to uncover the types of data that can effectively mitigate these conflicts. Specifically, we propose the concept of Reward Consistency (RC), which identifies samples that align with multiple preference objectives, thereby reducing conflicts during training. Through gradient-based analysis, we demonstrate that RC-compliant samples inherently constrain performance degradation during multi-objective optimization. Building on these insights, we further develop Reward Consistency Sampling, a framework that automatically constructs preference datasets that effectively mitigate conflicts during multi-objective alignment. Our generated data achieves an average improvement of 13.37% in both the harmless rate and helpfulness win rate when optimizing harmlessness and helpfulness, and can consistently resolve conflicts in varying multi-objective scenarios.

[150] OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution

Lucio La Cava,Andrea Tagarelli

Main category: cs.CL

TLDR: OpenTuringBench是一个基于开放大语言模型（OLLMs）的新基准，用于训练和评估机器生成文本检测器，解决图灵测试和作者归属问题。

Details

Motivation: 随着开放大语言模型在生成式AI应用中的广泛使用，检测其输出成为新挑战。 Method: 提出OpenTuringBench基准和OTBDetector对比学习框架，用于检测和归属OLLM生成的文本。 Result: OpenTuringBench任务具有相关性和不同难度，OTBDetector在多种任务中表现优异，超越现有检测器。 Conclusion: OpenTuringBench为机器生成文本检测提供了有效工具，OTBDetector展示了强大的检测能力。 Abstract: Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications, posing new challenges for detecting their outputs. We propose OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate machine-generated text detectors on the Turing Test and Authorship Attribution problems. OpenTuringBench focuses on a representative set of OLLMs, and features a number of challenging evaluation tasks, including human/machine-manipulated texts, out-of-domain texts, and texts from previously unseen models. We also provide OTBDetector, a contrastive learning framework to detect and attribute OLLM-based machine-generated texts. Results highlight the relevance and varying degrees of difficulty of the OpenTuringBench tasks, with our detector achieving remarkable capabilities across the various tasks and outperforming most existing detectors. Resources are available on the OpenTuringBench Hugging Face repository at https://huggingface.co/datasets/MLNTeam-Unical/OpenTuringBench

[151] Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

Wang Bill Zhu,Tianqi Chen,Ching Ying Lin,Jade Law,Mazen Jizzini,Jorge J. Nieva,Ruishan Liu,Robin Jia

Main category: cs.CL

TLDR: 论文评估了大型语言模型（LLM）在回答癌症患者真实问题时的表现，发现虽然准确性较高，但模型常忽略问题中的错误假设，存在医疗决策风险。作者提出了Cancer-Myth数据集，显示前沿LLM纠正错误假设的能力不足30%。

Details

Motivation: 评估LLM在复杂、个性化医疗问题中的表现，尤其是对错误假设的处理能力，以揭示其在临床可靠性上的不足。 Method: 使用真实患者问题评估LLM，并引入专家验证的Cancer-Myth数据集，测试模型对错误假设的纠正能力。 Result: LLM在准确性上表现良好（如GPT-4-Turbo得4.13/5），但纠正错误假设的能力低于30%，即使高级医疗代理方法也无效。 Conclusion: LLM在临床可靠性上存在显著缺陷，需更强大的保障措施以确保医疗AI系统的安全性。 Abstract: Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-4o, Gemini-1.Pro, and Claude-3.5-Sonnet -- corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.

[152] RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models

Juan Diego Rodriguez,Wenxuan Ding,Katrin Erk,Greg Durrett

Main category: cs.CL

TLDR: 论文研究了大型语言模型（LLMs）在生成答案与自我验证答案之间的不一致性（generator-validator gap），并提出RankAlign方法显著缩小了这一差距。

Details

Motivation: 尽管LLMs在许多任务中表现更强大和准确，但其行为仍存在不可靠性，尤其是在提示变化时答案的不一致性。 Method: 通过定义更严格的generator-validator gap，提出RankAlign，一种基于排名的训练方法。 Result: RankAlign平均缩小了31.8%的差距，优于所有基线方法，并能泛化到域外任务和词汇项。 Conclusion: RankAlign有效解决了LLMs的generator-validator gap问题，具有广泛适用性。 Abstract: Although large language models (LLMs) have become generally more capable and accurate across many tasks, some fundamental sources of unreliability remain in their behavior. One key limitation is their inconsistency at reporting the the same information when prompts are changed. In this paper, we consider the discrepancy between a model's generated answer and their own verification of that answer, the generator-validator gap. We define this gap in a more stringent way than prior work: we expect correlation of scores from a generator and a validator over the entire set of candidate answers. We show that according to this measure, a large gap exists in various settings, including question answering, lexical semantics tasks, and next-word prediction. We then propose RankAlign, a ranking-based training method, and show that it significantly closes the gap by 31.8% on average, surpassing all baseline methods. Moreover, this approach generalizes well to out-of-domain tasks and lexical items.

[153] Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Marcin Chochowski,Yashaswi Karnati,Raviraj Joshi,Ameya Sunil Mahabaleshwarkar,Zijia Chen,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov

Main category: cs.CL

TLDR: 本文提出了一种针对混合LLM架构（结合Attention和SSM）的新型压缩方法，通过组感知剪枝和知识蒸馏，显著提升了模型的准确性和推理速度。

Details

Motivation: 探索混合架构压缩的有效性，以在减少参数和训练成本的同时保持或提升性能。 Method: 引入组感知剪枝策略，结合SSM、FFN、嵌入维度和层剪枝，随后进行知识蒸馏再训练。 Result: 将Nemotron-H 8B模型压缩至4B参数，训练token减少40倍，准确率超越同类模型，推理速度提升2倍。 Conclusion: 该方法显著推进了Pareto前沿，为混合架构的压缩提供了高效解决方案。 Abstract: Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

[154] Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts

Quanyu Long,Jianda Chen,Zhengyuan Liu,Nancy F. Chen,Wenya Wang,Sinno Jialin Pan

Main category: cs.CL

TLDR: 论文提出了一种基于三编码器序列检索器的方法，通过马尔可夫决策过程（MDP）建模组合检索任务，显著优于基线方法。

Details

Motivation: 解决传统检索增强框架在组合检索任务中的局限性，即需要协调多个信息来源。 Method: 采用三编码器序列检索器，将检索过程分解为条件概率序列，并通过两阶段训练（监督数据初始训练和基于LLM偏好的策略优化）。 Result: 实验结果表明，该方法在组合检索任务中显著优于基线方法。 Conclusion: 组合检索在需要多证据或多示例的任务中具有潜力，且显式建模示例间依赖关系是关键。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. While retrieval-augmented frameworks traditionally focus on selecting top-ranked documents in a single pass, many real-world scenarios demand compositional retrieval, where multiple sources must be combined in a coordinated manner. In this work, we propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP), decomposing the probability of retrieving a set of elements into a sequence of conditional probabilities and allowing each retrieval step to be conditioned on previously selected examples. We train the retriever in two stages: first, we efficiently construct supervised sequential data for initial policy training; we then refine the policy to align with the LLM's preferences using a reward grounded in the structural correspondence of generated programs. Experimental results show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies. These findings highlight the potential of compositional retrieval for tasks requiring multiple pieces of evidence or examples.

[155] A Dual-Space Framework for General Knowledge Distillation of Large Language Models

Xue Zhang,Songming Zhang,Yunlong Liang,Fandong Meng,Yufeng Chen,Jinan Xu,Jie Zhou

Main category: cs.CL

TLDR: 论文提出了一种双空间知识蒸馏（DSKD）框架，解决了当前白盒知识蒸馏（KD）在输出空间和词汇表差异上的限制。通过投影器和精确标记对齐（ETA）算法，DSKD支持不同词汇表的LLM之间的KD，并在多个任务上表现优异。

Details

Motivation: 当前白盒KD框架存在输出空间差异和词汇表不兼容的问题，限制了知识蒸馏的效果和应用范围。 Method: 提出DSKD框架，使用投影器统一师生模型的输出空间，并开发ETA算法对齐不同标记序列。 Result: DSKD在指令跟随、数学推理和代码生成任务上显著优于现有方法，支持不同词汇表的LLM之间的KD。 Conclusion: DSKD是一种通用的KD框架，解决了现有方法的局限性，并在实验中表现出优越性能。 Abstract: Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.

[156] Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models

Maria Teleki,Xiangjue Dong,Haoran Liu,James Caverlee

Main category: cs.CL

TLDR: 论文研究了基于话语的男性默认性别偏见，提出了一个双重框架（GDCF和D-WEAT）来发现和分析性别化话语词，并测量其在LLM中的性别偏见。研究发现男性话语词在嵌入模型中表现更稳定，可能导致男性在下游任务中获得更好的系统性能。

Details

Motivation: 男性默认性别偏见虽然普遍存在，但研究不足。论文旨在揭示话语中的男性默认现象及其对语言模型的影响。 Method: 使用GDCF框架大规模发现和分析性别化话语词，通过D-WEAT测量LLM中的性别偏见。分析15,117个播客内容，结合LDA和BERTopic自动生成性别化话语词列表。 Result: 研究发现商业、技术/政治和视频游戏领域存在性别化话语的男性默认现象。男性话语词在嵌入模型中表现更稳定，可能导致男性在下游任务中受益。 Conclusion: 男性默认性别偏见在语言模型中表现为嵌入差异，是一种代表性伤害，可能导致系统性能对男性更有利。 Abstract: Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default.

[157] TextArena

Leon Guertler,Bobby Cheng,Simon Yu,Bo Liu,Leshem Choshen,Cheston Tan

Main category: cs.CL

TLDR: TextArena是一个开源的基于文本的竞争性游戏集合，用于训练和评估大型语言模型（LLMs）的代理行为，填补了传统基准测试在动态社交技能评估上的不足。

Details

Motivation: 传统基准测试很少评估动态社交技能（如谈判、心理理论和欺骗），TextArena旨在填补这一空白。 Method: TextArena包含57+个独特环境（单玩家、双玩家和多玩家设置），支持在线对战系统（与人类和其他模型），并提供实时TrueSkill评分。 Result: TextArena为研究、社区和可扩展性设计，支持轻松添加新游戏、测试模型、与模型对战和训练模型。 Conclusion: TextArena是一个功能强大且易于扩展的工具，适用于评估和提升LLMs的代理行为。 Abstract: TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.

[158] DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He,Tian Liang,Jiahao Xu,Qiuzhi Liu,Xingyu Chen,Yue Wang,Linfeng Song,Dian Yu,Zhenwen Liang,Wenxuan Wang,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu

Main category: cs.CL

TLDR: DeepMath-103K是一个新的大规模数学问题数据集，旨在通过强化学习训练高级推理模型，解决现有数据不足的问题。

Details

Motivation: 现有数学推理训练数据缺乏挑战性、可验证答案格式且易受评测基准污染，限制了强化学习在LLMs中的应用。 Method: 通过严格的数据筛选流程（包括源分析、去污染和高难度过滤）构建DeepMath-103K数据集，每个问题包含可验证答案和多种解决方案。 Result: 在DeepMath-103K上训练的模型在数学评测基准上表现显著提升。 Conclusion: DeepMath-103K为AI推理系统的进步提供了高质量数据支持，并已公开以促进社区发展。 Abstract: The capacity for complex mathematical reasoning is a key benchmark for artificial intelligence. While reinforcement learning (RL) applied to LLMs shows promise, progress is significantly hindered by the lack of large-scale training data that is sufficiently challenging, possesses verifiable answer formats suitable for RL, and is free from contamination with evaluation benchmarks. To address these limitations, we introduce DeepMath-103K, a new, large-scale dataset comprising approximately 103K mathematical problems, specifically designed to train advanced reasoning models via RL. DeepMath-103K is curated through a rigorous pipeline involving source analysis, stringent decontamination against numerous benchmarks, and filtering for high difficulty (primarily Levels 5-9), significantly exceeding existing open resources in challenge. Each problem includes a verifiable final answer, enabling rule-based RL, and three distinct R1-generated solutions suitable for diverse training paradigms like supervised fine-tuning or distillation. Spanning a wide range of mathematical topics, DeepMath-103K promotes the development of generalizable reasoning. We demonstrate that models trained on DeepMath-103K achieve significant improvements on challenging mathematical benchmarks, validating its effectiveness. We release DeepMath-103K publicly to facilitate community progress in building more capable AI reasoning systems: https://github.com/zwhe99/DeepMath.

[159] Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

Xiangru Zhu,Penglei Sun,Yaoxian Song,Yanghua Xiao,Zhixu Li,Chengyu Wang,Jun Huang,Bei Yang,Xiaoxiao Xu

Main category: cs.CL

TLDR: 论文提出了一种新指标SemVarEffect和基准SemVarBench，用于评估文本到图像合成中语义变化对输出的因果影响，发现现有模型在复杂语言模式上表现不佳。

Details

Motivation: 当前文本到图像合成模型难以捕捉词序变化带来的语义差异，且现有评估方法无法可靠衡量这些挑战。 Method: 通过两种语言排列生成语义变化，设计SemVarEffect指标和SemVarBench基准，评估模型表现。 Result: CogView-3-Plus和Ideogram 2表现最佳（0.2/1分），对象关系的语义变化理解较弱（0.07/1）。跨模态对齐是关键因素。 Conclusion: 研究为文本到图像合成提供了有效的评估框架，推动了对人类指令理解的探索。 Abstract: Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding. Our benchmark and code are available at https://github.com/zhuxiangru/SemVarBench .

cs.GR [Back]

[160] VideoPanda: Video Panoramic Diffusion with Multi-view Attention

Kevin Xie,Amirmojtaba Sabour,Jiahui Huang,Despoina Paschalidou,Greg Klar,Umar Iqbal,Sanja Fidler,Xiaohui Zeng

Main category: cs.GR

TLDR: VideoPanda是一种基于文本或单视角视频生成360度全景视频的新方法，通过多视角注意力层增强视频扩散模型，实现一致的多视角视频合成。

Details

Motivation: 高分辨率全景视频对虚拟现实的沉浸式体验至关重要，但采集困难，需特殊设备和复杂设置。 Method: 利用多视角注意力层增强视频扩散模型，支持文本或单视角视频条件训练，并采用随机子采样减轻计算负担。 Result: 在真实和合成视频数据集上的评估显示，VideoPanda生成的360度全景视频更真实、一致。 Conclusion: VideoPanda在生成全景视频方面优于现有方法，支持长视频的自回归生成。 Abstract: High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360$^\circ$ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360$^\circ$ panoramas across all input conditions compared to existing methods. Visit the project website at https://research-staging.nvidia.com/labs/toronto-ai/VideoPanda/ for results.

cs.IR [Back]

[161] ArxivBench: Can LLMs Assist Researchers in Conducting Research?

Ning Li,Jingran Zhang,Justin Cui

Main category: cs.IR

TLDR: 论文评估了专有和开源LLMs在生成与arXiv平台研究论文相关的准确回答时的表现，并引入了arXivBench基准测试工具。

Details

Motivation: 解决LLM生成内容中事实错误的问题，评估其在学术研究中的可靠性。 Method: 使用arXivBench基准测试，涵盖arXiv的八个主要学科类别和计算机科学的五个子领域。 Result: 不同学科的准确性差异显著，Claude-3.5-Sonnet表现最佳，AI子领域准确性普遍较高。 Conclusion: arXivBench为评估LLM生成科学内容的可靠性提供了标准化工具，促进其在学术研究中的可信使用。 Abstract: Large language models (LLMs) have demonstrated remarkable effectiveness in completing various tasks such as reasoning, translation, and question answering. However the issue of factual incorrect content in LLM-generated responses remains a persistent challenge. In this study, we evaluate both proprietary and open-source LLMs on their ability to respond with relevant research papers and accurate links to articles hosted on the arXiv platform, based on high level prompts. To facilitate this evaluation, we introduce arXivBench, a benchmark specifically designed to assess LLM performance across eight major subject categories on arXiv and five subfields within computer science, one of the most popular categories among them. Our findings reveal a concerning accuracy of LLM-generated responses depending on the subject, with some subjects experiencing significantly lower accuracy than others. Notably, Claude-3.5-Sonnet exhibits a substantial advantage in generating both relevant and accurate responses. And interestingly, most LLMs achieve a much higher accuracy in the Artificial Intelligence sub-field than other sub-fields. This benchmark provides a standardized tool for evaluating the reliability of LLM-generated scientific responses, promoting more dependable use of LLMs in academic and research environments. Our code is open-sourced at https://github.com/arxivBenchLLM/arXivBench and our dataset is available on huggingface at https://huggingface.co/datasets/arXivBenchLLM/arXivBench.

[162] Graph-based Approaches and Functionalities in Retrieval-Augmented Generation: A Comprehensive Survey

Zulun Zhu,Tiancheng Huang,Kai Wang,Junda Ye,Xinghe Chen,Siqiang Luo

Main category: cs.IR

TLDR: 本文综述了图在检索增强生成（RAG）中的作用，分析了其在提升语言模型性能中的应用，并指出了未来研究方向。

Details

Motivation: 大型语言模型（LLMs）在推理时存在事实性错误和幻觉问题，检索增强生成（RAG）通过外部知识源提供支持，但缺乏对图结构知识的统一综述。 Method: 本文从图的角度分析RAG，涵盖数据库构建、算法、管道和任务，比较现有方法的共性与差异。 Result: 综述了图在RAG中的多样化作用，提出了性能提升的具体案例，并总结了当前挑战。 Conclusion: 图在RAG中具有重要潜力，未来研究可结合图学习、数据库系统和自然语言处理进一步探索。 Abstract: Large language models (LLMs) struggle with the factual error during inference due to the lack of sufficient training data and the most updated knowledge, leading to the hallucination problem. Retrieval-Augmented Generation (RAG) has gained attention as a promising solution to address the limitation of LLMs, by retrieving relevant information from external source to generate more accurate answers to the questions. Given the pervasive presence of structured knowledge in the external source, considerable strides in RAG have been made to employ the techniques related to graphs and achieve more complex reasoning based on the topological information between knowledge entities. However, there is currently neither unified review examining the diverse roles of graphs in RAG, nor a comprehensive resource to help researchers navigate and contribute to this evolving field. This survey offers a novel perspective on the functionality of graphs within RAG and their impact on enhancing performance across a wide range of graph-structured data. It provides a detailed breakdown of the roles that graphs play in RAG, covering database construction, algorithms, pipelines, and tasks. Finally, it identifies current challenges and outline future research directions, aiming to inspire further developments in this field. Our graph-centered analysis highlights the commonalities and differences in existing methods, setting the stage for future researchers in areas such as graph learning, database systems, and natural language processing.

[163] JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture

Minh-Anh Nguyen,Dung D. Le

Main category: cs.IR

TLDR: JEPA4Rec是一个结合联合嵌入预测架构和语言建模的框架，用于解决序列推荐中的数据稀疏性和用户偏好理解不足的问题。

Details

Motivation: 语言表示学习在序列推荐中表现出潜力，但仍面临数据稀疏性和对用户偏好理解不足的挑战。 Method: JEPA4Rec通过将物品表示为文本句子，使用双向Transformer编码器，并结合掩码预测和两阶段训练策略，学习可迁移的表示。 Result: 在六个真实数据集上的实验表明，JEPA4Rec在跨域、跨平台和低资源场景中优于现有方法。 Conclusion: JEPA4Rec通过结合语言建模和联合嵌入预测，显著提升了推荐性能，减少了对大规模预训练数据的依赖。 Abstract: Language representation learning has emerged as a promising approach for sequential recommendation, thanks to its ability to learn generalizable representations. However, despite its advantages, this approach still struggles with data sparsity and a limited understanding of common-sense user preferences. To address these limitations, we propose $\textbf{JEPA4Rec}$, a framework that combines $\textbf{J}$oint $\textbf{E}$mbedding $\textbf{P}$redictive $\textbf{A}$rchitecture with language modeling of item textual descriptions. JEPA4Rec captures semantically rich and transferable representations, improving recommendation performance and reducing reliance on large-scale pre-training data. Specifically, JEPA4Rec represents items as text sentences by flattening descriptive information such as $\textit{title, category}$, and other attributes. To encode these sentences, we employ a bidirectional Transformer encoder with modified embedding layers tailored for capturing item information in recommendation datasets. We apply masking to text sentences and use them to predict the representations of the unmasked sentences, helping the model learn generalizable item embeddings. To further improve recommendation performance and language understanding, we employ a two-stage training strategy incorporating self-supervised learning losses. Experiments on six real-world datasets demonstrate that JEPA4Rec consistently outperforms state-of-the-art methods, particularly in cross-domain, cross-platform, and low-resource scenarios.

[164] CSPLADE: Learned Sparse Retrieval with Causal Language Models

Zhichao Xu,Aosong Feng,Yijun Tian,Haibo Ding,Lin Leee Cheong

Main category: cs.IR

TLDR: 论文探讨了大规模语言模型（LLM）在稀疏检索（LSR）中的训练挑战，并提出两种技术解决训练不稳定性和性能问题，最终在8B规模LLM上实现高效检索。

Details

Motivation: 密集检索虽有效但存在不可解释性和大索引问题，稀疏检索（LSR）成为替代方案，但尚未在BERT规模以上探索。 Method: 提出轻量级适应训练阶段解决训练不稳定性，设计两种模型变体支持双向信息。 Result: 成功训练8B规模LLM的LSR模型，性能竞争且索引减小，并首次通过量化分析性能-效率权衡。 Conclusion: 研究为LLM在高效检索建模中的适应提供了新见解。 Abstract: In recent years, dense retrieval has been the focus of information retrieval (IR) research. While effective, dense retrieval produces uninterpretable dense vectors, and suffers from the drawback of large index size. Learned sparse retrieval (LSR) has emerged as promising alternative, achieving competitive retrieval performance while also being able to leverage the classical inverted index data structure for efficient retrieval. However, limited works have explored scaling LSR beyond BERT scale. In this work, we identify two challenges in training large language models (LLM) for LSR: (1) training instability during the early stage of contrastive training; (2) suboptimal performance due to pre-trained LLM's unidirectional attention. To address these challenges, we propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information. With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size. Furthermore, we are among the first to analyze the performance-efficiency tradeoff of LLM-based LSR model through the lens of model quantization. Our findings provide insights into adapting LLMs for efficient retrieval modeling.

[165] Human-Oriented Image Retrieval System (HORSE): A Neuro-Symbolic Approach to Optimizing Retrieval of Previewed Images

Abraham Itzhak Weinberg

Main category: cs.IR

TLDR: HORSE是一种基于神经符号索引的新型图像检索方法，结合认知科学与计算技术，优化检索效率。

Details

Motivation: 当前图像搜索引擎依赖预处理和机器学习，效率低且不符合人类感知方式。 Method: 采用神经符号框架，结合神经网络与符号推理，优化检索过程。 Result: HORSE提供更直观高效的图像检索方案，适用于设计错误检测等领域。 Conclusion: HORSE展示了神经符号索引的潜力，未来需进一步优化系统指标。 Abstract: Image retrieval remains a challenging task due to the complex interaction between human visual perception, memory, and computational processes. Current image search engines often struggle to efficiently retrieve images based on natural language descriptions, as they rely on time-consuming preprocessing, tagging, and machine learning pipelines. This paper introduces the Human-Oriented Retrieval Search Engine for Images (HORSE), a novel approach that leverages neuro-symbolic indexing to improve image retrieval by focusing on human-oriented indexing. By integrating cognitive science insights with advanced computational techniques, HORSE enhances the retrieval process, making it more aligned with how humans perceive, store, and recall visual information. The neuro-symbolic framework combines the strengths of neural networks and symbolic reasoning, mitigating their individual limitations. The proposed system optimizes image retrieval, offering a more intuitive and efficient solution for users. We discuss the design and implementation of HORSE, highlight its potential applications in fields such as design error detection and knowledge management, and suggest future directions for research to further refine the system's metrics and capabilities.

physics.soc-ph [Back]

[166] Network Alignment

Rui Tang,Ziyun Yong,Shuyu Jiang,Xingshu Chen,Yaofang Liu,Yi-Cheng Zhang,Gui-Quan Sun,Wei Wang

Main category: physics.soc-ph

TLDR: 本文综述了网络对齐研究的最新进展，分析了不同领域中的网络对齐特点和方法，并讨论了未来研究的挑战和开放性问题。

Details

Motivation: 网络对齐对于理解复杂系统结构和行为、验证理论物理研究以及促进跨领域应用具有重要意义。然而，由于不同领域网络结构和特性的差异，研究往往孤立进行，术语和概念缺乏统一性。 Method: 综述了基于结构一致性、网络嵌入和图神经网络（GNN）的方法，并分析了在不同网络类型（如属性网络、异构网络、有向网络和动态网络）中的对齐方法。 Result: 详细比较了各种方法的实现原理、流程和性能差异，总结了各领域的进展。 Conclusion: 提出了未来研究的挑战和开放性问题，强调了网络对齐研究的跨领域统一性和方法改进的重要性。 Abstract: Complex networks are frequently employed to model physical or virtual complex systems. When certain entities exist across multiple systems simultaneously, unveiling their corresponding relationships across the networks becomes crucial. This problem, known as network alignment, holds significant importance. It enhances our understanding of complex system structures and behaviours, facilitates the validation and extension of theoretical physics research about studying complex systems, and fosters diverse practical applications across various fields. However, due to variations in the structure, characteristics, and properties of complex networks across different fields, the study of network alignment is often isolated within each domain, with even the terminologies and concepts lacking uniformity. This review comprehensively summarizes the latest advancements in network alignment research, focusing on analyzing network alignment characteristics and progress in various domains such as social network analysis, bioinformatics, computational linguistics and privacy protection. It provides a detailed analysis of various methods' implementation principles, processes, and performance differences, including structure consistency-based methods, network embedding-based methods, and graph neural network-based (GNN-based) methods. Additionally, the methods for network alignment under different conditions, such as in attributed networks, heterogeneous networks, directed networks, and dynamic networks, are presented. Furthermore, the challenges and the open issues for future studies are also discussed.

physics.flu-dyn [Back]

[167] Visual anemometry of natural vegetation from their leaf motion

Roni H. Goldshmid,John O. Dabiri,John E. Sader

Main category: physics.flu-dyn

TLDR: 论文提出了一种基于植被叶片运动的远程风速测量方法，适用于低到中风速范围，通过叶片尺寸和运动速度等参数计算风速。

Details

Motivation: 高分辨率近地面风速数据对天气预报、气候模型、野火控制和航空安全至关重要，但现有方法成本高或复杂。 Method: 通过分析多种植被叶片运动，提出基于叶片尺寸、运动速度、空气粘度和密度的风速计算公式。 Result: 公式通过实验室和野外测试验证，适用于多种植被类型，如橡树、橄榄树等。 Conclusion: 该方法为风速测量提供了低成本、远程的新范式。 Abstract: High-resolution, near-ground wind-speed data are critical for improving the accuracy of weather predictions and climate models,$^{1-3}$ supporting wildfire control efforts,$^{4-7}$ and ensuring the safe passage of airplanes during takeoff and landing maneouvers.$^{8,9}$ Quantitative wind speed anemometry generally employs on-site instrumentation for accurate single-position data or sophisticated remote techniques such as Doppler radar for quantitative field measurements. It is widely recognized that the wind-induced motion of vegetation depends in a complex manner on their structure and mechanical properties, obviating their use in quantitative anemometry.$^{10-14}$ We analyze measurements on a host of different vegetation showing that leaf motion can be decoupled from the leaf's branch and support structure, at low-to-moderate wind speed, $U_{wind}$. This wind speed range is characterized by a leaf Reynolds number, enabling the development of a remote, quantitative anemometry method based on the formula, $U_{wind}\approx740\sqrt{{\mu}U_{leaf}/{\rho}D}$, that relies only on the leaf size $D$, its measured fluctuating (RMS) speed $U_{leaf}$, the air viscosity $\mu$, and its mass density $\rho$. This formula is corroborated by a first-principles model and validated using a host of laboratory and field tests on diverse vegetation types, ranging from oak, olive, and magnolia trees through to camphor and bullgrass. The findings of this study open the door to a new paradigm in anemometry, using natural vegetation to enable remote and rapid quantitative field measurements at global locations with minimal cost.

cs.AI [Back]

[168] Toward Super Agent System with Hybrid AI Routers

Yuhang Yao,Haixin Wang,Yibo Chen,Jiawen Wang,Min Chang Jordan Ren,Bosheng Ding,Salman Avestimehr,Chaoyang He

Main category: cs.AI

TLDR: 论文提出了一种超级代理系统设计，通过意图检测和任务路由，结合本地与云端模型，实现高效、低成本的AI代理部署。

Details

Motivation: 解决现有AI代理在效率、成本和隐私方面的挑战，使其更适合实际应用和大规模部署。 Method: 系统检测用户意图后，将请求路由至专用任务代理或自动生成工作流，动态选择本地或云端模型。 Result: 提出了一种混合模式架构，支持本地与云端协作，优化性能和成本。 Conclusion: 未来超级代理将更广泛地融入日常生活，多模态模型和边缘硬件的发展将推动这一趋势。 Abstract: AI Agents powered by Large Language Models are transforming the world through enormous applications. A super agent has the potential to fulfill diverse user needs, such as summarization, coding, and research, by accurately understanding user intent and leveraging the appropriate tools to solve tasks. However, to make such an agent viable for real-world deployment and accessible at scale, significant optimizations are required to ensure high efficiency and low cost. This paper presents a design of the Super Agent System. Upon receiving a user prompt, the system first detects the intent of the user, then routes the request to specialized task agents with the necessary tools or automatically generates agentic workflows. In practice, most applications directly serve as AI assistants on edge devices such as phones and robots. As different language models vary in capability and cloud-based models often entail high computational costs, latency, and privacy concerns, we then explore the hybrid mode where the router dynamically selects between local and cloud models based on task complexity. Finally, we introduce the blueprint of an on-device super agent enhanced with cloud. With advances in multi-modality models and edge hardware, we envision that most computations can be handled locally, with cloud collaboration only as needed. Such architecture paves the way for super agents to be seamlessly integrated into everyday life in the near future.

[169] ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search

Yize Zhang,Tianshu Wang,Sirui Chen,Kun Wang,Xingyu Zeng,Hongyu Lin,Xianpei Han,Le Sun,Chaochao Lu

Main category: cs.AI

TLDR: ARise框架通过结合风险评估和动态检索增强生成，显著提升了开放场景下的复杂推理能力。

Details

Motivation: 现有方法在开放场景中表现不佳，存在错误传播和验证瓶颈问题。 Method: 提出ARise框架，整合风险评估与动态检索增强生成，采用蒙特卡洛树搜索优化推理计划。 Result: ARise在实验中比现有最佳方法提升23.10%-25.37%。 Conclusion: ARise有效解决了开放场景中的复杂推理问题，性能显著优于现有方法。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test--time compute. However, their application in open--ended, knowledge--intensive, complex reasoning scenarios is still limited. Reasoning--oriented methods struggle to generalize to open--ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge--augmented reasoning (KAR) methods fail to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore--exploit tradeoff arises in multi--branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval--augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state--of--the--art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%.

[170] Enhancing multimodal analogical reasoning with Logic Augmented Generation

Anna Sofia Lippolis,Andrea Giovanni Nuzzolese,Aldo Gangemi

Main category: cs.AI

TLDR: 论文提出了一种逻辑增强生成（LAG）框架，通过语义知识图谱和提示启发式方法提取隐式类比知识，提升了隐喻检测和理解任务的表现。

Details

Motivation: 大型语言模型在自然语言处理中表现出色，但隐式知识提取仍是挑战，缺乏物理世界经验。语义知识图谱可作为概念空间，指导生成更高效、可解释的结果。 Method: 结合语义知识图谱和提示启发式方法，生成扩展的知识图谱三元组，支持跨领域未标记多模态数据的推理。 Result: 在四个数据集的三个隐喻任务中，该方法超越基线模型，甚至优于人类表现，推理过程更可解释，但在领域特定隐喻理解上仍有局限。 Conclusion: LAG框架在隐式知识提取和推理上表现优异，但隐喻理解仍有改进空间，需进一步优化标注和评估方法。 Abstract: Recent advances in Large Language Models have demonstrated their capabilities across a variety of tasks. However, automatically extracting implicit knowledge from natural language remains a significant challenge, as machines lack active experience with the physical world. Given this scenario, semantic knowledge graphs can serve as conceptual spaces that guide the automated text generation reasoning process to achieve more efficient and explainable results. In this paper, we apply a logic-augmented generation (LAG) framework that leverages the explicit representation of a text through a semantic knowledge graph and applies it in combination with prompt heuristics to elicit implicit analogical connections. This method generates extended knowledge graph triples representing implicit meaning, enabling systems to reason on unlabeled multimodal data regardless of the domain. We validate our work through three metaphor detection and understanding tasks across four datasets, as they require deep analogical reasoning capabilities. The results show that this integrated approach surpasses current baselines, performs better than humans in understanding visual metaphors, and enables more explainable reasoning processes, though still has inherent limitations in metaphor understanding, especially for domain-specific metaphors. Furthermore, we propose a thorough error analysis, discussing issues with metaphorical annotations and current evaluation methods.

[171] Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

Chang Yang,Ruiyu Wang,Junzhe Jiang,Qi Jiang,Qinggang Zhang,Yanchen Deng,Shuxin Li,Shuyue Hu,Bo Li,Florian T. Pokorny,Xiao Huang,Xinrun Wang

Main category: cs.AI

TLDR: 本文提出了NPPC，一个不可破解、不可碾压的自动验证通用基准测试，用于评估大型语言模型的推理能力。

Details

Motivation: 当前基准测试容易被快速破解或碾压，无法满足LLMs快速发展的需求。 Method: NPPC包含三个模块：npgym（生成NP完全问题实例）、npsolver（评估模型性能）、npeval（分析模型表现）。 Result: 实验表明NPPC能显著降低先进LLMs的性能至10%以下，并发现DeepSeek-R1等模型在NP问题中表现最佳。 Conclusion: NPPC是首个不可破解、不可碾压的推理基准测试，为LLMs迈向AGI提供了可靠测试平台。 Abstract: Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs' performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).

[172] Towards Automated Safety Requirements Derivation Using Agent-based RAG

Balahari Vignesh Balu,Florian Geissler,Francesco Carella,Joao-Vitor Zacchi,Josef Jiru,Nuria Mata,Reinhard Stolle

Main category: cs.AI

TLDR: 论文提出了一种基于代理的检索增强生成（RAG）方法，用于自动驾驶车辆安全需求的自动化推导，解决了传统方法在处理复杂查询和检索相关信息时的不足。

Details

Motivation: 传统预训练大语言模型（LLMs）在安全分析中缺乏领域知识，现有RAG方法在处理复杂查询时性能下降，难以检索最相关信息，尤其在安全相关应用中更为明显。 Method: 采用基于代理的RAG方法，结合汽车标准文档和Apollo案例研究，测试数据集包含从Apollo数据中提取的安全需求问题与答案。 Result: 实验表明，基于代理的RAG方法检索的信息更相关，优于传统RAG方法。 Conclusion: 基于代理的RAG方法在安全需求推导中表现更优，为自动驾驶安全分析提供了更有效的解决方案。 Abstract: We study the automated derivation of safety requirements in a self-driving vehicle use case, leveraging LLMs in combination with agent-based retrieval-augmented generation. Conventional approaches that utilise pre-trained LLMs to assist in safety analyses typically lack domain-specific knowledge. Existing RAG approaches address this issue, yet their performance deteriorates when handling complex queries and it becomes increasingly harder to retrieve the most relevant information. This is particularly relevant for safety-relevant applications. In this paper, we propose the use of agent-based RAG to derive safety requirements and show that the retrieved information is more relevant to the queries. We implement an agent-based approach on a document pool of automotive standards and the Apollo case study, as a representative example of an automated driving perception system. Our solution is tested on a data set of safety requirement questions and answers, extracted from the Apollo data. Evaluating a set of selected RAG metrics, we present and discuss advantages of a agent-based approach compared to default RAG methods.

cs.HC [Back]

[173] UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

Xinyi Liu,Xiaoyi Zhang,Ziyun Zhang,Yan Lu

Main category: cs.HC

TLDR: 论文提出了一种基于视觉的GUI指令定位方法，通过数据合成管道UI-E2I-Synth生成复杂指令数据集，并引入新基准UI-I2E-Bench，模型表现优异。

Details

Motivation: 现有基于GUI元数据的方法受限于平台和实现差异，而视觉方法更具普适性，但缺乏公开训练数据和高效标注方法。 Method: 使用GPT-4o生成合成数据，提出UI-E2I-Synth管道和UI-I2E-Bench基准。 Result: 模型在GUI指令定位任务中表现优异，验证了数据合成管道的有效性。 Conclusion: 提出的方法和基准为未来GUI定位研究提供了实用工具和方向。 Abstract: Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability. In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation.In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline. The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding. We will release corresponding artifacts at https://colmon46.github.io/i2e-bench-leaderboard/

[174] The Obvious Invisible Threat: LLM-Powered GUI Agents' Vulnerability to Fine-Print Injections

Chaoran Chen,Zhiping Zhang,Bingcan Guo,Shang Ma,Ibrahim Khalilov,Simret A Gebreegziabher,Yanfang Ye,Ziang Xiao,Yaxing Yao,Tianshi Li,Toby Jia-Jun Li

Main category: cs.HC

TLDR: 本文研究了基于大语言模型（LLM）的GUI代理在隐私和安全方面的脆弱性，提出了六种攻击类型，并通过实验验证了其危害性，同时提出了防御策略。

Details

Motivation: GUI代理在完成任务时需要处理敏感数据，但其自主性引入了新的隐私和安全风险，攻击者可能通过恶意内容操控代理行为或泄露隐私。 Method: 通过实验测试六种攻击类型对六种先进GUI代理的影响，涉及234个恶意网页和39名人类参与者。 Result: 研究发现GUI代理对上下文嵌入的威胁高度脆弱，且人类监督也无法完全防止攻击。 Conclusion: 强调隐私意识设计的重要性，并提出了实用的防御策略以提升GUI代理的安全性和可靠性。 Abstract: A Large Language Model (LLM) powered GUI agent is a specialized autonomous system that performs tasks on the user's behalf according to high-level instructions. It does so by perceiving and interpreting the graphical user interfaces (GUIs) of relevant apps, often visually, inferring necessary sequences of actions, and then interacting with GUIs by executing the actions such as clicking, typing, and tapping. To complete real-world tasks, such as filling forms or booking services, GUI agents often need to process and act on sensitive user data. However, this autonomy introduces new privacy and security risks. Adversaries can inject malicious content into the GUIs that alters agent behaviors or induces unintended disclosures of private information. These attacks often exploit the discrepancy between visual saliency for agents and human users, or the agent's limited ability to detect violations of contextual integrity in task automation. In this paper, we characterized six types of such attacks, and conducted an experimental study to test these attacks with six state-of-the-art GUI agents, 234 adversarial webpages, and 39 human participants. Our findings suggest that GUI agents are highly vulnerable, particularly to contextually embedded threats. Moreover, human users are also susceptible to many of these attacks, indicating that simple human oversight may not reliably prevent failures. This misalignment highlights the need for privacy-aware agent design. We propose practical defense strategies to inform the development of safer and more reliable GUI agents.

[175] Interactivity x Explainability: Toward Understanding How Interactivity Can Improve Computer Vision Explanations

Indu Panigrahi,Sunnie S. Y. Kim,Amna Liaqat,Rohan Jinturkar,Olga Russakovsky,Ruth Fong,Parastoo Abtahi

Main category: cs.HC

TLDR: 研究探讨了交互性在计算机视觉模型解释中的作用，发现其能提升用户控制与理解，但也带来新挑战，并提出了设计建议。

Details

Motivation: 静态解释格式存在信息过载、语义与像素级信息脱节等问题，交互性可能解决这些问题。 Method: 通过鸟类识别任务（N=24），研究三种解释类型（热图、概念、原型）的交互效果。 Result: 交互性增强用户控制与理解，但也引入新挑战。 Conclusion: 提出设计建议，如默认视图选择、独立输入控制和约束输出空间。 Abstract: Explanations for computer vision models are important tools for interpreting how the underlying models work. However, they are often presented in static formats, which pose challenges for users, including information overload, a gap between semantic and pixel-level information, and limited opportunities for exploration. We investigate interactivity as a mechanism for tackling these issues in three common explanation types: heatmap-based, concept-based, and prototype-based explanations. We conducted a study (N=24), using a bird identification task, involving participants with diverse technical and domain expertise. We found that while interactivity enhances user control, facilitates rapid convergence to relevant information, and allows users to expand their understanding of the model and explanation, it also introduces new challenges. To address these, we provide design recommendations for interactive computer vision explanations, including carefully selected default views, independent input controls, and constrained output spaces.

cs.LG [Back]

[176] GPT Meets Graphs and KAN Splines: Testing Novel Frameworks on Multitask Fine-Tuned GPT-2 with LoRA

Gabriel Bo,Marc Bernardino,Justin Gu

Main category: cs.LG

TLDR: 论文探索了在预训练GPT-2模型中集成可学习和可解释模块（如KAN和GAT）以提升多任务学习准确性的潜力，但发现优化的LoRA增强Transformer表现最佳。

Details

Motivation: 受KAN和GAT在CoT模型中的应用及其与MLPs等简单架构的争议启发，研究旨在通过改进架构提升多任务学习性能。 Method: 采用LoRA微调超参数和L2正则化增强标准自注意力Transformer，并开发了Graph LoRA和Hybrid-KAN LoRA两种变体。 Result: 优化的LoRA增强Transformer在多个任务中表现最佳（SST测试集55.249%准确率，CFIMDB开发集99.18%准确率，89.9%的复述检测准确率，CHRF得分42.097）。 Conclusion: LoRA参数适应是提升情感分析、复述检测和十四行诗生成任务性能的最有效策略。 Abstract: We explore the potential of integrating learnable and interpretable modules--specifically Kolmogorov-Arnold Networks (KAN) and graph-based representations--within a pre-trained GPT-2 model to enhance multi-task learning accuracy. Motivated by the recent surge in using KAN and graph attention (GAT) architectures in chain-of-thought (CoT) models and debates over their benefits compared to simpler architectures like MLPs, we begin by enhancing a standard self-attention transformer using Low-Rank Adaptation (LoRA), fine-tuning hyperparameters, and incorporating L2 regularization. This approach yields significant improvements. To further boost interpretability and richer representations, we develop two variants that attempt to improve the standard KAN and GAT: Graph LoRA and Hybrid-KAN LoRA (Learnable GPT). However, systematic evaluations reveal that neither variant outperforms the optimized LoRA-enhanced transformer, which achieves 55.249% accuracy on the SST test set, 99.18% on the CFIMDB dev set, and 89.9% paraphrase detection test accuracy. On sonnet generation, we get a CHRF score of 42.097. These findings highlight that efficient parameter adaptation via LoRA remains the most effective strategy for our tasks: sentiment analysis, paraphrase detection, and sonnet generation.

[177] How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

Ming Li,Yanhong Li,Ziyue Li,Tianyi Zhou

Main category: cs.LG

TLDR: 论文通过谱分析研究了不同质量数据对LLM微调动态的影响，发现梯度奇异值分解的谱属性可以统一解释数据评估指标，并揭示了数据质量与训练稳定性之间的关系。

Details

Motivation: 探索不同质量数据（如低/高质量指令和推理数据）对LLM微调动态的影响，填补这一领域的研究空白。 Method: 采用谱分析方法，分析层间梯度的奇异值分解（SVD）特性，研究数据质量与梯度结构的关系。 Result: 高质量数据通常具有较低的核范数和较高的有效秩；推理数据的有效秩显著高于指令数据，表明更复杂的任务具有更丰富的梯度结构。 Conclusion: 研究为数据质量对微调的影响提供了统一视角，揭示了数据质量与训练稳定性的关系，为优化数据探索策略提供了新见解。 Abstract: As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.

[178] Looking beyond the next token

Abitha Thankaraj,Yiding Jiang,J. Zico Kolter,Yonatan Bisk

Main category: cs.LG

TLDR: 论文提出了一种通过重新排列和处理训练数据序列的方法（Trelawney），无需改变模型架构即可更准确地模拟真实数据生成过程，并在多个任务中提升性能。

Details

Motivation: 现有因果语言模型的训练结构与人类写作和推理过程存在不匹配，传统方法认为需要改变架构，而本文认为通过数据处理即可解决。 Method: 提出Trelawney方法，通过重新排列和处理训练数据序列，无需改变模型架构或训练基础设施。 Result: Trelawney在规划、算法推理和故事生成等任务中提升了性能，并能自然生成长期目标。 Conclusion: Trelawney不仅提升了现有任务性能，还可能为语言模型带来新能力。 Abstract: The structure of causal language model training assumes that each token can be accurately predicted from the previous context. This contrasts with humans' natural writing and reasoning process, where goals are typically known before the exact argument or phrasings. While this mismatch has been well studied in the literature, the working assumption has been that architectural changes are needed to address this mismatch. We argue that rearranging and processing the training data sequences can allow models to more accurately imitate the true data-generating process, and does not require any other changes to the architecture or training infrastructure. We demonstrate that this technique, Trelawney, and the inference algorithms derived from it allow us to improve performance on several key benchmarks that span planning, algorithmic reasoning, and story generation tasks. Finally, our method naturally enables the generation of long-term goals at no additional cost. We investigate how using the model's goal-generation capability can further improve planning and reasoning. Additionally, we believe Trelawney could potentially open doors to new capabilities beyond the current language modeling paradigm.

[179] A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

Wei Xiong,Jiarui Yao,Yuhui Xu,Bo Pang,Lei Wang,Doyen Sahoo,Junnan Li,Nan Jiang,Tong Zhang,Caiming Xiong,Hanze Dong

Main category: cs.LG

TLDR: GRPO的有效性主要源于丢弃完全错误的样本，而非奖励归一化。RAFT作为简单基线表现优异，Reinforce-Rej提出为更高效的替代方案。

Details

Motivation: 探究GRPO在LLM微调中成功的原因，并寻找更高效、稳定的替代方法。 Method: 通过分析GRPO的核心组件，提出RAFT和Reinforce-Rej两种方法进行比较。 Result: RAFT表现与GRPO和PPO相当，Reinforce-Rej在KL效率和稳定性上更优。 Conclusion: 建议未来研究关注负样本的合理利用，RAFT可作为稳健基线。 Abstract: Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.

[180] Teaching Large Language Models to Reason through Learning and Forgetting

Tianwei Ni,Allen Nie,Sapana Chaudhary,Yao Liu,Huzefa Rangwala,Rasool Fakoor

Main category: cs.LG

TLDR: 通过将搜索能力直接集成到模型中，并通过微调使用成功和失败的推理路径，显著提升了模型解决复杂数学问题的能力，同时大幅降低推理时间。

Details

Motivation: 推理时搜索虽能提升模型能力，但显著增加计算成本和推理时间，需找到更高效的方法。 Method: 通过微调模型，结合成功（学习）和失败（忘记）的推理路径，并使用较小的学习率防止能力退化。 Result: 在Game-of-24和Countdown基准测试中表现优于标准微调和推理时搜索，推理时间减少180倍。 Conclusion: 提出的方法有效提升了模型性能并显著降低了推理时间，为复杂问题求解提供了高效方案。 Abstract: Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it using both successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. While fine-tuning the model with these data might seem straightforward, we identify a critical issue: the model's search capability tends to degrade rapidly if fine-tuning is performed naively. We show that this degradation can be substantially mitigated by employing a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown mathematical reasoning benchmarks show that our approach not only outperforms both standard fine-tuning and inference-time search baselines but also significantly reduces inference time by 180$\times$.

[181] DataDecide: How to Predict Best Pretraining Data with Small Experiments

Ian Magnusson,Nguyen Tai,Ben Bogin,David Heineman,Jena D. Hwang,Luca Soldaini,Akshita Bhagia,Jiacheng Liu,Dirk Groeneveld,Oyvind Tafjord,Noah A. Smith,Pang Wei Koh,Jesse Dodge

Main category: cs.LG

TLDR: 论文探讨如何通过小规模实验选择数据集以降低成本，并发布了DataDecide工具包支持研究。研究发现小规模模型排名能有效预测大规模模型表现，且连续似然指标在小实验中可作为高效代理。

Details

Motivation: 降低大语言模型预训练成本，通过小规模实验选择最优数据集。 Method: 在25种不同数据集上进行控制实验，模型规模从150M到1B参数，评估小规模模型排名和8种缩放法则方法的预测能力。 Result: 小规模模型排名预测大规模模型表现的准确率约80%，连续似然指标在小实验中表现高效。 Conclusion: DataDecide为未来缩放法则研究提供基准，小规模实验可显著降低成本并保持预测准确性。 Abstract: Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

[182] LEMUR Neural Network Dataset: Towards Seamless AutoML

Arash Torabi Goodarzi,Roman Kochnev,Waleed Khalid,Furui Qin,Tolgay Atinc Uzun,Yashkumar Sanjaybhai Dhameliya,Yash Kanubhai Kathiriya,Zofia Antonina Bentyn,Dmitry Ignatov,Radu Timofte

Main category: cs.LG

TLDR: LEMUR是一个开源的神经网络模型数据集，支持AutoML任务，提供结构化模型表示和性能数据，适用于资源受限环境。

Details

Motivation: 高质量数据集对神经网络发展至关重要，LEMUR旨在支持AutoML任务和模型分析。 Method: 基于Python和PyTorch，LEMUR提供结构化模型代码和性能数据，集成Optuna框架进行优化和分析。 Result: LEMUR支持模型评估、预处理和数据库管理，提供API和边缘设备部署能力。 Conclusion: LEMUR为研究人员和实践者提供了开发和测试神经网络的工具，将作为开源项目发布。 Abstract: Neural networks are fundamental in artificial intelligence, driving progress in computer vision and natural language processing. High-quality datasets are crucial for their development, and there is growing interest in datasets composed of neural networks themselves to support benchmarking, automated machine learning (AutoML), and model analysis. We introduce LEMUR, an open source dataset of neural network models with well-structured code for diverse architectures across tasks such as object detection, image classification, segmentation, and natural language processing. LEMUR is primarily designed to enable fine-tuning of large language models (LLMs) for AutoML tasks, providing a rich source of structured model representations and associated performance data. Leveraging Python and PyTorch, LEMUR enables seamless extension to new datasets and models while maintaining consistency. It integrates an Optuna-powered framework for evaluation, hyperparameter optimization, statistical analysis, and graphical insights. LEMUR provides an extension that enables models to run efficiently on edge devices, facilitating deployment in resource-constrained environments. Providing tools for model evaluation, preprocessing, and database management, LEMUR supports researchers and practitioners in developing, testing, and analyzing neural networks. Additionally, it offers an API that delivers comprehensive information about neural network models and their complete performance statistics with a single request, which can be used in experiments with code-generating large language models. The LEMUR will be released as an open source project under the MIT license upon acceptance of the paper.

[183] Power-scaled Bayesian Inference with Score-based Generative mModels

Huseyin Tuna Erdinc,Yunlin Zeng,Abhinav Prakash Gahlot,Felix J. Herrmann

Main category: cs.LG

TLDR: 提出了一种基于分数的生成算法，用于在贝叶斯推断框架中从功率缩放先验和似然中进行采样，无需重新训练即可灵活控制先验与似然的影响。

Details

Motivation: 研究如何在贝叶斯推断中灵活调整先验与似然的影响，以优化后验样本的生成，特别是在地震速度模型合成中。 Method: 开发了一种基于分数的生成算法，支持对先验和似然进行功率缩放，并通过采样中间功率后验进行敏感性分析。 Result: 实验表明，适当增加似然的功率可提高后验样本与条件数据（如地震图像）的保真度，而降低先验功率则增加样本的结构多样性。 Conclusion: 该方法在贝叶斯推断中实现了对先验与似然影响的灵活控制，为后验细化提供了实用工具。 Abstract: We propose a score-based generative algorithm for sampling from power-scaled priors and likelihoods within the Bayesian inference framework. Our algorithm enables flexible control over prior-likelihood influence without requiring retraining for different power-scaling configurations. Specifically, we focus on synthesizing seismic velocity models conditioned on imaged seismic. Our method enables sensitivity analysis by sampling from intermediate power posteriors, allowing us to assess the relative influence of the prior and likelihood on samples of the posterior distribution. Through a comprehensive set of experiments, we evaluate the effects of varying the power parameter in different settings: applying it solely to the prior, to the likelihood of a Bayesian formulation, and to both simultaneously. The results show that increasing the power of the likelihood up to a certain threshold improves the fidelity of posterior samples to the conditioning data (e.g., seismic images), while decreasing the prior power promotes greater structural diversity among samples. Moreover, we find that moderate scaling of the likelihood leads to a reduced shot data residual, confirming its utility in posterior refinement.

[184] Towards Spatially-Aware and Optimally Faithful Concept-Based Explanations

Shubham Kumar,Dwip Dalal,Narendra Ahuja

Main category: cs.LG

TLDR: 本文提出了一种新的评估方法SF，用于提高无监督概念解释方法（U-CBEMs）的准确性，并通过实验验证了其有效性。

Details

Motivation: 现有U-CBEMs的忠实性评估方法存在局限性，尤其是忽略了概念的空间分布，影响了评估的准确性。 Method: 提出SF方法，引入空间感知代理和两个新的忠实性指标，生成最优忠实性（OF）解释。 Result: 实验表明，SF方法显著提高了忠实性（30%以上），且OF解释在域外数据和对抗样本上表现更优。 Conclusion: SF方法有效解决了现有评估方法的不足，为U-CBEMs提供了更准确的忠实性评估和解释。 Abstract: Post-hoc, unsupervised concept-based explanation methods (U-CBEMs) are a promising tool for generating semantic explanations of the decision-making processes in deep neural networks, having applications in both model improvement and understanding. It is vital that the explanation is accurate, or faithful, to the model, yet we identify several limitations of prior faithfulness metrics that inhibit an accurate evaluation; most notably, prior metrics involve only the set of concepts present, ignoring how they may be spatially distributed. We address these limitations with Surrogate Faithfulness (SF), an evaluation method that introduces a spatially-aware surrogate and two novel faithfulness metrics. Using SF, we produce Optimally Faithful (OF) explanations, where concepts are found that maximize faithfulness. Our experiments show that (1) adding spatial-awareness to prior U-CBEMs increases faithfulness in all cases; (2) OF produces significantly more faithful explanations than prior U-CBEMs (30% or higher improvement in error); (3) OF's learned concepts generalize well to out-of-domain data and are more robust to adversarial examples, where prior U-CBEMs struggle.

[185] Meta-learning For Few-Shot Time Series Crop Type Classification: A Benchmark On The EuroCropsML Dataset

Joana Reuss,Jan Macdonald,Simon Becker,Konrad Schultka,Lorenz Richter,Marco Körner

Main category: cs.LG

TLDR: 论文研究了迁移学习和元学习算法在真实世界作物分类任务中的表现，发现MAML类算法在特定场景下精度略高但计算成本更高，且地理差异对知识迁移构成挑战。

Details

Motivation: 解决作物类型数据空间不平衡问题，评估迁移学习和元学习算法在真实世界应用中的表现。 Method: 在EuroCropsML数据集上对比迁移学习和元学习算法（如MAML、ANIL、TIML）的性能。 Result: MAML类算法在特定任务中精度略高但计算成本更高；地理差异显著影响知识迁移效果。 Conclusion: 需权衡精度与计算资源，地理差异是知识迁移的主要挑战；提供了首个真实世界作物分类的基准测试。 Abstract: Spatial imbalances in crop type data pose significant challenges for accurate classification in remote sensing applications. Algorithms aiming at transferring knowledge from data-rich to data-scarce tasks have thus surged in popularity. However, despite their effectiveness in previous evaluations, their performance in challenging real-world applications is unclear and needs to be evaluated. This study benchmarks transfer learning and several meta-learning algorithms, including (First-Order) Model-Agnostic Meta-Learning ((FO)-MAML), Almost No Inner Loop (ANIL), and Task-Informed Meta-Learning (TIML), on the real-world EuroCropsML time series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to simpler transfer learning methods when applied to crop type classification tasks in Estonia after pre-training on data from Latvia. However, this improvement comes at the cost of increased computational demands and training time. Moreover, we find that the transfer of knowledge between geographically disparate regions, such as Estonia and Portugal, poses significant challenges to all investigated algorithms. These insights underscore the trade-offs between accuracy and computational resource requirements in selecting machine learning methods for real-world crop type classification tasks and highlight the difficulties of transferring knowledge between different regions of the Earth. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating transfer and meta-learning methods for crop type classification under real-world conditions. The corresponding code is publicly available at https://github.com/dida-do/eurocrops-meta-learning.

[186] InfoClus: Informative Clustering of High-dimensional Data Embeddings

Fuyin Lai,Edith Heiter,Guillaume Bied,Jefrey Lijffijt

Main category: cs.LG

TLDR: 论文提出了一种名为InfoClus的新方法，通过分块和解释高维数据的低维嵌入，帮助理解和探索数据。

Details

Motivation: 高维数据的低维嵌入难以解释，需要一种方法来自动分块并提供解释。 Method: 提出InfoClus方法，结合信息论和稀疏解释，通过贪婪搜索优化分块和解释。 Result: 在三个数据集上的定性和定量分析表明，InfoClus优于现有方法（RVX和VERA）。 Conclusion: InfoClus能够为基于降维的散点图分析提供良好的起点。 Abstract: Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.

[187] Revealing Covert Attention by Analyzing Human and Reinforcement Learning Agent Gameplay

Henrik Krauss,Takehisa Yairi

Main category: cs.LG

TLDR: 提出了一种通过游戏数据揭示人类隐蔽注意模式的新方法，使用强化学习技术生成注意力图，并与眼动追踪数据对比。

Details

Motivation: 研究人类与AI代理在游戏中的注意力差异，探索仅通过游戏数据揭示人类隐蔽注意模式的可能性。 Method: 提出CTR注意力网络，从人类和RL代理的游戏数据生成稀疏注意力图，并与基于眼动追踪的TIOA模型对比。 Result: 人类CTR注意力图更接近TIOA模型，显示人类注意力集中在玩家和附近对手，而代理注意力更广泛。 Conclusion: CTR网络能有效从游戏数据中揭示人类隐蔽注意模式，无需额外数据，为开发具有人类注意特征的RL代理提供了基础。 Abstract: This study introduces a novel method for revealing human covert attention patterns using gameplay data alone, utilizing offline attention techniques from reinforcement learning (RL). We propose the contextualized, task-relevant (CTR) attention network, which generates attention maps from both human and RL agent gameplay in Atari environments. These maps are sparse yet retain the necessary information for the current player's decision making. We compare the CTR-derived attention maps with a temporally integrated overt attention (TIOA) model based on eye-tracking data, serving as a point of comparison and discussion. Visual inspection reveals distinct attention patterns: human CTR maps focus on the player and rather nearby opponents, occasionally shifting between stronger focus and broader views - sometimes even attending to empty space ahead. In contrast, agent maps maintain a consistent broad focus on most objects, including distant ones and the player. Quantitative analysis further demonstrates that human CTR maps align more closely with TIOA than agent maps do. Our findings indicate that the CTR attention network can effectively reveal human covert attention patterns from gameplay alone, without the need for additional data like brain activity recordings. This work contributes to understanding human-agent attention differences and enables the development of RL agents augmented with human covert attention.

[188] R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

Lijun Sheng,Jian Liang,Zilei Wang,Ran He

Main category: cs.LG

TLDR: 论文提出了一种名为R-TPT的方法，通过测试时提示调优和可靠性加权集成策略，增强视觉语言模型在对抗攻击下的防御能力，无需标注数据且灵活性强。

Details

Motivation: 视觉语言模型（如CLIP）因其固有脆弱性和常用开源模型选择，面临更高的对抗攻击风险。现有防御方法依赖标注数据且缺乏灵活性。 Method: 提出R-TPT方法，通过优化边际熵目标并引入可靠性加权集成策略，在推理阶段减轻对抗攻击影响。 Result: 在多种攻击下的广泛实验中，R-TPT表现出色，有效提升了防御能力。 Conclusion: R-TPT无需标注数据即可增强对抗防御，同时保持推理任务的灵活性，代码已开源。 Abstract: Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in https://github.com/TomSheng21/R-TPT.

[189] Mamba-Based Ensemble learning for White Blood Cell Classification

Lewis Clifton,Xin Tian,Duangdao Palasuwan,Phandee Watanaboonyongcharoen,Ponlapat Rojnuckarin,Nantheera Anantrasirichai

Main category: cs.LG

TLDR: 论文提出了一种基于Mamba模型和集成学习的新型框架，用于改进白细胞分类，解决了数据不平衡和计算资源限制的问题。

Details

Motivation: 手动白细胞分类费时且易出错，现有深度学习方法虽有效但面临数据不平衡和计算资源限制的挑战。 Method: 结合Mamba模型（线性复杂度）和集成学习，提出新框架，并引入新数据集Chula-WBC-8。 Result: 验证了Mamba模型在白细胞分类中的有效性，显著提升了分类效率且未牺牲准确性。 Conclusion: Mamba模型为资源受限环境下的白细胞分类提供了高效且可扩展的解决方案。 Abstract: White blood cell (WBC) classification assists in assessing immune health and diagnosing various diseases, yet manual classification is labor-intensive and prone to inconsistencies. Recent advancements in deep learning have shown promise over traditional methods; however, challenges such as data imbalance and the computational demands of modern technologies, such as Transformer-based models which do not scale well with input size, limit their practical application. This paper introduces a novel framework that leverages Mamba models integrated with ensemble learning to improve WBC classification. Mamba models, known for their linear complexity, provide a scalable alternative to Transformer-based approaches, making them suitable for deployment in resource-constrained environments. Additionally, we introduce a new WBC dataset, Chula-WBC-8, for benchmarking. Our approach not only validates the effectiveness of Mamba models in this domain but also demonstrates their potential to significantly enhance classification efficiency without compromising accuracy. The source code can be found at https://github.com/LewisClifton/Mamba-WBC-Classification.

cs.CY [Back]

[190] Will AI shape the way we speak? The emerging sociolinguistic influence of synthetic voices

Éva Székely,Jūra Miniota,Míša,Hejná

Main category: cs.CY

TLDR: 论文探讨了语音交互界面对人类交流的影响，尤其是通过声音传递的社会身份信息，并呼吁跨学科研究其潜在社会影响。

Details

Motivation: 随着语音和语言技术的发展，语音交互界面对人类交流的影响日益显著，尤其是声音传递的社会身份信息可能对社会产生深远影响。 Method: 通过分析声音交互中的声学-韵律趋同和语言适应现象，探讨语音AI如何影响用户的说话模式。 Result: 语音AI的交互性可能比被动媒体更显著地影响用户的语言习惯，进而对社会身份和公共认知产生潜在影响。 Conclusion: 论文呼吁跨学科研究语音AI的社会影响，以更好地理解其潜在的社会身份塑造和控制作用。 Abstract: The growing prevalence of conversational voice interfaces, powered by developments in both speech and language technologies, raises important questions about their influence on human communication. While written communication can signal identity through lexical and stylistic choices, voice-based interactions inherently amplify socioindexical elements - such as accent, intonation, and speech style - which more prominently convey social identity and group affiliation. There is evidence that even passive media such as television is likely to influence the audience's linguistic patterns. Unlike passive media, conversational AI is interactive, creating a more immersive and reciprocal dynamic that holds a greater potential to impact how individuals speak in everyday interactions. Such heightened influence can be expected to arise from phenomena such as acoustic-prosodic entrainment and linguistic accommodation, which occur naturally during interaction and enable users to adapt their speech patterns in response to the system. While this phenomenon is still emerging, its potential societal impact could provide organisations, movements, and brands with a subtle yet powerful avenue for shaping and controlling public perception and social identity. We argue that the socioindexical influence of AI-generated speech warrants attention and should become a focus of interdisciplinary research, leveraging new and existing methodologies and technologies to better understand its implications.

[191] Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment

Jiseon Kim,Jea Kwon,Luiz Felipe Vecchietti,Alice Oh,Meeyoung Cha

Main category: cs.CY

TLDR: 研究探讨大型语言模型（LLM）在道德困境中的决策与人类判断的一致性，发现其决策因角色差异显著变化，且政治倾向影响较大。

Details

Motivation: 探究LLM在现实应用中的道德决策行为及其与人类判断的差异。 Method: 通过道德机器实验，分析不同社会人口学角色下LLM与人类决策的对比。 Result: LLM的决策因角色差异显著，政治倾向对决策方向和程度影响突出。 Conclusion: 需关注LLM在涉及道德决策应用中的伦理风险和潜在问题。 Abstract: Deploying large language models (LLMs) with agency in real-world applications raises critical questions about how these models will behave. In particular, how will their decisions align with humans when faced with moral dilemmas? This study examines the alignment between LLM-driven decisions and human judgment in various contexts of the moral machine experiment, including personas reflecting different sociodemographics. We find that the moral decisions of LLMs vary substantially by persona, showing greater shifts in moral decisions for critical tasks than humans. Our data also indicate an interesting partisan sorting phenomenon, where political persona predominates the direction and degree of LLM decisions. We discuss the ethical implications and risks associated with deploying these models in applications that involve moral decisions.

physics.med-ph [Back]

[192] Embedding Radiomics into Vision Transformers for Multimodal Medical Image Classification

Zhenyu Yang,Haiming Zhu,Rihui Zhang,Haipeng Zhang,Jianliang Wang,Chunhao Wang,Minbin Chen,Fang-Fang Yin

Main category: physics.med-ph

TLDR: RE-ViT结合了放射组学特征与ViT架构，通过早期融合提升了医学图像分类的性能和鲁棒性。

Details

Motivation: ViT在医学图像分析中因数据密集和缺乏领域特定偏置而受限，而放射组学虽可解释但难以扩展。RE-ViT旨在结合两者优势。 Method: 将图像分块，提取放射组学特征并与像素嵌入融合，经归一化和位置编码后输入ViT编码器，通过[CLS]令牌分类。 Result: 在三个数据集上（BUSI、ChestXray2017、Retinal OCT）均达到SOTA性能（AUC分别为0.950、0.989、0.986）。 Conclusion: RE-ViT成功融合放射组学与ViT，在多模态医学图像分类中表现出优越性能和泛化能力。 Abstract: Background: Deep learning has significantly advanced medical image analysis, with Vision Transformers (ViTs) offering a powerful alternative to convolutional models by modeling long-range dependencies through self-attention. However, ViTs are inherently data-intensive and lack domain-specific inductive biases, limiting their applicability in medical imaging. In contrast, radiomics provides interpretable, handcrafted descriptors of tissue heterogeneity but suffers from limited scalability and integration into end-to-end learning frameworks. In this work, we propose the Radiomics-Embedded Vision Transformer (RE-ViT) that combines radiomic features with data-driven visual embeddings within a ViT backbone. Purpose: To develop a hybrid RE-ViT framework that integrates radiomics and patch-wise ViT embeddings through early fusion, enhancing robustness and performance in medical image classification. Methods: Following the standard ViT pipeline, images were divided into patches. For each patch, handcrafted radiomic features were extracted and fused with linearly projected pixel embeddings. The fused representations were normalized, positionally encoded, and passed to the ViT encoder. A learnable [CLS] token aggregated patch-level information for classification. We evaluated RE-ViT on three public datasets (including BUSI, ChestXray2017, and Retinal OCT) using accuracy, macro AUC, sensitivity, and specificity. RE-ViT was benchmarked against CNN-based (VGG-16, ResNet) and hybrid (TransMed) models. Results: RE-ViT achieved state-of-the-art results: on BUSI, AUC=0.950+/-0.011; on ChestXray2017, AUC=0.989+/-0.004; on Retinal OCT, AUC=0.986+/-0.001, which outperforms other comparison models. Conclusions: The RE-ViT framework effectively integrates radiomics with ViT architectures, demonstrating improved performance and generalizability across multimodal medical image classification tasks.

q-bio.QM [Back]

[193] Cryo-em images are intrinsically low dimensional

Luke Evans,Octavian-Vlad Murad,Lars Dingeldein,Pilar Cossio,Roberto Covino,Marina Meila

Main category: q-bio.QM

TLDR: 论文通过流形学习技术分析CryoSBI的潜在表示，揭示了高维数据本质上是低维平滑流形，并建立了潜在结构与物理参数的联系。

Details

Motivation: 探索CryoSBI潜在表示的几何结构，以更好地理解生物分子构象的推断过程。 Method: 应用流形学习技术（如扩散映射）分析模拟和实验数据中的潜在表示。 Result: 发现数据本质上是低维平滑流形，模拟数据覆盖实验数据，潜在结构与物理参数直接相关。 Conclusion: 验证了CryoSBI方法的有效性，并为进一步优化推断策略提供了基于流形几何的新机会。 Abstract: Simulation-based inference provides a powerful framework for cryo-electron microscopy, employing neural networks in methods like CryoSBI to infer biomolecular conformations via learned latent representations. This latent space represents a rich opportunity, encoding valuable information about the physical system and the inference process. Harnessing this potential hinges on understanding the underlying geometric structure of these representations. We investigate this structure by applying manifold learning techniques to CryoSBI representations of hemagglutinin (simulated and experimental). We reveal that these high-dimensional data inherently populate low-dimensional, smooth manifolds, with simulated data effectively covering the experimental counterpart. By characterizing the manifold's geometry using Diffusion Maps and identifying its principal axes of variation via coordinate interpretation methods, we establish a direct link between the latent structure and key physical parameters. Discovering this intrinsic low-dimensionality and interpretable geometric organization not only validates the CryoSBI approach but enables us to learn more from the data structure and provides opportunities for improving future inference strategies by exploiting this revealed manifold geometry.

cs.SI [Back]

[194] Exposure to Content Written by Large Language Models Can Reduce Stigma Around Opioid Use Disorder in Online Communities

Shravika Mittal,Darshi Shah,Shin Won Do,Mai ElSherief,Tanushree Mitra,Munmun De Choudhury

Main category: cs.SI

TLDR: 研究表明，大型语言模型（LLMs）生成的回复能有效减少在线社区中对阿片类药物使用障碍（OUD）及其治疗药物的污名化态度。

Details

Motivation: 在线和离线环境中普遍存在的污名化阻碍了阿片类药物使用障碍（OUD）的减害努力，尤其是针对治疗药物（MAT）和患者的污名。研究探讨了LLMs是否能在在线社区中减少这种污名。 Method: 通过预注册的随机对照实验，参与者阅读LLM生成、人工撰写或无回复的OUD相关内容，实验分为单次阅读（N=2,141）和14天重复阅读（N=107）两种设置。 Result: 实验发现，LLM生成的回复在两种设置下均显著降低了参与者对MAT的污名化态度。 Conclusion: LLMs可作为教育干预工具，促进对OUD的包容性讨论，提升对MAT的积极态度。 Abstract: Widespread stigma, both in the offline and online spaces, acts as a barrier to harm reduction efforts in the context of opioid use disorder (OUD). This stigma is prominently directed towards clinically approved medications for addiction treatment (MAT), people with the condition, and the condition itself. Given the potential of artificial intelligence based technologies in promoting health equity, and facilitating empathic conversations, this work examines whether large language models (LLMs) can help abate OUD-related stigma in online communities. To answer this, we conducted a series of pre-registered randomized controlled experiments, where participants read LLM-generated, human-written, or no responses to help seeking OUD-related content in online communities. The experiment was conducted under two setups, i.e., participants read the responses either once (N = 2,141), or repeatedly for 14 days (N = 107). We found that participants reported the least stigmatized attitudes toward MAT after consuming LLM-generated responses under both the setups. This study offers insights into strategies that can foster inclusive online discourse on OUD, e.g., based on our findings LLMs can be used as an education-based intervention to promote positive attitudes and increase people's propensity toward MAT.

eess.IV [Back]

[195] Integrating electrocardiogram and fundus images for early detection of cardiovascular diseases

K. A. Muthukumar,Dhruva Nandi,Priya Ranjan,Krithika Ramachandran,Shiny PJ,Anirban Ghosh,Ashwini M,Aiswaryah Radhakrishnan,V. E. Dhandapani,Rajiv Janardhanan

Main category: eess.IV

TLDR: 提出了一种结合心电图（ECG）和视网膜眼底图像的新方法，用于心血管疾病（CVD）的早期诊断和优先级分类。通过FFT和EMD提取特征，神经网络分类器取得了84%的准确率。

Details

Motivation: 心血管疾病是全球主要健康问题，需要更先进的诊断技术。视网膜血管网络和ECG动态信息结合可提供更全面的诊断视角。 Method: 使用FFT将ECG和眼底图像转换到频域，计算EMD距离，拼接特征后输入神经网络分类器。 Result: 初步测试准确率为84%，显示了该方法的潜力。 Conclusion: 该方法有望在资源有限的医疗环境中提升CVD诊断能力，未来将进一步优化和验证。 Abstract: Cardiovascular diseases (CVD) are a predominant health concern globally, emphasizing the need for advanced diagnostic techniques. In our research, we present an avant-garde methodology that synergistically integrates ECG readings and retinal fundus images to facilitate the early disease tagging as well as triaging of the CVDs in the order of disease priority. Recognizing the intricate vascular network of the retina as a reflection of the cardiovascular system, alongwith the dynamic cardiac insights from ECG, we sought to provide a holistic diagnostic perspective. Initially, a Fast Fourier Transform (FFT) was applied to both the ECG and fundus images, transforming the data into the frequency domain. Subsequently, the Earth Mover's Distance (EMD) was computed for the frequency-domain features of both modalities. These EMD values were then concatenated, forming a comprehensive feature set that was fed into a Neural Network classifier. This approach, leveraging the FFT's spectral insights and EMD's capability to capture nuanced data differences, offers a robust representation for CVD classification. Preliminary tests yielded a commendable accuracy of 84 percent, underscoring the potential of this combined diagnostic strategy. As we continue our research, we anticipate refining and validating the model further to enhance its clinical applicability in resource limited healthcare ecosystems prevalent across the Indian sub-continent and also the world at large.

[196] PathSeqSAM: Sequential Modeling for Pathology Image Segmentation with SAM2

Mingyang Zhu,Yinting Liu,Mingyu Li,Jiacheng Wang

Main category: eess.IV

TLDR: PathSeqSAM利用SAM2的记忆机制，将2D病理切片视为连续视频帧，引入距离感知注意力机制和LoRA进行领域适应，显著提升了分割质量。

Details

Motivation: 现有方法独立处理2D切片，忽略了跨切片信息，限制了分割效果。 Method: 将2D切片视为视频帧，利用SAM2记忆机制，结合距离感知注意力和LoRA进行领域适应。 Result: 在KPI Challenge 2024数据集上，PathSeqSAM在跨切片上下文依赖的挑战性案例中表现更优。 Conclusion: PathSeqSAM通过利用跨切片信息，显著提升了病理图像分割的质量。 Abstract: Current methods for pathology image segmentation typically treat 2D slices independently, ignoring valuable cross-slice information. We present PathSeqSAM, a novel approach that treats 2D pathology slices as sequential video frames using SAM2's memory mechanisms. Our method introduces a distance-aware attention mechanism that accounts for variable physical distances between slices and employs LoRA for domain adaptation. Evaluated on the KPI Challenge 2024 dataset for glomeruli segmentation, PathSeqSAM demonstrates improved segmentation quality, particularly in challenging cases that benefit from cross-slice context. We have publicly released our code at https://github.com/JackyyyWang/PathSeqSAM.

[197] Efficient and Robust Remote Sensing Image Denoising Using Randomized Approximation of Geodesics' Gramian on the Manifold Underlying the Patch Space

Kelum Gajamannage,Dilhani I. Jayathilake,Maria Vasilyeva

Main category: eess.IV

TLDR: 提出了一种无需额外训练样本的高效遥感图像去噪方法，通过低秩流形和随机化奇异值谱近似实现。

Details

Motivation: 遥感图像因环境和成像系统问题质量下降，现有去噪算法难以处理复杂纹理，神经网络方法资源消耗大。 Method: 将图像分块，利用低秩流形表示无噪声版本，通过随机化奇异值谱近似揭示流形，并分通道去噪后合并。 Result: 方法高效且鲁棒，无需额外训练样本，适用于复杂纹理的遥感图像。 Conclusion: 该方法为遥感图像去噪提供了一种资源节约且性能优越的解决方案。 Abstract: Remote sensing images are widely utilized in many disciplines such as feature recognition and scene semantic segmentation. However, due to environmental factors and the issues of the imaging system, the image quality is often degraded which may impair subsequent visual tasks. Even though denoising remote sensing images plays an essential role before applications, the current denoising algorithms fail to attain optimum performance since these images possess complex features in the texture. Denoising frameworks based on artificial neural networks have shown better performance; however, they require exhaustive training with heterogeneous samples that extensively consume resources like power, memory, computation, and latency. Thus, here we present a computationally efficient and robust remote sensing image denoising method that doesn't require additional training samples. This method partitions patches of a remote-sensing image in which a low-rank manifold, representing the noise-free version of the image, underlies the patch space. An efficient and robust approach to revealing this manifold is a randomized approximation of the singular value spectrum of the geodesics' Gramian matrix of the patch space. The method asserts a unique emphasis on each color channel during denoising so the three denoised channels are merged to produce the final image.

[198] AgentPolyp: Accurate Polyp Segmentation via Image Enhancement Agent

Pu Wang,Zhihua Zhang,Dianjie Lu,Guijuan Zhang,Youshan Zhang,Zhuoran Zheng

Main category: eess.IV

TLDR: AgentPolyp是一个结合CLIP语义引导和动态图像增强的轻量级神经网络框架，用于解决息肉图像中的噪声问题，提升分割效果。

Details

Motivation: 由于环境和人为因素干扰，息肉图像常存在光照不足、模糊和过曝等问题，影响分割任务。 Method: 框架通过CLIP语义分析评估图像质量，并采用强化学习动态应用多模态增强操作（如去噪、对比度调整），结合质量评估反馈循环优化分割。 Result: 模块化架构支持即插即用扩展，适用于多种增强算法和分割网络，满足内窥镜设备的部署需求。 Conclusion: AgentPolyp能有效提升息肉图像的分割鲁棒性，具有实际应用潜力。 Abstract: Since human and environmental factors interfere, captured polyp images usually suffer from issues such as dim lighting, blur, and overexposure, which pose challenges for downstream polyp segmentation tasks. To address the challenges of noise-induced degradation in polyp images, we present AgentPolyp, a novel framework integrating CLIP-based semantic guidance and dynamic image enhancement with a lightweight neural network for segmentation. The agent first evaluates image quality using CLIP-driven semantic analysis (e.g., identifying ``low-contrast polyps with vascular textures") and adapts reinforcement learning strategies to dynamically apply multi-modal enhancement operations (e.g., denoising, contrast adjustment). A quality assessment feedback loop optimizes pixel-level enhancement and segmentation focus in a collaborative manner, ensuring robust preprocessing before neural network segmentation. This modular architecture supports plug-and-play extensions for various enhancement algorithms and segmentation networks, meeting deployment requirements for endoscopic devices.

[199] Efficient Medical Image Restoration via Reliability Guided Learning in Frequency Domain

Pengcheng Zheng,Kecheng Chen,Jiaxin Huang,Bohao Chen,Ju Liu,Yazhou Ren,Xiaorong Pu

Main category: eess.IV

TLDR: LRformer是一种基于Transformer的轻量级方法，通过频率域中的可靠性引导学习，解决医学图像恢复任务中的计算效率和结果可靠性问题。

Details

Motivation: 现有深度学习方法在医学图像恢复中计算效率低且忽略结果可靠性，而临床场景对此需求迫切。 Method: 提出LRformer，结合可靠性引导学习（RLPP）和频率域交叉注意力（GFCA），利用FFT降低计算复杂度。 Result: 实验证明LRformer在多种任务中高效且有效。 Conclusion: LRformer在医学图像恢复中表现出优越的性能和效率。 Abstract: Medical image restoration tasks aim to recover high-quality images from degraded observations, exhibiting emergent desires in many clinical scenarios, such as low-dose CT image denoising, MRI super-resolution, and MRI artifact removal. Despite the success achieved by existing deep learning-based restoration methods with sophisticated modules, they struggle with rendering computationally-efficient reconstruction results. Moreover, they usually ignore the reliability of the restoration results, which is much more urgent in medical systems. To alleviate these issues, we present LRformer, a Lightweight Transformer-based method via Reliability-guided learning in the frequency domain. Specifically, inspired by the uncertainty quantification in Bayesian neural networks (BNNs), we develop a Reliable Lesion-Semantic Prior Producer (RLPP). RLPP leverages Monte Carlo (MC) estimators with stochastic sampling operations to generate sufficiently-reliable priors by performing multiple inferences on the foundational medical image segmentation model, MedSAM. Additionally, instead of directly incorporating the priors in the spatial domain, we decompose the cross-attention (CA) mechanism into real symmetric and imaginary anti-symmetric parts via fast Fourier transform (FFT), resulting in the design of the Guided Frequency Cross-Attention (GFCA) solver. By leveraging the conjugated symmetric property of FFT, GFCA reduces the computational complexity of naive CA by nearly half. Extensive experimental results in various tasks demonstrate the superiority of the proposed LRformer in both effectiveness and efficiency.

cs.RO [Back]

[200] ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping

Shun Iwase,Zubair Irshad,Katherine Liu,Vitor Guizilini,Robert Lee,Takuya Ikeda,Ayako Amma,Koichi Nishiwaki,Kris Kitani,Rares Ambrus,Sergey Zakharov

Main category: cs.RO

TLDR: ZeroGrasp是一个同时进行3D重建和抓取姿态预测的框架，利用遮挡推理和物体间空间关系建模提升性能，并在合成数据集上训练，实现了实时性和高精度。

Details

Motivation: 现有方法直接从部分信息输出抓取姿态，未建模场景几何，导致运动不优甚至碰撞。 Method: ZeroGrasp结合3D重建和抓取姿态预测，利用遮挡推理和空间关系建模，并基于大规模合成数据集训练。 Result: 在GraspNet-1B基准测试和真实机器人实验中表现最优，并能泛化到新物体。 Conclusion: ZeroGrasp通过合成数据实现了高性能和泛化能力，为机器人抓取提供了新思路。 Abstract: Robotic grasping is a cornerstone capability of embodied systems. Many methods directly output grasps from partial information without modeling the geometry of the scene, leading to suboptimal motion and even collisions. To address these issues, we introduce ZeroGrasp, a novel framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time. A key insight of our method is that occlusion reasoning and modeling the spatial relationships between objects is beneficial for both accurate reconstruction and grasping. We couple our method with a novel large-scale synthetic dataset, which comprises 1M photo-realistic images, high-resolution 3D reconstructions and 11.3B physically-valid grasp pose annotations for 12K objects from the Objaverse-LVIS dataset. We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments. ZeroGrasp achieves state-of-the-art performance and generalizes to novel real-world objects by leveraging synthetic data.

[201] Acquisition of high-quality images for camera calibration in robotics applications via speech prompts

Timm Linder,Kadir Yilmaz,David B. Adrian,Bastian Leibe

Main category: cs.RO

TLDR: 提出一种基于语音命令控制的相机校准图像采集技术，提高校准过程的鲁棒性和用户体验。

Details

Motivation: 传统相机校准依赖静态校准板，易受运动模糊或滚动快门效应影响，现有方法如远程触发或后处理过滤模糊帧不够高效。 Method: 使用带时间戳的语音转文本模型，通过语音命令精确触发图像采集，避免模糊和运动伪影。 Result: 实验表明该方法快速高效，成功校准复杂多相机系统，提升用户体验。 Conclusion: 语音控制校准技术优于传统方法，适用于实际机器人视觉应用。 Abstract: Accurate intrinsic and extrinsic camera calibration can be an important prerequisite for robotic applications that rely on vision as input. While there is ongoing research on enabling camera calibration using natural images, many systems in practice still rely on using designated calibration targets with e.g. checkerboard patterns or April tag grids. Once calibration images from different perspectives have been acquired and feature descriptors detected, those are typically used in an optimization process to minimize the geometric reprojection error. For this optimization to converge, input images need to be of sufficient quality and particularly sharpness; they should neither contain motion blur nor rolling-shutter artifacts that can arise when the calibration board was not static during image capture. In this work, we present a novel calibration image acquisition technique controlled via voice commands recorded with a clip-on microphone, that can be more robust and user-friendly than e.g. triggering capture with a remote control, or filtering out blurry frames from a video sequence in postprocessing. To achieve this, we use a state-of-the-art speech-to-text transcription model with accurate per-word timestamping to capture trigger words with precise temporal alignment. Our experiments show that the proposed method improves user experience by being fast and efficient, allowing us to successfully calibrate complex multi-camera setups.

[202] Next-Future: Sample-Efficient Policy Learning for Robotic-Arm Tasks

Fikrican Özgür,René Zurbrügg,Suryansh Kumar

Main category: cs.RO

TLDR: 论文提出了一种名为“Next-Future”的新型回放策略，用于改进多目标强化学习中的样本效率和准确性，特别是在机器人操作任务中。

Details

Motivation: Hindsight Experience Replay (HER) 虽然广泛用于多目标强化学习，但其基于启发式的回放方法缺乏理论框架，限制了性能。 Method: 提出“Next-Future”策略，专注于奖励单步转移，以提升多目标马尔可夫决策过程的学习效率和准确性。 Result: 在八个机器人操作任务中，七项任务的样本效率和六项任务的成功率显著提升。 Conclusion: “Next-Future”策略在多目标强化学习中表现出更高的效率和实用性，适用于复杂机器人任务。 Abstract: Hindsight Experience Replay (HER) is widely regarded as the state-of-the-art algorithm for achieving sample-efficient multi-goal reinforcement learning (RL) in robotic manipulation tasks with binary rewards. HER facilitates learning from failed attempts by replaying trajectories with redefined goals. However, it relies on a heuristic-based replay method that lacks a principled framework. To address this limitation, we introduce a novel replay strategy, "Next-Future", which focuses on rewarding single-step transitions. This approach significantly enhances sample efficiency and accuracy in learning multi-goal Markov decision processes (MDPs), particularly under stringent accuracy requirements -- a critical aspect for performing complex and precise robotic-arm tasks. We demonstrate the efficacy of our method by highlighting how single-step learning enables improved value approximation within the multi-goal RL framework. The performance of the proposed replay strategy is evaluated across eight challenging robotic manipulation tasks, using ten random seeds for training. Our results indicate substantial improvements in sample efficiency for seven out of eight tasks and higher success rates in six tasks. Furthermore, real-world experiments validate the practical feasibility of the learned policies, demonstrating the potential of "Next-Future" in solving complex robotic-arm tasks.

Table of Contents

cs.CV [Back]

[1] ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

[2] Enhancing Image Restoration through Learning Context-Rich and Detail-Accurate Features

[3] Data Augmentation Through Random Style Replacement

[4] H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

[5] AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark

[6] Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

[7] SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging

[8] Relation-Rich Visual Document Generator for Visual Information Extraction

[9] Perturbed State Space Feature Encoders for Optical Flow with Event Cameras

[10] H-MoRe: Learning Human-centric Motion Representation for Action Analysis

[11] NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results

[12] The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

[13] SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

[14] Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization

[15] CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates

[16] Hearing Anywhere in Any Environment

[17] Real-time Seafloor Segmentation and Mapping

[18] ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models

[19] SeeTree -- A modular, open-source system for tree detection and orchard localization

[20] Minimal Sensing for Orienting a Solar Panel

[21] Rainy: Unlocking Satellite Calibration for Deep Learning in Precipitation

[22] Visual Language Models show widespread visual deficits on neuropsychological tests

[23] 3D Wavelet Convolutions with Extended Receptive Fields for Hyperspectral Image Classification

[24] The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability

[25] Tabular foundation model to detect empathy from visual cues

[26] GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

[27] PatrolVision: Automated License Plate Recognition in the wild

[28] IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

[29] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

[30] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

[31] LightFormer: A lightweight and efficient decoder for remote sensing image segmentation

[32] A comprehensive review of remote sensing in wetland classification and mapping

[33] Enhancing Features in Long-tailed Data Using Large Vision Mode

[34] LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

[35] DAAF:Degradation-Aware Adaptive Fusion Framework for Robust Infrared and Visible Images Fusion

[36] Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

[37] Weather-Aware Object Detection Transformer for Domain Adaptation

[38] Large Language Model-Informed Feature Discovery Improves Prediction and Interpretation of Credibility Perceptions of Visual Content

[39] Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task

[40] Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models

[41] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving

[42] CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors

[43] Fine-Grained Rib Fracture Diagnosis with Hyperbolic Embeddings: A Detailed Annotation Framework and Multi-Label Classification Model

[44] InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation

[45] Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering

[46] Cross-Frequency Implicit Neural Representation with Self-Evolving Parameters

[47] Recognition of Geometrical Shapes by Dictionary Learning

[48] An Efficient and Mixed Heterogeneous Model for Image Restoration

[49] AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

[50] Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

[51] Adaptive Decision Boundary for Few-Shot Class-Incremental Learning

[52] Deep Learning in Concealed Dense Prediction

[53] Seeing like a Cephalopod: Colour Vision with a Monochrome Event Camera

[54] DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification

[55] PraNet-V2: Dual-Supervised Reverse Attention for Medical Image Segmentation

[56] TMCIR: Token Merge Benefits Composed Image Retrieval

[57] MediSee: Reasoning-based Pixel-level Perception in Medical Images

[58] GATE3D: Generalized Attention-based Task-synergized Estimation in 3D*

[59] AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era

[60] DRIFT open dataset: A drone-derived intelligence for traffic analysis in urban environmen

[61] Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

[62] TADACap: Time-series Adaptive Domain-Aware Captioning

[63] Defending Against Frequency-Based Attacks with Diffusion Models

[64] QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models

[65] Leveraging LLMs and attention-mechanism for automatic annotation of historical maps

[66] Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections

[67] UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques

[68] Improving fingerprint presentation attack detection by an approach integrated into the personal verification stage

[69] Change State Space Models for Remote Sensing Change Detection

[70] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

[71] Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

[72] Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

[73] S$^2$Teacher: Step-by-step Teacher for Sparsely Annotated Oriented Object Detection

[74] Flyweight FLIM Networks for Salient Object Detection in Biomedical Images

[75] K-means Enhanced Density Gradient Analysis for Urban and Transport Metrics Using Multi-Modal Satellite Imagery

[76] Visual Re-Ranking with Non-Visual Side Information

[77] Taming Consistency Distillation for Accelerated Human Image Animation

[78] GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention