Skip to content

Table of Contents

cs.CV [Back]

[1] Spectral Dictionary Learning for Generative Image Modeling

Andrew Kiruluta

Main category: cs.CV

TL;DR: 提出了一种基于频谱生成的新型图像合成模型,通过显式参数化的频谱基函数重建图像,具有高可解释性和物理意义。

Details Motivation: 传统生成模型(如变分、对抗和扩散模型)依赖随机推断或对抗训练,缺乏可解释性。本文旨在提供一种确定性、可解释的频谱生成方法。 Method: 将图像展平为一维信号,通过学习的频谱基函数(参数化为频率、相位和振幅)线性组合重建图像,并拟合混合系数的概率模型以生成新图像。 Result: 在CIFAR-10基准测试中,模型在重建质量和感知保真度上表现优异,同时训练稳定性和计算效率更高。 Conclusion: 该模型为可控合成提供了新途径,频谱字典的直接控制增强了可解释性,适用于图像操作和分析的新应用。 Abstract: We propose a novel spectral generative model for image synthesis that departs radically from the common variational, adversarial, and diffusion paradigms. In our approach, images, after being flattened into one-dimensional signals, are reconstructed as linear combinations of a set of learned spectral basis functions, where each basis is explicitly parameterized in terms of frequency, phase, and amplitude. The model jointly learns a global spectral dictionary with time-varying modulations and per-image mixing coefficients that quantify the contributions of each spectral component. Subsequently, a simple probabilistic model is fitted to these mixing coefficients, enabling the deterministic generation of new images by sampling from the latent space. This framework leverages deterministic dictionary learning, offering a highly interpretable and physically meaningful representation compared to methods relying on stochastic inference or adversarial training. Moreover, the incorporation of frequency-domain loss functions, computed via the short-time Fourier transform (STFT), ensures that the synthesized images capture both global structure and fine-grained spectral details, such as texture and edge information. Experimental evaluations on the CIFAR-10 benchmark demonstrate that our approach not only achieves competitive performance in terms of reconstruction quality and perceptual fidelity but also offers improved training stability and computational efficiency. This new type of generative model opens up promising avenues for controlled synthesis, as the learned spectral dictionary affords a direct handle on the intrinsic frequency content of the images, thus providing enhanced interpretability and potential for novel applications in image manipulation and analysis.

[2] SmallGS: Gaussian Splatting-based Camera Pose Estimation for Small-Baseline Videos

Yuxin Yao,Yan Zhang,Zhening Huang,Joan Lasenby

Main category: cs.CV

TL;DR: SmallGS是一个针对小基线视频的相机姿态估计框架,利用高斯泼溅技术优化相机姿态,结合预训练视觉特征提升鲁棒性,在动态场景中表现优于现有方法。

Details Motivation: 小基线视频在社交媒体中普遍存在,但现有姿态估计方法因特征模糊、漂移累积和三角约束不足而难以处理。 Method: SmallGS通过高斯泼溅重建场景,结合预训练特征(如DINOv2)优化相机姿态,无需显式特征对应或强视差运动。 Result: 在TUM-Dynamics序列中,SmallGS在小基线视频的相机姿态估计上表现优于MonST3R和DORID-SLAM。 Conclusion: SmallGS为小基线视频提供了一种高效、鲁棒的相机姿态估计解决方案。 Abstract: Dynamic videos with small baseline motions are ubiquitous in daily life, especially on social media. However, these videos present a challenge to existing pose estimation frameworks due to ambiguous features, drift accumulation, and insufficient triangulation constraints. Gaussian splatting, which maintains an explicit representation for scenes, provides a reliable novel view rasterization when the viewpoint change is small. Inspired by this, we propose SmallGS, a camera pose estimation framework that is specifically designed for small-baseline videos. SmallGS optimizes sequential camera poses using Gaussian splatting, which reconstructs the scene from the first frame in each video segment to provide a stable reference for the rest. The temporal consistency of Gaussian splatting within limited viewpoint differences reduced the requirement of sufficient depth variations in traditional camera pose estimation. We further incorporate pretrained robust visual features, e.g. DINOv2, into Gaussian splatting, where high-dimensional feature map rendering enhances the robustness of camera pose estimation. By freezing the Gaussian splatting and optimizing camera viewpoints based on rasterized features, SmallGS effectively learns camera poses without requiring explicit feature correspondences or strong parallax motion. We verify the effectiveness of SmallGS in small-baseline videos in TUM-Dynamics sequences, which achieves impressive accuracy in camera pose estimation compared to MonST3R and DORID-SLAM for small-baseline videos in dynamic scenes. Our project page is at: https://yuxinyao620.github.io/SmallGS

[3] Object Learning and Robust 3D Reconstruction

Sara Sabour

Main category: cs.CV

TL;DR: 论文探讨了无监督神经网络的架构设计和训练方法,用于图像中目标分割,并扩展到3D应用中的目标检测与移除。

Details Motivation: 解决2D无监督目标分割中区分前景与背景的挑战,并探索3D场景中动态目标的检测与移除。 Method: 使用运动线索(FlowCapsules)进行2D目标分割,利用3D几何一致性检测动态目标。 Result: 提出了无监督目标分割方法,并设计了优化内核以改进3D建模。 Conclusion: 展示了无监督目标方法的优势,并提出了未来研究方向,鼓励社区探索显式目标表示。 Abstract: In this thesis we discuss architectural designs and training methods for a neural network to have the ability of dissecting an image into objects of interest without supervision. The main challenge in 2D unsupervised object segmentation is distinguishing between foreground objects of interest and background. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. The last part of this thesis focuses on 3D applications where the goal is detecting and removal of the object of interest from the input images. In these tasks, we leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects. Our transient object masks are then used for designing robust optimization kernels to improve 3D modelling in a casual capture setup. One of our goals in this thesis is to show the merits of unsupervised object based approaches in computer vision. Furthermore, we suggest possible directions for defining objects of interest or foreground objects without requiring supervision. Our hope is to motivate and excite the community into further exploring explicit object representations in image understanding tasks.

[4] CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss

Dileepa Pitawela,Gustavo Carneiro,Hsiang-Ting Chen

Main category: cs.CV

TL;DR: CLOC是一种基于对比学习的序数分类方法,通过优化多边界损失(MMNP)学习有序表示,解决了现有方法未考虑相邻类别重要性差异的问题。

Details Motivation: 在序数分类中,相邻类别的误分类后果不同,但现有方法未考虑这种差异,CLOC旨在解决这一问题。 Method: 提出CLOC方法,使用多边界n对损失(MMNP)学习有序表示,支持关键相邻类别的灵活决策边界。 Result: 在五个真实图像数据集和一个合成数据集上,CLOC优于现有方法,并展示了其可解释性和可控性。 Conclusion: CLOC通过优化多边界损失,显著提升了序数分类性能,并满足了临床和实践需求。 Abstract: In ordinal classification, misclassifying neighboring ranks is common, yet the consequences of these errors are not the same. For example, misclassifying benign tumor categories is less consequential, compared to an error at the pre-cancerous to cancerous threshold, which could profoundly influence treatment choices. Despite this, existing ordinal classification methods do not account for the varying importance of these margins, treating all neighboring classes as equally significant. To address this limitation, we propose CLOC, a new margin-based contrastive learning method for ordinal classification that learns an ordered representation based on the optimization of multiple margins with a novel multi-margin n-pair loss (MMNP). CLOC enables flexible decision boundaries across key adjacent categories, facilitating smooth transitions between classes and reducing the risk of overfitting to biases present in the training data. We provide empirical discussion regarding the properties of MMNP and show experimental results on five real-world image datasets (Adience, Historical Colour Image Dating, Knee Osteoarthritis, Indian Diabetic Retinopathy Image, and Breast Carcinoma Subtyping) and one synthetic dataset simulating clinical decision bias. Our results demonstrate that CLOC outperforms existing ordinal classification methods and show the interpretability and controllability of CLOC in learning meaningful, ordered representations that align with clinical and practical needs.

[5] Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning

Mingxuan Cui,Qing Guo,Yuyi Wang,Hongkai Yu,Di Lin,Qin Zou,Ming-Ming Cheng,Xi Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于3D高斯泼溅(3DGS)的3D修复方法(3DGI),通过可见性不确定性引导和场景概念学习,实现了高质量的无缝修复。

Details Motivation: 解决3D场景修复中如何有效利用多视角互补视觉和语义信息的挑战。 Method: 提出VISTA框架,结合可见性不确定性引导的3DGI和场景概念学习,使用扩散模型填充掩码区域。 Result: 在静态和动态场景修复任务中表现优于现有技术,生成高质量的无伪影修复结果。 Conclusion: VISTA框架为3D场景修复提供了高效且通用的解决方案,适用于多样化场景。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful and efficient 3D representation for novel view synthesis. This paper extends 3DGS capabilities to inpainting, where masked objects in a scene are replaced with new contents that blend seamlessly with the surroundings. Unlike 2D image inpainting, 3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views, as occluded areas in one view may be visible in others. To address this, we propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI in utilizing complementary visual cues. We also employ uncertainties to learn a semantic concept of scene without the masked object and use a diffusion model to fill masked objects in input images based on the learned concept. Finally, we build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning. VISTA generates high-quality 3DGS models capable of synthesizing artifact-free and naturally inpainted novel views. Furthermore, our approach extends to handling dynamic distractors arising from temporal object changes, enhancing its versatility in diverse scene reconstruction scenarios. We demonstrate the superior performance of our method over state-of-the-art techniques using two challenging datasets: the SPIn-NeRF dataset, featuring 10 diverse static 3D inpainting scenes, and an underwater 3D inpainting dataset derived from UTB180, including fast-moving fish as inpainting targets.

[6] Subject-driven Video Generation via Disentangled Identity and Motion

Daneul Kim,Jingxu Zhang,Wonjoon Jin,Sunghyun Cho,Qi Dai,Jaesik Park,Chong Luo

Main category: cs.CV

TL;DR: 提出了一种零样本、无需调优的主题驱动视频生成模型,通过分离主题学习和时间动态,利用图像定制数据集直接训练视频定制模型。

Details Motivation: 传统视频定制方法依赖大型标注视频数据集,计算成本高且标注需求大。本文旨在通过图像数据集直接训练视频模型,降低资源需求。 Method: 将视频定制分解为两部分:1)通过图像定制数据集注入主题;2)通过图像到视频训练方法保留时间动态。采用随机图像标记丢弃和初始化,以及随机切换优化策略。 Result: 模型在零样本设置下表现出色,主题一致性和扩展性优于现有方法。 Conclusion: 该方法有效解决了视频定制中的资源问题,同时提升了生成质量。 Abstract: We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

[7] Learning Underwater Active Perception in Simulation

Alexandre Cardaillac,Donald G. Dansereau

Main category: cs.CV

TL;DR: 提出了一种基于多层感知器(MLP)的主动感知框架,用于在水下不同条件下获取高质量图像。

Details Motivation: 水下浑浊度和反向散射会影响机器人操作的可见性,传统方法存在机动和设置限制。 Method: 使用MLP预测图像质量,生成包含十种水类型的合成数据集,并改进Blender软件以模拟水下光传播。 Result: 在仿真中验证了方法,显著提高了视觉覆盖范围和图像质量。 Conclusion: 该方法简单高效,适用于广泛水下条件,代码已开源。 Abstract: When employing underwater vehicles for the autonomous inspection of assets, it is crucial to consider and assess the water conditions. Indeed, they have a significant impact on the visibility, which also affects robotic operations. Turbidity can jeopardise the whole mission as it may prevent correct visual documentation of the inspected structures. Previous works have introduced methods to adapt to turbidity and backscattering, however, they also include manoeuvring and setup constraints. We propose a simple yet efficient approach to enable high-quality image acquisition of assets in a broad range of water conditions. This active perception framework includes a multi-layer perceptron (MLP) trained to predict image quality given a distance to a target and artificial light intensity. We generated a large synthetic dataset including ten water types with different levels of turbidity and backscattering. For this, we modified the modelling software Blender to better account for the underwater light propagation properties. We validated the approach in simulation and showed significant improvements in visual coverage and quality of imagery compared to traditional approaches. The project code is available on our project page at https://roboticimaging.org/Projects/ActiveUW/.

[8] VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

Xinyu Chen,Yunxin Li,Haoyuan Shi,Baotian Hu,Wenhan Luo,Yaowei Wang,Min Zhang

Main category: cs.CV

TL;DR: VideoVista-CulturalLingo 是一个多文化、多语言、多领域的视频理解评测基准,旨在填补现有评测的不足,并评估了24种视频大模型的表现。

Details Motivation: 现有视频评测基准多局限于单一语言和文化背景,无法全面评估AI系统的跨文化理解能力。 Method: 构建包含1,389个视频和3,134个问答对的评测数据集,涵盖中国、北美和欧洲文化,支持中英文提问。 Result: 现有模型在中文问题(尤其是中国历史)上表现较差;开源模型在时间理解任务中表现不佳;主流模型在科学问题上表现良好,开源模型在数学上较弱。 Conclusion: VideoVista-CulturalLingo 揭示了现有模型的跨文化和多语言理解不足,为未来研究提供了重要参考。 Abstract: Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.

[9] A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw

Wenwen Li,Chia-Yu Hsu,Sizhe Wang,Zhining Gu,Yili Yang,Brendan M. Rogers,Anna Liljedahl

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的多模态融合方法,用于准确检测北极地区的Retrogressive Thaw Slumps (RTS),解决了其小尺度、模糊边界和时空变化的挑战。

Details Motivation: RTS是北极地区重要的冻土退化标志,但因其小尺度和模糊边界,传统检测方法难以准确识别。 Method: 采用Cascade Mask R-CNN结合多尺度视觉Transformer,并提出两种新策略:特征级残差跨模态注意力融合和预训练单模态学习后多模态微调。 Result: 实验表明,该方法在RTS检测上优于现有模型,为多模态数据的高效利用提供了新思路。 Conclusion: 研究不仅提升了RTS检测的准确性,还为冻土地貌及其环境影响的理解提供了重要支持。 Abstract: Retrogressive Thaw Slumps (RTS) in Arctic regions are distinct permafrost landforms with significant environmental impacts. Mapping these RTS is crucial because their appearance serves as a clear indication of permafrost thaw. However, their small scale compared to other landform features, vague boundaries, and spatiotemporal variation pose significant challenges for accurate detection. In this paper, we employed a state-of-the-art deep learning model, the Cascade Mask R-CNN with a multi-scale vision transformer-based backbone, to delineate RTS features across the Arctic. Two new strategies were introduced to optimize multimodal learning and enhance the model's predictive performance: (1) a feature-level, residual cross-modality attention fusion strategy, which effectively integrates feature maps from multiple modalities to capture complementary information and improve the model's ability to understand complex patterns and relationships within the data; (2) pre-trained unimodal learning followed by multimodal fine-tuning to alleviate high computing demand while achieving strong model performance. Experimental results demonstrated that our approach outperformed existing models adopting data-level fusion, feature-level convolutional fusion, and various attention fusion strategies, providing valuable insights into the efficient utilization of multimodal data for RTS mapping. This research contributes to our understanding of permafrost landforms and their environmental implications.

[10] Dual Prompting Image Restoration with Diffusion Transformers

Dehong Kong,Fan Li,Zhixin Wang,Jiaqi Xu,Renjing Pei,Wenbo Li,WenQi Ren

Main category: cs.CV

TL;DR: DPIR是一种新型图像修复方法,通过双提示控制分支和多视角条件信息提取,显著提升修复质量。

Details Motivation: 现有方法(如U-Net和DiTs)在图像修复中仍面临质量限制,DPIR旨在通过双提示和多视角条件提取解决这一问题。 Method: DPIR包含两个分支:低质量图像条件分支和双提示控制分支,结合全局-局部视觉提示和文本提示。 Result: 实验表明,DPIR在图像修复中表现优异,显著提升了修复质量。 Conclusion: DPIR通过双提示和多视角条件提取,为图像修复提供了更高效的解决方案。 Abstract: Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image conditioning branch and a dual prompting control branch. The first branch utilizes a lightweight module to incorporate image priors into the DiT with high efficiency. More importantly, we believe that in image restoration, textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, alongside textual prompts to form dual prompts, greatly enhance the quality of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance.

[11] FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model

Kaicheng Pang,Xingxing Zou,Waikeung Wong

Main category: cs.CV

TL;DR: FashionM3是一个基于视觉语言模型的多模态、多任务、多轮时尚助手,通过个性化推荐、替代建议、产品图像生成和虚拟试穿等功能提升用户体验。

Details Motivation: 现代零售中时尚搭配和个性化推荐具有重要经济价值,视觉语言模型的出现为零售业提供了新的机会。 Method: 基于视觉语言模型微调,构建多任务、多轮交互的FashionM3助手,使用包含331,124个多模态对话样本的FashionRec数据集进行训练。 Result: 定量和定性评估及用户研究表明,FashionM3在推荐效果和实用价值上表现优异。 Conclusion: FashionM3通过多轮交互提供个性化建议,展示了其在时尚推荐领域的潜力和实用性。 Abstract: Fashion styling and personalized recommendations are pivotal in modern retail, contributing substantial economic value in the fashion industry. With the advent of vision-language models (VLM), new opportunities have emerged to enhance retailing through natural language and visual interactions. This work proposes FashionM3, a multimodal, multitask, and multiround fashion assistant, built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover satisfying outfits by offering multiple capabilities including personalized recommendation, alternative suggestion, product image generation, and virtual try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising 331,124 multimodal dialogue samples across basic, personalized, and alternative recommendation tasks, FashionM3 delivers contextually personalized suggestions with iterative refinement through multiround interactions. Quantitative and qualitative evaluations, alongside user studies, demonstrate FashionM3's superior performance in recommendation effectiveness and practical value as a fashion assistant.

[12] VEU-Bench: Towards Comprehensive Understanding of Video Editing

Bozheng Li,Yongliang Wu,Yi Lu,Jiashuo Yu,Licheng Tang,Jiawang Cao,Wenqing Zhu,Yuyang Sun,Jay Wu,Wenbo Zhu

Main category: cs.CV

TL;DR: 本文介绍了VEU-Bench,一个专注于视频编辑理解(VEU)任务的综合基准测试,揭示了当前视频大语言模型(Vid-LLMs)在此类任务中的不足,并提出了一种名为Oscars的专家模型,显著提升了性能。

Details Motivation: 互联网上广泛分享的视频通常经过编辑,而现有的视频大语言模型在视频编辑理解任务中的能力尚未被充分探索。 Method: 提出了VEU-Bench,一个涵盖19个细粒度任务的基准测试,并开发了Oscars模型,通过微调提升性能。 Result: 实验表明,当前Vid-LLMs在VEU任务中表现不佳,而Oscars模型在准确性上提升了28.3%,并显著提升了通用视频理解任务的性能。 Conclusion: VEU-Bench和Oscars模型填补了视频编辑理解任务的空白,并为Vid-LLMs的性能提升提供了有效途径。 Abstract: Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3% across nine reasoning tasks.

[13] Fine-Tuning Adversarially-Robust Transformers for Single-Image Dehazing

Vlad Vasilescu,Ana Neacsu,Daniela Faur

Main category: cs.CV

TL;DR: 该论文研究了单图像去雾中对抗噪声的脆弱性,并提出两种轻量级微调策略以提高预训练Transformer的鲁棒性。

Details Motivation: 单图像去雾在遥感应用中至关重要,但现有方法对对抗噪声的鲁棒性不足,可能导致性能显著下降。 Method: 提出两种轻量级微调策略,用于增强预训练Transformer的对抗鲁棒性。 Result: 实验表明,即使1像素的变化也可能导致PSNR下降2.8 dB,而所提方法在保持干净数据性能的同时显著提升了对抗鲁棒性。 Conclusion: 论文展示了所提方法在遥感场景中的适用性,并开源了对抗微调和攻击算法的代码。 Abstract: Single-image dehazing is an important topic in remote sensing applications, enhancing the quality of acquired images and increasing object detection precision. However, the reliability of such structures has not been sufficiently analyzed, which poses them to the risk of imperceptible perturbations that can significantly hinder their performance. In this work, we show that state-of-the-art image-to-image dehazing transformers are susceptible to adversarial noise, with even 1 pixel change being able to decrease the PSNR by as much as 2.8 dB. Next, we propose two lightweight fine-tuning strategies aimed at increasing the robustness of pre-trained transformers. Our methods results in comparable clean performance, while significantly increasing the protection against adversarial data. We further present their applicability in two remote sensing scenarios, showcasing their robust behavior for out-of-distribution data. The source code for adversarial fine-tuning and attack algorithms can be found at github.com/Vladimirescu/RobustDehazing.

[14] Token Sequence Compression for Efficient Multimodal Computing

Yasmine Omri,Parth Shroff,Thierry Tambe

Main category: cs.CV

TL;DR: 论文提出一种自适应压缩方法,通过视觉令牌选择和合并优化视觉语言模型,减少冗余并提升效率。

Details Motivation: 当前视觉编码器存在冗余和低效问题,亟需一种更高效的多模态数据处理方法。 Method: 通过基准测试和定性分析,研究多种视觉令牌选择和合并方法,提出基于聚类的令牌聚合方法。 Result: 实验表明,简单的聚类级令牌聚合优于现有最先进的令牌选择和合并方法。 Conclusion: 该研究为高维数据的高效编码和处理提供了新思路,推动了可扩展和可持续的多模态系统发展。 Abstract: The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for multimodal data. In this work, we characterize a panoply of visual token selection and merging approaches through both benchmarking and qualitative analysis. In particular, we demonstrate that simple cluster-level token aggregation outperforms prior state-of-the-art works in token selection and merging, including merging at the vision encoder level and attention-based approaches. We underline the redundancy in current vision encoders, and shed light on several puzzling trends regarding principles of visual token selection through cross-modal attention visualizations. This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.

[15] DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing

Aniruddha Bala,Rohit Chowdhury,Rohan Jaiswal,Siddharth Roheda

Main category: cs.CV

TL;DR: 本文提出了一种在频域中通过修改DCT系数生成对抗性扰动的方法,以保护图像免受恶意编辑,同时减少视觉伪影并保持对噪声净化技术的鲁棒性。

Details Motivation: 扩散模型的进步使得通过文本提示轻松编辑图像成为可能,但也引发了图像安全问题。现有防御方法在像素空间添加噪声容易被察觉且对JPEG压缩等净化技术不鲁棒。 Method: 提出了一种优化方法,直接在频域中通过修改DCT系数引入对抗性扰动,利用JPEG管道生成对抗性图像。 Result: 实验表明,该方法在多种任务和数据集上有效防止恶意编辑,同时引入更少的视觉伪影,并保持对噪声净化技术的鲁棒性。 Conclusion: 该方法在保护图像免受恶意编辑方面表现优异,同时兼顾视觉质量和鲁棒性。 Abstract: Advancements in diffusion models have enabled effortless image editing via text prompts, raising concerns about image security. Attackers with access to user images can exploit these tools for malicious edits. Recent defenses attempt to protect images by adding a limited noise in the pixel space to disrupt the functioning of diffusion-based editing models. However, the adversarial noise added by previous methods is easily noticeable to the human eye. Moreover, most of these methods are not robust to purification techniques like JPEG compression under a feasible pixel budget. We propose a novel optimization approach that introduces adversarial perturbations directly in the frequency domain by modifying the Discrete Cosine Transform (DCT) coefficients of the input image. By leveraging the JPEG pipeline, our method generates adversarial images that effectively prevent malicious image editing. Extensive experiments across a variety of tasks and datasets demonstrate that our approach introduces fewer visual artifacts while maintaining similar levels of edit protection and robustness to noise purification techniques.

[16] CAMU: Context Augmentation for Meme Understanding

Girish A. Koushik,Diptesh Kanojia,Helen Treharne,Aditya Joshi

Main category: cs.CV

TL;DR: CAMU框架通过结合视觉-语言模型和参数高效微调技术,显著提升了社交媒体模因中的仇恨内容检测性能。

Details Motivation: 社交媒体模因结合了视觉和文本线索,文化背景复杂,传统方法难以有效检测仇恨内容。 Method: CAMU利用大型视觉-语言模型生成描述性标题,通过标题评分网络突出仇恨相关内容,并高效微调CLIP文本编码器以增强多模态理解。 Result: 在Hateful Memes数据集上,CAMU达到0.807的准确率和0.806的F1分数,与现有SoTA框架相当但更高效;在MultiOFF数据集上F1分数为0.673。 Conclusion: CAMU展示了高效且通用的仇恨和冒犯内容检测能力,强调了视觉基础和文本表示的重要性。 Abstract: Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. We introduce a novel framework, CAMU, which leverages large vision-language models to generate more descriptive captions, a caption-scoring neural network to emphasise hate-relevant content, and parameter-efficient fine-tuning of CLIP's text encoder for an improved multimodal understanding of memes. Experiments on publicly available hateful meme datasets show that simple projection layer fine-tuning yields modest gains, whereas selectively tuning deeper text encoder layers significantly boosts performance on all evaluation metrics. Moreover, our approach attains high accuracy (0.807) and F1-score (0.806) on the Hateful Memes dataset, at par with the existing SoTA framework while being much more efficient, offering practical advantages in real-world scenarios that rely on fixed decision thresholds. CAMU also achieves the best F1-score of 0.673 on the MultiOFF dataset for offensive meme identification, demonstrating its generalisability. Additional analyses on benign confounders reveal that robust visual grounding and nuanced text representations are crucial for reliable hate and offence detection. We will publicly release CAMU along with the resultant models for further research. Disclaimer: This paper includes references to potentially disturbing, hateful, or offensive content due to the nature of the task.

[17] Masked strategies for images with small objects

H. Martin Gillis,Ming Hill,Paul Hollensen,Alan Fine,Thomas Trappenberg

Main category: cs.CV

TL;DR: 研究探讨了掩码自编码器(MAE)在血液成分图像重建和语义分割中的应用,发现较小的掩码比例和补丁尺寸能提升重建效果,预训练权重对小尺寸血液成分的分割有益。

Details Motivation: 血液成分检测和分类中,小像素尺寸物体在相似背景中的识别具有挑战性,传统深度学习方法在域外图像上表现不佳。 Method: 使用MAE学习ViT编码器表示,调整掩码比例和补丁尺寸,并将编码器权重用于U-Net Transformer的语义分割。 Result: 较小掩码比例和补丁尺寸改善了MAE的图像重建;预训练权重提升了小尺寸血液成分的分割效果。 Conclusion: 该方法为小物体分割和分类提供了一种高效策略。 Abstract: The hematology analytics used for detection and classification of small blood components is a significant challenge. In particular, when objects exists as small pixel-sized entities in a large context of similar objects. Deep learning approaches using supervised models with pre-trained weights, such as residual networks and vision transformers have demonstrated success for many applications. Unfortunately, when applied to images outside the domain of learned representations, these methods often result with less than acceptable performance. A strategy to overcome this can be achieved by using self-supervised models, where representations are learned and weights are then applied for downstream applications. Recently, masked autoencoders have proven to be effective to obtain representations that captures global context information. By masking regions of an image and having the model learn to reconstruct both the masked and non-masked regions, weights can be used for various applications. However, if the sizes of the objects in images are less than the size of the mask, the global context information is lost, making it almost impossible to reconstruct the image. In this study, we investigated the effect of mask ratios and patch sizes for blood components using a MAE to obtain learned ViT encoder representations. We then applied the encoder weights to train a U-Net Transformer for semantic segmentation to obtain both local and global contextual information. Our experimental results demonstrates that both smaller mask ratios and patch sizes improve the reconstruction of images using a MAE. We also show the results of semantic segmentation with and without pre-trained weights, where smaller-sized blood components benefited with pre-training. Overall, our proposed method offers an efficient and effective strategy for the segmentation and classification of small objects.

[18] From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

Yabing Wang,Zhuotao Tian,Qingpei Guo,Zheng Qin,Sanping Zhou,Ming Yang,Le Wang

Main category: cs.CV

TL;DR: 论文提出了一种两阶段框架,用于解决零样本组合图像检索中的伪词令牌表示不足、训练与推理阶段不一致以及依赖大规模合成数据的问题。

Details Motivation: 由于组合图像检索(CIR)数据标注成本高,零样本CIR成为有前景的替代方案。现有投影方法存在伪词令牌表示能力不足、训练与推理不一致及依赖合成数据的问题。 Method: 两阶段框架:第一阶段通过视觉语义注入模块和软文本对齐目标增强图像到伪词令牌的学习;第二阶段利用少量合成数据优化文本编码器,提取组合语义。 Result: 在三个公共数据集上进行了实验,性能优于现有方法。 Conclusion: 提出的框架显著提升了零样本CIR的性能,且对合成数据质量要求较低。 Abstract: Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.

[19] RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation

Zheng Qin,Le Wang,Yabing Wang,Sanping Zhou,Gang Hua,Wei Tang

Main category: cs.CV

TL;DR: RSRNav通过建模目标与当前观测的空间关系,解决了ImageNav中语义特征方向性不足和视角不一致的问题,显著提升了导航性能。

Details Motivation: 解决ImageNav中语义特征方向性不准确和训练与应用视角不一致导致的性能下降问题。 Method: 通过构建目标与当前观测的空间关系(细粒度互相关和方向感知相关),并将其输入策略网络进行动作预测。 Result: 在三个基准数据集上表现优异,尤其在“用户匹配目标”设置下,展示了实际应用潜力。 Conclusion: RSRNav通过空间关系建模,显著提升了图像目标导航的准确性和鲁棒性。 Abstract: Recent image-goal navigation (ImageNav) methods learn a perception-action policy by separately capturing semantic features of the goal and egocentric images, then passing them to a policy network. However, challenges remain: (1) Semantic features often fail to provide accurate directional information, leading to superfluous actions, and (2) performance drops significantly when viewpoint inconsistencies arise between training and application. To address these challenges, we propose RSRNav, a simple yet effective method that reasons spatial relationships between the goal and current observations as navigation guidance. Specifically, we model the spatial relationship by constructing correlations between the goal and current observations, which are then passed to the policy network for action prediction. These correlations are progressively refined using fine-grained cross-correlation and direction-aware correlation for more precise navigation. Extensive evaluation of RSRNav on three benchmark datasets demonstrates superior navigation performance, particularly in the "user-matched goal" setting, highlighting its potential for real-world applications.

[20] Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

Yuanbing Ouyang,Yizhuo Liang,Qingpeng Li,Xinfei Guo,Yiming Luo,Di Wu,Hao Wang,Yushan Pan

Main category: cs.CV

TL;DR: LVTP是一种基于多尺度Tsallis熵和低层视觉特征的渐进式token剪枝框架,显著降低计算成本,同时保持语义分割性能。

Details Motivation: Vision Transformers在语义分割中表现优异,但计算成本高,现有token剪枝方法未充分考虑视觉数据特性。 Method: 提出LVTP框架,结合多尺度Tsallis熵和低层视觉特征,通过动态评分机制和两次聚类优化剪枝。 Result: 在多个数据集上实现20%-45%的计算成本降低,性能损失可忽略,尤其在复杂边缘区域表现优异。 Conclusion: LVTP在计算成本与精度之间取得了更好的平衡,无需额外训练或架构修改。 Abstract: Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.

[21] Federated Client-tailored Adapter for Medical Image Segmentation

Guyue Hu,Siyuan Song,Yukun Kang,Zhu Yin,Gangming Zhao,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 提出了一种联邦客户端定制适配器(FCA)框架,用于医学图像分割,解决了分布式数据岛屿和客户端域异质性带来的训练不稳定问题。

Details Motivation: 现有方法依赖集中式学习,不适用于分布式数据场景,且联邦学习因客户端域异质性(如分布多样性和类别不平衡)导致训练不稳定。 Method: FCA框架利用现成的医学基础模型中的通用知识稳定训练,并开发了两种客户端定制更新策略,分别更新公共和个体组件。 Result: 在三个大规模数据集上的实验验证了FCA框架的有效性和优越性。 Conclusion: FCA框架实现了稳定且客户端定制的医学图像分割,无需共享敏感数据。 Abstract: Medical image segmentation in X-ray images is beneficial for computer-aided diagnosis and lesion localization. Existing methods mainly fall into a centralized learning paradigm, which is inapplicable in the practical medical scenario that only has access to distributed data islands. Federated Learning has the potential to offer a distributed solution but struggles with heavy training instability due to client-wise domain heterogeneity (including distribution diversity and class imbalance). In this paper, we propose a novel Federated Client-tailored Adapter (FCA) framework for medical image segmentation, which achieves stable and client-tailored adaptive segmentation without sharing sensitive local data. Specifically, the federated adapter stirs universal knowledge in off-the-shelf medical foundation models to stabilize the federated training process. In addition, we develop two client-tailored federated updating strategies that adaptively decompose the adapter into common and individual components, then globally and independently update the parameter groups associated with common client-invariant and individual client-specific units, respectively. They further stabilize the heterogeneous federated learning process and realize optimal client-tailored instead of sub-optimal global-compromised segmentation models. Extensive experiments on three large-scale datasets demonstrate the effectiveness and superiority of the proposed FCA framework for federated medical segmentation.

[22] ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification

Shuanglin Yan,Neng Dong,Shuang Li,Rui Yan,Hao Tang,Jing Qin

Main category: cs.CV

TL;DR: 提出了一种基于身体形状感知的文本对齐框架(BSaTa),通过显式建模身体形状信息提升可见光-红外行人重识别(VIReID)性能。

Details Motivation: 现有方法仅依赖身份标签监督,难以充分提取高层语义信息,且未显式建模身体形状特征,而身体形状对跨模态匹配至关重要。 Method: 设计了身体形状文本对齐模块(BSTA)和文本-视觉一致性正则化器(TVCR),并结合多文本监督和分布一致性约束的形状感知表示学习(SRL)机制。 Result: 在SYSU-MM01和RegDB数据集上取得优越性能。 Conclusion: BSaTa框架通过显式建模身体形状信息,有效提升了VIReID的性能。 Abstract: Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.

[23] A Large Vision-Language Model based Environment Perception System for Visually Impaired People

Zezhou Chen,Zhaoxiang Liu,Kai Wang,Kohou Wang,Shiguo Lian

Main category: cs.CV

TL;DR: 论文提出了一种基于大型视觉语言模型(LVLM)的环境感知系统,帮助视障人士通过可穿戴设备理解周围环境,并通过交互方式获取场景描述和物体信息。

Details Motivation: 视障人士因自然场景的复杂性难以感知环境,限制了其个人和社会活动。 Method: 系统结合LVLM和分割模型,通过可穿戴设备捕捉场景,提供全局描述、物体分类及详细描述。利用RGB图像分割结果作为外部知识输入以减少LVLM的幻觉。 Result: 在POPE、MME和LLaVA-QA90上的技术实验表明,系统比Qwen-VL-Chat提供更准确的场景描述;探索性实验证实其有效帮助视障人士感知环境。 Conclusion: 该系统通过LVLM和分割模型的结合,显著提升了视障人士的环境感知能力。 Abstract: It is a challenging task for visually impaired people to perceive their surrounding environment due to the complexity of the natural scenes. Their personal and social activities are thus highly limited. This paper introduces a Large Vision-Language Model(LVLM) based environment perception system which helps them to better understand the surrounding environment, by capturing the current scene they face with a wearable device, and then letting them retrieve the analysis results through the device. The visually impaired people could acquire a global description of the scene by long pressing the screen to activate the LVLM output, retrieve the categories of the objects in the scene resulting from a segmentation model by tapping or swiping the screen, and get a detailed description of the objects they are interested in by double-tapping the screen. To help visually impaired people more accurately perceive the world, this paper proposes incorporating the segmentation result of the RGB image as external knowledge into the input of LVLM to reduce the LVLM's hallucination. Technical experiments on POPE, MME and LLaVA-QA90 show that the system could provide a more accurate description of the scene compared to Qwen-VL-Chat, exploratory experiments show that the system helps visually impaired people to perceive the surrounding environment effectively.

[24] Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models

Chen Chen,Daochang Liu,Mubarak Shah,Chang Xu

Main category: cs.CV

TL;DR: PRSS方法通过改进扩散模型中的分类器自由引导,结合提示重新锚定(PR)和语义提示搜索(SS),在隐私和实用性之间取得更好平衡。

Details Motivation: 文本到图像扩散模型存在记忆训练集图像的问题,可能引发原创性和隐私问题,现有方法在提升隐私时往往牺牲实用性。 Method: 提出PRSS方法,结合提示重新锚定(PR)和语义提示搜索(SS),优化隐私与实用性的权衡。 Result: 实验表明,PRSS在不同隐私级别下均能显著改善隐私-实用性平衡,达到新最优水平。 Conclusion: PRSS为扩散模型提供了一种有效平衡隐私与实用性的新方法。 Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in creating images highly aligned with user prompts, yet their proclivity for memorizing training set images has sparked concerns about the originality of the generated images and privacy issues, potentially leading to legal complications for both model owners and users, particularly when the memorized images contain proprietary content. Although methods to mitigate these issues have been suggested, enhancing privacy often results in a significant decrease in the utility of the outputs, as indicated by text-alignment scores. To bridge the research gap, we introduce a novel method, PRSS, which refines the classifier-free guidance approach in diffusion models by integrating prompt re-anchoring (PR) to improve privacy and incorporating semantic prompt search (SS) to enhance utility. Extensive experiments across various privacy levels demonstrate that our approach consistently improves the privacy-utility trade-off, establishing a new state-of-the-art.

[25] Cabbage: A Differential Growth Framework for Open Surfaces

Xiaoyi Liu,Hao Tang

Main category: cs.CV

TL;DR: Cabbage是一个用于模拟3D开放表面(如花瓣卷曲)的微分生长框架,生成高质量无自交的三角网格。

Details Motivation: 模拟自然界中开放表面的微分生长行为,如花瓣卷曲,为计算建模、数字制造和教育提供工具。 Method: 通过边缘细分驱动Cabbage-Shell,结合壳力扩展表面,特征感知平滑和重新网格化保证质量,校正碰撞防止自交。 Result: Cabbage在形态表达、网格质量和稳定性上优于现有方法,能生成复杂图案并支持数百步模拟。 Conclusion: Cabbage是首个开源的高质量微分生长框架,适用于计算建模、数字制造和教育,并为几何处理和形状分析提供数据。 Abstract: We propose Cabbage, a differential growth framework to model buckling behavior in 3D open surfaces found in nature-like the curling of flower petals. Cabbage creates high-quality triangular meshes free of self-intersection. Cabbage-Shell is driven by edge subdivision which differentially increases discretization resolution. Shell forces expands the surface, generating buckling over time. Feature-aware smoothing and remeshing ensures mesh quality. Corrective collision effectively prevents self-collision even in tight spaces. We additionally provide Cabbage-Collision, and approximate alternative, followed by CAD-ready surface generation. Cabbage is the first open-source effort with this calibre and robustness, outperforming SOTA methods in its morphological expressiveness, mesh quality, and stably generates large, complex patterns over hundreds of simulation steps. It is a source not only of computational modeling, digital fabrication, education, but also high-quality, annotated data for geometry processing and shape analysis.

[26] DMS-Net:Dual-Modal Multi-Scale Siamese Network for Binocular Fundus Image Classification

Guohao Huo,Zibo Lin,Zitong Wang,Ruiting Dai,Hao Tang

Main category: cs.CV

TL;DR: DMS-Net是一种双模态多尺度Siamese网络,用于双眼眼底图像分类,通过多尺度上下文感知模块和双模态特征融合模块提升性能,在ODIR-5K数据集上表现优异。

Details Motivation: 传统诊断方法和单眼深度学习未能考虑双眼病理相关性,需改进。 Method: 使用权重共享的Siamese ResNet-152提取特征,引入多尺度上下文感知模块和双模态特征融合模块。 Result: 在ODIR-5K数据集上达到80.5%准确率、86.1%召回率和83.8%Cohen's kappa。 Conclusion: DMS-Net在对称病理检测和临床决策中表现出色。 Abstract: Ophthalmic diseases pose a significant global health challenge, yet traditional diagnosis methods and existing single-eye deep learning approaches often fail to account for binocular pathological correlations. To address this, we propose DMS-Net, a dual-modal multi-scale Siamese network for binocular fundus image classification. Our framework leverages weight-shared Siamese ResNet-152 backbones to extract deep semantic features from paired fundus images. To tackle challenges such as lesion boundary ambiguity and scattered pathological distributions, we introduce a Multi-Scale Context-Aware Module (MSCAM) that integrates adaptive pooling and attention mechanisms for multi-resolution feature aggregation. Additionally, a Dual-Modal Feature Fusion (DMFF) module enhances cross-modal interaction through spatial-semantic recalibration and bidirectional attention, effectively combining global context and local edge features. Evaluated on the ODIR-5K dataset, DMS-Net achieves state-of-the-art performance with 80.5% accuracy, 86.1% recall, and 83.8% Cohen's kappa, demonstrating superior capability in detecting symmetric pathologies and advancing clinical decision-making for ocular diseases.

[27] A BERT-Style Self-Supervised Learning CNN for Disease Identification from Retinal Images

Xin Li,Wenhui Zhu,Peijie Qiu,Oana M. Dumitrascu,Amal Youssef,Yalin Wang

Main category: cs.CV

TL;DR: 论文提出了一种结合轻量级CNN(nn-MobileNet)和自监督学习的方法,用于医学图像分析,解决了标签数据稀缺和高计算需求的问题。

Details Motivation: 医学图像分析中,深度学习依赖大量标签数据,但获取高质量标签昂贵且困难。Vision Transformers(ViT)虽能利用无标签数据,但计算需求高且缺乏局部性特征。 Method: 采用nn-MobileNet框架,结合BERT风格的自监督学习,在无标签的视网膜图像上预训练,并验证其在阿尔茨海默病、帕金森病等疾病识别中的效果。 Result: 预训练模型显著提升了下游任务的性能。 Conclusion: 研究表明,轻量级CNN结合自监督学习能有效处理无标签数据,为标签稀缺场景提供了可行方案。 Abstract: In the field of medical imaging, the advent of deep learning, especially the application of convolutional neural networks (CNNs) has revolutionized the analysis and interpretation of medical images. Nevertheless, deep learning methods usually rely on large amounts of labeled data. In medical imaging research, the acquisition of high-quality labels is both expensive and difficult. The introduction of Vision Transformers (ViT) and self-supervised learning provides a pre-training strategy that utilizes abundant unlabeled data, effectively alleviating the label acquisition challenge while broadening the breadth of data utilization. However, ViT's high computational density and substantial demand for computing power, coupled with the lack of localization characteristics of its operations on image patches, limit its efficiency and applicability in many application scenarios. In this study, we employ nn-MobileNet, a lightweight CNN framework, to implement a BERT-style self-supervised learning approach. We pre-train the network on the unlabeled retinal fundus images from the UK Biobank to improve downstream application performance. We validate the results of the pre-trained model on Alzheimer's disease (AD), Parkinson's disease (PD), and various retinal diseases identification. The results show that our approach can significantly improve performance in the downstream tasks. In summary, this study combines the benefits of CNNs with the capabilities of advanced self-supervised learning in handling large-scale unlabeled data, demonstrating the potential of CNNs in the presence of label scarcity.

[28] POET: Prompt Offset Tuning for Continual Human Action Adaptation

Prachi Garg,Joseph K J,Vineeth N Balasubramanian,Necati Cihan Camgoz,Chengde Wan,Kenrick Kin,Weiguang Si,Shugao Ma,Fernando De La Torre

Main category: cs.CV

TL;DR: POET提出了一种隐私感知的少样本持续动作识别方法,通过轻量级骨干网络和时空可学习提示偏移调优,无需存储用户敏感数据。

Details Motivation: 扩展现实(XR)设备需要个性化动作识别能力,但现有模型静态且类别有限,用户需以低样本高效方式添加新类别,同时保护隐私。 Method: 提出POET(Prompt-Offset Tuning),基于轻量级骨干网络和时空可学习提示偏移调优,首次应用于图神经网络。 Result: 在NTU RGB+D和SHREC-2017数据集上,POET表现优于基准方法。 Conclusion: POET为隐私感知的持续动作识别提供了高效解决方案,适用于XR设备。 Abstract: As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at https://github.com/humansensinglab/POET-continual-action-recognition.

[29] S3MOT: Monocular 3D Object Tracking with Selective State Space Model

Zhuohao Yan,Shaoquan Feng,Xingxing Li,Yuxuan Zhou,Chunxi Xia,Shengyu Li

Main category: cs.CV

TL;DR: 提出三种创新技术(HSSM、FCOE、VeloSSM)提升单目3D多目标跟踪性能,在KITTI基准测试中达到76.86 HOTA,优于之前最佳方法。

Details Motivation: 单目3D多目标跟踪在机器人学和计算机视觉中至关重要,但现有方法难以从2D视频流中挖掘3D时空关联。 Method: 1. HSSM:高效数据关联机制;2. FCOE:直接利用密集特征图提升重识别精度;3. VeloSSM:建模速度时序依赖以改进6-DoF姿态估计。 Result: 在KITTI测试中达到76.86 HOTA(31 FPS),优于之前最佳方法+2.63 HOTA和+3.62 AssA。 Conclusion: 该方法在单目3D多目标跟踪任务中表现出高效性和鲁棒性,代码和模型已开源。 Abstract: Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at https://github.com/bytepioneerX/s3mot.

[30] Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation

Weipeng Tan,Chuming Lin,Chengming Xu,FeiFan Xu,Xiaobin Hu,Xiaozhong Ji,Junwei Zhu,Chengjie Wang,Yanwei Fu

Main category: cs.CV

TL;DR: 论文提出DICE-Talk框架,通过解耦身份与情感并协同相似情感特征,解决了现有情感说话头生成方法在情感表达、身份保持和情感关联学习上的不足。

Details Motivation: 现有方法在情感说话头生成中存在情感线索利用不足、身份泄漏和情感关联学习孤立的问题。 Method: 提出解耦情感嵌入器、相关性增强情感条件模块和情感判别目标,通过跨模态注意力、情感银行和潜在空间分类实现。 Result: 在MEAD和HDTF数据集上表现优异,情感准确性和唇同步性能均优于现有方法。 Conclusion: DICE-Talk能够生成身份保持且情感丰富的说话头,适应未见过的身份。 Abstract: Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.

[31] Study on Real-Time Road Surface Reconstruction Using Stereo Vision

Deepak Ghimire,Byoungjun Kim,Donghoon Kim,SungHwan Jeong

Main category: cs.CV

TL;DR: 本文优化了RoadBEV框架,通过异构全局结构化剪枝和重新设计的头部网络,提升了实时道路表面重建的效率和精度。

Details Motivation: 道路表面重建对自动驾驶至关重要,需在边缘设备上实现高效实时推理。 Method: 采用异构全局结构化剪枝优化立体特征提取主干,并重新设计头部网络,包括优化的沙漏结构、动态注意力头、减少特征通道、混合精度推理和高效概率体积计算。 Result: 方法提高了推理速度并降低了重建误差。 Conclusion: 优化后的框架适合自动驾驶中的实时道路表面重建。 Abstract: Road surface reconstruction plays a crucial role in autonomous driving, providing essential information for safe and smooth navigation. This paper enhances the RoadBEV [1] framework for real-time inference on edge devices by optimizing both efficiency and accuracy. To achieve this, we proposed to apply Isomorphic Global Structured Pruning to the stereo feature extraction backbone, reducing network complexity while maintaining performance. Additionally, the head network is redesigned with an optimized hourglass structure, dynamic attention heads, reduced feature channels, mixed precision inference, and efficient probability volume computation. Our approach improves inference speed while achieving lower reconstruction error, making it well-suited for real-time road surface reconstruction in autonomous driving.

[32] Salient Region-Guided Spacecraft Image Arbitrary-Scale Super-Resolution Network

Jingfan Yang,Hu Gao,Ying Zhang,Depeng Dang

Main category: cs.CV

TL;DR: 提出了一种基于显著区域引导的航天器图像任意尺度超分辨率网络(SGSASR),通过识别航天器核心区域并选择性融合特征,提升超分辨率效果。

Details Motivation: 现有任意尺度超分辨率方法在航天器图像中忽略了核心区域与黑色背景的差异,引入无关噪声。 Method: 设计了航天器核心区域识别块(SCRRB)和自适应加权特征融合增强机制(AFFEM),选择性融合核心区域特征与通用图像特征。 Result: 实验表明,SGSASR优于现有方法。 Conclusion: SGSASR通过显著区域引导和特征融合,有效提升了航天器图像的超分辨率质量。 Abstract: Spacecraft image super-resolution seeks to enhance low-resolution spacecraft images into high-resolution ones. Although existing arbitrary-scale super-resolution methods perform well on general images, they tend to overlook the difference in features between the spacecraft core region and the large black space background, introducing irrelevant noise. In this paper, we propose a salient region-guided spacecraft image arbitrary-scale super-resolution network (SGSASR), which uses features from the spacecraft core salient regions to guide latent modulation and achieve arbitrary-scale super-resolution. Specifically, we design a spacecraft core region recognition block (SCRRB) that identifies the core salient regions in spacecraft images using a pre-trained saliency detection model. Furthermore, we present an adaptive-weighted feature fusion enhancement mechanism (AFFEM) to selectively aggregate the spacecraft core region features with general image features by dynamic weight parameter to enhance the response of the core salient regions. Experimental results demonstrate that the proposed SGSASR outperforms state-of-the-art approaches.

[33] MASF-YOLO: An Improved YOLOv11 Network for Small Object Detection on Drone View

Liugang Lu,Dabin He,Congxiang Liu,Zhixiang Deng

Main category: cs.CV

TL;DR: 提出了一种基于YOLOv11的新型目标检测网络MASF-YOLO,通过多尺度特征聚合和自适应融合,显著提升了无人机图像中小目标的检测精度,并在VisDrone2019数据集上验证了其优越性。

Details Motivation: 无人机图像中目标像素极小、尺度变化大且背景复杂,限制了目标检测的实际应用。 Method: 设计了多尺度特征聚合模块(MFAM)、改进的高效多尺度注意力模块(IEMA)和维度感知选择性集成模块(DASI),以提升小目标检测和背景噪声抑制能力。 Result: 在VisDrone2019验证集上,MASF-YOLO-s比YOLOv11-s在mAP@0.5和mAP@0.5:0.95上分别提升了4.6%和3.5%,且参数和计算成本更低。 Conclusion: MASF-YOLO在检测精度和模型效率上均具有明显竞争优势。 Abstract: With the rapid advancement of Unmanned Aerial Vehicle (UAV) and computer vision technologies, object detection from UAV perspectives has emerged as a prominent research area. However, challenges for detection brought by the extremely small proportion of target pixels, significant scale variations of objects, and complex background information in UAV images have greatly limited the practical applications of UAV. To address these challenges, we propose a novel object detection network Multi-scale Context Aggregation and Scale-adaptive Fusion YOLO (MASF-YOLO), which is developed based on YOLOv11. Firstly, to tackle the difficulty of detecting small objects in UAV images, we design a Multi-scale Feature Aggregation Module (MFAM), which significantly improves the detection accuracy of small objects through parallel multi-scale convolutions and feature fusion. Secondly, to mitigate the interference of background noise, we propose an Improved Efficient Multi-scale Attention Module (IEMA), which enhances the focus on target regions through feature grouping, parallel sub-networks, and cross-spatial learning. Thirdly, we introduce a Dimension-Aware Selective Integration Module (DASI), which further enhances multi-scale feature fusion capabilities by adaptively weighting and fusing low-dimensional features and high-dimensional features. Finally, we conducted extensive performance evaluations of our proposed method on the VisDrone2019 dataset. Compared to YOLOv11-s, MASFYOLO-s achieves improvements of 4.6% in mAP@0.5 and 3.5% in mAP@0.5:0.95 on the VisDrone2019 validation set. Remarkably, MASF-YOLO-s outperforms YOLOv11-m while requiring only approximately 60% of its parameters and 65% of its computational cost. Furthermore, comparative experiments with state-of-the-art detectors confirm that MASF-YOLO-s maintains a clear competitive advantage in both detection accuracy and model efficiency.

[34] ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding

Yi-Xing Peng,Qize Yang,Yu-Ming Tang,Shenghao Fu,Kun-Yu Lin,Xihan Wei,Wei-Shi Zheng

Main category: cs.CV

TL;DR: ActionArt是一个细粒度视频描述数据集,用于提升多模态模型对人类动作的理解能力,通过代理任务减少对昂贵人工标注的依赖。

Details Motivation: 细粒度理解人类动作和姿态对AI应用至关重要,但现有模型因缺乏精细标注数据而表现不足。 Method: 开发ActionArt数据集,包含多样化人类动作视频及详细标注,并提出基于自动生成数据的代理任务。 Result: 实验显示代理任务显著缩小了与人工标注数据的性能差距。 Conclusion: 代理任务是提升模型细粒度理解能力的有效方法,减少了对人工标注的依赖。 Abstract: Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each accompanied by detailed annotations that meticulously label every limb movement. We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions. Experimental results indicate that, while current large multimodal models perform commendably on various tasks, they often fall short in achieving fine-grained understanding. We attribute this limitation to the scarcity of meticulously annotated data, which is both costly and difficult to scale manually. Since manual annotations are costly and hard to scale, we propose proxy tasks to enhance the model perception ability in both spatial and temporal dimensions. These proxy tasks are carefully crafted to be driven by data automatically generated from existing MLLMs, thereby reducing the reliance on costly manual labels. Experimental results show that the proposed proxy tasks significantly narrow the gap toward the performance achieved with manually annotated fine-grained data.

[35] E-InMeMo: Enhanced Prompting for Visual In-Context Learning

Jiahao Zhang,Bowen Wang,Hong Liu,Liangzhi Li,Yuta Nakashima,Hajime Nagahara

Main category: cs.CV

TL;DR: 论文提出E-InMeMo方法,通过可学习扰动优化视觉上下文学习(ICL)的提示质量,显著提升性能。

Details Motivation: 视觉ICL的成功依赖提示质量,现有方法未充分优化。 Method: 引入可学习扰动到上下文对中,优化提示。 Result: 在标准视觉任务中,E-InMeMo显著优于现有方法,如前景分割mIoU提升7.99,单目标检测提升17.04。 Conclusion: E-InMeMo是一种轻量且有效的视觉ICL优化策略。 Abstract: Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo

[36] PerfCam: Digital Twinning for Production Lines Using 3D Gaussian Splatting and Vision Models

Michel Gokan Khan,Renan Guarese,Fabian Johnson,Xi Vincent Wang,Anders Bergman,Benjamin Edvinsson,Mario Romero,Jérémy Vachier,Jan Kronqvist

Main category: cs.CV

TL;DR: PerfCam是一个开源的数字孪生框架,结合3D高斯泼溅和计算机视觉模型,用于工业生产线中的对象跟踪和KPI提取。

Details Motivation: 为工业生产线提供实时KPI和数字孪生能力,支持智能制造环境中的操作分析。 Method: 利用3D重建和卷积神经网络(CNNs)实现半自动对象跟踪和空间映射。 Result: 在制药行业的实际生产线中验证了PerfCam的有效性,并公开数据集支持进一步研究。 Conclusion: PerfCam能够通过精确的数字孪生功能提供可操作的洞察,是智能制造环境中的有效工具。 Abstract: We introduce PerfCam, an open source Proof-of-Concept (PoC) digital twinning framework that combines camera and sensory data with 3D Gaussian Splatting and computer vision models for digital twinning, object tracking, and Key Performance Indicators (KPIs) extraction in industrial production lines. By utilizing 3D reconstruction and Convolutional Neural Networks (CNNs), PerfCam offers a semi-automated approach to object tracking and spatial mapping, enabling digital twins that capture real-time KPIs such as availability, performance, Overall Equipment Effectiveness (OEE), and rate of conveyor belts in the production line. We validate the effectiveness of PerfCam through a practical deployment within realistic test production lines in the pharmaceutical industry and contribute an openly published dataset to support further research and development in the field. The results demonstrate PerfCam's ability to deliver actionable insights through its precise digital twin capabilities, underscoring its value as an effective tool for developing usable digital twins in smart manufacturing environments and extracting operational analytics.

[37] Label-independent hyperparameter-free self-supervised single-view deep subspace clustering

Lovro Sindicic,Ivica Kopriva

Main category: cs.CV

TL;DR: 论文提出了一种新的单视图深度子空间聚类方法,解决了现有DSC算法的五个主要问题,包括信息利用不足、任务独立性假设、超参数调优需求、依赖标签的终止机制以及后处理依赖。

Details Motivation: 现有深度子空间聚类算法存在信息利用不足、任务独立性假设、超参数调优需求、依赖标签的终止机制以及后处理依赖等问题,限制了其广泛应用。 Method: 提出了一种单视图DSC方法,包括层间自表达损失最小化、子空间结构范数优化、多阶段顺序学习框架、基于相对误差的自停止机制以及固定系数保留策略。 Result: 在六个数据集上的实验表明,该方法优于大多数经过超参数调优的线性SC算法,并与最佳线性方法性能相当。 Conclusion: 该方法通过改进信息利用、任务集成和终止机制,显著提升了深度子空间聚类的性能和实用性。 Abstract: Deep subspace clustering (DSC) algorithms face several challenges that hinder their widespread adoption across variois application domains. First, clustering quality is typically assessed using only the encoder's output layer, disregarding valuable information present in the intermediate layers. Second, most DSC approaches treat representation learning and subspace clustering as independent tasks, limiting their effectiveness. Third, they assume the availability of a held-out dataset for hyperparameter tuning, which is often impractical in real-world scenarios. Fourth, learning termination is commonly based on clustering error monitoring, requiring external labels. Finally, their performance often depends on post-processing techniques that rely on labeled data. To address this limitations, we introduce a novel single-view DSC approach that: (i) minimizes a layer-wise self expression loss using a joint representation matrix; (ii) optimizes a subspace-structured norm to enhance clustering quality; (iii) employs a multi-stage sequential learning framework, consisting of pre-training and fine-tuning, enabling the use of multiple regularization terms without hyperparameter tuning; (iv) incorporates a relative error-based self-stopping mechanism to terminate training without labels; and (v) retains a fixed number of leading coefficients in the learned representation matrix based on prior knowledge. We evaluate the proposed method on six datasets representing faces, digits, and objects. The results show that our method outperforms most linear SC algorithms with careffulyl tuned hyperparameters while maintaining competitive performance with the best performing linear appoaches.

[38] What is the Added Value of UDA in the VFM Era?

Brunó B. Englert,Tommie Kerssies,Gijs Dubbelman

Main category: cs.CV

TL;DR: 研究探讨了无监督域适应(UDA)在更现实和多样化数据场景中的表现,发现UDA在合成数据场景中表现优于仅源域微调,但在多样化真实数据场景中无额外价值。

Details Motivation: 解决UDA在现实应用中的表现问题,尤其是与仅源域微调的比较,以支持自动驾驶的鲁棒性。 Method: 通过语义分割任务评估UDA在合成到真实和真实到真实场景中的表现,并研究少量目标域标签的影响。 Result: UDA在合成数据场景中表现优于仅源域微调(+2 mIoU),但在多样化真实数据中无优势。使用少量标签时,UDA达到与全监督模型相同的性能(85 mIoU)。 Conclusion: UDA在特定场景下(如合成数据)仍具价值,但在多样化真实数据中需谨慎使用。需进一步探讨如何优化UDA以支持大规模自动驾驶。 Abstract: Unsupervised Domain Adaptation (UDA) can improve a perception model's generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully-supervised learning with real target data. However, because VFMs have strong generalization from their pre-training, more straightforward, source-only fine-tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real-world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source-only fine-tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth-to-real and real-to-real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA's improvement over source-only fine-tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source-only fine-tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state-of-the-art segmentation quality of 85 mIoU as a fully-supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.

[39] Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition

Yin Tang,Jiankai Li,Hongyu Yang,Xuan Dong,Lifeng Fan,Weixin Li

Main category: cs.CV

TL;DR: 论文提出了一种名为MCCL的新方法,通过多粒度视觉线索组合学习解决图像意图识别的挑战,显著提升了现有方法的准确性。

Details Motivation: 社交媒体的普及使得图像意图识别对个人和社会具有重要意义,但传统方法难以处理意图的多样性和主观性。 Method: MCCL方法结合视觉线索组合和多粒度特征,采用类特定原型解决数据不平衡问题,并使用图卷积网络进行多标签分类。 Result: 在Intentonomy和MDID数据集上取得了最先进的性能,同时具备良好的可解释性。 Conclusion: 该方法为理解人类复杂多样的表达形式提供了新的探索方向。 Abstract: In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses challenges due to the wide variation and subjectivity of such clues, compounded by the problem of intra-class variety in conveying abstract concepts, e.g. "enjoy life". Existing methods seek to solve the problem by manually designing representative features or building prototypes for each class from global features. However, these methods still struggle to deal with the large visual diversity of each intent category. In this paper, we introduce a novel approach named Multi-grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition. Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi-grained features. We adopt class-specific prototypes to alleviate data imbalance. We treat intent recognition as a multi-label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations. Demonstrated by a state-of-the-art performance on the Intentonomy and MDID datasets, our approach advances the accuracy of existing methods while also possessing good interpretability. Our work provides an attempt for future explorations in understanding complex and miscellaneous forms of human expression.

[40] LiDAR-Guided Monocular 3D Object Detection for Long-Range Railway Monitoring

Raul David Dominguez Sanchez,Xavier Diaz Ortiz,Xingcheng Zhou,Max Peter Ronecker,Michael Karner,Daniel Watzenig,Alois Knoll

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的单目图像长距离3D物体检测方法,专为铁路自动化设计,可检测250米内的物体。

Details Motivation: 德国铁路系统需要高自动化以应对老旧基础设施和增加列车流量,长距离感知是关键,而传统汽车系统的感知范围不足。 Method: 结合Faraway-Frustum方法,仅使用单目图像,训练时引入LiDAR数据改进深度估计,包含四个模块:改进的YOLOv9、深度估计网络及短/长距离3D检测头。 Result: 在OSDaR23数据集上验证,能有效检测250米内的物体。 Conclusion: 该方法在铁路自动化中具有潜力,未来需进一步改进。 Abstract: Railway systems, particularly in Germany, require high levels of automation to address legacy infrastructure challenges and increase train traffic safely. A key component of automation is robust long-range perception, essential for early hazard detection, such as obstacles at level crossings or pedestrians on tracks. Unlike automotive systems with braking distances of ~70 meters, trains require perception ranges exceeding 1 km. This paper presents an deep-learning-based approach for long-range 3D object detection tailored for autonomous trains. The method relies solely on monocular images, inspired by the Faraway-Frustum approach, and incorporates LiDAR data during training to improve depth estimation. The proposed pipeline consists of four key modules: (1) a modified YOLOv9 for 2.5D object detection, (2) a depth estimation network, and (3-4) dedicated short- and long-range 3D detection heads. Evaluations on the OSDaR23 dataset demonstrate the effectiveness of the approach in detecting objects up to 250 meters. Results highlight its potential for railway automation and outline areas for future improvement.

[41] Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

Kun Li,Jianhui Wang,Yangfan He,Xinyuan Song,Ruoyu Wang,Hongyang He,Wenxin Zhang,Jiaqi Chen,Keqin Li,Sida Li,Miao Zhang,Tianyu Shi,Xueqian Wang

Main category: cs.CV

TL;DR: 提出了一种视觉协同适应(VCA)框架,通过多轮对话数据集和人类反馈优化生成图像,显著提升了图像一致性和用户意图对齐。

Details Motivation: 生成式AI在文本驱动图像生成中存在高分辨率输出与用户细粒度偏好对齐的挑战,需多轮交互优化。 Method: 结合人类反馈和多轮对话数据集,利用多样性、一致性和偏好反馈等多奖励函数,通过LoRA微调扩散模型。 Result: 实验表明,该方法在图像一致性和用户意图对齐上优于现有基线,用户满意度显著提升。 Conclusion: VCA框架在多轮对话场景中表现优异,为生成式AI的交互优化提供了有效解决方案。 Abstract: Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.

[42] A Data-Centric Approach to 3D Semantic Segmentation of Railway Scenes

Nicolas Münger,Max Peter Ronecker,Xavier Diaz,Michael Karner,Daniel Watzenig,Jan Skaloud

Main category: cs.CV

TL;DR: 论文提出两种针对铁路场景的数据增强方法,提升LiDAR语义分割在远距离的准确性,并在OSDaR23数据集上验证了其有效性。

Details Motivation: 解决铁路场景中LiDAR语义分割在远距离性能不足的问题,特别是针对行人和轨道的识别。 Method: 1. 行人实例粘贴方法:通过注入真实变化增强远距离行人分割;2. 轨道稀疏化方法:调整点云密度以优化远距离轨道分割。 Result: 两种方法显著提升了远距离分割性能,同时保持了近距离的准确性,并在OSDaR23数据集上建立了首个3D语义分割基准。 Conclusion: 数据为中心的方法能有效解决铁路自动驾驶感知中的特定挑战。 Abstract: LiDAR-based semantic segmentation is critical for autonomous trains, requiring accurate predictions across varying distances. This paper introduces two targeted data augmentation methods designed to improve segmentation performance on the railway-specific OSDaR23 dataset. The person instance pasting method enhances segmentation of pedestrians at distant ranges by injecting realistic variations into the dataset. The track sparsification method redistributes point density in LiDAR scans, improving track segmentation at far distances with minimal impact on close-range accuracy. Both methods are evaluated using a state-of-the-art 3D semantic segmentation network, demonstrating significant improvements in distant-range performance while maintaining robustness in close-range predictions. We establish the first 3D semantic segmentation benchmark for OSDaR23, demonstrating the potential of data-centric approaches to address railway-specific challenges in autonomous train perception.

[43] Unify3D: An Augmented Holistic End-to-end Monocular 3D Human Reconstruction via Anatomy Shaping and Twins Negotiating

Nanjie Yao,Gangjian Zhang,Wenhao Shen,Jian Shu,Hao Wang

Main category: cs.CV

TL;DR: 提出了一种端到端的单目3D人体重建方法,直接预测2D图像到3D化身,无需显式中间几何表示。

Details Motivation: 现有方法依赖前置模型生成显式几何表示,限制了重建任务的完整性。 Method: 采用端到端网络,包含解剖形状提取模块和双模态U-Net特征交互模块,并引入数据增强策略。 Result: 在多个测试集和实际案例中表现优于现有方法。 Conclusion: 新方法通过端到端设计和特征交互,显著提升了3D人体重建的效果。 Abstract: Monocular 3D clothed human reconstruction aims to create a complete 3D avatar from a single image. To tackle the human geometry lacking in one RGB image, current methods typically resort to a preceding model for an explicit geometric representation. For the reconstruction itself, focus is on modeling both it and the input image. This routine is constrained by the preceding model, and overlooks the integrity of the reconstruction task. To address this, this paper introduces a novel paradigm that treats human reconstruction as a holistic process, utilizing an end-to-end network for direct prediction from 2D image to 3D avatar, eliminating any explicit intermediate geometry display. Based on this, we further propose a novel reconstruction framework consisting of two core components: the Anatomy Shaping Extraction module, which captures implicit shape features taking into account the specialty of human anatomy, and the Twins Negotiating Reconstruction U-Net, which enhances reconstruction through feature interaction between two U-Nets of different modalities. Moreover, we propose a Comic Data Augmentation strategy and construct 15k+ 3D human scans to bolster model performance in more complex case input. Extensive experiments on two test sets and many in-the-wild cases show the superiority of our method over SOTA methods. Our demos can be found in : https://e2e3dgsrecon.github.io/e2e3dgsrecon/.

[44] Dense Geometry Supervision for Underwater Depth Estimation

Wenxiang Gua,Lin Qia

Main category: cs.CV

TL;DR: 本文提出了一种针对水下场景的单目深度估计新方法,通过构建经济高效的数据集和设计纹理-深度融合模块,显著提高了模型在水下环境中的准确性和适应性。

Details Motivation: 水下场景的单目深度估计研究较少,且缺乏相关数据和方法支持,本文旨在解决这些挑战。 Method: 构建经济高效的水下数据集,并设计基于水下光学成像原理的纹理-深度融合模块。 Result: 在FLSea数据集上的实验表明,该方法显著提高了水下深度估计的准确性和适应性。 Conclusion: 本文为水下单目深度估计提供了经济高效的解决方案,具有实际应用潜力。 Abstract: The field of monocular depth estimation is continually evolving with the advent of numerous innovative models and extensions. However, research on monocular depth estimation methods specifically for underwater scenes remains limited, compounded by a scarcity of relevant data and methodological support. This paper proposes a novel approach to address the existing challenges in current monocular depth estimation methods for underwater environments. We construct an economically efficient dataset suitable for underwater scenarios by employing multi-view depth estimation to generate supervisory signals and corresponding enhanced underwater images. we introduces a texture-depth fusion module, designed according to the underwater optical imaging principles, which aims to effectively exploit and integrate depth information from texture cues. Experimental results on the FLSea dataset demonstrate that our approach significantly improves the accuracy and adaptability of models in underwater settings. This work offers a cost-effective solution for monocular underwater depth estimation and holds considerable promise for practical applications.

[45] BiasBench: A reproducible benchmark for tuning the biases of event cameras

Andreas Ziegler,David Joseph,Thomas Gossard,Emil Moldovan,Andreas Zell

Main category: cs.CV

TL;DR: BiasBench是一个新的事件数据集,用于系统测试事件相机的偏置配置,并提出了一种基于强化学习的在线偏置调整方法。

Details Motivation: 事件相机的输出质量取决于偏置配置,但目前缺乏自动配置工具和系统测试框架。 Method: 提出了BiasBench数据集,包含多个场景和偏置配置,并开发了一种基于强化学习的在线偏置调整方法。 Result: BiasBench提供了可重复性测试的基础,强化学习方法能够有效调整偏置。 Conclusion: BiasBench和强化学习方法为事件相机的偏置配置提供了实用工具和解决方案。 Abstract: Event-based cameras are bio-inspired sensors that detect light changes asynchronously for each pixel. They are increasingly used in fields like computer vision and robotics because of several advantages over traditional frame-based cameras, such as high temporal resolution, low latency, and high dynamic range. As with any camera, the output's quality depends on how well the camera's settings, called biases for event-based cameras, are configured. While frame-based cameras have advanced automatic configuration algorithms, there are very few such tools for tuning these biases. A systematic testing framework would require observing the same scene with different biases, which is tricky since event cameras only generate events when there is movement. Event simulators exist, but since biases heavily depend on the electrical circuit and the pixel design, available simulators are not well suited for bias tuning. To allow reproducibility, we present BiasBench, a novel event dataset containing multiple scenes with settings sampled in a grid-like pattern. We present three different scenes, each with a quality metric of the downstream application. Additionally, we present a novel, RL-based method to facilitate online bias adjustments.

[46] Event-Based Eye Tracking. 2025 Event-based Vision Workshop

Qinyu Chen,Chang Gao,Min Liu,Daniele Perrone,Yan Ru Pei,Zuowen Wang,Zhuo Zou,Shihang Tan,Tao Han,Guorui Lu,Zhen Xu,Junyuan Ding,Ziteng Wang,Zongwei Wu,Han Han,Yuliang Wu,Jinze Chen,Wei Zhai,Yang Cao,Zheng-jun Zha,Nuwan Bandara,Thivya Kandappu,Archan Misra,Xiaopeng Lin,Hongxiang Huang,Hongwei Ren,Bojun Cheng,Hoang M. Truong,Vinh-Thuan Ly,Huy G. Tran,Thuan-Phat Nguyen,Tram T. Doan

Main category: cs.CV

TL;DR: 本文是对2025年基于事件的眼动追踪挑战赛的综述,总结了排名靠前团队的创新方法,并讨论了硬件设计的视角。

Details Motivation: 推动基于事件的眼动追踪研究发展,总结挑战赛中优秀方法。 Method: 综述挑战赛中排名靠前团队的方法,分析其准确性、模型大小和计算量。 Result: 总结了各团队方法的性能指标,并探讨了硬件设计的影响。 Conclusion: 挑战赛为基于事件的眼动追踪研究提供了重要参考,未来需结合硬件优化进一步提升性能。 Abstract: This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research. In each method, accuracy, model size, and number of operations are reported. In this survey, we also discuss event-based eye tracking from the perspective of hardware design.

[47] SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology

Elena Plekhanova,Damien Robert,Johannes Dollinger,Emilia Arens,Philipp Brun,Jan Dirk Wegner,Niklaus Zimmermann

Main category: cs.CV

TL;DR: 论文提出了一种基于物候学的采样策略和SSL4Eco数据集,通过自监督学习提升全球生态任务的表征质量,并在多个下游任务中达到最优性能。

Details Motivation: 随着生物多样性和气候危机的加剧,全球生物多样性制图等宏观生态研究变得更加紧迫。然而,现有遥感数据存在区域偏见和季节性问题,限制了模型的泛化能力。 Method: 提出了一种简单的物候学采样策略,并构建了多日期Sentinel-2数据集SSL4Eco,通过季节对比目标训练现有模型。 Result: 在8个下游任务中,SSL4Eco预训练模型在7个任务中达到最优性能,显著提升了表征质量。 Conclusion: 研究表明数据集构建的重要性,并开源了代码、数据和模型权重,支持宏观生态和计算机视觉研究。 Abstract: With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.

[48] Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator

Minjae Kang,Martim Brandão

Main category: cs.CV

TL;DR: 论文提出了一种音频-视觉生成与分离模型(AV-GAS),用于从混合音频生成图像,解决了现有方法无法处理多类音频的问题。

Details Motivation: 现有音频-视觉生成模型仅能处理单类音频,无法应对混合音频的挑战。 Method: 提出AV-GAS模型,结合音频-视觉分离器,生成多类音频对应的图像,并引入新的评估指标CRS和R@K。 Result: 在VGGSound数据集上,模型表现优于现有技术,CRS提高7%,R@2*提高4%。 Conclusion: AV-GAS成功解决了从混合音频生成图像的挑战,并提出了新的任务和评估指标。 Abstract: Recent audio-visual generative models have made substantial progress in generating images from audio. However, existing approaches focus on generating images from single-class audio and fail to generate images from mixed audio. To address this, we propose an Audio-Visual Generation and Separation model (AV-GAS) for generating images from soundscapes (mixed audio containing multiple classes). Our contribution is threefold: First, we propose a new challenge in the audio-visual generation task, which is to generate an image given a multi-class audio input, and we propose a method that solves this task using an audio-visual separator. Second, we introduce a new audio-visual separation task, which involves generating separate images for each class present in a mixed audio input. Lastly, we propose new evaluation metrics for the audio-visual generation task: Class Representation Score (CRS) and a modified R@K. Our model is trained and evaluated on the VGGSound dataset. We show that our method outperforms the state-of-the-art, achieving 7% higher CRS and 4% higher R@2* in generating plausible images with mixed audio.

[49] Enhancing Long-Term Re-Identification Robustness Using Synthetic Data: A Comparative Analysis

Christian Pionzewski,Rebecca Rademacher,Jérôme Rutinowski,Antonia Ponikarov,Stephan Matzke,Tim Chilla,Pia Schreynemackers,Alice Kirchheim

Main category: cs.CV

TL;DR: 论文探讨了合成训练数据对材料磨损和老化预测在重识别中的影响,通过实验和动态更新的图库提升了性能,并发布了一个新的开源数据集。

Details Motivation: 研究合成训练数据对材料老化重识别的影响,并探索动态更新图库策略以提升性能。 Method: 测试不同实验设置和图库扩展策略,使用合成训练数据和动态更新图库。 Result: 动态更新图库使Rank-1准确率提升24%,合成训练数据提升13%。 Conclusion: 合成数据和动态图库策略显著提升性能,新数据集为研究提供了资源。 Abstract: This contribution explores the impact of synthetic training data usage and the prediction of material wear and aging in the context of re-identification. Different experimental setups and gallery set expanding strategies are tested, analyzing their impact on performance over time for aging re-identification subjects. Using a continuously updating gallery, we were able to increase our mean Rank-1 accuracy by 24%, as material aging was taken into account step by step. In addition, using models trained with 10% artificial training data, Rank-1 accuracy could be increased by up to 13%, in comparison to a model trained on only real-world data, significantly boosting generalized performance on hold-out data. Finally, this work introduces a novel, open-source re-identification dataset, pallet-block-2696. This dataset contains 2,696 images of Euro pallets, taken over a period of 4 months. During this time, natural aging processes occurred and some of the pallets were damaged during their usage. These wear and tear processes significantly changed the appearance of the pallets, providing a dataset that can be used to generate synthetically aged pallets or other wooden materials.

[50] Task-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy

Zhengru Fang,Zhenghao Liu,Jingjing Wang,Senkang Hu,Yu Guo,Yiqin Deng,Yuguang Fang

Main category: cs.CV

TL;DR: 提出了一种面向任务的通信框架O-VIB,用于无人机在无GPS环境下的高效精准定位。

Details Motivation: 解决无人机在无GPS城市环境中的定位问题,同时应对带宽、内存和处理能力的限制。 Method: 采用多摄像头系统提取紧凑多视角特征,利用O-VIB编码器剪除非信息特征并减少冗余。 Result: 在严格的带宽限制下,O-VIB实现了高精度定位。 Conclusion: O-VIB框架为无人机在低空经济中的定位提供了高效解决方案。 Abstract: To support the Low Altitude Economy (LAE), precise unmanned aerial vehicles (UAVs) localization in urban areas where global positioning system (GPS) signals are unavailable. Vision-based methods offer a viable alternative but face severe bandwidth, memory and processing constraints on lightweight UAVs. Inspired by mammalian spatial cognition, we propose a task-oriented communication framework, where UAVs equipped with multi-camera systems extract compact multi-view features and offload localization tasks to edge servers. We introduce the Orthogonally-constrained Variational Information Bottleneck encoder (O-VIB), which incorporates automatic relevance determination (ARD) to prune non-informative features while enforcing orthogonality to minimize redundancy. This enables efficient and accurate localization with minimal transmission cost. Extensive evaluation on a dedicated LAE UAV dataset shows that O-VIB achieves high-precision localization under stringent bandwidth budgets. Code and dataset will be made publicly available: github.com/fangzr/TOC-Edge-Aerial.

[51] STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting

Yunze Deng,Haijun Xiong,Bin Feng,Xinggang Wang,Wenyu Liu

Main category: cs.CV

TL;DR: STP4D是一种新颖的文本到4D生成方法,通过整合时空-提示一致性建模,解决了现有方法的时空不一致和几何失真问题。

Details Motivation: 现有文本到4D生成方法缺乏统一的时空建模和提示对齐框架,导致生成内容质量低或与文本不符。 Method: STP4D设计了三个模块:时变提示嵌入、几何信息增强和时间扩展变形,并首次结合扩散模型生成4D高斯。 Result: 实验表明,STP4D能以高效(约4.6秒/资产)生成高质量4D内容,质量和速度均优于现有方法。 Conclusion: STP4D通过统一框架实现了高质量的文本到4D生成,为相关领域提供了新思路。 Abstract: Text-to-4D generation is rapidly developing and widely applied in various scenarios. However, existing methods often fail to incorporate adequate spatio-temporal modeling and prompt alignment within a unified framework, resulting in temporal inconsistencies, geometric distortions, or low-quality 4D content that deviates from the provided texts. Therefore, we propose STP4D, a novel approach that aims to integrate comprehensive spatio-temporal-prompt consistency modeling for high-quality text-to-4D generation. Specifically, STP4D employs three carefully designed modules: Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation, which collaborate to accomplish this goal. Furthermore, STP4D is among the first methods to exploit the Diffusion model to generate 4D Gaussians, combining the fine-grained modeling capabilities and the real-time rendering process of 4DGS with the rapid inference speed of the Diffusion model. Extensive experiments demonstrate that STP4D excels in generating high-fidelity 4D content with exceptional efficiency (approximately 4.6s per asset), surpassing existing methods in both quality and speed.

[52] Depth3DLane: Monocular 3D Lane Detection via Depth Prior Distillation

Dongxin Lyu,Han Huang,Cheng Tan,Zimu Li

Main category: cs.CV

TL;DR: 提出了一种基于BEV的框架,通过多尺度深度特征和深度先验蒸馏改进单目3D车道检测,解决了IPM方法的局限性。

Details Motivation: 单目3D车道检测因深度信息获取困难而具有挑战性,传统IPM方法因平坦地面假设和上下文信息丢失导致3D重建不准确。 Method: 结合了分层深度感知头(提供多尺度深度特征)、深度先验蒸馏(从教师模型转移语义深度知识)和条件随机场模块(增强车道连续性)。 Result: 实验表明,该方法在z轴误差和整体性能上优于现有方法。 Conclusion: 提出的框架显著提升了3D车道检测的准确性,代码已开源。 Abstract: Monocular 3D lane detection is challenging due to the difficulty in capturing depth information from single-camera images. A common strategy involves transforming front-view (FV) images into bird's-eye-view (BEV) space through inverse perspective mapping (IPM), facilitating lane detection using BEV features. However, IPM's flat-ground assumption and loss of contextual information lead to inaccuracies in reconstructing 3D information, especially height. In this paper, we introduce a BEV-based framework to address these limitations and improve 3D lane detection accuracy. Our approach incorporates a Hierarchical Depth-Aware Head that provides multi-scale depth features, mitigating the flat-ground assumption by enhancing spatial awareness across varying depths. Additionally, we leverage Depth Prior Distillation to transfer semantic depth knowledge from a teacher model, capturing richer structural and contextual information for complex lane structures. To further refine lane continuity and ensure smooth lane reconstruction, we introduce a Conditional Random Field module that enforces spatial coherence in lane predictions. Extensive experiments validate that our method achieves state-of-the-art performance in terms of z-axis error and outperforms other methods in the field in overall performance. The code is released at: https://anonymous.4open.science/r/Depth3DLane-DCDD.

[53] SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations

Shuting Zhao,Linxin Bai,Liangjing Shao,Ye Zhang,Xinrong Chen

Main category: cs.CV

TL;DR: SSD-Poser是一种轻量级高效模型,用于从稀疏观测中实现实时全身姿态估计,结合混合编码器和频率感知解码器,显著提升运动平滑性和计算效率。

Details Motivation: AR/VR应用中,HMDs仅提供头部和手部信号,但全身姿态估计仍具挑战性,现有方法难以平衡精确重建与实时性。 Method: 设计SSD-Poser模型,采用State Space Attention Encoders和Frequency-Aware Decoder,优化姿态重建与运动平滑性。 Result: 在AMASS数据集上,SSD-Poser表现出卓越的准确性和计算效率,优于现有方法。 Conclusion: SSD-Poser通过创新设计解决了实时全身姿态估计的挑战,为AR/VR应用提供了高效解决方案。 Abstract: The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.

[54] TSCL:Multi-party loss Balancing scheme for deep learning Image steganography based on Curriculum learning

Fengchun Liu. Tong Zhang,Chunying Zhang

Main category: cs.CV

TL;DR: 提出了一种两阶段课程学习损失调度器(TSCL),用于平衡深度学习图像隐写算法中的多项损失,提升隐写质量、解码精度和安全性。

Details Motivation: 传统方法中固定损失权重无法适应任务重要性和训练过程,需要动态调整以优化性能。 Method: TSCL分为先验课程控制和损失动态控制两阶段,分别调整损失权重和评估任务学习速度。 Result: 在ALASKA2、VOC2012和ImageNet数据集上验证了TSCL在隐写质量、解码精度和安全性上的提升。 Conclusion: TSCL通过动态调整损失权重,显著提升了深度学习图像隐写算法的性能。 Abstract: For deep learning-based image steganography frameworks, in order to ensure the invisibility and recoverability of the information embedding, the loss function usually contains several losses such as embedding loss, recovery loss and steganalysis loss. In previous research works, fixed loss weights are usually chosen for training optimization, and this setting is not linked to the importance of the steganography task itself and the training process. In this paper, we propose a Two-stage Curriculum Learning loss scheduler (TSCL) for balancing multinomial losses in deep learning image steganography algorithms. TSCL consists of two phases: a priori curriculum control and loss dynamics control. The first phase firstly focuses the model on learning the information embedding of the original image by controlling the loss weights in the multi-party adversarial training; secondly, it makes the model shift its learning focus to improving the decoding accuracy; and finally, it makes the model learn to generate a steganographic image that is resistant to steganalysis. In the second stage, the learning speed of each training task is evaluated by calculating the loss drop of the before and after iteration rounds to balance the learning of each task. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed TSCL strategy improves the quality of steganography, decoding accuracy and security.

[55] Revisiting Data Auditing in Large Vision-Language Models

Hongyu Zhu,Sichu Liang,Wenwen Wang,Boheng Li,Tongxin Yuan,Fangqi Li,ShiLin Wang,Zhuosheng Zhang

Main category: cs.CV

TL;DR: 论文揭示了当前成员推断(MI)基准测试中存在的分布偏移问题,提出了基于最优传输的度量方法,并构建了新的无偏基准测试,发现现有MI方法在真实条件下表现不佳,同时探讨了MI的理论上限和实际可行的审计场景。

Details Motivation: 随着大型视觉语言模型(VLMs)的发展,数据审计的需求日益迫切,但现有MI方法因分布偏移问题导致性能虚高,亟需更真实的评估方法。 Method: 通过分析分布偏移的本质,提出基于最优传输的度量方法,构建无偏基准测试,并探讨MI的理论上限和实际可行场景。 Result: 现有MI方法在无偏条件下表现接近随机猜测,理论分析显示MI的错误率较高,但在微调、真实文本访问和集合推断等场景下审计仍可行。 Conclusion: 研究系统揭示了MI在VLMs中的局限性,同时指出了未来可信数据审计的可能方向。 Abstract: With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)--which integrate vision encoders with LLMs for accurate visual grounding--have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC > 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM's embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios--fine-tuning, access to ground-truth texts, and set-based inference--where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.

[56] Interpretable Affordance Detection on 3D Point Clouds with Probabilistic Prototypes

Maximilian Xiling Li,Korbinian Rudolf,Nils Blank,Rudolf Lioutikov

Main category: cs.CV

TL;DR: 论文提出了一种基于原型学习的方法用于3D点云的可供性检测,该方法在保持高性能的同时提供了可解释性。

Details Motivation: 传统深度学习模型(如PointNet++、DGCNN)在3D点云可供性检测中是黑盒模型,缺乏决策过程的透明度。原型学习(如ProtoPNet)提供了一种可解释的替代方案,但此前主要应用于图像任务。 Method: 将原型学习方法应用于3D点云的可供性检测模型。 Result: 在3D-AffordanceNet基准数据集上,原型模型与最先进的黑盒模型性能相当,同时具有内在可解释性。 Conclusion: 原型模型因其高性能和可解释性,成为需要更高信任和安全的人机交互场景的理想选择。 Abstract: Robotic agents need to understand how to interact with objects in their environment, both autonomously and during human-robot interactions. Affordance detection on 3D point clouds, which identifies object regions that allow specific interactions, has traditionally relied on deep learning models like PointNet++, DGCNN, or PointTransformerV3. However, these models operate as black boxes, offering no insight into their decision-making processes. Prototypical Learning methods, such as ProtoPNet, provide an interpretable alternative to black-box models by employing a "this looks like that" case-based reasoning approach. However, they have been primarily applied to image-based tasks. In this work, we apply prototypical learning to models for affordance detection on 3D point clouds. Experiments on the 3D-AffordanceNet benchmark dataset show that prototypical models achieve competitive performance with state-of-the-art black-box models and offer inherent interpretability. This makes prototypical models a promising candidate for human-robot interaction scenarios that require increased trust and safety.

[57] COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

Haozhen Yan,Yan Hong,Jiahui Zhan,Yikun Ji,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang

Main category: cs.CV

TL;DR: 论文提出了COCOInpaint基准,专注于检测基于修复的图像篡改,填补了现有IMDL方法在修复篡改检测上的空白。

Details Motivation: 现有图像篡改检测方法主要关注拼接或复制移动伪造,缺乏针对修复篡改的专用基准。 Method: 构建COCOInpaint基准,包含高质量修复样本、多样化生成场景和大规模覆盖。 Result: 提供了258,266张修复图像,强调修复与真实区域的内在不一致性,并建立了严格评估协议。 Conclusion: COCOInpaint基准将公开,以促进修复篡改检测的未来研究。 Abstract: Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.

[58] Fast Autoregressive Models for Continuous Latent Generation

Tiankai Hang,Jianmin Bao,Fangyun Wei,Dong Chen

Main category: cs.CV

TL;DR: FAR模型通过替换MAR的扩散头为轻量级快捷头,实现了高效连续空间图像生成,推理速度提升2.3倍,同时保持生成质量。

Details Motivation: 自回归模型在连续域图像生成中存在计算成本高的问题,MAR模型虽解决了量化问题,但推理速度慢。 Method: 提出FAR模型,用轻量级快捷头替代MAR的扩散头,支持高效少步采样,并与因果Transformer无缝集成。 Result: FAR推理速度比MAR快2.3倍,FID和IS分数保持竞争力。 Conclusion: FAR首次实现了高效自回归连续空间图像生成,平衡了质量与可扩展性。 Abstract: Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves $2.3\times$ faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.

[59] Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Kesen Zhao,Beier Zhu,Qianru Sun,Hanwang Zhang

Main category: cs.CV

TL;DR: UV-CoT是一种无监督视觉链式思维框架,通过偏好优化实现图像级推理,无需边界框标注,显著提升多模态大语言模型的视觉理解能力。

Details Motivation: 现有方法主要关注文本链式思维(CoT),忽略了视觉线索的利用,且依赖大量标注数据,难以泛化到未见案例。 Method: 提出UV-CoT框架,通过偏好比较模型生成的边界框,利用自动数据生成管道和评估模型排序,训练目标模型。 Result: 在六个数据集上表现优于现有文本和视觉CoT方法,零样本测试在四个未见数据集上展示强泛化能力。 Conclusion: UV-CoT通过模拟人类感知(识别关键区域并基于其推理),显著提升了视觉理解能力,尤其在空间推理任务中。 Abstract: Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.

[60] A Multimodal Hybrid Late-Cascade Fusion Network for Enhanced 3D Object Detection

Carlo Sgaravatti,Roberto Basla,Riccardo Pieroni,Matteo Corno,Sergio M. Savaresi,Luca Magri,Giacomo Boracchi

Main category: cs.CV

TL;DR: 提出了一种基于LiDAR和RGB相机的混合级联方案,用于多模态输入的3D物体检测,通过后期融合减少LiDAR误报,并通过级联融合恢复LiDAR漏检。

Details Motivation: 解决LiDAR检测中的误报和漏检问题,提升多模态输入下3D物体检测的性能。 Method: 采用混合级联方案,结合RGB检测网络和3D LiDAR检测器,通过投影匹配减少误报,利用极线约束和视锥恢复漏检。 Result: 在KITTI基准测试中表现优异,尤其在行人和骑行者的检测上显著提升。 Conclusion: 该方法灵活且高效,可兼容现有单模态检测器,显著提升多模态检测性能。 Abstract: We present a new way to detect 3D objects from multimodal inputs, leveraging both LiDAR and RGB cameras in a hybrid late-cascade scheme, that combines an RGB detection network and a 3D LiDAR detector. We exploit late fusion principles to reduce LiDAR False Positives, matching LiDAR detections with RGB ones by projecting the LiDAR bounding boxes on the image. We rely on cascade fusion principles to recover LiDAR False Negatives leveraging epipolar constraints and frustums generated by RGB detections of separate views. Our solution can be plugged on top of any underlying single-modal detectors, enabling a flexible training process that can take advantage of pre-trained LiDAR and RGB detectors, or train the two branches separately. We evaluate our results on the KITTI object detection benchmark, showing significant performance improvements, especially for the detection of Pedestrians and Cyclists.

[61] LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

Rui Li,Biao Zhang,Zhenyu Li,Federico Tombari,Peter Wonka

Main category: cs.CV

TL;DR: LaRI是一种基于单张图像进行未见几何推理的新方法,通过分层点图建模相机光线相交的多个表面,实现高效、完整的几何推理。

Details Motivation: 传统深度估计仅限于可见表面,无法处理多表面或遮挡几何。LaRI旨在解决这一问题,统一对象和场景级任务。 Method: LaRI使用分层点图表示多表面,并预测光线停止索引以识别有效相交像素和层。同时构建了合成和真实数据的训练生成流程。 Result: LaRI在对象级任务中仅用4%训练数据和17%参数即达到可比性能;在场景级任务中仅需一次前向传播即可完成遮挡几何推理。 Conclusion: LaRI是一种高效、通用的几何推理方法,适用于对象和场景级任务,性能优越且资源高效。 Abstract: We present layered ray intersections (LaRI), a new method for unseen geometry reasoning from a single image. Unlike conventional depth estimation that is limited to the visible surface, LaRI models multiple surfaces intersected by the camera rays using layered point maps. Benefiting from the compact and layered representation, LaRI enables complete, efficient, and view-aligned geometric reasoning to unify object- and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. We build a complete training data generation pipeline for synthetic and real-world data, including 3D objects and scenes, with necessary data cleaning steps and coordination between rendering engines. As a generic method, LaRI's performance is validated in two scenarios: It yields comparable object-level results to the recent large generative model using 4% of its training data and 17% of its parameters. Meanwhile, it achieves scene-level occluded geometry reasoning in only one feed-forward.

[62] Iterative Event-based Motion Segmentation by Variational Contrast Maximization

Ryo Yamaki,Shintaro Shiba,Guillermo Gallego,Yoshimitsu Aoki

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的迭代运动分割方法,通过将事件分类为背景和前景,扩展了对比度最大化框架,显著提升了运动物体检测的准确性。

Details Motivation: 事件相机能够捕捉场景变化,但需要将事件数据分类为不同运动以实现运动分割,这对物体检测等任务至关重要。 Method: 提出了一种迭代运动分割方法,将事件分类为背景(主导运动假设)和前景(独立运动残差),扩展了对比度最大化框架。 Result: 实验表明,该方法在公开和自录数据集上成功分类事件簇,生成清晰的运动补偿边缘图像,运动物体检测准确率提升超过30%。 Conclusion: 该方法扩展了对比度最大化框架的敏感性,为事件驱动的运动分割理论提供了新的进展。 Abstract: Event cameras provide rich signals that are suitable for motion estimation since they respond to changes in the scene. As any visual changes in the scene produce event data, it is paramount to classify the data into different motions (i.e., motion segmentation), which is useful for various tasks such as object detection and visual servoing. We propose an iterative motion segmentation method, by classifying events into background (e.g., dominant motion hypothesis) and foreground (independent motion residuals), thus extending the Contrast Maximization framework. Experimental results demonstrate that the proposed method successfully classifies event clusters both for public and self-recorded datasets, producing sharp, motion-compensated edge-like images. The proposed method achieves state-of-the-art accuracy on moving object detection benchmarks with an improvement of over 30%, and demonstrates its possibility of applying to more complex and noisy real-world scenes. We hope this work broadens the sensitivity of Contrast Maximization with respect to both motion parameters and input events, thus contributing to theoretical advancements in event-based motion segmentation estimation. https://github.com/aoki-media-lab/event_based_segmentation_vcmax

[63] NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

Haotian Dong,Xin Wang,Di Lin,Yipeng Wu,Qin Chen,Ruonan Liu,Kairui Yang,Ping Li,Qing Guo

Main category: cs.CV

TL;DR: 本文提出NoiseController方法,通过多级噪声分解、多帧噪声协作和联合去噪,提升视频生成的时空一致性。

Details Motivation: 高质量视频生成对电影工业和自动驾驶等领域至关重要,但现有方法常忽视全局时空信息,导致时空一致性不足。 Method: 采用多级噪声分解(场景级和个体级)、多帧噪声协作(时空协作矩阵)和联合去噪(并行U-Net)来增强视频一致性。 Result: 在公开数据集上验证了NoiseController的先进性能。 Conclusion: NoiseController通过创新噪声分解与协作机制,显著提升了视频生成的时空一致性。 Abstract: High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during video generation. In this paper, we propose the NoiseController, consisting of Multi-Level Noise Decomposition, Multi-Frame Noise Collaboration, and Joint Denoising, to enhance spatiotemporal consistencies in video generation. In multi-level noise decomposition, we first decompose initial noises into scene-level foreground/background noises, capturing distinct motion properties to model multi-view foreground/background variations. Furthermore, each scene-level noise is further decomposed into individual-level shared and residual components. The shared noise preserves consistency, while the residual component maintains diversity. In multi-frame noise collaboration, we introduce an inter-view spatiotemporal collaboration matrix and an intra-view impact collaboration matrix , which captures mutual cross-view effects and historical cross-frame impacts to enhance video quality. The joint denoising contains two parallel denoising U-Nets to remove each scene-level noise, mutually enhancing video generation. We evaluate our NoiseController on public datasets focusing on video generation and downstream tasks, demonstrating its state-of-the-art performance.

[64] RGS-DR: Reflective Gaussian Surfels with Deferred Rendering for Shiny Objects

Georgios Kouros,Minye Wu,Tinne Tuytelaars

Main category: cs.CV

TL;DR: RGS-DR是一种新颖的逆渲染方法,专注于重建和渲染具有光泽和反射特性的物体,支持灵活的重新照明和场景编辑。

Details Motivation: 现有方法(如NeRF和3D高斯泼溅)在处理视角依赖效应时表现不佳,RGS-DR旨在通过2D高斯面元表示准确估计几何和表面法线,从而提升逆渲染质量。 Method: RGS-DR通过可学习的基元显式建模几何和材质属性,并将其栅格化到延迟着色管道中,减少渲染伪影并保留锐利反射。此外,采用多级立方体mipmap近似环境光照积分,并通过基于球面mipmap的残差通道优化外观建模。 Result: 实验表明,RGS-DR在光泽物体的高质量重建和渲染方面表现优异,优于无法重新照明的最新重建方法。 Conclusion: RGS-DR通过创新的几何和材质建模方法,显著提升了光泽物体的逆渲染质量,支持灵活的重新照明和编辑。 Abstract: We introduce RGS-DR, a novel inverse rendering method for reconstructing and rendering glossy and reflective objects with support for flexible relighting and scene editing. Unlike existing methods (e.g., NeRF and 3D Gaussian Splatting), which struggle with view-dependent effects, RGS-DR utilizes a 2D Gaussian surfel representation to accurately estimate geometry and surface normals, an essential property for high-quality inverse rendering. Our approach explicitly models geometric and material properties through learnable primitives rasterized into a deferred shading pipeline, effectively reducing rendering artifacts and preserving sharp reflections. By employing a multi-level cube mipmap, RGS-DR accurately approximates environment lighting integrals, facilitating high-quality reconstruction and relighting. A residual pass with spherical-mipmap-based directional encoding further refines the appearance modeling. Experiments demonstrate that RGS-DR achieves high-quality reconstruction and rendering quality for shiny objects, often outperforming reconstruction-exclusive state-of-the-art methods incapable of relighting.

[65] An Improved ResNet50 Model for Predicting Pavement Condition Index (PCI) Directly from Pavement Images

Andrews Danyo,Anthony Dontoh,Armstrong Aboah

Main category: cs.CV

TL;DR: 提出了一种结合CBAM的改进ResNet50模型,用于从路面图像直接预测PCI,显著降低了预测误差。

Details Motivation: 提高路面状况指数(PCI)预测的准确性,以支持基础设施维护。 Method: 在ResNet50架构中集成卷积块注意力模块(CBAM),自主优化图像关键特征提取。 Result: 改进后的ResNet50-CBAM模型MAPE为58.16%,优于基线模型的70.76%和65.48%。 Conclusion: 注意力机制能有效提升路面状况分析的准确性和效率。 Abstract: Accurately predicting the Pavement Condition Index (PCI), a measure of roadway conditions, from pavement images is crucial for infrastructure maintenance. This study proposes an enhanced version of the Residual Network (ResNet50) architecture, integrated with a Convolutional Block Attention Module (CBAM), to predict PCI directly from pavement images without additional annotations. By incorporating CBAM, the model autonomously prioritizes critical features within the images, improving prediction accuracy. Compared to the original baseline ResNet50 and DenseNet161 architectures, the enhanced ResNet50-CBAM model achieved a significantly lower mean absolute percentage error (MAPE) of 58.16%, compared to the baseline models that achieved 70.76% and 65.48% respectively. These results highlight the potential of using attention mechanisms to refine feature extraction, ultimately enabling more accurate and efficient assessments of pavement conditions. This study emphasizes the importance of targeted feature refinement in advancing automated pavement analysis through attention mechanisms.

[66] Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

Shivam Duggal,Yushi Hu,Oscar Michel,Aniruddha Kembhavi,William T. Freeman,Noah A. Smith,Ranjay Krishna,Antonio Torralba,Ali Farhadi,Wei-Chiu Ma

Main category: cs.CV

TL;DR: Eval3D是一个细粒度、可解释的3D生成评估工具,通过多模型一致性评估3D资产质量,优于现有方法。

Details Motivation: 现有3D生成系统常无法保证高质量和一致性,且缺乏可靠的评估工具。 Method: 利用多样化基础模型和工具作为探针,测量3D资产在不同方面的一致性。 Result: Eval3D提供像素级测量和准确空间反馈,更贴近人类判断。 Conclusion: Eval3D揭示了当前3D生成模型的局限性,为未来改进提供了方向。 Abstract: Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often overlook the geometric quality of generated assets or merely rely on black-box multimodal large language models for coarse assessment. In this paper, we introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets based on various distinct yet complementary criteria. Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured by measuring the consistency among various foundation models and tools. We thus leverage a diverse set of models and tools as probes to evaluate the inconsistency of generated 3D assets across different aspects. Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments. We comprehensively evaluate existing 3D generation models using Eval3D and highlight the limitations and challenges of current models.

[67] Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models

Patrick Müller,Alexander Braun,Margret Keuper

Main category: cs.CV

TL;DR: 论文提出了两个数据集(OpticsBench和LensCorruptions)来评估模型对真实光学模糊效果的鲁棒性,发现现有模型性能差异显著。

Details Motivation: 深度神经网络在计算机视觉中广泛应用,但其对模糊等扰动的鲁棒性需进一步研究,现有模糊模拟过于简化。 Method: 通过Zernike多项式生成两类模糊数据集(OpticsBench和LensCorruptions),分别模拟单一参数和真实镜头组合的模糊效果。 Result: 在ImageNet和MSCOCO上的实验表明,不同预训练模型对两类模糊的性能差异显著。 Conclusion: 需考虑真实模糊效果以更准确评估模型鲁棒性。 Abstract: Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model's robustness against blur.

[68] E-VLC: A Real-World Dataset for Event-based Visible Light Communication And Localization

Shintaro Shiba,Quan Kong,Norimasa Kobori

Main category: cs.CV

TL;DR: 论文提出了首个公开数据集,用于评估事件相机在LED信号解码和定位中的性能,并提出了基于对比度最大化的新定位方法。

Details Motivation: 目前缺乏公开数据集来评估事件相机在LED信号解码和定位中的性能,尤其是在不同真实场景下的表现。 Method: 提出了一种基于对比度最大化的新定位方法,用于运动估计和补偿,并提供了一个包含事件相机、帧相机和精确同步地面真实位姿的数据集。 Result: 实验表明,基于事件的LED定位优于传统的基于帧的AR标记定位,且提出的方法在定位中表现高效。 Conclusion: 该数据集有望成为未来评估事件相机在移动设备上应用的基准,同时推动事件相机在更多领域的应用。 Abstract: Optical communication using modulated LEDs (e.g., visible light communication) is an emerging application for event cameras, thanks to their high spatio-temporal resolutions. Event cameras can be used simply to decode the LED signals and also to localize the camera relative to the LED marker positions. However, there is no public dataset to benchmark the decoding and localization in various real-world settings. We present, to the best of our knowledge, the first public dataset that consists of an event camera, a frame camera, and ground-truth poses that are precisely synchronized with hardware triggers. It provides various camera motions with various sensitivities in different scene brightness settings, both indoor and outdoor. Furthermore, we propose a novel method of localization that leverages the Contrast Maximization framework for motion estimation and compensation. The detailed analysis and experimental results demonstrate the advantages of LED-based localization with events over the conventional AR-marker--based one with frames, as well as the efficacy of the proposed method in localization. We hope that the proposed dataset serves as a future benchmark for both motion-related classical computer vision tasks and LED marker decoding tasks simultaneously, paving the way to broadening applications of event cameras on mobile devices. https://woven-visionai.github.io/evlc-dataset

[69] Augmenting Perceptual Super-Resolution via Image Quality Predictors

Fengjia Zhang,Samrudhdhi B. Rangrej,Tristan Aumentado-Armstrong,Afsaneh Fazly,Alex Levinshtein

Main category: cs.CV

TL;DR: 论文探讨了在超分辨率任务中利用非参考图像质量评估(NR-IQA)模型的方法,以改善感知质量与失真之间的权衡。

Details Motivation: 超分辨率问题本质上是病态的,传统方法倾向于生成模糊图像。研究旨在通过NR-IQA模型提升图像质量,使其更符合人类感知。 Method: 分析了NR-IQA指标在超分辨率数据上的表现,并提出了两种应用方法:改变数据采样和直接优化可微分质量分数。 Result: 实验表明,该方法在感知质量与失真之间取得了更好的平衡,更符合人类偏好。 Conclusion: NR-IQA模型为超分辨率任务提供了一种更符合人类感知的优化方向。 Abstract: Super-resolution (SR), a classical inverse problem in computer vision, is inherently ill-posed, inducing a distribution of plausible solutions for every input. However, the desired result is not simply the expectation of this distribution, which is the blurry image obtained by minimizing pixelwise error, but rather the sample with the highest image quality. A variety of techniques, from perceptual metrics to adversarial losses, are employed to this end. In this work, we explore an alternative: utilizing powerful non-reference image quality assessment (NR-IQA) models in the SR context. We begin with a comprehensive analysis of NR-IQA metrics on human-derived SR data, identifying both the accuracy (human alignment) and complementarity of different metrics. Then, we explore two methods of applying NR-IQA models to SR learning: (i) altering data sampling, by building on an existing multi-ground-truth SR framework, and (ii) directly optimizing a differentiable quality score. Our results demonstrate a more human-centric perception-distortion tradeoff, focusing less on non-perceptual pixel-wise distortion, instead improving the balance between perceptual fidelity and human-tuned NR-IQA measures.

cs.GR [Back]

[70] iVR-GS: Inverse Volume Rendering for Explorable Visualization via Editable 3D Gaussian Splatting

Kaiyuan Tang,Siyuan Yao,Chaoli Wang

Main category: cs.GR

TL;DR: iVR-GS是一种新型的逆体积渲染方法,通过高斯泼溅技术降低渲染成本,同时支持场景编辑,实现交互式体积探索。

Details Motivation: 现有NVS方法在渲染速度和硬件需求上表现优异,但预设的TF设置限制了用户对场景的探索。iVR-GS旨在解决这一问题。 Method: 通过组合多个与基本TF关联的iVR-GS模型,覆盖场景的不同可见部分,每个模型包含可编辑的3D高斯点,支持实时渲染和编辑。 Result: iVR-GS在多个体积数据集上展示了优于其他NVS解决方案(如Plenoxels、CCNeRF和3DGS)的重建质量和可组合性。 Conclusion: iVR-GS为交互式体积探索提供了一种高效且灵活的解决方案。 Abstract: In volume visualization, users can interactively explore the three-dimensional data by specifying color and opacity mappings in the transfer function (TF) or adjusting lighting parameters, facilitating meaningful interpretation of the underlying structure. However, rendering large-scale volumes demands powerful GPUs and high-speed memory access for real-time performance. While existing novel view synthesis (NVS) methods offer faster rendering speeds with lower hardware requirements, the visible parts of a reconstructed scene are fixed and constrained by preset TF settings, significantly limiting user exploration. This paper introduces inverse volume rendering via Gaussian splatting (iVR-GS), an innovative NVS method that reduces the rendering cost while enabling scene editing for interactive volume exploration. Specifically, we compose multiple iVR-GS models associated with basic TFs covering disjoint visible parts to make the entire volumetric scene visible. Each basic model contains a collection of 3D editable Gaussians, where each Gaussian is a 3D spatial point that supports real-time scene rendering and editing. We demonstrate the superior reconstruction quality and composability of iVR-GS against other NVS solutions (Plenoxels, CCNeRF, and base 3DGS) on various volume datasets. The code is available at https://github.com/TouKaienn/iVR-GS.

[71] From Cluster to Desktop: A Cache-Accelerated INR framework for Interactive Visualization of Tera-Scale Data

Daniel Zavorotny,Qi Wu,David Bauer,Kwan-Liu Ma

Main category: cs.GR

TL;DR: 该论文提出了一种基于GPU缓存的INR渲染框架,显著提升了渲染速度,使大规模科学数据的交互式可视化成为可能。

Details Motivation: 当前INR渲染速度较慢,限制了其在消费级硬件上的交互式可视化应用。 Method: 采用可扩展的多分辨率GPU缓存,减少冗余数据查询,优先处理新区域。 Result: 相比现有技术,平均速度提升5倍,同时保持高质量可视化效果。 Conclusion: 结合现有硬件加速INR压缩器,该框架支持在消费级硬件上交互式探索大规模数据集。 Abstract: Machine learning has enabled the use of implicit neural representations (INRs) to efficiently compress and reconstruct massive scientific datasets. However, despite advances in fast INR rendering algorithms, INR-based rendering remains computationally expensive, as computing data values from an INR is significantly slower than reading them from GPU memory. This bottleneck currently restricts interactive INR visualization to professional workstations. To address this challenge, we introduce an INR rendering framework accelerated by a scalable, multi-resolution GPU cache capable of efficiently representing tera-scale datasets. By minimizing redundant data queries and prioritizing novel volume regions, our method reduces the number of INR computations per frame, achieving an average 5x speedup over the state-of-the-art INR rendering method while still maintaining high visualization quality. Coupled with existing hardware-accelerated INR compressors, our framework enables scientists to generate and compress massive datasets in situ on high-performance computing platforms and then interactively explore them on consumer-grade hardware post hoc.

cs.CL [Back]

[72] Optimism, Expectation, or Sarcasm? Multi-Class Hope Speech Detection in Spanish and English

Sabur Butt,Fazlourrahman Balouchzahi,Ahmad Imam Amjad,Maaz Amjad,Hector G. Ceballos,Salud Maria Jimenez-Zafra

Main category: cs.CL

TL;DR: PolyHope V2是一个多语言、细粒度的希望语音数据集,包含30,000多条标注推文,用于提升NLP系统对复杂希望情感的识别能力。

Details Motivation: 希望是一种复杂且未被充分探索的情感状态,对教育、心理健康和社交互动有重要影响,但其在NLP中的检测仍具挑战性。 Method: 研究引入PolyHope V2数据集,标注了四种希望子类型(广义、现实、非现实、讽刺),并比较了预训练Transformer模型与大型语言模型(如GPT-4和Llama 3)的性能。 Result: 微调的Transformer模型在区分细微希望类别和讽刺方面优于基于提示的LLMs。 Conclusion: 该数据集和结果为未来需要更高语义和上下文敏感性的情感识别任务提供了坚实基础。 Abstract: Hope is a complex and underexplored emotional state that plays a significant role in education, mental health, and social interaction. Unlike basic emotions, hope manifests in nuanced forms ranging from grounded optimism to exaggerated wishfulness or sarcasm, making it difficult for Natural Language Processing systems to detect accurately. This study introduces PolyHope V2, a multilingual, fine-grained hope speech dataset comprising over 30,000 annotated tweets in English and Spanish. This resource distinguishes between four hope subtypes Generalized, Realistic, Unrealistic, and Sarcastic and enhances existing datasets by explicitly labeling sarcastic instances. We benchmark multiple pretrained transformer models and compare them with large language models (LLMs) such as GPT 4 and Llama 3 under zero-shot and few-shot regimes. Our findings show that fine-tuned transformers outperform prompt-based LLMs, especially in distinguishing nuanced hope categories and sarcasm. Through qualitative analysis and confusion matrices, we highlight systematic challenges in separating closely related hope subtypes. The dataset and results provide a robust foundation for future emotion recognition tasks that demand greater semantic and contextual sensitivity across languages.

[73] Improving LLM Personas via Rationalization with Psychological Scaffolds

Brihi Joshi,Xiang Ren,Swabha Swayamdipta,Rik Koncel-Kedziorski,Tim Paek

Main category: cs.CL

TL;DR: PB&J框架通过引入心理学理论生成的用户行为解释,提升了语言模型构建用户画像的能力,优于仅基于人口统计或判断的方法。

Details Motivation: 现有方法仅依赖用户人口统计或历史判断,无法捕捉用户行为背后的深层原因。 Method: PB&J框架结合心理学理论(如大五人格和原始世界信念)生成用户行为解释,增强语言模型的用户画像。 Result: 实验表明,PB&J框架在公共意见和电影偏好预测任务中表现优于传统方法,且与人工撰写的解释效果相当。 Conclusion: PB&J框架通过心理学理论生成的行为解释,显著提升了语言模型用户画像的准确性和解释性。 Abstract: Language models prompted with a user description or persona can predict a user's preferences and opinions, but existing approaches to building personas -- based solely on a user's demographic attributes and/or prior judgments -- fail to capture the underlying reasoning behind said user judgments. We introduce PB&J (Psychology of Behavior and Judgments), a framework that improves LLM personas by incorporating rationales of why a user might make specific judgments. These rationales are LLM-generated, and aim to reason about a user's behavior on the basis of their experiences, personality traits or beliefs. This is done using psychological scaffolds -- structured frameworks grounded in theories such as the Big 5 Personality Traits and Primal World Beliefs -- that help provide structure to the generated rationales. Experiments on public opinion and movie preference prediction tasks demonstrate that LLM personas augmented with PB&J rationales consistently outperform methods using only a user's demographics and/or judgments. Additionally, LLM personas constructed using scaffolds describing user beliefs perform competitively with those using human-written rationales.

[74] Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation

Zhuang Yu,Shiliang Sun,Jing Zhao,Tengfei Song,Hao Yang

Main category: cs.CL

TL;DR: 研究了预训练编码器和解码器在多模态机器翻译中的作用,发现预训练解码器能显著提升翻译质量,而编码器的效果取决于视觉-文本对齐质量。

Details Motivation: 探索大规模预训练语言和视觉模型在多模态机器翻译中的效果和角色。 Method: 系统研究了不同训练策略(从零训练到使用预训练和部分冻结组件)对翻译性能的影响,并在Multi30K和CoMMuTE数据集上进行了实验。 Result: 预训练解码器能生成更流畅和准确的输出,而预训练编码器的效果因视觉-文本对齐质量而异。 Conclusion: 预训练在多模态翻译中起关键但不对称作用,为未来多模态翻译系统架构设计提供了指导。 Abstract: Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.

[75] RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models

Bang An,Shiyue Zhang,Mark Dredze

Main category: cs.CL

TL;DR: 研究发现,检索增强生成(RAG)框架可能降低大型语言模型(LLMs)的安全性,且现有红队测试方法在RAG环境下效果不佳。

Details Motivation: 当前AI安全研究主要关注标准LLMs,对RAG框架如何影响模型安全性知之甚少。 Method: 对11种LLMs进行RAG与非RAG框架的详细对比分析,并探索安全性变化的原因。 Result: RAG可能使模型更不安全,且现有红队测试方法在RAG环境下效果较差。 Conclusion: 需针对RAG LLMs开发专门的安全研究和红队测试方法。 Abstract: Efforts to ensure the safety of large language models (LLMs) include safety fine-tuning, evaluation, and red teaming. However, despite the widespread use of the Retrieval-Augmented Generation (RAG) framework, AI safety work focuses on standard LLMs, which means we know little about how RAG use cases change a model's safety profile. We conduct a detailed comparative analysis of RAG and non-RAG frameworks with eleven LLMs. We find that RAG can make models less safe and change their safety profile. We explore the causes of this change and find that even combinations of safe models with safe documents can cause unsafe generations. In addition, we evaluate some existing red teaming methods for RAG settings and show that they are less effective than when used for non-RAG settings. Our work highlights the need for safety research and red-teaming methods specifically tailored for RAG LLMs.

[76] DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models

Jianyu Liu,Hangyu Guo,Ranjie Duan,Xingyuan Bu,Yancheng He,Shilong Li,Hui Huang,Jiaheng Liu,Yucheng Wang,Chenchen Jing,Xingwei Qu,Xiao Zhang,Yingshui Tan,Yanan Wu,Jihao Gu,Yangguang Li,Jianke Zhu

Main category: cs.CL

TL;DR: 论文提出DREAM方法,通过多模态风险解耦和强化学习,显著提升MLLMs的安全性,实验显示效果优于GPT-4V。

Details Motivation: 多模态大语言模型(MLLMs)因整合视觉和文本数据而带来新的安全挑战,需系统性解耦风险以增强安全性。 Method: 采用多模态风险解耦分析,结合监督微调和基于AI反馈的迭代强化学习(RLAIF),提出DREAM方法。 Result: DREAM显著提升MLLMs的安全性,SIUO安全&有效得分比GPT-4V提高16.17%,且不影响正常任务性能。 Conclusion: DREAM通过系统性风险解耦和强化学习,有效提升MLLMs的安全对齐,为多模态模型安全提供了新思路。 Abstract: Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbf{DREAM} (\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety \textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17\% improvement in the SIUO safe\&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.

[77] Exploring Personality-Aware Interactions in Salesperson Dialogue Agents

Sijia Cheng,Wen-Yu Chang,Yun-Nung Chen

Main category: cs.CL

TL;DR: 研究探讨了MBTI定义的用户角色对销售对话代理交互质量的影响,发现交互动态、任务完成率和对话自然性存在显著模式,并发布了跨领域的用户模拟器。

Details Motivation: 理解不同用户角色如何影响销售对话代理的交互质量和性能,以提升系统的适应性和个性化能力。 Method: 通过大规模测试和分析,评估预训练代理在多种MBTI用户类型中的有效性、适应性和个性化能力。 Result: 揭示了交互动态、任务完成率和对话自然性的显著模式,表明对话代理可以优化策略以匹配不同人格特质。 Conclusion: 研究为构建更适应和用户中心的销售对话系统提供了实用见解,并通过发布跨领域用户模拟器为未来研究提供了工具。 Abstract: The integration of dialogue agents into the sales domain requires a deep understanding of how these systems interact with users possessing diverse personas. This study explores the influence of user personas, defined using the Myers-Briggs Type Indicator (MBTI), on the interaction quality and performance of sales-oriented dialogue agents. Through large-scale testing and analysis, we assess the pre-trained agent's effectiveness, adaptability, and personalization capabilities across a wide range of MBTI-defined user types. Our findings reveal significant patterns in interaction dynamics, task completion rates, and dialogue naturalness, underscoring the future potential for dialogue agents to refine their strategies to better align with varying personality traits. This work not only provides actionable insights for building more adaptive and user-centric conversational systems in the sales domain but also contributes broadly to the field by releasing persona-defined user simulators. These simulators, unconstrained by domain, offer valuable tools for future research and demonstrate the potential for scaling personalized dialogue systems across diverse applications.

[78] PropRAG: Guiding Retrieval with Beam Search over Proposition Paths

Jingjin Wang

Main category: cs.CL

TL;DR: PropRAG通过利用上下文丰富的命题和新颖的束搜索算法,改进了标准RAG方法,实现了多步推理链的显式发现,并在多个数据集上取得了最先进的性能。

Details Motivation: 标准RAG方法无法捕捉人类记忆的关联性和上下文理解能力,而现有的结构化RAG方法(如HippoRAG)因上下文丢失而受限。PropRAG旨在解决这些问题。 Method: PropRAG利用上下文丰富的命题和束搜索算法进行多步推理链的显式发现,完全避免了在线LLM推理,依赖高效的图遍历和预计算嵌入。 Result: PropRAG在PopQA、2Wiki、HotpotQA和MuSiQue等数据集上取得了最先进的零样本Recall@5和F1分数。 Conclusion: PropRAG通过改进证据检索和显式路径发现,推动了非参数持续学习的进展。 Abstract: Retrieval Augmented Generation (RAG) has become the standard non-parametric approach for equipping Large Language Models (LLMs) with up-to-date knowledge and mitigating catastrophic forgetting common in continual learning. However, standard RAG, relying on independent passage retrieval, fails to capture the interconnected nature of human memory crucial for complex reasoning (associativity) and contextual understanding (sense-making). While structured RAG methods like HippoRAG utilize knowledge graphs (KGs) built from triples, the inherent context loss limits fidelity. We introduce PropRAG, a framework leveraging contextually rich propositions and a novel beam search algorithm over proposition paths to explicitly discover multi-step reasoning chains. Crucially, PropRAG's online retrieval process operates entirely without invoking generative LLMs, relying instead on efficient graph traversal and pre-computed embeddings. This avoids online LLM inference costs and potential inconsistencies during evidence gathering. LLMs are used effectively offline for high-quality proposition extraction and post-retrieval for answer generation. PropRAG achieves state-of-the-art zero-shot Recall@5 results on PopQA (55.3%), 2Wiki (93.7%), HotpotQA (97.0%), and MuSiQue (77.3%), alongside top F1 scores (e.g., 52.4% on MuSiQue). By improving evidence retrieval through richer representation and explicit, LLM-free online path finding, PropRAG advances non-parametric continual learning.

[79] Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

Wataru Kawakami,Keita Suzuki,Junichiro Iwasawa

Main category: cs.CL

TL;DR: 论文介绍了Preferred-MedLLM-Qwen-72B,一个针对日本医学领域优化的72B参数模型,通过两阶段微调实现高准确性和稳定推理。

Details Motivation: 大型语言模型在医学领域潜力巨大,但临床采用受到事实准确性、语言限制和推理可靠性问题的阻碍。 Method: 采用两阶段微调:1)在日语医学语料库上继续预训练(CPT);2)通过推理偏好优化(RPO)提升可靠推理路径生成。 Result: 在IgakuQA基准测试中达到0.868准确率,优于GPT-4o等模型,且在需要解释时仍保持高准确性。 Conclusion: 优化可靠解释与准确性同等重要,模型权重已公开以促进可信赖LLM研究。 Abstract: Large Language Models (LLMs) show potential in medicine, yet clinical adoption is hindered by concerns over factual accuracy, language-specific limitations (e.g., Japanese), and critically, their reliability when required to generate reasoning explanations -- a prerequisite for trust. This paper introduces Preferred-MedLLM-Qwen-72B, a 72B-parameter model optimized for the Japanese medical domain to achieve both high accuracy and stable reasoning. We employ a two-stage fine-tuning process on the Qwen2.5-72B base model: first, Continued Pretraining (CPT) on a comprehensive Japanese medical corpus instills deep domain knowledge. Second, Reasoning Preference Optimization (RPO), a preference-based method, enhances the generation of reliable reasoning pathways while preserving high answer accuracy. Evaluations on the Japanese Medical Licensing Exam benchmark (IgakuQA) show Preferred-MedLLM-Qwen-72B achieves state-of-the-art performance (0.868 accuracy), surpassing strong proprietary models like GPT-4o (0.866). Crucially, unlike baseline or CPT-only models which exhibit significant accuracy degradation (up to 11.5\% and 3.8\% respectively on IgakuQA) when prompted for explanations, our model maintains its high accuracy (0.868) under such conditions. This highlights RPO's effectiveness in stabilizing reasoning generation. This work underscores the importance of optimizing for reliable explanations alongside accuracy. We release the Preferred-MedLLM-Qwen-72B model weights to foster research into trustworthy LLMs for specialized, high-stakes applications.

[80] Random-Set Large Language Models

Muhammad Mubashar,Shireen Kudukkil Manchingal,Fabio Cuzzolin

Main category: cs.CL

TL;DR: 论文提出了一种新的随机集大语言模型(RSLLM),用于量化LLMs生成文本的不确定性,通过预测有限随机集(置信函数)而非传统概率向量,并结合层次聚类方法提高效率。

Details Motivation: 研究LLMs生成文本的可信度问题,提出量化其不确定性的方法,以改进模型的可信度和实用性。 Method: 提出RSLLM方法,预测有限随机集(置信函数),并通过层次聚类提取关键子集以提高效率。 Result: 在CoQA和OBQA数据集上测试,RSLLM在答案正确性和不确定性估计方面优于标准模型,并能检测幻觉。 Conclusion: RSLLM通过置信函数量化不确定性,显著提升了LLMs的可信度和实用性。 Abstract: Large Language Models (LLMs) are known to produce very high-quality tests and responses to our queries. But how much can we trust this generated text? In this paper, we study the problem of uncertainty quantification in LLMs. We propose a novel Random-Set Large Language Model (RSLLM) approach which predicts finite random sets (belief functions) over the token space, rather than probability vectors as in classical LLMs. In order to allow so efficiently, we also present a methodology based on hierarchical clustering to extract and use a budget of "focal" subsets of tokens upon which the belief prediction is defined, rather than using all possible collections of tokens, making the method scalable yet effective. RS-LLMs encode the epistemic uncertainty induced in their generation process by the size and diversity of its training set via the size of the credal sets associated with the predicted belief functions. The proposed approach is evaluated on CoQA and OBQA datasets using Llama2-7b, Mistral-7b and Phi-2 models and is shown to outperform the standard model in both datasets in terms of correctness of answer while also showing potential in estimating the second level uncertainty in its predictions and providing the capability to detect when its hallucinating.

[81] Application and Optimization of Large Models Based on Prompt Tuning for Fact-Check-Worthiness Estimation

Yinglong Yu,Hao Shen,Zhengyi Lyu,Qi He

Main category: cs.CL

TL;DR: 本文提出了一种基于提示调优的事实核查价值估计分类方法,通过设计提示模板和利用上下文学习,提高了在有限或无标签数据下判断事实核查价值的准确性。实验表明,该方法在F1分数和准确率等指标上优于或匹配BERT、GPT-3.5和GPT-4等基线模型。

Details Motivation: 针对全球化和信息化背景下日益严重的虚假信息问题,研究如何高效估计事实核查的价值,尤其是在数据有限或无标签的情况下。 Method: 基于提示调优技术,设计提示模板并应用于大型语言模型,通过上下文学习提升事实核查价值估计的准确性。 Result: 在公开数据集上的实验表明,该方法在F1分数和准确率等指标上优于或匹配BERT、GPT-3.5和GPT-4等基线模型。 Conclusion: 基于提示调优的方法在事实核查价值估计任务中表现出有效性和先进性,尤其在数据有限的情况下具有优势。 Abstract: In response to the growing problem of misinformation in the context of globalization and informatization, this paper proposes a classification method for fact-check-worthiness estimation based on prompt tuning. We construct a model for fact-check-worthiness estimation at the methodological level using prompt tuning. By applying designed prompt templates to large language models, we establish in-context learning and leverage prompt tuning technology to improve the accuracy of determining whether claims have fact-check-worthiness, particularly when dealing with limited or unlabeled data. Through extensive experiments on public datasets, we demonstrate that the proposed method surpasses or matches multiple baseline methods in the classification task of fact-check-worthiness estimation assessment, including classical pre-trained models such as BERT, as well as recent popular large models like GPT-3.5 and GPT-4. Experiments show that the prompt tuning-based method proposed in this study exhibits certain advantages in evaluation metrics such as F1 score and accuracy, thereby effectively validating its effectiveness and advancement in the task of fact-check-worthiness estimation.

[82] Comparative Study on the Discourse Meaning of Chinese and English Media in the Paris Olympics Based on LDA Topic Modeling Technology and LLM Prompt Engineering

Yinglong Yu,Zhaopu Yao,Fang Yuan

Main category: cs.CL

TL;DR: 通过主题建模、LLM提示工程和语料库短语学方法分析中英文媒体对巴黎奥运会的报道,探讨话语构建和态度意义的异同。

Details Motivation: 研究中英文媒体对巴黎奥运会的报道差异,揭示不同文化背景下的态度和话语特点。 Method: 使用主题建模、LLM提示工程和语料库短语学方法分析报道内容。 Result: 中文媒体关注具体运动、体育精神、兴奋剂争议和新技术,英文媒体关注女性运动员、奖牌和资格争议。中文报道在开幕式和体育精神上更积极,英文报道在女性运动员报道上积极,但在开幕式反应和女子拳击争议上消极。 Conclusion: 中英文媒体在报道巴黎奥运会时存在显著差异,反映了文化和社会价值观的不同。 Abstract: This study analyzes Chinese and English media reports on the Paris Olympics using topic modeling, Large Language Model (LLM) prompt engineering, and corpus phraseology methods to explore similarities and differences in discourse construction and attitudinal meanings. Common topics include the opening ceremony, athlete performance, and sponsorship brands. Chinese media focus on specific sports, sports spirit, doping controversies, and new technologies, while English media focus on female athletes, medal wins, and eligibility controversies. Chinese reports show more frequent prepositional co-occurrences and positive semantic prosody in describing the opening ceremony and sports spirit. English reports exhibit positive semantic prosody when covering female athletes but negative prosody in predicting opening ceremony reactions and discussing women's boxing controversies.

[83] Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

Atharva Kulkarni,Yuan Zhang,Joel Ruben Antony Moniz,Xiou Ge,Bo-Hsiang Tseng,Dhivya Piraviperumal,Swabha Swayamdipta,Hong Yu

Main category: cs.CL

TL;DR: 论文通过大规模实证评估发现,当前幻觉检测指标存在不足,LLM(如GPT-4)表现最佳,模式寻求解码方法可减少幻觉。

Details Motivation: 语言模型的幻觉问题影响其可靠性,但现有检测指标的鲁棒性和泛化性尚未得到验证。 Method: 评估了6组幻觉检测指标,覆盖4个数据集、37个语言模型和5种解码方法。 Result: 指标常与人类判断不一致,LLM(如GPT-4)表现最佳,模式寻求解码方法有效减少幻觉。 Conclusion: 需开发更鲁棒的指标和策略以理解和缓解幻觉问题。 Abstract: Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

[84] Temporal Entailment Pretraining for Clinical Language Models over EHR Data

Tatsunori Tanaka,Fi Zheng,Kai Sato,Zhifeng Li,Yuanyun Zhang,Shi Li

Main category: cs.CL

TL;DR: 提出了一种新的临床语言模型预训练目标,通过时间顺序的句子对训练模型,提升其在时间相关临床任务中的表现。

Details Motivation: 现有方法将电子健康记录视为静态文档,忽略了患者轨迹的时间演变和因果关联。 Method: 设计时间蕴含预训练目标,将EHR片段作为时间顺序的句子对,训练模型判断后续状态与先前状态的关系(蕴含、矛盾或中立)。 Result: 在MIMIC IV数据集上预训练后,模型在时间临床QA、早期预警预测和疾病进展建模中达到最优性能。 Conclusion: 通过时间结构化的预训练任务,模型能够学习潜在临床推理,提升泛化能力。 Abstract: Clinical language models have achieved strong performance on downstream tasks by pretraining on domain specific corpora such as discharge summaries and medical notes. However, most approaches treat the electronic health record as a static document, neglecting the temporally-evolving and causally entwined nature of patient trajectories. In this paper, we introduce a novel temporal entailment pretraining objective for language models in the clinical domain. Our method formulates EHR segments as temporally ordered sentence pairs and trains the model to determine whether a later state is entailed by, contradictory to, or neutral with respect to an earlier state. Through this temporally structured pretraining task, models learn to perform latent clinical reasoning over time, improving their ability to generalize across forecasting and diagnosis tasks. We pretrain on a large corpus derived from MIMIC IV and demonstrate state of the art results on temporal clinical QA, early warning prediction, and disease progression modeling.

[85] EDU-NER-2025: Named Entity Recognition in Urdu Educational Texts using XLM-RoBERTa with X (formerly Twitter)

Fida Ullah,Muhammad Ahmad,Muhammad Tayyab Zamir,Muhammad Arif,Grigori sidorov,Edgardo Manuel Felipe Riverón,Alexander Gelbukh

Main category: cs.CL

TL;DR: 该论文针对乌尔都语教育领域的命名实体识别(NER)问题,创建了一个手动标注的数据集EDU-NER-2025,并分析了标注过程中的挑战及语言复杂性。

Details Motivation: 乌尔都语在特定领域(如教育)的NER研究不足,缺乏标注数据集,限制了模型对学术角色、课程名称等实体的识别能力。 Method: 创建了包含13个教育领域实体的手动标注数据集EDU-NER-2025,详细描述了标注过程和指南,并分析了语言挑战。 Result: 成功构建了首个乌尔都语教育领域的NER数据集,并识别了形态复杂性和歧义等语言问题。 Conclusion: 该研究填补了乌尔都语教育领域NER的资源空白,为未来研究提供了基础。 Abstract: Named Entity Recognition (NER) plays a pivotal role in various Natural Language Processing (NLP) tasks by identifying and classifying named entities (NEs) from unstructured data into predefined categories such as person, organization, location, date, and time. While extensive research exists for high-resource languages and general domains, NER in Urdu particularly within domain-specific contexts like education remains significantly underexplored. This is Due to lack of annotated datasets for educational content which limits the ability of existing models to accurately identify entities such as academic roles, course names, and institutional terms, underscoring the urgent need for targeted resources in this domain. To the best of our knowledge, no dataset exists in the domain of the Urdu language for this purpose. To achieve this objective this study makes three key contributions. Firstly, we created a manually annotated dataset in the education domain, named EDU-NER-2025, which contains 13 unique most important entities related to education domain. Second, we describe our annotation process and guidelines in detail and discuss the challenges of labelling EDU-NER-2025 dataset. Third, we addressed and analyzed key linguistic challenges, such as morphological complexity and ambiguity, which are prevalent in formal Urdu texts.

Þórir Hrafn Harðarson,Hrafn Loftsson,Stefán Ólafsson

Main category: cs.CL

TL;DR: 研究探讨了基于偏好的训练技术(如RLHF和DPO)是否能够提升语言模型在冰岛法律摘要生成中的表现,结果显示偏好训练提高了法律准确性,但对语言质量影响不大。

Details Motivation: 法律领域语言模型的整合潜力巨大,但法律文本的专业术语和语言风格带来挑战,研究旨在探索偏好训练是否能改善模型表现。 Method: 比较了偏好训练(RLHF和DPO)与传统监督学习在冰岛法律摘要生成中的效果。 Result: 偏好训练提高了法律准确性,但对冰岛语语言质量无显著改善,自动指标与人工评估存在差异。 Conclusion: 偏好训练在法律领域有潜力,但需结合定性评估以提升语言模型的实际应用效果。 Abstract: The integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Human Feedback and Direct Preference Optimization, can enhance models' performance in generating Icelandic legal summaries that align with domain-specific language standards and user preferences. We compare models fine-tuned with preference training to those using conventional supervised learning. Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage. Discrepancies between automated metrics and human evaluations further underscore the importance of qualitative assessment in developing language models for the legal domain.

[87] Optimising ChatGPT for creativity in literary translation: A case study from English into Dutch, Chinese, Catalan and Spanish

Shuxiang Du,Ana Guerberof Arenas,Antonio Toral,Kyo Gerrits,Josep Marco Borillo

Main category: cs.CL

TL;DR: 研究探讨了ChatGPT在不同配置下(文本粒度、温度设置和提示策略)的机器翻译输出变异性,发现最小指令提示在创造性翻译中表现最佳,但仍不及人工翻译。

Details Motivation: 评估ChatGPT在文学文本翻译中的创造性表现,并比较不同配置对其输出的影响。 Method: 在四种语言中测试六种配置,使用创造力评分公式评估翻译质量。 Result: 最小指令提示(温度1.0)在西班牙语、荷兰语和中文中表现最佳,但整体仍逊于人工翻译。 Conclusion: ChatGPT在创造性翻译中有潜力,但需进一步优化以接近人工翻译水平。 Abstract: This study examines the variability of Chat-GPT machine translation (MT) outputs across six different configurations in four languages,with a focus on creativity in a literary text. We evaluate GPT translations in different text granularity levels, temperature settings and prompting strategies with a Creativity Score formula. We found that prompting ChatGPT with a minimal instruction yields the best creative translations, with "Translate the following text into [TG] creatively" at the temperature of 1.0 outperforming other configurations and DeepL in Spanish, Dutch, and Chinese. Nonetheless, ChatGPT consistently underperforms compared to human translation (HT).

[88] Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family

Pierre-Carl Langlais,Pavel Chizhov,Mattia Nee,Carlos Rosas Hinostroza,Matthieu Delsart,Irène Girard,Othman Hicheur,Anastasia Stasenko,Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: Pleias-RAG-350m和Pleias-RAG-1B是新一代小型推理模型,专为RAG、搜索和源摘要设计,性能优于4B参数以下的SLM,并在多语言和引用支持上表现突出。

Details Motivation: 解决小型模型在RAG任务中性能不足的问题,同时支持多语言和引用功能,适用于资源受限的场景。 Method: 基于大型合成数据集进行预训练,模拟多语言开源检索,支持查询路由、重排等功能。 Result: 在HotPotQA和2wiki等基准测试中优于4B参数以下的SLM,并与更大模型竞争。 Conclusion: 这些模型因其小规模和高效性,为生成式AI开辟了新应用场景。 Abstract: We introduce a new generation of small reasoning models for RAG, search, and source summarization. Pleias-RAG-350m and Pleias-RAG-1B are mid-trained on a large synthetic dataset emulating the retrieval of a wide variety of multilingual open sources from the Common Corpus. They provide native support for citation and grounding with literal quotes and reintegrate multiple features associated with RAG workflows, such as query routing, query reformulation, and source reranking. Pleias-RAG-350m and Pleias-RAG-1B outperform SLMs below 4 billion parameters on standardized RAG benchmarks (HotPotQA, 2wiki) and are competitive with popular larger models, including Qwen-2.5-7B, Llama-3.1-8B, and Gemma-3-4B. They are the only SLMs to date maintaining consistent RAG performance across leading European languages and ensuring systematic reference grounding for statements. Due to their size and ease of deployment on constrained infrastructure and higher factuality by design, the models unlock a range of new use cases for generative AI.

[89] Efficient Single-Pass Training for Multi-Turn Reasoning

Ritesh Goru,Shanay Mehta,Prateek Jain

Main category: cs.CL

TL;DR: 提出了一种通过响应令牌复制和自定义注意力掩码来优化多轮推理数据集微调的方法,显著减少训练时间。

Details Motivation: 大型语言模型在多轮推理任务中生成推理标记时,这些标记会被排除在后续输入之外,导致无法单次前向处理整个对话,限制了效率。 Method: 采用响应令牌复制和自定义注意力掩码,确保推理标记的可见性约束,从而优化训练过程。 Result: 显著减少了训练时间,实现了对多轮推理数据集的高效微调。 Conclusion: 该方法有效解决了多轮推理任务中的训练效率问题,为类似任务提供了实用解决方案。 Abstract: Training Large Language Models ( LLMs) to generate explicit reasoning before they produce an answer has been shown to improve their performance across various tasks such as mathematics and coding. However, fine-tuning LLMs on multi-turn reasoning datasets presents a unique challenge: LLMs must generate reasoning tokens that are excluded from subsequent inputs to the LLM. This discrepancy prevents us from processing an entire conversation in a single forward pass-an optimization readily available when we fine-tune on a multi-turn non-reasoning dataset. This paper proposes a novel approach that overcomes this limitation through response token duplication and a custom attention mask that enforces appropriate visibility constraints. Our approach significantly reduces the training time and allows efficient fine-tuning on multi-turn reasoning datasets.

[90] MAGI: Multi-Agent Guided Interview for Psychiatric Assessment

Guanqun Bi,Zhuang Chen,Zhoufu Liu,Hongkai Wang,Xiyao Xiao,Yuqiang Xie,Wen Zhang,Yongkang Huang,Yuxuan Chen,Libiao Peng,Yi Feng,Minlie Huang

Main category: cs.CL

TL;DR: MAGI框架通过多智能体协作将MINI结构化临床访谈自动化,结合临床严谨性和对话适应性,提升心理健康评估效果。

Details Motivation: 现有大型语言模型方法未能符合精神病学诊断协议,自动化结构化临床访谈可显著改善心理健康服务的可及性。 Method: MAGI通过四个专门智能体(导航、提问、判断、诊断)动态执行MINI访谈流程,生成可解释的心理测量链式推理。 Result: 在1002名真实参与者中测试,涵盖多种心理疾病,MAGI在临床严谨性、对话适应性和可解释性方面表现优越。 Conclusion: MAGI为LLM辅助心理健康评估提供了首个结合临床协议与智能体协作的框架,具有实际应用潜力。 Abstract: Automating structured clinical interviews could revolutionize mental healthcare accessibility, yet existing large language models (LLMs) approaches fail to align with psychiatric diagnostic protocols. We present MAGI, the first framework that transforms the gold-standard Mini International Neuropsychiatric Interview (MINI) into automatic computational workflows through coordinated multi-agent collaboration. MAGI dynamically navigates clinical logic via four specialized agents: 1) an interview tree guided navigation agent adhering to the MINI's branching structure, 2) an adaptive question agent blending diagnostic probing, explaining, and empathy, 3) a judgment agent validating whether the response from participants meet the node, and 4) a diagnosis Agent generating Psychometric Chain-of- Thought (PsyCoT) traces that explicitly map symptoms to clinical criteria. Experimental results on 1,002 real-world participants covering depression, generalized anxiety, social anxiety and suicide shows that MAGI advances LLM- assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.

[91] TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation

Shintaro Ozaki,Kazuki Hayashi,Yusuke Sakai,Jingun Kwon,Hidetaka Kamigaito,Katsuhiko Hayashi,Manabu Okumura,Taro Watanabe

Main category: cs.CL

TL;DR: TextTIGER通过增强和总结实体描述提升图像生成性能,实验证明其优于仅使用标题提示的方法。

Details Motivation: 解决图像生成中实体知识记忆不足的问题,因实体数量庞大且持续涌现。 Method: 提出TextTIGER,利用LLM增强和总结实体描述,减少长输入对性能的影响。 Result: 在IS、FID和CLIPScore等指标上表现优于仅使用标题提示的方法,且总结的描述更丰富。 Conclusion: 通过增强和总结实体描述,TextTIGER显著提升了图像生成能力。 Abstract: Generating images from prompts containing specific entities requires models to retain as much entity-specific knowledge as possible. However, fully memorizing such knowledge is impractical due to the vast number of entities and their continuous emergence. To address this, we propose Text-based Intelligent Generation with Entity prompt Refinement (TextTIGER), which augments knowledge on entities included in the prompts and then summarizes the augmented descriptions using Large Language Models (LLMs) to mitigate performance degradation from longer inputs. To evaluate our method, we introduce WiT-Cub (WiT with Captions and Uncomplicated Background-explanations), a dataset comprising captions, images, and an entity list. Experiments on four image generation models and five LLMs show that TextTIGER improves image generation performance in standard metrics (IS, FID, and CLIPScore) compared to caption-only prompts. Additionally, multiple annotators' evaluation confirms that the summarized descriptions are more informative, validating LLMs' ability to generate concise yet rich descriptions. These findings demonstrate that refining prompts with augmented and summarized entity-related descriptions enhances image generation capabilities. The code and dataset will be available upon acceptance.

[92] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review

Toghrul Abbasli,Kentaroh Toyoda,Yuan Wang,Leon Witt,Muhammad Asif Ali,Yukai Miao,Dan Li,Qingsong Wei

Main category: cs.CL

TL;DR: 本文系统综述了大型语言模型(LLMs)的不确定性量化(UQ)和校准方法,填补了现有文献的空白,并提出了一个严格的基准测试。

Details Motivation: LLMs的幻觉问题(输出错误信息)是主要挑战之一,但现有研究缺乏对其UQ和校准方法的深入分析和全面比较。 Method: 通过系统综述代表性文献,引入一个严格的基准测试,并使用两个可靠性数据集对六种相关方法进行实证评估。 Result: 实证评估验证了综述的重要发现,并揭示了现有方法的有效性。 Conclusion: 本文首次专门研究了LLMs的校准方法和相关指标,并提出了未来研究方向和开放挑战。 Abstract: Large Language Models (LLMs) have been transformative across many domains. However, hallucination -- confidently outputting incorrect information -- remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.

[93] Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant

Lei Shen,Xiaoyu Shen

Main category: cs.CL

TL;DR: Auto-SLURP是一个新的基准数据集,用于评估基于LLM的多智能体框架在智能个人助手场景中的表现,填补了当前缺乏针对性评估工具的空白。

Details Motivation: 当前缺乏专门用于评估LLM多智能体框架性能的基准数据集,Auto-SLURP旨在填补这一空白。 Method: 通过重新标注原始SLURP数据集并集成模拟服务器和外部服务,构建了一个端到端的评估流程。 Result: 实验表明,Auto-SLURP对当前最先进的框架提出了显著挑战,表明可靠的智能多智能体个人助手仍需改进。 Conclusion: Auto-SLURP为多智能体框架的评估提供了重要工具,同时揭示了该领域仍需进一步研究。 Abstract: In recent years, multi-agent frameworks powered by large language models (LLMs) have advanced rapidly. Despite this progress, there is still a notable absence of benchmark datasets specifically tailored to evaluate their performance. To bridge this gap, we introduce Auto-SLURP, a benchmark dataset aimed at evaluating LLM-based multi-agent frameworks in the context of intelligent personal assistants. Auto-SLURP extends the original SLURP dataset -- initially developed for natural language understanding tasks -- by relabeling the data and integrating simulated servers and external services. This enhancement enables a comprehensive end-to-end evaluation pipeline, covering language understanding, task execution, and response generation. Our experiments demonstrate that Auto-SLURP presents a significant challenge for current state-of-the-art frameworks, highlighting that truly reliable and intelligent multi-agent personal assistants remain a work in progress. The dataset and related code are available at https://github.com/lorashen/Auto-SLURP/.

[94] Pushing the boundary on Natural Language Inference

Pablo Miralles-González,Javier Huertas-Tato,Alejandro Martín,David Camacho

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的方法(GRPO)用于NLI任务,通过Chain-of-Thought学习消除对标注数据的需求,并在ANLI等挑战性数据集上验证了其有效性。

Details Motivation: 当前NLI系统依赖监督学习,数据集存在标注偏差,限制了泛化能力和实际应用。 Method: 采用GRPO强化学习方法结合Chain-of-Thought学习,使用参数高效技术(LoRA和QLoRA)微调7B、14B和32B语言模型。 Result: 32B AWQ量化模型在11个对抗性数据集中7个(或全部)超越现有最佳结果,内存占用仅22GB。 Conclusion: 该研究为构建鲁棒的NLI系统提供了可扩展且实用的框架,同时保持推理质量。 Abstract: Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering, and information retrieval. Despite its importance, current NLI systems heavily rely on supervised learning with datasets that often contain annotation artifacts and biases, limiting generalization and real-world applicability. In this work, we apply a reinforcement learning-based approach using Group Relative Policy Optimization (GRPO) for Chain-of-Thought (CoT) learning in NLI, eliminating the need for labeled rationales and enabling this type of training on more challenging datasets such as ANLI. We fine-tune 7B, 14B, and 32B language models using parameter-efficient techniques (LoRA and QLoRA), demonstrating strong performance across standard and adversarial NLI benchmarks. Our 32B AWQ-quantized model surpasses state-of-the-art results on 7 out of 11 adversarial sets$\unicode{x2013}$or on all of them considering our replication$\unicode{x2013}$within a 22GB memory footprint, showing that robust reasoning can be retained under aggressive quantization. This work provides a scalable and practical framework for building robust NLI systems without sacrificing inference quality.

[95] A UD Treebank for Bohairic Coptic

Amir Zeldes,Nina Speransky,Nicholas Wagner,Caroline T. Schroeder

Main category: cs.CL

TL;DR: 本文介绍了首个语法标注的Bohairic Coptic语料库,并与Sahidic Coptic进行了对比分析,揭示了Bohairic的独特性。

Details Motivation: 尽管其他Coptic方言(如Sahidic)已有较多数字资源,但Bohairic Coptic作为拜占庭晚期埃及的主要方言和当代科普特教会的语言,仍缺乏资源。 Method: 构建并评估了首个语法标注的Bohairic Coptic语料库,涵盖圣经文本、圣徒传记等;与Sahidic Coptic进行了对比分析及跨方言解析实验。 Result: 揭示了Bohairic Coptic与Sahidic Coptic的显著差异,表明Bohairic是一种独特但相关的方言。 Conclusion: Bohairic Coptic的研究填补了资源空白,并为其语言学分析提供了基础。 Abstract: Despite recent advances in digital resources for other Coptic dialects, especially Sahidic, Bohairic Coptic, the main Coptic dialect for pre-Mamluk, late Byzantine Egypt, and the contemporary language of the Coptic Church, remains critically under-resourced. This paper presents and evaluates the first syntactically annotated corpus of Bohairic Coptic, sampling data from a range of works, including Biblical text, saints' lives and Christian ascetic writing. We also explore some of the main differences we observe compared to the existing UD treebank of Sahidic Coptic, the classical dialect of the language, and conduct joint and cross-dialect parsing experiments, revealing the unique nature of Bohairic as a related, but distinct variety from the more often studied Sahidic.

[96] HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang,Wenliang Zheng,Aashrith Madasu,Peng Shi,Ryo Kamoi,Hao Zhou,Zhuoyang Zou,Shu Zhao,Sarkar Snigdha Sarathi Das,Vipul Gupta,Xiaoxin Lu,Nan Zhang,Ranran Haoran Zhang,Avitej Iyer,Renze Lou,Wenpeng Yin,Rui Zhang

Main category: cs.CL

TL;DR: 论文介绍了HRScene,一个用于评估视觉大语言模型(VLMs)在高分辨率图像(HRI)理解能力上的综合基准,包含25个真实数据集和2个合成诊断数据集。实验表明当前VLMs在HRI任务上表现不佳,平均准确率约50%。

Details Motivation: 现有VLMs声称能处理高分辨率图像,但缺乏全面评估其能力的基准。 Method: 提出HRScene基准,整合25个真实数据集和2个合成数据集,覆盖多种场景和分辨率,由10名研究生标注。 Result: 实验显示VLMs在真实任务中平均准确率约50%,在合成数据中表现出区域利用不足的问题。 Conclusion: 当前VLMs在高分辨率图像理解上存在显著不足,需进一步研究。 Abstract: High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\times$ 1,024 to 35,503 $\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.

[97] Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers

Jared Moore,Declan Grabb,William Agnew,Kevin Klyman,Stevie Chancellor,Desmond C. Ong,Nick Haber

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLM)是否适合替代心理治疗师,发现LLMs在治疗关系中存在缺陷,如表达对心理健康问题的偏见和不恰当回应,结论是不应替代治疗师。

Details Motivation: 研究LLMs替代心理治疗师的可行性,因科技初创企业和研究领域对此有推广。 Method: 通过映射主流医疗机构使用的治疗指南,评估LLMs在治疗关系中的表现,并进行实验测试。 Result: LLMs表现出对心理健康问题的偏见,且在自然治疗环境中回应不当,如鼓励妄想思维。 Conclusion: LLMs不应替代治疗师,但可探讨其在临床治疗中的其他辅助角色。 Abstract: Should a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to *replace* mental health providers, a use case promoted in the tech startup and research space. We conduct a mapping review of therapy guides used by major medical institutions to identify crucial aspects of therapeutic relationships, such as the importance of a therapeutic alliance between therapist and client. We then assess the ability of LLMs to reproduce and adhere to these aspects of therapeutic relationships by conducting several experiments investigating the responses of current LLMs, such as `gpt-4o`. Contrary to best practices in the medical community, LLMs 1) express stigma toward those with mental health conditions and 2) respond inappropriately to certain common (and critical) conditions in naturalistic therapy settings -- e.g., LLMs encourage clients' delusional thinking, likely due to their sycophancy. This occurs even with larger and newer LLMs, indicating that current safety practices may not address these gaps. Furthermore, we note foundational and practical barriers to the adoption of LLMs as therapists, such as that a therapeutic alliance requires human characteristics (e.g., identity and stakes). For these reasons, we conclude that LLMs should not replace therapists, and we discuss alternative roles for LLMs in clinical therapy.

[98] BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

Hongyu Wang,Shuming Ma,Furu Wei

Main category: cs.CL

TL;DR: BitNet v2通过H-BitLinear模块解决1-bit LLMs中的激活异常问题,实现4-bit量化,显著降低内存和计算成本。

Details Motivation: 1-bit LLMs的激活异常问题阻碍了低比特量化,BitNet v2旨在解决这一问题。 Method: 提出H-BitLinear模块,应用在线Hadamard变换平滑激活分布,实现4-bit量化。 Result: BitNet v2在8-bit激活下性能与BitNet b1.58相当,4-bit激活下性能损失极小。 Conclusion: BitNet v2显著降低了内存和计算成本,适用于高效部署。 Abstract: Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

[99] PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

Yiming Wang,Pei Zhang,Jialong Tang,Haoran Wei,Baosong Yang,Rui Wang,Chenshu Sun,Feitong Sun,Jiran Zhang,Junxuan Wu,Qiqian Cang,Yichang Zhang,Fei Huang,Junyang Lin,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: PolyMath是一个多语言数学推理基准测试,涵盖18种语言和4种难度级别,旨在评估大型语言模型(LLMs)的多语言推理能力。研究发现当前LLMs在多语言推理中存在显著挑战,如性能差异大、输入输出语言一致性低等。

Details Motivation: 为评估LLMs在多语言数学推理中的能力,填补现有基准测试在语言多样性和难度全面性上的不足。 Method: 开发PolyMath基准测试,涵盖18种语言和4种难度级别,并对高级LLMs进行全面评估。 Result: Deepseek-R1-671B和Qwen-QwQ-32B的基准得分仅为43.4和41.8,最高难度下准确率低于30%。研究发现LLMs在多语言推理中存在性能差异、语言一致性低等问题。 Conclusion: 控制输出语言指令可能提升LLMs的多语言推理能力,尤其是低资源语言,为未来研究提供了方向。 Abstract: In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Deepseek-R1-671B and Qwen-QwQ-32B, achieve only 43.4 and 41.8 benchmark scores, with less than 30% accuracy under the highest level. From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

[100] Fast-Slow Thinking for Large Vision-Language Model Reasoning

Wenyi Xiao,Leilei Gan,Weilong Dai,Wanggui He,Ziwei Huang,Haoyuan Li,Fangxun Shu,Zhelun Yu,Peng Zhang,Hao Jiang,Fei Wu

Main category: cs.CL

TL;DR: FAST框架通过动态调整推理深度,解决了大型视觉语言模型(LVLM)的"过度思考"问题,显著提升了准确率并减少了计算资源消耗。

Details Motivation: 大型视觉语言模型(LVLM)存在"过度思考"现象,即无论问题复杂度如何,模型都会生成冗长的推理过程,导致资源浪费。 Method: 提出了FAST框架,包含三个核心组件:基于模型的指标用于问题特征化、自适应推理奖励机制和难度感知的KL正则化。 Result: 在七个推理基准测试中,FAST实现了最先进的准确率,相对基础模型提升了10%以上,同时减少了32.7-67.3%的token使用量。 Conclusion: FAST框架有效平衡了推理长度和准确性,为解决LVLM的过度思考问题提供了可行方案。 Abstract: Recent advances in large vision-language models (LVLMs) have revealed an \textit{overthinking} phenomenon, where models generate verbose reasoning across all tasks regardless of questions. To address this issue, we present \textbf{FAST}, a novel \textbf{Fa}st-\textbf{S}low \textbf{T}hinking framework that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. We develop FAST-GRPO with three components: model-based metrics for question characterization, an adaptive thinking reward mechanism, and difficulty-aware KL regularization. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.

[101] Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions

James D. Finch,Yasasvi Josyula,Jinho D. Choi

Main category: cs.CL

TL;DR: 本文提出了一种新的Slot Schema Induction(SSI)方法,将其视为文本生成任务,并利用LLM自动生成高质量对话数据。同时解决了评估中的数据泄漏和指标问题。

Details Motivation: SSI在任务导向对话系统中至关重要,但现有方法依赖人工干预且评估存在问题。 Method: 将SSI作为文本生成任务,利用LLM自动生成对话数据,并改进评估指标。 Result: 提出了新的SSI方法和评估框架,推动了对话系统的技术进步。 Conclusion: 该方法为未来SSI研究奠定了基础,并提升了对话系统的性能。 Abstract: In task-oriented dialogue (TOD) systems, Slot Schema Induction (SSI) is essential for automatically identifying key information slots from dialogue data without manual intervention. This paper presents a novel state-of-the-art (SoTA) approach that formulates SSI as a text generation task, where a language model incrementally constructs and refines a slot schema over a stream of dialogue data. To develop this approach, we present a fully automatic LLM-based TOD simulation method that creates data with high-quality state labels for novel task domains. Furthermore, we identify issues in SSI evaluation due to data leakage and poor metric alignment with human judgment. We resolve these by creating new evaluation data using our simulation method with human guidance and correction, as well as designing improved evaluation metrics. These contributions establish a foundation for future SSI research and advance the SoTA in dialogue understanding and system development.

[102] Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues

Leandra Fichtel,Maximilian Spliethöver,Eyke Hüllermeier,Patricia Jimenez,Nils Klowait,Stefan Kopp,Axel-Cyrille Ngonga Ngomo,Amelie Robrecht,Ingrid Scharlau,Lutz Terfloth,Anna-Lisa Vollmer,Henning Wachsmuth

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)在共同构建解释对话中的能力,发现LLMs能展示部分共同构建行为,但监控理解和调整解释的能力有限。

Details Motivation: 探讨LLMs能否作为解释者在动态调整的解释对话中发挥作用,以满足不同背景和需求的解释对象。 Method: 通过用户研究,让解释对象与LLMs互动,评估LLMs的共同构建行为及其对解释对象理解的影响。 Result: LLMs能展示如验证问题等共同构建行为,促进解释对象的参与和理解,但在动态调整解释方面仍有局限。 Conclusion: 当前LLMs在共同构建解释对话中表现有限,未来需进一步提升其动态调整能力。 Abstract: The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee's background and needs, recent research has focused on co-constructive explanation dialogues, where the explainer continuously monitors the explainee's understanding and adapts explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with LLMs, of which some have been instructed to explain a predefined topic co-constructively. We evaluate the explainees' understanding before and after the dialogue, as well as their perception of the LLMs' co-constructive behavior. Our results indicate that current LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees' engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.

[103] TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation

Gwen Yidou Weng,Benjie Wang,Guy Van den Broeck

Main category: cs.CL

TL;DR: TRACE是一种新框架,通过概率推理和轻量级控制,高效计算预期属性概率(EAP),实现语言模型输出的可控生成。

Details Motivation: 随着大语言模型的发展,需要控制其输出以符合人类价值观或特定属性,但现有方法成本高、灵活性差或效率低。 Method: TRACE通过从语言模型中提取隐马尔可夫模型(HMM),并结合小型分类器估计属性概率,实现精确的EAP计算,从而调整模型的下一个词概率。 Result: TRACE在解毒任务中达到最佳效果,仅增加10%的解码开销,并能在几秒内适应76个低资源个性化模型。 Conclusion: TRACE提供了一种高效、灵活且可扩展的方法,用于控制语言模型的输出。 Abstract: As large language models (LMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on next-token predictions and struggle with global properties that require looking ahead. Existing solutions either tune or post-train LMs for each new attribute - expensive and inflexible - or approximate the Expected Attribute Probability (EAP) of future sequences by sampling or training, which is slow and unreliable for rare attributes. We introduce TRACE (Tractable Probabilistic Reasoning for Adaptable Controllable gEneration), a novel framework that efficiently computes EAP and adapts to new attributes through tractable probabilistic reasoning and lightweight control. TRACE distills a Hidden Markov Model (HMM) from an LM and pairs it with a small classifier to estimate attribute probabilities, enabling exact EAP computation over the HMM's predicted futures. This EAP is then used to reweigh the LM's next-token probabilities for globally compliant continuations. Empirically, TRACE achieves state-of-the-art results in detoxification with only 10% decoding overhead, adapts to 76 low-resource personalized LLMs within seconds, and seamlessly extends to composite attributes.

eess.SP [Back]

[104] Material Identification Via RFID For Smart Shopping

David Wang,Derek Goh,Jiale Zhang

Main category: eess.SP

TL;DR: 论文提出了一种利用RFID标签信号衰减和散射特性检测隐藏物品的系统,结合神经网络分类和距离测量,实现了高精度的实时防盗。

Details Motivation: 解决无人商店中因物品被隐藏(如放入背包或口袋)而难以预防盗窃的问题。 Method: 利用RFID标签的RSSI和相位角数据训练神经网络,分类七种常见容器,并结合距离测量提高准确性。 Result: 在模拟零售环境中,系统对单次读取的准确率为74%,一秒样本的准确率为89%;结合距离测量后,准确率提升至82%。 Conclusion: 该系统结合计算机视觉,利用现有基础设施实现了主动防盗,适用于无人零售场景。 Abstract: Cashierless stores rely on computer vision and RFID tags to associate shoppers with items, but concealed items placed in backpacks, pockets, or bags create challenges for theft prevention. We introduce a system that turns existing RFID tagged items into material sensors by exploiting how different containers attenuate and scatter RF signals. Using RSSI and phase angle, we trained a neural network to classify seven common containers. In a simulated retail environment, the model achieves 89% accuracy with one second samples and 74% accuracy from single reads. Incorporating distance measurements, our system achieves 82% accuracy across 0.3-2m tag to reader separations. When deployed at aisle or doorway choke points, the system can flag suspicious events in real time, prompting camera screening or staff intervention. By combining material identification with computer vision tracking, our system provides proactive loss prevention for cashierless retail while utilizing existing infrastructure.

cs.CR [Back]

[105] Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Narek Maloyan,Dmitry Namiot

Main category: cs.CR

TL;DR: LLM作为评判系统易受提示注入攻击,本文提出框架区分内容作者攻击与系统提示攻击,评估五种模型在不同任务中的表现,攻击成功率高达73.8%,小模型更脆弱。

Details Motivation: 揭示LLM评判系统的脆弱性,特别是提示注入攻击的影响,为防御提供依据。 Method: 提出分离内容作者攻击与系统提示攻击的框架,评估五种模型在四个任务中的表现,使用50个提示/条件。 Result: 攻击成功率最高达73.8%,小模型更易受攻击,攻击可迁移性在50.5%-62.6%之间。 Conclusion: 建议采用多模型委员会和比较评分方法,并公开所有代码和数据集。 Abstract: LLM as judge systems used to assess text quality code correctness and argument strength are vulnerable to prompt injection attacks. We introduce a framework that separates content author attacks from system prompt attacks and evaluate five models Gemma 3.27B Gemma 3.4B Llama 3.2 3B GPT 4 and Claude 3 Opus on four tasks with various defenses using fifty prompts per condition. Attacks achieved up to seventy three point eight percent success smaller models proved more vulnerable and transferability ranged from fifty point five to sixty two point six percent. Our results contrast with Universal Prompt Injection and AdvPrompter We recommend multi model committees and comparative scoring and release all code and datasets

[106] Diffusion-Driven Universal Model Inversion Attack for Face Recognition

Hanrui Wang,Shuo Wang,Chun-Shien Lu,Isao Echizen

Main category: cs.CR

TL;DR: DiffUMI是一种无需训练的扩散驱动通用模型反演攻击方法,用于评估人脸识别系统的隐私风险。

Details Motivation: 传统人脸识别系统将原始图像转换为嵌入向量以保护隐私,但模型反演攻击仍能重构私人面部图像,威胁隐私。现有方法需为目标模型训练生成器,计算成本高。 Method: 提出DiffUMI,基于预训练扩散模型,无需训练目标特定生成器,通过优化对抗搜索实现高效高保真面部重构。 Result: DiffUMI在隐私保护人脸识别系统中表现出色,并首次利用模型反演区分非人脸输入。 Conclusion: DiffUMI展示了无条件扩散模型在模型反演中的潜力,为隐私风险评估提供了高效工具。 Abstract: Facial recognition technology poses significant privacy risks, as it relies on biometric data that is inherently sensitive and immutable if compromised. To mitigate these concerns, face recognition systems convert raw images into embeddings, traditionally considered privacy-preserving. However, model inversion attacks pose a significant privacy threat by reconstructing these private facial images, making them a crucial tool for evaluating the privacy risks of face recognition systems. Existing methods usually require training individual generators for each target model, a computationally expensive process. In this paper, we propose DiffUMI, a training-free diffusion-driven universal model inversion attack for face recognition systems. DiffUMI is the first approach to apply a diffusion model for unconditional image generation in model inversion. Unlike other methods, DiffUMI is universal, eliminating the need for training target-specific generators. It operates within a fixed framework and pretrained diffusion model while seamlessly adapting to diverse target identities and models. DiffUMI breaches privacy-preserving face recognition systems with state-of-the-art success, demonstrating that an unconditional diffusion model, coupled with optimized adversarial search, enables efficient and high-fidelity facial reconstruction. Additionally, we introduce a novel application of out-of-domain detection (OODD), marking the first use of model inversion to distinguish non-face inputs from face inputs based solely on embeddings.

math.NA [Back]

[107] Outlier-aware Tensor Robust Principal Component Analysis with Self-guided Data Augmentation

Yangyang Xu,Kexin Li,Li Yang,You-Wei Wen

Main category: math.NA

TL;DR: 提出了一种自引导数据增强方法,通过自适应加权抑制异常值影响,将TRPCA问题转化为标准TPCA问题,提高了处理结构化损坏的能力。

Details Motivation: 现有TRPCA方法依赖稀疏异常假设,无法有效处理结构化损坏。 Method: 采用自适应加权方案动态识别并降低异常值贡献,结合优化的近端块坐标下降算法求解。 Result: 在合成和真实数据集(如人脸恢复、背景减除等)上验证了方法的有效性,精度和计算效率均有提升。 Conclusion: 该方法能有效应对多种损坏模式,优于现有技术。 Abstract: Tensor Robust Principal Component Analysis (TRPCA) is a fundamental technique for decomposing multi-dimensional data into a low-rank tensor and an outlier tensor, yet existing methods relying on sparse outlier assumptions often fail under structured corruptions. In this paper, we propose a self-guided data augmentation approach that employs adaptive weighting to suppress outlier influence, reformulating the original TRPCA problem into a standard Tensor Principal Component Analysis (TPCA) problem. The proposed model involves an optimization-driven weighting scheme that dynamically identifies and downweights outlier contributions during tensor augmentation. We develop an efficient proximal block coordinate descent algorithm with closed-form updates to solve the resulting optimization problem, ensuring computational efficiency. Theoretical convergence is guaranteed through a framework combining block coordinate descent with majorization-minimization principles. Numerical experiments on synthetic and real-world datasets, including face recovery, background subtraction, and hyperspectral denoising, demonstrate that our method effectively handles various corruption patterns. The results show the improvements in both accuracy and computational efficiency compared to state-of-the-art methods.

eess.AS [Back]

[108] Kimi-Audio Technical Report

KimiTeam,Ding Ding,Zeqian Ju,Yichong Leng,Songxiang Liu,Tong Liu,Zeyu Shang,Kai Shen,Wei Song,Xu Tan,Heyi Tang,Zhengtao Wang,Chu Wei,Yifei Xin,Xinran Xu,Jianwei Yu,Yutao Zhang,Xinyu Zhou,Y. Charles,Jun Chen,Yanru Chen,Yulun Du,Weiran He,Zhenxing Hu,Guokun Lai,Qingcheng Li,Yangyang Liu,Weidong Sun,Jianzhou Wang,Yuzhi Wang,Yuefeng Wu,Yuxin Wu,Dongchao Yang,Hao Yang,Ying Yang,Zhilin Yang,Aoxiong Yin,Ruibin Yuan,Yutong Zhang,Zaida Zhou

Main category: eess.AS

TL;DR: Kimi-Audio是一个开源的音频基础模型,擅长音频理解、生成和对话。其构建包括模型架构、数据整理、训练方法、推理部署和评估。

Details Motivation: 开发一个多功能音频模型,支持广泛的音频任务,并开源以促进研究。 Method: 采用12.5Hz音频分词器,设计基于LLM的架构,结合连续特征输入和离散标记输出,使用流匹配技术实现分块流式解码。 Result: 在多个音频基准测试中达到最先进性能,包括语音识别、音频理解和对话。 Conclusion: Kimi-Audio是一个高效的多功能音频模型,开源代码和工具包以推动社区发展。 Abstract: We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

cs.SE [Back]

[109] Spatial Reasoner: A 3D Inference Pipeline for XR Applications

Steven Häsler,Philipp Ackermann

Main category: cs.SE

TL;DR: 提出了一种空间推理框架,通过结合几何事实与符号谓词和关系,处理3D场景中的语义任务,如物体排列关系(如‘上’、‘后’、‘近’等)。

Details Motivation: 现代XR系统需要能够以语义方式理解3D场景的AR/VR应用,但目前缺乏有效的空间推理工具。 Method: 基于定向3D边界框表示,结合空间谓词(拓扑、连通性、方向性等),构建空间知识图谱,并通过管道推理模型支持空间查询和动态规则评估。 Result: 框架能够高效地将几何数据转化为可操作知识,支持复杂3D环境中的可扩展空间推理,并与机器学习、自然语言处理等技术无缝集成。 Conclusion: 该框架为XR应用中的空间推理提供了高效、技术无关的解决方案,并推动了空间本体论的创建。 Abstract: Modern extended reality XR systems provide rich analysis of image data and fusion of sensor input and demand AR/VR applications that can reason about 3D scenes in a semantic manner. We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks such as determining how 3D objects are arranged among each other ('on', 'behind', 'near', etc.). Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates, ranging from topology and connectivity to directionality and orientation, expressed in a formalism related to natural language. The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation. Implementations for client- and server-side processing demonstrate the framework's capability to efficiently translate geometric data into actionable knowledge, ensuring scalable and technology-independent spatial reasoning in complex 3D environments. The Spatial Reasoner framework is fostering the creation of spatial ontologies, and seamlessly integrates with and therefore enriches machine learning, natural language processing, and rule systems in XR applications.

eess.IV [Back]

[110] A Deep Bayesian Convolutional Spiking Neural Network-based CAD system with Uncertainty Quantification for Medical Images Classification

Mohaddeseh Chegini,Ali Mahloojifar

Main category: eess.IV

TL;DR: 提出了一种基于深度贝叶斯卷积脉冲神经网络的CAD系统,通过蒙特卡洛Dropout方法量化预测不确定性,提高了模型的可靠性和准确性。

Details Motivation: 传统深度脉冲神经网络(SNN)因无法量化预测不确定性而面临不可靠性挑战,需改进以适用于医疗图像分类。 Method: 采用蒙特卡洛Dropout方法作为贝叶斯近似,构建深度贝叶斯卷积SNN模型,并应用于多项医疗图像分类任务。 Result: 实验证明,所提模型在准确性和可靠性上表现优异,可作为传统深度学习的替代方案。 Conclusion: 该模型为医疗图像分类提供了一种可靠且高效的解决方案,具有实际应用潜力。 Abstract: The Computer_Aided Diagnosis (CAD) systems facilitate accurate diagnosis of diseases. The development of CADs by leveraging third generation neural network, namely, Spiking Neural Network (SNN), is essential to utilize of the benefits of SNNs, such as their event_driven processing, parallelism, low power consumption, and the ability to process sparse temporal_spatial information. However, Deep SNN as a deep learning model faces challenges with unreliability. To deal with unreliability challenges due to inability to quantify the uncertainty of the predictions, we proposed a deep Bayesian Convolutional Spiking Neural Network based_CADs with uncertainty_aware module. In this study, the Monte Carlo Dropout method as Bayesian approximation is used as an uncertainty quantification method. This method was applied to several medical image classification tasks. Our experimental results demonstrate that our proposed model is accurate and reliable and will be a proper alternative to conventional deep learning for medical image classification.

[111] Predicting Dairy Calf Body Weight from Depth Images Using Deep Learning (YOLOv8) and Threshold Segmentation with Cross-Validation and Longitudinal Analysis

Mingsi Liao,Gota Morota,Ye Bi,Rebecca R. Cockrum

Main category: eess.IV

TL;DR: 研究利用深度学习技术开发了自动监测小牛体重的方法,解决了传统方法在劳动力和时间上的限制,并提高了预测准确性。

Details Motivation: 由于劳动力和设施限制,传统的小牛体重监测方法效率低下,且基于图像的体重估计因荷斯坦小牛的毛色图案而复杂化。研究旨在探索非接触式测量方法。 Method: 研究开发了基于深度学习的分割模型(YOLOv8),并与阈值方法进行比较,同时使用线性回归(LR)、极端梯度提升(XGBoost)和线性混合模型(LMM)进行体重预测。 Result: YOLOv8分割效果优于阈值方法(IoU=0.98 vs. 0.89)。XGBoost在单时间点预测中表现最佳(R²=0.91),LMM在纵向预测中最为准确(R²=0.99)。 Conclusion: 深度学习技术在小牛体重自动预测中具有潜力,可提升农场管理效率。 Abstract: Monitoring calf body weight (BW) before weaning is essential for assessing growth, feed efficiency, health, and weaning readiness. However, labor, time, and facility constraints limit BW collection. Additionally, Holstein calf coat patterns complicate image-based BW estimation, and few studies have explored non-contact measurements taken at early time points for predicting later BW. The objectives of this study were to (1) develop deep learning-based segmentation models for extracting calf body metrics, (2) compare deep learning segmentation with threshold-based methods, and (3) evaluate BW prediction using single-time-point cross-validation with linear regression (LR) and extreme gradient boosting (XGBoost) and multiple-time-point cross-validation with LR, XGBoost, and a linear mixed model (LMM). Depth images from Holstein (n = 63) and Jersey (n = 5) pre-weaning calves were collected, with 20 Holstein calves being weighed manually. Results showed that You Only Look Once version 8 (YOLOv8) deep learning segmentation (intersection over union = 0.98) outperformed threshold-based methods (0.89). In single-time-point cross-validation, XGBoost achieved the best BW prediction (R^2 = 0.91, mean absolute percentage error (MAPE) = 4.37%), while LMM provided the most accurate longitudinal BW prediction (R^2 = 0.99, MAPE = 2.39%). These findings highlight the potential of deep learning for automated BW prediction, enhancing farm management.

[112] Spectral Bias Correction in PINNs for Myocardial Image Registration of Pathological Data

Bastien C. Baluyot,Marta Varela,Chen Qin

Main category: eess.IV

TL;DR: 通过整合傅里叶特征映射和调制策略,改进物理信息神经网络(PINN)以解决谱偏问题,提升心肌图像配准的准确性和生物力学合理性。

Details Motivation: 心肌图像配准对心脏应变分析和疾病诊断至关重要,但神经网络的谱偏问题导致高频变形建模不准确,尤其在病理数据中。 Method: 在PINN框架中引入傅里叶特征映射和调制策略,以解决谱偏问题。 Result: 在两个数据集上的实验表明,该方法能更准确地捕捉心肌病中的高频变形,配准精度更高且生物力学合理。 Conclusion: 该方法为可扩展的心脏图像配准及跨患者和病理的泛化提供了基础。 Abstract: Accurate myocardial image registration is essential for cardiac strain analysis and disease diagnosis. However, spectral bias in neural networks impedes modeling high-frequency deformations, producing inaccurate, biomechanically implausible results, particularly in pathological data. This paper addresses spectral bias in physics-informed neural networks (PINNs) by integrating Fourier Feature mappings and introducing modulation strategies into a PINN framework. Experiments on two distinct datasets demonstrate that the proposed methods enhance the PINN's ability to capture complex, high-frequency deformations in cardiomyopathies, achieving superior registration accuracy while maintaining biomechanical plausibility - thus providing a foundation for scalable cardiac image registration and generalization across multiple patients and pathologies.

[113] Physics-Driven Neural Compensation For Electrical Impedance Tomography

Chuyu Wang,Huiting Deng,Dong Liu

Main category: eess.IV

TL;DR: 提出了一种名为PhyNC的无监督深度学习框架,结合EIT的物理原理,解决了逆问题的不适定性和灵敏度分布问题,显著提升了重建精度。

Details Motivation: EIT在医学和工业应用中具有潜力,但面临逆问题不适定性和灵敏度分布不均的挑战,传统方法未能有效解决。 Method: 提出PhyNC框架,动态分配神经表示能力至低灵敏度区域,结合物理原理进行无监督学习。 Result: 在模拟和实验数据上,PhyNC在细节保留和抗伪影方面优于现有方法,尤其在低灵敏度区域表现突出。 Conclusion: PhyNC提升了EIT重建的鲁棒性,并为其他类似成像模态提供了灵活框架。 Abstract: Electrical Impedance Tomography (EIT) provides a non-invasive, portable imaging modality with significant potential in medical and industrial applications. Despite its advantages, EIT encounters two primary challenges: the ill-posed nature of its inverse problem and the spatially variable, location-dependent sensitivity distribution. Traditional model-based methods mitigate ill-posedness through regularization but overlook sensitivity variability, while supervised deep learning approaches require extensive training data and lack generalization. Recent developments in neural fields have introduced implicit regularization techniques for image reconstruction, but these methods typically neglect the physical principles underlying EIT, thus limiting their effectiveness. In this study, we propose PhyNC (Physics-driven Neural Compensation), an unsupervised deep learning framework that incorporates the physical principles of EIT. PhyNC addresses both the ill-posed inverse problem and the sensitivity distribution by dynamically allocating neural representational capacity to regions with lower sensitivity, ensuring accurate and balanced conductivity reconstructions. Extensive evaluations on both simulated and experimental data demonstrate that PhyNC outperforms existing methods in terms of detail preservation and artifact resistance, particularly in low-sensitivity regions. Our approach enhances the robustness of EIT reconstructions and provides a flexible framework that can be adapted to other imaging modalities with similar challenges.

[114] Towards a deep learning approach for classifying treatment response in glioblastomas

Ana Matoso,Catarina Passarinho,Marta P. Loureiro,José Maria Moreira,Patrícia Figueiredo,Rita G. Nunes

Main category: eess.IV

TL;DR: 该研究提出了一种基于深度学习的管道,用于根据RANO标准对胶质母细胞瘤的治疗反应进行分类,使用连续MRI扫描数据,并测试了多种方法和模型架构。

Details Motivation: 胶质母细胞瘤的治疗反应评估复杂且耗时,深度学习在分类问题中表现优异,因此研究旨在开发首个基于RANO标准的深度学习分类管道。 Method: 研究使用了LUMIERE数据集,测试了五种方法:图像减法、模态组合、模型架构、预训练任务和临床数据添加。最佳性能模型为Densenet264,输入为T1、T2和FLAIR图像。 Result: 最佳模型的平衡准确率为50.96%,Saliency Maps能有效突出肿瘤区域,而Grad-CAM效果较差,仅在部分类别中有效。 Conclusion: 该研究为未来基于RANO标准的胶质母细胞瘤治疗反应评估设定了基准,同时强调了评估肿瘤反应时多种因素的异质性。 Abstract: Glioblastomas are the most aggressive type of glioma, having a 5-year survival rate of 6.9%. Treatment typically involves surgery, followed by radiotherapy and chemotherapy, and frequent magnetic resonance imaging (MRI) scans to monitor disease progression. To assess treatment response, radiologists use the Response Assessment in Neuro-Oncology (RANO) criteria to categorize the tumor into one of four labels based on imaging and clinical features: complete response, partial response, stable disease, and progressive disease. This assessment is very complex and time-consuming. Since deep learning (DL) has been widely used to tackle classification problems, this work aimed to implement the first DL pipeline for the classification of RANO criteria based on two consecutive MRI acquisitions. The models were trained and tested on the open dataset LUMIERE. Five approaches were tested: 1) subtraction of input images, 2) different combinations of modalities, 3) different model architectures, 4) different pretraining tasks, and 5) adding clinical data. The pipeline that achieved the best performance used a Densenet264 considering only T1-weighted, T2-weighted, and Fluid Attenuated Inversion Recovery (FLAIR) images as input without any pretraining. A median Balanced Accuracy of 50.96% was achieved. Additionally, explainability methods were applied. Using Saliency Maps, the tumor region was often successfully highlighted. In contrast, Grad-CAM typically failed to highlight the tumor region, with some exceptions observed in the Complete Response and Progressive Disease classes, where it effectively identified the tumor region. These results set a benchmark for future studies on glioblastoma treatment response assessment based on the RANO criteria while emphasizing the heterogeneity of factors that might play a role when assessing the tumor's response to treatment.

[115] NUDF: Neural Unsigned Distance Fields for high resolution 3D medical image segmentation

Kristine Sørensen,Oscar Camara,Ole de Backer,Klaus Kofoed,Rasmus Paulsen

Main category: eess.IV

TL;DR: 提出了一种基于神经无符号距离场(NUDF)的医学图像分割方法,解决了高分辨率处理和复杂形状建模的问题。

Details Motivation: 传统医学图像分割方法在处理高分辨率图像时面临内存不足或细节丢失的问题,而NUDF能够高效处理高分辨率图像并捕捉复杂形状。 Method: 通过神经无符号距离场(NUDF)直接从图像中学习连续距离场,避免了对图像下采样或二值化处理的需求。 Result: 在左心耳(LAA)分割任务中,NUDF能够生成高精度3D网格模型,精度接近CT图像体素间距。 Conclusion: NUDF方法在医学图像分割中表现出色,尤其适用于复杂形状的高分辨率建模。 Abstract: Medical image segmentation is often considered as the task of labelling each pixel or voxel as being inside or outside a given anatomy. Processing the images at their original size and resolution often result in insuperable memory requirements, but downsampling the images leads to a loss of important details. Instead of aiming to represent a smooth and continuous surface in a binary voxel-grid, we propose to learn a Neural Unsigned Distance Field (NUDF) directly from the image. The small memory requirements of NUDF allow for high resolution processing, while the continuous nature of the distance field allows us to create high resolution 3D mesh models of shapes of any topology (i.e. open surfaces). We evaluate our method on the task of left atrial appendage (LAA) segmentation from Computed Tomography (CT) images. The LAA is a complex and highly variable shape, being thus difficult to represent with traditional segmentation methods using discrete labelmaps. With our proposed method, we are able to predict 3D mesh models that capture the details of the LAA and achieve accuracy in the order of the voxel spacing in the CT images.

[116] Partition Map-Based Fast Block Partitioning for VVC Inter Coding

Xinmin Feng,Zhuoyuan Li,Li Li,Dong Liu,Feng Wu

Main category: eess.IV

TL;DR: 提出了一种基于分区图的快速块划分算法,用于VVC编码中的帧间编码,通过神经网络预测分区图并结合双阈值决策方案,显著降低了编码复杂度。

Details Motivation: VVC编码中的递归分区搜索导致编码复杂度大幅增加,需要一种高效的方法来减少计算负担。 Method: 改进分区图并引入MTT掩码,设计神经网络利用时空特征预测分区图,采用双阈值决策方案平衡复杂度和性能。 Result: 实验结果显示,该方法平均节省51.30%编码时间,BDBR损失仅为2.12%。 Conclusion: 该方法在保持编码性能的同时显著降低了复杂度,适用于VVC编码的帧间分区优化。 Abstract: Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding, and thus improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion (RD) performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjontegaard Delta Bit Rate (BDBR) under the random access configuration.

[117] A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography

Yui Lo,Yuqian Chen,Dongnan Liu,Leo Zekelman,Jarrett Rushmore,Yogesh Rathi,Nikos Makris,Alexandra J. Golby,Fan Zhang,Weidong Cai,Lauren J. O'Donnell

Main category: eess.IV

TL;DR: Tract2Shape是一种新型多模态深度学习框架,用于快速、准确地预测白质纤维束的形状测量,优于现有方法,并展现出强大的跨数据集泛化能力。

Details Motivation: 传统计算白质纤维束形状测量的方法计算成本高且耗时,限制了大规模数据集的分析。 Method: 提出Tract2Shape框架,结合几何(点云)和标量(表格)特征,利用降维算法预测主要形状成分,并在HCP-YA和PPMI数据集上训练和评估。 Result: 在HCP-YA数据集上表现优于现有模型,Pearson's r最高,nMSE最低;在PPMI数据集上保持高性能,验证了泛化能力。 Conclusion: Tract2Shape为大规模白质形状分析提供了高效、准确的解决方案,具有广泛应用潜力。 Abstract: Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson's r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson's r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.

[118] HepatoGEN: Generating Hepatobiliary Phase MRI with Perceptual and Adversarial Models

Jens Hooge,Gerard Sanroma-Guell,Faidra Stavropoulou,Alexander Ullmann,Gesine Knobloch,Mark Klemens,Carola Schmidt,Sabine Weckbach,Andreas Bolz

Main category: eess.IV

TL;DR: 该研究提出了一种基于深度学习的合成HBP图像方法,比较了三种生成模型,发现pGAN在定量性能上表现最佳,但U-Net在一致性上更优。

Details Motivation: HBP图像获取时间长,影响患者舒适度和扫描效率,因此需要一种减少扫描时间的方法。 Method: 使用三种生成模型(U-Net、pGAN、DDPM)从早期对比阶段合成HBP图像,并引入CES评估数据质量。 Result: pGAN定量性能最好,但U-Net在一致性和减少伪影方面更优,DDPM表现较差。 Conclusion: 合成HBP图像可行,能减少扫描时间且不损害诊断效果,展示了深度学习在肝脏MRI中的临床潜力。 Abstract: Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays a crucial role in the detection and characterization of focal liver lesions, with the hepatobiliary phase (HBP) providing essential diagnostic information. However, acquiring HBP images requires prolonged scan times, which may compromise patient comfort and scanner throughput. In this study, we propose a deep learning based approach for synthesizing HBP images from earlier contrast phases (precontrast and transitional) and compare three generative models: a perceptual U-Net, a perceptual GAN (pGAN), and a denoising diffusion probabilistic model (DDPM). We curated a multi-site DCE-MRI dataset from diverse clinical settings and introduced a contrast evolution score (CES) to assess training data quality, enhancing model performance. Quantitative evaluation using pixel-wise and perceptual metrics, combined with qualitative assessment through blinded radiologist reviews, showed that pGAN achieved the best quantitative performance but introduced heterogeneous contrast in out-of-distribution cases. In contrast, the U-Net produced consistent liver enhancement with fewer artifacts, while DDPM underperformed due to limited preservation of fine structural details. These findings demonstrate the feasibility of synthetic HBP image generation as a means to reduce scan time without compromising diagnostic utility, highlighting the clinical potential of deep learning for dynamic contrast enhancement in liver MRI. A project demo is available at: https://jhooge.github.io/hepatogen

[119] Nearly isotropic segmentation for medial temporal lobe subregions in multi-modality MRI

Yue Li,Pulkit Khandelwal,Long Xie,Laura E. M. Wisse,Nidhi Mundada,Christopher A. Brown,Emily McGrew,Amanda Denning,Sandhitsu R. Das,David A. Wolk,Paul A. Yushkevich

Main category: eess.IV

TL;DR: 开发了一种近乎各向同性的分割流程,用于提高T2加权MRI中内侧颞叶亚区厚度测量的准确性。

Details Motivation: T2加权MRI的高平面分辨率虽适合海马亚区分割,但其低平面外分辨率影响了亚区厚度测量的准确性。 Method: 通过图像和标签上采样,结合高分辨率分割,创建近乎各向同性的图谱,并训练多模态深度学习分割模型。 Result: 实验表明,近乎各向同性的亚区分割提高了T2加权MRI中皮质厚度作为神经退行性疾病生物标志物的准确性。 Conclusion: 该方法显著提升了T2加权MRI在神经退行性疾病研究中的应用价值。 Abstract: Morphometry of medial temporal lobe (MTL) subregions in brain MRI is sensitive biomarker to Alzheimers Disease and other related conditions. While T2-weighted (T2w) MRI with high in-plane resolution is widely used to segment hippocampal subfields due to its higher contrast in hippocampus, its lower out-of-plane resolution reduces the accuracy of subregion thickness measurements. To address this issue, we developed a nearly isotropic segmentation pipeline that incorporates image and label upsampling and high-resolution segmentation in T2w MRI. First, a high-resolution atlas was created based on an existing anisotropic atlas derived from 29 individuals. Both T1-weighted and T2w images in the atlas were upsampled from their original resolution to a nearly isotropic resolution 0.4x0.4x0.52mm3 using a non-local means approach. Manual segmentations within the atlas were also upsampled to match this resolution using a UNet-based neural network, which was trained on a cohort consisting of both high-resolution ex vivo and low-resolution anisotropic in vivo MRI with manual segmentations. Second, a multi-modality deep learning-based segmentation model was trained within this nearly isotropic atlas. Finally, experiments showed the nearly isotropic subregion segmentation improved the accuracy of cortical thickness as an imaging biomarker for neurodegeneration in T2w MRI.

[120] RSFR: A Coarse-to-Fine Reconstruction Framework for Diffusion Tensor Cardiac MRI with Semantic-Aware Refinement

Jiahao Huang,Fanwen Wang,Pedro F. Ferreira,Haosen Zhang,Yinzhe Wu,Zhifan Gao,Lei Zhu,Angelica I. Aviles-Rivero,Carola-Bibiane Schonlieb,Andrew D. Scott,Zohya Khalique,Maria Dwornik,Ramyah Rajakulasingam,Ranil De Silva,Dudley J. Pennell,Guang Yang,Sonia Nielles-Vallespin

Main category: eess.IV

TL;DR: RSFR框架通过零样本语义先验和Vision Mamba重建主干,显著提升了心脏扩散张量成像的重建质量与参数估计准确性。

Details Motivation: 心脏扩散张量成像(DTI)在临床应用中面临信噪比低、伪影多等技术挑战,限制了其实际效用。 Method: 提出RSFR框架,采用从粗到精的策略,结合Segment Anything模型的零样本语义先验和Vision Mamba重建主干。 Result: 实验表明,RSFR在高欠采样率下实现了最先进的重建质量和准确的DT参数估计。 Conclusion: RSFR具有鲁棒性、可扩展性和临床转化潜力,为心脏DTI的定量分析提供了新方向。 Abstract: Cardiac diffusion tensor imaging (DTI) offers unique insights into cardiomyocyte arrangements, bridging the gap between microscopic and macroscopic cardiac function. However, its clinical utility is limited by technical challenges, including a low signal-to-noise ratio, aliasing artefacts, and the need for accurate quantitative fidelity. To address these limitations, we introduce RSFR (Reconstruction, Segmentation, Fusion & Refinement), a novel framework for cardiac diffusion-weighted image reconstruction. RSFR employs a coarse-to-fine strategy, leveraging zero-shot semantic priors via the Segment Anything Model and a robust Vision Mamba-based reconstruction backbone. Our framework integrates semantic features effectively to mitigate artefacts and enhance fidelity, achieving state-of-the-art reconstruction quality and accurate DT parameter estimation under high undersampling rates. Extensive experiments and ablation studies demonstrate the superior performance of RSFR compared to existing methods, highlighting its robustness, scalability, and potential for clinical translation in quantitative cardiac DTI.

cs.LG [Back]

[121] Gradient Descent as a Shrinkage Operator for Spectral Bias

Simon Lucey

Main category: cs.LG

TL;DR: 本文探讨了激活函数与样条回归/平滑之间的联系,并分析了其对1D浅层网络频谱偏置的影响,提出梯度下降(GD)可视为一种收缩算子,通过隐式选择保留的频率分量来控制频谱偏置。

Details Motivation: 研究激活函数选择对神经网络频谱偏置的影响,并探索梯度下降在其中的作用。 Method: 通过理论分析,将梯度下降重新解释为收缩算子,研究其对Jacobian矩阵奇异值的掩蔽作用,并探讨GD超参数与带宽的关系。 Result: 发现GD正则化仅对单调激活函数有效,非单调激活函数(如sinc、高斯函数)可作为频谱偏置的高效替代方案。 Conclusion: 激活函数的选择和GD超参数对频谱偏置有显著影响,非单调激活函数在迭代效率上表现更优。 Abstract: We generalize the connection between activation function and spline regression/smoothing and characterize how this choice may influence spectral bias within a 1D shallow network. We then demonstrate how gradient descent (GD) can be reinterpreted as a shrinkage operator that masks the singular values of a neural network's Jacobian. Viewed this way, GD implicitly selects the number of frequency components to retain, thereby controlling the spectral bias. An explicit relationship is proposed between the choice of GD hyperparameters (learning rate & number of iterations) and bandwidth (the number of active components). GD regularization is shown to be effective only with monotonic activation functions. Finally, we highlight the utility of non-monotonic activation functions (sinc, Gaussian) as iteration-efficient surrogates for spectral bias.

cs.MA [Back]

[122] Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

Isadora White,Kolby Nottingham,Ayush Maniar,Max Robinson,Hansen Lillemark,Mehul Maheshwari,Lianhui Qin,Prithviraj Ammanabrolu

Main category: cs.MA

TL;DR: 研究探讨了LLM在复杂具身推理任务中的协作能力,提出了MINDcraft平台和MineCollab基准测试,发现当前LLM代理在协作中的主要瓶颈是自然语言沟通效率。

Details Motivation: 协作在日常生活中无处不在且至关重要,研究旨在探索LLM如何适应性地协作完成复杂具身推理任务。 Method: 开发了MINDcraft平台和MineCollab基准测试,用于测试LLM代理在Minecraft游戏中的协作能力。 Result: 实验发现当前LLM代理在协作中的主要瓶颈是自然语言沟通效率,性能下降高达15%。 Conclusion: 现有LLM代理在多代理协作(尤其是具身场景)中表现不佳,需探索超越上下文学习和模仿学习的方法。 Abstract: Collaboration is ubiquitous and essential in day-to-day life -- from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: https://mindcraft-minecollab.github.io/

cs.SD [Back]

[123] Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

Leena G Pillai,D. Muhammad Noorul Mubarak,Elizabeth Sherly

Main category: cs.SD

TL;DR: 提出了一种基于BiLSTM和1D CNN的新方法,用于预测语音中的舌和唇发音特征,固定权重初始化优于自适应权重。

Details Motivation: 语音产生涉及复杂的发音协调,舌和唇是关键发音器官,预测其发音特征对语音研究有重要意义。 Method: 使用堆叠BiLSTM结合1D CNN进行后处理,固定权重初始化,训练数据包括语音和EMA数据集。 Result: 固定权重方法在较少训练周期内优于自适应权重,适用于多种评估模式(SD、SI、CD、CC)。 Conclusion: 该方法为发音特征预测提供了高效模型,推动了语音产生研究的进展。 Abstract: Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

cs.RO [Back]

[124] Set Phasers to Stun: Beaming Power and Control to Mobile Robots with Laser Light

Charles J. Carver,Hadleigh Schwartz,Toma Itagaki,Zachary Englhardt,Kechen Liu,Megan Graciela Nauli Manik,Chun-Cheng Chang,Vikram Iyer,Brian Plancher,Xia Zhou

Main category: cs.RO

TL;DR: Phaser是一个通过窄光束激光为移动机器人提供无线供电和通信的系统,结合了立体视觉跟踪和高功率光束控制,实现了高效能源传输和低功耗通信。

Details Motivation: 解决移动机器人无线供电和通信的需求,提高能源传输效率和降低通信功耗。 Method: 设计半自动校准程序,融合立体视觉3D跟踪与高功率光束控制,并利用激光作为数据通道实现低功耗通信。 Result: Phaser原型实现了110 mW/cm²的功率密度和多米范围内的无错误数据传输,功耗比蓝牙低97%。 Conclusion: Phaser成功为无电池机器人提供高效供电和通信,显著提升了性能。 Abstract: We present Phaser, a flexible system that directs narrow-beam laser light to moving robots for concurrent wireless power delivery and communication. We design a semi-automatic calibration procedure to enable fusion of stereo-vision-based 3D robot tracking with high-power beam steering, and a low-power optical communication scheme that reuses the laser light as a data channel. We fabricate a Phaser prototype using off-the-shelf hardware and evaluate its performance with battery-free autonomous robots. Phaser delivers optical power densities of over 110 mW/cm$^2$ and error-free data to mobile robots at multi-meter ranges, with on-board decoding drawing 0.3 mA (97\% less current than Bluetooth Low Energy). We demonstrate Phaser fully powering gram-scale battery-free robots to nearly 2x higher speeds than prior work while simultaneously controlling them to navigate around obstacles and along paths. Code, an open-source design guide, and a demonstration video of Phaser is available at https://mobilex.cs.columbia.edu/phaser.

cs.HC [Back]

[125] Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents

Chaoran Chen,Zhiping Zhang,Ibrahim Khalilov,Bingcan Guo,Simret A Gebreegziabher,Yanfang Ye,Ziang Xiao,Yaxing Yao,Tianshi Li,Toby Jia-Jun Li

Main category: cs.HC

TL;DR: 本文探讨了LLM驱动的GUI代理在处理敏感数据时的隐私和安全风险,提出了一种以人为中心的评估框架。

Details Motivation: 现有评估主要关注性能,而隐私和安全风险未被充分研究,亟需填补这一空白。 Method: 分析了GUI代理的三大风险,比较了与传统GUI自动化和通用自主代理的区别,并提出了五项整合人类评估的挑战。 Result: 提出了一个以人为中心的评估框架,强调风险评估、用户知情同意和隐私安全设计。 Conclusion: 呼吁在GUI代理设计和评估中更重视隐私和安全,通过人类参与提升透明度和安全性。 Abstract: The rise of Large Language Models (LLMs) has revolutionized Graphical User Interface (GUI) automation through LLM-powered GUI agents, yet their ability to process sensitive data with limited human oversight raises significant privacy and security risks. This position paper identifies three key risks of GUI agents and examines how they differ from traditional GUI automation and general autonomous agents. Despite these risks, existing evaluations focus primarily on performance, leaving privacy and security assessments largely unexplored. We review current evaluation metrics for both GUI and general LLM agents and outline five key challenges in integrating human evaluators for GUI agent assessments. To address these gaps, we advocate for a human-centered evaluation framework that incorporates risk assessments, enhances user awareness through in-context consent, and embeds privacy and security considerations into GUI agent design and evaluation.

cs.CE [Back]

[126] SMARTFinRAG: Interactive Modularized Financial RAG Benchmark

Yiwei Zha

Main category: cs.CE

TL;DR: SMARTFinRAG是一个针对金融领域RAG系统的评估平台,解决了模块化架构、文档中心评估和研究-实现桥梁的三大关键问题。

Details Motivation: 金融领域快速采用语言模型技术,但评估专用RAG系统仍具挑战性。 Method: 提出SMARTFinRAG,包括模块化架构、文档中心评估范式和研究-实现接口。 Result: 评估显示不同配置下检索效果和响应质量存在显著差异。 Conclusion: 开源架构支持透明研究,同时解决金融机构部署RAG系统的实际问题。 Abstract: Financial sectors are rapidly adopting language model technologies, yet evaluating specialized RAG systems in this domain remains challenging. This paper introduces SMARTFinRAG, addressing three critical gaps in financial RAG assessment: (1) a fully modular architecture where components can be dynamically interchanged during runtime; (2) a document-centric evaluation paradigm generating domain-specific QA pairs from newly ingested financial documents; and (3) an intuitive interface bridging research-implementation divides. Our evaluation quantifies both retrieval efficacy and response quality, revealing significant performance variations across configurations. The platform's open-source architecture supports transparent, reproducible research while addressing practical deployment challenges faced by financial institutions implementing RAG systems.

cs.IR [Back]

[127] Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Retrieval

Yongkang Li,Panagiotis Eustratiadis,Simon Lupart,Evangelos Kanoulas

Main category: cs.IR

TL;DR: 本文提出了一种在密集信息检索中的语料库中毒攻击方法,通过直接在嵌入空间优化生成对抗性文档,解决了现有攻击方法的局限性,并在无监督条件下实现了快速有效的攻击。

Details Motivation: 当前文献中的攻击方法在离散词汇空间操作,而检索发生在连续嵌入空间;此外,现有方法通常假设攻击者已知查询分布。本文旨在解决这两个局限性。 Method: 提出一种直接在嵌入空间优化的方法,训练扰动模型以保持原始与对抗文档嵌入的几何距离,同时最大化词汇级差异。攻击在无监督条件下进行。 Result: 实验表明,该方法能在两分钟内生成对抗性文档,速度比现有方法快四倍,且生成的文本更自然(低困惑度),难以检测。 Conclusion: 本文方法在无监督条件下实现了快速有效的语料库中毒攻击,解决了现有方法的局限性,并展示了在多种任务和设置下的优越性。 Abstract: This paper concerns corpus poisoning attacks in dense information retrieval, where an adversary attempts to compromise the ranking performance of a search algorithm by injecting a small number of maliciously generated documents into the corpus. Our work addresses two limitations in the current literature. First, attacks that perform adversarial gradient-based word substitution search do so in the discrete lexical space, while retrieval itself happens in the continuous embedding space. We thus propose an optimization method that operates in the embedding space directly. Specifically, we train a perturbation model with the objective of maintaining the geometric distance between the original and adversarial document embeddings, while also maximizing the token-level dissimilarity between the original and adversarial documents. Second, it is common for related work to have a strong assumption that the adversary has prior knowledge about the queries. In this paper, we focus on a more challenging variant of the problem where the adversary assumes no prior knowledge about the query distribution (hence, unsupervised). Our core contribution is an adversarial corpus attack that is fast and effective. We present comprehensive experimental results on both in- and out-of-domain datasets, focusing on two related tasks: a top-1 attack and a corpus poisoning attack. We consider attacks under both a white-box and a black-box setting. Notably, our method can generate successful adversarial examples in under two minutes per target document; four times faster compared to the fastest gradient-based word substitution methods in the literature with the same hardware. Furthermore, our adversarial generation method generates text that is more likely to occur under the distribution of natural text (low perplexity), and is therefore more difficult to detect.